LLaVA 1.5 - Improved Baselines with Visual Instruction Tuning

VLM Paper Review

LLaVA 1.5 - Improved Baselines with Visual Instruction Tuning

코딩하는머글 2024. 6. 9. 23:57

1. Abstract

Fully connected vision-language cross-modal connector(in llava)는 강력하고, 데이터 효율적이다.
CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts을 이용해서, 최신 벤치마크에 대한 강력한 베이스라인을 세울 수 있었다.
Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ∼1 day on a single 8-A100 node

2. Introduction

LMM의 성장에 따라서, 일상적인 목표(general-purpose assistants, Apple의 multimodal llm ferret, gpt4v 등)을 도와주는 방향으로 성장. -> 인공지능의 핵심 과제인 일반적인 테스크를 수행하는. 즉, 인간의 의도에 맞게 현실 세계에 작업~
최근 LMM 연구는 visual instruction tuning이라는 새로운 컨셉으로 진행된다. 실제 라바에서 사용하고 있으며, 기존 논문 리뷰에서 많이 언급 됐을 것을 생각하고 진행.
실제로, Natural instruction-following과 visual reasoning capabilities -> 문제를 풀기 위해서 시각정 정보를 해석하고 진행하는 능력. 에 놀라운 결과를 보여주고 있다.
LLM 성능을 분석하기 위한 각 벤치마크에 대한 논문, https://arxiv.org/pdf/2306.13394.pdf (MME)
Seed-bench: Each evaluation dimension contains multiple-choice questions with groundtruth options derived from human annotation. Spatial understanding / temporal understanding
POPE for better evaluation of object hallucination
MMBench: robustly evaluating the various abilities of vision-language models.
MM-Vet:MM-Vet focuses on the integration of different core VL capabilities, including recognition, OCR, knowledge, language generation, spatial awareness, and math.
최근에는 향상된 성능이 아래와 같은 이유로 향상되었다고 설명하고 있다.
Recent works further demonstrate improved performance by scaling up the pretraining data, instruction-following data, visual encoders, or langauge models], respectively.
1) Scaling up the retraining data. - Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities / 사전 학습 데이터 셋이 구성된 학습 환경과 특정한 패턴의 데이터셋을 클린하는 작업을 통해, utilize a large-scale, weakly labeled, web-crawled set of image-text pairs으로 pre-training을 진행한 문. (In the first stage of pre-training, we mainly utilize a large-scale, weakly labeled, web-crawled set of image-text pairs. Our pre-training dataset is composed of several publicly accessible sources and some in-house data. We made an effort to clean the dataset of certain patterns. As summarized in Table 2, the original dataset contains a total of 5 billion image-text pairs, and after cleaning, 1.4 billion data remain, with 77.3% English (text) data and 22.7% Chinese (text) data.)
2) Instruction-following data - InstructBLIP에서는 실제 학습데이터의 prompt 분포와 zero-shot을 할 때의 prompt 분포가 맞지 않는 문제가 발생해서. 프롬프트 설명이 충분함에도 zero-shot 성능이 낮은 문제를 데이터 관점에서 학습. Intruction을 생성하기 위해 비슷한 유형끼리 cluster를 모으고, cluster별로 instruction template을 생성 -> 다른 배경에서 COT나 Flan-T5의 reasoning 데이터셋으로 구축하는 등의 방식으로 성능이 향상된 배경. -> 해당 연구에서는 26개의 데이터에서 held-in dataset으로 instruction tuning과 13개의 held-out dataset을 zero-shot evaluation에 활용하는 등, instruction이 단순히 llm이 생성하는데 조건을 주는 것뿐 아니라, image encoder에서 어떤 부분을 추출해야 하는지 조건을 줄 수 있도록 구성.
3) Visual Encoder: Qwen-VL는 vision encoder를 위해서 Vision Transformer (ViT) 활용 실제로 14stride로 패치를 자름.
4) language models: 옾은 이미지와 mixing multimodal-language 데이터가 LMM 성능 향상에 highlights 하고 있다.
LLaVA는 다양한 downstream task에서 영향력을 보이고 있다.
(including region-level -> 지역 단위로 보는 것. 노루들 사이에는 뭐가 있냐~
pixel-level understanding -> category나 object를 identify 말고 reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text
biomedical assistants / image generation / adversarial studies 등
두개의 개선점을 제안 an MLP cross-modal connector / incorporating academic task 하며, 더 나은 multimodal understanding 능력을 이끌고 있다.
designed된 visual resamplers를 학습하는 InstructBLIP/Qwen-VL와 대조적으로 LLaVA LLMs을 위한 간단한 architecture을 사용하고, 오직 a simple fully connected projection layer on 600k image-text paires를 요구한다.
8-A100으로 하루 안에 학습 가능하고, 다양한 벤치마크에서 최신 결과를 얻을 수 있다.
Qwen-VL은 in-house 데이터를 포함했지만, LLaVA는 오직 public 데이터로 학습했다. 이를 통해 LMM을 위한 easily-reproducible baselines을 기대하고 있는 저자…

3. Background

Instruction-following LMM:

일반적인 구조는 visual feature를 encode하기 위해서 pre-trained visual backbone을 포함하고, 사용자 instructions을 이해하고 응답 생성을 위해 pre-trained large language model (LLM)을 포함. 또한 vision encoder 출력을 LM에 align 시키기 위해서 vision-language cross-modal connector이 포함.
LLaVA는 LLMs을 위한 가장 간단한 구조로 생각하고, Visual resampler(visual patches의 수를 줄이기 위해)을 위한 Qformer가 선택적으로 적용될 수 있다.
instruction-following LMM을 학습하는 건 two-stage protocol을 따른다.
1. vision-language alignment pretraining stage는 image-text pairs에 LM의 word embedding space와 함께visual features을 align한다. 이전 연구에서는 더 적은 pair(~600K)가 활용됐고, 최근 연구들이 큰 양의 pair을 성능 향상을 위해 사용하면서 VL connector를 사전학습 시키고 있다.
2. visual instruction tuning stage는 model을 visual instructions에 tune하고, 사용자에게 시각 내용이 포함된 다양한 응답을 가능하게 한다.

Multimodal instruction-following data:

자연어에서는 Lima에서 이미 instruction-following 데이터의 성능이 보여짐.
LLaVA는 text-only GPT-4를 이용하여 COCO의 bounding box와 caption dataset을 multimodal instruction-following data로 확장시켰다.
3개의 instruction-following dataset인 conversational-style QA/ detailed description / complex reasoning.
LLaVA는 textual understanding / million-scales / region-level conversations로 확장시킴.
InstructBLIP은 model의 visual capabilities를 강화하기 위한 academic-task-oriented VQA 데이터셋을 포함.
하지만, flamingo에서 실험한 논문에 따르면, naive한 데이터를 접목시키는 것은 VQA dataset에 overfit시키는 경향성 또한 존재. -> 해당 저자들은 llava pipeline에 VQA dataset를 conversational style로 변환할 것을 제안한다. 이를 통해 학습의 효율성이 있지만, 추가적인 데이터 scaling의 복잡성이 추가.

4. Improved Baselines of LLaVA

LLaVA는 Real-life visual instruction-following task에 대한 다양한 벤치마크에서 성능이 우수했고, 오직 academic benchmarks(single-word 같이 짧은 답변을 요구하는)에서는 떨어졌다. -> 큰 데이터셋에서 학습되지 않았기 때문에.
데이터, 모델, input image resolution의 scaling 효과를 연구
Res - input image resolution
PT - the number of samples in pretraining
IT - instruction tuning stage
12개의 벤치마크에 따른 LLM 성능 비교
최종적으로 LLaVA는 visual instruction tuning에 데이터 효율적이면서, 적은 compute와 training-data로 다른 방식에 비해 높은 성능을 성취

Response formatting prompts:

InstructBLIP처럼, short-form과 long-form VQA 사이의 inability to balance를 발견. 이는 following reasons 때문에
1. Ambiguous prompts on the response format. (Q: {} / A: {}) 해당 prompts는 명확한 답변 형식을 가리키지 않고, short-form 답변에 LLM이 overfit될 수 있다. behavorially to short-form answers even for natural visual conversations.
2. Not fine-tuning the LLM. -> InstructBLIP이 Qformer를 instruction tuning을 위해서만 finetuing. Qformer’s visual output tokens가 LLM의 출력의 길이를 조절한다(prefix tuning으로서). 그러나 Qformer는 라마와 같은 모델과 비교해서는 제한된 용량으로 해당 능력이 부족하다.
따라서, single response formatting prompt를 사용해서 output format을 명확히 가르키기 위한 방안을 제시.(예시와 같이 VQA)
As shown in Table 1, by merely including VQAv2 [12] in training, LLaVA’s performance on MME significantly improves (1323.8 vs 502.8) and outperforms InstructBLIP by 111 points. -> 이유 알아보기

MLP vision-language connector:

Self-supervised learning에서 linear projection에서 MLP로 변화하고 향상된 성능에 영향받아서, VL connector의 representation power가 향상되는 걸 발견했다.
simclr: this only influences the unsupervised training stage(Linear - BN - ReLU - Linear - BN 구조의 MLP) -> visual representation을 to a 128-dimensional latent space

Academic task oriented data:

table1에서 보이는 것과 같이, academic-task-oriented VQA datasets을 추가했다. for VQA, OCR, and region-level perception
InstructBLIP이 open- knowledge VQA (OKVQA [33], A-OKVQA [37]) and OCR (OCRVQA [34], TextCaps [39])의 데이터셋을 학습했던 것 대비, OKVQA/OCR을 하나만 사용한 LLaVA가 성능이 더 좋았다.
region-level VQA datasets(Visual Genome, RefCOCO) 또한 localizing fine-grained visual details를 향상시키는 것을 발견했다.

Additional scaling

Input image resolution을 scale up하고 -> 이미지 디테일 잘 보기 위해서 / GQA 데이터셋을 visual knowledge source로서 추가했다.
ShareGPT 데이터를 병합하고 / LLM을 13B으로 늘렸다. -> 실제로 MM-Vet의 결과가 visual conversations를 위한 LLM의 성능에 중요했다고 보여진다..

5. Discussion

Comparison with SoTA:

Academic VQA 벤치마크와 instruction-following LMM을 위한 최신 벤치마크로 평가했다.
적은 pretraining과 instruction tuning.데이터를 사용했음에도 12개 중 11개의 벤치마크에서 최고 성능
LLaVA-1.5 achieves the best performance with the simplest architecture, academic compute and public datasets, and yields a fully-reproducible and affordable baseline for future research
해당 결과로 Visual instruction tuning이 pretraining보다 중요하다고 제안했고,
LMM이 상당한 VL alignment pretraing을 요구한다는 질문을 야기. -> vision encoder(clip, open clip, Eva-clip 등)가 이미 web-scale image-text paired dataset에 사전학습 되었어도~
LLaVA-1.5(7B)가 IDEFICS(80B)와 Flaming-like LLM with billions of trainable parameters for cross-modal connection의 성능을 능가 했다.

Zero-shot format instruction generalization:

LLaVA가 제한된 포멧 instructions으로 학습되어도, 다른 것들을 잘 generalize하더라~
VizWiz의 경우, 제한된 내용이 답변하기 불충분하면 Unanswerable을 출력. (Table에서 instruction변경해서 11.1% -> 67.8)

Zero-shot multilingual capability:

Multilingual multimodal instruction following을 위해 finetuned 하지 않았지만, following multilingual instructions이 가능할 것으로 보인다. -> ShareGPT(ChatGPT conversation 데이터를 활용해서) 안에 부분적인 multilingual language instruction 때문에.
LLaVA-1.5 outperforms Qwen-VL-Chat by 7.3% (63.6% vs 56.7%)

Computational cost:

Instruction tuning을 위한 LCS-558K 데이터셋과 training iterations, batch size를 어느정도 같게 LLaVA처럼~
Image input resolution을 336px로 해서, lava-1.5의 학습이 2배 가까이 증가. pretraining에 6시간, visual instruction tuning에 20시간(8xA100 사용)

Limitations:

1) LLaVA는 full image patches를 사용해서, 각각의 학습 횟수를 지연시킬 수 있다.
visual resampler가 visual patches를 줄이지만, LLaVA와 비교할 때 유사한 양의 훈련 데이터를 사용할 때 수렴(convergence)을 효율적으로 달성할 수 없다 -> resample들이 더 많은 훈련 가능한 매개변수를 가지고 있어서.
따라서 sample-efficient visual resampler가 instruction-following mm의 scaling-up에 기여할 수 있다고 본다.
2) LLaVA-1.5는 processing multiple images이 불가능하다. -> instruction-following data의 부족과 context length의 제한으로.
3) LLaVA-1.5는 following complex instructions에 효율성이 있지만, 문제 해결 능력은 특정 도메인에서 제한적이다. -> 그러나 이것 또한 더 capable한 LM과 high-quality, targeted visual instruction using data로 향상될 수 있다.
4) hallucination에 대한 경향도가 줄어듦에도 불구하고, 환각을 유발하거나 거짓 정보를 퍼뜨릴 가능성이 있고, 의약 분야에서는 조심히 사용해야 한다.

Appendix

학습 비용 감소와 효율성 증진을 위한 multiple 전략

For all VQA datasets, 같은 학습 이미지의 QA pairs를 하나의 conversation으로 병함.
For ShareGPT, we filter out invalid conversations as [41]. Unlike Vicuna, long conversations that surpass 2048 tokens은 생략. Multiple conversation을 구분하는 것보다~ (약40K conversations.)
Each QA pair in A-OKVQA [37] Multiple-choice 데이터가 부족할 경우를 counterbalance하기 위해 k times를 augment한다. 질문의 답변 수만큼!
80K conversations are sampled from OCRVQA [34].
ForVisual Genome, we sample 10 annotations for images with additional annotations.
For RefCOCO, conversations are dissected into segments, each containing fewer than 10 conversations. 언어와 시각 모드의 데이터를 섞지 않고 각 배치에서는 하나의 모드만 사용함으로써 훈련 속도를 높일 수 있었고, 이는 최종 모델의 성능에는 영향을 미치지 않았다는 것을 의미
We obverse that language conversations are often longer than visual ones. For each batch, we sample conversations only from a single modality, and this speeds up the training by 25%, and does not affect the final outcome.

All data splits are concatenated together and sampled with the same probability

'VLM Paper Review' 카테고리의 다른 글

Llava vs GPT-4V (0)	2024.06.10
PaLI-3 (0)	2024.06.10
BLIP2 (0)	2024.06.09
BLIP (0)	2024.06.09
LLaVA (0)	2024.06.09

현재글LLaVA 1.5 - Improved Baselines with Visual Instruction Tuning

코딩하는 머글

AI 및 연구관련 논문을 정리하는 곳입니다.

Today :
Yesterday :

코딩하는 머글