0. Abstract
- This paper presents a comprehensive evaluation of 4 publicly available large multimodal models by building an LVLM evaluation Hub 5 (LVLM-eHub)
- 8개의대표적인 LVLMs(InstructBLIP, MiniGPT-4)로 구성되며, which are thoroughly evaluated by a quantitative 7 capability evaluation and an online arena platform
- The former evaluates 6 cat- egories of multimodal capabilities of LVLMs such as visual question answering and embodied artificial intelligence on 40 standard text-related visual benchmarks, while the latter provides the user-level evaluation of LVLMs in an open-world question-answering scenario
- 해당 논문은 아래와 같은 몇몇 혁신적인 발견을 찾아낸다.
- First, Instruction-tuned LVLM with massive in-domain data(특정한 도메인에 매우 연관된 데이터)such as InstructBLIP may overfit many existing tasks, generalizing poorly in the open-world scenario.
- Second, Instruction-tuned LVLM with moderate instruction-following data(조금 더 체계화된 instruction-following 데이터가 필요할 것으로 보인다) may result in object hallucination issues (i.e., generate objects that are inconsistent with target images in the descriptions) → It either makes the current evaluation metric such as CIDER for image captioning ineffective or generates wrong answers
- Third, employing a multi-turn(여러 단계에 걸쳐 이해하는 → chain of thought prompting와 흡사한 형태일 것으로 추정…) reasoning evaluation framework could mitigate the issue of object hallucination, shedding light(명확하게 하다) on developing an effective metric for LVLM evaluation.
1. Introduction
- LVLM은 VQA와 multimodal conversation와 같은 멀티모달 task을 위한 멀티모달 vl learning에서 눈에 띄는 성장을 해왔다.
- LVLM은 LLM으로부터 지식을 capitalize하고, visual features를 효율적으로 textual space로 align한다.
- Flamingo, a pioneering LVLM, integrates visual features into LLMs through cross-attention layers. Later studies proposed more efficient vision-text interactions, more efficient training methods and employing instruction tuning.
- systematic evaluation of LVLMs을 제공하고자 하는 노력이 거의 없었다. → 하지만, evaluation은 강점과 약점을 파악하는데 중요한 역할을 하기 때문에 매우 중요. for guiding their future development.
- Recent work presents a systematic investigation of object hallucination of LVLMs by proposing a polling-based object probing evaluation method→ ground-truth objects를 추출 using 세그먼트 → negative sampling을 POPE로 구축 → ground-truth objects와 nonexistent objects들로 LVLM을 poll하기 위한 question templates를 작성한다.
- 폴링(polling): 시스템이 여러 번의 시도 또는 요청을 통해 필요한 정보를 수집하거나 상태를 체크하는 과정
- ImageNetVC studies how well LVLMs can master visual commonsense knowledge.
- Comprehensively evaluate the performance of LVLMs in visual recognition with text recognition, such as optical character recognition.
- GVT evaluates LVLM’s visual semantic understanding and fine-grained perception capabilities.(What makes for good visual tokenizers for large language models)
- → Nevertheless, these studies only evaluate a portion of LVLMs on specific tasks, lacking an overall understanding of LVLM’s capabilities.
- Specifically, the quantitative capability evaluation extensively evaluates 6 categories of multimodal capabilities of LVLMs including visual perception, visual knowledge acquisition, visual reasoning, visual commonsense, object hallucination, and embodied intelligence (see Fig. 1 (a)) → 40개의 표준적인 text-releted visual benchmark로 구성했다.
- → On the other hand, the online arena platform features anonymous randomized pairwise battles in a crowd-sourced manner, providing a user-level model ranking in the open-world question-answering scenario (see54 Fig. 1 (b))
- LVLM-eHub의 혁신적인 발견(2) With moderate instruction following data, Instruction-tuned LVLM may cause object hallucination issues, generating objects that are inconsistent with target images in the descriptions. This leads to incorrect answers or renders current evaluation metrics, such as CIDER for image captioning, ineffective.
- (3) We find that a multi-turn reasoning evaluation pipeline can mitigate the issue of object hallucination, indicating that developing an effective metric for LVLM evaluation is urgent.
- (1)Instruction-tuned LVLM with massive in-domain data suffers from overfitting and generalizes poorly in open-world scenarios, such as InstructBLIP (see Fig. 1 (a)).
- 요약된 Contributions(2) LVLM-eHub provides extensive evaluation on 6 categories of multimodal capabilities of LVLMs in more than 40 text-based visual tasks.(4) Our evaluation results reveal several innovative findings, providing a foundational framework for the assessment of innovative strategies aimed at enhancing zero-shot multimodal techniques.
- (3) LVLM-eHub builds an online arena platform for LVLMs, which features anonymous randomized pairwise user-level comparison in a open-world scenario.
- (1) We propose LVLM-eHub which is the first comprehensive evaluation benchmark for large vision-language models, to our best knowledge.
2. LVLM Evaluation Hub
- 2-1 Quantitative Capability Evaluation
- we summarize 6 categories of capabilities and collect corresponding benchmarks for quantitative evaluation
- Visual Perception: Visual perception is the ability to recognize the scene or objects in images, the preliminary ability of the human visual system. image classification(ImgCLs) → ImageNet1K, CIFAR10, Pets37 and Flowers102 benchmark **multi-class identification(MCI) and object counting(OC)**→ GVT benchmark ImgCLs and MCI measure how well an LVLM grasps high-level semantic information, while OC assesses the recognition ability for fine-grained objects.
- Visual Knowledge Acquisition: Visual knowledge acquisition entails understanding images beyond perception to acquire knowledge 해당 평가는 Optical Characters Recognition (OCR) using twelve benchmarks (including IIIT5K, IC13, IC15, Total-Text,CUTE80, SVT, SVTP, COCO-Text, WordArt, CTW, HOST, WOST) Key Information Extraction (KIE) using the SROIE and FUNSD, and Image Captioning (ImgCap) using two benchmarks (including NoCaps and Flickr30K). The OCR task measures whether a model can accurately identify and extract text from images or scanned documents. The KIE task further poses challenges in extracting structured information from unstructured or semi-structured text. ImgCapassesses whether a model can generate a good natural language description of the content of an image.
- Visual Reasoning: Visual reasoning requires a comprehensive understanding of images and related texts we utilize three tasks including visual question answering (VQA), knowledge-grounded image description (KGID), and visual entailment (SNLI-VE), two benchmarks (ScienceQA and VizWiz) and one benchmark (SNLI-VE), respectively. These three tasks are in VQA form in different domains → A capable LVLM should be able to understand the objects and scenes in an image and can reason to generate answers that are semantically meaningful and relevant to the question asked.
- Visual Commonsense: Visual commonsense refers to the general visual knowledge commonly shared across the world, as opposed to the visual information specific to a single image. This evaluation tests the model’s understanding of commonly shared human knowledge about generic visual concepts using ImageNetVC and visual commonsense reasoning (VCR). Specifically, ImageNetVC is utilized for zero-shot visual commonsense evaluation, such as color and shape, while VCR covers various scenes, such as spatial, casual, and mental commonsense.
- Embodied Intelligence: Embodied intelligence aims to create agents, such as robots, which learn to solve challenging tasks requiring environmental interaction. Recently, LLM and LVLM exhibited exceptional effectiveness in guiding the agent to complete a series of tasks. In this evaluation, we utilize high-level tasks as in EmbodiedGPT and employ Minecraft, VirtualHome, Meta-World, and Franks Kitchen as benchmarks.
- Object Hallucination. It is known that LVLM suffers from the object hallucination problem, i.e., the generated results are inconsistent with the target images in the descriptions. Evaluating the degree of object hallucination for different LVLMs help understand their respective weaknesses. To this end, we evaluate the object hallucination problem of LVLMs on the MSCOCO dataset.
- 2-2 Online Evaluation with LVLM Arena
- LVLM을 위한 Quantitative 평가들을 모두 만족하도록 디자인 하는 것은 불가능하다. → LVLM responses constitutes an open-ended problem.
- Inspired by FastChat, we introduce the LVLM Arena, an online evaluation framework for LVLMs’ pairwise battle with human judgment.
- Figure 2 illustrates the LVLM Arena, comprising(포함) three primary components: matchmaking, chat, and voting.
- Initially, two models are sampled from the model zoo. Users then converse side-by-side with the models, who remain anonymous. Subsequently, users vote for the superior model.
- Matchmaking: The matchmaking module samples two models in a tournament style based on their Elo rating. However, due to the currently limited size of the model hub, we employ random sampling.
- Chat: Users chat side-by-side with two sampled models (which remain anonymous) using images or text inputs. Different from quantitative evaluation, users can chat about anything. Our existing online platform supports only single-round chats due to multi-round chats’ high computational and memory demands. Future updates will address this constraint.
- Voting: After the chat session, users vote for their preferred model. Four options are available: Model A, Model B, Tie, and Both are bad. The Elo rating is subsequently updated using voting results.
- In contrast to limited quantitative evaluations, the LVLM Arena provides an open-world evaluation framework that enables users to chat with models about anything, emulating real-world conditions. Besides, users serve as the judge for the battle, which brings more convincing evaluation results than traditional evaluation metrics.
- 2-3 Zero-shot Evaluation
- LVLMs are capable of capturing a wide range of multimodal patterns and relationships. We evaluate the above 6 categories of capabilities of LVLMs by investigating their zero-shot performance on various tasks.
- Question Answering: Prompting with visual question answering can be used to solve many 148 downstream tasks, which assess how well an LVLM understands the underlying language and visual features. We design proper prompts to ensure that the LLM can produce meaningful results. Ex) text prompts of OCR can be "what is written in the image?". Then, we evaluate the answers generated by the LLM using the corresponding metric such as accuracy.
- Prefix-based Score: For multi-choice QA tasks, we can utilize a visual encoder to obtain visual prompts for a given image. Then, the visual prompts are prefixed into the text embeddings, which are fed into the LLM. The likelihood of image-text pair can be generated, which is referred to as a prefix-based score. We can obtain a prefix-based score for each text prompt of the candidate’s answer. The answer with the largest prefix-based score is selected as the final answer.
- Multi-turn Reasoning. Following IdealGPT, we use a multi-turn reasoning framework to evaluate complex visual reasoning tasks. Specifically, we utilize an LLM such as ChatGPT to generate sub-questions for a given question, an LVLM to provide corresponding sub-answers, and another LLM to reason to assess sub-answers’ quality. Such a pipeline iteratively proceeds until a satisfactory answer is obtained.
- User Study: Evaluating the quality of the text generated by an LVLM requires a thorough under standing of the underlying language and context. In embedded artificial intelligence tasks, the LVLM generates a plan for the given instruction, which should be evaluated through various aspects such as recognition accuracy and conciseness in answers. It is hard to implement such an evaluation using an existing metric. Thus, user studies are conducted to assess the quality, relevance, and usefulness of the text generated by the LVLM in a specific context. To maintain evaluation fairness, we randomly shuffle the model’s output order and anonymize outputs during evaluation.
- 효율적인 방법을 제시하면서도, 참여자에게 잠재적인 위험이 가해지지 않게 했다. not include an IRB Approval.
3. Experiment and Analysis
- perform a zero-shot evaluation to assess the 6 kinds of capabilities of LVLMs.
- 3-1 Results on Visual Perception
- (1) mPLUG-Owl and LLaVA perform best on coarse-grained(큰 덩어리를 나눈) classification tasks (i.e., ImageNet1K and CIFAR10). The commonality is that they update LLM with 158K instruction-following data. → 많은 데이터로 학습시킨 공통점
- (2) InstructBLIP presents good perception ability in fine-grained ImgCls, OC, and MCI tasks. The main reason is that InstructBLIP may be fine-tuned on various existing VQA datasets, which may make it overfit on these tasks. → 전반적인 task에서 성과가 좋은 이유 → 다양한 VQA셋으로 학습해서.
- (3) The performances of LVLMs on ImgCls are significantly inferior to supervised SOTA, indicating plenty of room for LVLM’s perception ability. → SOTA의 지도학습에 비해 현저히 떨어지는 ImgCls에서의 성능 → 개선될 여지를 보임
- 3-2 Results on Visual Knowledge Acquisition
- We evaluate the acquisition of visual knowledge through various tasks, namely Optical Character Recognition (OCR), Key Information Extraction (KIE), and Image Captioning, all performed in a Visual Question Answering (VQA) fashion.
- In Table 3
- First, BLIP2, InstructBLIP, and VGTrans achieve dominant performance in all tasks. This may be because these models use a large visual encoder (i.e., ViT-g/14) and Q-Former updated with massive image-text pairs. A stronger visual encoder and adaption module can extract better tokens entailed with the global and local context, leading to remarkable improvement in visual knowledge acquisition.
- Second, InstructBLIP presents consistently the best results on all tasks. The main reason is that InstructBLIP overfits these tasks by fine-tuning massive VQA data.
- 3-3 Results on Visual Reasoning
- Visual reasoning encompasses the ability to comprehensively understand images and perform cognitive tasks. In this section, we evaluate the visual reasoning ability of LVLMs on various tasks, including Visual Question Answering (VQA), Knowledge-Grounded Image Description (KGID), and Visual Entailment (VE) tasks.
- Table 4 shows the zero-shot performance in visual reasoning, and we have the following observations.
- First, compared with BLIP2, InstructBLIP again presents better results overall because it overfits many tasks by fine-tuning massive VQA data.
- Second, compared with BLIP2, instruction-tuned LVLMs, except for InstructBLIP, generally perform worse than BLIP2. The common words in the instruction data often influence the generated content, which can not be evaluated by the current metrics (see Appendix C).
- Third, instruction-tuned LVLMs consistently surpass BLIP2 on SNLI-VE where the final answer is obtained by multi-turn reasoning. It shows that instruction-following fine-tuning can produce promising content once a good evaluation scheme is employed.
- 3-4 Results on Visual Commonsense
- We use two challenging visual commonsense benchmarks in a zero-shot setting, including ImageNetVC and Visual Commonsense Reasoning(VCR).
- In Table 5, we can find that all those LVLMs represent their abilities to solve visual commonsense problems.
- First, InstructBLIP performs best (68.41%) among those LVLMs on the ImageNetVC dataset. The main reason is that it is fine-tuned on 1.6M fine-grained VQA data, making it adapt to answer visual common questions. (또 그냥 데이터 많다로 결론…)
- Second, LLaMA-Adapter V2 (46.20%) and LLaVA (46.20%) show the same best performance among those LVLMs on the VCR dataset. The main reason is that instruction-flowing data is used to update the LLM. Note that the final answer of VCR is obtained by multi-turn reasoning. It also shows the significant role of a good evaluation scheme in producing promising content for instruction-tuned models.
- 3-5 Results on Object Hallucination
- In this section, we focus on evaluating such object hallucination problems on MSCOCO captioning dataset.
- Following POPE evaluation pipeline which is a multi-step QA procedure, we prompt LVLMs with multiple Yes-or-No questions. For example, ‘Is there a person in the image?’. We use accuracy as the evaluation metric.
- InstructBlip performs best in the hallucination problem, followed by BLIP2, whose average accuracy both reached more than 80%. We find that instruction-tuned models, except for InstructBLIP, perform worse than BLIP2 because they tend to answer ‘Yes’ to the question, which shows that LVLMs are prone to generate objects frequently occurring in the instruction data.
- Such object hallucination problem can be alleviated by a multi-turn reasoning pipeline shown in the experiments on SNLI-VE and VCR.
- 3-6 Results on Embodied Intelligence
- To appraise the effectiveness of planning outputs using the given image, we conducted a user study involving 15 participants.
- Specifically, the participants rated the generated plans from different LVLM models using a scoring system similar to.
- The evaluation comprised five dimensions with scores ranging from 1 to 5. These dimensions included object recognition accuracy, spatial relationship understanding, level of conciseness in the response, reasonability of the planning, and executability of the planning. The resulting average scores for the different models among the participants are presented in Table 7 below.
- We present quantitative evaluation results for Franka Kitchen, Minecraft, and Meta-World. Based on the evaluation results, we observe that visual instruction data is essential for embodied tasks. BLIP2 lacked visual instruction tuning, which greatly affected its capability to produce reasonable and executable plans.
- 3-7 Results on Online Arena Evaluation
- We have collected 634 pieces of evaluation data since we launch the LVLM arena.
- The collected data shows almost the same number of battle outcomes for ‘Model A wins’ and ‘Model B wins.’ Moreover, 21.8% battle outcomes are voted as ‘both are bad,’ implying that the current LVLMs still struggle to generate good answers for open-world visual questions.
- Furthermore, we rank the selected 8 LVLMs with Elo rating using the collected data by following Fastchat. As shown in Fig. 1 (b), mPLUG-Owl, MiniGPT-4, and Otter, which are fine-tuned with amounts of instruction-following data with updating many parameters, are the top-3 best models in the open-world VQA scenario, indicating the significance of instruction-following tuning and effective parameter update.
- Moreover, InstructBLIP perform best on in-domain capability evaluation, while being much worse than many instruction-tuned models, implying severe overfitting issue, as shown in Fig. 1.
- 3-8 Takeaway Analysis
- First, the quality of visual instruction data matters more than quantity in the open-world VQA. We observe that MiniGPT4, which is tuned by only 3.5K high-quality visual instruction data performs much better than InstructBLIP tuned on visual instruction data adapted from various existing VQA datasets in our Multi-Modality Arena.
- Second, a strong visual encoder can help extract detailed information from the image, leading to good performance in OCR tasks. For instance, we see that BLIP2, InstructBLIP, and VPGTrans achieve better performance than the remaining 5 LVLMs. This may be because the visual encoder ViT-g/14 used in BLIP2, InstructBLIP, and VPGTrans is more powerful than ViT-L/14 employed in the remaining LVLMs.
- Third, multi-turn reasoning helps alleviate the hallucination issue, indicating that the evaluation method with critical thinking can induce the correct prediction from the model. We find that LVLM with multi-turn reasoning can determine whether an object exists in the image more accurately than single-turn reasoning. Hence, multi-turn reasoning is appropriate to assess the full potential of the model. Fourth, LVLMs tuned with high-quality instruction-following data present more promising planning ability than models without being tuned with instruction data as demonstrated in Table 7.
4. Conclusion
- This paper proposes a comprehensive evaluation benchmark for large vision-language models called LVLM-eHub that incorporates both quantitative performance evaluation and human feedback evaluation.
- For the quantitative evaluation, we employ 16 tasks spanning over 40+ text-related visual datasets to assess the six essential capabilities of LVLM models. Additionally, we have established an online LVLM Arena to gather human feedback on LVLM models continually.
- This arena serves as an invaluable resource, providing an Elo rating rank that offers LVLMs ranking in the open-world scenario. Our evaluation results reveal several important findings, stimulating the future development of LVLMs. We will make ongoing efforts to build a platform for LVLM evaluation as discussed
'Data Review' 카테고리의 다른 글
Unilr (0) | 2024.06.10 |
---|---|
ScreenAI & WebVLN (0) | 2024.06.10 |