VLM Paper Review

Llava vs GPT-4V

코딩하는머글 2024. 6. 10. 13:30

참고 블로그(https://encord.com/blog/gpt-vision-vs-llava/)

1. Architectural Difference

  • GPT-4 is primarily built upon a transformer-based design, where it excels in natural language understanding and generation. After training, the model is fine-tuned using reinforcement learning from human feedback. GPT-4 can process text and image inputs and generate text-based responses, unlike its predecessors where it can only process text prompts.
  • LLaVA, on the other hand, leverages the capabilities of Vicuna, an open-sourced chatbot trained by fine-tuning LLaMA and a visual model. For processing image inputs, LLaVA uses a pre-trained CLIP visual encoder which extracts visual features from the input images and links them to language embeddings of pre-trained LLaMA using an adaptable projection matrix. This projection effectively transforms visual elements into language embedding tokens, thereby establishing a connection between textual and visual data.

2. Performance Comparison to SOTA

  • GPT-4 and LLaVA are not compared on the same benchmark datasets.
  • GPT-4’s performance is evaluated on a narrow suite of standard academic vision benchmarks. Thorough benchmark assessments were performed, which encompassed simulated examinations originally designed for human candidates. These evaluations encompassed a range of tests, such as the Olympiads and AP exams.
  • GPT-4 not only outperforms existing models by a substantial margin in English but also exhibits robust performance in various other languages. When tested on translated versions of MMLU, GPT-4 outshines the English-language state-of-the-art in 24 out of the 26 languages considered.

  • LLaVA's performance comparison to SOTA reveals promising results across various benchmarks. In tasks like ScienceQA, LLaVA's accuracy closely rivals that of the SOTA model, showcasing its proficiency in comprehending visual content and delivering effective question answering, particularly for out-of-domain questions.
  • Moreover, LLaVA excels in a conversational context, demonstrating the ability to understand and respond to queries in a manner aligned with human intent. In an evaluation dataset containing 30 unseen images, LLaVA outperforms GPT-4 with an 85.1% relative score, affirming the effectiveness of the proposed self-instruct method in multimodal settings.
  • Despite being trained on a relatively small multimodal instruction-following dataset with approximately 80,000 unique images, LLaVA showcases strikingly similar reasoning abilities to multimodal GPT-4, as demonstrated through rigorous evaluation.
  • Surprisingly, in challenging scenarios where the prompts demand in-depth image understanding, LLaVA's performance closely aligns with that of multimodal GPT-4, even on out-of-domain images. LLaVA effectively comprehends the scenes and adeptly follows user instructions to provide relevant responses. In contrast, other models like BLIP-2 and OpenFlamingo tend to focus on describing the image rather than adhering to the user's instructions for answering appropriately. This highlights LLaVA's strong proficiency in instruction-following, positioning it as a highly competitive contender among multimodal AI models.

3. Performance on Various Computer Vision Tasks

  • Object Detection
    • Llava의 경우에는 작은 물체에 대해서 인간과 같이 인식하지 못하거나, misidentify하는 경향이 있다. 그러나 gpt는 상황에 맞게 잘 detection한다.
  • Sudoku and Crossword Puzzle
    • Both LLaVA and GPT-4 encounter challenges when tasked with solving a sudoku puzzle. LLaVA tends to struggle to comprehend the image and understand the task's nuances. On the other hand, GPT-4 exhibits an understanding of the task but often misinterprets the sudoku grid, resulting in consistently incorrect answers.
    • Conversely, when presented with a crossword puzzle, GPT-4 demonstrates a better grasp of the task and successfully solves the puzzle, albeit with occasional errors. LLaVA, however, takes a different approach by offering explanations on how to solve the puzzle rather than providing direct answers, reflecting its conversational instruction-following abilities.
  • OCR
    • While LLaVA encounters challenges in deciphering handwritten texts, it exhibits a commendable self-awareness regarding the underlying issues affecting its reading ability.
    • Despite not having the extensive training data available to GPT-4, LLaVA acknowledges its limitations and provides users with actionable recommendations for improved performance.
    • In contrast, GPT-4 demonstrates a higher proficiency in handling handwritten text, with only two minor errors detected in its interpretation.
    • Llava의 경우 90도 회전했을 때, 글자 읽기에 어려움이 있었음.
    • When confronted with text rotated beyond 90 degrees, LLaVA encounters difficulty in reading the text. Furthermore, neither of the chatbots demonstrates the capability to decipher overlapped text effectively.
  • Mathematical OCR and Reasoning
    • This illustrates GPT-4's proficiency in both mathematical Optical Character Recognition (OCR) and reasoning, highlighting an area where LLaVA falls short. GPT는 이유까지 잘 설명하고, 단계별 설명이 가능하지만, Llava의 경우에는 실패하는 경우가 많다. → step-by-step을 통해, GPT4에서 Curriculum or Incremental learning이 활용됐을 것으로 보임. 공식적이지는 X
  • VQA
    • LLaVA and GPT-4 excel in interpreting images, whether they're paintings or memes. They demonstrate a strong grasp of visual content and provide accurate responses to questions based on the images.
    • However, LLaVA struggles to deliver prompt and accurate answers in scenarios necessitating Optical Character Recognition (OCR). For instance, when presented with an image and tasked to provide answers based on the information extracted from it, LLaVA often furnishes misleading responses.
    • GPT-4는 효율적으로 정보를 추출하고 답변하지만, Llava는 잘못된 답을 전달하는 경향이 있다.
  • Science QA
    • Since both LLaVA and GPT-4 have been trained with a focus on academic content, they excel in the domain of science question answering. These models exhibit a strong capacity to grasp and interpret labeled diagrams, offering clear and comprehensive explanations.
  • Data Analysis
    • In data analysis, when presented with a graph, LLaVA primarily offers a description of the visual representation. In contrast, GPT-4 goes the extra mile by providing more elaborate insights, complete with observations derived from the data presented in the graph.

4. Performance on Prompt Injection Attacks

  • Prompt injection attacks란? involve manipulating the input or prompts given to AI models to generate responses that may be biased, harmful, or inappropriate. → Attackers insert specific language or instructions to influence the AI model's output in unintended ways, potentially causing misinformation or promoting harmful content.
  • Evaluating the multimodal AI chatbots' performance in handling prompt injections is crucial because it sheds light on their safety measures. Since these chatbots are accessible to the public, assessing their ability to resist manipulated prompts is of utmost importance.
  • Conflicted Text in Image: In the presence of text within an image, GPT-4 disregards the text prompt and follows the instruction contained in the image itself. Conversely, LLaVA sticks to the text input provided. → This difference in behavior is noteworthy, as it highlights a potential vulnerability when it comes to malicious or biased content injection into the chatbot's responses.
  • Embedding text within an image could serve as a mechanism for introducing inappropriate or harmful instructions to the AI model, as GPT-4 does not consider the textual content in such cases and may execute tasks that could be considered undesirable or problematic.
  • Hidden Text → 해당 부분도 마찬가지로, gpt는 구체적 설명, Llava는 묘사를 진행 Given that multimodal chatbots are capable of generating outputs based on the text within images, there exists a potential vulnerability whereby malicious information can be concealed within an image using embedded text. To ensure the responsible and safe use of these chatbots, it is imperative that they are trained and equipped to detect and handle such scenarios effectively.

  • GPT-4 and LLaVA represent two competing multimodal AI chatbots, each with its strengths and areas of improvement.
  • GPT-4 performs well in many computer vision tasks compared to LLaVA and OpenAI is constantly working on improving its security. However, its accessibility is limited and available for research upon request.
  • LLaVA's performance is noteworthy, especially given its training on a smaller dataset, and it is accessible to the public through open-sourcing. However, in the context of ongoing research on the security of AI chatbots, this accessibility may raise concerns.

'VLM Paper Review' 카테고리의 다른 글

PaLI-3  (0) 2024.06.10
LLaVA 1.5 - Improved Baselines with Visual Instruction Tuning  (0) 2024.06.09
BLIP2  (0) 2024.06.09
BLIP  (0) 2024.06.09
LLaVA  (0) 2024.06.09