AI Paper Review

Participating Paper Review

코딩하는머글 2024. 6. 10. 14:07

Honeybee: Locality-enhanced Projector for Multimodal LLM

dinov2, clip

LLaVA-1.6: Improved reasoning, OCR, and world knowledge

LLaVA-MoE: Mixture of Experts for Large Vision-Language Models

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

TableVQA-Bench: A Visual Question Answering Benchmarks on Multiple Table Domains

Interleaved data(InternLM-XComposer2 & MiniGPT-5)

Image Retrieval & Knowledge-VQA(OK-VQA)

QWEN-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading and Beyond

CogVLM: Visual Expert for Pretrained Language Models

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions 

LWM: WORLD MODEL ON MILLION-LENGTH VIDEO AND LANGUAGE WITH RINGATTENTION

PaLI-3:Vision Language Models: Smaller, Faster, Stronger

Ferret: Refer and Ground Anything Anywhere at Any Granularity

RingAttention: Ring Attention with Blockwise Transformers for Near-Infinite Context

UnifiedIO2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

VisLingInstruct: Elevating Zero-Shot Learning in Multi-Modal Language Models with Autonomous Instruction Optimization

MoAI: Mixture of All Intelligence for Large Language and Vision Models

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Synth^1: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

ImageBind:One Embedding Space To Bind Them All

sglang: Efficient Execution of Structured Language Model Programs

Owl-ViT, Owl-ViT2: Simple Open-Vocabulary Object Detection with Vision Transformers

Zero ++: Parellel Training Optimization

Video-LLM & Benchmarks(Datasets)

VoT & DetToolChain: DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

Libra: a package for transformation of differential systems for multiloop integrals

Data Mixing