Honeybee: Locality-enhanced Projector for Multimodal LLM
dinov2, clip
LLaVA-1.6: Improved reasoning, OCR, and world knowledge
LLaVA-MoE: Mixture of Experts for Large Vision-Language Models
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
TableVQA-Bench: A Visual Question Answering Benchmarks on Multiple Table Domains
Interleaved data(InternLM-XComposer2 & MiniGPT-5)
Image Retrieval & Knowledge-VQA(OK-VQA)
QWEN-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading and Beyond
CogVLM: Visual Expert for Pretrained Language Models
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
LWM: WORLD MODEL ON MILLION-LENGTH VIDEO AND LANGUAGE WITH RINGATTENTION
PaLI-3:Vision Language Models: Smaller, Faster, Stronger
Ferret: Refer and Ground Anything Anywhere at Any Granularity
RingAttention: Ring Attention with Blockwise Transformers for Near-Infinite Context
UnifiedIO2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
VisLingInstruct: Elevating Zero-Shot Learning in Multi-Modal Language Models with Autonomous Instruction Optimization
MoAI: Mixture of All Intelligence for Large Language and Vision Models
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Synth^1: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings
ImageBind:One Embedding Space To Bind Them All
sglang: Efficient Execution of Structured Language Model Programs
Owl-ViT, Owl-ViT2: Simple Open-Vocabulary Object Detection with Vision Transformers
Zero ++: Parellel Training Optimization
Video-LLM & Benchmarks(Datasets)
VoT & DetToolChain: DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM
Libra: a package for transformation of differential systems for multiloop integrals
Data Mixing