0. Abstract
- Existing information retrieval (IR): 현재는 단순하게 such as searching for images with text descriptions, searching for a news article with a headline image, or finding a similar photo with a query image 만 가능한 상황
- Unilr은 8개의 모달리티 간의 retrieval task를 가능하게 a unified instruction-guided multimodal retriever를 수행
- 이를 위해서 10개의 개별적인 멀티모달 IR 데이터셋을 학습
- M-BEIR(a multimodal retrieval benchmark with comprehensive results) → standardize the evaluation of universal multimodal information retrieval
1. Introduction
- IR: (Knn-retreival, Rag 등 분야에서) 필요한 process가 됨
- clip과 같은 특정 domain에 학습된 모델이나 사용자 요구가 제한된 상황이기 때문에 → general-purpose neural retriever이 필요하다.
- 따라서, UniIR framework → single retriever로 to accomplish (possibly) any retrieval task이 가
- UniIR은 다양한 모달리티에 있는 candidate를 retrieve 하기 위해서 instructions이 필요
- 이 UnilR을 훈련시키기 위해 → M-BEIR(a benchmark of instruction following multimodal retrieval tasks building on existing 10 diverse datasets and unifying their queries and targets in a unified task formulation.)
- query → 유저의 retrieval intention을 정의하기 위해 curate
2. UniIR Framework
-
- query q: text (qt) / image (qi) / image-text pair (qi , qt)
- retrieval candidate c: text (ct) / image (ci) / image-text pair (ci , ct)
- Eight existing retrieval tasksProblem Definition
- language task instruction (qinst): to represent the intention of the retrieval task(탐색 목표인지, 이미지나 텍스트를 찾는 것인지 또는 둘 다인지 그리고 상관된 영역을 특정화까지)
- unified retriever model f capable of taking any type of query to retrieve any type of target specified by the instruction qinst:
- C: denotes the heterogeneous candidate pool
- f(·) function we are optimizing for maximum dot-product retrieval
- c∗: predicted result.
- UnilR Model
- 해당 모델을 2가지 two multimodal fusion mechanisms으로 실험 (score-level fusion & feature-level fusion)
- (a) Score-level Fusion
- score-level fusion variants for CLIP and BLIP (CLIPSF and BLIPSF) employ distinct encoders for vision and text.
- vision encoder: fi uni-modal text encoder: ft 해당 두 인코더를 통해서 각각의 vector 변환
- vectors → weighted sum to form a unified representation vector. for queries: f(qi,qt,qinst) = w1fI(qi) + w2fT (qt,qinst) for targets: f(ci,ct) = w3fI(ci) + w4fT (ct)
- similarity score(between a query q and a target c) is calculated as a weighted sum of the within-modality and cross-modality similarity scores
- w1, w2, w3, w4 is a set of learnable parameters that reflects importance weights.
- (b) Feature-level Fusion
- 해당 작업에서는 uni-modal data를 따로 processing 하지 않고, integrates features during the encoding phase
- This fusion method computes a unified feature vector for multi-modal queries or candidates using mixed-modality attention layers.
- for the CLIP feature-level fusion (CLIPFF) → the pre-trained vision encoder fI and text encoder fTwith a 2-layer Multi-Modal Transformer(mixed-modality encoder fMIX)
- In BLIP feature-level fusion (BLIPFF) → begins with the extraction of image embeddings vision encoder fI . These embeddings are then integrated with text embeddings through the cross-attention layers of BLIP’s image-grounded text encoder(fMIX)
- In both CLIPFF and BLIPFF, the output from fMIX is a comprehensive feature vector that combines information from both image and text modalities.
- The final representations for the query and target(fMIX(qi, qt, qinst)) / fMIX(ci,ct) The similarity score between the query and the target:
- 앞의 4개의 model variants를 fine-tuned할 예정(query-target contrastive objective를 사용하면서)
- 동일한 instruction tuning format을 지키기 위해, qinst + qt로 inst는 prefix
- 이미지나 텍스트 하나라도 missing인 경우에는 padding token을 input
3. M-BEIR Benchmark
- 8개의 task로 나누어진 10개의 datasets / Each task is accompanied by human-authored instructions, encompassing 1.5 million queries and a pool of 5.6 million retrieval candidates in total (test 결과는 4 experiments에 있을 듯)
- Data format
- each task in M-BEIR includes queries Q = {q1 , q2 , ...}, a set of candidates C = {c1 , c2 , ...}, where q and c both support text and image modality
- a human-authored instruction qinst is provided to specify the intent of the retrieval task. Each query instance in the M-BEIR dataset includes (a query q / an instruction qinst / a list of relevant(positive) candidate data c+/ a list of potentially available irrelevant (negative) candidate data c−)
- 모든 M-BEIR query instance는 최소 하나의 positive candidate 데이터와 가능한 negative 데이터가 없게 구성 → default retrieval setting이 positive candidates를 retrieve
- Dataset Collection
- everyday imagery / fashion items / Wikipedia entries / news articles. It integrates 8 multimodal retrieval tasks by leveraging a variety of datasets.
- Data Selection
- retrieval- focused datasets (OVEN 6 7, EDIS 3, CIRR 7, FashionIQ 7)
- image-caption datasets (MS-COCO 4, Fashion200K 1 4, VisualNews 1 4)
- image-similarity measurement dataset (NIGHTS 5)
- retrieval-based VQA datasets (InfoSeek 6 8, WebQA 2 3)
- repurposed as retrieval tasks within the M-BEIR benchmark
- In the case of image-caption datasets, image-caption pair as the retrieval task following MSCOCO.
- For the other datasets, adopt original queries and use the annotated gold candidates as positive candidates c+ / annotated hard negatives as irrelevant candidates c−.
- adopt the provided candidate pool.
- M-BEIR covers 8 different multimodal retrieval tasks and 4 domains with a global pool of 5.6 million candidates.
- Instruction Annotation Guideline
- instruction describes a multimodal retrieval task by intent, domain, query modality, and target candidate modality. →각 데이터의 특성과 task를 고려해서 instruction을 설정(Table 1 + appendix Table 18, 19 참고 )
- Evaluation Metrics
- standard retrieval evaluation metric(recall@k), used for MSCOCO
- adhere to the recall implementation of CLIP/BLIP for MSCOCO, which counts the retrieved instance as correct if it overlaps with relevant instances.
- Recall@5 for all datasets (단 Fashion200K and FashionIQ → Recall@10.)
4. Experiments
- Evaluation under two retrieval scenarios
- (1) retrieving from the M-BEIR 5.6 million candidate pool, which consists of the retrieval corpus from all tasks → faiss에서 인덱스 클러스터링이 된 경우?
- (2) retrieving from a task-specific pool(with homogeneous candidates) provided by the original dataset, which is to enable fair comparison with existing SoTA retrievers.
- task-specific pool == M-BEIR local / The retrieval process involves a two-step pipeline.
- 1 - extract multimodal feature vectors for all the queries and candidates in the pool.
- 2 - utilize FAISS for efficient similarity searches in dense vector spaces, to index and retrieve candidates.
- Baselines
- Zero-shot SoTA Retriever
- these models cannot understand the intent of the retrieval task as the input is only query q, → they are expected to achieve low performance in the standard setting (1) with a heterogeneous candidate pool
- Single/Multi-task Fine-tuned Baselines
- clip & blip: only takes in q and c using the query-target contrastive training objective to maximize the positive pair similarity → multi-task baseline retrievers을 위해서 instructions을 제외시키고 M-BEIR 학습 데이터 등을 fine-tuning
- Zero-shot SoTA Retriever
- Experimental Results
- Zero-shot retrievers cannot comprehend retrieval intention
- BLIP2: the recall of WebQA drops 35.2% to 0% (text snippets → image-text pairs)
- EDIS의 경우에도 blip2: retrieves distracting candidates from the wrong modality for an EDIS query
- Instruction-tuning improves retrieval on M-BEIR
- 대부분의 테스크 성능은 좋음
- Instruction-tuning does not significantly improve within modality retrieval tasks(like 2 and 5 as these do not require the embedding model to understand intent)
- UniIR can precisely follow instructions
- retrieval error classified into three categories: in- correct modality / incorrect domain / other errors.
- Multi-task models showed a high error rate of 58.8% and 50.9% in retrieving instances with the wrong modality
- With instruction finetuning, resulting in a significant drop in error rate to 2.7% and 15.2%.
- In Figure 4, Unilr 모델은 1개의 correct pair를 가져옴. 틀린 경우에도 최소한 positive similarity를 가진 값을 가져오더라~
- UniIR can generalize to unseen retrieval tasks
-
- UniIR models outperform SoTA retriever baselines by a significant margin on held-out datasets during zero-shot evaluation.
- With instruction-tuning, superior generalization abilities on unseen tasks and datasets compared to their multi-task counterparts without instructions.
-
- Aligning the model architecture with pre-training
- Zero-shot retrievers cannot comprehend retrieval intention
- Comparison with Existing Methods
- UniIR vs Zero-shot Retrievers
- UniIR vs Single-task Tuning
- Generalization Performance on Held-Out Datasets and Tasks
6. Conclusion
- UniIR: a framework to build universal multi- modal information retrieval models.
- M-BEIR benchmark: enable the training and evaluation of UniIR models.
'Data Review' 카테고리의 다른 글
LVLM-eHub: A Comprehensive EvaluationBenchmark for Large Vision-Language Models (0) | 2024.06.10 |
---|---|
ScreenAI & WebVLN (0) | 2024.06.10 |