Unilr

Data Review

Unilr

코딩하는머글 2024. 6. 10. 00:33

0. Abstract

Existing information retrieval (IR): 현재는 단순하게 such as searching for images with text descriptions, searching for a news article with a headline image, or finding a similar photo with a query image 만 가능한 상황
Unilr은 8개의 모달리티 간의 retrieval task를 가능하게 a unified instruction-guided multimodal retriever를 수행
이를 위해서 10개의 개별적인 멀티모달 IR 데이터셋을 학습
M-BEIR(a multimodal retrieval benchmark with comprehensive results) → standardize the evaluation of universal multimodal information retrieval

1. Introduction

IR: (Knn-retreival, Rag 등 분야에서) 필요한 process가 됨
clip과 같은 특정 domain에 학습된 모델이나 사용자 요구가 제한된 상황이기 때문에 → general-purpose neural retriever이 필요하다.
따라서, UniIR framework → single retriever로 to accomplish (possibly) any retrieval task이 가
UniIR은 다양한 모달리티에 있는 candidate를 retrieve 하기 위해서 instructions이 필요
이 UnilR을 훈련시키기 위해 → M-BEIR(a benchmark of instruction following multimodal retrieval tasks building on existing 10 diverse datasets and unifying their queries and targets in a unified task formulation.)
query → 유저의 retrieval intention을 정의하기 위해 curate

2. UniIR Framework

- query q: text (qt) / image (qi) / image-text pair (qi , qt)
- retrieval candidate c: text (ct) / image (ci) / image-text pair (ci , ct)
- Eight existing retrieval tasksProblem Definition
- language task instruction (qinst): to represent the intention of the retrieval task(탐색 목표인지, 이미지나 텍스트를 찾는 것인지 또는 둘 다인지 그리고 상관된 영역을 특정화까지)
- unified retriever model f capable of taking any type of query to retrieve any type of target specified by the instruction qinst:
  - C: denotes the heterogeneous candidate pool
  - f(·) function we are optimizing for maximum dot-product retrieval
  - c∗: predicted result.
UnilR Model
- 해당 모델을 2가지 two multimodal fusion mechanisms으로 실험 (score-level fusion & feature-level fusion)
- (a) Score-level Fusion
  - score-level fusion variants for CLIP and BLIP (CLIPSF and BLIPSF) employ distinct encoders for vision and text.
  - vision encoder: fi uni-modal text encoder: ft 해당 두 인코더를 통해서 각각의 vector 변환
  - vectors → weighted sum to form a unified representation vector. for queries: f(qi,qt,qinst) = w1fI(qi) + w2fT (qt,qinst) for targets: f(ci,ct) = w3fI(ci) + w4fT (ct)
  - similarity score(between a query q and a target c) is calculated as a weighted sum of the within-modality and cross-modality similarity scores
  - w1, w2, w3, w4 is a set of learnable parameters that reflects importance weights.
- (b) Feature-level Fusion
  - 해당 작업에서는 uni-modal data를 따로 processing 하지 않고, integrates features during the encoding phase
  - This fusion method computes a unified feature vector for multi-modal queries or candidates using mixed-modality attention layers.
  - for the CLIP feature-level fusion (CLIPFF) → the pre-trained vision encoder fI and text encoder fTwith a 2-layer Multi-Modal Transformer(mixed-modality encoder fMIX)
  - In BLIP feature-level fusion (BLIPFF) → begins with the extraction of image embeddings vision encoder fI . These embeddings are then integrated with text embeddings through the cross-attention layers of BLIP’s image-grounded text encoder(fMIX)
  - In both CLIPFF and BLIPFF, the output from fMIX is a comprehensive feature vector that combines information from both image and text modalities.
  - The final representations for the query and target(fMIX(qi, qt, qinst)) / fMIX(ci,ct) The similarity score between the query and the target:
  - 앞의 4개의 model variants를 fine-tuned할 예정(query-target contrastive objective를 사용하면서)
  - 동일한 instruction tuning format을 지키기 위해, qinst + qt로 inst는 prefix
  - 이미지나 텍스트 하나라도 missing인 경우에는 padding token을 input

3. M-BEIR Benchmark

8개의 task로 나누어진 10개의 datasets / Each task is accompanied by human-authored instructions, encompassing 1.5 million queries and a pool of 5.6 million retrieval candidates in total (test 결과는 4 experiments에 있을 듯)
Data format
- each task in M-BEIR includes queries Q = {q1 , q2 , ...}, a set of candidates C = {c1 , c2 , ...}, where q and c both support text and image modality
- a human-authored instruction qinst is provided to specify the intent of the retrieval task. Each query instance in the M-BEIR dataset includes (a query q / an instruction qinst / a list of relevant(positive) candidate data c+/ a list of potentially available irrelevant (negative) candidate data c−)
- 모든 M-BEIR query instance는 최소 하나의 positive candidate 데이터와 가능한 negative 데이터가 없게 구성 → default retrieval setting이 positive candidates를 retrieve
Dataset Collection
- everyday imagery / fashion items / Wikipedia entries / news articles. It integrates 8 multimodal retrieval tasks by leveraging a variety of datasets.
- Data Selection
  - retrieval- focused datasets (OVEN 6 7, EDIS 3, CIRR 7, FashionIQ 7)
  - image-caption datasets (MS-COCO 4, Fashion200K 1 4, VisualNews 1 4)
  - image-similarity measurement dataset (NIGHTS 5)
  - retrieval-based VQA datasets (InfoSeek 6 8, WebQA 2 3)
  - repurposed as retrieval tasks within the M-BEIR benchmark
    - In the case of image-caption datasets, image-caption pair as the retrieval task following MSCOCO.
    - For the other datasets, adopt original queries and use the annotated gold candidates as positive candidates c+ / annotated hard negatives as irrelevant candidates c−.
    - adopt the provided candidate pool.
  - M-BEIR covers 8 different multimodal retrieval tasks and 4 domains with a global pool of 5.6 million candidates.
- Instruction Annotation Guideline
  - instruction describes a multimodal retrieval task by intent, domain, query modality, and target candidate modality. →각 데이터의 특성과 task를 고려해서 instruction을 설정(Table 1 + appendix Table 18, 19 참고 )
Evaluation Metrics
- standard retrieval evaluation metric(recall@k), used for MSCOCO
- adhere to the recall implementation of CLIP/BLIP for MSCOCO, which counts the retrieved instance as correct if it overlaps with relevant instances.
- Recall@5 for all datasets (단 Fashion200K and FashionIQ → Recall@10.)

4. Experiments

Evaluation under two retrieval scenarios
- (1) retrieving from the M-BEIR 5.6 million candidate pool, which consists of the retrieval corpus from all tasks → faiss에서 인덱스 클러스터링이 된 경우?
- (2) retrieving from a task-specific pool(with homogeneous candidates) provided by the original dataset, which is to enable fair comparison with existing SoTA retrievers.
- task-specific pool == M-BEIR local / The retrieval process involves a two-step pipeline.
  - 1 - extract multimodal feature vectors for all the queries and candidates in the pool.
  - 2 - utilize FAISS for efficient similarity searches in dense vector spaces, to index and retrieve candidates.
Baselines
- Zero-shot SoTA Retriever
  - these models cannot understand the intent of the retrieval task as the input is only query q, → they are expected to achieve low performance in the standard setting (1) with a heterogeneous candidate pool
- Single/Multi-task Fine-tuned Baselines
  - clip & blip: only takes in q and c using the query-target contrastive training objective to maximize the positive pair similarity → multi-task baseline retrievers을 위해서 instructions을 제외시키고 M-BEIR 학습 데이터 등을 fine-tuning
Experimental Results
- Zero-shot retrievers cannot comprehend retrieval intention
  - BLIP2: the recall of WebQA drops 35.2% to 0% (text snippets → image-text pairs)
  - EDIS의 경우에도 blip2: retrieves distracting candidates from the wrong modality for an EDIS query
- Instruction-tuning improves retrieval on M-BEIR
  - 대부분의 테스크 성능은 좋음
  - Instruction-tuning does not significantly improve within modality retrieval tasks(like 2 and 5 as these do not require the embedding model to understand intent)
- UniIR can precisely follow instructions
  - retrieval error classified into three categories: in- correct modality / incorrect domain / other errors.
  - Multi-task models showed a high error rate of 58.8% and 50.9% in retrieving instances with the wrong modality
  - With instruction finetuning, resulting in a significant drop in error rate to 2.7% and 15.2%.
  - In Figure 4, Unilr 모델은 1개의 correct pair를 가져옴. 틀린 경우에도 최소한 positive similarity를 가진 값을 가져오더라~
- UniIR can generalize to unseen retrieval tasks
  - 1. UniIR models outperform SoTA retriever baselines by a significant margin on held-out datasets during zero-shot evaluation.
    2. With instruction-tuning, superior generalization abilities on unseen tasks and datasets compared to their multi-task counterparts without instructions.
- Aligning the model architecture with pre-training
Comparison with Existing Methods
- UniIR vs Zero-shot Retrievers
- UniIR vs Single-task Tuning
- Generalization Performance on Held-Out Datasets and Tasks

6. Conclusion

UniIR: a framework to build universal multi- modal information retrieval models.
M-BEIR benchmark: enable the training and evaluation of UniIR models.

'Data Review' 카테고리의 다른 글

LVLM-eHub: A Comprehensive EvaluationBenchmark for Large Vision-Language Models (0)	2024.06.10
ScreenAI & WebVLN (0)	2024.06.10

현재글Unilr

코딩하는 머글

AI 및 연구관련 논문을 정리하는 곳입니다.

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

코딩하는 머글

Unilr

0. Abstract

1. Introduction

2. UniIR Framework

3. M-BEIR Benchmark

4. Experiments

6. Conclusion

'Data Review' 카테고리의 다른 글

'Data Review'의 다른글

티스토리툴바

Unilr

0. Abstract

1. Introduction

2. UniIR Framework

3. M-BEIR Benchmark

4. Experiments

6. Conclusion

'Data Review' 카테고리의 다른 글

'Data Review'의 다른글

관련글

티스토리툴바