Data Review

Unilr

코딩하는머글 2024. 6. 10. 00:33

0. Abstract

  • Existing information retrieval (IR): 현재는 단순하게 such as searching for images with text descriptions, searching for a news article with a headline image, or finding a similar photo with a query image 만 가능한 상황
  • Unilr은 8개의 모달리티 간의 retrieval task를 가능하게 a unified instruction-guided multimodal retriever를 수행
  • 이를 위해서 10개의 개별적인 멀티모달 IR 데이터셋을 학습
  • M-BEIR(a multimodal retrieval benchmark with comprehensive results) → standardize the evaluation of universal multimodal information retrieval

1. Introduction

  • IR: (Knn-retreival, Rag 등 분야에서) 필요한 process가 됨
  • clip과 같은 특정 domain에 학습된 모델이나 사용자 요구가 제한된 상황이기 때문에 → general-purpose neural retriever이 필요하다.
  • 따라서, UniIR framework → single retriever로 to accomplish (possibly) any retrieval task이 가
  • UniIR은 다양한 모달리티에 있는 candidate를 retrieve 하기 위해서 instructions이 필요
  • 이 UnilR을 훈련시키기 위해 → M-BEIR(a benchmark of instruction following multimodal retrieval tasks building on existing 10 diverse datasets and unifying their queries and targets in a unified task formulation.)
  • query → 유저의 retrieval intention을 정의하기 위해 curate

2. UniIR Framework

    • query q: text (qt) / image (qi) / image-text pair (qi , qt)
    • retrieval candidate c: text (ct) / image (ci) / image-text pair (ci , ct)
    • Eight existing retrieval tasksProblem Definition
    • language task instruction (qinst): to represent the intention of the retrieval task(탐색 목표인지, 이미지나 텍스트를 찾는 것인지 또는 둘 다인지 그리고 상관된 영역을 특정화까지)
    • unified retriever model f capable of taking any type of query to retrieve any type of target specified by the instruction qinst:
      • C: denotes the heterogeneous candidate pool
      • f(·) function we are optimizing for maximum dot-product retrieval
      • c∗: predicted result.
  • UnilR Model
    • 해당 모델을 2가지 two multimodal fusion mechanisms으로 실험 (score-level fusion & feature-level fusion)
    • (a) Score-level Fusion
      • score-level fusion variants for CLIP and BLIP (CLIPSF and BLIPSF) employ distinct encoders for vision and text.
      • vision encoder: fi uni-modal text encoder: ft 해당 두 인코더를 통해서 각각의 vector 변환
      • vectors → weighted sum to form a unified representation vector. for queries: f(qi,qt,qinst) = w1fI(qi) + w2fT (qt,qinst) for targets: f(ci,ct) = w3fI(ci) + w4fT (ct)
      • similarity score(between a query q and a target c) is calculated as a weighted sum of the within-modality and cross-modality similarity scores
      • w1, w2, w3, w4 is a set of learnable parameters that reflects importance weights.
    • (b) Feature-level Fusion
      • 해당 작업에서는 uni-modal data를 따로 processing 하지 않고, integrates features during the encoding phase
      • This fusion method computes a unified feature vector for multi-modal queries or candidates using mixed-modality attention layers.
      • for the CLIP feature-level fusion (CLIPFF) → the pre-trained vision encoder fI and text encoder fTwith a 2-layer Multi-Modal Transformer(mixed-modality encoder fMIX)
      • In BLIP feature-level fusion (BLIPFF) → begins with the extraction of image embeddings vision encoder fI . These embeddings are then integrated with text embeddings through the cross-attention layers of BLIP’s image-grounded text encoder(fMIX)
      • In both CLIPFF and BLIPFF, the output from fMIX is a comprehensive feature vector that combines information from both image and text modalities.
      • The final representations for the query and target(fMIX(qi, qt, qinst)) / fMIX(ci,ct) The similarity score between the query and the target:
      • 앞의 4개의 model variants를 fine-tuned할 예정(query-target contrastive objective를 사용하면서)
      • 동일한 instruction tuning format을 지키기 위해, qinst + qt로 inst는 prefix
      • 이미지나 텍스트 하나라도 missing인 경우에는 padding token을 input

3. M-BEIR Benchmark

  • 8개의 task로 나누어진 10개의 datasets / Each task is accompanied by human-authored instructions, encompassing 1.5 million queries and a pool of 5.6 million retrieval candidates in total (test 결과는 4 experiments에 있을 듯)
  • Data format
    • each task in M-BEIR includes queries Q = {q1 , q2 , ...}, a set of candidates C = {c1 , c2 , ...}, where q and c both support text and image modality
    • a human-authored instruction qinst is provided to specify the intent of the retrieval task. Each query instance in the M-BEIR dataset includes (a query q / an instruction qinst / a list of relevant(positive) candidate data c+/ a list of potentially available irrelevant (negative) candidate data c−)
    • 모든 M-BEIR query instance는 최소 하나의 positive candidate 데이터와 가능한 negative 데이터가 없게 구성 → default retrieval setting이 positive candidates를 retrieve
  • Dataset Collection
    • everyday imagery / fashion items / Wikipedia entries / news articles. It integrates 8 multimodal retrieval tasks by leveraging a variety of datasets.
    • Data Selection
      • retrieval- focused datasets (OVEN 6 7, EDIS 3, CIRR 7, FashionIQ 7)
      • image-caption datasets (MS-COCO 4, Fashion200K 1 4, VisualNews 1 4)
      • image-similarity measurement dataset (NIGHTS 5)
      • retrieval-based VQA datasets (InfoSeek 6 8, WebQA 2 3)
      • repurposed as retrieval tasks within the M-BEIR benchmark
        • In the case of image-caption datasets, image-caption pair as the retrieval task following MSCOCO.
        • For the other datasets, adopt original queries and use the annotated gold candidates as positive candidates c+ / annotated hard negatives as irrelevant candidates c−.
        • adopt the provided candidate pool.
      • M-BEIR covers 8 different multimodal retrieval tasks and 4 domains with a global pool of 5.6 million candidates.
    • Instruction Annotation Guideline
      • instruction describes a multimodal retrieval task by intent, domain, query modality, and target candidate modality. →각 데이터의 특성과 task를 고려해서 instruction을 설정(Table 1 + appendix Table 18, 19 참고 )
  • Evaluation Metrics
    • standard retrieval evaluation metric(recall@k), used for MSCOCO
    • adhere to the recall implementation of CLIP/BLIP for MSCOCO, which counts the retrieved instance as correct if it overlaps with relevant instances.
    • Recall@5 for all datasets (단 Fashion200K and FashionIQ → Recall@10.)

4. Experiments

  • Evaluation under two retrieval scenarios
    • (1) retrieving from the M-BEIR 5.6 million candidate pool, which consists of the retrieval corpus from all tasks → faiss에서 인덱스 클러스터링이 된 경우?
    • (2) retrieving from a task-specific pool(with homogeneous candidates) provided by the original dataset, which is to enable fair comparison with existing SoTA retrievers.
    • task-specific pool == M-BEIR local / The retrieval process involves a two-step pipeline.
      • 1 - extract multimodal feature vectors for all the queries and candidates in the pool.
      • 2 - utilize FAISS for efficient similarity searches in dense vector spaces, to index and retrieve candidates.
  • Baselines
    • Zero-shot SoTA Retriever
      • these models cannot understand the intent of the retrieval task as the input is only query q, → they are expected to achieve low performance in the standard setting (1) with a heterogeneous candidate pool
    • Single/Multi-task Fine-tuned Baselines
      • clip & blip: only takes in q and c using the query-target contrastive training objective to maximize the positive pair similarity → multi-task baseline retrievers을 위해서 instructions을 제외시키고 M-BEIR 학습 데이터 등을 fine-tuning
  • Experimental Results
    • Zero-shot retrievers cannot comprehend retrieval intention
      • BLIP2: the recall of WebQA drops 35.2% to 0% (text snippets → image-text pairs)
      • EDIS의 경우에도 blip2: retrieves distracting candidates from the wrong modality for an EDIS query
    • Instruction-tuning improves retrieval on M-BEIR
      • 대부분의 테스크 성능은 좋음
      • Instruction-tuning does not significantly improve within modality retrieval tasks(like 2 and 5 as these do not require the embedding model to understand intent)
    • UniIR can precisely follow instructions
      • retrieval error classified into three categories: in- correct modality / incorrect domain / other errors.
      • Multi-task models showed a high error rate of 58.8% and 50.9% in retrieving instances with the wrong modality
      • With instruction finetuning, resulting in a significant drop in error rate to 2.7% and 15.2%.
      • In Figure 4, Unilr 모델은 1개의 correct pair를 가져옴. 틀린 경우에도 최소한 positive similarity를 가진 값을 가져오더라~
    • UniIR can generalize to unseen retrieval tasks
        1. UniIR models outperform SoTA retriever baselines by a significant margin on held-out datasets during zero-shot evaluation.
        2. With instruction-tuning, superior generalization abilities on unseen tasks and datasets compared to their multi-task counterparts without instructions.
    • Aligning the model architecture with pre-training
  • Comparison with Existing Methods
    • UniIR vs Zero-shot Retrievers
    • UniIR vs Single-task Tuning
    • Generalization Performance on Held-Out Datasets and Tasks

6. Conclusion

  • UniIR: a framework to build universal multi- modal information retrieval models.
  • M-BEIR benchmark: enable the training and evaluation of UniIR models.