Parrot Captions Teach CLIP to Spot Text

Vision Encoder

Parrot Captions Teach CLIP to Spot Text

코딩하는머글 2024. 6. 10. 00:33

Abstract

clip은 학습한 데이터의 text spot의 경향성이 큼
이미지의 visual semantic을 무시하고, text 내의 텍스트를 복제하는 등 text에만 초점
In LAION-2B, parrot (spell, text embedded in images) 캡셥의 비율이 높음.
LAION-style image-text similarity를 측정 → visual text가 지배적인 영향
실제로 parrot captions가 text spotting bias를 만들어내는지 살펴보기 위해서 parrot-caption-oriented criteria에 따라서 curated LAION subsets를 이용한 CLIP을 학습시킴.
- 이는 visual representation에 좋지 않았음
따라서, clip-like 모델을 설계와 dataset curation pipeline(clip score filtering)에 대한 재검토가 필요

Introduction

clip이 효율적이고 간단하나, visual text나 color, gender에 대한 bias가 존재 한다.
해당 논문에서는 visual text bias(spotting text)에 대해서 연구
실제로 데이터셋에 대해서 clustering 후, 각 클러스터에 대한 순위를 clip scores로 측정했을 때, 상위 clip score를 가진 샘플 중 상당수가 captions와 pixel 형태의 이미지로 관찰
이는 text supervision(align the visual and language concepts)이 아닌 visual text에 의존하는 경향 → 이러한 캡션을 Parrot Captions라고 언급 → actual visual concepts를 인지하지 않고, spotting만 하도록 clip에게 가르키는 방향으로 학습하고 있지 않나…?
3 Perspectives to understand the impact(dataset, models, model training)
- Captions in LAION-2B have a significant bias towards describing visual text content embedded in the images
- Released CLIP models have strong text spotting bias almost in every style of web images, resulting in the CLIP-filtering datasets inherently biased towards visual text dominant data.
  - LAION-2B 구축과정에서는 OpenAI’s ViT-B/32 CLIP model을 통해 cosine similarities → 해당 부분에서 분석 결과 OpenCLIP 모델이 원래의 CLIP 모델보다 text spotting이 더 편향
- CLIP models easily learn text spotting capacity from parrot captions while failing to connect the vision-language semantics, just like a text spotting parrot.
  - text spotting → downstream task에서 zero-shot 성능 저하로 이어지기 때문에, clip과 constrastive 방식에 기반한 data curation pipeline이 재검토 돼야 함.

Related Work

Contrastive Vision-Language Pre-training
- 2B images → OpenCLIP connect to CLIP closely.
Studying of CLIP Behaviors
- Disentangling visual and written concepts → 해당 논문과 비슷한 맥락
- LoGo Prompt → visual text content를 활용하기 위한 추가적인 정보를 활용하는 방식 / 학습 X
Data Curation with Text Removal
- DiHT → OCR confidence가 높고, matching text ratio인 image-text paires를 필터링 → 사전 학습 데이터셋 개선
- 해당 논문은 data bias의 결과를 밝히고, text spotting 영향을 분석한 후, 다양한 문제를 발견하고자 함

Terminology

data processing (Clustering / text spotting(OCR) / text inpainting)
1. image를 cluster → 다양한 도메인의 image를 분석하기 위해
2. text spotting model → text detect and recognize the text
3. match the spotted text with caption(algorithm caption / ocr_text를 set 후, intersection)
5. inpainting to remove the text for CLIP’s pattern ablation
참고 용어
- Embedded Text: text spotted by OCR models from the images. To study the correlation of embedded text with captions, we define different kinds of embedded text as,
  - All-Emb. Text: all the text is spotted from an image.
  - Co-Emb.Text: spotted text concurrently appears in the image’s corresponding captions
  - Syn-Emb. Text: synthetic text rendered in an image with a fixed font and a blank background.
- Co-Emb. Text Rate (CoTR): the word set IoU of CoEmb. text and captions → Algorithm1
- Parrot Caption: captions with CoTR > 0.
- Image w/ or w/o Embedded Text: spotted text results of a given image are none-empty or empty.
- Text Removal Image: do inpainting in the specific spotted text area (All-Emb., Co-Emb., or Random). The random is implemented by sampling other image’s text areas.
- Relative Scores (RSA/RSC): the difference of the CLIP score between images modified by different inpainting operations while keeping the same captions. RSA and RSC are the short for the relative scores before and after removing All-Emb. text and Co-Emb. text.
- Image Clusters: image partitions based on K-Means.
- CLIP and OpenCLIP: the CLIP models are trained on WIT-400M [30] and LAION-2B [33] dataset.
- N-gram Vocabulary (Vocab): the set of all contiguous N word sequences extracted from a text corpus, such as the collection of all captions or embedded text.

Profiling LAION-2B Data

Implementation Details
- Clustering with CLIP Features: K-means / large memory consumption → PCA / 동일한 feature 추출 파이프라인과 학습된 K-Means를 사용하여 partition
- spotting model → DeepSolo(checkpoint-ViTAEv2-S) → 100개 초과의 구별된 단어를 다룰 수는 없는 한계점 →(~2% proportion of the dataset이라 그냥 진행)
Statistic and Observations from LAION-2B
- co_emb_text_rate = len(co_emb_text) / len(cap_words) → 전체 이미지 내에서 비율
- co_emb_text_rate w/Emb. Text→ OCR 결과가 하나라도 있는 경우 이미지들 내에서 비율
- Fuzzy Co-Emb. Text Rate → ??
- 데이터 분포를 보려고, 모든 이미지를 100 clusters로 나눈 뒤, OCR에 따라 3타입으로 구분
  - 모든 cluster가 embedded text를 포함.
  - parrot captions은 various scene을 커버하려고 시도할 계획
  - around 60% of captions at least precisely parrot one concur- rent word (Co-Emb. Text Rate > 0) → LAION은 parrot captions에 bias

Inspecting Pre-Trained CLIP Models

LAION-2B dataset collection pipeline → CLIP score from OpenAI’s model to filter out the image-text pair below 0.28 → LAION 데이터가 parrot caption의 높은 비율을 가진 이유
Ablation of Embedded Text Removal
- Text Removal via Inpainting: OCR 결과를 기반으로 Inpainting OCR model의 한계로 인해서, All-Emb / Co-Emb를 생성 추가로, randomly sampled spotted text polygons 생성 → inpainting 다양성 확보
- Results: OCR과 text inpainting → LAION 이미지의 6가지 유형
  - images embedded with text > images without embedded text(Clip scores)
  - Text removal에서 clip scores가 급격히 낮아짐
  - S(Co-Emb) - S(All-Emb) > 0 → 더 큰 inpainting 영역으로 인해 lose more visual information / text spotting prediction의 불완전성으로 추측
  - 그러나 일부 샘플은 embedded text 제거 후, higher clip score를 달성하는 경우도 있음
Prompting with Syn-Emb. Text
- OpenCLIP이 CLIP 모델에 비해 더 강한 text spotting
- Open CLIP과 CLIP 모두 concurrent words 민감
- 1-gram 및 2-gram 결과를 텍스트 길이를 기준으로 그룹화
- Co-Emb. 텍스트는 이미지에서 규칙적으로 배열되지 않아, 연속된 단어 시퀀스를 추출하기 어렵
- 모든 모델이 더 긴 단어를 spotting 하는 데 더 뛰어남 → 텍스트 인코더에서 사용된 tokenizer로 구별 가능

Training CLIP on Emb. Text Curated Data

training CLIP models on LAION-2B subsets selected by different embedded-text-oriented criteria
Experiment Setups
- Evaluation: DataComp benchmark using 38 zero-shot classification and retrieval tasks as evaluation
- average performance of DataComp benchmark (Avg.) / ImageNet (IN) / Retrieval (Ret)
Ablation Study on Data Curation
- Curation I: Embedded Text in Images: impact of embedded text on overall pre-train data quality
  - images embedded with text → reduce the pretraining dataset quality
  - model(trained with the images embedded with text) → the strongest text spotting capacity
- Curation II: Co-Emb. Text Rate (CoTR)
  - CoTR이 증가함에 따라, 모든 제로샷 벤치마크 성능이 크게 떨어짐 → 임베딩된 텍스트보다 '패러럿 캡션'이 사전 훈련된 데이터 품질을 저하
  - CoTR이 증가함에 따라 텍스트 스팟팅 능력이 강화되지 않음 → 높은 CoTR 데이터에서 캡션의 평균 길이가 감소할 가능성
- Curation III: Relative Score from Text Removal
  - text removal에 대한 before after 사진을 subset으로 나눈 후, 비교했을 때 embedded text(RSA) or parrot captions (RSC) → CLIP score가 상대적으로 높음
  - 높은 RSA 혹은 RSC 데이터로 학습한 CLIP의 경우 downstream 성능에서 bad
  - 중요한 점은 Avg.S는 RSA 또는 RSC와 positive correlation을 보이며, 편향된 pre-trained 모델에서 CLIP 점수를 데이터 필터링 전략으로 사용하기 힘듦
Ablation Study on Text-Oriented Tasks
- VQA, Image Captioning, Retrieval 등 tasks를 평가하기 위해서 BLIP을 선택
- 10 pre-train / VQA → 10 finetune, Others → 5 finetune
- RSA ≥ 0.3 → Retrieval, Captioning, VQA → bad
- parrot captions → 보기만을 요구하는 downstream task에서는 성능이 좋지만, data mixing을 적절히 해야한다.