PaLI-3

VLM Paper Review

PaLI-3

코딩하는머글 2024. 6. 10. 00:32

Intro

PaLI-3는 (기존 PaLI, PaLI-X 대비)규모가 작은 모델링에 집중(5B parameters와 사전학습된 backbone)
3 min components:
- Constrastive pretraining of image encoder on web-scale image-text data
- Improved dataset mixture for PaLI multimodal training
- Training at higher resolutions
new SOTA results: visually-situated text understanding과 object localization을 요구하는 task
SigLIP을 이용해서 SOTA multilingual constrastive vision model (2B parameters)
Contributions:
- constrastively 사전학습된 SigLIP 모델을 비교 → 후자가 visually-situated text understanding tasks와 localization tasks에서 좋은 성능.
- SOTA모델인 PaLI-x와 비교하여 10배 작은 크기. → understanding visually-situated text에서 큰 성능

Related Work

2가지 방법으로 PaLI framework를 사용해서 image encoders를 사전학습 및 비교
- classification pretraining using large weakly labeled datasets
- constrastive pretraining on web-scale noisy data
PaLI vision encoder를 ViT-G(2B) → ViT-e (4B)로 scaling-up 했을 때, VL tasks 성능 향상
2 multimodal understanding capabilities:
- Natural scene understaing(captioning, VQA, object detection/localization) → PaLI-17B
- Visually-situated text understanding(Document and infographics QA) → Pix2struct
task type에 맞는 training 방식으로 하나의 task에 집중해서 VLM이 발전
하지만 PaLI-X는 두 카테고리에서 SOTA 성능 → OCR-related 학습 + 55B parameter model
따라서 해당 논문은 constrastively-pretrained ViT의 장점과 improved하고 balanced한 방식을 PaLI-3에 넣음. → 해당 방식을 통해 5B의 스케일로 두 카테고리에서 SOTA level

Model

1. Architecture

Figure 1: Overview of the PaLI-3 (5B) model: images are encoded into visual tokens individually by the contrastively pretrained 2B SigLIP vision model. Along with a query, these visual tokens are passed to an 3B encoder-decoder UL2 Transformer which produces the desired answer. In such a setup, a contrastively pretrained model provides significantly more useful tokens than one classification pretrained model as in previous PaLI models.
PaLI-3: ViT가 image를 text(the question, prompt, instruction)와 함께 tokens로 인코드 하고, encoder-decoder transformer를 통과시켜서 text(output)를 생성
PaLI-3
- Visual component:
  - From a constrastively pretrained ViT-G/14 model(2B para) using SigLIP
  - ViT-G/14(image embedding)와 text embedding transformer를 각각 학습시키고, binary classifier(sigmoid cross-entropy) → ~~CLIP이나 ALIGN과 비슷하지만, 더 effective / scalable / robust~~
- Full PaLI model:
  - ViT image encoder의 output이 visual tokens를 형성하고, 해당 부분이 linearly projected. embedded input text tokens에 preprended.
  - UL2 encoder-decoder LM을 위한 테스크별 별도 프롬프트
1. Stages of training
- 학습 진행은 PaLI/PaLI-X와 비슷하고 multiple stages
- Stage 0: Unimodal pretraining
  - SigLIP 학습을 따라서 image encoder를 constrastively pretrain
  - ~~PaLI/PaLI-X → JFT classification pretrained encoder를 사용~~
  - 위와 달리 PaLI-3는 model-based filtering(LAION-400M와 유사한)접근을 사용했고, 40% pair를 유지 224x224의 resolution
    - LAION(Filtering out unsuitable image-text pairs)
  - 3B UL2 model trained following the mixture of denoisers procedure
- Stage 1: Multimodal training
  - 처음에는 image encoder를 frozen하고 (224x224) resolution
  - WebLI를 PDF 문서와 dense text 등을 enriching시켜서 document과 text understanding 능력을 향상
- Stage 2: Resolution increase
  - 추후 PaLI-3 checkpoints at 812×812 and 1064×1064 resolution
  - data mixture는 visuallly-situated text와 object detection 부분에 집중
- Task specialization (transfer)
  - 812x812 resolution checkpoint를 finetune했지만, document understanding tasks를 위해서 1064x1064로 up

Experiments

데이터셋 전략(링크)
1. Classification or Constrastively pretrained ViT?
- fixed 224×224 resolution (i.e. only include Stage 1) / 20% of the full PaLI-3 schedule
- SigLIP models provide moderate gains on “simpler” tasks such as captioning and question-answering, and large gains for more “complicated” scene-text and spatial understanding tasks such as TextVQA and RefCOCO → 복잡한 케이스에서 더 잘하더라! constrastively pretrained로 진행.
1. Visually-situated text understanding
- a: pali-x / b: pali / c:ERNIE-Layout / "†”: 타겟과 비슷한 추가적인 VQA 데이터가 학습됨
- Info와 Doc은 1064 / 나머지는 812로 fine tune
- external OCR input이 있을 때 성능이 올랐지만, AI2D와 ChartQA는 understanding 뿐 아니라 strong reasoning capability(diagram, 차트) 등을 요구해서 성능 향상 x → PaLI-3가 PaLI-X 대비 약간 떨어짐(PaLI-X는 32B으로 reasoning에 good)
- PaLI-X처럼 GCP(Google vision API)로 OCR token을 확보
- with OCR → -0.7 / without OCR → +4.4 TextCaps, TextVQA, InfographicVQA and DocVQA → +8 Avg은 +1.8 → image encoder가 OCR capability를 학습했다고 제안
1. Referring Expression Segmentation
- VQ-VAE를 이용해서 segmentation masks를 예측
  - fine-tune PaLI-3 on the combined training sets of RefCOCO, RefCOCO+, and RefCOCO → 각 학습마다, reffering expression과 segmentation mask box를 같이 넘김
  - 결과적으로 constrastive pretraining이 더 효율적
1. Natural image understanding
- general vision-language understanding tasks(COCO, VQAv2, OKVQA, TallyQA)
- 812x812 res / no external OCR module → 이미지에 text가 거의 X
- COCO, OKVQA는 성능이 떨어짐
1. Video Captioning and Question Answering
- PaLI-X는 16frames를 샘플해서 독립적으로 ViT image encoder에 통과시키면서 사전 학습에 활용.PaLI-3는 multi-frame inputs을 사전학습에 활용 X
- 3개의 CIDEr에서만 under-perform
- 모델 크기를 고려했을 때, 성능과 실용성이 보인다.
1. Direct Image Encoder Evaluation
- LM 없이 Image encoder(ViT-G model) 평가
  - 1. image classification capabilities using the standard ImageNet
  - 1. multilingual image-text retrieval on the Crossmodal-3600 benchmark
  - 1. linear probing of the representation in the few-shot setting
  - the best and largest classification pretrained image encoders가 표준적인 classification task에서는 우수하지만, VL tasks에서는 SigLIP 보다는 별로
Model Fairness, Biases and Other Potential Issues
- 1. Biases, fairness, other potential issues를 평가하기 위해 FairFace / MIAP 사용
  - FairFace datasets → caption 생성을 위해
  - Perspective API(threshold 설정) → toxicity과 profanity를 측정하기 위해(다른 잠재적 issue 사이에서)
- 1. level of demographic parity in the model → sensitive attribute에 독립적인지
  - CelebA dataset를 이용 → 특정 직업을 prefix하는 이미지
  - PaLI-X, PaLI-3 tends to assign a higher log-perplexity score to woman에 비해서 PaLI-3는 더 적은 occupations가 확인 → 남성과 관련된 경향성은 존재
- 1. performance across all subgroups on a detection task using the MIAP dataset
  - error rate가 매우 낮다.(all subgroups에서)
- Limitation: In PaLI-X → fairness는 사회적인 개념이라 통계적 metric이 불가 / inferring을 위해서 자동화된 툴을 사용 / sensitive attributes(gender, ethnicity 등)을 위한 classification 제공이 안됨 / (celebA나 fairface)의 데이터를 활용했기 때문에 gender presentation에 대한 다양성을 커버하지 못함 / toxicity는 기계적인 caption에 의존한다.
Conclusion
- classification pretraining vs contrastive pretraining → 후자가 더 좋고 효율적인 VLM, 특히 localization과 text understanding task
- small VLM

'VLM Paper Review' 카테고리의 다른 글

Llava vs GPT-4V (0)	2024.06.10
LLaVA 1.5 - Improved Baselines with Visual Instruction Tuning (0)	2024.06.09
BLIP2 (0)	2024.06.09
BLIP (0)	2024.06.09
LLaVA (0)	2024.06.09

현재글PaLI-3

코딩하는 머글

AI 및 연구관련 논문을 정리하는 곳입니다.

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

코딩하는 머글

PaLI-3

Intro

Related Work

Model

Experiments

'VLM Paper Review' 카테고리의 다른 글

'VLM Paper Review'의 다른글

티스토리툴바

PaLI-3

Intro

Related Work

Model

Experiments

'VLM Paper Review' 카테고리의 다른 글

'VLM Paper Review'의 다른글

관련글

티스토리툴바