3. A Guide to VLM Training

data curation pipeline으로 scaling law를 깨고 잘 학습시킬 수 있다.
VLM Training 시 Data, Grounding, alignment가 중요하다.

3. 1 Training Data

DataComp → Pretraining dataset에 대한 벤치마크
Data Pruning
- heuristic하게 low-quality pair 제거
- pre-trained VLM을 이용해서 랭킹 후 poorly aligned data pair를 모두 버림
- 다양하고 균형있는 데이터셋 만들기
  - Ranking based on pretrained VLM
  - CLIP-score : image와 text embedding 사이의 cosine similarity 계산
  - → image-text pair의 alignment를 rank할 수 있다.

Diversity and Balancing

diverse, well-balanced dataset은 일반화 성능을 높인다
text-based sampling: ImageNet class와 overlap되는 캡션이 있는 pair만 남김
image-based sampling : image를 encoding한 후 FAISS를 이용해 100,000 group으로 클러스터링
MetaCLIP : Wikipedia의 500,000 쿼리를 이용해 많은 개념을 담는 data distribution을 만들고자함
→ Web-crawled data 는 길 꼬리를 가진 분포이기에 Balanced dataset을 모으는 것 자체가 어렵다.
training data에서 많은 개념들을 커버해야 zero-shot abilities가 좋아진다.

3. 1. 1 Improving the training data with synthetic data

Bootstrapping Language-Image Pre-training(BLIP) : synthetic sample을 생성하고, noisy captions를 거르면서 부트스트랩
LLaVA를 캡셔닝 모델로 사용하면, text-to-image model 학습을 효율적으로 할 수 있음
text-to-image generative model을 이용한 연구도 있음
SynCLR, SynthCLIP 같은 경우는 LLM으로 caption을 만든 오직 합성 데이터만 사용함

3. 1. 2 Using data augmentation

SLIP → self-supervised loss term과 유사한 것 사용
- input image가 2개의 augmentation을 만들어서 positive pair를 만듬
- vision encoder에만 SSL loss가 사용됨
CLIP-rocket → SSL loss가 cross modal이 되도록
- CLIP contrastive loss가 image-text pair의 augmentation에 사용됨
- Strong projector(2 layer MLP, 매우 invariant한 representation) / Weak projector(CLIP) 각각의 projector를 사용

3. 1. 3 Interleaved data curation

Natural Interleaved data
- OBELICS : web에서 동시에 작성된 이미지와 text를 모으고, DOM을 이용해 정제, 사람이 추가 정제
Synthetic interleaved data
- MMC4 : text only data를 image로 개조함, CLIP similarity score로 pair를 맞춰줌

3. 1. 4 Assesing multimodal data quality

train되는 underlying data distribution을 밝히는 연구가 활발함.
아직 명확한 방법은 없으나, text itself(QRating), image itself(VILA, LAION), alignment(CLIP) 등 으로 측정

Harnessing human expertise : the power of data annotation

VLM을 더 뛰어나게 만들 수 있음
복잡한 관계를 알아내는 annotation을 가진 데이터로 학습한 모델은 더 복잡한 것을 이해하고 세밀한 캡션을 생성함
QKVQA, A-QKVQA 등
DCI dataset

3. 2 Software

3. 2. 1 Using existing public software repositories

OpenCLIP
hugginface/transformer

3. 2. 3 speeding up training

torch.compile
xformers : attention mechanism 효율적으로
FFCV : datafile을 load 빠르게

Masking

way to quickly improve the training efficiency
randomly masking image token → speed up training time

3. 2. 4 Importance of other hyper-parameters

image resolution, visual encoder capacity, visual pretraining data 등 중요

3. 3 Which model to use?

simple contrastive training
masking strategies to predict missing texts or image patches
generative paradigm
learning only a mapping between the LLM and vision encoder representations

3. 3. 1 When to use contrastive models like CLIP

CLIP text encoder를 이미지 retrieval으로 사용할 수 있다.
더 복잡한 모델을 만들기 위한 좋은 베이스임
CLIP은 generative model이 아니므로 caption을 생성할 수 없고, retrieval로만 사용할 수 있다.
밑바닥부터 훈련하기엔 리소스가 너무 큼

3. 3. 2 When to use masking

marked된 image와 text 전부에서 reconstruction을 학습하므로 그들의 distribution을 동시에 모델링 할 수 있다.
contrastive learning과 다르게 다시 input space로 representation space로 다시 매핑하는 decoder를 사용해야함
batch dependency가 없다.
대부분의 VLM은 masking과 contrastive learning을 동시에 사용한다.

3. 3. 3 When to use a generative model

diffusion이나 autoregressive criteria에 기반한 Generative model들은 text prompt로부터 photorealistic한 image를 만들 수 없다.
누군가는 world model이 되기 위해 생성이 필요하다하고, 누군가는 필요없다고 한다.
활용 관점에서는 abstract representation을 디코딩하면서 무엇을 배우는 지 알면 좋은데, CLIP은 word embedding을 받아 K-NN 등 추가적인 과정을 거쳐 유사한 이미지를 가져오지만, generative model은 바로 생성한다.
text와 image의 implicit joint distribution을 배울 수 있다.
그러나 계산량 비싸다.

3. 3. 4 When to use LLM on pretrained backbone

리소스가 제한될 때 사용
Mapping만 배우면 된다
이슈
1. LLM의 hallucination으로부터 영향받을 가능성
2. pretrain될 때 bias

3. 4 Improving grounding

grounding : 의미적인 연결을 구축하는 과정

풀고자하는 문제 : 프롬프트를 이해하지 못함 or 할루시네이션
object의 왼쪽 오른쪽에 있는 지, 방향, 개수 세기 등 관계를 이해하는 것과 관련 있음

3. 4. 1 Using bounding boxes annotations

visual concept과 textual concepts align하고 locate하기 위해 IoU(Intersection Over Union) loss 사용
image에서 object의 위치, caption과 일치하는 object들을 알게 되면서 모델이 관계를 더 잘 학습
예시 : X-VLM
kosmos-2 같은 모델 : web crawl image-text pair를 명사로 추출하고 GLIP 모델로 bounding box를 prediction
- detection 모델이 좋아야함.

3. 4. 2 Negative captioning

contrastive objectives에서 많이 사용됨 → collaps 없애기, generalization 성능 올리기 등
positive pair ↔ negative pair : 다른 class나 category를 구분할 수 있는 underlying pattern 포함
VLM에서도 negative sample 사용

3. 5 Improving alignment

Instruction tuning과 RLHF를 VLM에서도 사용함.
Instruction tuning : instruction, inputs, desired reponse가 담긴 supervised-data로 fine tuning
- 100~1000개의 샘플 ex) LLaVA
RLHF : align model output with human preference
- human preference를 흉내낼 수 있는 reward model. ex) LLaVA-RLHF

3. 5. 1 A LLaVA story

vicuna language model encoder + CLIP ViT-L/14 vision encoder
encoder output은 linear projector를 통해 same dimensional space로 보냄
LLaVA 1.5 → instruction tuning 올림, cross-modal fully connected layer
LLaVA-RLHF → vision instruction data의 부족 → Factually Augmented RLHF 알고리즘 적용
- text domain의 RLHF를 image caption 등 정보를 이용하여 vision language task에서도 사용
LLaVA-Next → SOTA
1. image resolution 올림
2. instruction tuning data mixture
3. 34B-parameters LLM backbone

3. 5. 2 Multimodal in-context learning

적은 샘플을 context로 주는 것 만으로도 학습 가능 (without extra fine-tuning)
- given instruction & image → generate answer

3. 6 Improving text-rich image understanding

multimodal LLM이 VLM의 zero-shot task를 가능하게 만듬
OCR학습 없이 OCR을 잘 할 수 있다.
그러나, 이미지에 있는 text를 해석하기는 어려워한다
- → 아마 natural image가 training data에서 대부분이기 때문일 것이다.
- 해결 방법
  - Insturction tuninig with fine-grained text-rich data : LLaVAR
  - Dealing with fine-grained text in high resolution images : Monkey
  - Decoupled Scene Text Recognition Module and MM-LLM : Lumos
  - scene text recognition(STR) module을 사용하여 MM-LLM에 feed

3. 7 Parameter-Efficient Fine-Tuning

전체 파라미터 보단 부분만 학습시켜서 downstream task에 적용

LoRa based
Prompt-based
Adapter-based
Mapping-based

LoRA-based

pure language model과 vision language model에 LoRA 적용
QLoRA : quntization을 추가
VeRA : performance 유지하면서 trainable parameter 더 줄임
DoRA : pre-trained weight을 magnitude와 direction으로 decompose

Prompt-based

Vision-language pre-training은 text와 image를 shared feature space에 보내고, 이것이 prompting을 통해서 zero-shot transfer를 가능하게 함.
Context Operation(CoOp) manual prompt 대신 learnable vector 사용

Adapter-based

Adapter는 pretrained network 사이에 더해지는 모듈
CLIP-Adapter : vision, language branch 각각에 feature adapter로 fine-tuning
→ weight sharing 연구 중 → full fine-tuning에 필적함
LLaMa-Adapter V2 : 더 많은 learnable parameter를 Unlock하고 visiual token에 통합.

Mapping-based

아키텍쳐에 대한 지식없이 더 간단한 방법
각 unimodal module(vision encoder & LLM)을 frozen하고 adapter layer없이 ‘매핑’만 학습

------------------------------------------------------------------------------------

참고 논문: "An Introduction to Vision-Language Modeling" (Florian Bordes et al., 2024).
출처: [arXiv:2405.17247](https://arxiv.org/abs/2405.17247)

'데이터 > Machine Learning' 카테고리의 다른 글

Adding Conditional Control to Text-to-Image Diffusion Models 논문 리뷰 (0)	2025.01.05
[논문 정리]Introduction to VLM(3/3) (1)	2024.11.24
[논문 정리]Introduction to VLM(1/3) (2)	2024.10.27
Transformer 기반 모델의 3가지 아키텍처(Encoders, Decoders, Encoder-Decoders)에 대해 알아보자 (1)	2024.04.14
한국어 사전학습 ELECTRA + 오픈소스 데이터로 돈안드는 긍부정 분류 모델 만들기 (2)	2024.02.18

알고싶코

[논문 정리]Introduction to VLM(2/3)

3. A Guide to VLM Training

3. 1 Training Data

3. 1. 1 Improving the training data with synthetic data

3. 1. 2 Using data augmentation

3. 1. 3 Interleaved data curation

3. 1. 4 Assesing multimodal data quality

3. 2 Software

3. 2. 1 Using existing public software repositories

3. 2. 3 speeding up training

3. 2. 4 Importance of other hyper-parameters

3. 3 Which model to use?

3. 3. 1 When to use contrastive models like CLIP

3. 3. 2 When to use masking

3. 3. 3 When to use a generative model

3. 3. 4 When to use LLM on pretrained backbone

3. 4 Improving grounding

3. 4. 1 Using bounding boxes annotations

3. 4. 2 Negative captioning

3. 5 Improving alignment

3. 5. 1 A LLaVA story

3. 5. 2 Multimodal in-context learning

3. 6 Improving text-rich image understanding

3. 7 Parameter-Efficient Fine-Tuning

'데이터 > Machine Learning' 카테고리의 다른 글

티스토리툴바

[논문 정리]Introduction to VLM(2/3)

3. A Guide to VLM Training

3. 1 Training Data

3. 1. 1 Improving the training data with synthetic data

3. 1. 2 Using data augmentation

3. 1. 3 Interleaved data curation

3. 1. 4 Assesing multimodal data quality

3. 2 Software

3. 2. 1 Using existing public software repositories

3. 2. 3 speeding up training

3. 2. 4 Importance of other hyper-parameters

3. 3 Which model to use?

3. 3. 1 When to use contrastive models like CLIP

3. 3. 2 When to use masking

3. 3. 3 When to use a generative model

3. 3. 4 When to use LLM on pretrained backbone

3. 4 Improving grounding

3. 4. 1 Using bounding boxes annotations

3. 4. 2 Negative captioning

3. 5 Improving alignment

3. 5. 1 A LLaVA story

3. 5. 2 Multimodal in-context learning

3. 6 Improving text-rich image understanding

3. 7 Parameter-Efficient Fine-Tuning

'데이터 > Machine Learning' 카테고리의 다른 글

'데이터/Machine Learning' Related Articles

티스토리툴바