데이터/Machine Learning

[논문 정리]Introduction to VLM(3/3)

성장하기 2024. 11. 24. 21:41

4. Approaches for Responsible VLM Evaluation

visio-linguistic abilities(단어가 visual clue에 잘 매핑되는 지) 추정하는 것이 중요함
VQA, zero-shot prediction, bias or hallucination 등을 고려

4. 1 Benchmarking visio-linguistic abilities

특정 단어나 문장을 일치하는 visual clue와 잘 연관시키는 능력을 평가

4. 1. 1 Image captioning

생성한 caption을 BLEU score나 ROUGE로 측정 → BLEU score는 너무 heuristic함
CLIPScore : image와 caption의 CLIP representation 유사도
→ CLIP 모델 성능에 따라 달렸다는 한계가 있음

4. 1. 2 Text-to-image consistency

LLIPScore : text prompt와 생성된 이미지의 alignment를 측정
Language Model로 txt caption에 대한 질문을 생성
→ 질문이 모두 hallucination 없이 옳아야만 제대로된 측정이 가능하다.
VPEval : OCR, VQA 등 여러 모듈을 모아놓은 프로그램

4. 1. 3 Visual question answering

VQA는 VLM에서 거의 main task
multiple-choice / open-ended answer
candidate answer 중에서 정확히 일치하는 문자열, 인간이 만든 정답에 기반함
→ 요새는 generative model과 OOD evaluation에 집중함

Selective prediction

얼마나 옳은 답을 잘 선택하는 지, 틀린 답을 잘 피하는 지

Visual Dialog

이미지에 대한 질문 시리즈
이미지로 여러 이야기를 할 수 있는 에이전트로서의 능력

4. 1. 4 Text-centric Visual Question Answering

기존 VQA와의 차이점
1. 사진속 text를 읽고 배치해야함
2. 사진 속 텍스트와 다른 것들과의 상호작용을 추론해야함
Text Recognition, Scene Text-Centric VQA, Document-Oriented VQA, Key Information Extraction(KIE), Handwritten Mathematical Expression Recognition 등

4. 1. 5 zero-shot image classification

explicit하게 학습되지 않은 classification task에 대한 능력
↔ few shot learning : 적은 데이터로 fine tuning
CLIP 모델에 prompt structure를 바꾸거나, 특정 task에 커스터마이징하면, 기존 ImageNet benchmark에 필적할만한 성능도 나온다.
zero-shot ability는 training data에 개념이 존재했는 지, 없었는 지에 dependent하다.
Generalization on Out-of-Distribution(OOD) tasks
- CLIP이 zeroshot을 잘하는 이유는 이미 Image Net 데이터들을 전부 담을 정도로 큰 데이터셋에서 학습했기 때문일 수 있다.
- training distribution에서 벗어나면 generalization이 매우 안된다.
- test example의 token을 수정하여 ImageNet data distribution을 따르도록 하면 OOD performance가 향상된다.

4. 1. 6 Visio-linguistic compositional reasoning

모델이 헷갈릴법한 벤치마크를 디자인 함(문장 내 단어순서 바꾸기 등)

4. 1. 7 Dense captioning and crop caption matching

VLM을 이용한 생성은 text tokenizer로 인해 짧은 생성만 가능하다
그냥 summarization을 하면 이미지의 정보를 많이 잃음 → Densely Captioned Image
Image를 distinct part로 쪼갬 → human annotation

4. 1. 8 Synthetic data based visio-linguistic evaluations

negative caption과 관련있는 사건을 찾기 어렵다.(real-data를 사용할 경우)
location bias 해결 필요, ‘coffee cup under the table’ 등
→ Photorealistic Unreal Graphics dataset

4. 2 Benchmarking Bias and disparties in LLMs

4. 2. 1 Benchmarking bias via classifications

사람을 classification(gender, skin tone 등)할 수 있는 요소들에 대한 bias를 직업분류예측 등을 통해 측정

4. 2. 2 Benchmarking bias via embeddings

embedding space에 집중하는 bias 측정
text와 image의 representation 사이의 관계
European American → lighter skin / African American → darker skin

4. 2. 3 Language biases might impact your benchmark!

unimodal bias를 blind algorithm으로 밝힐 수 있음

4. 2. 4 Evaluating how specific concepts in the training data impact downstream task

train 때 자주 본 개념은 downstream에서 잘 풀고, 자주 본 것은 못품

4. 3 Benchmarking hallucinations

caption에 object hallucination이 있고 이를 측정하려는 벤치마크

4. 4 Benchmarking memorization

학습데이터를 외우는 것은 LLM이나 diffusion model에서 많이 발생하는 일
VLM에서 판단이 어려운 이유
- CLIP에 decoder가 있는 게 아니기 때문에 정보를 되돌리기 어려움
- CoCa나 LLaVA 같은 VLM은 생성능력에 한계가 있기에, cross-modal memorization을 증명할 길이 없다.
CLIP model이 object를 기억하는 것 → deja vu memorization이라 부름
- CLIP - image text pair 학습
- reference CLIP model - caption 없이 이미지만 학습
→ 추론 후 차이를 비교하여 memorization 정량화
여러 regularization으로도 memorization 낮출 수 있음

4. 5 Red Teaming

Red Teaming : 공개 인터페이스를 악용하여 바람직하지 않은 출력을 생성하는 것을 의미함
→ red team evaluation을 할 수 있으면 이후에 post-processing이 가능하다.

5. Extending VLM to videos

정적인 데이터인 이미지 뿐 아니라 비디오도 많이 연구됨
video의 시간은 storage와 GPU 메모리, frame rate 학습에 어려움을 준다.
초기 video-text 모델은 image-text 모델과 같이 self-supervised learning(contrastive)로 학습되었으나, video는 시간적으로 text와 일치하는 게 더 중요했기에 좋은 방법은 아니었다.
요새는 pre-trained LLM을 video encoder와 align 시키려는 시도가 있음

5. 1 Early work on videos based on BERT

VideoBERT : visual token과 video caption의 textual token → 섞어서 transformer → pretraining objective : 원래 BERT처럼 masking & reconstruct
Multimodal Event Representation Learning Over Time(MERLOT) : text description과 시간적인 일치도를 볼 수 있음
- contrastive objective : local text token & frame visual tokens
- masked language modeling objective
- temporal reordering objective
모델이 가능한 것
- 비디오로부터 배운 지식으로 질문에 답변하는 것
- 많은 범위의 데이터셋과 벤치마크로부터 어려운 질문에 답변하는 것

5. 2 Enabling text generation using and early-fusion VLM

VideoOFA
- 2 stage framework
  - image-text data로 fundamental visual-language representation
  - temporal reasoning 같은 특정 concept을 배우기 위해 backbone VLM을 pretraining

5. 3 Using a pretrained LLM

성능 좋은 LLM을 최대한 활용하고자 함
visual backbone을 LLM에 align
Video-LLaMA
- BLIP-2, Video Q-former, Audio Q-former
- conversational agent
MiniGPT4-Video
- 4개의 인접 visual-token을 하나로 concat하여 input token 개수 줄임

5. 4 Opportunities in evaluations

Dataset EgoSchema → 어디서 상호작용이 발생하는 지
“사람들이 하고 있는 경기는 무엇인가?”와 같은 질문은 단지 single frame만 봐도 풀 수 있기에 좋은 video benchmark가 아니다.
Video VLM은 추론하고, 세계 이해하는 능력도 중요하다.
→ 물리적으로 불가능한 합성 데이터를 사용하는 것이 효과적이다.
‘이것이 물리적으로 옳은가?’라고 질문에 대해서 random performance를 넘기지 못한다.(사람은 80% acc)

5. 5 Challenges in leveraging video data

temporal space에 대한 supervision이 약한 문제
기존 데이터는 scene을 묘사할 뿐 action이나 motion을 묘사하지는 못한다.
CLIP도 noun bias가 있음
videoPrism : imperfect caption의 한계 때문에 video encoder를 video만으로 학습
video processing은 계산량 비싸고, redundant하다.
더 효과적인 training protocol

------------------------------------------------------------------------------------

참고 논문: "An Introduction to Vision-Language Modeling" (Florian Bordes et al., 2024).
출처: [arXiv:2405.17247](https://arxiv.org/abs/2405.17247)