Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation

Xingning Dong et al. / Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation / CVPR 2022

1. Task Definition

์šฐ์„  Scene Graph Generation ์ด ๋ฌด์—‡์ธ์ง€ ๊ฐ„๋žตํžˆ ์†Œ๊ฐœํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค

Scene Graph Generation (SGG) ๋Š”, ์ด๋ฏธ์ง€๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•˜์„ ๋•Œ ์ด๋ฅผ ๊ทธ๋ž˜ํ”„๋กœ ๋ฐ”๊พธ์–ด์ฃผ๋Š” Task ์ž…๋‹ˆ๋‹ค.

1

๊ทธ๋ฆผ1์€ SGG์˜ ๋ชจ๋ธ์˜ ๊ณผ์ •์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ ์„ค๋ช…ํ•˜๋ฉด, ์‚ฌ๋žŒ๊ณผ ๋ง์ด ์žˆ๋Š” ์ด๋ฏธ์ง€๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ๋ชจ๋ธ์ด ๊ทธ๋ž˜ํ”„๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

์ด ๋•Œ ์šฐ๋ฆฌ๊ฐ€ ์ƒ์„ฑํ•˜๊ณ  ์‹ถ์€ ๊ทธ๋ž˜ํ”„ G๋Š” V, E, R, O ์ด 4๊ฐ€์ง€ ์ปดํฌ๋„ŒํŠธ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

V ๋Š” ๋…ธ๋“œ, object detector์˜ proposal ๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ E ๋Š” edge๋กœ, ์—ฐ๊ด€์ด ์žˆ๋Š” object ๋ผ๋ฆฌ ์—ฐ๊ฒฐ์ด ๋ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ SGG ์—์„œ๋Š” ๊ฐ ๋…ธ๋“œ์™€ ์—ฃ์ง€์˜ label ์˜ class ๊ฐ€ ๋ฌด์—‡์ธ์ง€ ๊ตฌ๋ถ„ํ•˜๋Š” classification Task๋„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

R ์€ Edge์˜ Relation class๋ฅผ ๋œปํ•˜๋ฉฐ, O ์€ Object์˜ class๋ฅผ ๋œปํ•ฉ๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ ์ตœ์ข… ์–ป์€ Graph ๋Š”

<object, predicate, subject> (์‚ฌ๋žŒ, ๋จน์ด์ฃผ๋‹ค, ๋ง) ์™€ ๊ฐ™์€ triplet ์˜ ์กฐํ•ฉ์œผ๋กœ ์ด๋ฃจ์–ด์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋ฉด ์œ„์˜ ์‹์œผ๋กœ ๋ถ€ํ„ฐ

P(V | I ) - object detector

P(E | V, I ) - relation proposal netowrk

P(R, O | V, E, I ) - Classification models for entity and predicate.

์ด 3๊ฐ€์ง€๋ฅผ ๋ชจ๋ธ๋ง ํ•˜๋ฉด ์ €ํฌ๋Š” Scene Graph ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ œ๋ฅผ ์ •์˜ํ•  ์ˆ˜ ์žˆ๊ฒŒ๋ฉ๋‹ˆ๋‹ค.

ํŠนํžˆ๋‚˜ ์ด ์—ฐ๊ตฌ์˜ ์ค‘์ ์€, Unbiased SGG ๋กœ์„œ, ํŠน์ • class ์— biased ๋˜์ง€ ์•Š๊ณ ,

๋‹ค์–‘ํ•œ relation์„ ๋งž์ถœ ์ˆ˜ ์žˆ๋„๋ก (class imbalanced training ๊ณผ ์œ ์‚ฌ) ํ•˜๋Š” SGG ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

2. Motivation

๊ทธ๋ ‡๋‹ค๋ฉด ํ˜„์žฌ ์กด์žฌํ•˜๋Š” SGG ๋ชจ๋ธ์€ ์–ด๋–ค ์—ฐ๊ตฌ๋“ค์ด ์žˆ๊ณ , ๋˜ ๊ทธ ์—ฐ๊ตฌ๋“ค์€ ์–ด๋–ค ํ•œ๊ณ„์ ์ด ์žˆ๋Š”์ง€ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Scene Graph Generation

๊ธฐ์กด SGG ๋ฐฉ๋ฒ•๋“ค์€ visual context๋ฅผ ๋ฐ˜์˜ํ•œ Scene Graph ๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ๋งŽ์€ ๋…ธ๋ ฅ์„ ๊ธฐ์šธ์˜€์Šต๋‹ˆ๋‹ค. Scene ์˜ Object ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋ฐ˜์˜ํ•œ context๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ๋…ธ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

  1. ์ดˆ๊ธฐ์—๋Š” scene ์„ ํ‘œํ˜„ํ•˜๋Š” feature์— ๋Œ€ํ•ด ์—ฐ๊ตฌํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๊ทธ๋“ค์€ Faster R-CNN object detector ๋กœ ์ถ”์ถœํ•œ feature๋ฅผ ์–ด๋–ป๊ฒŒ ํ™œ์šฉํ•˜์—ฌ, ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š”์ง€์— ๋” ๋‚˜์•„๊ฐ€ language feature (class label์˜ word) ๋“ฑ์„ ์ด์šฉํ•˜์—ฌ, ๋ณด๋‹ค ๋‚˜์€ scene graph context ๋ฅผ ํ•™์Šตํ•˜๊ณ ์ž ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

  2. ํ˜„์žฌ์—๋Š” ๋ชจ๋ธ ์ˆ˜์ค€์—์„œ, ์–ด๋–ป๊ฒŒ context๋ฅผ ์ถ”์ถœํ• ์ง€์— ์ค‘์ ์„ ๋‘” ์—ฐ๊ตฌ๊ฐ€ ๋งŽ์ด ๋ฐœ๋‹ฌํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๊ทธ๋“ค์€ ๊ธฐ์ดˆ์ ์œผ๋กœ LSTM ๊ณผ ๊ฐ™์€ sequential ๋ชจ๋ธ, GNN ๋„๋ฉ”์ธ์—์„œ ์‚ฌ์šฉํ•˜๋Š” meassage propagation scheme, ๋˜๋Š” self-attention network ๋“ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ทธ๋Ÿฌํ•œ context๋ฅผ ๋ชจ๋ธ๋งํ•˜์˜€์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ์ด๋ ‡๊ฒŒ expressive power๋ฅผ ์˜ฌ๋ ค๋„, Scene Graph ๋ฐ์ดํ„ฐ์— ์กด์žฌํ•˜๋Š” label class์˜ bias ๋ฌธ์ œ์—๋Š” ์•„์ฃผ ์†Œ์†Œํ•œ ํ–ฅ์ƒ๋งŒ์„ ๊ฐ€์ ธ์™”์Šต๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ ์„ค๋ช…ํ•˜๋ฉด, 'on'๊ณผ ๊ฐ™์ด ๋นˆ๋ฒˆํžˆ ๋“ฑ์žฅํ•˜๋Š” class ์— ๋Œ€ํ•ด์„œ๋Š” ์ž˜ ๋งž์ถ”์ง€๋งŒ ์ด๋Š” scene graph ์ƒ์„ฑ ๊ด€์ ์—์„œ๋Š” ์˜๋ฏธ๊ฐ€ ์ ๊ณ , 'standing on'๊ณผ ๊ฐ™์€ tail class์— ์กด์žฌํ•˜๋Š” relation์—์„œ๋Š” ๋ชจ๋ธ์ด ์ž˜ ํ•™์Šตํ•˜์ง€ ๋ชปํ•˜์ง€๋งŒ, ์ด๋Š” visual context๋ฅผ ์ž˜ ํ‘œํ˜„ํ•˜๋Š” ์ค‘์š”ํ•œ relation ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ, State-of-art SGG ์—ฐ๊ตฌ๋“ค์€ unbiased SGG๋ฅผ ๋งŒ๋“ค๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋“ค์€ ๋Œ€๊ฒŒ, 1) data resampling ์„ ํ†ตํ•ด ๋ชจ๋ธ์˜ bias๋ฅผ ์ค„์—ฌ์ฃผ๊ฑฐ๋‚˜, 2) re-weight loss ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๊ณ , ๋˜๋Š” 3) transfer learning framework ๋ฅผ ์ด์šฉํ•˜์—ฌ, ์ง€์‹์„ ์ „๋‹ฌํ•ด์ฃผ๋Š” ๋ฐฉ์‹์œผ๋กœ bias ๋ฌธ์ œ๋ฅผ ์™„ํ™” ํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ 3) ๊ณผ ์—ฐ๊ด€๋œ ์—ฐ๊ตฌ๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๊ฒ ์Šต๋‹ˆ๋‹ค.

๊ธฐ์กด ์—ฐ๊ตฌ์˜ LIMITATION

์ฒซ์งธ, language semantic ์„ ํ•™์Šตํ•  ๋•Œ concat ๊ณผ ๊ฐ™์ด ๋‹จ์ˆœํ•œ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค ๋‘˜์งธ, ๊ธฐ์กด์˜ Unbiased Training ๊ธฐ๋ฒ•์€ Tail์— overfit ๋˜์–ด Head ํผํฌ๋จผ์Šค๋ฅผ ๋„ˆ๋ฌด ํฌ์ƒํ•œ๋‹ค

๋ณธ ์—ฐ๊ตฌ์˜ IDEA

์ฒซ์งธ, Multi-Modal Learning ์—์„œ์˜ ์•„ํ‚คํ…์ณ๋ฅผ ๊ฐ€์ ธ์™€, language semantic์„ ๋ณด๋‹ค ํšจ์œจ์ ์œผ๋กœ ์ถ”์ถœ ๋‘˜์งธ, Class Incremental Learning ์—์„œ์˜ Expert Training ๊ธฐ๋ฒ•์„ ์ฐจ์šฉํ•˜์—ฌ, Head Tail ๋ชจ๋‘์—์„œ ์šฐ์›”ํ•œ ์„ฑ๋Šฅ์„ ๊ฐ€์ง€๋Š” SGG ๋ชจ๋ธ Training ๊ธฐ๋ฒ• ์ œ์•ˆ

3. Method

์•„๋ž˜ ๊ทธ๋ฆผ์€, ์ œ์•ˆ๋œ ๋ชจ๋ธ์˜ ์ „์ฒด์ ์ธ ์•„ํ‚คํ…์ณ ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.

2
  1. Proposal Network ๋ฅผ ํ†ต๊ณผํ•˜์—ฌ, ์ด๋ฏธ์ง€์—์„œ Visual Feature(Bounding Box, convolutional Feature), Language Fature (Class Label word) ๋“ฑ์„ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.

  2. Visual Feature ์™€ Language Feature๋ฅผ ํ†ตํ•ด ๊ฐ๊ฐ Object์™€ Relation์˜ Emedding์„ Encoding ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋•Œ Encoding ์„ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋Š” ๊ตฌ์กฐ๊ฐ€ ๋ณธ ๋…ผ๋ฌธ์˜ ์ฒซ๋ฒˆ์งธ contribution์ธ Stacked hybrid attention ์ž…๋‹ˆ๋‹ค. ๋” ์ž์„ธํ•œ๊ฑด ๋’ค์—์„œ ๋‹ค๋ฃจ๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

  3. Encoder์—์„œ ์–ป์–ด๋‚ธ Embeeding์„ ํ†ตํ•ด์„œ, Object์™€ Relation ์˜ Decoder๋ฅผ ๊ฐ๊ฐ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„  ๋‹จ์ˆœํžˆ Classifier ๋ฅผ ํ•™์Šตํ•œ๋‹ค๊ณ  ์ดํ•ดํ•˜๋ฉด ๋  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ, ๊ธฐ์กด์˜ ์—ฐ๊ตฌ์™€์˜ ์ฐจ์ด์ ์€ Relation decoding part ์˜ Group Collaborative Learning ์ž…๋‹ˆ๋‹ค. ์ด ํŒŒํŠธ๋Š” Relation์˜ Class Imabalance๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•œ ๋ชจ๋“ˆ๋กœ, ๋ณธ ๋…ผ๋ฌธ์˜ ๋‘๋ฒˆ์งธ Contribution ์ž…๋‹ˆ๋‹ค. ์ด ๋˜ํ•œ ๋’ค์—์„œ ์ž์„ธํžˆ ๋‹ค๋ฃจ๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

Stacked Hybrid-Attention (SHA)

SHA๋Š” ์•ž์„œ ์–ธ๊ธ‰ํ•œ๋Œ€๋กœ, ๊ธฐ์กด์˜ concatenation, summation ํ•˜์—ฌ visual/language feature๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด, ๋‘˜ ์‚ฌ์ด์˜ inter-modal / intra-modal ๊ด€๊ณ„๋ฅผ ์žก์•„๋‚ด๋Š”๋ฐ ๋ถˆ์ถฉ๋ถ„ํ•˜๋‹ค๋Š” ๋ฐ์—์„œ ์ถœ๋ฐœํ•ฉ๋‹ˆ๋‹ค. ๋” ๊นŠ๊ฒŒ ์ƒ๊ฐํ•ด๋ณด๋ฉด, visual feature๋“ค ์‚ฌ์ด (์‚ฌ๋žŒ ์ด๋ฏธ์ง€ <-> ๋ง ์ด๋ฏธ์ง€)์—์„œ ์กด์žฌํ•˜๋Š” ๊ด€๊ณ„๊ฐ€ ์žˆ๊ณ , ๋‹จ์–ด๋“ค ๋ผ๋ฆฌ์˜ ๊ด€๊ณ„ ('human' word <-> 'horse' word) ์˜ ๊ด€๊ณ„๊ฐ€ multi-modal ์˜ ํ˜•ํƒœ๋กœ ์กด์žฌํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋‹จ์ˆœ summation์ด ์ข‹์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ ์ž…๋‹ˆ๋‹ค. SHA๋Š” ๊ธฐ์กด์˜ multimodal learning ์—์„œ์˜ architecture๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์•„์ฃผ ์‰ฝ๊ฒŒ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜ ๊ทธ๋ฆผ์ด SHA์˜ ๊ตฌ์กฐ๋ฅผ ๋‚˜ํƒ€๋‚ธ ๊ทธ๋ฆผ์ž…๋‹ˆ๋‹ค.

3

SA ๋ชจ๋“ˆ๊ณผ CA ๋ชจ๋“ˆ์ด ์žˆ๋Š”๋ฐ ์ด ๋‘ ๋ชจ๋“ˆ ๋‹ค Multe-Head Attention ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•œ ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ๋‘˜์˜ ์ฐจ์ด๋Š” SA ๋ชจ๋“ˆ์˜ ๊ฒฝ์šฐ intra-modal refinement๋ฅด ๋ชฉ์ ์œผ๋กœ, ๊ฐ™์€ feature (image๋ฉด image) ๋ผ๋ฆฌ ๋„ฃ์€ ๋ชจ๋“ˆ์ด๊ณ , CA ๋ชจ๋“ˆ์˜ ๊ฒฝ์šฐ ๋‘˜๋‹ค ๊ฐ™์ด ๋„ฃ์–ด์„œ semantic ์„ ์ถ”์ถœํ•œ cross attention ๋ชจ๋“ˆ ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด, ๋ณธ ๋…ผ๋ฌธ์€ Feature๋ฅผ ๋” ์ž˜ ํ™œ์šฉํ•˜์—ฌ context๋ฅผ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ด์•ผ๊ธฐ ํ•ฉ๋‹ˆ๋‹ค.

Group Collaborative Learning (GCL)

Group Collaborative Learning ๋Š” ๊ธฐ์กด relation ์˜ class imabalance๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ class incremental learning์˜ ๊ตฌ์กฐ์— ์ฐฉ์•ˆํ•˜์—ฌ, SGG ์—ฐ๊ตฌ์— ์ ์šฉํ•œ ์‚ฌ๋ก€๋กœ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์–ด๋–ป๊ฒŒ Bias๋ฅผ ํ•ด๊ฒฐ ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ์ž์„ธํžˆ ์•Œ์•„๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์šฐ์„  ์•„๋ž˜ Group Collaborative Learning์˜ ๊ทธ๋ฆผ์„ ๋จผ์ € ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

4

๊ทธ๋ฆผ์„ ๋ณด๋ฉด, ํฌ๊ฒŒ Predicate Class Grouping ~ Collaborative Knowledge Distillation ์ˆœ์œผ๋กœ ์—ฌ๋Ÿฌ ๊ณผ์ •์„ ๊ฑฐ์น˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด Class Incremental Learning์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋ฅผ ์š”์•ฝํ•˜๋ฉด, ''์ฃผ์–ด์ง„ Data ๊ฐ€ Imbalanced ํ•˜๋‹ˆ๊นŒ, balanced ํ•œ ์ƒํ™ฉ์—์„œ ์—ฌ๋Ÿฌ ๋ชจ๋ธ (์—ฌ๋Ÿฌ Expert) A, B, ..., E ๋ฅผ ๊ฐ๊ฐ ๋‚˜๋ˆ„์–ด ํ•™์Šตํ•˜์ž. ๊ทธ๋Ÿฌ๋ฉด A, B, C , D, E ๊ฐ๊ฐ์˜ ๋ชจ๋“ˆ์€ ๊ฐ๊ฐ ์ „๋ฌธ์ ์œผ๋กœ ์ž˜ ์˜ˆ์ธกํ•˜๋Š” class ๊ฐ€ ์ƒ๊ธฐ๊ณ , ๊ทธ ์ง€์‹์„ ํ•œ ๋ชจ๋ธ์—๊ฒŒ ๊ณต์œ  (์ „์ด, knowledge distillation) ํ•˜์—ฌ, ๋ชจ๋“  class ์— ๋Œ€ํ•ด ์ž˜ ๋งž์ถœ ์ˆ˜ ์žˆ๋Š” ํ•˜๋‚˜์˜ ๋ชจ๋ธ์„ ๋งŒ๋“ค์ž'' ์ž…๋‹ˆ๋‹ค.

๋‹ค์†Œ ๋ณต์žกํ•œ ๋ง๋กœ ๋“ค๋ฆด ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ „๋ฌธ๊ฐ€ ์—ฌ๋Ÿฌ๋ช…์„ ๋‚˜๋ˆ„์–ด์„œ ๋งŒ๋“ค๊ณ , ์ „๋ฌธ๊ฐ€์˜ ์—ฌ๋Ÿฌ ์ง€์‹์„ ํ•œ ํ•™์ƒ์—๊ฒŒ ์ฃผ์ž…ํ•ด์ฃผ์ž๋Š” ๊ฒƒ ์ž…๋‹ˆ๋‹ค.

Step 1. Predicate Class Grouping. ์ „๋ฌธ๊ฐ€๋ฅผ ๋ช‡๋ช… ๋‘˜์ง€๋ฅผ ์ •ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. All Classes์˜ Distribution์ด ๋งค์šฐ Long-tail ์ด๋ผ Imabalance ๊ฐ€ ์‹ฌํ•˜์ง€๋งŒ, ์ด๋ฅผ sorting ํ•˜์—ฌ ์•ž์—์„œ๋ถ€ํ„ฐ ์ž˜๋ผ Group์„ ๋งŒ๋“ค๋ฉด, ์ƒ๋Œ€์ ์œผ๋กœ Balanceํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, ํŒŒ๋ž€ relation ์„ Group 1, ํŒŒ๋ž€์ƒ‰ + ์ดˆ๋ก์ƒ‰ relation์„ Group 2, ... ์ด๋Ÿฐ์‹์œผ๋กœ ์ด K ๊ฐœ์˜ Group์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ์ด ๊ฐ๊ฐ์˜ Group ๋‚ด์—์„œ๋Š” ์ƒ๋Œ€์ ์œผ๋กœ Balanced distribution์„ ๊ฐ–๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

Step 2. Balanced Sample Preparation ์—์„œ๋Š”, Group ๋‚ด์—์„œ ์ ๊ฒŒ ๋“ฑ์žฅํ•˜๋Š” Class๋ฅผ ์ข€๋” ๋ณผ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ฃผ๋Š” ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ์ด ๋•Œ์—๋Š” Under Sampling ๋งŒ ์ ์šฉํ•˜๋ฉฐ, ์ ๊ฒŒ ๋“ฑ์žฅํ•˜๋Š” Class๋Š” ์กฐ๊ธˆ๋งŒ Drop ํ•˜๊ณ , ๋งŽ์ด ๋“ฑ์žฅํ•˜๋Š” Class๋ฅผ ๋งŽ์ด Drop ํ•˜์—ฌ ๊ทธ๋ฃน๋‚ด์—์„œ์˜ Balance๋ฅผ ์ ๊ฒŒ ๋“ฑ์žฅํ•˜๋Š” ์• ๋“ค์—๊ฒŒ ๋” ์ดˆ์ ์„ ๋‘๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

Step 3. Class Probability Prediction/Parallel Classifier Optimization. ๊ธฐ์กด, Classifier๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. Cross Entropy๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, ์ด K ๊ฐœ์˜ ๊ทธ๋ฃน์— ๋Œ€ํ•˜์—ฌ ๊ฐ๊ฐ Classifier๋ฅผ ํ‰ํ–‰ํ•˜๊ฒŒ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

Step 4. Collaborative Knowledge Distillation. ์ด์ œ ๊ฐ๊ฐ์˜ Classifier ๋Š” ์ „๋ฌธ์ ์ธ ์ง€์‹์„ ๋ณด์œ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. Group 1์€ Head Class ์˜ ์ง€์‹์„ ๋งŽ์ด ๊ฐ€์ง€๊ณ  ์žˆ์„ ๊ฒƒ์ด๊ณ , Group K๋Š” Tail class ์˜ ์ง€์‹์„ ๋งŽ์ด ๊ฐ€์ง€๊ณ  ์žˆ์„ ๊ฒƒ์ด๋ฉฐ, ๊ทธ ์‚ฌ์ด์˜ Classifier ๋Š” Body Class์˜ ์ง€์‹์„ ๊ฐ€์ง€๊ณ  ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ์ค„์„ ์„ธ์›Œ ๋†“๊ณ , KL-divergence Loss๋ฅผ ํ•™์Šตํ•˜์—ฌ, ์ง€์‹์„ ์ „์ดํ•ด ์ค๋‹ˆ๋‹ค. ์ง€์‹์˜ ์ „์ด ์ˆœ์„œ๋Š” ํ›„์— ์‹คํ—˜ ๋’ค์—์„œ ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์šฐ์„  Adjacency ๋ฐฉ์‹์„ ์„ค๋ช…ํ•˜์ž๋ฉด, 1๋ฒˆ Clasifier ๋Š” 2๋ฒˆ Classifier ์—๊ฒŒ ์ง€์‹์„ ์ฃผ๊ณ , 2๋ฒˆ์€ 3๋ฒˆ์—๊ฒŒ.. ์ฒด์ธ ํ˜•์‹์œผ๋กœ ์ง€์‹์„ ์ „ํŒŒํ•ด์ค๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๋˜๋ฉด ์ตœ์ข…์— ์žˆ๋Š” K ๋ฒˆ์งธ classifier ๋Š” ๋ชจ๋“  ์ง€์‹์„ ์ˆœ์ฐจ์ ์œผ๋กœ ์ „๋‹ฌ ๋ฐ›์•„, Head~ Tail ๋ชจ๋‘๋ฅผ ์ž˜ ๋งž์ถœ ์ˆ˜ ์žˆ๋Š” Classifier๋ฅผ ์–ป๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

4. Experiment & Result

์‹คํ—˜์—์„œ๋Š” ๊ธฐ์กด์˜ ์‹คํ—˜ ์„ธํŒ…์—์„œ, ์ œ์•ˆ๋œ ๋ชจ๋ธ์ด ์–ผ๋งˆ๋‚˜ ํšจ๊ณผ์ ์ธ์ง€๋ฅผ ๊ฒ€์ฆํ•˜๊ณ , ๊ฐ๊ฐ์˜ ๋ชจ๋ธ Component ๊ฐ€ ํšจ๋ ฅ์ด ์žˆ์—ˆ๋Š”์ง€ ๊ฒ€์ฆํ•ฉ๋‹ˆ๋‹ค.

Metric

Unbiased SGG์˜ ๊ฒฝ์šฐ ํ‰๊ฐ€ ๋ฉ”ํŠธ๋ฆญ mR@K ์ž…๋‹ˆ๋‹ค. top-K triplet (<subject, relation, object>) ๋ฅผ ๋ชจ๋ธ์ด ์ถ”์ •ํ–ˆ์„ ๋•Œ, ์‹ค์ œ GT triplet ์—์„œ ์–ผ๋งˆ๋‚˜ ๋งž์ถ”์—ˆ๋Š”์ง€๋ฅผ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์ „์ฒด ๊ฐœ์ˆ˜์˜ ํ‰๊ท ์„ ์žฌ๋ฉด R@K, class ๋ณ„ R@K ๋ฅผ ์žฌ๊ณ  Class๋กœ ๋‚˜๋ˆ„์–ด์ฃผ๋ฉด meanR@K(mR@K) ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

Task

Task๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ 3๊ฐ€์ง€ ์ž…๋‹ˆ๋‹ค.

SGDET - Image -> Object detect / object classification / predicate classification ์ˆ˜ํ–‰.

์ „ํ˜•์ ์œผ๋กœ ์ด๋ฏธ์ง€๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, Graph๋ฅผ ์ƒ์„ฑํ•˜๋Š” ํƒœ์Šคํฌ ์ž…๋‹ˆ๋‹ค. ์„ธ๊ฐ€์ง€ ์ค‘์— ๊ฐ€์žฅ ์–ด๋ ค์šด ํƒœ์Šคํฌ๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์œผ๋ฉฐ,
๋ง ๊ทธ๋Œ€๋กœ ์ด๋ฏธ์ง€๊ฐ€ ๊ทธ๋ž˜ํ”„ ์ž์ฒด๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋งตํ•‘์„ ๋ฐฐ์šฐ๋Š” ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ, Object Detector, Graph Edge Prediction, Object, relation classifier์˜
๋ชจ๋“  ์„ฑ๋Šฅ์„ ๋‹ค ์ฒดํฌํ•˜๋Š” ๊ฒƒ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๊ฒ ์Šต๋‹ˆ๋‹ค.

SGCLS - Ground Truth Box -> object classification / Predicate classification ์ˆ˜ํ–‰

์ด๋ฏธ์ง€๊ฐ€ ์ฃผ์–ด์ง€๊ณ , ์‹ค์ œ Bounding Box๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ Scene Graph๋ฅผ ๋งŒ๋“œ๋Š” ํƒœ์Šคํฌ ์ž…๋‹ˆ๋‹ค. Object Detector์— Dependentํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—
์œ„์˜ SGDET Task๋ณด๋‹ค๋Š” ์‚ด์ง ์‰ฌ์›Œ์ง„ Task ์ž…๋‹ˆ๋‹ค. ์˜ค์ง Object, Predicate Classifer์˜ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•˜๋Š” ๊ธฐ์ค€ ์ž…๋‹ˆ๋‹ค.

PREDCLS - Ground Truth Box, object category -> Predciate Classification ์ˆ˜ํ–‰

๋งˆ์ง€๋ง‰์œผ๋กœ, ์ด๋ฏธ์ง€๊ฐ€ ์ฃผ์–ด์ง€๊ณ , ์‹ค์ œ Bounding Box์™€ Object์˜ Classs๊นŒ์ง€ ๋ฌด์—‡์ธ์ง€ ์ฃผ์–ด์กŒ์„ ๋•Œ Scene Graph๋ฅผ ๋งŒ๋“œ๋Š” ํƒœ์Šคํฌ ์ž…๋‹ˆ๋‹ค. 
Object Detector์— Dependentํ•˜์ง€ ์•Š๊ณ , Object์˜ Class๋„ ์ด๋ฏธ ์•Œ๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ€์žฅ ์‰ฌ์šด ํƒœ์Šคํฌ์ž…๋‹ˆ๋‹ค. ์˜ค์ง, Predicate Classifer์˜ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•˜๋Š” ๊ธฐ์ค€ ์ž…๋‹ˆ๋‹ค.

Result

5

์œ„ ํ‘œ๋Š” mR@K ๋ฅผ K=20, 50, 100 ์— ๋”ฐ๋ผ ๊ฐ๊ฐ์˜ Task์— ๋น„๊ตํ•œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ SHA์™€ GCL ์„ ์ œ์•ˆํ•˜์˜€๋Š”๋ฐ์š”, SHA๋Š” ๋ชจ๋ธ ์ธ์ฝ”๋”์˜ ์•„ํ‚คํ…์ณ ์ œ์•ˆ์ด๋ผ ๋ณธ ๋…ผ๋ฌธ์—๋งŒ ํ•ด๋‹นํ•˜์ง€๋งŒ, GCL ์˜ ๊ฒฝ์šฐ Training scheme ์„ ์ œ์•ˆํ•œ ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— Model agnostic (๊ธฐ์กด์˜ ๋‹ค๋ฅธ ๋…ผ๋ฌธ๋“ค์— ๋Œ€ํ•ด์„œ๋„ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Œ) ํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„  LSTM ๊ธฐ๋ฐ˜์œผ๋กœ Context๋ฅผ ์ถ”์ •ํ•˜์—ฌ Relation์„ ์˜ˆ์ธกํ•˜๋Š” Motif ์™€ TreeLSTM ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ์˜ˆ์ธกํ•˜๋Š” VCTree ์ด 2๊ฐ€์ง€์— GCL ์„ ์ ์šฉํ•œ ๊ฒƒ๋„ ๊ฐ™์ด ์‹คํ—˜์„ ์ง„ํ–‰ํ•œ๊ฑธ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ๋ฅผ ํ•ด์„ํ•ด๋ณด๋ฉด, GCL ์„ ์‚ฌ์šฉํ•˜๋ฉด ๊ธฐ์กด์˜ ๋ชจ๋ธ์˜ mR@K ๊ฐ’๋„ ํฌ๊ฒŒ ํ–ฅ์ƒ ๊ฐ€๋Šฅํ•˜๋ฉฐ, ํŠนํžˆ๋‚˜ ์ œ์•ˆ๋œ Self-Attention ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์—์„œ, SHA ๋ ˆ์ด์–ด์™€ GCL๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•œ ๊ฒƒ์ด ๊ฐ€์žฅ ์šฐ์ˆ˜ํ–ˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

6

์œ„ ํ‘œ๋Š” ์ œ์•ˆ ๋œ ๋…ผ๋ฌธ์˜ Component์„ ์ž˜๊ฒŒ ์ž˜๋ผ ablation study๋ฅผ ํ•˜๊ณ , ๊ฐ ๋ชจ๋ธ์˜ ์ปดํฌ๋„ŒํŠธ์˜ ํšจ์šฉ์„ฑ์„ ์ž…์ฆํ•˜๋Š” ๋‹จ๊ณ„๋กœ ๋ณผ ์ˆ˜ ์žˆ๊ฒ ์Šต๋‹ˆ๋‹ค. GCL์„ ๋นผ๋ฒ„๋ฆฌ๋ฉด, ๋ชจ๋ธ์ด ์‰ฝ๊ฒŒ biased ๋˜๋Š”๊ฒƒ์„ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ๊ณ , Knowledge Distillation ์„ ํ†ตํ•ด ๋ชจ๋ธ์„ ํ•˜๋‚˜๋กœ ํ•ฉ์ณค์„ ๋•Œ, ์ง€์‹์ด ์ „์ด ๋˜๋ฉด์„œ ์„ฑ๋Šฅ์ด ๋”์šฑ ํ–ฅ์ƒ๋˜๋Š” ๊ฒƒ์„ ๋ณด์•„, Transfer learning์ด ํšจ๊ณผ์ ์ด์—ˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ดํ•ด ๋น„ํ•ด ์„ฑ๋Šฅํ–ฅ์ƒ์ด ์ ์ง€๋งŒ SHA์˜ SA์™€ CA ๋ ˆ์ด์–ด๋„ ๊ฐ๊ฐ ํšจ๋ ฅ์ด ์žˆ์—ˆ์Œ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

7

์œ„ ๊ทธ๋ฆผ์€ GCL ๊ตฌ์กฐ๋ฅผ ์‹ค์ œ ์—ฌ๋Ÿฌ ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋Œ€ํ•ด์„œ ์ง„ํ–‰ํ•ด๋ณด๊ณ , ์–ด๋–ป๊ฒŒ ์ง„ํ–‰๋˜๋Š”์ง€ ์ข€๋” ๊ตฌ์ฒดํ™”๋œ ์˜ˆ์‹œ๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ ์ž…๋‹ˆ๋‹ค. ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์กฐ์ ˆํ•˜๋ฉฐ, ๊ฐ๊ฐ์˜ group์˜ ์ˆ˜๋ฅผ ๋ฐ”๊พธ์–ด๊ฐ€๋ฉฐ ๋ชจ๋ธ์„ ํ•™์Šตํ•ด ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด์— ๋”ฐ๋ฅธ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

8

Adjacency ๋ฐฉ์‹๋ณด๋‹ค Top down ๋ฐฉ์‹์ด ํšจ๊ณผ์ ์ธ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๊ณ , ๊ทธ๋ฃน์„ ์–ด๋–ป๊ฒŒ ๋‚˜๋ˆ„๋ƒ์— ๋”ฐ๋ผ์„œ๋„ ์„ฑ๋Šฅ์˜ ์ฐจ์ด๊ฐ€ ๊ฝค ๋‚˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋‹ค๋ฅธ ๊ทธ๋ฃน์—์„œ๋„, ๊ธฐ์กด ๋ชจ๋ธ๋“ค๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ์—๋Š” ์—ฌ์ „ํžˆ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ธฐ๋Š” ํ•˜๋„ค์š”.

5. Conclusion

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” SGG ์—์„œ Visual/Language Feature ์˜ Multi-Modality ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ , Class Imbalanced ๋ฌธ์ œ๋ฅผ ํ’€ ์ˆ˜ ์žˆ๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•˜์˜€์Šต๋‹ˆ๋‹ค

Take home message

SGG ๋ชจ๋ธ๋“ค์ด ๊ฐ๊ด‘์„ ๋ฐ›๊ณ  ์žˆ๋Š”๋งŒํผ, ๋‹ค๋ฅธ ๋ถ„์•ผ์—์„œ์˜ ์—ฐ๊ตฌ๊ฐ€ SGG๋กœ ์ฐจ์šฉ๋˜๋Š” ๊ฒฝ์šฐ์˜ ๋…ผ๋ฌธ๋“ค์ด ๋งŽ์ด Accept ๋˜๋Š” ์ถ”์„ธ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค. SGG ์ƒํ™ฉ์— ๋งž๊ฒŒ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ์กด Vision ์—ฐ๊ตฌ์—์„œ์˜ ์•„์ด๋””์–ด๋“ค์ด ๋ฌด์—‡์ด ์žˆ๋Š”์ง€ ์‚ดํŽด๋ณด๊ณ , SGG๋ฅผ ์—ฐ๊ตฌํ•œ๋‹ค๋ฉด, ์ข‹์€ ๊ธฐ์—ฌ๋ฅผ ํ•  ์ˆ˜ ์žˆ๋Š” ์—ฐ๊ตฌ๊ฐ€ ๋งŽ์ด ๋‚จ์•„์žˆ๋Š” ๋ถ„์•ผ๋กœ ์ƒ๊ฐ ๋ฉ๋‹ˆ๋‹ค.

Author

์œค๊ฐ•ํ›ˆ (Kanghoon Yoon)

  • Affiliation (KAIST Industrial Engineering Department)

  • (optional) ph.D students in DSAIL

Reference & Additional materials

  1. Visual translation embedding network for visual relation detection

  2. Representation learning for scene graph completion via jointly structural and visual embedding

  3. Neural Motifs: Scene Graph Parsing with Global Context

  4. Graph R-CNN for Scene Graph Generation.

  5. GPS-net: Graph property sensing network for scene graph generation

Last updated