Are Transformers More Robust Than CNNs?

Y Bai et al., / Are Transformers More Robust Than CNNs? / Neurips-2021

1. Problem Definition

  • Vision Transformer(ViT) Network๋Š” CNN๋ณด๋‹ค ๊ฐ•๋ ฅํ•˜๊ณ  robustํ•˜๋‹ค๊ณ  ์•Œ๋ ค์ ธ์žˆ๋‹ค.

  • ํ•˜์ง€๋งŒ ์ด ์—ฐ๊ตฌ์—์„œ๋Š” ๋ช‡๊ฐ€์ง€ ์‹คํ—˜์„ ํ†ตํ•ด์„œ ๊ธฐ์กด์˜ ์ด๋Ÿฐ ๋ฏฟ์Œ์— ์˜๋ฌธ์„ ์ œ๊ธฐํ•˜๊ณ  ๊ณต์ •ํ•˜๊ฒŒ ์„ค๊ณ„๋œ ์‹คํ—˜์กฐ๊ฑด์—์„œ ๊ฐ•๊ฑด์„ฑ์„ ๋‹ค์‹œ ์กฐ์‚ฌํ•œ๋‹ค.

  • ๊ฒฐ๋ก ์ ์œผ๋กœ adversarial attack์— CNN๋„ ์ถฉ๋ถ„ํžˆ ๊ฐ•๊ฑดํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ–ˆ๋‹ค

  • ๊ฐ•๊ฑด์„ฑ์—๋Œ€ํ•œ ์‹คํ—˜ ๋„์ค‘์—, ๋ฐฉ๋Œ€ํ•œ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ pre-training์ด transformer๊ฐ€ CNN์˜ ์„ฑ๋Šฅ์„ ๋„˜๋Š”๋ฐ ๊ผญ ํ•„์š”ํ•œ ๊ฒƒ์€ ์•„๋‹˜๋„ ๋ถ€๊ฐ€์ ์œผ๋กœ ํ™•์ธํ–ˆ๋‹ค.

2. Motivation

  • Pure-attention based model์ธ transformer๊ฐ€ inductive bias์—†์ด CNN์˜ ์„ฑ๋Šฅ์„ ๋›ฐ์–ด๋„˜์—ˆ๊ณ  Detection, instance segmentation, sementic segmentation์—์„œ๋„ ์—ฐ๊ตฌ๋˜๊ณ ์žˆ๋‹ค

  • ๋˜ํ•œ ์ตœ๊ทผ ์—ฐ๊ตฌ๋“ค์—์„œ Transformer๋Š” OOD์™€ ์ ๋Œ€์  ๊ณต๊ฒฉ์— CNN๋ณด๋‹ค ๊ฐ•๊ฑดํ•จ์ด ๋ฐํ˜€์กŒ๋‹ค

    • ํ•˜์ง€๋งŒ, ์ €์ž๋Š” ์ด๋Ÿฐ ๊ฒฐ๊ณผ๊ฐ€ unfairํ•œ ํ™˜๊ฒฝ์—์„œ ๋„์ถœ๋˜์—ˆ๋‹ค๊ณ  ์ฃผ์žฅํ•œ๋‹ค

    • #params๊ฐ€ Transformer์ชฝ์ด ๋งŽ์•˜๊ณ  training dataset, epochs and augmentation ์ „๋žต ๋“ฑ์ด ๋™์ผํ•˜๊ฒŒ ๋งž์ถฐ์ง€์ง€ ์•Š์•˜๋‹ค(๋’ค์— ์‹คํ—˜์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋“ฏ์ด ViT์—๊ฒŒ ์œ ๋ฆฌํ•œ ์กฐ๊ฑด์ด ๋‹ค์ˆ˜ ์žˆ๋‹ค)

  • ์ด ์—ฐ๊ตฌ์—์„œ ๊ณต์ •ํ•œ ๋น„๊ต๋ฅผ ํ†ตํ•ด ์ ๋Œ€์  ๊ณต๊ฒฉ๊ณผ OOD์— ๋Œ€ํ•œ ๊ฐ•๊ฑด์„ฑ์„ ํ™•์ธํ•  ๊ฒƒ์ด๋‹ค

    • CNN์ด Transformer์˜ training recipes๋ฅผ ๋”ฐ๋ฅธ๋‹ค๋ฉด perturbation๊ณผ patch์— ๊ธฐ๋ฐ˜ํ•œ attack์— ๋” ๊ฐ•๊ฑดํ•จ์„ ๋ฐœ๊ฒฌํ–ˆ๋‹ค

    • ์—ฌ์ „ํžˆ Transformer๊ฐ€ OOD์— ๊ฐ•๊ฑดํ•จ์„ ๋ฐœ๊ฒฌํ–ˆ๊ณ  ์ด๋Š” pre-training์ด ์—†์–ด๋„ ๊ฐ€๋Šฅํ–ˆ๋‹ค. Ablation study์—์„œ self-attention์ด ์ด๋Ÿฐ ํ˜„์ƒ์˜ ์ด์œ ์ž„์„ ๋ฐœ๊ฒฌํ–ˆ๋‹ค

๐Ÿ’ก ์ด ์—ฐ๊ตฌ๊ฐ€ ๋‹ค๋ฅธ Architecture๋ผ๋ฆฌ์˜ ๊ฐ•๊ฑด์„ฑ์„ ๋น„๊ตํ•˜๋Š” ํ‘œ์ค€์ด ๋˜๊ธธ ๋ฐ”๋ž€๋‹ค๊ณ  ์ €์ž๋Š” ๋ฐํžˆ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค

3. Method

  • ์ด ์ฑ•ํ„ฐ์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‚ด์šฉ์„ ๋‹ค๋ฃฌ๋‹ค. ๋ชจ๋‘ ์‹คํ—˜์—์„œ ์ž์ฃผ ๋“ฑ์žฅํ•  ๋‚ด์šฉ์ด๋ฏ€๋กœ ์ฃผ์˜๊นŠ๊ฒŒ ์ˆ™์ง€ํ•˜๊ธธ ๋ฐ”๋ž๋‹ˆ๋‹ค.

  1. CNN๊ณผ ViT์˜ ํ•™์Šต์กฐ๊ฑด ๋น„๊ต

  2. ๋‹ค์–‘ํ•œ Attack๊ณผ OOD Dataset

3.1 Training CNNs and Transformer

  • ํ•™์Šต ํ›„ CNN์™€ ViT์˜ Top-1 Acc๋Š” 76.8, 76.9๋กœ ๋งค์šฐ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋ƒ„

CNN

  • ResNet-50์ด ViT์™€ ๋น„์Šทํ•œ #params๋ฅผ ๊ฐ€์ง€๋ฏ€๋กœ ์ฑ„ํƒ

  • ImageNet์— ํ•™์Šต

  • ๊ธฐํƒ€ ํ•™์Šต ๋””ํ…Œ์ผ(SGD-momentum, 100eph, L2๊ทœ์ œ)

ViT

  • ์™ธ๋ถ€ ๋ฐ์ดํ„ฐ์—†์ด ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ธ DeiT์˜ recipe๋ฅผ ๋”ฐ๋ผ์„œ DeiT-S(#params๊ฐ€ ResNet50๊ณผ ๋น„์Šท)๋ฅผ default ViT๋กœ ์ฑ„ํƒํ•จ

  • AdamW, 3๊ฐœ์˜ Aug(Rand, Cut, MixUp)

  • ResNet๊ณผ ํ•™์Šต ํ™˜๊ฒฝ์„ ๋งž์ถ”๊ธฐ์œ„ํ•ด Erasing, Stochastic Depth, Repeated Aug๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ. DeiT๋Š” 300ephํ•™์Šต๋˜์ง€๋งŒ ๊ฐ™์€ ์ด์œ ๋กœ 100eph๋งŒ ํ•™์Šต

3.2 Robustness Evaluations

3.2.1 Adversarial Attack

PGD

  • PGD(Projected Gradient Descent) : ์‚ฌ๋žŒ์€ ํ™•์ธํ•˜๊ธฐ ์–ด๋ ต์ง€๋งŒ ๊ธฐ๊ณ„๋ฅผ ์†์ผ ์ˆ˜ ์žˆ๋Š” ์„ญ๋™

Untitled

TPA

  • TPA : texture๊ฐ€ ์žˆ๋Š” patch๋ฅผ ๋ถ™์—ฌ ๋„คํŠธ์›Œํฌ๋ฅผ ์†์ด๋Š” attack

Untitled
Untitled

3.2.2 OOD

  • ๋…ผ๋ฌธ๊ณผ PaperWithCode(PWC)์— ์žˆ๋Š” ์„ค๋ช…์ด ์กฐ๊ธˆ ๋‹ค๋ฅธ๋ฐ PWC๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ ์—ˆ๋‹ค

    • mageNet-A : ResNet model์ด ๊ฐ•ํ•œ ํ™•์‹ ์œผ๋กœ ํ‹€๋ฆฐ ์ด๋ฏธ์ง€์…‹. ๊ธฐ๊ณ„ํ•™์Šต ๋ชจ๋ธ์ด ์–ด๋ ค์›Œํ•˜๋Š” ์ฆ‰ ํ•™์Šต ๋ถ„ํฌ๋ž‘์€ ์ข€ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€๋“ค์˜ ๋ชจ์ž„์ด๋‹ค. ์‹ค์ œ ์ด๋ฏธ์ง€๋ฅผ ๋ณด๋ฉด ์™œ ๊ทธ๋Ÿฐ ํ‹€๋ฆฐ ๋‹ต์„ ๋ƒˆ๋Š”์ง€ ์•Œ ๊ฒƒ๋„ ๊ฐ™๋‹ค Untitled

    • ImageNet-C : ์ด๋ฏธ์ง€์— ๋‹ค์–‘ํ•œ Augmentation์ด ์ ์šฉ๋œ ์ด๋ฏธ์ง€์…‹

      Untitled

    • Stylized ImageNet : ์ด๋ฏธ์ง€๋‹น ๋‹ค์–‘ํ•œ texture๋ฅผ ์ž…ํ•œ ๋ฐ์ดํ„ฐ์…‹

      Untitled

4. Experiment

  • ์‹คํ—˜์€ ํฌ๊ฒŒ ๋‘ ๊ฐœ์˜ ํŒŒํŠธ๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค.

  1. ์ ๋Œ€์  ๊ณต๊ฒฉ์— ๋Œ€ํ•œ ๊ฐ•๊ฑด์„ฑ

  2. OOD Sample์— ๋Œ€ํ•œ ๊ฐ•๊ฑด์„ฑ

4.1 Adversarial Robustness

  • 5000์žฅ์˜ ImageNet ๊ฒ€์ฆ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์˜€์Œ

4.1.1 Robustness to Perturnation-Based Attacks

Untitled
  • AutoAttack์˜ ์„ญ๋™์„ ๋†’์ด๋‹ˆ ์™„์ „ํžˆ fooled

  • ๊ทธ๋Ÿฌ๋‚˜ ๋‘ ๋ชจ๋ธ์ด ์ „ํ˜€ Adversarial training๋˜์ง€ ์•Š์•˜์Œ์„ ๊ธฐ์–ตํ•˜์ž

    Adversarial Training

    img

    parameters

    max

    expectation

    perturbation

    data

    dataset

    • ์„ญ๋™์„ ์ฃผ์–ด์„œ Loss๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” sample ์—์„œ์˜ ์ตœ์  parameter๋ฅผ ์ฐพ์œผ๋ผ๋Š” ๋‚ด์šฉ์˜ ์ˆ˜์‹์ด๋‹ค

    • ์ •ํ™•ํžˆ๋Š” PGD๊ฐ€ ์‚ฌ์šฉ๋˜์—ˆ๋Š”๋ฐ ๋ฐ˜๋ณต์ ์ธ step์„ ํ†ตํ•ด์„œ ์ตœ์  ๊ณต๊ฒฉ์ง€์ ์„ ์ฐพ๋Š” ๋ฐฉ๋ฒ•์ด๋ผ ์ดํ•ดํ•˜๋ฉด ๋˜๊ฒ ๋‹ค

    Adversarial Training on Transformers

    • CNN์€ ๋ฌธ์ œ ์—†์—ˆ์œผ๋‚˜ Transformer๋Š” ๊ฐ•ํ•œ Augmentation์ด PGD์™€ ํ•จ๊ป˜ ์ ์šฉ๋˜๋‹ˆ collapse๋˜์–ด๋ฒ„๋ฆฌ๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ๋‹ค

    • ๋”ฐ๋ผ์„œ Augmentation์„ eph์ฆ๊ฐ€์— ๋”ฐ๋ผ ์ ์  ๊ฐ•๋„๋ฅผ ๋†’์—ฌ๊ฐ€๋ฉฐ ํ•™์Šตํ•œ ๊ฒฐ๊ณผ 44%์˜ robustness๋ฅผ ์–ป์—ˆ๋‹ค

    Transformers with CNNsโ€™ Training Recipes

    • CNN์—์„œ ์‚ฌ์šฉ๋œ ํ•™์Šต์กฐ๊ฑด(M-SGD, ๊ฐ•ํ•œ Augmentation ๋ฐฐ์ œ)์„ Transformer์— ์‚ฌ์šฉํ–ˆ๋”๋‹ˆ ํ•™์Šต์ด ์•ˆ์ •๋˜๊ธด ํ–ˆ์ง€๋งŒ clean data์— ๋Œ€ํ•œ ์„ฑ๋Šฅ๊ณผ PGD-100์— ๋Œ€ํ•œ ๋ฐฉ์–ด์œจ์ด ํ•˜๋ฝํ–ˆ๋‹ค

    • ์ด๋Ÿฌํ•œ ํ˜„์ƒ์ด ๋‚˜ํƒ€๋‚œ ์ด์œ ๋Š” ๊ฐ•ํ•œ Augmentation์„ ๊ทœ์ œํ•ด overfitting์ด ์‰ฝ๊ฒŒ ์ผ์–ด๋‚ฌ๊ธฐ ๋•Œ๋ฌธ์ด๊ณ  ์ด์ „ ์—ฐ๊ตฌ์—์„œ ๋ฐํ˜€์กŒ๋“ฏ์ด Transformer ์ž์ฒด๊ฐ€ SGD์™€๊ฐ™์€ optimizer์—์„œ ์ตœ์ ์ ์„ ์ž˜ ์ฐพ์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค

    CNNs with Transformersโ€™ Training Recipes

    Untitled

    • ResNet-50 + ReLU์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด ViT๋ณด๋‹ค ๋œ ๊ฐ•๊ฑดํ•˜๋‹ค. ์ด๋Ÿฐ ์‹คํ—˜๊ฒฐ๊ณผ์— ๋๋‚˜์ง€ ์•Š๊ณ  ์ €์ž๋“ค์€ ์ƒˆ๋กœ์šด ์‹คํ—˜์„ ํ•ด๋ณผ motivation์„ ์–ป์—ˆ๋‹ค๊ณ ํ•œ๋‹ค. Transformer์˜ recipes๋ฅผ CNN์— ์ ์šฉํ•ด ๋น„๊ตํ•ด๋ณด๋Š” ๊ฒƒ์ด๋‹ค

    • Transformer๊ฐ€ ์“ฐ๋Š” optimizer์™€ strong regularization๋Š” ๋ณ„ ํšจ๊ณผ๊ฐ€ ์—†๊ฑฐ๋‚˜ ํ•™์Šต์—์„œ collapse๋ฅผ ์ผ์œผ์ผฐ๋‹ค

    • non-smoothํ•œ ํŠน์„ฑ์„ ๊ฐ€์ง„ ReLU๋ฅผ transoformer๊ฐ€ ์“ฐ๋Š” GELU๋กœ ๋Œ€์ฒดํ–ˆ๋‹ค. ReLU๋Š” ์ ๋Œ€์  ๊ณต๊ฒฉ์— ์ทจ์•ฝํ•œ activation์ž„์ด ์•Œ๋ ค์ ธ์žˆ๋‹ค

    • ๊ทธ ๊ฒฐ๊ณผ ResNet-50 + GELU๋Š” DeiT์— ํ•„์ ํ•˜๋Š” ์ ๋Œ€์  ๊ณต๊ฒฉ์—๋Œ€ํ•œ ์„ฑ๋Šฅ์„ ๋‚ด์—ˆ์œผ๋ฉฐ ์ด๋Š” ๊ธฐ์กด ์—ฐ๊ตฌ์˜ ๊ฒฐ๋ก ์„ ๋ฐ˜๋ฐ•ํ•˜๋Š” ๊ฒƒ์ด๋‹ค

4.1.2 Robustness to Patch-Based Attacks

Untitled
  • default๋กœ 4๊ฐœ์˜ patch๋กœ ๋Œ€์ƒ ์ด๋ฏธ์ง€์˜ ์ „์ฒด ๋ฉด์ ์— 10%์•ˆ์ชฝ์ด ๋˜๊ฒŒ attackํ–ˆ๋‹ค. ๋‘ ๋ชจ๋ธ ๋ชจ๋‘ TPA์— ๋Œ€ํ•œ ์ ๋Œ€์  ํ•™์Šต์€ ํ•˜์ง€ ์•Š์•˜๋‹ค. ๊ทธ ์ด์œ ๊ฐ€ ์ข€ ํ—ท๊ฐˆ๋ฆฌ๋Š”๋ฐ ์ ๋Œ€์  ํ•™์Šต์‹œ์— non-trivial ๊ทธ๋Ÿฌ๋‹ˆ๊นŒ, ์„ฑ๋Šฅ์ด ๋„ˆ๋ฌด ์ข‹์•„์ ธ์„œ ๋น„๊ต๊ฐ€ ์–ด๋ ต๋‹ค๋Š” ์ทจ์ง€๋กœ ํ•ด์„ํ–ˆ๋‹ค

  • Table 3์˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด CNN์€ Transformer์˜ ๊ฐ•๊ฑด์„ฑ์— ๋ฏธ์น˜์ง€ ๋ชปํ•˜๊ณ  ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์˜ ์ฃผ์žฅ์ด ๋งž์•„๋ณด์ธ๋‹ค

  • ํ•˜์ง€๋งŒ ์ €์ž๋“ค์€ TPA์˜ ํŠน์„ฑ์— ์ฃผ๋ชฉํ•˜์—ฌ ์ƒˆ๋กœ์šด ์ง€์ ์„ ํ•œ๋‹ค. TPA๋Š” ์ด๋ฏธ์ง€์œ„์— ์ธ์œ„์ ์ธ patch๊ฐ€ ๋ถ™๋Š” ํ˜•ํƒœ์ด๋‹ค. ์ด๋Š” patch๋ฅผ ์ž˜๋ผ ๋ถ™์ด๊ฑฐ๋‚˜ ์‚ญ์ œํ•˜๋Š” CutMix์™€ ์œ ์‚ฌํ•˜๋ฉฐ CutMix๋Š” ViT์—๋งŒ ์ ์šฉ๋˜์—ˆ๊ธฐ๋•Œ๋ฌธ์— ViT์—๊ฒŒ TPA๊ฐ€ ๋‹น์—ฐํžˆ ์œ ๋ฆฌํ•œ task๋ผ๋Š” ๊ฒƒ์ด๋‹ค

Untitled
  • ๊ทธ์—๋Œ€ํ•œ ์ฆ๋ช…์œผ๋กœ ViT์— ์ ์šฉ๋˜์—ˆ๋˜ 3๊ฐœ์˜ strong augmentation์„ ์ ์šฉํ•ด ResNet-50์„ ํ•™์Šต์‹œ์ผœ TPA์—๋Œ€ํ•œ ์„ฑ๋Šฅ์„ ์‚ดํˆ๋”๋‹ˆ table 4์™€ ๊ฐ™์•˜๋‹ค

  • ๊ฐ€์„ค๋Œ€๋กœ CutMix์˜ ์œ ๋ฌด๊ฐ€ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ์ขŒ์šฐํ–ˆ๋‹ค

  • RandAug+CutMix์—์„œ DeiT์˜ TPA์—๋Œ€ํ•œ ๊ฐ•๊ฑด์„ฑ๋ณด๋‹ค ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๊ณ  ์ด๋Š” ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์ด ์ฃผ์žฅํ•œ patch-based ๊ณต๊ฒฉ์—๋Œ€ํ•œ transformer์˜ ๊ฐ•๊ฑด์„ฑ์ด CNN๋ณด๋‹ค ์ข‹๋‹ค๋Š” ์ฃผ์žฅ์„ ๋ฐ˜๋ฐ•ํ•œ๋‹ค

4.2 Robustness on OOD Samples

  • ์ด ์ฑ•ํ„ฐ์—์„œ๋Š” DeiT์˜ Recipes ์ค‘ ์–ด๋–ค ๊ฒƒ์„ ์–ด๋–ป๊ฒŒ ResNet์— ์ ์šฉํ•  ๊ฒƒ์ธ์ง€ ์ •ํ•œ ๋’ค์— ResNet์„ ํ•™์Šต ํ›„ ์„ฑ๋Šฅ์„ DeiT์™€ ๋น„๊ตํ•˜๋Š” ๋‚ด์šฉ์„ ๋‹ด๊ณ ์žˆ๋‹ค

4.2.1 Aligning Training Recipes

Untitled
  • ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ์— pre-training์—†์ด๋„ ViT๊ฐ€ ๋” robustํ–ˆ๋‹ค(ResNet-50* ์€ ํ›„์ˆ )

A Fully Aligned Version(Step 0)

  • ResNet-50* ์€ DeiT์˜ recipe๋ฅผ ๋”ฐ๋ผ opimizer(Adam-W), lr scheduler and strong augmentation์„ ์ ์šฉํ–ˆ์ง€๋งŒ ResNet-50์— ๋น„ํ•ด์„œ ๋ˆˆ์— ๋„๋Š” ์„ฑ๋Šฅ ํ–ฅ์ƒ์€ ์—†์—ˆ๋‹ค(Table 5)

    • ๋”ฐ๋ผ์„œ ์„ธ ์Šคํ…์„ ๊ฑฐ์ณ DeiT์™€ ์กฐ๊ฑด์„ ๊ฐ™์ดํ•˜๋Š” ์ตœ์ ์˜ setup์„ ์ฐพ์•„๋ณธ๋‹ค(Ablation)

Step 1 : Aligning Learning Rate Scheduler

Untitled
  • Table 6์—์„œ, step decay๋ณด๋‹ค cosine schedule decay๋ฅผ ์“ฐ๋Š” ๊ฒƒ์ด ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์—ˆ์œผ๋ฏ€๋กœ ์‚ฌ์šฉ

Step 2 : Aligning Optimizer

  • Table 6์—์„œ, Adam-W๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ResNet์˜ ์„ฑ๋Šฅ๊ณผ ๊ฐ•๊ฑด์„ฑ์„ ๋ชจ๋‘ ํ•ด์ณค๋‹ค. ๋”ฐ๋ผ์„œ M-SGD์‚ฌ์šฉ

Step 3 : Aligning Augmentation Strategies

Untitled
  • ๋‹ค์–‘ํ•œ ์กฐํ•ฉ์„ ์กฐ์‚ฌํ–ˆ๋Š”๋ฐ ์ผ๋‹จ strong aug์˜ ์กด์žฌ๊ฐ€ OOD์—์„œ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ด. ๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์ œ์ผ ์ข‹์€ ์„ฑ๋Šฅ์€ ์—ฌ์ „ํžˆ DeiT์˜€๋‹ค

Comparing ResNet With Best Training Recipes To DeiT-S

  • Step์„ ๊ฑฐ์ณ ์„ธ๊ฐ€์ง€ training recipe๋ฅผ ์กฐ์‚ฌํ–ˆ์Œ์—๋„ ResNet์€ DeiT์˜ OOD์„ฑ๋Šฅ์„ ๋”ฐ๋ผ๊ฐ€์ง€ ๋ชปํ–ˆ๋‹ค

    • ์ด๊ฒƒ์€ Transformer์™€ CNN์‚ฌ์ด OOD์„ฑ๋Šฅ์„ ๊ฐ€๋ฅธ key๊ฐ€ training recipe์— ์žˆ์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Œ์„ ์•”์‹œํ•œ๋‹ค

Model Size

Untitled
  • #params์— ๋”ฐ๋ฅธ ๋น„๊ต๋„ ํ•˜๊ธฐ์œ„ํ•ด ์ƒˆ๋กœ์šด ์‹คํ—˜์„ ํ•˜์˜€๋‹ค. ResNet์— * ์ด ๋ถ™์€ ๊ฒƒ์€ ์„ธ ๊ฐ€์ง€ recipe์„ ๋ชจ๋‘ ์ ์šฉํ•œ ๊ฒƒ์ด๊ณ  Best๋Š” ์œ„์—์„œ ์ฐพ์€ ์กฐํ•ฉ์ด๋‹ค

  • ์ „์ฒด์ ์œผ๋กœ DeiT๊ฐ€ parameter ์ˆ˜์˜ ๋ณ€ํ™”์—๋„ ์ œ์ผ ์ข‹์€ OOD์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค

4.2.2 Distillation

Untitled
  • ๊ฒฐ๊ณผ1(T:DeiT, S:ResNet) : ์•Œ๋ ค์ง„ ์ƒ์‹๊ณผ ๋‹ค๋ฅด๊ฒŒ Student๊ฐ€ ๋” ๋‚˜์œ ์„ฑ๋Šฅ. DeiT๊ฐ€ ๋” ์ข‹์€ ์„ฑ๋Šฅ

  • ๊ฒฐ๊ณผ2(T:ResNet, S:DeiT) : DeiT๊ฐ€ ๋” ์ข‹์€ ์„ฑ๋Šฅ

  • 4.2.1๊ณผ 4.2.2์˜ ๊ฒฐ๊ณผ๋กœ ๋ฏธ๋ฃจ์–ด๋ณผ ๋•Œ, DeiT์˜ ๊ฐ•๋ ฅํ•œ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์€ training setup๊ณผ knowledge distillation์ด ์•„๋‹Œ Transformer์˜ ๊ตฌ์กฐ ์ž์ฒด์—์„œ ์˜จ๋‹ค๊ณ  ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค

4.2.3 Hybrid Architecture

Untitled
  • Hybrid-DeiT๋Š” ResNet-18์˜ res_4 block์˜ output์„ DeiT-Mini์—๊ฒŒ ๋„˜๊ฒจ์ฃผ๋Š” hybrid๋ชจ๋ธ์ด๋‹ค

  • CNN(ResNet)์— transformer๊ตฌ์กฐ๊ฐ€ ๋”ํ•ด์ง€๋‹ˆ ResNet-50๋ณด๋‹ค ๋” ๊ฐ•๊ฑดํ•ด์กŒ๋‹ค. ํ•˜์ง€๋งŒ pureํ•œ transformer์ž์ฒด๋ณด๋‹ค๋Š” ๋ชปํ–ˆ๋‹ค. ์ด๊ฒƒ์€ Transformer์˜ self-attention mechanism์ด ๊ฐ•๊ฑด์„ฑ ํ–ฅ์ƒ์— ํ•„์ˆ˜์ ์ธ ์š”์†Œ์ž„์„ ์ฆ๋ช…ํ•œ๋‹ค

4.2.4 300-Epoch Training

Untitled
Untitled
  • CNN๊ตฌ์กฐ๋Š” 100ephํ•™์Šต๋˜๋Š”๊ฒŒ ์ผ๋ฐ˜์ ์ด์ง€๋งŒ Transformer๋Š” 300eph ์ •๋„๋กœ ๋งŽ์ด ํ•™์Šต๋œ๋‹ค. ์ด๋Ÿฐ ํ˜•ํ‰์„ฑ์— ๋งž์ถ”์–ด ํ•™์Šตํ–ˆ๋”๋‹ˆ Table 9์™€ ๊ฐ™์•˜๋‹ค

  • ๋” ๊ณต์ •ํ•œ ๋น„๊ต๋ฅผ ์œ„ํ•ด์„œ ResNet์˜ clean acc๊ฐ€ DeiT๋ณด๋‹ค ๋†’์€ 101,200์„ ๊ฐ€์ ธ์™€ ์‹คํ—˜ํ–ˆ๋‹ค. ์—ญ์‹œ DeiT๊ฐ€ ๋” ๋†’์€ OOD์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค

  • ์ด๊ฒƒ์œผ๋กœ Transformer๊ฐ€ CNN๋ณด๋‹ค OOD์— ๋” ๊ฐ•๊ฑดํ•˜๋‹ค๊ณ  ๋งํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค

5. Conclusion

  • unfairํ•œ ์กฐ๊ฑด์—์„œ ์‹คํ–‰๋˜๋˜ ์‹คํ—˜์„ ์ ์ ˆํ•œ ์กฐ์น˜๋ฅผ ํ†ตํ•ด ๋น„๊ตํ•˜๋‹ˆ Transformer๋Š” ์ ๋Œ€์  ๊ณต๊ฒฉ์—์„œ CNN๋ณด๋‹ค ๊ฐ•๊ฑดํ•˜์ง€ ์•Š์•˜๋‹ค

  • ๋˜ํ•œ OOD์—์„œ์˜ Transformer์„ฑ๋Šฅ์€ self-attention๊ณผ ๊ด€๋ จ์ด ์žˆ์Œ์„ ํ™•์ธํ–ˆ๋‹ค

  • ์ด ์—ฐ๊ตฌ๋กœ transformer์— ๋Œ€ํ•œ ์ดํ•ด๊ฐ€ ํ–ฅ์ƒ๋˜๊ณ  transformer๊ณผ CNN์‚ฌ์ด ๊ณต์ •ํ•œ ๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅํ•ด์ง€๊ธธ ๋ฐ”๋ž€๋‹ค

๊ฐœ์ธ์  ์˜๊ฒฌ์œผ๋กœ..

  • ViT์˜ ๋“ฑ์žฅ์€ ๋งŽ์€ ์ด์Šˆ๋ฅผ ๋‚ณ์•˜์Šต๋‹ˆ๋‹ค. ์ฒ˜์Œ CNN์ดํ›„ Image๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ๊ทผ์›์ ์ธ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•๋ก  ์ œ์‹œ์˜€๊ณ  ๋ฌด์—‡๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์•˜์Šต๋‹ˆ๋‹ค. ์‹ฌ์ง€์–ด ์ตœ๊ทผ ์—ฐ๊ตฌ๋“ค์—์„œ๋Š” ViT๊ฐ€ CNN๋ณด๋‹ค ๊ฐ•๊ฑดํ•˜๊ธฐ๊นŒ์ง€ ํ•˜๋‹ค๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•˜๋ฉด์„œ Vision์˜ ์˜์—ญ์€ ์ด์ œ (์—„์ฒญ๋‚œ pretrain dataset์„ ๊ฐ€์ง„ ์‚ฌ์—…์ฒด๊ฐ€ ํ•™์Šตํ•œ) ViT๊ฐ€ ๋ชจ๋‘ ๊ฐ€์ ธ๊ฐˆ ๊ฒƒ์ด๋ผ๋Š” ์˜ˆ์ƒ์„ ํ•˜๊ธฐ๋„ ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ํ•™๊ณ„์˜ ์ด๋Ÿฐ ๋ฏฟ์Œ ์ž์ฒด์— ์˜๋ฌธ์„ ๊ฐ€์ง€๊ณ  ๋„์ „ํ•˜๋Š”๊ฒŒ ์‰ฌ์šด์ผ์ด ์•„๋‹ˆ์—ˆ์„ ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฐ ์—ฐ๊ตฌ๋ฅผ ๋‚ด๋†“์€ ์—ฐ๊ตฌ์ž๋“ค์˜ ์‹ค๋ ฅ๊ณผ ์ž์‹ ๊ฐ์—์„œ ๋˜ ํ•œ๋ฒˆ ๊ฒธ์†ํ•ด์•ผํ•จ์„ ๋А๋‚๋‹ˆ๋‹ค.


Author Information

  • ํ™์„ฑ๋ž˜ SungRae Hong

    • Master's Course, KAIST Knowledge Service Engineering

    • Interested In : SSL, Vision DL, Audio DL

    • Contact : sun.hong@kaist.ac.kr

6. Reference & Additional materials

Please write the reference. If paper provides the public code or other materials, refer them.

  • Github Implementation LINK

Last updated