10 Knowledge Distillation

Lecture 10 - Knowledge Distillation | MIT 6.S965

EfficientML.ai Lecture 9 - Knowledge Distillation (MIT 6.5940, Fall 2023, Zoom)

10.5 KD for Object Detection

Object Detection 도메인에서는, 크게 두 가지 문제를 추가로 해결해야 한다.

foreground, background를 잘 분리할 수 있어야 한다.
Bounding box를 잘 찾아야 한다.

이때, bounding box는 classification이 아니라, regression 문제에 해당된다.

10.5.1 Distillation Pipeline for Object Detection

Learning Efficient Object Detection Models with Knowledge Distillation 논문(2017)

Object Detection의 특징에 맞춰, 위 논문에서는 다음과 같은 절차를 통해 KD를 수행한다.

feature imitation

Adaptation

교사와 학생의 intermediate feature map을 비교한다.

1x1 conv로 channel 수를 맞춘다.

L_{H i n t} (V, Z) = | | V - Z | |_{2}^{2}

Detection head

classification, regression 결과를 모두 도출한 뒤 loss를 계산한다.

Weighted Cross Entropy Loss

foreground, background classification을, 서로 다른 가중치를 사용하는 것으로 class imbalance 문제를 해결한다.

$L_{s o f t} (P_{s}, P_{t}) = - \sum w_{c} P_{t} \log P_{s}$
Bounded Regression Loss

L_{b} (R_{s}, R_{t}, y) = - {\begin{matrix} | | R_{s} - y | |_{2}^{2}, & i f | | R_{s} - y | |_{2}^{2} + m > | | R_{t} - y | |_{2}^{2} \\ 0, & o t h e r w i s e \end{matrix}

이때 margin을 두어, 학생 성능이 교사 성능 + margin $m$ 을 넘어서는 순간, loss가 0이 되며 학습이 중단되도록 구현했다.

10.5.2 Convert Regression to Classification Problem

Localization Distillation for Dense Object Detection 논문(2022)

혹은 regression 문제인 bounding box을, classification 문제로 바꿔서 KD를 수행할 수 있다.

bounding box

x축을 6개 구간으로 나누고, y축을 6개 구간으로 나눈다.
각 구간을 class로 지정한다.

10.6 KD for Semantic Segmentation

Structured Knowledge Distillation for Semantic Segmentation 논문(2019)

Semantic Segmentation 도메인에서는 Discriminator을 사용한 KD 방법이 제안되었다. (Adversarial Distillation)

structured KD

feature imitation: classification, detection 도메인과 유사하게 진행
Discriminator

adversarial loss: 학생이 discriminator를 속일 수 있도록 학습된다.

10.7 KD for GAN

GAN Compression: Efficient Architectures for Interactive Conditional GANs 논문(2020)

KD for GAN

training objective는 다음과 같다.

ℒ = ℒ_{c G A N} (x) + λ_{r e c o n} ℒ_{r e c o n} + λ_{d i s t i l l} ℒ_{d i s t i l l} (x)

Reconstruction Loss

ℒ_{r e c o n} = {\begin{matrix} | | G (x) - y | | & p a i r e d c G A N s \\ | | G (x) - G^{'} (x) | | & u n p a i r e d c G A N s \end{matrix}

Distillation Loss

ℒ_{d i s t i l l} = \sum_{k = 1}^{n} | | G_{k} (x) - f_{k} (G_{k}^{'} (x)) | |

- c G A N L o s s

\mathcal{L}{cGAN} = \mathbb{E}_x[\log (1- D(x, G(x)))]}[\log D(x,y)] + \mathbb{E

- - - # # 10.8 K D f o r N L P > [M o b i l e B E R T : a C o m p a c t T a s k - A g n o s t i c B E R T f o r R e s o u r c e - L i m i t e d D e v i c e s 논 문 (2020)] (h t t p s : / / a r x i v . o r g / a b s / 2004.02984) M o b i l e B E R T 논 문 에 서 는 N L P 도 메 인 에 서, 교 사 의 f e a t u r e m a p 과 a t t e n t i o n 정 보 를 쩐 달 하 는 방 식 으 로 K D 를 구 현 했 다 .! [M o b i l e B E R T] (h t t p s : / / r a w . g i t h u b u s e r c o n t e n t . c o m / e r e c t b r a n c h / M I T - E f f i c i e n t - A I / m a s t e r / 2022 / l e c 10 / s u m m a r y 02 / i m a g e s / N L P_{K} D . p n g) - F e a t u r e M a p T r a n s f e r (F M T) - A t t e n t i o n T r a n s f e r (A T) - - - # # 10.9 N e t w o r k A u g m e n t a t i o n > [N E T W O R K A U G M E N T A T I O N F O R T I N Y D E E P L E A R N I N G 논 문 (2022)] (h t t p s : / / a r x i v . o r g / p d f / 2110.08890 . p d f) l a r g e m o d e l 에 서 o v e r f i t t i n g 을 피 하 기 위 해 사 용 하 는 * * d a t a a u g m e n t a t i o n * *, * * d r o p o u t * * 과 같 은 방 법 은, t i n y m o d e l 에 서 오 히 려 악 영 향 을 미 친 다 . - d a t a a u g m e n t a t i o n c u t o u t, m i x u p, r o t a t i o n, f l i p 등! [d a t a a u g m e n t a t i o n] (h t t p s : / / r a w . g i t h u b u s e r c o n t e n t . c o m / e r e c t b r a n c h / M I T - E f f i c i e n t - A I / m a s t e r / 2022 / l e c 10 / s u m m a r y 02 / i m a g e s / d a t a_{a} u g m e n t a t i o n . p n g) - d r o p o u t! [d r o p o u t] (h t t p s : / / r a w . g i t h u b u s e r c o n t e n t . c o m / e r e c t b r a n c h / M I T - E f f i c i e n t - A I / m a s t e r / 2022 / l e c 10 / s u m m a r y 02 / i m a g e s / d r o p o u t . p n g) 다 음 은 t i n y m o d e l 에 서 해 당 기 법 을 적 용 했 을 때, 성 능 차 이 를 나 타 내 는 그 림 이 다 .! [A u t o A u g m e n t, d r o p o u t] (h t t p s : / / r a w . g i t h u b u s e r c o n t e n t . c o m / e r e c t b r a n c h / M I T - E f f i c i e n t - A I / m a s t e r / 2022 / l e c 10 / s u m m a r y 02 / i m a g e s / d a t a_{a} u g m e n t_{d} r o p o u t_{c} o m p a r e . p n g) - - - # # # 10.9 .1 T r a i n i n g P r o c e s s * * N e t A u g * * 는 모 델 자 체 를 증 강 하 는 방 식 을 택 한 다 . (r e v e r s e d r o p o u t) > 반 대 로 l a r g e m o d e l 에 서 는 o v e r f i t t i n g 을 유 발 하 므 로 주 의 해 야 한 다 . - a u g m e n t m o d e l 각 레 이 어 의 # c h a n n e l s 을 늘 려, d y n a m i c n e u r a l n e t w o r k 를 학 습 한 다 . (* * w e i g h t s h a r i n g * *) - S t e p 1 o r i g i n a l, a u g m e n t e d m o d e l 의 f o r w a r d, b a c k w a r d 를 함 께 수 행 한 다 .! [N e t A u g s t e p 1] (h t t p s : / / g i t h u b . c o m / e r e c t b r a n c h / M I T - E f f i c i e n t - A I / b l o b / m a s t e r / 2022 / l e c 10 / s u m m a r y 02 / i m a g e s / N e t A u g_{s} t e p 1 . p n g) > 좌 : o r i g i n a l t i n y m o d e l, 우 : a u g m e n t e d m o d e l 이 때 l o s s f u n c t i o n 은 b a s e s u p e r v i s i o n, a u x i l i a r y s u p e r v i s i o n 두 항 의 결 합 으 로 표 현 된 다 .

{\mathcal{L} }{aug} = {\mathcal{L} }(W]) $$}) + {\alpha}{\mathcal{L} }([W_{base}, W_{aug

scaling factor $α$ : auxiliary supervision가 loss에 미치는 영향을 조절

10.9.2 NetAug Learning Curve

다음은 ImageNet 데이터셋을 이용한 학습에서 NetAug를 적용했을 때의 성능을 나타낸 그림이다.

Learning curves on ImageNet

tiny model(MbV2-Tiny): under-fitting을 막고 성능을 향상시킨다.
large model(ResNet50): over-fitting을 유발한다.