Hard Negative Mixing for Contrastive Learning

Paper review

Hard Negative Mixing for Contrastive Learning

방망이 깎는 연구자 2023. 1. 12. 22:22

https://arxiv.org/abs/2010.01028

Hard Negative Mixing for Contrastive Learning

Contrastive learning has become a key component of self-supervised learning approaches for computer vision. By learning to embed two augmented versions of the same image close to each other and to push the embeddings of different images apart, one can trai

arxiv.org

1. Introduction

Contrastive learning(CL)은 self supervised learning(SSL) manner에서 많이 사용되는 technique이며 unsupervised manner로 사용할 수 있다. The embeddings of two transformed versions of the same image은 가깝게, any other image는 멀게 push 한다. 최근 연구에서는 MoCo, InfoMin 등 hand crafting the set of data augmentation이 적용되며 representation을 학습하는데 도움이 된다는 것을 보여주었다.

동시에 CutMix, MixUp과 같은 pixel level에서의 data mixing technique과 Manifold mixup과 같은 feature level에서의 data mixing technique이 model 학습에 도움이 된다는것이 알려져 있다.

그러는 와중에 negative의 수가 contrastive loss에 직접적으로 영향을 미치기 때문에, MoCo는 batch를 상당히 증가시키거나 large memory bank를 유지한다.

하지만 memory/batch 크기를 증가시키면 performance 측면에서 diminishing returns을 이끈다는 것이 알려져 있다.

“More negative samples does not necessarily mean hard negative samples.”

In this paper, 저자는 Contrastive learning의 중요한 측면인 hard negative가 지금까지 무시되어 왔다고 주장한다.

저자는 기존의 data mixing approach 성공적인 연구에 힘입어, minimal computational overhead로 바로 computational 가능한 hard negative mixing 즉, hard negative sample에 대한 feature-level mixing for hard negative samples를 제안한다.

저자의 contributions은 다음과 같다.

a) MoCo based contrastive self supervised learning method를 더 깊게 연구하고 더 harder negative의 필요성을 관찰.

b) Hard negative mixing을 제안하며, embedding space에서 직접적으로 hard negative를 synthesize 하여 각 positive query에 적응. 또한 the hardest existing negatives를 mix 하며 query 자체와 most hardest negative를 mixing 할 것을 제안한다.

c) 저자는 본인의 method를 사용하면 a wide range of hyperparameters에 대해 embedding spacee의 활용뿐만 아니라 visual representations의 generalization 둘 다 향상시킨다는 것을 보여준다.

d) Linear probing에 대한 good performance를 보여준다.

2. Related work

대부분 early self supervised learning methods는 single image에서 적용된 transformation의 속성들을 예측하는데 proxy classification tasks (upstream 또는 pretext task라고 하면 더 쉽게 이해할 듯하다.)를 설계한다.

MoCo, SimCLR, PIRL, CMC, SWAV와 같은 논문 이외에도 feature space에서 cluster가 형성되게 학습하면서 contrastive learning or trasformation prediction tasks를 병행하는 연구(DeepCluster, ICT)도 존재한다.

대부분의 contrastive methods의 성능을 좌지우지하는 방법은 data augmentation을 활용한다. 최근 연구는 same image에 적용된 heavy data augmentation은 positive pair에 self supervised task의 hardness를 modulation 하기 때문에, useful representation을 배우는데 crucial 하다. 반면 저자의 방법은 negative pair 측면에서 proxy task의 hardness를 변화시키고 있다.

이 paper 말고 이전에도 selection of negatives에 대한 논의(Mining on Manifolds, Debiased Contrastive Learning, On Mutual Information in Contrastive Learning for Visual Representations, Contrastive Learning with Adversarial Examples)는 이루어졌었다.

Mining on Manifolds의 저자는 Euclidean distance에 대해 neighbor feature에 초점을 맞추면서 a large set에 대해서 hard negatives에 초점을 맞춰서 hard negative mining을 하는데 the nearest neighbor graph에 정의된 manifold distance를 사용하는 경우에는 그렇지 않다.

Debiased Contrastive Learning은 negative sample의 “true” distribution을 approxmimating 하는데 관심이 있었지만 false negative의 effect를 중재하는 노력으로 contrastive loss의 편향된 방법을 제시했다.

On Mutual Information in Contrastive Learning for Visual Representations은 negative sampling에 대한 다른 strategy를 제안하며 추가로 결합된 InfoNCE의 variational extension을 제시한다.

마지막으로 Contrastive Learning with Adversarial Examples은 adversarial examples를 활용하여 훨씬 어려운 positive pair와 hard negative pair를 생성하는 방법을 제안했다.

Contrastive learning을 위한 Mixup과 그 외 다른 수많은 variants (Un-Mix, Manifold mixup, Attentive cutmix, Cutmix)들은 cross entropy loss와 같이 사용할 때 매우 효과적인 전략임을 보여주었다. Manifold mixup은 hidden states에서의 interpolation에 대해 network가 덜 확신하게하는 feature-space regularizer이다. Interpolation의 장점은 cross-entropy 이외의 loss에 대해 최근에서야 연구되었다.

Un-Mix에서 저자들은 self supervised learning을 위해 image/pixel space에서의 mixing을 제안하지만 해당 paper의 저자들은 embedding space에서 query specific synthetic points를 즉시 생성한다.

Embedding expansion의 저자들은 fine-grained recognition task에 대한 supervised metric learning을 위한 embedding 간 interpolation을 학습한다.

이와 대조적으로 MoCHi는 class annotation이 필요하지 않으며, negative에 대한 선택을 수행하지 않으며, multiple pairs 사이의 single random interpolation만 sampling 한다.

추가로 negative를 mixing 하는 것을 넘어서 negative와 negative를 혼합하여 훨씬 더 어려운 negative를 얻고 향상된 성능을 보여준다.

3. Understanding hard negatives in unsupervised contrastive learning

3.1 Contrastive learning with memory

$f$ 는 visual representation learning을 위한 CNN encoder이고 input image $x$를 embedding vector $z = f(x)\ , z ∈ \mathbb{R^d}$ 로 변환한다.

추가로 $Q$를 size $K$ 인 “memory bank” 라 하자(a set of $K$ embeddings in $\mathbb{R}^d$). Query $q$와 key $k$ embedding이 positive pair를 형성하도록 하자, 이는 queue라고 불리는 negatives of bank ($Q$)에 있는 모든 feature와 대조된다.

가장 유명하고 보편화된 CL loss는 다음과 같다.

$$L_{q,k,Q}=−log\ ⁡\cfrac{⁡exp⁡(q^Tk/τ)}{exp⁡(q^Tk/τ)+∑_{n∈Q}exp⁡(q^Tn/τ)}$$

여기서 τ는 temperature parameter이며 all embeddings들은 $ℓ_2$-normalized 되어있다. 많은 선행 연구들이 같은 image에 대해서 transform 하여 query, key를 구성하는 것에 대한 성공사례를 보여주었다 (SimCLR, MoCo v2, PIRL,etc).

Memory bank $Q$는 each positive pair에 대한 negatives를 포함하여 dataset의 다른 모든 images, 마지막 queue of batch, 또는 단순히 현재 minibatch의 다른 모든 images 포함하는 "external" memory로 정의될 수 있다.

위 수식의 Log-likelihood function은 각 input / query $q$에 대해서 softmax 함수를 적용하여 생성된 probability distribution에 으로 정의된다.

$p_{z_i}$를 query에 대한 matching probability로 하고 $z_i ∈ Z = \ Q \ \cup \ \{ k \}$이다. 그러면, query $q$에 대한 loss의 gradient는 다음과 같다.

$$\cfrac{∂L_{q,k,Q}}{∂q} = - \cfrac{1}{τ} \Bigg((1-p_k) \cdot k - ∑_{n∈Q} p_n \cdot n\Bigg) , \ where \ p_{z_i} = ⁡\cfrac{⁡exp⁡(q^Tz_i/τ)}{∑_{j∈Z}exp⁡(q^Tz_j/τ)}$$

$p_k$, $p_n$은 key와 negative의 feature, 즉 $z_i=k$와 $z_i=n$에 대한 matching probability이다. loss에 대한 positive and negative logits의 contribution이 $(K+1)$ way cross-entropy classification loss와 동일하다는 것을 알 수 있다. 여기서 key에 대한 logits은 query’s latent class에 해당하고 모든 gradients는 $1/τ$ 로 스케일링 된다.

3.2 Hard negatives in contrastive learning

Hard negatives는 CL에서 매우 crucial 하다. MoCo-v2와 MoCo와 3가지의 주요 변경된 점이 있는데, 1) MLP head 추가, 2) cosine learning rate schedule, and 3) more challenging data augmentation. Appendix에서는 more challenging data augmentation이 proxy task를 더 어렵게 만든다는 걸 논의한다. 하지만, proxy task performance가 떨어졌더라도 linear classification의 경우 performance 성능 향상을 관찰한다.

MoCHi가 더 harder negatives를 mixing하여 proxy task를 modulate함으로써 similar effect를 얻는 방법에 대해 논의한다.

3.3 A class oracle-based analysis

ImageNet class label annotation(class oracle)을 사용하여 contrastive learning에 대한 negatives를 analysis한다. query와 동일한 class의 image에 해당하는 memory $Q$의 모든 negative features를 FN이라고 정의하자. 먼저 contrastive learning에서 false negatives를 정량화 한 다음에 그것이 linear classification performance에 얼마나 영향을 미치는지 확인한다. 또한 class annotation을 이용하여 contrastive self-supervised learning oracle을 학습할 수 있으며, 여기서 training 중에 각 query의 negatives에서 FN을 무시한 후 downstream task에서 성능을 측정할 수 있다.

이것은 supcon 연구와 관련이 있으며, contrastive loss가 same label을 공유하는 supervised manner의 positive pairs 형성하며 사용된다. supcon과 달리 oracle은 각 query에 대해 same label이 있는 negative만 label을 discarding할 때만 사용한다.

training epochs동안 highest 1024개의 negative logits을 비교하면서 oracle과 MoCo-v2를 FN의 비율을 정량화한다. 모든 경우에 rep가 향상됨에 따라 점점 더 많은 FN(same-class logits)이 상위권에 랭크된다는것을 알수있다. negative queue로부터 discarding함으로써, class oracle version은 same-class embedding을 더 가깝게 만들 수 있다.

class oracle을 사용한 성능 결과가 CE로 훈련된 supervised upper bound은 Fig 1 ~~(Table 1이 아니고?)~~ 하단에 나와있다!

78.0 (MoCo-v2, 200 epochs) → 81.8 (MoCo-v2 oracle, 200 epochs) → 86.2 (supervised).

4. Feature space mixing of hard negatives

Hard negative를 synthesize하기위한 접근 방식을 제시. 즉 contrastive loss의 hardest negative features 또는 query와 hardest negatives features 중 일부를 mixing한다.

이러한 hard negative mising approach를 MoCHi라고 부르며, $N,s,s^2$인 parameter를 사용한다.

$(N,s,\grave s)$

4.1 Mixing the hardest negatives

query $q$, key $k$, negative/queue는 크기 $K$의 queue에서 $n ∈ Q$를 feature로 하는 경우 query에 대한 loss는 softmax 함수에 제공된 logits $l(z_i) = q^Tz_i / τ$ 로 구성된다.

$\tilde Q = \{ n_1, . . . , n_K \}$ 는 다음과 같이 all negative features가 정렬된 집합이다.

즉, $l(n_i) > l(n_j),\ ∀i < j,$ 특정 query features에 대한 similarity를 감소시킨 sorted negative features 집합이다.

각 query에 대해 기존 negative “hardest” pair의 convex linear combination을 생성하여 $s$의 hard negative features를 synthesize할 것을 제안한다. 정렬된 집합 $Q$를 잘라내어 hardest negatives를 정의한다. 즉, 처음 $N < K$ 항목만 유지한다.

$H = \{h_1 . . . , h_s \}$는 생성될 synthetic points의 집합이다. synthetic point $h_k ∈ H$는 다음과 같이 주어진다. µ

$$h_k = \cfrac {\tilde h_k}{\left| h_k \right\|_2 }, \ where \ \tilde h_k =α_kn_i +(1−α_k)n_j,$$

$n_i,n_j ∈ \tilde Q^N$ 은 closest $N$ negatives의 집합 $\tilde Q^N=\{ n_1,...,n_N \}$ 에서 무작위로 선택된 negative features이고 $α_k ∈ (0, 1)$ 은 randomly chosen coefficient이고 $∥·∥_2$은 $l_2 \ norm$이다.

mixing 이후에, logits $l(h_k)$은 계산되어 query $q$에 대한 negative logits으로 추가된다. 이 process는 batch의 each query에 대해 반복된다.

다른 모든 logits $l(z_i)$는 이미 계산되었기 때문에 추가 계산 비용에는 query와 synthesized features 간의 $s$ dot products만 포함되며, 이는 memory를 $s << K$만큼 증가시키는 것과 계산적으로 동일하다.

$$\grave s=\ <q,\tilde h_k>$$

4.2 Mixing for even harder negatives

existing negative features의 convex combinations으로 hard negative를 생성하고 이 analysis를 위해 $l_2 \ norm$의 effects를 무시하면 생성된 features는 hardest negatives의 convex hull inside에 놓일 것이다.

대부분의 경우 negatives와 query의 linear separability이 없는 training 초기에 이 synthesis는 현재보다 훨씬 더 harder negative를 초래할 수 있다. ~~(그냥 초기에 학습이 잘 안되서 negative와 positive를 구분 잘 못한다는 뜻)~~

training이 진행되고 linear separability (선형 분리 가능성)가 achieved 된다고 가정하면 이러한 방식으로 features를 synthesizing한다고해서 존재하는 hardest보다 더 harder한 negatives가 반드시 생성되는 것은 아니다.

그러나 여전히 query 주변의 space를 stretch 하여 memory negatives를 더 push하고 space의 uniformity을 높인다. query 주변의 space stretching 효과는 t-SNE에서도 볼 수 있다.

우리의 intuition을 최대한 탐색하기 위해 query를 hardest negatives과 mixing하여 proxy task에 대해 훨씬 더 harder한 negatives을 얻을 것을 제안한다.

따라서 우리는 각 query에 대해 $\grave s$ synthetic hard negative features을 집합 $\tilde Q^N$ 즉, hardest negatives에서 randomly chosen feature과 혼합하여 추가로 합성한다. $\grave H=\{ \grave h_1,...,\grave h_{\grave s} \}$ 는 query와 negatives를 mixing하여 생성되는 synthetic points 집합이라하자.

그러면, Eq. (3)과 유사하게, synthetic points $\grave h_k = \cfrac { {\tilde h_k \grave {} }}{\Vert{\tilde h_k \grave {}}\Vert_2}$ $where \ \tilde h_k \grave {} = \beta_k q + (1-\beta_k)n_j$,

그리고 $n_j$는 $\tilde Q^N$에서 randomly chosen negative feature인 반면 $\beta_k ∈ (0, 0.5)$는 query에 대해 randomly chosen mixing coefficient이다. $\beta_k < 0.5$ 는 query의 contribution이 negative contribution보다 항상 작다는 것을 보장한다.

4.3 Discussion and analysis of MoCHi

linear layer 대신에 MLP head를 이용하여 contrastive loss 계산.

dot products가 lower-layer embedding에서 계산됨.

Is the proxy task more difficult?

synthetic features가 포함된 경우(lines with no marker)와 포함되지 않은 경우(lines with triangle marker) 두 가지 MoCHi에 대한 proxy 성능을 보여준다.

negatives ($\grave s=0$ , green lines) pairs를 혼합할 때, 모델이 더 빨리 학습하지만 baseline과 큰 차이가 없다.

실제로 특징이 수렴되면 $max\ l(h_k) < max\ l(n_j),h_k ∈ H,n_j ∈ \tilde Q^N$ 이 나타난다. (학습이 잘 되었다면, negative간 hard negative sim logit < 그냥 negative sim logit 은 자명)

그러나 negative를 query와 추가로 mixing하여 synthesizing하는 경우 그렇지 않다.

Fig 2b에서 알 수 있듯이, training이 끝날 때, $max\ l(\grave h_k) > max\ l(n_j),\grave h_k ∈ \grave H$ (학습이 잘 되었다면, query간 hard negative sim logit > 그냥 negative logit이다, query로 mixing한 pairs가 similarity가 더 높다)

즉, synthetic negatives를 discarding할 때 proxy task에 대한 최종 성능은 MoCo-v2 baseline과 유사하지만 최종 성능은 훨씬 낮다. MoCHi를 통해 우리는 negative의 hardness를 통해 proxy task의 hardness를 조정할 수 있다.

Oracle insights for MoCHi.

Fig 2.c 에서 two false negatives(lines with square markers)를 mixing하여 얻은 synthesized features의 비율이 시간이 지남에 따라 증가하지만, 매우 작은 상태로 남아있다. 약 1%.

동시에 synthetic features의 8%가 partially false negatives(lines with triangle markers)이며 즉, two components 중 적어도 하나는 false negative임을 알 수 있다.

MoCHi의 oracle 변형에 대해, 우리는 hard negative를 합성하는데 false negatives가 참여하는것을 허용하지 X.

MoCHi oracle이 더 높은 상한(82.5 vs 81.8 for MoCo-v2)을 얻고, CE 상한에 대한 차이를 더욱 좁힐 수 있을 뿐만 아니라, 우리는 appendix에서 MoCHi oracle이 더 긴 훈련 이후 CE 대비 performance loss의 대부분을 복구할 수 있음을 보여준다.

i.e. 79.0 (MoCHi, 200 epochs) → 82.5 (MoCHi oracle, 200 epochs) → 85.2 (MoCHi oracle, 800 epochs) → 86.2 (supervised).

훈련이 끝날 때까지 MoCo-v2 (rightmost values of Fig 2c)에 비해 MoCHi는 top logits에서 false negatives의 비율이 약간 낮다는 것이다.

MoCHi는 false negatives인 synthetic negative point를 추가하고 same class embedding을 분리하고있지만, ImageNet-100에서 linear classification을 위해 더 높은 performance를 보인다.

즉, same-class features의 absolute similarity은 감소할 수 있지만, 그 방법은 더 선형적으로 분리 가능한 공간을 만든것으로 보인다.

이를 통해 synthetic hard negative가 embedding space의 활용에 어떤 영향을 미치는지 자세히 살펴보게 되었다.

Measuring the utilization of the embedding space.

5. Experiments

ImageNet-1K, ImageNet-100 subset에 대한 representations을 배운다.

based on MoCo-v2. MoCov2의 결과는 북붙. 4 GPUs.

ImageNet-100에서 linear classification의 경우, common protocol을 따르고 validation set에 대한 결과를 보고.

LR 10.0(30.0), schedule that drops at 30,40, and 50.

training에는 K= 16k.

MoCHi의 경우, 10 warmup

PASCAL VOC에서 Faster R-CNN, fine-tune

A note on reporting variance in results.

SSL 논문들이 variance에 대해서 논하지 않는게 unfortunate. ImageNet-1K에서 ResNet-50 모델을 training하고 evaluate하는데 6-7일. standard deviation가 표시되는 경우, 최소 3회 이상으로 측정.

Fig 3b를 보면 a large number of MoCHi combinations가 일괄된 성능 향상을 제공.

다른 ablations에 대해서 MoCo-v2보다 MoCHi가 (+0.7%) 향상되는 것을 볼수있다.

Table 1은 가장 성능이 우수한 MoCHI 변형 간의 비교와 MoCo-v2 기준선에 대한 이득을 보여준다.

또한 pixel space의 mixup을 사용하여 더 harder images를 synthesize하는 최근 방법과 비교함. ~~(이게 iMix인가?)~~

Comparison with the state of the art on ImageNet-1K, PASCAL VOC and COCO.

ImageNet-1K training set에 대한 training 이후 결과 제시.

average negative logits plot을 보고 queue와 dataset 모두 training set에 대해 크기가 약 10배 크기 때문에 우리는 ImageNet-100보다 더 작은 N 값을 실험.

Main observations

a) MoCHi는 ImageNet-1K에서 linear classification을 위해 MoCo-v2보다 성능 향상을 보이지 않는다. downstream task와 동일한 dataset에서 hard negative로 훈련함으로써 bias의 영향을 받는다.

Figure 3c 2c는 hard negative mixing이 얼마나 alignment를 줄이고 training중에 사용되는 dataset의 uniformity를 증가시키는지 보여준다.

b) MoCHi는 모델이 더 빨리 학습할 수 있도록 도와주고 100 epoch이후 transfer learning에서 MoCo-v2보다 더 높은 성능 향상을 보여준다.

c) The harder negative strategy presented in Section 4.2 helps a lot for shorter training.

d) In 200 epochs MoCHi can achieve performance similar to MoCo-v2 after 800 epochs on PASCAL VOC.

e) From all the MoCHi runs reported in Table 2 as well as in the Appendix, we see that performance gains are consistent across multiple hyperparameter configurations.

Table 3 에서 COCO dataset에 대한 OD, semantic seg 제시.

Batchnorm, Mask R-CNN, Image scale은 [640,800]이고 training은 inference할 때 800, finue tuning해서 VAL2017 비교

MoCHi and MoCo use the same hyper-parameters as the ImageNet supervised counterpart (i.e. we did not do any method-specific tuning).

From Table 3 we see that MoCHi displays consistent gains over both the supervised baseline and MoCo-v2, for both 100 and 200 epoch pre-training.

In fact, MoCHi is able to reach the AP performance similar to supervised pre-training for instance segmentation (33.2) after only 100 epochs of pre-training.

6. Conclusions

harder negatives에 대한 need를 식별.

unsupervised way로 배운 rep을 더 개선하여 더 나은 transfer learning과 embedding space의 더 나은 활용을 제공할 수 있는 hard negative mixing 방법을 제시.

그리고 generalizable representations faster하게 배우고, 이는 SSL의 계산비용을 고려했을 때 좋다. maximum gains을 얻기위해 필요한 hyper parameter가 training set에 한정되어있지만 multiple MoCHi 구성이 상당한 gain을 제공하며, hard negative mixing이 transfer learning 에 지속적으로 긍정적인 영향을 미친다는 것을 발견.