Metric learning

Metric learning은 다양하게 사용되고 있으며 다음과 같은 application이 있다. ex) Clustering, Ranking, Unsupervised learning,

우선 제목 그대로 metric 이란 무엇인가? metric 이란 두 점 x와 y 간의 거리 함수를 의미한다. 거리 함수를 만족하려면 다음과 같은 네 가지 조건을 만족해야 한다.

Non-negativity: $f(x,y) ≥ 0$
Identity of Discernible: $if \ f(x,y) = 0, \ then \ x=y$
Symmetry: $f(x,y) = f(y,x)$
Triangle Inequlity: $f(x,z) ≤ f(x,y) + f(y,z)$

이러한 거리 함수를 바탕으로 metric learning은 고전 통계 method와 deep learning method로 나뉠 수 있다. 고전 통계 method로는 유클리디안 distance와 마할라노비스 distance를 이용한 method들이 존재한다. 하지만 이미지 차원에서의 distance는 매우 직관적이지 못하며, pixel-wise euclidean distance는 차원 간의 독립성을 전제하에 수행하기 때문에 (LDA를 참고) 올바르지 못한 metric learning 일 수 있다.

자 그럼 deep learning에서는 어떠한 방식으로 접근했을까?

필자는 크게 softmax based approach와 contrastive learning based approach 이렇게 두 가지로 간추려서 이야기를 전개해 보고자 한다.

Softmax based approach

단순 decision boundary를 구분하도록 학습된 초기 metric learning

초기 metric learning은 단순하게 softmax와 cross entropy로 multi-class classifier를 훈련시켜 metric learning을 시도하였다. softmax 수식을 우선 분해해보자.

여기서 $f$는 learned feature vector이며, FC layer의 input이고 $a$는 FC의 output이다. $W$는 마지막으로 FC layer의 weight이며, $W$는 class $j$ 의 linear classifier라고 해석할 수 있다.

L-Softmax (Large Margin Softmax)

Softmax가 주로 사용되는 문제에서는 다음과 같은 모델 작동 순서가 주로 존재한다.

Image → Feature extraction by CNN → FC layer → Softmax function → Cross entropy loss

이때 FC layer는 linear classifier로서 decision boundary를 잡아주는 역할을 하게 된다.(왜일까?)

softmax로 학습했을 때의 CNN feature를 t-SNE로 visualize한 결과.

따라서 다른 approach와 달리 positive와 negative에 대해서 distance를 학습하려고 하지는 않는다. (그럼 이 approach 말고 다른 approach는 positive, negative에서 사용한다는 말인가?)

*Liu et. al.*는 class 별로 이렇게 겹치는 특징들이 왜 생기는지, class별로 feature의 분포를 어떻게해야 더 잘 모아줄지 고민한다. *Liu et. al.*는 CNN feature가 FC layer를 거쳐 softmax까지 오는 과정을 Fig.1 의 수식으로 이미 보였다.

여기서 한 번 더 생각해 보자.

우리는 내적을 $W^Tx$ 와 $||W|| \ ||x|| \cos\theta$ 두 가지 꼴로 나타낼 수 있다.

후자의 경우로 다시 FC layer의 수식을 전개하면,

$f_j = ||W_j||\ ||x_i|| \ \cos(\theta_j) \ where \ \theta_j \ (0 \leq \theta_j \leq \pi)$ 이고, 여기서 softmax activation 까지 적용하면 다음과 같다.

여기서 softmax function은 주로 Cross entropy loss를 이용하여 학습하기 때문에 단적인 예로 1번 class라면 [1,0], 2번 클래스라면 [0,1]이 나와야 한다. 따라서 $W_1$ 와 $W_2$의 벡터는 각도를 중심으로 클래스 벡터끼리 더 모이는 쪽으로 학습된다. 여기에 추가적으로 각도를 중심으로 양쪽으로 밀어내어 학습을 시키면 다음과 같은 t-SNE 결과를 보여준다.

margin을 강하게 주어 학습할수록 더 뾰족한 형태의 t-SNE 결과를 보여준다. 최종적으로 L-softmax는 클래스 별로 각을 벌려서 discriminative를 학습한다.

A-Softmax (Angular Softmax or Sphere Face)

Liu et. al. 는 단순히 $||W||$ 크기에 따라 적절한 decision boundary를 구분하는 게 아닌 오로지 “각도”로만 boundary를 나누고자 한다. 따라서 mapping을 hypersphere의 껍질로 보내기 위하여 $||W||=1 ,$$bias=0$으로 고정하여 원점을 중심으로 하는 구를 가정한다. 일반적으로 class 1과 class 2의 decision boundary는 다음과 같다. 수식의 편의를 위하여 class 1으로 예측했다고 가정하자.

$$||W_1||\ ||f_1|| \ cos(\theta_1) \ > ||W_2||\ ||f_2|| \ cos(\theta_2)$$

$$||W||=1$, $||f_1|| \ cos(\theta_1) \ > ||f_2|| \ cos(\theta_2)$$

최종적으로 L-softmax처럼 우리가 벌리고자 하는 class 별 margin을 대입하면 다음과 같은 결과를 얻을 수 있다.

Arcface

거의 다 왔다. 이전 연구들은 weight를 normalize 했지만 Arcface는 weight 뿐만 아니라 input $x$에서도 normalize를 진행한다. 이 아이디어는 arcface에서 나온게 아니다 (자세한건 Cosface를 참고하자.) $W$ 만 normalize 한다면 $cos(\theta)$는 차이가 너무 작은데 비슷한 $L_2$ distance를 가지는 $x_1, x_2$가 있다면 우리가 원하지 않는 결과를 가져올 수 있다. 따라서 $x$에 대해서도 $s$값으로 scaling 하는 것으로 이러한 현상을 방지한다. 따라서, feature와 weight 둘 다 normalize를 진행하고 정말 “각도”로만 decision boundary를 구분한다. 학습된 embedding features는 결국 radius가 $s$인 hypersphere에 분산된다.

본 논문의 저자는 $||x_i||$를 $L_2$ norm한 이후에 $s$로 다시 re-scale한다.

해당 figure에서 arcface에 대한 모든 철학이 담겨있다. 차근차근 step을 밟아보자.

$W$, $x$ 에 대해서 $L_{2}$normalize를 적용.
$W$, $x$ 를 내적한다. 즉, FC layer의 input을 normalize한 두 $W$, $x$로 받음.
여기에 모든 target class와 target class가 아닌 대표 $W$ vector들에 대해서 내적을 하여 $\cos(\theta)$를 구함.
$\theta_{y_{i}}$만 추출하기 위하여 $\cos$의 역함수인 $\arccos$함수를 적용.
margin을 추가. $\theta_{y_{i}}$→ $\theta_{y_{i}} + m$
logit으로 해석하기 위하여 $\cos$를 적용. $\theta_{y_{i}} + m$ → $\cos (\theta_{y_{i}}+m)$
Feature re-scale 하기 위하여 $s$를 곱함. $\cos (\theta_{y_{i}}+m)$ → $s*\cos (\theta_{y_{i}}+m)$
Softmax function 적용 이후에 cross-entropy loss function적용하여 backpropagation.

step을 밟는데 어려움이 있을까봐 수식을 첨부한다.

Normalize와 scaling factor까지 적용한 수식이다.

여기에 추가로 constant한 margin을 추가한다.

$W_{y_i}$과 $f_{\theta}(X_i)$의 유사도는 커지도록, 다른 class vector 즉, $W_{j\neq y_i}$과 $f_{\theta}(X_i)$ 유사도는 멀어지게 훈련한다. 다시 정리하면, class 간 각각의 대표 vector들 $W_{i}$와 embedding vector $f_{\theta}(X_i)$ 사이의 각들로 분류할 수 있는 임베딩 공간을 학습하도록 한다. 여기에 추가로 margin을 두고 $\theta_{y_{i}} + m < \theta_{j,i}$를 잡아주고 inter와 intra를 각각 커지고 작아지게 학습하도록 한다.

자, 이제 metric learning의 첫 번째 approach가 끝났다. softmax based approach는 단순 classification에서 softmax loss를 사용할 경우 decision boundary가 우리가 원하는 방향으로 그어지기 어렵다. 이러한 문제점을 해결하기 위하여 intra, inter class로 잘 embedding 되기 위해서 단순 matrix multiplication이 아닌 내적의 기하학적인 관점을 도입하여 embedding을 시도하였다. Angular loss는 face recognition과 image retrieval에서 많이 사용되며 closed set이 아닌 open set에 대해서도 잘 적용할 수 있게 발전해 왔다. metric learning에서의 해당 approach에 대한 연구동향은 추가적으로 각자 살펴보길 바란다.

Contrastive learning based approach

다음 절은 contrastive learning(CL), 즉 대조학습이다. CL은 유사한 sample pairs들은 가깝게 dissimilar pairs들은 멀게 학습되도록 하는게 목적이다. CL은 supervised method와 unsupervised method 둘 다 적용할 수 있다. unsupervised manner으로는 self-supervised learning에서 많이 사용되곤한다.

CL의 방법론을 간략하게 살펴보고가자.

supervised learning method의 image classification results

Figure에서는 치타를 Input image로 넣었다. classificaiton output을 보니 래오파드, 재규어 등의 분류 비율이 보트나 쇼핑카트의 분류 비율보다 높다. 이를 통해 치타, 래오파드, 재규어 등에 similar한 feature가 작용하고 있고, 이렇게 잘 추출된 특징값은 instance간의 유사도 정보를 가지고 있을 것이라는 가정을 시작하게 된다. 이게 바로 큰 틀에서의 CL의 철학이다. 따라서 CL은 positive pair와 negative pair로 구성되며, positive pair 끼리는 거리를 좁히고, negative pair끼리는 거리를 멀리 띄워놓는 것이 학습 원리이다. 두서없이 길어진 말을 한번 더 정리해보자.

주어진 input samples $\{ {x_i} \}$이 주어졌을때 각 샘플들은 $L$ classes 사이의 대응하는 label $y_i∈ \{{1,…,L} \}$ 을 갖는다. 이제 $x_i$를 임베딩 벡터로 인코딩하는 함수 $f(.): X \rightarrow \mathbb{R}^n$ 를 학습하여 동일한 class의 example은 similar embedding을 가지며 다른 class의 example은 매우 다른 embedding을 가지도록 학습하고싶다.

자, 이걸 loss function으로 표현해보자.

$$L_{cont}(x_i,x_j,θ)=\mathbb{1} [y_i=y_j]‖f_θ(x_i)−f_θ(x_j)‖^2_2+\mathbb{1}[y_i≠y_j]max(0,ϵ−‖f_θ(x_i)−f_θ(x_j)‖_2)^2$$

여기서 ϵ는 hyperparameter이며, 다른 classes들의 samples사이의 lower bound distance를 define한다.

contrastive learning은 위와 같은 loss function을 기반으로 발전되어왔다. 차근차근 하나씩 살펴보자.

Triplet loss method

Triplet loss는 같은 환자의 different poses와 angles에 대해서 얼굴인식에 처음 사용되었다 (FaceNet 참고).

anchor input $x$가 주어졌을 때, one positive sample $x^+$와 one negative sample $x^-$ ($x^+$은 same class에서, $x^-$은 different class에 속한다.) Triplet loss는 다음과 같은 loss function으로 anchor와 positive 와의 거리를 최소화하고 negative와의 거리를 최대화하는 방법을 학습한다.

$$L_{triplet}(x,x^+,x^−)=∑_{x∈X}max(0,‖f(x)−f(x^+)‖^2_2−‖f(x)−f(x^−)‖^2_2+ϵ)$$

ϵ 는 margin parameter이며 similar vs dissimilar pairs 사이의 거리 사이의 최소 offset으로 구성된다. triplet loss는 $x^-$이 매우 중요한 영향을 끼친다.

N-pair Loss

N-pair loss는 triplet loss를 일반화하여 multiple negative samples에 comparison을 포함한다.

training samples 의 $(N+1)$-tuplet이 주어졌을 때, 하나의 positive와 $N-1$개의 negative들이 포함된 $\{ {x,x^+,x_1^−,…,x_{N−1}^−}\}$ 들로 N-pair loss은 다음과 같이 정의된다.

$$L_{N-pair}(x,x^+,{\{ x_i \}{i=1}^{N−1}})=log⁡(1+∑{i=1}^{N−1}exp⁡(f(x)^⊤f(x_i^−)−f(x)^⊤f(x^+)))$$

$$=−log\ ⁡\cfrac{⁡exp⁡(f(x)^⊤f(x^+))}{exp⁡(f(x)^⊤f(x^+))+∑_{i=1}^{N−1}exp⁡(f(x)^⊤f(x_i^−))}$$

class당 one negative sample만 샘플링한다면, multi-class classification의 softmax loss와 동일하다.

NCE

Noise Contrastive Estimation(NCE)는 statistical model의 parameter를 추정하는 방법이다. NCE의 main idea는 target data와 noise를 구분하기 위해 logistic regression을 실행한다.

$x$를 target sample $\sim \ P(x|C=1;\theta) = p_\theta(x)$이고 $\tilde{x}$를 noise sample $\sim \ P(\tilde{x}|C=0;\theta) = q(\tilde{x})$라고 하자. logistic regression은 logit을 모델링하므로 NCE에서는 noise distribution 대신 target data distribution으로부터 sample $u$를 모델링한다.

$$ℓ_θ(u)=log⁡\cfrac{p_θ(u)}{q(u)}=log\ ⁡p_θ(u)−log\ ⁡q(u)$$

sigmoid $σ(.)$로 logits을 확률로 변환한후에, cross entropy에 적용하자.

$$L_{NCE}=-\cfrac{1}{N}∑_{i=1}^N[log\ ⁡σ(ℓ_θ(x_i))+log\ ⁡(1−σ(ℓ_θ(x\tilde{}_i)))]$$

$$where \ σ(ℓ)=\cfrac{1}{1+exp⁡(−ℓ)}=\cfrac{p_θ}{p_θ+q}$$

여기서는 one positive sample과 one noise sample에만 적용되는 NCE loss를 서술했지만 이후 나온 논문들에서는 multiple noise samples들을 적용하는 CL도 나왔으니 잘 참고해보길 바란다.

InfoNCE

Contrastive learning에서 가장 큰 기저로 삼고있는 개념이며, 많은 Self supervised learning method에서도 사용하고있다. NCE에서 영감을 받은 Contrastive Predictive Coding(CPC)의 infoNCE loss는 categorical cross-entropy loss를 사용하여 unrelated noise samples의 집합 중 positive sample을 identify하는데 목적이 있다.

Context vector $c$가 주어졌을 때, positive sample은 conditional distribution $p(x|c)$에서 추출해야하는 반면, $N-1$개의 negative samples은 context $c$와 독립적인 proposal distribution $p(x)$로부터 추출해야한다. 가시성을 위해, 모든 samples에는 $X = \{x_i \}^N_{i=1}$처럼 label을 명시하고, positive sample은 $x_{pos}$라고 하자. positive sample을 잘 검출할 확률은 다음과 같다.

$$p(C=pos|X,c)= \cfrac{p(x_{pos}|c)∏_{i=1,…,N;i≠pos}p(x_i)}{∑_{j=1}^N[ \ p(x_j|c)∏_{i=1,…,N;i≠j}p(x_i) \ ]} \ =\cfrac{\cfrac{p(x_{pos}|c)}{p(x_{pos})}}{∑_{j=1}^N\cfrac{p(x_j|c)}{p(x_j)}} \ = \cfrac{f(x_{pos},c)}{∑_{j=1}^Nf(x_j,c)}$$

scoring function 즉, encoder function은 $f(x,c)∝\cfrac{p(x|c)}{p(x)}$.

InfoNCE loss는 positive sample을 올바르게 분류하는 negative log probability를 optimize한다.

$$L_{InfoNCE}=−E[\ log\cfrac{⁡f(x,c)}{∑_{\grave {x}∈X}f(\grave{x},c)} \ ]$$

$f(x,c)$가 densitiy ratio $\cfrac{p(x|c)}{p(x)}$를 추정한다는 사실은 mutual information opimization과 관련이 있다. input $x$와 context vector $c$ 사이의 mutual information을 최대화하기 위해 다음과 같이 수식을 전개해보자.

$$I(x;c)=∑_{x,c}p(x,c)\ log\cfrac{⁡p(x,c)}{p(x)p(c)}=∑_{x,c}p(x,c)\ log \cfrac{p(x|c)}{p(x)}$$

여기서 $\cfrac{p(x|c)}{p(x)}$은 $f$로 추정된다.

Sequence prediction tasks에서는 $t$ 이후의 미래의 값, $p_k(x_{t+k}|c_t)$를 modeling하는 대신에 CPC는 $x_{t+k}$와 $c$ 사이의 mutual information을 보존하기 위한 density function을 modeling한다.

$$f_k(x_{t+k},c_t)=exp⁡(z_{t+k}^⊤W_kc_t)∝\cfrac{p(x_{t+k}|c_t)}{p(x_{t+k})}$$

여기서 $z_{t+k}$는 encoded input이고 $W_k$는 trainable weight matrix이다.

Overview를 마치며

softmax method를 제외하고 contrastive learning method는 다음과 같은 공통적인 주제가 남아있다.

1. Data augmentation

Contrastive learning은 supervised manner도 가능하고 unsupervised manner도 가능하다. Positive pair와 negative pair를 잘 정립한다면 큰 문제없이 contrastive learning으로 모델을 학습할 수 있다.

Unsupervised manner는 임의의 sample에 대해서 data augmentation을 이용하여 positive pair를 구성하는게 핵심 key이다. 따라서 contrastive learning에서 data augmentation은 모델 성능을 결정짓는 중요한 factor로 작용할 수 있다.

2. Batch size

Training 과정에서 large batch를 사용하는 것은 batch 내에서 negative sample에 의존하는 방법론에 대해서 필수적이다. batch의 크기가 충분히 클 때만 loss function은 충분히 다양한 negative samples들을 수집하고 학습할 수 있으며, model이 그만큼 다른 sample을 구별하면서 meaningful representation을 배울 수 있다.

3. Hard negative mining

Hard negative mining은 anchor sample과 다른 label을 가져야하지만, anchor embedding vector와 매우 가까운 embedding feature를 가져야한다. Supervised manner는 task-specific hard negatives를 분별하는데 쉽다. 하지만 unsupervised manner에서는 hard negative mining이 매우 까다롭다. training batch size나 memory bank 크기를 늘리면, implicit하게 hard negative samples이 추가되지만, large memory usage에 대한 burden을 무시할 순 없다.

Kyungjin Cho