RegularFace: Deep Face Recognition via Exclusive Regularization

Proposed method

In this paper, we propose the ‘exclusive regularization’ to enlarge the distance between samples of different classes, to improve feature discriminability.

Suppose $W \in \mathbb{R}^{D\times C}$ is the weights of classification layer that maps $D$ dimensional features to $C$ dimensional class confidence scores. $W^i$ and $W^j$ are the $i$-th and $j$-th column of $W$. The exclusive loss can be denoted as: $$ \begin{equation} \mathcal{L}_{r}(W) = \frac{1}{C}\sum_i \max_{j\neq i} \frac{W_i \cdot W_j}{||W_i||_2^2 \cdot ||W_j||_2^2}. \label{eq:l-exc} \end{equation} $$

Interestingly, the same idea has been adopted in two other concurrent works in CVPR2019:

UniformFace: Learning Deep Equidistributed Representation for Face Recognition
Unequal-Training for Deep Face Recognition With Long-Tailed Noisy Data.

Illustration:

pressue plot — Illustration of face embeddings trained under various loss functions, points in color indicate different identities. (a) Softmax loss learns separable decision boundaries. (b) Angular softmax loss learns angularly separable decision boundaries. (c) Center loss[1] ‘pulls’ embeddings of the same identity towards their center, in order to obtain compact and discriminative representations. (d) SphereFace[2] (A-Softmax loss) proposes the ‘angular margin’ to clamp representations within a narrow angle. (e) Our proposed RegularFace introduces ‘inter-class push force’ that explicitly ‘pushes’ representations of different identities far way.

As depicted in above figure, our method "pushes" representations of different identities away from others, improving the "inter-class separability".

A demonstrative implementation in PyTorch:

import torch
class ExclusiveLinear(nn.Module):
  
  def __init__(self, feat_dim=512, num_class=10572, norm_data=True, radius=20):
    super(ExclusiveLinear, self).__init__()
    self.num_class = num_class
    self.feat_dim = feat_dim
    self.norm_data = norm_data
    self.radius = float(radius)
    self.weight = nn.Parameter(torch.randn(self.num_class, self.feat_dim))
    self.reset_parameters()

  def reset_parameters(self):
    stdv = 1. / math.sqrt(self.weight.size(1))
    self.weight.data.uniform_(-stdv, stdv)

  def forward(self, x):

    weight_norm = torch.nn.functional.normalize(self.weight, p=2, dim=1)
    cos = torch.mm(weight_norm, weight_norm.t())
    cos.clamp(-1, 1)

    cos1 = cos.detach()
    cos1.scatter_(1, torch.arange(self.num_class).view(-1, 1).long().cuda(), -100)

    _, indices = torch.max(cos1, dim=0)
    mask = torch.zeros((self.num_class, self.num_class)).cuda()
    mask.scatter_(1, indices.view(-1, 1).long(), 1)
    
    exclusive_loss = torch.dot(cos.view(cos.numel()), mask.view(mask.numel())) / self.num_class
    
    if self.norm_data:
      x = torch.nn.functional.normalize(x, p=2, dim=1)
      x = x * self.radius

    return torch.nn.functional.linear(x, weight_norm), exclusive_loss

Merit of our method:

Easily improve inter-class separability and feature discriminability without hyper-parameter tuning.
Computationally lite (with small identities). On CASIA-WebFace, the extra overhead our method brings about is negligible.
Performance improvements on Sphereface[2] and centerloss[1].
Easy to implement and has straight-forward interpretability.

Weakness of our method:

Inefficient and memory-consumptive on large datasets with large numbers of identities. The exclusive loss is calculated from a $C\times C$ cosine similarity matrix ("cos" in above code). For a dataset with large number of identities ($C$), the computation is memory memory-consumptive and inefficient.
Brings insignificant improvement based on ArcFace[3]. ArcFace introduces additive margins that controls the between-class margins in a very fine-grain level. Well tuned cross-class decision margins lead to good between-class variance, especially when the number of classes (identities) is large enough (See the figure above).

为了计算公式$\ref{eq:l-exc}$中的 exclusive loss，我们要维护一个$C\times C$ 的余弦相似度矩阵（代码中的cos）。其中 $cos_{i,j}$ 表示 $W_i$ 和 $W_j$ 的余弦相似度。当数据集中的 identity 个数很多的时候，这个矩阵会很大，因此计算 exclusive loss 效率会比较低，而且消耗内存。
在 ArcFace[3] 上性能不理想。一个可能的原因是：ArcFace 中加性的边界（margin）控制粒度更细。 当决策边界控制得比较好的时候，类别间的离散度也会随之变大，特别是当数据集的 identity 数目很多的时候。与之相比，Sphereface 使用乘性的系数 $m$ 来决定类别间的决策边界的 margin，$m$ 只能是整数，因此对边界的调整粒度比较粗。

Citation:

If our method is helpful to your research, please kindly consider to cite:

@InProceedings{zhao2019regularface,
  author = {Zhao, Kai and Xu, Jingyi and Cheng, Ming-Ming},
  title = {RegularFace: Deep Face Recognition via Exclusive Regularization},
  booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2019}
}

Reference:

[1] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision., pages 499–515. Springer, 2016.

[2] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In IEEE conf Comput Vis Pattern Recog., volume 1, 2017.

[3] Deng, Jiankang and Guo, Jia and Niannan, Xue and Zafeiriou, Stefanos. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In IEEE conf Comput Vis Pattern Recog., volume 1, 2019.