Heatmap Regression without Soft-argmax for Facial Landmark Detection
Chiao-An Yang Raymond A. Yeh
Purdue University, Department of Computer Science
ICCV 2025
arXiv GitHub Hugging Face
TLDR
  1. $\tt Soft\text{-}argmax$ approximation is unnecessary and suboptimal for heatmap regression.
  2. Without $\tt Soft\text{-}argmax$, our method is competitive while achieving a 2.2$\times$ faster training convergence.

Abstract

Facial landmark detection is an important task in computer vision with numerous applications, such as head pose estimation, expression analysis, face swapping, etc. Heatmap regression-based methods have been widely used to achieve state-of-the-art results in this task. These methods involve computing the argmax over the heatmaps to predict a landmark. Since $\tt argmax$ is not differentiable, these methods use a differentiable approximation, $\tt Soft\text{-}argmax$, to enable end-to-end training on deep-nets. In this work, we revisit this long-standing choice of using $\tt Soft\text{-}argmax$ and demonstrate that it is not the only way to achieve strong performance. Instead, we propose an alternative training objective based on the classic structured prediction framework. Empirically, our method achieves state-of-the-art performance on three facial landmark benchmarks (WFLW, COFW, and 300W), converging $2.2\times$ faster during training while maintaining better and competitive accuracy.

Qualitative Comparison
Figure 1: Qualitative comparison on WFLW between STAR and ours. The ground truth landmarks are marked in green while predictions are marked in red. The regions highlighted in blue circles emphasize where our method outperforms STAR.

Approach

Let $\mathbf{X} \in \mathbb{R}^{3 \times H \times W}$ to be an input facial image labeled with $N$ landmarks $\mathbf{y} = [\mathbf{y}_1, \mathbf{y}_2, \cdots, \mathbf{y}_N]$, where each $\mathbf{y}_n$ represents the pixel location $(u,v) \in \{0, \dots, H-1\}\times \{0, \dots, W-1\}$ of the landmark, i.e., $\mathbf{y} \in \mathbb{R}^{N \times 2}$. The task of landmark detection is to learn a model, parameterized by $\theta$, that predicts landmark $\hat{\mathbf{y}}$ given the input face image $\mathbf{X}$.

In heatmap regression methods, given an input, the model outputs $N$ heatmap $\hat{\mathbf{H}} = [\hat{\mathbf{H}}_1, \dots, \hat{\mathbf{H}}_N] \in \mathbb{R}^{N\times H \times W}$ where each heatmap $\hat{\mathbf{H}}_n$ represents a score of each pixel being a landmark. They are often trained by minimizing the distance between a soft approximation of the landmark coodinates $\tilde{\mathbf{y}}_n$ and the ground truth $\mathbf{y}_n$, \begin{align} \sum_n \|\tilde{\mathbf{y}}_n - \mathbf{y}_n\|_2^2. \end{align} where the $\tilde y_n$ is approximated from the heatmap $\hat{\mathbf{H}}_n$ using $\tt Soft\text{-}argmax$, \begin{align} \tilde{\mathbf{y}}_n = {\tt Soft\text{-}argmax}(\hat{\mathbf{H}}_n) \triangleq \sum_{\mathbf{y}'} \mathbf{y}' \cdot \texttt{Softmax}(\hat{\mathbf{H}}_n)[\mathbf{y}'], \end{align}

Inspired by the well-studied structure prediction research, we propose the following loss to replace the $\tt Soft\text{-}argmax$ approximation in heatmap regression methods, \begin{align} \epsilon \ln \left(\sum_{\hat{\mathbf{y}}_n} \exp \frac{\Delta(\mathbf{y}_n, \hat{\mathbf{y}}_n) + F_n(\hat{\mathbf{y}}_n, \mathbf{X}, \theta)}{\epsilon}\right) - F_n(\mathbf{y}_n, \mathbf{X}, \theta). \end{align} Here we define $F_n(\mathbf{y}_n,\mathbf{X};\theta)$ as the scoring function of each possible coordinates for each landmark and $\Delta$ is an margin function of choice, e.g. $\ell_1$, $\ell_2$.

Demonstration

We consider a simplified model with a single landmark on a 1D heatmap. We choose the dataset to have a single training sample at ground truth $\mathbf{y} = 5$. We visualize how gradient descent updates the heatmap,

Demo GIF

Figure 2: Comparison between Soft-argmax and ours. We use $\uparrow$ to denote where $F$ is increasing, i.e., positive updates, and $\downarrow$ to denote where F is decreasing, i.e., negative updates.

After a few steps, our argmax is already located at $\mathbf{y}=5$, with a significant gap between its maximum and second maximum. Meanwhile, Soft-argmax is still struggling to find the correct maximum. We observe signigicantly faster convergence when using our approach. In abstract, Soft-argmax gradually “moves” the peaks towards the target $\mathbf{y} = 5$, while ours directly increases the peak at the target while decreasing the others.

Experiments

Table
Table 1: Comparison to SOTA facial landmark detection methods on the full set of WFLW. We report the inter-ocular NME$\downarrow$, FR%$\downarrow$, and AUC%$\uparrow$. “C” and “H” correspond to coordinate/heatmap-regression respectively.
2D
Figure 3: Comparison of convergence efficiency between STAR's Soft-argmax and ours. The first two columns visualize the $\hat{\mathbf{H}}$ trained with STAR at epochs 1 and 5. The last two columns visualize the $\hat{\mathbf{H}}$ trained with our loss at epochs 1 and 2. Each row corresponds to a different sample.
Curve
Figure 4: Comparison of NME curve throughout training between STAR and ours. We plot the curve of inter-ocular NME$\uparrow$ over training epochs for both methods. The performance improves and converges much faster using our method than using STAR.

Citation

@inproceedings{yang2025regression,  title={Heatmap Regression without Soft-Argmax for Facial Landmark Detection},  author={Yang, Chiao-An and Yeh, Raymond A},  booktitle={Proc. ICCV},  year={2025} }