Average gradient outer product as a mechanism for deep neural collapse (2402.13728v6)

Published 21 Feb 2024 in cs.LG and stat.ML

Abstract: Deep Neural Collapse (DNC) refers to the surprisingly rigid structure of the data representations in the final layers of Deep Neural Networks (DNNs). Though the phenomenon has been measured in a variety of settings, its emergence is typically explained via data-agnostic approaches, such as the unconstrained features model. In this work, we introduce a data-dependent setting where DNC forms due to feature learning through the average gradient outer product (AGOP). The AGOP is defined with respect to a learned predictor and is equal to the uncentered covariance matrix of its input-output gradients averaged over the training dataset. The Deep Recursive Feature Machine (Deep RFM) is a method that constructs a neural network by iteratively mapping the data with the AGOP and applying an untrained random feature map. We demonstrate empirically that DNC occurs in Deep RFM across standard settings as a consequence of the projection with the AGOP matrix computed at each layer. Further, we theoretically explain DNC in Deep RFM in an asymptotic setting and as a result of kernel learning. We then provide evidence that this mechanism holds for neural networks more generally. In particular, we show that the right singular vectors and values of the weights can be responsible for the majority of within-class variability collapse for DNNs trained in the feature learning regime. As observed in recent work, this singular structure is highly correlated with that of the AGOP.

References (47)

Citations (10)

View on Semantic Scholar

Summary

The paper demonstrates that AGOP is the primary driver of deep neural collapse by reducing within-class variability in DNNs.
It introduces Deep RFM to reveal how the weight matrix's right singular vectors, rather than ReLU, achieve effective linear denoising.
Theoretical and empirical results link AGOP to optimal kernel structures, offering actionable insights for DNN training and architecture design.

This paper, "Average gradient outer product as a mechanism for deep neural collapse" (2402.13728), proposes that feature learning, specifically through the Average Gradient Outer Product (AGOP), is the primary mechanism driving Deep Neural Collapse (DNC) in deep neural networks (DNNs). DNC is a phenomenon where feature representations of training data in the final and intermediate layers of overparameterized DNNs trained on classification tasks exhibit a rigid geometric structure: within-class features collapse to their class mean (NC1), and class means form an Equiangular Tight Frame (ETF) or orthogonal basis (NC2).

Existing theoretical explanations for DNC often rely on the Unconstrained Features Model (UFM), which is data-agnostic and simplifies the network structure, thus not fully capturing the role of feature learning or the data itself in the process. This paper bridges that gap by arguing that the AGOP, a statistic related to how the network uses its inputs for prediction, is central to DNC formation.

The paper first investigates how within-class variability (NC1) is reduced through the layers of a standard DNN. It shows that the improvement in NC1 occurs mainly due to the application of the right singular vectors and corresponding singular values of the weight matrix ( $S_lV_l^\top$ ), rather than the ReLU non-linearity, which was often implicitly assumed to be the primary driver in feature-agnostic models like the UFM. The right singular component can be seen as performing a linear projection that effectively "denoises" the data by discarding variance in less important directions.

Drawing upon the Neural Feature Ansatz (NFA), which states that the Gram matrix of weights $W_l^\top W_l$ is approximately proportional to the AGOP with respect to the inputs at layer $l$ , the authors argue that this linear denoising is performed by projection onto the AGOP. This highlights the AGOP as a key operator in feature learning that directly contributes to the reduction of within-class variability.

To demonstrate the AGOP's role more directly, the paper utilizes the Deep Recursive Feature Machine (Deep RFM). Deep RFM is a backpropagation-free model proposed in prior work (2309.00570) as an abstraction of DNN feature learning. It recursively transforms data by applying the square root of the AGOP (estimated from a shallow kernel machine) and then a random feature map (like a random ReLU network layer). Algorithm 1 in the paper outlines this process:

Algorithm 1: Deep Recursive Feature Machine (Deep RFM)
Input: X_1, Y, {k_l}_{l=1}^{L+1}, L, {Φ_l}_{l=1}^{L}
Output: α_{L+1}, {M_l}_{l=1}^{L}

For l = 1 to L:
  Normalize data X_l
  Learn kernel regression coefficients α_l = Y(k_l(X_l, X_l) + μI)^(-1)
  Construct predictor f_l(⋅) = α_l k_l(⋅, X_l)
  Compute AGOP: M_l = Σ_{c,i} ∇f_l(x^l_{ci}) ∇f_l(x^l_{ci})^T = X_l α_l α_l^T X_l^T
  Transform data X_{l+1} = Φ_l(M_l^{1/2} X_l)
End For
Normalize data X_{L+1}
Learn coefficients α_{L+1} = Y(k_{L+1}(X_{L+1}, X_{L+1}) + μI)^(-1)

The paper empirically shows that Deep RFM exhibits DNC (Figures 2 and 3), with the within-class variability collapsing and class means forming an ETF. Crucially, experiments reveal that in Deep RFM, the DNC improvement is primarily driven by the multiplication with the AGOP square root ( $M_l^{1/2}$ ), while the random feature map ( $\Phi_l$ ) has little effect or can even slightly worsen the NC1 metric.

This empirical finding is supported by a theoretical proposition demonstrating that random feature maps with ReLU activations tend to increase distances between vectors rather than collapsing them (Figure 4), explaining why this component of Deep RFM (and the ReLU in DNNs) doesn't significantly reduce within-class variability.

Further theoretical analyses strengthen the connection between AGOP and DNC in Deep RFM.

Asymptotic Analysis: Under assumptions of high-dimensional, full-rank data and linearized kernels/feature maps ( $k(X) \approx X^\top X + \lambda_k I$ , $\Phi(X) \approx X + \lambda_\Phi^{1/2} I$ ), the paper proves (Theorem 1) that applying the Deep RFM transformation iteratively leads to the data Gram matrix converging exponentially to a collapsed state ( $yy^\top + \lambda_\Phi I$ ). The rate of collapse depends on the ratio $\lambda_k/\lambda_\Phi$ , suggesting that the "non-linearity" introduced by the feature map (represented by $\lambda_\Phi$ ) is crucial for convergence, while deviations from perfect linearity in the kernel ( $\lambda_k$ ) can slow it down.
Non-asymptotic Analysis: By analyzing a relaxed version of a parametrized kernel ridge regression problem (which includes a single layer of RFM), the paper shows (Theorem 2) that the optimal kernel matrix is exactly the matrix representing the collapsed state ( $I_K \otimes (\mathbf{1}_n\mathbf{1}_n^\top)$ ). This suggests that models implicitly or explicitly minimizing such objectives are biased towards forming collapsed representations. Since RFM is based on kernel ridge regression and uses the AGOP, this provides a theoretical link between AGOP and the optimal kernel structure corresponding to NC1.

In summary, the paper provides substantial empirical and theoretical evidence that feature learning through the AGOP is a core mechanism underlying the formation of Deep Neural Collapse. It shows that in both DNNs and Deep RFM, the linear transformation associated with the AGOP (or weight matrix $S_lV_l^\top$ in DNNs) is primarily responsible for collapsing within-class variability, whereas non-linearities like ReLU play a less significant role in this specific aspect of collapse.

Practical Implications and Implementation:

Understanding DNC: This work provides a feature-learning-centric view of DNC, complementing existing UFM-based theories. Understanding DNC as a consequence of AGOP-based feature learning can guide further research into its benefits (generalization, robustness) and drawbacks (e.g., in imbalanced datasets).
DNN Training: The finding that the weight matrix's right singular space, related to the AGOP via NFA, drives NC1 suggests that training methods influencing the alignment of weights with the AGOP could implicitly control DNC. Learning rates, weight decay, and initialization strategies known to affect NFA correlation (2402.05271, 2306.04815) could be relevant knobs for controlling DNC in practice.
Deep RFM as a Model: Deep RFM is a backpropagation-free architecture. While currently not competitive with state-of-the-art DNNs on large-scale tasks, its ability to exhibit DNC through an interpretable AGOP mechanism makes it a valuable theoretical tool. If its performance could be improved, it might inspire more efficient alternatives to gradient descent for deep feature learning.
Designing Architectures/Regularization: The theoretical analysis points to the parameters $\lambda_k$ and $\lambda_\Phi$ as influential on collapse speed. Future work could explore designing non-linearities or regularization techniques that effectively tune these parameters in DNNs to promote desired DNC properties. For example, modifications to ReLU or batch normalization might affect $\lambda_\Phi$ , while weight decay might influence $\lambda_k$ .
Implementation: Implementing Deep RFM (Algorithm 1) involves:
- Choosing a kernel function (e.g., Laplace kernel as in experiments).
- Computing the kernel matrix $k_l(X_l, X_l)$ .
- Solving a kernel ridge regression ( $(k_l + \mu I)^{-1} Y$ ). This involves matrix inversion, which can be computationally expensive for large datasets ( $N$ ) but feasible for smaller $N$ .
- Computing the AGOP matrix $M_l = X_l \alpha_l \alpha_l^\top X_l^\top$ . This is also $O(N^2 d_l)$ or $O(N^2 K^2 + N d_l K)$ depending on how it's computed.
- Computing the matrix square root $M_l^{1/2}$ . This requires eigen-decomposition of $M_l$ ( $O(d_l^3)$ or $O(N^3)$ if using the $N \times N$ version $X_l^\top M_l X_l$ ).
- Applying a random feature map $\Phi_l(Z) = \sigma(W_l Z)$ , where $W_l$ is a random matrix. This is computationally efficient, similar to a standard linear layer followed by ReLU.

The primary computational bottlenecks for Deep RFM are the kernel regression and AGOP computation, especially for large $N$ . DNNs trained with gradient descent implicitly learn the AGOP alignment but avoid explicit computation of the AGOP matrix, which is $O(d^2)$ . The paper's experiments on CIFAR-10, MNIST, and SVHN use typical dataset sizes (up to 50,000 samples) and network widths (512), demonstrating the phenomenon is observable in moderate-scale settings. The theoretical results, while sometimes relying on simplified settings (e.g., linearized kernels, high dimensions), provide valuable insights into the underlying mechanisms that may translate to more complex, real-world scenarios.

PDF Markdown

Related Papers

Tweets

https://twitter.com/dbeagleholeCS/status/1795560606906675621

https://twitter.com/StatMLPapers/status/1760531076026442108

https://twitter.com/LiveFromVR/status/1770546002564420088

https://twitter.com/PatrikMuncaster/status/1767917679782900193