Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
124 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Average gradient outer product as a mechanism for deep neural collapse (2402.13728v6)

Published 21 Feb 2024 in cs.LG and stat.ML

Abstract: Deep Neural Collapse (DNC) refers to the surprisingly rigid structure of the data representations in the final layers of Deep Neural Networks (DNNs). Though the phenomenon has been measured in a variety of settings, its emergence is typically explained via data-agnostic approaches, such as the unconstrained features model. In this work, we introduce a data-dependent setting where DNC forms due to feature learning through the average gradient outer product (AGOP). The AGOP is defined with respect to a learned predictor and is equal to the uncentered covariance matrix of its input-output gradients averaged over the training dataset. The Deep Recursive Feature Machine (Deep RFM) is a method that constructs a neural network by iteratively mapping the data with the AGOP and applying an untrained random feature map. We demonstrate empirically that DNC occurs in Deep RFM across standard settings as a consequence of the projection with the AGOP matrix computed at each layer. Further, we theoretically explain DNC in Deep RFM in an asymptotic setting and as a result of kernel learning. We then provide evidence that this mechanism holds for neural networks more generally. In particular, we show that the right singular vectors and values of the weights can be responsible for the majority of within-class variability collapse for DNNs trained in the feature learning regime. As observed in recent work, this singular structure is highly correlated with that of the AGOP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning, pp.  74–84. PMLR, 2020.
  2. A random matrix perspective on mixtures of nonlinearities for deep learning. arXiv preprint arXiv:1912.00827, 2019.
  3. Mechanism of feature learning in convolutional neural networks. arXiv preprint arXiv:2309.00570, 2023.
  4. Gradient descent induces alignment between weights and the empirical ntk for deep non-linear networks. arXiv preprint arXiv:2402.05271, 2024.
  5. Kernel learning in ridge regression” automatically” yields exact low rank solution. arXiv preprint arXiv:2310.11736, 2023.
  6. Neural collapse in deep linear network: From balanced to imbalanced data. arXiv preprint arXiv:2301.00437, 2023.
  7. Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training. In Proceedings of the National Academy of Sciences (PNAS), volume 118, 2021.
  8. Improved generalization bounds for transfer learning via neural collapse. In First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML, 2022.
  9. Linking neural collapse and l2 normalization with improved out-of-distribution detection in deep neural networks. Transactions on Machine Learning Research (TMLR), 2022.
  10. Neural collapse under mse loss: Proximity to and dynamics on the central path. In International Conference on Learning Representations (ICLR), 2022.
  11. A law of data separation in deep learning. arXiv preprint arXiv:2210.17020, 2022.
  12. Neural collapse for unconstrained feature model under cross-entropy loss with imbalanced data. arXiv preprint arXiv:2309.09725, 2023.
  13. Universality laws for high-dimensional learning with random features. IEEE Transactions on Information Theory, 69(3):1932–1964, 2022.
  14. Limitations of neural collapse for understanding generalization in deep learning. arXiv preprint arXiv:2202.08384, 2022.
  15. Neural tangent kernel: Convergence and generalization in neural networks. In Conference on Neural Information Processing Systems (NeurIPS), 2018.
  16. An unconstrained layer-peeled perspective on neural collapse. In International Conference on Learning Representations (ICLR), 2022.
  17. Karoui, N. E. The spectrum of kernel random matrices. The Annals of Statistics, pp.  1–50, 2010.
  18. Kothapalli, V. Neural collapse: A review on modelling principles and generalization. In Transactions on Machine Learning Research (TMLR), 2023.
  19. The asymmetric maximum margin bias of quasi-homogeneous neural networks. arXiv preprint arXiv:2210.03820, 2022.
  20. Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning. arXiv preprint arXiv:2012.09839, 2020.
  21. Relu soothes the ntk condition number and accelerates optimization for wide neural networks. arXiv preprint arXiv:2305.08813, 2023.
  22. Neural collapse under cross-entropy loss. In Applied and Computational Harmonic Analysis, volume 59, 2022.
  23. Neural collapse with unconstrained features. arXiv preprint arXiv:2011.11619, 2020.
  24. Prevalence of neural collapse during the terminal phase of deep learning training. In Proceedings of the National Academy of Sciences (PNAS), volume 117, 2020.
  25. Neural collapse in the intermediate hidden layers of classification neural networks. arXiv preprint arXiv:2308.02760, 2023.
  26. Explicit regularization and implicit bias in deep network classifiers trained with the square loss. arXiv preprint arXiv:2101.00072, 2020.
  27. Feature learning in neural networks and kernel machines that recursively learn features. arXiv preprint arXiv:2212.13881, 2022.
  28. Linear recursive feature machines provably recover low-rank matrices. arXiv preprint arXiv:2401.04553, 2024.
  29. Feature learning in deep classifiers through intermediate neural collapse. Technical Report, 2023.
  30. Roughgarden, T. Beyond worst-case analysis. Communications of the ACM, 62(3):88–96, 2019.
  31. A generalized representer theorem. In International conference on computational learning theory, pp.  416–426. Springer, 2001.
  32. Neural (tangent kernel) collapse. arXiv preprint arXiv:2305.16427, 2023.
  33. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM), 51(3):385–463, 2004.
  34. On the robustness of neural collapse and the neural collapse of robustness. arXiv preprint arXiv:2311.07444, 2023.
  35. Deep neural collapse is provably optimal for the deep unconstrained features model. arXiv preprint arXiv:2305.13165, 2023.
  36. Imbalance trouble: Revisiting neural-collapse geometry. In Conference on Neural Information Processing Systems (NeurIPS), 2022.
  37. Extended unconstrained features model for exploring deep neural collapse. In International Conference on Machine Learning (ICML), 2022.
  38. Perturbation analysis of neural collapse. arXiv preprint arXiv:2210.16658, 2022.
  39. A consistent estimator of the expected gradient outerproduct. In UAI, pp.  819–828, 2014.
  40. Linear convergence analysis of neural collapse with unconstrained features. In OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop), 2022.
  41. How far pre-trained models are from neural collapse on the target dataset informs their transferability. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  5549–5558, 2023.
  42. On the emergence of simplex symmetry in the final and penultimate layers of neural network classifiers. In Mathematical and Scientific Machine Learning, 2022.
  43. Woodbury, M. A. Inverting modified matrices. Department of Statistics, Princeton University, 1950.
  44. Dynamics in deep classifiers trained with the square loss: Normalization, low rank, neural collapse, and generalization bounds. In Research, volume 6, 2023.
  45. Efficient estimation of the central mean subspace via smoothed gradient outer products. arXiv preprint arXiv:2312.15469, 2023.
  46. On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features. In International Conference on Machine Learning (ICML), 2022.
  47. Catapults in sgd: spikes in the training loss and their impact on generalization through feature learning. arXiv preprint arXiv:2306.04815, 2023.
Citations (10)

Summary

  • The paper demonstrates that AGOP is the primary driver of deep neural collapse by reducing within-class variability in DNNs.
  • It introduces Deep RFM to reveal how the weight matrix's right singular vectors, rather than ReLU, achieve effective linear denoising.
  • Theoretical and empirical results link AGOP to optimal kernel structures, offering actionable insights for DNN training and architecture design.

This paper, "Average gradient outer product as a mechanism for deep neural collapse" (2402.13728), proposes that feature learning, specifically through the Average Gradient Outer Product (AGOP), is the primary mechanism driving Deep Neural Collapse (DNC) in deep neural networks (DNNs). DNC is a phenomenon where feature representations of training data in the final and intermediate layers of overparameterized DNNs trained on classification tasks exhibit a rigid geometric structure: within-class features collapse to their class mean (NC1), and class means form an Equiangular Tight Frame (ETF) or orthogonal basis (NC2).

Existing theoretical explanations for DNC often rely on the Unconstrained Features Model (UFM), which is data-agnostic and simplifies the network structure, thus not fully capturing the role of feature learning or the data itself in the process. This paper bridges that gap by arguing that the AGOP, a statistic related to how the network uses its inputs for prediction, is central to DNC formation.

The paper first investigates how within-class variability (NC1) is reduced through the layers of a standard DNN. It shows that the improvement in NC1 occurs mainly due to the application of the right singular vectors and corresponding singular values of the weight matrix (SlVlS_lV_l^\top), rather than the ReLU non-linearity, which was often implicitly assumed to be the primary driver in feature-agnostic models like the UFM. The right singular component can be seen as performing a linear projection that effectively "denoises" the data by discarding variance in less important directions.

Drawing upon the Neural Feature Ansatz (NFA), which states that the Gram matrix of weights WlWlW_l^\top W_l is approximately proportional to the AGOP with respect to the inputs at layer ll, the authors argue that this linear denoising is performed by projection onto the AGOP. This highlights the AGOP as a key operator in feature learning that directly contributes to the reduction of within-class variability.

To demonstrate the AGOP's role more directly, the paper utilizes the Deep Recursive Feature Machine (Deep RFM). Deep RFM is a backpropagation-free model proposed in prior work (2309.00570) as an abstraction of DNN feature learning. It recursively transforms data by applying the square root of the AGOP (estimated from a shallow kernel machine) and then a random feature map (like a random ReLU network layer). Algorithm 1 in the paper outlines this process:

1
2
3
4
5
6
7
8
9
10
11
12
13
Algorithm 1: Deep Recursive Feature Machine (Deep RFM)
Input: X_1, Y, {k_l}_{l=1}^{L+1}, L, {Φ_l}_{l=1}^{L}
Output: α_{L+1}, {M_l}_{l=1}^{L}

For l = 1 to L:
  Normalize data X_l
  Learn kernel regression coefficients α_l = Y(k_l(X_l, X_l) + μI)^(-1)
  Construct predictor f_l(⋅) = α_l k_l(⋅, X_l)
  Compute AGOP: M_l = Σ_{c,i} ∇f_l(x^l_{ci}) ∇f_l(x^l_{ci})^T = X_l α_l α_l^T X_l^T
  Transform data X_{l+1} = Φ_l(M_l^{1/2} X_l)
End For
Normalize data X_{L+1}
Learn coefficients α_{L+1} = Y(k_{L+1}(X_{L+1}, X_{L+1}) + μI)^(-1)

The paper empirically shows that Deep RFM exhibits DNC (Figures 2 and 3), with the within-class variability collapsing and class means forming an ETF. Crucially, experiments reveal that in Deep RFM, the DNC improvement is primarily driven by the multiplication with the AGOP square root (Ml1/2M_l^{1/2}), while the random feature map (Φl\Phi_l) has little effect or can even slightly worsen the NC1 metric.

This empirical finding is supported by a theoretical proposition demonstrating that random feature maps with ReLU activations tend to increase distances between vectors rather than collapsing them (Figure 4), explaining why this component of Deep RFM (and the ReLU in DNNs) doesn't significantly reduce within-class variability.

Further theoretical analyses strengthen the connection between AGOP and DNC in Deep RFM.

  1. Asymptotic Analysis: Under assumptions of high-dimensional, full-rank data and linearized kernels/feature maps (k(X)XX+λkIk(X) \approx X^\top X + \lambda_k I, Φ(X)X+λΦ1/2I\Phi(X) \approx X + \lambda_\Phi^{1/2} I), the paper proves (Theorem 1) that applying the Deep RFM transformation iteratively leads to the data Gram matrix converging exponentially to a collapsed state (yy+λΦIyy^\top + \lambda_\Phi I). The rate of collapse depends on the ratio λk/λΦ\lambda_k/\lambda_\Phi, suggesting that the "non-linearity" introduced by the feature map (represented by λΦ\lambda_\Phi) is crucial for convergence, while deviations from perfect linearity in the kernel (λk\lambda_k) can slow it down.
  2. Non-asymptotic Analysis: By analyzing a relaxed version of a parametrized kernel ridge regression problem (which includes a single layer of RFM), the paper shows (Theorem 2) that the optimal kernel matrix is exactly the matrix representing the collapsed state (IK(1n1n)I_K \otimes (\mathbf{1}_n\mathbf{1}_n^\top)). This suggests that models implicitly or explicitly minimizing such objectives are biased towards forming collapsed representations. Since RFM is based on kernel ridge regression and uses the AGOP, this provides a theoretical link between AGOP and the optimal kernel structure corresponding to NC1.

In summary, the paper provides substantial empirical and theoretical evidence that feature learning through the AGOP is a core mechanism underlying the formation of Deep Neural Collapse. It shows that in both DNNs and Deep RFM, the linear transformation associated with the AGOP (or weight matrix SlVlS_lV_l^\top in DNNs) is primarily responsible for collapsing within-class variability, whereas non-linearities like ReLU play a less significant role in this specific aspect of collapse.

Practical Implications and Implementation:

  • Understanding DNC: This work provides a feature-learning-centric view of DNC, complementing existing UFM-based theories. Understanding DNC as a consequence of AGOP-based feature learning can guide further research into its benefits (generalization, robustness) and drawbacks (e.g., in imbalanced datasets).
  • DNN Training: The finding that the weight matrix's right singular space, related to the AGOP via NFA, drives NC1 suggests that training methods influencing the alignment of weights with the AGOP could implicitly control DNC. Learning rates, weight decay, and initialization strategies known to affect NFA correlation (2402.05271, 2306.04815) could be relevant knobs for controlling DNC in practice.
  • Deep RFM as a Model: Deep RFM is a backpropagation-free architecture. While currently not competitive with state-of-the-art DNNs on large-scale tasks, its ability to exhibit DNC through an interpretable AGOP mechanism makes it a valuable theoretical tool. If its performance could be improved, it might inspire more efficient alternatives to gradient descent for deep feature learning.
  • Designing Architectures/Regularization: The theoretical analysis points to the parameters λk\lambda_k and λΦ\lambda_\Phi as influential on collapse speed. Future work could explore designing non-linearities or regularization techniques that effectively tune these parameters in DNNs to promote desired DNC properties. For example, modifications to ReLU or batch normalization might affect λΦ\lambda_\Phi, while weight decay might influence λk\lambda_k.
  • Implementation: Implementing Deep RFM (Algorithm 1) involves:
    • Choosing a kernel function (e.g., Laplace kernel as in experiments).
    • Computing the kernel matrix kl(Xl,Xl)k_l(X_l, X_l).
    • Solving a kernel ridge regression ((kl+μI)1Y(k_l + \mu I)^{-1} Y). This involves matrix inversion, which can be computationally expensive for large datasets (NN) but feasible for smaller NN.
    • Computing the AGOP matrix Ml=XlαlαlXlM_l = X_l \alpha_l \alpha_l^\top X_l^\top. This is also O(N2dl)O(N^2 d_l) or O(N2K2+NdlK)O(N^2 K^2 + N d_l K) depending on how it's computed.
    • Computing the matrix square root Ml1/2M_l^{1/2}. This requires eigen-decomposition of MlM_l (O(dl3)O(d_l^3) or O(N3)O(N^3) if using the N×NN \times N version XlMlXlX_l^\top M_l X_l).
    • Applying a random feature map Φl(Z)=σ(WlZ)\Phi_l(Z) = \sigma(W_l Z), where WlW_l is a random matrix. This is computationally efficient, similar to a standard linear layer followed by ReLU.

The primary computational bottlenecks for Deep RFM are the kernel regression and AGOP computation, especially for large NN. DNNs trained with gradient descent implicitly learn the AGOP alignment but avoid explicit computation of the AGOP matrix, which is O(d2)O(d^2). The paper's experiments on CIFAR-10, MNIST, and SVHN use typical dataset sizes (up to 50,000 samples) and network widths (512), demonstrating the phenomenon is observable in moderate-scale settings. The theoretical results, while sometimes relying on simplified settings (e.g., linearized kernels, high dimensions), provide valuable insights into the underlying mechanisms that may translate to more complex, real-world scenarios.