Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Limitations of Neural Collapse for Understanding Generalization in Deep Learning (2202.08384v1)

Published 17 Feb 2022 in cs.LG, cs.CV, and stat.ML

Abstract: The recent work of Papyan, Han, & Donoho (2020) presented an intriguing "Neural Collapse" phenomenon, showing a structural property of interpolating classifiers in the late stage of training. This opened a rich area of exploration studying this phenomenon. Our motivation is to study the upper limits of this research program: How far will understanding Neural Collapse take us in understanding deep learning? First, we investigate its role in generalization. We refine the Neural Collapse conjecture into two separate conjectures: collapse on the train set (an optimization property) and collapse on the test distribution (a generalization property). We find that while Neural Collapse often occurs on the train set, it does not occur on the test set. We thus conclude that Neural Collapse is primarily an optimization phenomenon, with as-yet-unclear connections to generalization. Second, we investigate the role of Neural Collapse in feature learning. We show simple, realistic experiments where training longer leads to worse last-layer features, as measured by transfer-performance on a downstream task. This suggests that neural collapse is not always desirable for representation learning, as previously claimed. Finally, we give preliminary evidence of a "cascading collapse" phenomenon, wherein some form of Neural Collapse occurs not only for the last layer, but in earlier layers as well. We hope our work encourages the community to continue the rich line of Neural Collapse research, while also considering its inherent limitations.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Like Hui (5 papers)
  2. Mikhail Belkin (76 papers)
  3. Preetum Nakkiran (43 papers)
Citations (50)

Summary

  • The paper finds that neural collapse observed during training does not extend to test data, undermining its assumed link to generalization.
  • The paper rigorously defines train collapse, weak test-collapse, and strong test-collapse, demonstrating that strong test-collapse is theoretically infeasible.
  • Empirical results across datasets and architectures reveal that increased train collapse may correlate with poorer transfer learning performance and overall generalization.

This paper, "Limitations of Neural Collapse for Understanding Generalization in Deep Learning" (Hui et al., 2022 ), investigates the significance of the "Neural Collapse" (NC) phenomenon, first observed in [Papyan24652], particularly its purported connection to generalization in deep learning. Neural Collapse describes a phenomenon where, during the late stages of training deep neural networks for classification, the last-layer feature representations for training samples of the same class converge to a single point, and these class-specific points exhibit a specific geometric structure (e.g., a Simplex Equiangular Tight Frame).

The authors argue that the role of NC in understanding generalization has been unclear due to ambiguities in whether NC refers to behavior on the training set or the test set, and the role of sample size. To clarify this, they propose more precise definitions of NC:

  1. Train-Collapse: Features of training samples of the same class converge to a single point as training time tt \to \infty. This is an optimization property defined for a specific training set SS.
  2. Weak Test-Collapse: Features of test samples converge to one of kk distinct points (where kk is the number of classes) as tt \to \infty, for almost all test samples drawn from the data distribution $\cD$.
  3. Strong Test-Collapse: Features of test samples converge to the specific point associated with their Bayes-optimal class as tt \to \infty, for almost all test samples.

Crucially, these definitions require the collapse to occur for any finite training set size nn, as tt \to \infty.

Theoretical Feasibility and Practical Implications:

The paper argues that Strong Test-Collapse (Definition 3) is theoretically infeasible in realistic settings. If test samples mapped perfectly to class-specific points regardless of training data size, it would imply that a Bayes-optimal classifier could be extracted even from models trained on very few samples, which contradicts fundamental statistical learning principles.

Weak Test-Collapse (Definition 2) is theoretically possible but unlikely to occur with standard training methods like SGD. While it doesn't imply learning the Bayes-optimal classifier, it still implies mapping a continuous test distribution to a discrete set of points in feature space, which is a strong property.

Train-Collapse (Definition 1), on the other hand, is widely observed empirically [Papyan24652] and in this paper. It describes how the network fits the training data very precisely in the late training phase.

Empirical Findings: Train vs. Test Collapse:

The authors conduct extensive experiments on various datasets (MNIST, FashionMNIST, CIFAR-10, SVHN, STL-10) and architectures (ResNet, DenseNet, VGG). They quantify the "degree of collapse" using variance measures similar to [Papyan24652]:

  • TrainVariance: Measures the variance of features within each class on the training set, normalized by the variance between class means on the training set.
  • TestVariance (Strong): Measures the variance of features within each class on the test set, normalized by the variance between class means on the test set.
  • TestVariance (Weak): Measures the variance of features within clusters found by k-means on the test set features, normalized by variance between these clusters.

The experiments consistently show:

  • Train-Collapse occurs reliably in most settings, with TrainVariance approaching a small value over training time.
  • Test-Collapse (both strong and weak) does not occur. TestVariance remains significantly higher than TrainVariance, exhibiting a "generalization gap" in collapse (Figure 1, Figure 3). The test features do not collapse to discrete points.

This indicates that Neural Collapse, as observed in practice, is primarily a phenomenon specific to the training data and the optimization process, rather than a structural property of the model's behavior on the underlying test distribution.

Neural Collapse and Generalization:

The paper argues that neither Train-Collapse nor Weak Test-Collapse are sufficient for good generalization. It's possible to construct models that satisfy them but generalize poorly. Strong Test-Collapse is sufficient but infeasible in practice.

Furthermore, the authors present evidence suggesting that Train-Collapse might be anti-correlated with generalization quality in certain settings:

  1. Varying Dataset Size: On CIFAR-10 and FashionMNIST, increasing the size of the training dataset leads to lower Strong TestVariance (more test "collapse" in the sense of reduced variance, though not full collapse) and better test accuracy. However, larger datasets also result in higher TrainVariance (less train collapse) at the end of training (Figure 4). This suggests that the phenomena on the train and test sets behave differently and can even move in opposite directions in response to changes in training data size, highlighting that train collapse is not a reliable indicator of test behavior.
  2. Transfer Learning: They pre-train models on binary super-class tasks (e.g., odd/even digits on MNIST, animals/objects on CIFAR-10) and then fine-tune them on the original multi-class task (10-way, 8-way). They save checkpoints during pre-training and evaluate their quality for transfer learning by fine-tuning. They find that checkpoints exhibiting more Train-Collapse (lower TrainVariance) on the pre-training task lead to worse downstream performance after fine-tuning (Figure 5). This counter-intuitive result suggests that the feature representations learned in the heavily collapsed late stage of training for the simple pre-training task are less useful for transfer to a related, more complex task.

Cascading Collapse (Preliminary):

As a preliminary observation, the paper explores whether the collapse phenomenon extends to earlier layers of the network. In experiments with a fully-connected network on MNIST, they observe that earlier layers also show a decrease in within-class variance over training time, but this collapse appears to happen later and to a lesser degree than in the last hidden layer (Figure 6). They term this "cascading collapse" and suggest it as an interesting avenue for future research into the optimization dynamics of deep networks, again framing it primarily as an optimization phenomenon.

Conclusion for Practitioners:

The paper concludes that while Neural Collapse is a fascinating optimization phenomenon that reliably occurs on the training data in the late stages of deep learning, its direct relevance and benefits for generalization and learning high-quality transferable representations are limited and perhaps even negative in some scenarios. For developers and engineers, this suggests:

  • Observing Neural Collapse (on the training set) is an indicator of reaching a specific phase of optimization, but it should not be directly interpreted as a cause or guarantee of good generalization performance on unseen data.
  • Training deep into the "Neural Collapse" phase might not always yield the best representations for downstream transfer tasks. If the goal is good transfer learning, achieving zero training loss and high train collapse might be counterproductive compared to earlier training stages or using different training strategies.
  • Metrics related to Train-Collapse might serve as indicators of optimization progress or state on the training data, but are poor proxies for evaluating generalization performance or feature quality for new data. Test-time behavior is fundamentally different.
  • Future research on NC should continue exploring its role in optimization dynamics (like cascading collapse), but its connection to generalization requires careful distinction between train-time and test-time phenomena.