Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pathologies of Predictive Diversity in Deep Ensembles (2302.00704v3)

Published 1 Feb 2023 in cs.LG and stat.ML

Abstract: Classic results establish that encouraging predictive diversity improves performance in ensembles of low-capacity models, e.g. through bagging or boosting. Here we demonstrate that these intuitions do not apply to high-capacity neural network ensembles (deep ensembles), and in fact the opposite is often true. In a large scale study of nearly 600 neural network classification ensembles, we examine a variety of interventions that trade off component model performance for predictive diversity. While such interventions can improve the performance of small neural network ensembles (in line with standard intuitions), they harm the performance of the large neural network ensembles most often used in practice. Surprisingly, we also find that discouraging predictive diversity is often benign in large-network ensembles, fully inverting standard intuitions. Even when diversity-promoting interventions do not sacrifice component model performance (e.g. using heterogeneous architectures and training paradigms), we observe an opportunity cost associated with pursuing increased predictive diversity. Examining over 1000 ensembles, we observe that the performance benefits of diverse architectures/training procedures are easily dwarfed by the benefits of simply using higher-capacity models, despite the fact that such higher capacity models often yield significantly less predictive diversity. Overall, our findings demonstrate that standard intuitions around predictive diversity, originally developed for low-capacity ensembles, do not directly apply to modern high-capacity deep ensembles. This work clarifies fundamental challenges to the goal of improving deep ensembles by making them more diverse, while suggesting an alternative path: simply forming ensembles from ever more powerful (and less diverse) component models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (98)
  1. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information Fusion, 76:243–297, 2021.
  2. The best deep ensembles sacrifice predictive diversity. In I Can’t Believe It’s Not Better Workshop: Understanding Deep Learning Through Empirical Falsification, 2022a. URL https://openreview.net/forum?id=6sBiAIpkUiO.
  3. Deep ensembles work, but are they necessary? In Advances in Neural Information Processing Systems, 2022b.
  4. Understanding double descent requires a fine-grained bias-variance decomposition. Advances in neural information processing systems, 33:11022–11032, 2020.
  5. Transferability metrics for selecting source model ensembles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7936–7946, 2022.
  6. Pitfalls of in-domain uncertainty estimation and ensembling in deep learning. arXiv preprint arXiv:2002.06470, 2020.
  7. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
  8. Kernel interpolation as a bayes point machine. arXiv, 2021.
  9. Jock Blackard. Covertype. UCI Machine Learning Repository, 1998. DOI: https://doi.org/10.24432/C50K5N.
  10. Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
  11. Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
  12. Reliability benchmarks for image segmentation. In NeurIPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications, 2022.
  13. Generalized negative correlation learning for deep ensembling. arXiv preprint arXiv:2011.02952, 2020.
  14. Pert-perfect random tree ensembles. Computing Science and Statistics, 33(4):90–4, 2001.
  15. Exploiting joint robustness to adversarial perturbations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1122–1131, 2020.
  16. Repulsive deep ensembles are bayesian. Advances in Neural Information Processing Systems, 34:3451–3465, 2021.
  17. CINIC-10 is not ImageNet or CIFAR-10. arXiv preprint arXiv:1810.03505, 2018.
  18. Exploring the role of loss functions in multiclass classification. In 2020 54th annual conference on information sciences and systems (ciss), pages 1–5. IEEE, 2020.
  19. ImageNet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, pages 248–255, 2009.
  20. Thomas G Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting and randomization. Machine learning, 32:1–22, 1998.
  21. Thomas G Dietterich. Ensemble methods in machine learning. In International Workshop on Multiple Classifier Systems, pages 1–15, 2000.
  22. On robustness and transferability of convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16458–16468, 2021.
  23. Double trouble in double descent: Bias and variance (s) in the lazy regime. In International Conference on Machine Learning, 2020.
  24. Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757, 2019.
  25. Yoav Freund. Boosting a weak learning algorithm by majority. Information and computation, 121(2):256–285, 1995.
  26. Xavier Gastaldi. Shake-shake regularization. arXiv preprint arXiv:1705.07485, 2017.
  27. Disentangling feature and lazy training in deep neural networks. Journal of Statistical Mechanics: Theory and Experiment, 2020(11):113301, 2020.
  28. How to fill the optimum set? population gradient descent with harmless diversity. arXiv preprint arXiv:2202.08376, 2022.
  29. No one representation to rule them all: Overlapping features of training methods. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=BK-4qbGgIE3.
  30. On embeddings for numerical features in tabular deep learning. Advances in Neural Information Processing Systems, 35:24991–25004, 2022.
  31. On calibration of modern neural networks. In International Conference on Machine Learning, 2017.
  32. Ensembles of classifiers: a bias-variance perspective. Transactions on Machine Learning Research, 2022. URL https://openreview.net/forum?id=lIOQFVncY9.
  33. Evaluating scalable bayesian deep learning methods for robust computer vision. In Computer Vision and Pattern Recognition Workshops, pages 318–319, 2020.
  34. Deep pyramidal residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5927–5935, 2017.
  35. Neural network ensembles. Transactions on pattern analysis and machine intelligence, 12(10):993–1001, 1990.
  36. Training independent subnetworks for robust prediction. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=OGg9XnKxFAH.
  37. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  38. Benchmarking neural network robustness to common corruptions and surface variations. In International Conference on Learning Representations, 2019.
  39. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
  40. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
  41. Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks. arXiv preprint arXiv:2006.07322, 2020.
  42. Maximizing overall diversity for improved uncertainty estimates in deep ensembles. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4264–4271, 2020.
  43. Joint training of deep ensembles fails due to learner collusion. arXiv preprint arXiv:2301.11323, 2023.
  44. Improving adversarial robustness of ensembles with diversity training. arXiv preprint arXiv:1901.09981, 2019.
  45. On the reversed bias-variance tradeoff in deep ensembles. ICML, 2021.
  46. When ensembling smaller models is more efficient than single large models. arXiv preprint arXiv:2005.00570, 2020.
  47. Learning multiple layers of features from tiny images, 2009.
  48. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 51(2):181–207, 2003.
  49. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, 2017.
  50. Knowledge distillation by on-the-fly native ensemble. Advances in neural information processing systems, 31, 2018.
  51. Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7(7):3, 2015.
  52. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  53. Algorithms for non-negative matrix factorization. Advances in neural information processing systems, 13, 2000.
  54. Why m heads are better than one: Training a diverse ensemble of deep networks. arXiv preprint arXiv:1511.06314, 2015.
  55. Diversify and disambiguate: Learning from underspecified data. arXiv preprint arXiv:2202.03418, 2022.
  56. On power laws in deep ensembles. In Advances in Neural Information Processing Systems, 2020.
  57. Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International Conference on Machine Learning, 2021.
  58. Jensen-shannon divergence in ensembles of concurrently-trained neural networks. In 2012 11th International Conference on Machine Learning and Applications, volume 2, pages 558–562. IEEE, 2012.
  59. When does label smoothing help? Advances in neural information processing systems, 32, 2019.
  60. Deep ensembles for low-data transfer learning. arXiv preprint arXiv:2010.06866, 2020.
  61. Obtaining well calibrated probabilities using Bayesian binning. In AAAI Conference on Artificial Intelligence, 2015.
  62. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021.
  63. A modern take on the bias-variance tradeoff in neural networks. arXiv preprint arXiv:1810.08591, 2018.
  64. Why are bootstrapped deep ensembles not better? In ”I Can’t Believe It’s Not Better!” NeurIPS 2020 workshop. openreview.net, December 2020.
  65. Efficient model averaging for deep neural networks. In Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, pages 205–220. Springer, 2017.
  66. Diversity and generalization in neural network ensembles. In International Conference on Artificial Intelligence and Statistics, pages 11720–11743. PMLR, 2022.
  67. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, 2019.
  68. Agree to disagree: Diversity through disagreement for better transferability. arXiv preprint arXiv:2202.04414, 2022.
  69. Improving adversarial robustness via promoting ensemble diversity. In International Conference on Machine Learning, pages 4970–4979. PMLR, 2019.
  70. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  71. When networks disagree: Ensemble methods for hybrid neural networks, 1992.
  72. Dice: Diversity in deep ensembles via conditional redundancy adversarial estimation. arXiv preprint arXiv:2101.05544, 2021.
  73. Do cifar-10 classifiers generalize to cifar-10? arXiv preprint arXiv:1806.00451, 2018.
  74. Do ImageNet classifiers generalize to ImageNet? In International Conference on Machine Learning, 2019.
  75. Ensembles of locally independent prediction models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5527–5536, 2020.
  76. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  77. Dibs: Diversity inducing information bottleneck in model ensembles. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9666–9674, 2021.
  78. Regression as classification: Influence of task formulation on neural network features. arXiv preprint arXiv:2211.05641, 2022.
  79. Going deeper with convolutions. In Computer Vision and Pattern Recognition, 2015.
  80. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
  81. Measuring robustness to natural distribution shifts in image classification. In Advances in Neural Information Processing Systems, 2020.
  82. Evading the simplicity bias: Training a diverse set of models discovers solutions with superior ood generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16761–16772, 2022.
  83. Uncertainty-aware (una) bases for bayesian regression using multi-headed auxiliary networks. arXiv preprint arXiv:2006.11695, 2020.
  84. When are ensembles really effective? arXiv preprint arXiv:2305.12313, 2023.
  85. Semi-supervised novelty detection using ensembles with regularized disagreement. In Uncertainty in Artificial Intelligence, pages 1939–1948. PMLR, 2022.
  86. Plex: Towards reliability using pretrained large model extensions. arXiv preprint arXiv:2207.07411, 2022.
  87. Film-ensemble: Probabilistic deep learning via feature-wise linear modulation. Advances in neural information processing systems, 35:22229–22242, 2022.
  88. To ensemble or not ensemble: When does end-to-end training fail? In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 109–123. Springer, 2020.
  89. BatchEnsemble: an alternative approach to efficient ensemble and lifelong learning. In International Conference on Learning Representations, 2020.
  90. Hyperparameter ensembles for robustness and uncertainty quantification. Advances in Neural Information Processing Systems, 33:6514–6527, 2020.
  91. A unified theory of diversity in ensemble learning. arXiv preprint arXiv:2301.03962, 2023.
  92. Boosting ensemble accuracy by revisiting ensemble diversity metrics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16469–16477, 2021.
  93. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
  94. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  95. Resnest: Split-attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2736–2746, 2022.
  96. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  97. Towards better accuracy-efficiency trade-offs: Divide and co-training. IEEE Transactions on Image Processing, 2022.
  98. Zhi-Hua Zhou. Ensemble methods: foundations and algorithms. CRC press, 2012.
Citations (10)

Summary

We haven't generated a summary for this paper yet.