ERM++: An Improved Baseline for Domain Generalization (2304.01973v4)

Published 4 Apr 2023 in cs.LG and cs.CV

Abstract: Domain Generalization (DG) aims to develop classifiers that can generalize to new, unseen data distributions, a critical capability when collecting new domain-specific data is impractical. A common DG baseline minimizes the empirical risk on the source domains. Recent studies have shown that this approach, known as Empirical Risk Minimization (ERM), can outperform most more complex DG methods when properly tuned. However, these studies have primarily focused on a narrow set of hyperparameters, neglecting other factors that can enhance robustness and prevent overfitting and catastrophic forgetting, properties which are critical for strong DG performance. In our investigation of training data utilization (i.e., duration and setting validation splits), initialization, and additional regularizers, we find that tuning these previously overlooked factors significantly improves model generalization across diverse datasets without adding much complexity. We call this improved, yet simple baseline ERM++. Despite its ease of implementation, ERM++ improves DG performance by over 5\% compared to prior ERM baselines on a standard benchmark of 5 datasets with a ResNet-50 and over 15\% with a ViT-B/16. It also outperforms all state-of-the-art methods on DomainBed datasets with both architectures. Importantly, ERM++ is easy to integrate into existing frameworks like DomainBed, making it a practical and powerful tool for researchers and practitioners. Overall, ERM++ challenges the need for more complex DG methods by providing a stronger, more reliable baseline that maintains simplicity and ease of use. Code is available at \url{https://github.com/piotr-teterwak/erm_plusplus}

Citations (6)

View on Semantic Scholar

Summary

The paper introduces ERM++, an enhanced baseline that expands hyperparameter tuning to include training duration, initialization, and new regularization techniques.
It employs an automated learning rate and full-data training strategy that achieves over 5% improvement with ResNet-50 and more than 15% with ViT-B/16 architectures.
Advanced pre-training methods and model parameter averaging in ERM++ effectively reduce overfitting and catastrophic forgetting on diverse DG benchmarks.

The paper "ERM++: An Improved Baseline for Domain Generalization" presents an enhanced baseline method for Domain Generalization (DG) by optimizing the traditional Empirical Risk Minimization (ERM) approach. The authors identify additional hyperparameters that can further bolster ERM’s performance in DG settings by reducing overfitting and catastrophic forgetting, which are critical issues when adapting to unseen data distributions.

Key Contributions:

Hyperparameter Tuning Beyond Basics: Traditional ERM setups focus on tuning only a few hyperparameters such as learning rate, weight decay, batch size, and dropout. This paper expands the scope to consider training duration, initialization, and additional regularizers, leading to their improved baseline, named ERM++.
Training Amount Optimization:
- Auto-LR: An automated procedure to adjust learning rate and determine training duration based on validation performance, ensuring convergence without excessive overfitting.
- Full Data Usage: Instead of creating fixed training-validation splits, ERM++ uses a two-pass strategy allowing models to train on the full dataset, including validation data, maximizing data utility.
Initialization Improvements:
- The paper evaluates and integrates modern pre-training techniques, like weights from AugMix and DINOv2, which have been shown to provide better initialization for neural networks, thereby improving convergence and generalization performance.
Regularization Techniques:
- Model Parameter Averaging (MPA): Helps achieve better generalization by averaging model iterates, leading to flatter minima.
- Warm Start and Unfreezing BatchNorm: Techniques that help maintain initialization integrity and provide regularization by selectively training layers or introducing noise through unfreezing batch normalization.
Experimental Results:
- ERM++ shows over a 5% improvement over prior ERM baselines on a suite of DG benchmarks with a ResNet-50 backbone and more than a 15% improvement using a ViT-B/16 architecture.
- The paper also explores how data similarity to pre-training distributions influences DG performance, stressing that ERM++ with robust initializations excels even with dissimilar datasets.
Efficiency and Impact:
- By fine-tuning overlooked parameters and efficiently utilizing available data, ERM++ establishes a new benchmark for DG. The approach is compatible with existing methods and frameworks, facilitating easy adoption and further research development.

Overall, ERM++ offers a robust and efficient enhancement to existing DG baselines, backed by thorough experimentation across multiple datasets, elucidating the critical role of training procedures, initialization strategies, and regularization in achieving state-of-the-art DG performance.

PDF Markdown

Related Papers

GitHub

GitHub - piotr-teterwak/erm_plusplus (15 stars)