Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
104 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
40 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data Augmentation Revisited: Rethinking the Distribution Gap between Clean and Augmented Data (1909.09148v2)

Published 19 Sep 2019 in cs.LG, cs.CV, and stat.ML

Abstract: Data augmentation has been widely applied as an effective methodology to improve generalization in particular when training deep neural networks. Recently, researchers proposed a few intensive data augmentation techniques, which indeed improved accuracy, yet we notice that these methods augment data have also caused a considerable gap between clean and augmented data. In this paper, we revisit this problem from an analytical perspective, for which we estimate the upper-bound of expected risk using two terms, namely, empirical risk and generalization error, respectively. We develop an understanding of data augmentation as regularization, which highlights the major features. As a result, data augmentation significantly reduces the generalization error, but meanwhile leads to a slightly higher empirical risk. On the assumption that data augmentation helps models converge to a better region, the model can benefit from a lower empirical risk achieved by a simple method, i.e., using less-augmented data to refine the model trained on fully-augmented data. Our approach achieves consistent accuracy gain on a few standard image classification benchmarks, and the gain transfers to object detection.

Citations (61)

Summary

We haven't generated a summary for this paper yet.