Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup (2101.06983v2)

Published 18 Jan 2021 in cs.LG, cs.CL, and cs.IR

Abstract: Contrastive learning has been applied successfully to learn vector representations of text. Previous research demonstrated that learning high-quality representations benefits from batch-wise contrastive loss with a large number of negatives. In practice, the technique of in-batch negative is used, where for each example in a batch, other batch examples' positives will be taken as its negatives, avoiding encoding extra negatives. This, however, still conditions each example's loss on all batch examples and requires fitting the entire large batch into GPU memory. This paper introduces a gradient caching technique that decouples backpropagation between contrastive loss and the encoder, removing encoder backward pass data dependency along the batch dimension. As a result, gradients can be computed for one subset of the batch at a time, leading to almost constant memory usage.

Citations (78)

View on Semantic Scholar

Summary

The paper introduces a novel gradient caching technique that enables scaling large contrastive learning batches without proportional memory increase.
It segments training into sub-batches and accumulates gradients, preserving the benefits of large batch sizes on consumer-grade GPUs.
Experimental results demonstrate comparable retrieval accuracy to traditional methods with only a 20% runtime increase.

Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup: An Expert Overview

The paper "Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup" addresses a significant challenge within the domain of contrastive learning: memory limitations when training with large batch sizes. The authors present a novel gradient caching technique that enables the scaling of batch size without a proportional increase in memory usage, thereby democratizing access to state-of-the-art models even for those with constrained hardware resources.

Context and Previous Work

Contrastive learning has proven effective for representation learning across numerous natural language processing tasks. A key factor in its success is the utilization of large batch sizes, which provide a substantial number of negative samples, thus improving the quality of learned representations. However, the memory demand limits the applicability of this approach, particularly in settings where cutting-edge hardware is unavailable.

Prior works, such as DPR and methods utilizing SimCLR, have shown the benefits of large batch sizes but require substantial computational resources. Attempts to alleviate this through gradient accumulation fail to capture the in-batch negative advantages due to smaller effective batch sizes.

Methodology

The authors propose a solution that decouples the backpropagation process in contrastive learning, allowing the use of large batch sizes without exceeding memory limits. This is accomplished by:

Gradient Caching Technique: The process involves computing and caching the gradients of representations before backpropagating through the encoder. By doing this, memory-intensive operations are distributed over multiple smaller operations rather than a single large one.
Sub-batch Processing: Instead of processing the entire large batch at once, the batch is divided into manageable sub-batches. The gradient updates are accumulated over these sub-batches, maintaining the benefits of a large batch update without the corresponding memory overhead.
Implementation Efficiency: The technique not only matches the performance of large batch training but does so with about a 20% increase in runtime, making state-of-the-art results accessible on consumer-grade GPUs.

Experimental Results

The experiments conducted using a dense passage retriever demonstrate the technique's effectiveness in maintaining retrieval accuracy. The results show that the gradient cache technique achieves comparable accuracy to the original large-batch DPR method, underscoring the importance of batch size in contrastive learning.

Additionally, the paper reports scalability improvements. By effectively managing the computation graph and caching intermediate gradients, the process scales efficiently with negligible memory overhead, potentially reducing the resource gap between academia and industry.

Implications and Future Directions

The introduction of gradient caching significantly reduces the dependency on high-end hardware for contrastive learning and opens up avenues for researchers with limited resources to contribute to the field effectively. This has broader implications for the dissemination of advanced machine learning techniques, enabling a more inclusive research environment.

In terms of future developments, the separation of computation and data dependencies prompts further exploration into other areas of AI where similar constraints exist. This could lead to new approaches to optimize memory usage in various neural network architectures, potentially advancing fields such as computer vision and unsupervised learning.

Conclusion

This paper presents a practical and innovative approach to large batch size contrastive learning in memory-constrained environments. By effectively decoupling the gradient computation process, the authors provide a method that retains the advantages of large batches without substantial hardware investment. The implications are significant, offering both theoretical and applied contributions to the field of AI and machine learning.

PDF Markdown

Related Papers

GitHub

GitHub - luyug/GradCache: Run Effective Large Batch Contrastive Learning Beyond GPU/TPU Memory Constraint (395 stars)

Tweets

https://twitter.com/zach_nussbaum/status/1929661657887727742

https://twitter.com/further_ai/status/1866016134278873364

https://twitter.com/raphaelsrty/status/1886021982254223666