Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks

Published 3 May 2017 in cs.LG and cs.AR | (1705.01626v1)

Abstract: Popular deep learning frameworks require users to fine-tune their memory usage so that the training data of a deep neural network (DNN) fits within the GPU physical memory. Prior work tries to address this restriction by virtualizing the memory usage of DNNs, enabling both CPU and GPU memory to be utilized for memory allocations. Despite its merits, virtualizing memory can incur significant performance overheads when the time needed to copy data back and forth from CPU memory is higher than the latency to perform the computations required for DNN forward and backward propagation. We introduce a high-performance virtualization strategy based on a "compressing DMA engine" (cDMA) that drastically reduces the size of the data structures that are targeted for CPU-side allocations. The cDMA engine offers an average 2.6x (maximum 13.8x) compression ratio by exploiting the sparsity inherent in offloaded data, improving the performance of virtualized DNNs by an average 32% (maximum 61%).

Abstract PDF Upgrade to Chat

Authors (5)

Citations (169)

View on Semantic Scholar

Summary

The paper introduces a cDMA engine that compresses activation maps using a zero-value compression algorithm to reduce data transfer overhead.
It achieves compression ratios up to 13.8× and an average performance improvement of 32% across varied DNN architectures.
The solution integrates within existing GPU memory controllers, offering scalable virtualization with minimal hardware modifications.

Analysis of Virtualization Strategy for Training Deep Neural Networks

The paper "Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks" presents a high-performance method to address memory constraints during the training of deep neural networks (DNNs) on GPUs. The authors propose a compression strategy focused on exploiting the sparsity inherent in activation maps, significantly alleviating the bottlenecks associated with data movement between CPU and GPU. Their approach leverages a compressing DMA engine (cDMA), which is integrated into the GPU’s memory architecture, enabling a substantial reduction in data transfer volume through zero-value compression (ZVC) of activation maps.

Key Contributions

Virtualized Memory Usage: The paper discusses the challenges of memory limitations within GPU architectures. Previous solutions have sought to virtualize DNN memory usage by allowing CPU memory to supplement GPU memory. However, such strategies introduce performance penalties when data transfer latency surpasses computation latency. This work introduces a cDMA engine that compresses activation data, thereby minimizing the size and performance overhead of data transfers.
Compression Algorithm: The proposed ZVC algorithm capitalizes on activation sparsity, converting data structures into compact forms for efficient transfer over PCIe. ZVC achieves compression ratios of up to 13.8× across varied DNN architectures and operations, with an average across networks of 2.6×.
Architectural Integration: The authors present a comprehensive DMA engine implementation—positioned within existing memory controllers—which minimizes GPU design overhead. This placement ensures efficient DRAM fetch rates that match PCIe bandwidth requirements, improving overall performance scalability in synchrony with GPU throughput capabilities.

Detailed Findings

The study identifies significant sparsity in DNN layers, especially post-ReLU operations, leading to an average sparsity of up to 62% during training across multiple networks. Such findings underpin the ZVC strategy, demonstrating substantial reductions in the activation footprint during offload operations.

The experimental results illustrate an average performance improvement of 32%, with maximum reductions in overhead reaching 61%. This is benchmarked against a vDNN implementation that previously encountered up to a 52% performance penalty due to constrained PCIe bandwidth.

In addition to the performance benefits, ZVC maintains a cost-efficient implementation—requiring only minor modifications to existing GPU architectures and memory management systems, without affecting the training efficacy or convergence properties of the DNN.

Implications and Future Directions

The cDMA offers notable enhancements for memory management in GPU-accelerated DNN training environments, achieving increased computational efficiency and flexibility. Its integration provides a scalable solution applicable across different network architectures without demanding substantial hardware alterations.

Future advancements in CPU-GPU interconnects, such as NVIDIA’s NVLINK, could further amplify the impact of this compression strategy, although these improvements will not negate the need for efficient data management in multi-GPU nodes sharing communication resources.

The paper suggests extensions of the cDMA engine to compress data pre-storage within GPU memory, potentially reducing memory footprint at multiple processing stages. This direction could lead to perceptible reductions in energy consumption, meeting performance demands across increasingly complex and resource-intensive DNN frameworks.

In summary, the paper demonstrates a pragmatic approach to mitigating GPU memory challenges through effective use of inherent network characteristics, bridging a critical gap in contemporary deep learning infrastructure.

Markdown Report Issue