RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing (1912.12953v1)

Published 30 Dec 2019 in cs.DC and cs.AR

Abstract: Personalized recommendation systems leverage deep learning models and account for the majority of data center AI cycles. Their performance is dominated by memory-bound sparse embedding operations with unique irregular memory access patterns that pose a fundamental challenge to accelerate. This paper proposes a lightweight, commodity DRAM compliant, near-memory processing solution to accelerate personalized recommendation inference. The in-depth characterization of production-grade recommendation models shows that embedding operations with high model-, operator- and data-level parallelism lead to memory bandwidth saturation, limiting recommendation inference performance. We propose RecNMP which provides a scalable solution to improve system throughput, supporting a broad range of sparse embedding models. RecNMP is specifically tailored to production environments with heavy co-location of operators on a single server. Several hardware/software co-optimization techniques such as memory-side caching, table-aware packet scheduling, and hot entry profiling are studied, resulting in up to 9.8x memory latency speedup over a highly-optimized baseline. Overall, RecNMP offers 4.2x throughput improvement and 45.8% memory energy savings.

Citations (182)

View on Semantic Scholar

Summary

The paper presents RecNMP, a near-memory processing (NMP) architecture embedded within DRAM buffer chips to accelerate personalized recommendation systems by handling bandwidth-intensive sparse embedding operations locally.
RecNMP demonstrates significant performance gains, achieving a 9.8× latency speedup and 4.2× throughput increase over optimized baselines by effectively mitigating memory bandwidth saturation bottlenecks.
The scalable RecNMP solution operates within commodity DRAM, reduces memory energy usage by 45.8%, and provides a practical framework for future NMP-driven acceleration of AI models in data centers.

RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing

The paper presents RecNMP, a solution designed to speed up personalized recommendation systems using near-memory processing. A primary focus of recommendation systems is to handle the large demand for AI cycles within data centers, driven by deep learning models that primarily execute sparse embedding operations. These operations are characterized by irregular memory access patterns that create bottlenecks, making acceleration challenging. RecNMP addresses these challenges by deploying near-memory processing (NMP) in a commodity DRAM environment, specifically to speed up personalized recommendation inference.

Characterization of Recommendation Models

The paper characterizes production-grade recommendation models and identifies a major bottleneck: memory bandwidth saturation due to embedding operations. These operations exhibit high parallelism but lead to bandwidth constraints, stalling the performance of AI inference. Recommendation systems, particularly at Facebook, demonstrate that over 70% of AI cycles are consumed by these models. Despite their significance, research in optimizing these models remains limited compared to CNNs and RNNs.

Recommendation models utilize both dense (continuous) and sparse (categorical) features. Sparse features come with large embedding tables accessed through SparseLengthsSum (SLS), which works by performing small sparse lookups within large tables. The paper highlights that SLS operations present two unique challenges: poor predictability due to irregular table indices and overwhelming on-chip memory resources, which traditional caching cannot address effectively.

RecNMP Architecture

RecNMP introduces a near-memory processing architecture embedded within DRAM buffer chips. It operates by executing bandwidth-intensive embedding operations locally and leveraging rank-level parallelism. This approach minimizes off-chip bottlenecks, achieving up to 8× increased bandwidth within the constrained memory access architectures. The architecture incorporates a DDR4-compatible design utilizing lightweight functional units tailored for SLS-family operators.

The NMP instructions are optimized to compress DDR commands, allowing higher parallelism across data channels without compromising C/A bus bandwidth. This compression is pivotal to managing the irregular data pattern prevalent in handling sparse embeddings. RecNMP’s programming model offers a heterogeneous compute setup, akin to OpenCL, to facilitate host-NMP coordination.

Performance Benefits and Evaluation

The substantial findings from RecNMP indicate a 9.8× latency speedup over optimized baseline systems. Memory-side caching enhances performance further, while table-aware packet scheduling and hot entry profiling offer additional locality-based optimization. Speculative evaluation via production traces rather than purely randomized patterns reveals RecNMP's ability to improve throughput by 4.2× and cut memory energy usage by 45.8%.

The scalable nature of RecNMP demonstrates benefits across both single and co-located models, reducing cache interference and optimizing the memory usage associated with non-SLS layers, such as fully connected (FC) operations, leading to an up to 30% reduction in latency. Moreover, despite incorporating additional hardware within the DIMM architecture, RecNMP remains efficient in terms of area and power consumption, well within industrial standards for DRAM components.

Conclusion

RecNMP renders a practical solution to the challenges faced by personalized recommendation systems, addressing fundamental memory bottlenecks and optimizing scarce resources within data center environments. The research presented in the paper lays groundwork for further exploration into NMP-driven architectures, particularly as AI model complexities continue to evolve, alongside demands for optimized high-throughput systems in practical deployments. Future work may involve refining the co-optimization strategies and exploring further simplification of the instruction set to broaden the applicability and efficiency of near-memory architectures in emerging AI applications.