GPU Graph Processing on CXL-Based Microsecond-Latency External Memory (2312.03113v1)
Abstract: In GPU graph analytics, the use of external memory such as the host DRAM and solid-state drives is a cost-effective approach to processing large graphs beyond the capacity of the GPU onboard memory. This paper studies the use of Compute Express Link (CXL) memory as alternative external memory for GPU graph processing in order to see if this emerging memory expansion technology enables graph processing that is as fast as using the host DRAM. Through analysis and evaluation using FPGA prototypes, we show that representative GPU graph traversal algorithms involving fine-grained random access can tolerate an external memory latency of up to a few microseconds introduced by the CXL interface as well as by the underlying memory devices. This insight indicates that microsecond-latency flash memory may be used as CXL memory devices to realize even more cost-effective GPU graph processing while still achieving performance close to using the host DRAM.
- Advanced Micro Devices, Inc. 2023. AMD Expands Leadership Data Center Portfolio with New EPYC CPUs and Shares Details on Next-Generation AMD Instinct Accelerator and Software Enablement for Generative AI. https://www.amd.com/en/newsroom/press-releases/2023-6-13-amd-expands-leadership-data-center-portfolio-with-.html.
- The GAP benchmark suite. arXiv preprint arXiv:1508.03619 (2015).
- Design and Analysis of CXL Performance Models for Tightly-Coupled Heterogeneous Computing. In the 1st International Workshop on Extreme Heterogeneity Solutions (ExHET). Article 1.
- The CXL Consortium. [n. d.]. Compute Express Link™. https://www.computeexpresslink.org/.
- Intel Corporation. 2023a. Compute Express Link (CXL)-Cache/Mem Protocol Interface (CPI). https://cdrdv2-public.intel.com/644330/644330_CPISpecification_Rev1p0.pdf.
- NVIDIA Corporation. 2023b. NVIDIA A100 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/a100/.
- NVIDIA Corporation. 2023c. NVIDIA Grace Hopper Superchip. https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/.
- Gluon: A communication-optimizing substrate for distributed heterogeneous graph analytics. In Proceedings of the 39th ACM SIGPLAN conference on programming language design and implementation. 752–768.
- Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. Journal of parallel and distributed computing 74, 12 (2014), 3202–3216.
- Traversing large graphs on GPUs with unified memory. Proceedings of the VLDB Endowment 13, 7 (2020), 1119–1133.
- Chai: Collaborative heterogeneous applications for integrated-architectures. In 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 43–54.
- Direct Access, High-Performance Memory Disaggregation with DirectCXL. In the 2022 USENIX Annual Technical Conference. Carlsbad, CA, USA.
- Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU. In 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, Portland, OR, USA, 31–46. https://doi.org/10.1109/PACT.2017.41
- Mark Harris. 2013. How to Access Global Memory Efficiently in CUDA C/C++ Kernels. https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/.
- Mark Harris. 2017. Unified Memory for CUDA Beginners. https://developer.nvidia.com/blog/unified-memory-cuda-beginners/.
- BigSparse: High-performance external graph analytics. arXiv preprint arXiv:1710.07736 (2017).
- GraFBoost: Using accelerated flash storage for external graph analytics. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 411–424.
- Batch-aware unified memory management in GPUs for irregular workloads. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 1357–1370.
- Failure Tolerant Training with Persistent Memory Disaggregation over CXL. IEEE Micro 43, 2 (jan 2023), 66–75. https://doi.org/10.1109/MM.2023.3237548
- GraphChi:Large-Scale Graph Computation on Just a PC. In 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12). 31–46.
- Elastic Use of Far Memory for In-Memory Database Management Systems. In Proceedings of the 19th International Workshop on Data Management on New Hardware (DaMoN). 35–43.
- A framework for memory oversubscription management in graphics processing units. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 49–63.
- Pond: CXL-Based Memory Pooling Systems for Cloud Platforms. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Vol. 2. Vancouver, BC Canada, 574–587.
- Yuan Lin and Vinod Grover. 2018. Using CUDA Warp-Level Primitives. https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/.
- Hang Liu and H Howie Huang. 2017. Graphene:Fine-Grained IO Management for Graph Computing. In 15th USENIX Conference on File and Storage Technologies (FAST 17). 285–300.
- Mosaic: Processing a trillion-edge graph on a single machine. In Proceedings of the Twelfth European Conference on Computer Systems. 527–543.
- Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 135–146.
- DRAGON: breaking GPU memory capacity limits with direct NVM access. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 414–426.
- TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Vol. 3. Vancouver, BC Canada, 742–755.
- Scalability! but at what COST?. In 15th Workshop on Hot Topics in Operating Systems (HotOS XV).
- EMOGI: Efficient Memory-access for Out-of-memory Graph-traversal In GPUs. Proc. VLDB Endow. 14 (2020), 114–127.
- A lightweight infrastructure for graph analytics. In Proceedings of the twenty-fourth ACM symposium on operating systems principles. 456–471.
- GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Vol. 2. 325–339.
- X-stream: Edge-centric graph processing using streaming partitions. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. 472–488.
- Subway: Minimizing Data Transfer during out-of-GPU-Memory Graph Processing. In 15th European Conference on Computer Systems (EuroSys ’20). Heraklion, Greece, Article 12:1–16.
- GraphReduce: Processing Large-Scale Graphs on Accelerator-Based Systems. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’15). Austin, TX, USA, Article 28:1–12. https://doi.org/10.1145/2807591.2807655
- ActivePointers: a case for software address translation on GPUs. ACM SIGARCH Computer Architecture News 44, 3 (2016), 596–608.
- Emerging Usage and Evaluation of Low Latency FLASH. In 2020 IEEE International Memory Workshop (IMW). IEEE, 1–4.
- GPUfs: Integrating a file system with GPUs. ACM Transactions on Computer Systems (TOCS) 32, 1 (2014), 1–31.
- GPUnet: Networking abstractions for GPU programs. ACM Transactions on Computer Systems (TOCS) 34, 3 (2016), 1–31.
- Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices. arXiv preprint arXiv:2303.15375 (2023).
- Approaching DRAM Performance by Using Microsecond-Latency Flash Memory for Small-Sized Random Read Accesses: A New Access Method and Its Graph Applications. Proc. VLDB Endow. 14, 8 (apr 2021), 1311–1324. https://doi.org/10.14778/3457390.3457397
- Generic system calls for GPUs. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 843–856.
- Evaluating Emerging CXL-enabled Memory Pooling for HPC Systems. In SC Workshop on Memory Centric High Performance Computing (MCHPC’22). Dallas, TX, USA.
- Jaewon Yang and Jure Leskovec. 2012. Defining and evaluating network communities based on ground-truth. In Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics. 1–8.
- CXLMemSim: A pure software simulated CXL.mem for performance characterization. arXiv preprint arXiv:2303.06153 (2023).
- FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs. In 13th USENIX Conference on File and Storage Technologies (FAST 15). 45–58.