Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DCRA: A Distributed Chiplet-based Reconfigurable Architecture for Irregular Applications (2311.15443v2)

Published 26 Nov 2023 in cs.AR and cs.DC

Abstract: In recent years, the growing demand to process large graphs and sparse datasets has led to increased research efforts to develop hardware- and software-based architectural solutions to accelerate them. While some of these approaches achieve scalable parallelization with up to thousands of cores, adaptation of these proposals by the industry remained slow. To help solve this dissonance, we identified a set of questions and considerations that current research has not considered deeply. Starting from a tile-based architecture, we put forward a Distributed Chiplet-based Reconfigurable Architecture (DCRA) for irregular applications that carefully consider fabrication constraints that made prior work either hard or costly to implement or too rigid to be applied. We identify and study pre-silicon, package-time and compile-time configurations that help optimize DCRA for different deployments and target metrics. To enable that, we propose a practical path for manufacturing chip packages by composing variable numbers of DCRA and memory dies, with a software-configurable Torus network to connect them. We evaluate six applications and four datasets, with several configurations and memory technologies, to provide a detailed analysis of the performance, power, and cost of DCRA as a compute node for scale-out sparse data processing. Finally, we present our findings and discuss how DCRA's framework for design exploration can help guide architects to build scalable and cost-efficient systems for irregular applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (93)
  1. Piuma: programmable integrated unified memory architecture. arXiv preprint arXiv:2010.06277, 2020.
  2. Chronos: Efficient speculative parallelism for accelerators. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 1247–1262, 2020.
  3. Carbon explorer: A holistic framework for designing carbon aware datacenters. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 118–132, 2023.
  4. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, pages 105–117, 2015.
  5. AMD. AMD rome, 2018. https://en.wikichip.org/wiki/amd/cores/rome.
  6. What is the right die-to-die interface? a comparison study, 2022. https://www.opencompute.org/events/past-events/hipchips-chiplet-workshop-isca-conference.
  7. OpenPiton+Ariane: The first open-source, SMP Linux-booting RISC-V system scaling from one to many cores. In Third Workshop on Computer Architecture Research with RISC-V, CARRV, volume 19, 2019.
  8. Cerebras Systems Inc. The second generation wafer scale engine. https://cerebras.net/wp-content/uploads/2021/04/Cerebras-CS-2-Whitepaper.pdf.
  9. Towards sustainable computing: Assessing the carbon footprint of heterogeneous systems. arXiv preprint arXiv:2306.09434, 2023.
  10. Nvidia A100 GPU: Performance & innovation for GPU computing. In 2020 IEEE Hot Chips 32 Symposium (HCS), pages 1–43. IEEE Computer Society, 2020.
  11. Cameo: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache. In 47th IEEE/ACM International Symposium on Microarchitecture, pages 1–12. IEEE, 2014.
  12. Exploiting private local memories to reduce the opportunity cost of accelerator integration. In Proceedings of the 2016 International Conference on Supercomputing, pages 1–12, 2016.
  13. Polygraph: Exposing the value of flexibility for graph processing accelerators. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 595–608. IEEE, 2021.
  14. Accelerating scientific applications with sambanova reconfigurable dataflow architecture. Computing in Science & Engineering, 23(2):114–119, 2021.
  15. Esperanto Technologies. Esperanto’s et-minion on-chip RISC-V cores. https://www.esperanto.ai/technology/.
  16. Kim Eun-jin. Samsung and SK Hynix Enjoy a Rush of Orders for New Memories, 2023. https://www.businesskorea.co.kr/news/articleView.html?idxno=109380.
  17. A formal study on topology and floorplan characteristics of mesh and torus-based optical networks-on-chip. Microprocessors and Microsystems, 37(8):941–952, 2013.
  18. Decades: A 67mm2, 1.46tops, 55 giga cache-coherent 64-bit RISC-V instructions per second, heterogeneous manycore soc with 109 tiles including accelerators, intelligent storage, and efpga in 12nm finfet. In Custom Integrated Circuits Conference (CICC), pages 1–2, 2023.
  19. Spade: A flexible and scalable accelerator for spmm and sddmm. In Proceedings of the 50th Annual International Symposium on Computer Architecture, pages 1–15, 2023.
  20. What your dram power models are not telling you: Lessons from a detailed experimental study. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 2(3):1–41, 2018.
  21. ESP4ML: Platform-based design of systems-on-chip for embedded machine learning. In DATE. IEEE Press, 2020.
  22. Ponte vecchio: A multi-tile 3d stacked processor for exascale computing. In 2022 IEEE International Solid-State Circuits Conference (ISSCC), volume 65, pages 42–44. IEEE, 2022.
  23. Chasing carbon: The elusive environmental footprint of computing. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 854–867. IEEE, 2021.
  24. Linley Gwennap. Groq rocks neural networks. Microprocessor Report, Tech. Rep., jan, 2020.
  25. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics. In Proceedings of the 49th Annual International Symposium on Microarchitecture, MICRO, 2016.
  26. William Harrod. Agile: The future of data centric computing, 2022. https://www.youtube.com/watch?v=qIM_RBXX6O0.
  27. High-Bandwidth Memory (HBM), 2015. https://www.amd.com/Documents/High-Bandwidth-Memory-HBM.pdf.
  28. A 5-ghz mesh interconnect for a teraflops processor. IEEE micro, 27(5):51–61, 2007.
  29. Intel. Intel Kaby Lake G, 2018. https://en.wikichip.org/wiki/intel/cores/kaby_lake_g.
  30. JEDEC. Standard high bandwidth memory specification jesd235a, 2015.
  31. A scalable architecture for ordered parallelism. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO-48, page 228–241, New York, NY, USA, 2015. Association for Computing Machinery.
  32. Scotten W. Jones. Lithovision: Economics in the 3d era. https://semiwiki.com/wp-content/uploads/2020/03/Lithovision-2020.pdf.
  33. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th annual international symposium on computer architecture, pages 1–12, 2017.
  34. Metis: A software package for partitioning unstructured graphs. Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices, Version, 4(0), 1998.
  35. Parmetis: Parallel graph partitioning and sparse matrix ordering library. 1997.
  36. A 16gb 9.5gb/s/pin lpddr5x sdram with low-power schemes exploiting dynamic voltage-frequency scaling and offset-calibrated readout sense amplifiers in a fourth generation 10nm dram process. In 2022 IEEE International Solid- State Circuits Conference (ISSCC), volume 65, pages 448–450, 2022.
  37. Technology-driven, highly-scalable dragonfly topology. ACM SIGARCH Computer Architecture News, 36(3):77–88, 2008.
  38. Processing-in-memory in high bandwidth memory (pim-hbm) architecture with energy-efficient and low latency channels for high bandwidth system. In 2019 IEEE 28th Conference on Electrical Performance of Electronic Packaging and Systems (EPEPS), pages 1–3, 2019.
  39. System level analysis of fast, per-core DVFS using on-chip switching regulators. In 2008 IEEE 14th International Symposium on High Performance Computer Architecture, pages 123–134. IEEE, 2008.
  40. Simon Knowles. Graphcore. In 2021 IEEE Hot Chips 33 Symposium (HCS), pages 1–25. IEEE, 2021.
  41. Amoeba-cache: Adaptive blocks for eliminating waste in the memory hierarchy. In 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 376–388. IEEE, 2012.
  42. Polarfly: A cost-effective and flexible low-diameter topology. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE, 2022.
  43. John H Lau. Status and outlooks of flip chip technology. IPC EXPO Proceedings, February 2017, pages 1–20, 2017.
  44. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998.
  45. An overview of the development of a gpu with integrated hbm on silicon interposer. In 2016 IEEE 66th Electronic Components and Technology Conference (ECTC), pages 1439–1444, 2016.
  46. 22.3 a 128gb 8-high 512gb/s hbm2e dram with a pseudo quarter bank structure, power dispersion and an instruction-based at-speed pmbist. In 2020 IEEE International Solid-State Circuits Conference-(ISSCC), pages 334–336. IEEE, 2020.
  47. Charles E Leiserson. Fat-trees: Universal networks for hardware-efficient supercomputing. IEEE transactions on Computers, 100(10):892–901, 1985.
  48. Kronecker graphs: An approach to modeling networks. Journal of Machine Learning Reseach (JMLR), 11:985–1042, March 2010.
  49. AI accelerator on IBM telum processor: industrial product. In Proceedings of the 49th Annual International Symposium on Computer Architecture, pages 1012–1028, 2022.
  50. Embedded multi-die interconnect bridge (emib)–a high density, high bandwidth packaging interconnect. In 2016 IEEE 66th Electronic Components and Technology Conference (ECTC), pages 557–565. IEEE, 2016.
  51. Graphattack: Optimizing data supply for graph applications on in-order multicore architectures. ACM Transactions on Architecture and Code Optimization (TACO), 18(4):1–26, 2021.
  52. Micron. High Bandwidth Memory with ECC, 2018. https://media-www.micron.com/-/media/client/global/documents/products/data-sheet/dram/hbm2e/8gb_and_16gb_hbm2e_dram.pdf.
  53. MooreElite. Die yield calculator. https://isine.com/resources/die-yield-calculator/.
  54. Introducing the Graph 500. http://www.graph500.org/specifications, 2010.
  55. Pioneering chiplet technology and design for the amd epyc™ and ryzen™ processor families. In Proceedings of the 48th Annual International Symposium on Computer Architecture, ISCA ’21, page 57–70. IEEE Press, 2021.
  56. Sapphire rapids: The next-generation intel xeon scalable processor. In 2022 IEEE International Solid-State Circuits Conference (ISSCC), volume 65, pages 44–46. IEEE, 2022.
  57. Pipette: Improving core utilization on irregular applications through intra-core pipeline parallelism. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 596–608. IEEE, 2020.
  58. Fifer: Practical acceleration of irregular applications on reconfigurable architectures. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’21, page 1064–1077, New York, NY, USA, 2021. Association for Computing Machinery.
  59. Supply chain aware computer architecture. In Proceedings of the 50th Annual International Symposium on Computer Architecture, pages 1–15, 2023.
  60. 22.1 a 1.1 v 16gb 640gb/s hbm2e dram with a data-bus window-extension technique and a synergetic on-die ecc scheme. In 2020 IEEE International Solid-State Circuits Conference-(ISSCC), pages 330–332. IEEE, 2020.
  61. Open Compute Group. Bunch of wires phy specification. https://opencomputeproject.github.io/ODSA-BoW/bow_specification.html.
  62. Tiny but mighty: designing and realizing scalable latency tolerance for manycore socs. In ISCA, pages 817–830, 2022.
  63. Wafer-scale fast fourier transforms. In Proceedings of the 37th International Conference on Supercomputing, ICS ’23, page 180–191, New York, NY, USA, 2023. Association for Computing Machinery.
  64. DCRA simulation framework and artifacts, 2023. https://github.com/morenes/dcra.git.
  65. Dalorex: A data-local program execution and architecture for memory-bound applications. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 718–730. IEEE, 2023.
  66. Energy efficient architecture for graph analytics accelerators. ACM SIGARCH Computer Architecture News, 44(3):166–177, 2016.
  67. Fine-grained dram: Energy-efficient dram for extreme bandwidth systems. In 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 41–54. IEEE, 2017.
  68. A 192-gb 12-high 896-gb/s hbm3 dram with a tsv auto-calibration scheme and machine-learning-based layout optimization. In 2022 IEEE International Solid-State Circuits Conference (ISSCC), volume 65, pages 444–446. IEEE, 2022.
  69. Andy Patrizio. High-bandwidth memory (hbm) delivers impressive performance gains. https://semiengineering.com/whats-next-for-high-bandwidth-memory/.
  70. J Thomas Pawlowski. Hybrid memory cube (hmc). In 2011 IEEE Hot Chips 23 Symposium (HCS), pages 1–24. IEEE, 2011.
  71. François Pellegrini. Scotch and pt-scotch graph partitioning software: an overview. Combinatorial Scientific Computing, pages 373–406, 2012.
  72. A scalable architecture for reprioritizing ordered parallelism. In Proceedings of the 49th Annual International Symposium on Computer Architecture, ISCA ’22, page 437–453, New York, NY, USA, 2022. Association for Computing Machinery.
  73. Graphpulse: An event-driven hardware accelerator for asynchronous graph processing. In 2020 53rd Annual IEEE/ACM Symposium on Microarchitecture (MICRO), pages 908–921. IEEE, 2020.
  74. Agam Shah. Chipmakers Looking at New Architecture to Drive Computing Ahead, 2022. https://www.hpcwire.com/2022/11/23/chipmakers-looking-at-new-architecture-to-drive-computing-ahead/.
  75. Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’52, page 14–27, New York, NY, USA, 2019. Association for Computing Machinery.
  76. Debendra Das Sharma. Pci express 6.0 specification: A low-latency, high-bandwidth, high-reliability, and cost-effective interconnect with 64.0 gt/s pam-4 signaling. IEEE Micro, 41(1), 2020.
  77. A 96-mb 3d-stacked sram using inductive coupling with 0.4-v transmitter, termination scheme and 12: 1 serdes in 40-nm cmos. IEEE Transactions on Circuits and Systems I, 68(2):692–703, 2020.
  78. BFS and coloring-based parallel algorithms for strongly connected components and related problems. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium, Phoenix, AZ, USA, May 19-23, 2014, pages 550–559. IEEE Computer Society, 2014.
  79. A 1.2 v 20 nm 307 gb/s hbm dram with at-speed wafer-level io test scheme and adaptive refresh considering temperature distribution. IEEE Journal of Solid-State Circuits, 52(1):250–260, 2017.
  80. Cost-effective design of scalable high-performance systems using active and passive interposers. In 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 728–735, 2017.
  81. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12-01, University of Illinois at Urbana-Champaign, 2012.
  82. Chiplet’s march to amd 3d v-cache and beyond, 2022. https://www.opencompute.org/events/past-events/hipchips-chiplet-workshop-isca-conference.
  83. Prodigy: Improving the memory latency of data-indirect irregular workloads using hardware-software co-design. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 654–667. IEEE, 2021.
  84. Dojo: The microarchitecture of tesla exa-scale computer. In 2022 IEEE Hot Chips 34 Symposium, pages 1–28. IEEE Computer Society, 2022.
  85. Cost-aware exploration for chiplet-based architecture with advanced packaging technologies. arXiv preprint arXiv:2206.07308, 2022.
  86. An 80-tile Sub-100-W teraflops processor in 65-nm CMOS. IEEE Journal of solid-state circuits, 43(1):29–41, 2008.
  87. Cohort: Software-oriented acceleration for heterogeneous socs. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ASPLOS 2023, page 105–117, New York, NY, USA, 2023. Association for Computing Machinery.
  88. John Wilson. High-bandwidth density, energy-efficient, short-reach signaling that enables massively scalable parallelism, 2022. https://www.opencompute.org/events/past-events/hipchips-chiplet-workshop-isca-conference.
  89. A 29.2 mb/mm2 ultra high density sram macro using 7nm finfet technology with dual-edge driven wordline/bitline and write/read-assist circuit. In 2020 IEEE Symposium on VLSI Circuits, pages 1–2, 2020.
  90. F. Zaruba and L. Benini. The cost of application-class processing: Energy and performance analysis of a linux-ready 1.7-ghz 64-bit RISC-V core in 22-nm fdsoi technology. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 27(11):2629–2640, Nov 2019. https://github.com/openhwgroup/cva6.
  91. Manticore: A 4096-core RISC-V chiplet architecture for ultraefficient floating-point computing. IEEE Micro, 41(2):36–42, 2020.
  92. Graphp: Reducing communication for pim-based graph processing with efficient data partition. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 544–557. IEEE, 2018.
  93. Graphq: Scalable pim-based graph processing. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 712–725, 2019.
Citations (2)

Summary

We haven't generated a summary for this paper yet.