Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
9 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
40 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Parendi: Thousand-Way Parallel RTL Simulation (2403.04714v2)

Published 7 Mar 2024 in cs.DC and cs.AR

Abstract: Hardware development critically depends on cycle-accurate RTL simulation. However, as chip complexity increases, conventional single-threaded simulation becomes impractical due to stagnant single-core performance. Parendi is an RTL simulator that addresses this challenge by exploiting the abundant fine-grained parallelism inherent in RTL simulation and efficiently mapping it onto the massively parallel Graphcore IPU (Intelligence Processing Unit) architecture. Parendi scales up to 5888 cores on 4 Graphcore IPU sockets. It allows us to run large RTL designs up to 4$\times$ faster than the most powerful state-of-the-art x64 multicore systems. To achieve this performance, we developed new partitioning and compilation techniques and carefully quantified the synchronization, communication, and computation costs of parallel RTL simulation: The paper comprehensively analyzes these factors and details the strategies that Parendi uses to optimize them.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. 4th gen AMD EPYC Processor Archiecture. Technical report, AMD.
  2. AI IPU Cloud Infrastructure. https://gcore.com/cloud/ai-platform. Accessed: 22-11-2023.
  3. Azure pricing calculator. https://azure.microsoft.com/en-us/pricing/calculator/.
  4. Introducing the Colussus MK2 GC200 IPU. https://www.graphcore.ai/products/ipu. Accessed: 2023-11-23.
  5. Long time to compile complicated processor. https://github.com/ucsc-vama/essent/issues/15, sep 2022.
  6. Simulation performance differs with different Verilog styles. https://github.com/verilator/verilator/issues/4547, oct 2023.
  7. Using essent with chipyard. https://github.com/ucsc-vama/essent/issues/20, sep 2023.
  8. Scalable parallel event-driven HDL simulation for multi-cores. In 2012 International Conference on Synthesis, Modeling, Analysis and Simulation Methods and Applications to Circuit Design (SMACD), pages 217–220, 2012.
  9. Chipyard: Integrated Design, Simulation, and Implementation Framework for Custom SoCs. IEEE Micro, 40(4):10–21, 2020.
  10. The Rocket Chip Generator. Technical report, University of California, Berkeley, 2016.
  11. Logic emulation with virtual wires. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 16(6):609–626, 1997.
  12. Chisel: constructing hardware in a Scala embedded language. pages 1216–1225, 2012.
  13. Scott Beamer. A Case for Accelerating Software RTL Simulation. IEEE Micro, 40(4):112–119, 2020.
  14. Efficiently Exploiting Low Activity Factors to Accelerate RTL Simulation. pages 1–6, 2020.
  15. Peter Birch. Open source FPGA-based emulation with nexus. In Workshop on Open-Source EDA Technology (WOSET), number 1, 2022.
  16. Event-driven gate-level simulation with GP-GPUs. pages 557–562, 2009.
  17. GCS: High-performance gate-level simulation with GPGPUs. pages 1332–1337, 2009.
  18. Gate-Level Simulation with GPU Computing. ACM Trans. Design Autom. Electr. Syst., 16(3):30:1–30:26, 2011.
  19. SlackSim: a platform for parallel simulations of CMPs on CMPs. SIGARCH Comput. Archit. News, 37(2):20–29, 2009.
  20. Accelerating RTL Simulation with Hardware Software Co-Design. In MICRO-56: 56th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’23, New York, NY, USA, 2023. Association for Computing Machinery.
  21. Manticore: Hardware-Accelerated RTL Simulation with Static Bulk-Synchronous Parallelism. In ASPLOS (4), pages 219–237, 2023.
  22. Harry Foster. Part 4: The 2020 Wilson Research Group Functional Verification Study, FPGA Verification Effort Trends, 12 2020.
  23. Harry Foster. Part 8: The 2020 Wilson Research Group Functional Verification Study, IC/ASIC Resource Trends, 1 2021.
  24. PriME: A parallel and distributed simulator for thousand-core chips. In Proceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 116–125, 2014.
  25. Performance Guarantees for Scheduling Algorithms. Oper. Res., 26(1):3–21, 1978.
  26. Reusability is FIRRTL ground: Hardware construction languages, compiler frameworks, and transformations. pages 209–216, 2017.
  27. A scalable architecture for ordered parallelism. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 228–241, 2015.
  28. Dissecting the Graphcore IPU Architecture via Microbenchmarking. CoRR, abs/1912.03413, 2019.
  29. FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud. IEEE Micro, 39(3):56–65, 2019.
  30. A new distributed event-driven gate-level HDL simulation by accurate prediction. pages 547–550, 2011.
  31. FPGA-based emulation: Industrial and custom prototyping solutions. In Proceedings of the The Roadmap to Reconfigurable Computing, 10th International Workshop on Field-Programmable Logic and Applications, FPL ’00, page 68–77, Berlin, Heidelberg, 2000. Springer-Verlag.
  32. Design and Implementation of a Parallel Verilog Simulator: PVSim. In VLSI Design, pages 329–334, 2004.
  33. From RTL to CUDA: A GPU Acceleration Flow for RTL Simulation with Batch Stimulus. pages 88:1–88:12, 2022.
  34. Fast Behavioural RTL Simulation of 10B Transistor SoC Designs with Metro-Mpi. pages 1–6, 2023.
  35. George Marsaglia. Xorshift RNGs. Journal of Statistical Software, 8(14):1–6, 2003.
  36. Graphite: A distributed parallel simulator for multicores. In Proceedings of the 16th IEEE Symposium on High-Performance Computer Architecture (HPCA), pages 1–12, 2010.
  37. A Hardware-Software Blueprint for Flexible Deep Learning Specialization. IEEE Micro, 39(5):8–16, 2019.
  38. Open-Source FPGA Bitcoin Miner. https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner.
  39. OSCI. SystemC. https://www.systemc.org.
  40. PicoRV32 - A Size-Optimized RISC-V CPU. https://github.com/YosysHQ/picorv32.
  41. Accelerating RTL simulation with GPUs. pages 687–693, 2011.
  42. Karl Rupp. Microprocessor trend data. https://github.com/karlrupp/microprocessor-trend-data, 2022. Accessed: 18-10-2023.
  43. Sartaj Sahni. Algorithms for Scheduling Independent Tasks. J. ACM, 23(1):116–127, 1976.
  44. Compile-time partitioning and scheduling of parallel programs. In SIGPLAN Symposium on Compiler Construction, pages 17–26, 1986.
  45. High-Quality Hypergraph Partitioning. ACM J. Exp. Algorithmics, 27:1.9:1–1.9:39, 2022.
  46. Wilson Snyder. Verilator, accelerated: Accelerating development, and case study of accelerating performance. 2nd Workshop on Open-Source Design Automation (OSDA).
  47. Wilson Snyder. Verilator 4.0: Open simulation goes multithreaded. The OPen Source Digital Design Conference (ORConf), 2018.
  48. Wilson Snyder. Your Big 4th Simulator: 2019 intro and roadmap. CHIPS Alliance, 2019.
  49. Submodular Approximation: Sampling-based Algorithms and Lower Bounds. SIAM J. Comput., 40(6):1715–1737, 2011.
  50. ZSim: fast and accurate microarchitectural simulation of thousand-core systems. In Proceedings of the 40th International Symposium on Computer Architecture (ISCA), pages 475–486, 2013.
  51. DIABLO: A Warehouse-Scale Computer Network Simulator using FPGAs. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XX), pages 207–221, 2015.
  52. Design and implementation of a high performance financial Monte-Carlo simulation engine on an FPGA supercomputer. pages 81–88, 2008.
  53. Jeffrey D. Ullman. NP-Complete Scheduling Problems. J. Comput. Syst. Sci., 10(3):384–393, 1975.
  54. Leslie G. Valiant. A Bridging Model for Parallel Computation. Commun. ACM, 33(8):103–111, 1990.
  55. SAGA: SystemC acceleration on GPU architectures. pages 115–120, 2012.
  56. RepCut: Superlinear Parallel RTL Simulation with Replication-Aided Partitioning. In ASPLOS (3), pages 572–585, 2023.
  57. SSIM: A Software Levelized Compiled-Code Simulator. pages 2–8, 1987.
  58. LECSIM: A Levelized Event Driven Compiled Logic Simulation. pages 491–496, 1990.
  59. Predictive parallel event-driven HDL simulation with a new powerful prediction strategy. pages 1–3, 2014.
  60. Constellation: An open-source SoC-capable NoC generator. In 2022 15th IEEE/ACM International Workshop on Network on Chip Architectures (NoCArc), pages 1–7, 2022.
  61. par-gem5: Parallelizing gem5’s Atomic Mode. pages 1–6, 2023.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com