Parendi: Thousand-Way Parallel RTL Simulation (2403.04714v2)
Abstract: Hardware development critically depends on cycle-accurate RTL simulation. However, as chip complexity increases, conventional single-threaded simulation becomes impractical due to stagnant single-core performance. Parendi is an RTL simulator that addresses this challenge by exploiting the abundant fine-grained parallelism inherent in RTL simulation and efficiently mapping it onto the massively parallel Graphcore IPU (Intelligence Processing Unit) architecture. Parendi scales up to 5888 cores on 4 Graphcore IPU sockets. It allows us to run large RTL designs up to 4$\times$ faster than the most powerful state-of-the-art x64 multicore systems. To achieve this performance, we developed new partitioning and compilation techniques and carefully quantified the synchronization, communication, and computation costs of parallel RTL simulation: The paper comprehensively analyzes these factors and details the strategies that Parendi uses to optimize them.
- 4th gen AMD EPYC Processor Archiecture. Technical report, AMD.
- AI IPU Cloud Infrastructure. https://gcore.com/cloud/ai-platform. Accessed: 22-11-2023.
- Azure pricing calculator. https://azure.microsoft.com/en-us/pricing/calculator/.
- Introducing the Colussus MK2 GC200 IPU. https://www.graphcore.ai/products/ipu. Accessed: 2023-11-23.
- Long time to compile complicated processor. https://github.com/ucsc-vama/essent/issues/15, sep 2022.
- Simulation performance differs with different Verilog styles. https://github.com/verilator/verilator/issues/4547, oct 2023.
- Using essent with chipyard. https://github.com/ucsc-vama/essent/issues/20, sep 2023.
- Scalable parallel event-driven HDL simulation for multi-cores. In 2012 International Conference on Synthesis, Modeling, Analysis and Simulation Methods and Applications to Circuit Design (SMACD), pages 217–220, 2012.
- Chipyard: Integrated Design, Simulation, and Implementation Framework for Custom SoCs. IEEE Micro, 40(4):10–21, 2020.
- The Rocket Chip Generator. Technical report, University of California, Berkeley, 2016.
- Logic emulation with virtual wires. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 16(6):609–626, 1997.
- Chisel: constructing hardware in a Scala embedded language. pages 1216–1225, 2012.
- Scott Beamer. A Case for Accelerating Software RTL Simulation. IEEE Micro, 40(4):112–119, 2020.
- Efficiently Exploiting Low Activity Factors to Accelerate RTL Simulation. pages 1–6, 2020.
- Peter Birch. Open source FPGA-based emulation with nexus. In Workshop on Open-Source EDA Technology (WOSET), number 1, 2022.
- Event-driven gate-level simulation with GP-GPUs. pages 557–562, 2009.
- GCS: High-performance gate-level simulation with GPGPUs. pages 1332–1337, 2009.
- Gate-Level Simulation with GPU Computing. ACM Trans. Design Autom. Electr. Syst., 16(3):30:1–30:26, 2011.
- SlackSim: a platform for parallel simulations of CMPs on CMPs. SIGARCH Comput. Archit. News, 37(2):20–29, 2009.
- Accelerating RTL Simulation with Hardware Software Co-Design. In MICRO-56: 56th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’23, New York, NY, USA, 2023. Association for Computing Machinery.
- Manticore: Hardware-Accelerated RTL Simulation with Static Bulk-Synchronous Parallelism. In ASPLOS (4), pages 219–237, 2023.
- Harry Foster. Part 4: The 2020 Wilson Research Group Functional Verification Study, FPGA Verification Effort Trends, 12 2020.
- Harry Foster. Part 8: The 2020 Wilson Research Group Functional Verification Study, IC/ASIC Resource Trends, 1 2021.
- PriME: A parallel and distributed simulator for thousand-core chips. In Proceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 116–125, 2014.
- Performance Guarantees for Scheduling Algorithms. Oper. Res., 26(1):3–21, 1978.
- Reusability is FIRRTL ground: Hardware construction languages, compiler frameworks, and transformations. pages 209–216, 2017.
- A scalable architecture for ordered parallelism. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 228–241, 2015.
- Dissecting the Graphcore IPU Architecture via Microbenchmarking. CoRR, abs/1912.03413, 2019.
- FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud. IEEE Micro, 39(3):56–65, 2019.
- A new distributed event-driven gate-level HDL simulation by accurate prediction. pages 547–550, 2011.
- FPGA-based emulation: Industrial and custom prototyping solutions. In Proceedings of the The Roadmap to Reconfigurable Computing, 10th International Workshop on Field-Programmable Logic and Applications, FPL ’00, page 68–77, Berlin, Heidelberg, 2000. Springer-Verlag.
- Design and Implementation of a Parallel Verilog Simulator: PVSim. In VLSI Design, pages 329–334, 2004.
- From RTL to CUDA: A GPU Acceleration Flow for RTL Simulation with Batch Stimulus. pages 88:1–88:12, 2022.
- Fast Behavioural RTL Simulation of 10B Transistor SoC Designs with Metro-Mpi. pages 1–6, 2023.
- George Marsaglia. Xorshift RNGs. Journal of Statistical Software, 8(14):1–6, 2003.
- Graphite: A distributed parallel simulator for multicores. In Proceedings of the 16th IEEE Symposium on High-Performance Computer Architecture (HPCA), pages 1–12, 2010.
- A Hardware-Software Blueprint for Flexible Deep Learning Specialization. IEEE Micro, 39(5):8–16, 2019.
- Open-Source FPGA Bitcoin Miner. https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner.
- OSCI. SystemC. https://www.systemc.org.
- PicoRV32 - A Size-Optimized RISC-V CPU. https://github.com/YosysHQ/picorv32.
- Accelerating RTL simulation with GPUs. pages 687–693, 2011.
- Karl Rupp. Microprocessor trend data. https://github.com/karlrupp/microprocessor-trend-data, 2022. Accessed: 18-10-2023.
- Sartaj Sahni. Algorithms for Scheduling Independent Tasks. J. ACM, 23(1):116–127, 1976.
- Compile-time partitioning and scheduling of parallel programs. In SIGPLAN Symposium on Compiler Construction, pages 17–26, 1986.
- High-Quality Hypergraph Partitioning. ACM J. Exp. Algorithmics, 27:1.9:1–1.9:39, 2022.
- Wilson Snyder. Verilator, accelerated: Accelerating development, and case study of accelerating performance. 2nd Workshop on Open-Source Design Automation (OSDA).
- Wilson Snyder. Verilator 4.0: Open simulation goes multithreaded. The OPen Source Digital Design Conference (ORConf), 2018.
- Wilson Snyder. Your Big 4th Simulator: 2019 intro and roadmap. CHIPS Alliance, 2019.
- Submodular Approximation: Sampling-based Algorithms and Lower Bounds. SIAM J. Comput., 40(6):1715–1737, 2011.
- ZSim: fast and accurate microarchitectural simulation of thousand-core systems. In Proceedings of the 40th International Symposium on Computer Architecture (ISCA), pages 475–486, 2013.
- DIABLO: A Warehouse-Scale Computer Network Simulator using FPGAs. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XX), pages 207–221, 2015.
- Design and implementation of a high performance financial Monte-Carlo simulation engine on an FPGA supercomputer. pages 81–88, 2008.
- Jeffrey D. Ullman. NP-Complete Scheduling Problems. J. Comput. Syst. Sci., 10(3):384–393, 1975.
- Leslie G. Valiant. A Bridging Model for Parallel Computation. Commun. ACM, 33(8):103–111, 1990.
- SAGA: SystemC acceleration on GPU architectures. pages 115–120, 2012.
- RepCut: Superlinear Parallel RTL Simulation with Replication-Aided Partitioning. In ASPLOS (3), pages 572–585, 2023.
- SSIM: A Software Levelized Compiled-Code Simulator. pages 2–8, 1987.
- LECSIM: A Levelized Event Driven Compiled Logic Simulation. pages 491–496, 1990.
- Predictive parallel event-driven HDL simulation with a new powerful prediction strategy. pages 1–3, 2014.
- Constellation: An open-source SoC-capable NoC generator. In 2022 15th IEEE/ACM International Workshop on Network on Chip Architectures (NoCArc), pages 1–7, 2022.
- par-gem5: Parallelizing gem5’s Atomic Mode. pages 1–6, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.