Parendi: Thousand-Way Parallel RTL Simulation (2403.04714v2)
Abstract: Hardware development critically depends on cycle-accurate RTL simulation. However, as chip complexity increases, conventional single-threaded simulation becomes impractical due to stagnant single-core performance. Parendi is an RTL simulator that addresses this challenge by exploiting the abundant fine-grained parallelism inherent in RTL simulation and efficiently mapping it onto the massively parallel Graphcore IPU (Intelligence Processing Unit) architecture. Parendi scales up to 5888 cores on 4 Graphcore IPU sockets. It allows us to run large RTL designs up to 4$\times$ faster than the most powerful state-of-the-art x64 multicore systems. To achieve this performance, we developed new partitioning and compilation techniques and carefully quantified the synchronization, communication, and computation costs of parallel RTL simulation: The paper comprehensively analyzes these factors and details the strategies that Parendi uses to optimize them.
- 4th gen AMD EPYC Processor Archiecture. Technical report, AMD.
- AI IPU Cloud Infrastructure. https://gcore.com/cloud/ai-platform. Accessed: 22-11-2023.
- Azure pricing calculator. https://azure.microsoft.com/en-us/pricing/calculator/.
- Introducing the Colussus MK2 GC200 IPU. https://www.graphcore.ai/products/ipu. Accessed: 2023-11-23.
- Long time to compile complicated processor. https://github.com/ucsc-vama/essent/issues/15, sep 2022.
- Simulation performance differs with different Verilog styles. https://github.com/verilator/verilator/issues/4547, oct 2023.
- Using essent with chipyard. https://github.com/ucsc-vama/essent/issues/20, sep 2023.
- Scalable parallel event-driven HDL simulation for multi-cores. In 2012 International Conference on Synthesis, Modeling, Analysis and Simulation Methods and Applications to Circuit Design (SMACD), pages 217–220, 2012.
- Chipyard: Integrated Design, Simulation, and Implementation Framework for Custom SoCs. IEEE Micro, 40(4):10–21, 2020.
- The Rocket Chip Generator. Technical report, University of California, Berkeley, 2016.
- Logic emulation with virtual wires. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 16(6):609–626, 1997.
- Chisel: constructing hardware in a Scala embedded language. pages 1216–1225, 2012.
- Scott Beamer. A Case for Accelerating Software RTL Simulation. IEEE Micro, 40(4):112–119, 2020.
- Efficiently Exploiting Low Activity Factors to Accelerate RTL Simulation. pages 1–6, 2020.
- Peter Birch. Open source FPGA-based emulation with nexus. In Workshop on Open-Source EDA Technology (WOSET), number 1, 2022.
- Event-driven gate-level simulation with GP-GPUs. pages 557–562, 2009.
- GCS: High-performance gate-level simulation with GPGPUs. pages 1332–1337, 2009.
- Gate-Level Simulation with GPU Computing. ACM Trans. Design Autom. Electr. Syst., 16(3):30:1–30:26, 2011.
- SlackSim: a platform for parallel simulations of CMPs on CMPs. SIGARCH Comput. Archit. News, 37(2):20–29, 2009.
- Accelerating RTL Simulation with Hardware Software Co-Design. In MICRO-56: 56th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’23, New York, NY, USA, 2023. Association for Computing Machinery.
- Manticore: Hardware-Accelerated RTL Simulation with Static Bulk-Synchronous Parallelism. In ASPLOS (4), pages 219–237, 2023.
- Harry Foster. Part 4: The 2020 Wilson Research Group Functional Verification Study, FPGA Verification Effort Trends, 12 2020.
- Harry Foster. Part 8: The 2020 Wilson Research Group Functional Verification Study, IC/ASIC Resource Trends, 1 2021.
- PriME: A parallel and distributed simulator for thousand-core chips. In Proceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 116–125, 2014.
- Performance Guarantees for Scheduling Algorithms. Oper. Res., 26(1):3–21, 1978.
- Reusability is FIRRTL ground: Hardware construction languages, compiler frameworks, and transformations. pages 209–216, 2017.
- A scalable architecture for ordered parallelism. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 228–241, 2015.
- Dissecting the Graphcore IPU Architecture via Microbenchmarking. CoRR, abs/1912.03413, 2019.
- FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud. IEEE Micro, 39(3):56–65, 2019.
- A new distributed event-driven gate-level HDL simulation by accurate prediction. pages 547–550, 2011.
- FPGA-based emulation: Industrial and custom prototyping solutions. In Proceedings of the The Roadmap to Reconfigurable Computing, 10th International Workshop on Field-Programmable Logic and Applications, FPL ’00, page 68–77, Berlin, Heidelberg, 2000. Springer-Verlag.
- Design and Implementation of a Parallel Verilog Simulator: PVSim. In VLSI Design, pages 329–334, 2004.
- From RTL to CUDA: A GPU Acceleration Flow for RTL Simulation with Batch Stimulus. pages 88:1–88:12, 2022.
- Fast Behavioural RTL Simulation of 10B Transistor SoC Designs with Metro-Mpi. pages 1–6, 2023.
- George Marsaglia. Xorshift RNGs. Journal of Statistical Software, 8(14):1–6, 2003.
- Graphite: A distributed parallel simulator for multicores. In Proceedings of the 16th IEEE Symposium on High-Performance Computer Architecture (HPCA), pages 1–12, 2010.
- A Hardware-Software Blueprint for Flexible Deep Learning Specialization. IEEE Micro, 39(5):8–16, 2019.
- Open-Source FPGA Bitcoin Miner. https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner.
- OSCI. SystemC. https://www.systemc.org.
- PicoRV32 - A Size-Optimized RISC-V CPU. https://github.com/YosysHQ/picorv32.
- Accelerating RTL simulation with GPUs. pages 687–693, 2011.
- Karl Rupp. Microprocessor trend data. https://github.com/karlrupp/microprocessor-trend-data, 2022. Accessed: 18-10-2023.
- Sartaj Sahni. Algorithms for Scheduling Independent Tasks. J. ACM, 23(1):116–127, 1976.
- Compile-time partitioning and scheduling of parallel programs. In SIGPLAN Symposium on Compiler Construction, pages 17–26, 1986.
- High-Quality Hypergraph Partitioning. ACM J. Exp. Algorithmics, 27:1.9:1–1.9:39, 2022.
- Wilson Snyder. Verilator, accelerated: Accelerating development, and case study of accelerating performance. 2nd Workshop on Open-Source Design Automation (OSDA).
- Wilson Snyder. Verilator 4.0: Open simulation goes multithreaded. The OPen Source Digital Design Conference (ORConf), 2018.
- Wilson Snyder. Your Big 4th Simulator: 2019 intro and roadmap. CHIPS Alliance, 2019.
- Submodular Approximation: Sampling-based Algorithms and Lower Bounds. SIAM J. Comput., 40(6):1715–1737, 2011.
- ZSim: fast and accurate microarchitectural simulation of thousand-core systems. In Proceedings of the 40th International Symposium on Computer Architecture (ISCA), pages 475–486, 2013.
- DIABLO: A Warehouse-Scale Computer Network Simulator using FPGAs. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XX), pages 207–221, 2015.
- Design and implementation of a high performance financial Monte-Carlo simulation engine on an FPGA supercomputer. pages 81–88, 2008.
- Jeffrey D. Ullman. NP-Complete Scheduling Problems. J. Comput. Syst. Sci., 10(3):384–393, 1975.
- Leslie G. Valiant. A Bridging Model for Parallel Computation. Commun. ACM, 33(8):103–111, 1990.
- SAGA: SystemC acceleration on GPU architectures. pages 115–120, 2012.
- RepCut: Superlinear Parallel RTL Simulation with Replication-Aided Partitioning. In ASPLOS (3), pages 572–585, 2023.
- SSIM: A Software Levelized Compiled-Code Simulator. pages 2–8, 1987.
- LECSIM: A Levelized Event Driven Compiled Logic Simulation. pages 491–496, 1990.
- Predictive parallel event-driven HDL simulation with a new powerful prediction strategy. pages 1–3, 2014.
- Constellation: An open-source SoC-capable NoC generator. In 2022 15th IEEE/ACM International Workshop on Network on Chip Architectures (NoCArc), pages 1–7, 2022.
- par-gem5: Parallelizing gem5’s Atomic Mode. pages 1–6, 2023.