Stencil-HMLS: A multi-layered approach to the automatic optimisation of stencil codes on FPGA (2310.01914v1)
Abstract: The challenges associated with effectively programming FPGAs have been a major blocker in popularising reconfigurable architectures for HPC workloads. However new compiler technologies, such as MLIR, are providing new capabilities which potentially deliver the ability to extract domain specific information and drive automatic structuring of codes for FPGAs. In this paper we explore domain specific optimisations for stencils, a fundamental access pattern in scientific computing, to obtain high performance on FPGAs via automated code structuring. We propose Stencil-HMLS, a multi-layered approach to automatic optimisation of stencil codes and introduce the HLS dialect, which brings FPGA programming into the MLIR ecosystem. Using the PSyclone Fortran DSL, we demonstrate an improvement of 14-100$\times$ with respect to the next best performant state-of-the-art tool. Furthermore, our approach is 14 to 92 times more energy efficient than the next most energy efficient approach.
- An mlir-based compiler flow for system-level design and hardware acceleration. In Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design, pages 1–9, 2022.
- Soda-opt an mlir based flow for co-design and high-level synthesis. In Proceedings of the 19th ACM International Conference on Computing Frontiers, pages 201–202, 2022.
- Stateful dataflow multigraphs: A data-centric model for performance portability on heterogeneous architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14, 2019.
- Nick Brown. Accelerating advection for atmospheric modelling on xilinx and intel fpgas. In 2021 IEEE International Conference on Cluster Computing (CLUSTER), pages 767–774. IEEE, 2021.
- Nick Brown. Porting incompressible flow matrix assembly to fpgas for accelerating hpc engineering simulations. In 2021 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC), pages 9–20. IEEE, 2021.
- It’s all about data movement: Optimising fpga data access to boost performance. In 2019 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC), pages 1–10. IEEE, 2019.
- Designing scalable fpga architectures using high-level synthesis. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 403–404, 2018.
- Stencilflow: Mapping large stencil programs to distributed spatial computing systems. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 315–326. IEEE, 2021.
- Mlir as hardware compiler infrastructure. In Workshop on Open-Source EDA Technology (WOSET), 2021.
- Bambu: an open-source research framework for the high-level synthesis of complex applications. In 2021 58th ACM/IEEE Design Automation Conference (DAC), pages 1327–1330. IEEE, 2021.
- High-performance spectral element methods on field-programmable gate arrays: implementation, evaluation, and future projection. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 1077–1086. IEEE, 2021.
- Fast and energy-efficient derivatives risk analysis: Streaming option greeks on xilinx and intel fpgas. In 2022 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC), pages 18–27. IEEE, 2022.
- Low-power option greeks: Efficiency-driven market risk analysis using fpgas. In International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies, pages 95–101, 2022.
- Conservation properties of convection difference schemes. Journal of Computational Physics, 6(3):392–405, 1970.
- Fortran high-level synthesis: Reducing the barriers to accelerating hpc codes on fpgas. In 33rd International Conference on Field-Programmable Logic and Applications, 2023.
- Psyclonebench: Small benchmarks used to inform the development of the psyclone domain-specific compiler, 2021.
- Hardware implementation on fpga for task-level parallel dataflow execution engine. IEEE Transactions on Parallel and Distributed Systems, 27(8):2303–2315, 2015.
- Scalehls: Achieving scalable high-level synthesis through mlir. In Proceedings of the Workshop on Languages, Tools, and Techniques for Accelerator Design (LATTE’21), 2021.
- A multiwindow partial buffering scheme for fpga-based 2-d convolvers. IEEE Transactions on Circuits and Systems II: Express Briefs, 54(2):200–204, 2007.