- The paper introduces eGPU, a SIMT soft GPGPU tailored for FPGAs, achieving a record 771 MHz operating frequency.
- It leverages FPGA resources such as ALMs, DSP Blocks, and embedded memories to optimize processing efficiency and maintain a balanced resource usage.
- Benchmarks on FFT and QR decomposition demonstrate its superior performance and flexibility compared to traditional soft processor architectures.
eGPU: A 750 MHz Class Soft GPGPU for FPGA
Introduction
The paper "eGPU: A 750 MHz Class Soft GPGPU for FPGA" (2307.08378) presents the design and implementation of eGPU, a SIMT soft processor tailored for FPGAs. Distinguishing itself from traditional soft processors, eGPU targets high-performance applications by leveraging the inherent resources of FPGAs, such as soft logic, embedded memories, and DSP Blocks. The architecture is aimed at enabling complex FPGA system designs, such as linear solvers common in wireless systems, via push-button software compilation.
Architecture Overview
The eGPU architecture is structured as a streaming multiprocessor machine comprising 512 threads, with each SM housing 16 SPs. It supports both IEEE754 FP32 and INT32 arithmetic. A notable achievement is the eGPU's ability to exceed a 770 MHz operating frequency in an Intel Agilex device, utilizing 5600 ALMs and 24 DSP Blocks, without the need for synthesis constraints.
Figure 1: Representative SM Architecture.
Core Architectural Features
The eGPU achieves a clock frequency of 771 MHz, surpassing other known soft processors of comparable complexity. This result is accomplished with an unconstrained compile, enabled by careful consideration of the FPGA's critical path constraints, primarily in the DSP Blocks configured in FP32 mode.
Resource Efficiency
Resource balance in the eGPU design mimics the native balance of logic, DSP, and memory in FPGA architectures, allowing multiple eGPU instances to be compactly packed with high efficiency and minimal penalty on speed.
Flexible ISA
The Instruction Set Architecture (ISA) allows targeting of initialized threads on an instruction basis, enhancing processing efficiency for operations like reduction without necessitating thread divergence. This flexibility also aids in coping with bandwidth limitations inherent in FPGA's shared memory structures.
Comparisons to Other Architectures
Comparatively, eGPU stands out against architectures like Guppy, FGPU, and MIAOW, providing superior operation frequencies and more efficient resource utilization. While the eGPU's pipeline is shallower, it achieves substantial performance gains against deeply pipelined alternatives.
Figure 2: SP Architecture.
Benchmarks
FFT
The eGPU demonstrates considerable execution capability with FFT, using SIMT to efficiently map butterfly computations across multiple threads. Profiling shows an address generation workload of 12% and butterfly computation at 13%, with shared memory access dominating at 75% of cycles, highlighting future optimization areas.
QRD
The QR decomposition, using Modified Gram-Schmidt methodology, showcases eGPU's prowess in handling small matrices. Here, flexibility in isolating thread subsets, alongside efficient norm computation, significantly enhances processing efficiency, yielding better results than traditional hard GPUs for similar workloads.
Implications and Future Developments
The eGPU architecture points to promising advancements in FPGA-based design, offering high clock frequencies and efficient resource use in complex system designs. While it does not aim to replace standard GPGPUs, its efficient execution of small datasets offers substantial applications in areas that demand low latency and high processing efficiency.
Future developments could explore further architectural optimizations, particularly in memory bandwidth enhancement and deeper integration of specialized processing elements.
Figure 3: I-WORD.
Conclusion
This research presents a sophisticated eGPU architecture that achieves high performance on FPGAs, both in terms of operating frequency and efficient resource utilization. Its flexible ISA and innovative design make it a formidable tool for implementing complex algorithms in FPGA systems, with potential implications for broader adoption in signal processing applications.