Papers
Topics
Authors
Recent
2000 character limit reached

eGPU: A 750 MHz Class Soft GPGPU for FPGA (2307.08378v1)

Published 17 Jul 2023 in cs.AR

Abstract: This paper introduces the eGPU, a SIMT soft processor designed for FPGAs. Soft processors typically achieve modest operating frequencies, a fraction of the headline performance claimed by modern FPGA families, and obtain correspondingly modest performance results. We propose a GPGPU architecture structured specifically to take advantage of both the soft logic and embedded features of the FPGA. We also consider the physical location of the embedded memories and DSP Blocks relative to the location and number of soft logic elements in order to have a design with balanced resources. Our goal is to create a high performance soft processor able to implement complex portions of FPGA system designs, such as the linear solvers commonly used in wireless systems, through push-button compilation from software. The eGPU architecture is a streaming multiprocessor (SM) machine with 512 threads. Each SM contains 16 scalar processors (SP). Both IEEE754 FP32 and INT32 integer arithmetic are supported. We demonstrate a single SM eGPU in an Intel Agilex device, requiring 5600 ALMs and 24 DSP Blocks, which closes timing at over 770 MHz from a completely unconstrained compile. Multiple eGPUs can also be tightly packed together into a single Agilex FPGA logic region, with minimal speed penalty.

Citations (4)

Summary

  • The paper introduces eGPU, a SIMT soft GPGPU tailored for FPGAs, achieving a record 771 MHz operating frequency.
  • It leverages FPGA resources such as ALMs, DSP Blocks, and embedded memories to optimize processing efficiency and maintain a balanced resource usage.
  • Benchmarks on FFT and QR decomposition demonstrate its superior performance and flexibility compared to traditional soft processor architectures.

eGPU: A 750 MHz Class Soft GPGPU for FPGA

Introduction

The paper "eGPU: A 750 MHz Class Soft GPGPU for FPGA" (2307.08378) presents the design and implementation of eGPU, a SIMT soft processor tailored for FPGAs. Distinguishing itself from traditional soft processors, eGPU targets high-performance applications by leveraging the inherent resources of FPGAs, such as soft logic, embedded memories, and DSP Blocks. The architecture is aimed at enabling complex FPGA system designs, such as linear solvers common in wireless systems, via push-button software compilation.

Architecture Overview

The eGPU architecture is structured as a streaming multiprocessor machine comprising 512 threads, with each SM housing 16 SPs. It supports both IEEE754 FP32 and INT32 arithmetic. A notable achievement is the eGPU's ability to exceed a 770 MHz operating frequency in an Intel Agilex device, utilizing 5600 ALMs and 24 DSP Blocks, without the need for synthesis constraints. Figure 1

Figure 1: Representative SM Architecture.

Core Architectural Features

Performance

The eGPU achieves a clock frequency of 771 MHz, surpassing other known soft processors of comparable complexity. This result is accomplished with an unconstrained compile, enabled by careful consideration of the FPGA's critical path constraints, primarily in the DSP Blocks configured in FP32 mode.

Resource Efficiency

Resource balance in the eGPU design mimics the native balance of logic, DSP, and memory in FPGA architectures, allowing multiple eGPU instances to be compactly packed with high efficiency and minimal penalty on speed.

Flexible ISA

The Instruction Set Architecture (ISA) allows targeting of initialized threads on an instruction basis, enhancing processing efficiency for operations like reduction without necessitating thread divergence. This flexibility also aids in coping with bandwidth limitations inherent in FPGA's shared memory structures.

Comparisons to Other Architectures

Comparatively, eGPU stands out against architectures like Guppy, FGPU, and MIAOW, providing superior operation frequencies and more efficient resource utilization. While the eGPU's pipeline is shallower, it achieves substantial performance gains against deeply pipelined alternatives. Figure 2

Figure 2: SP Architecture.

Benchmarks

FFT

The eGPU demonstrates considerable execution capability with FFT, using SIMT to efficiently map butterfly computations across multiple threads. Profiling shows an address generation workload of 12% and butterfly computation at 13%, with shared memory access dominating at 75% of cycles, highlighting future optimization areas.

QRD

The QR decomposition, using Modified Gram-Schmidt methodology, showcases eGPU's prowess in handling small matrices. Here, flexibility in isolating thread subsets, alongside efficient norm computation, significantly enhances processing efficiency, yielding better results than traditional hard GPUs for similar workloads.

Implications and Future Developments

The eGPU architecture points to promising advancements in FPGA-based design, offering high clock frequencies and efficient resource use in complex system designs. While it does not aim to replace standard GPGPUs, its efficient execution of small datasets offers substantial applications in areas that demand low latency and high processing efficiency.

Future developments could explore further architectural optimizations, particularly in memory bandwidth enhancement and deeper integration of specialized processing elements. Figure 3

Figure 3: I-WORD.

Conclusion

This research presents a sophisticated eGPU architecture that achieves high performance on FPGAs, both in terms of operating frequency and efficient resource utilization. Its flexible ISA and innovative design make it a formidable tool for implementing complex algorithms in FPGA systems, with potential implications for broader adoption in signal processing applications.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.