Emergent Mind

Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform

(2405.19284)
Published May 29, 2024 in cs.DC , cs.AI , and cs.AR

Abstract

Transformer-based foundation models have become crucial for various domains, most notably NLP or computer vision (CV). These models are predominantly deployed on high-performance GPUs or hardwired accelerators with highly customized, proprietary instruction sets. Until now, limited attention has been given to RISC-V-based general-purpose platforms. In our work, we present the first end-to-end inference results of transformer models on an open-source many-tiny-core RISC-V platform implementing distributed Softmax primitives and leveraging ISA extensions for SIMD floating-point operand streaming and instruction repetition, as well as specialized DMA engines to minimize costly main memory accesses and to tolerate their latency. We focus on two foundational transformer topologies, encoder-only and decoder-only models. For encoder-only models, we demonstrate a speedup of up to 12.8x between the most optimized implementation and the baseline version. We reach over 79% FPU utilization and 294 GFLOPS/W, outperforming State-of-the-Art (SoA) accelerators by more than 2x utilizing the HW platform while achieving comparable throughput per computational unit. For decoder-only topologies, we achieve 16.1x speedup in the Non-Autoregressive (NAR) mode and up to 35.6x speedup in the Autoregressive (AR) mode compared to the baseline implementation. Compared to the best SoA dedicated accelerator, we achieve 2.04x higher FPU utilization.

RISC-V compute cluster architecture with Xfrep and Xssr ISA extensions.

Overview

  • The paper presents an investigation into the efficient execution of transformer-based Foundation Models (FMs) on an open-source, many-tiny-core RISC-V platform by developing an optimized library and leveraging RISC-V ISA extensions.

  • Key optimizations made include the use of advanced Direct Memory Access (DMA) engines, specialized RISC-V ISA extensions, and precision scalability across different data formats, achieving notable speedups in model inference performance and efficiency.

  • The research highlights the full open-source deployment of Vision Transformer (ViT) and GPT models on a RISC-V hardware architecture, showcasing impressive benchmarking results that outperform state-of-the-art platforms.

Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform

The paper "Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform" presents a comprehensive investigation into the end-to-end inference of transformer models on an open-source, many-tiny-core RISC-V platform. This work is a collaboration between various researchers from ETH Zurich, University of Bologna, and Politecnico di Torino. The study is novel in demonstrating how foundation models (FMs) can be efficiently executed on a RISC-V-based platform, addressing the current void of RISC-V usage in FM deployment.

Key Contributions

  1. Open-source FM Library: The authors have developed an open-source library that supports both encoder-only and decoder-only models, leveraging the hardware capabilities and Instruction Set Architecture (ISA) extensions of the RISC-V multi-core platform. These capabilities include advanced Direct Memory Access (DMA) engines and cluster-to-cluster data transfers, which enhance performance by reducing main memory accesses.

  2. Kernel Optimization with ISA Extensions: The research provides a detailed analysis of the performance boost obtained by employing specialized RISC-V ISA extensions, such as SIMD floating-point operand streaming and instruction repetition. These optimizations result in speedups of up to 35.6$\times$ for decoder models in autoregressive (AR) mode, 16.1$\times$ in non-autoregressive (NAR) mode, and 12.8$\times$ for encoder-only models like Vision Transformers (ViTs). This is achieved with over 79% FPU utilization in NAR mode.

  3. Precision Scalability: The study explores performance scalability across different data precisions—FP64, FP32, FP16, and FP8. The optimized library benchmarks show that using lower precision formats significantly improves efficiency, achieving up to 294 GFLOPS/W with FP8 precision.

  4. First Fully Open-source Deployment: This work pioneers the full open-source deployment of ViT and GPT models on an open-source RISC-V hardware architecture, showcasing flexibility and the potential for large-scale adoption.

  5. Comprehensive Benchmarking: Benchmarking results demonstrate that the proposed end-to-end inference engine outperforms state-of-the-art (SoA) platforms in terms of hardware utilization. The platform achieves eight times higher FPU utilization when compared to the best SoA dedicated accelerator, with a minimum speedup of 1.81$\times$ compared to the best competitor.

Detailed Analysis

The researchers targeted various transformer-based foundation models, including encoder-only (like ViTs) and decoder-only (like GPT) models, to validate their framework. The computational patterns of attention layers, characterized by quadratic scaling with the input sequence length, were optimized using the FlashAttention-2 algorithm. This algorithm computes attention efficiently while reducing latency and memory accesses.

For encoder-only models, such as different variants of ViTs (Base, Large, Huge), the study achieved significant speedups by spatially tiling the GEMM operations across clusters and employing temporal tiling when needed. Double buffering techniques were also used to hide memory transfer latencies effectively. The hierarchical interconnect of the platform facilitated efficient data transfers at different levels of the memory hierarchy, further improving performance by minimizing costly main memory accesses.

For decoder-only models, the study focused on architectural and kernel optimizations for both non-autoregressive and autoregressive modes. The results showcased substantial performance improvements, particularly in FP8 data precision, demonstrating the efficacy of mixed-precision execution. The usage of cluster-to-cluster data transfers allowed for efficient layer fusion in the MLP and MHA blocks, essential for reducing intermediate memory accesses and enhancing computational efficiency.

Implications and Future Directions

The implications of this research are profound, both practically and theoretically. By demonstrating the feasibility of running large-scale transformer models on an open-source RISC-V platform, the study opens avenues for cost-effective and transparent AI deployments. This democratization of AI hardware aligns well with the increasing need for open-source ecosystems in machine learning research and applications.

The strong numerical results, such as achieving 294 GFLOPS/W with FP8 precision and more than doubling the FPU utilization of best-in-class SoA platforms, suggest that future developments could focus on further optimizing the RISC-V ISA for AI workloads. Additionally, expanding the scale of the architecture to multi-chiplet systems offers a promising direction for future research, potentially enabling the handling of even larger models and more complex AI tasks.

In conclusion, this research significantly pushes the boundaries of RISC-V platform capabilities in the context of foundation model inference, setting a new benchmark in open-source hardware and software co-design for AI applications.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.