Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

97 tokens/sec

GPT-4o

53 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

36 1

Optimizing the Deployment of Tiny Transformers on Low-Power MCUs (2404.02945v1)

Published 3 Apr 2024 in cs.LG, cs.AI, cs.DC, and cs.PF

Abstract: Transformer networks are rapidly becoming SotA in many fields, such as NLP and CV. Similarly to CNN, there is a strong push for deploying Transformer models at the extreme edge, ultimately fitting the tiny power budget and memory footprint of MCUs. However, the early approaches in this direction are mostly ad-hoc, platform, and model-specific. This work aims to enable and optimize the flexible, multi-platform deployment of encoder Tiny Transformers on commercial MCUs. We propose a complete framework to perform end-to-end deployment of Transformer models onto single and multi-core MCUs. Our framework provides an optimized library of kernels to maximize data reuse and avoid unnecessary data marshaling operations into the crucial attention block. A novel MHSA inference schedule, named Fused-Weight Self-Attention, is introduced, fusing the linear projection weights offline to further reduce the number of operations and parameters. Furthermore, to mitigate the memory peak reached by the computation of the attention map, we present a Depth-First Tiling scheme for MHSA. We evaluate our framework on three different MCU classes exploiting ARM and RISC-V ISA, namely the STM32H7, the STM32L4, and GAP9 (RV32IMC-XpulpV2). We reach an average of 4.79x and 2.0x lower latency compared to SotA libraries CMSIS-NN (ARM) and PULP-NN (RISC-V), respectively. Moreover, we show that our MHSA depth-first tiling scheme reduces the memory peak by up to 6.19x, while the fused-weight attention can reduce the runtime by 1.53x, and number of parameters by 25%. We report significant improvements across several Tiny Transformers: for instance, when executing a transformer block for the task of radar-based hand-gesture recognition on GAP9, we achieve a latency of 0.14ms and energy consumption of 4.92 micro-joules, 2.32x lower than the SotA PULP-NN library on the same platform.

References (45)

Authors (5)

Victor J. B. Jung (7 papers)
Alessio Burrello (52 papers)
Moritz Scherer (12 papers)
Francesco Conti (67 papers)
Luca Benini (362 papers)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel framework using FWSA and DFT methods to reduce latency and memory usage for Multi-Head Self-Attention on MCUs.
It achieves up to 4.79× lower latency compared to ARM CMSIS-NN and 2× lower latency than RISC-V PULP-NN through optimized kernel designs.
The open-source framework and extensive benchmarks pave the way for future enhancements in efficient edge-based deployment of Tiny Transformers.

Optimizing Tiny Transformers Deployment on Low-Power Microcontrollers

Introduction to the Framework

The recent surge in deploying Transformer models for edge computing applications emphasizes the necessity for efficient implementation strategies, especially on low-power microcontroller units (MCUs). This paper introduces a comprehensive framework that enhances the deployment of encoder-based Tiny Transformers across multiple commercial MCUs. The key contribution includes a novel library of optimized kernels targeting the efficient execution of Multi-Head Self-Attention (MHSA) mechanisms, fundamental to Transformer architectures. Additionally, the work presents a Fused-Weight Self-Attention (FWSA) inference schedule and a Depth-First Tiling (DFT) scheme aimed at minimizing memory footprint and computational overhead for MHSA operations.

Attention on Edge

The efficient execution of Transformer models on MCUs faces unique challenges, primarily due to the demanding memory and computation requirements of the attention mechanism. This paper's approach modifies traditional attention computations by introducing fused-weight and depth-first tiling strategies to mitigate these challenges.

The proposed Fused-Weight Self-Attention (FWSA) method reduces the computational complexity by fusing linear projection weights for queries and keys, effectively reducing the number of parameters and operations needed. This approach is particularly beneficial for models with a smaller embedding size (E), where it demonstrates a clear advantage in reducing both latency and memory requirements.

The Depth-First Tiling (DFT) method addresses the high memory footprint during the computation of attention maps by allowing their piecewise execution, thus never materializing the entire matrix in memory. This technique shows a significant reduction in memory peak usage, up to 6.19 times in some instances, highlighting its effectiveness for cache-less MCU devices.

Qualitative and Quantitative Enhancements

The paper reports a comprehensive evaluation of the proposed framework on a range of MCUs exploiting ARM and RISC-V Instruction Set Architectures (ISAs), showing substantial improvements over state-of-the-art (SotA) libraries. On average, a 4.79 times lower latency is observed compared to ARM's CMSIS-NN library and a 2 times lower latency compared to RISC-V's PULP-NN library.

A series of micro-benchmarks on the MHSA and FWSA operations highlight the scalability of performance across various input dimensions and the efficiency of parallel execution on multi-core platforms. The ablation paper emphasizes the individual contributions of the FWSA and DFT optimizations to reducing the runtime and memory requirements.

Practical Implications and Future Directions

The real-world implications of this research are profound, enhancing the deployment flexibility and efficiency of Tiny Transformers across a spectrum of IoT endpoints. The framework's ability to mitigate memory and computational bottlenecks opens new horizons for advanced on-device inference tasks within strict power and performance constraints.

This paper lays the groundwork for future research in the optimization of Transformer models for edge computing. Future work may explore the extension of these optimizations to other Transformer variants and the automatic generation of optimized tiling strategies based on model and hardware profiles.

The open-source availability of this framework encourages further community engagement and development, potentially expanding its applicability and improving the robustness of Tiny Transformers on low-power MCUs.

PDF Markdown

Tweets

https://twitter.com/pulp_platform/status/1776940275274781017

HackerNews

Optimizing the Deployment of Tiny Transformers on Low-Power MCUs (1 point, 0 comments)