Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Hardware Accelerator Based on Medium Granularity Dataflow for SpTRSV (2406.10511v3)

Published 15 Jun 2024 in cs.DC, cs.AR, cs.NA, cs.PF, and math.NA

Abstract: Sparse triangular solve (SpTRSV) is widely used in various domains. Numerous studies have been conducted using CPUs, GPUs, and specific hardware accelerators, where dataflows can be categorized into coarse and fine granularity. Coarse dataflows offer good spatial locality but suffer from low parallelism, while fine dataflows provide high parallelism but disrupt the spatial structure, leading to increased nodes and poor data reuse. This paper proposes a novel hardware accelerator for SpTRSV or SpTRSV-like DAGs. The accelerator implements a medium granularity dataflow through hardware-software codesign and achieves both excellent spatial locality and high parallelism. Additionally, a partial sum caching mechanism is introduced to reduce the blocking frequency of processing elements (PEs), and a reordering algorithm of intra-node edges computation is developed to enhance data reuse. Experimental results on 245 benchmarks with node counts reaching up to 85,392 demonstrate that this work achieves average performance improvements of 7.0$\times$ (up to 27.8$\times$) over CPUs and 5.8$\times$ (up to 98.8$\times$) over GPUs. Compared to the state-of-the-art technique (DPU-v2), this work shows a 2.5$\times$ (up to 5.9$\times$) average performance improvement and 1.7$\times$ (up to 4.1$\times$) average energy efficiency enhancement.

Summary

We haven't generated a summary for this paper yet.