Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

167 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

42 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Libfork: portable continuation-stealing with stackless coroutines (2402.18480v1)

Published 28 Feb 2024 in cs.DC

Abstract: Fully-strict fork-join parallelism is a powerful model for shared-memory programming due to its optimal time scaling and strong bounds on memory scaling. The latter is rarely achieved due to the difficulty of implementing continuation stealing in traditional High Performance Computing (HPC) languages -- where it is often impossible without modifying the compiler or resorting to non-portable techniques. We demonstrate how stackless coroutines (a new feature in C++20) can enable fully-portable continuation stealing and present libfork a lock-free fine-grained parallelism library, combining coroutines with user-space, geometric segmented-stacks. We show our approach is able to achieve optimal time/memory scaling, both theoretically and empirically, across a variety of benchmarks. Compared to openMP (libomp), libfork is on average 7.2x faster and consumes 10x less memory. Similarly, compared to Intel's TBB, libfork is on average 2.7x faster and consumes 6.2x less memory. Additionally, we introduce non-uniform memory access (NUMA) optimizations for schedulers that demonstrate performance matching busy-waiting schedulers.

References (41)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a portable C++20 library that uses stackless coroutines for continuation-stealing, achieving 7.2× speedup and 10× reduced memory usage.
It employs innovative lock-free, segmented stacks and adaptive scheduling to optimize computational speed and resource management in high performance computing.
The study benchmarks Libfork against established libraries like libomp and Intel’s TBB, demonstrating significant performance and memory improvements.

Exploring Efficient Parallelism with Libfork: Embedding Continuation-Stealing in C++20 Coroutines

Introduction

In the field of High Performance Computing (HPC), efficiently harnessing the power of modern multicore processors is pivotal. A recent advancement in this direction is encapsulated in the development of "Libfork", a C++20 library leveraging stackless coroutines for achieving fine-grained parallelism. This paper presents a comprehensive paper of Libfork, emphasizing its ability to employ continuation-stealing strategies with lock-free, segmented stacks, thereby optimizing both computational speed and memory usage in a portable manner. Employing coroutines introduced in C++20, Libfork stands out by being fully portable and does not require alterations to the compiler or the adherence to platform-specific tactics.

Core Concepts and Implementation

The crux of Libfork's approach lies in its unique utilization of stackless coroutines for continuation-stealing, a strategy known for its efficiency in exploiting parallelism. The library introduces an innovative lock-free, fine-grained parallelism mechanism combined with user-space, geometric segmented-stacks, aiming to hit the theoretical time and memory scaling optima across various benchmarks. Most notably, Libfork demonstrates a significant performance boost and reduced memory consumption when juxtaposed with well-established libraries like openMP (libomp) and Intel’s Threading Building Blocks (TBB).

Coroutines and Parallelism

Libfork operationalizes C++20 stackless coroutines in facilitating asynchronous programming and cooperative multitasking. The distinction between stackful and stackless coroutines is elucidated, laying the groundwork for understanding how Libfork's stackless implementation efficiently manages resources and execution flow.

The fork-join model, central to structuring task-based concurrency, is revisited to underscore the efficiency of continuation-stealing over traditional child-stealing strategies. The paper further explores the implementation intricacies, detailing how tasks are mapped onto coroutines, the dynamic allocation and deallocation of tasks through user-space segmented stacks, and the adaptive scheduling that caters to Non-Uniform Memory Access (NUMA) optimizations.

Performance Evaluation

The experimental analysis reveals Libfork's outstanding performance, showing an average speedup of 7.2× over libomp and a 10× reduction in memory usage, with similar favorable comparisons against Intel’s TBB. This remarkable efficiency is attributed to the segmented stack model that mitigates memory wastage and the coroutine-based design that minimally impacts the overhead.

Implications and Future Directions

The successful demonstration of Libfork's performance and its fully portable model holds significant implications for the future of parallel programming in C++. It not only challenges existing frameworks but also opens avenues for exploring finer-grained parallelism without the constraint of platform dependency or the necessity for compiler modifications.

Future work could explore further optimizations in coroutine management, better integration with hardware-specific features, and the expansion of Libfork's application to broader domains beyond those tested. Additionally, as C++ continues to evolve, there is potential for integrating new language features that could enhance Libfork's performance and usability.

Conclusion

Libfork represents a significant step forward in the pursuit of efficient, portable parallel programming. By leveraging modern C++'s stackless coroutines for continuation-stealing and introducing segmented stacks for resource management, it sets a new benchmark for fine-grained parallelism libraries. Looking ahead, the continual development of Libfork and similar initiatives will be critical in fully exploiting the capabilities of multicore processors in HPC and beyond.

Tweets

https://twitter.com/ChrisGr93091552/status/1763662478901473666

https://twitter.com/FilasienoF/status/1888216167837299173