Emergent Mind

Libfork: portable continuation-stealing with stackless coroutines

(2402.18480)
Published Feb 28, 2024 in cs.DC

Abstract

Fully-strict fork-join parallelism is a powerful model for shared-memory programming due to its optimal time scaling and strong bounds on memory scaling. The latter is rarely achieved due to the difficulty of implementing continuation stealing in traditional High Performance Computing (HPC) languages -- where it is often impossible without modifying the compiler or resorting to non-portable techniques. We demonstrate how stackless coroutines (a new feature in C++20) can enable fully-portable continuation stealing and present libfork a lock-free fine-grained parallelism library, combining coroutines with user-space, geometric segmented-stacks. We show our approach is able to achieve optimal time/memory scaling, both theoretically and empirically, across a variety of benchmarks. Compared to openMP (libomp), libfork is on average 7.2x faster and consumes 10x less memory. Similarly, compared to Intel's TBB, libfork is on average 2.7x faster and consumes 6.2x less memory. Additionally, we introduce non-uniform memory access (NUMA) optimizations for schedulers that demonstrate performance matching busy-waiting schedulers.

Overview

  • Libfork is a C++20 library that enhances fine-grained parallelism through stackless coroutines and continuation-stealing with lock-free, segmented stacks, aiming for maximum computational speed and minimal memory usage.

  • The library achieves significant performance improvements over established frameworks like openMP and Intel’s TBB, demonstrating an average speedup of 7.2x and a reduction in memory usage by 10x.

  • Utilizing stackless coroutines, Libfork introduces efficient resource management and execution flow, contrasting with traditional stackful models and enabling more effective parallel computing strategies.

  • Libfork's demonstrated capabilities suggest a promising future for parallel programming in C++, with potential further optimizations and broader application domains beyond the initial benchmarks.

Exploring Efficient Parallelism with Libfork: Embedding Continuation-Stealing in C++20 Coroutines

Introduction

In the realm of High Performance Computing (HPC), efficiently harnessing the power of modern multicore processors is pivotal. A recent advancement in this direction is encapsulated in the development of "Libfork", a C++20 library leveraging stackless coroutines for achieving fine-grained parallelism. This paper presents a comprehensive study of Libfork, emphasizing its ability to employ continuation-stealing strategies with lock-free, segmented stacks, thereby optimizing both computational speed and memory usage in a portable manner. Employing coroutines introduced in C++20, Libfork stands out by being fully portable and does not require alterations to the compiler or the adherence to platform-specific tactics.

Core Concepts and Implementation

The crux of Libfork's approach lies in its unique utilization of stackless coroutines for continuation-stealing, a strategy known for its efficiency in exploiting parallelism. The library introduces an innovative lock-free, fine-grained parallelism mechanism combined with user-space, geometric segmented-stacks, aiming to hit the theoretical time and memory scaling optima across various benchmarks. Most notably, Libfork demonstrates a significant performance boost and reduced memory consumption when juxtaposed with well-established libraries like openMP (libomp) and Intel’s Threading Building Blocks (TBB).

Coroutines and Parallelism

Libfork operationalizes C++20 stackless coroutines in facilitating asynchronous programming and cooperative multitasking. The distinction between stackful and stackless coroutines is elucidated, laying the groundwork for understanding how Libfork's stackless implementation efficiently manages resources and execution flow.

The fork-join model, central to structuring task-based concurrency, is revisited to underscore the efficiency of continuation-stealing over traditional child-stealing strategies. The paper further explores the implementation intricacies, detailing how tasks are mapped onto coroutines, the dynamic allocation and deallocation of tasks through user-space segmented stacks, and the adaptive scheduling that caters to Non-Uniform Memory Access (NUMA) optimizations.

Performance Evaluation

The experimental analysis reveals Libfork's outstanding performance, showing an average speedup of 7.2× over libomp and a 10× reduction in memory usage, with similar favorable comparisons against Intel’s TBB. This remarkable efficiency is attributed to the segmented stack model that mitigates memory wastage and the coroutine-based design that minimally impacts the overhead.

Implications and Future Directions

The successful demonstration of Libfork's performance and its fully portable model holds significant implications for the future of parallel programming in C++. It not only challenges existing frameworks but also opens avenues for exploring finer-grained parallelism without the constraint of platform dependency or the necessity for compiler modifications.

Future work could explore further optimizations in coroutine management, better integration with hardware-specific features, and the expansion of Libfork's application to broader domains beyond those tested. Additionally, as C++ continues to evolve, there is potential for integrating new language features that could enhance Libfork's performance and usability.

Conclusion

Libfork represents a significant step forward in the pursuit of efficient, portable parallel programming. By leveraging modern C++'s stackless coroutines for continuation-stealing and introducing segmented stacks for resource management, it sets a new benchmark for fine-grained parallelism libraries. Looking ahead, the continual development of Libfork and similar initiatives will be critical in fully exploiting the capabilities of multicore processors in HPC and beyond.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.