Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Libfork: portable continuation-stealing with stackless coroutines (2402.18480v1)

Published 28 Feb 2024 in cs.DC

Abstract: Fully-strict fork-join parallelism is a powerful model for shared-memory programming due to its optimal time scaling and strong bounds on memory scaling. The latter is rarely achieved due to the difficulty of implementing continuation stealing in traditional High Performance Computing (HPC) languages -- where it is often impossible without modifying the compiler or resorting to non-portable techniques. We demonstrate how stackless coroutines (a new feature in C++20) can enable fully-portable continuation stealing and present libfork a lock-free fine-grained parallelism library, combining coroutines with user-space, geometric segmented-stacks. We show our approach is able to achieve optimal time/memory scaling, both theoretically and empirically, across a variety of benchmarks. Compared to openMP (libomp), libfork is on average 7.2x faster and consumes 10x less memory. Similarly, compared to Intel's TBB, libfork is on average 2.7x faster and consumes 6.2x less memory. Additionally, we introduce non-uniform memory access (NUMA) optimizations for schedulers that demonstrate performance matching busy-waiting schedulers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Gordon E. Moore “Cramming more components onto integrated circuits, Reprinted from Electronics, volume 38, number 8, April 19, 1965, pp.114 ff.” In IEEE Solid-State Circuits Society Newsletter 11.3 Institute of ElectricalElectronics Engineers (IEEE), 2006, pp. 33–35 URL: http://dx.doi.org/10.1109/N-SSC.2006.4785860
  2. Laszlo B Kish “End of Moore’s law: thermal (noise) death of integration in micro and nano electronics” In Physics Letters A 305.3–4 Elsevier BV, 2002, pp. 144–149 DOI: 10.1016/s0375-9601(02)01365-8
  3. W.Daniel Hillis and Guy L. Steele “Data parallel algorithms” In Communications of the ACM 29.12 Association for Computing Machinery (ACM), 1986, pp. 1170–1183 DOI: 10.1145/7902.7903
  4. “On-the-Fly Pipeline Parallelism” In ACM Transactions on Parallel Computing 2.3 Association for Computing Machinery (ACM), 2015, pp. 1–42 DOI: 10.1145/2809808
  5. “A taxonomy of task-based parallel programming technologies for high-performance computing” In The Journal of Supercomputing 74.4 Springer ScienceBusiness Media LLC, 2018, pp. 1422–1434 DOI: 10.1007/s11227-018-2238-4
  6. “Actors: A Conceptual Foundation for Concurrent Object-Oriented Programming” In Research Directions in Object-Oriented Programming Cambridge, MA, USA: MIT Press, 1987, pp. 49–74
  7. Michel F. Sanner “Python: a programming language for software integration and development.” In Journal of molecular graphics & modelling 17 1, 1999, pp. 57–61 URL: https://api.semanticscholar.org/CorpusID:12160699
  8. Rob Pike “Go at Google” In Proceedings of the 3rd annual conference on Systems, programming, and applications: software for humanity, SPLASH ’12 ACM, 2012 DOI: 10.1145/2384716.2384720
  9. “Kotlin coroutines: design and implementation” In Proceedings of the 2021 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, SPLASH ’21 ACM, 2021 DOI: 10.1145/3486607.3486751
  10. Ana Lúcia De Moura and Roberto Ierusalimschy “Revisiting coroutines” In ACM Transactions on Programming Languages and Systems 31.2 Association for Computing Machinery (ACM), 2009, pp. 1–31 DOI: 10.1145/1462166.1462167
  11. Melvin E. Conway “A multiprocessor system design” In Proceedings of the November 12-14, 1963, fall joint computer conference on XX - AFIPS ’63 (Fall), AFIPS ’63 (Fall) ACM Press, 1963 DOI: 10.1145/1463822.1463838
  12. “Notes on the History of Fork and Join” In IEEE Annals of the History of Computing 38.3 Institute of ElectricalElectronics Engineers (IEEE), 2016, pp. 84–87 DOI: 10.1109/mahc.2016.34
  13. Matteo Frigo, Charles E. Leiserson and Keith H. Randall “The implementation of the Cilk-5 multithreaded language” In Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation, PLDI98 ACM, 1998 DOI: 10.1145/277650.277725
  14. “The Design of OpenMP Tasks” In IEEE Transactions on Parallel and Distributed Systems 20.3 Institute of ElectricalElectronics Engineers (IEEE), 2009, pp. 404–418 DOI: 10.1109/tpds.2008.105
  15. “Cpp-Taskflow: Fast Task-Based Parallel Programming Using Modern C++” In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) IEEE, 2019 DOI: 10.1109/ipdps.2019.00105
  16. Alexey Kukanov “The Foundations for Scalable Multicore Software in Intel Threading Building Blocks” In Intel Technology Journal 11.04 Intel, 2007 DOI: 10.1535/itj.1104.05
  17. “Nowa: A Wait-Free Continuation-Stealing Concurrency Platform” In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) IEEE, 2021 DOI: 10.1109/ipdps49936.2021.00044
  18. “A Practical Solution to the Cactus Stack Problem” In Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’16 ACM, 2016 DOI: 10.1145/2935764.2935787
  19. Pablo Halpern “Strict fork-join parallelism” In WG21 paper N 3409, 2012
  20. Edouard Lucas “Théorie des nombres” Gauthier-Villars, 1891
  21. Robert D. Blumofe and Charles E. Leiserson “Scheduling multithreaded computations by work stealing” In Journal of the ACM 46.5 Association for Computing Machinery (ACM), 1999, pp. 720–748 DOI: 10.1145/324133.324234
  22. N.S. Arora, R.D. Blumofe and C.G. Plaxton “Thread Scheduling for Multiprogrammed Multiprocessors” In Theory of Computing Systems 34.2 Springer ScienceBusiness Media LLC, 2001, pp. 115–144 DOI: 10.1007/s00224-001-0004-z
  23. “Non-blocking steal-half work queues” In Proceedings of the twenty-first annual symposium on Principles of distributed computing, PODC02 ACM, 2002 DOI: 10.1145/571825.571876
  24. Tom Dijk and Jaco C. Pol “Lace: Non-blocking Split Deque for Work-Stealing” In Euro-Par 2014: Parallel Processing Workshops Springer International Publishing, 2014, pp. 206–217 DOI: 10.1007/978-3-319-14313-2_18
  25. “Scalable work stealing” In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09 ACM, 2009 DOI: 10.1145/1654059.1654113
  26. Hannah Cartier, James Dinan and D.Brian Larkins “Optimizing Work Stealing Communication with Structured Atomic Operations” In 50th International Conference on Parallel Processing, ICPP 2021 ACM, 2021 DOI: 10.1145/3472456.3472522
  27. Karl-Filip Faxén “Wool-A work stealing library” In ACM SIGARCH Computer Architecture News 36.5 Association for Computing Machinery (ACM), 2008, pp. 93–100 DOI: 10.1145/1556444.1556457
  28. Karl-Filip Faxen “Efficient Work Stealing for Fine Grained Parallelism” In 2010 39th International Conference on Parallel Processing IEEE, 2010 DOI: 10.1109/icpp.2010.39
  29. “Correct and efficient work-stealing for weak memory models” In Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP ’13 ACM, 2013 DOI: 10.1145/2442516.2442524
  30. “CDSchecker: checking concurrent data structures written with C/C++ atomics” In ACM SIGPLAN Notices 48.10 Association for Computing Machinery (ACM), 2013, pp. 131–150 DOI: 10.1145/2544173.2509514
  31. “Dynamic circular work-stealing deque” In Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures, SPAA05 ACM, 2005 DOI: 10.1145/1073970.1073974
  32. Jaemin Choi “Formal Verification of Chase-Lev Deque in Concurrent Separation Logic” In ArXiv abs/2309.03642, 2023 URL: https://api.semanticscholar.org/CorpusID:261582315
  33. “Distributed Continuation Stealing is More Scalable than You Might Think” In 2022 IEEE International Conference on Cluster Computing (CLUSTER) IEEE, 2022 DOI: 10.1109/cluster51413.2022.00027
  34. Will Clinger, Anne Hartheimer and Eric Ost “Implementation strategies for continuations” In Proceedings of the 1988 ACM conference on LISP and functional programming, LISP88 ACM, 1988 DOI: 10.1145/62678.62692
  35. “Using memory mapping to support cactus stacks in work-stealing runtime systems” In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT ’10 ACM, 2010 DOI: 10.1145/1854273.1854324
  36. Chun-Xun Lin, Tsung-Wei Huang and Martin D.F. Wong “An Efficient Work-Stealing Scheduler for Task Dependency Graph” In 2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS) IEEE, 2020 DOI: 10.1109/icpads51040.2020.00018
  37. “Bringing Segmented Stacks to Embedded Systems” In Proceedings of the 24th International Workshop on Mobile Computing Systems and Applications, HotMobile ’23 ACM, 2023 DOI: 10.1145/3572864.3580344
  38. “hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications” In 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing IEEE, 2010 DOI: 10.1109/pdp.2010.67
  39. David W Walker and Jack J Dongarra “MPI: a standard message passing interface” In Supercomputer 12 ASFRA BV, 1996, pp. 56–68
  40. “UTS: An unbalanced tree search benchmark” In Languages and Compilers for Parallel Computing Berlin, Heidelberg: Springer Berlin Heidelberg, 2007, pp. 235–250
  41. “LLVM: A compilation framework for lifelong program analysis” In International Symposium on Code Generation and Optimization, 2004. CGO 2004. IEEE DOI: 10.1109/cgo.2004.1281665
Citations (1)

Summary

  • The paper introduces a portable C++20 library that uses stackless coroutines for continuation-stealing, achieving 7.2× speedup and 10× reduced memory usage.
  • It employs innovative lock-free, segmented stacks and adaptive scheduling to optimize computational speed and resource management in high performance computing.
  • The study benchmarks Libfork against established libraries like libomp and Intel’s TBB, demonstrating significant performance and memory improvements.

Exploring Efficient Parallelism with Libfork: Embedding Continuation-Stealing in C++20 Coroutines

Introduction

In the field of High Performance Computing (HPC), efficiently harnessing the power of modern multicore processors is pivotal. A recent advancement in this direction is encapsulated in the development of "Libfork", a C++20 library leveraging stackless coroutines for achieving fine-grained parallelism. This paper presents a comprehensive paper of Libfork, emphasizing its ability to employ continuation-stealing strategies with lock-free, segmented stacks, thereby optimizing both computational speed and memory usage in a portable manner. Employing coroutines introduced in C++20, Libfork stands out by being fully portable and does not require alterations to the compiler or the adherence to platform-specific tactics.

Core Concepts and Implementation

The crux of Libfork's approach lies in its unique utilization of stackless coroutines for continuation-stealing, a strategy known for its efficiency in exploiting parallelism. The library introduces an innovative lock-free, fine-grained parallelism mechanism combined with user-space, geometric segmented-stacks, aiming to hit the theoretical time and memory scaling optima across various benchmarks. Most notably, Libfork demonstrates a significant performance boost and reduced memory consumption when juxtaposed with well-established libraries like openMP (libomp) and Intel’s Threading Building Blocks (TBB).

Coroutines and Parallelism

Libfork operationalizes C++20 stackless coroutines in facilitating asynchronous programming and cooperative multitasking. The distinction between stackful and stackless coroutines is elucidated, laying the groundwork for understanding how Libfork's stackless implementation efficiently manages resources and execution flow.

The fork-join model, central to structuring task-based concurrency, is revisited to underscore the efficiency of continuation-stealing over traditional child-stealing strategies. The paper further explores the implementation intricacies, detailing how tasks are mapped onto coroutines, the dynamic allocation and deallocation of tasks through user-space segmented stacks, and the adaptive scheduling that caters to Non-Uniform Memory Access (NUMA) optimizations.

Performance Evaluation

The experimental analysis reveals Libfork's outstanding performance, showing an average speedup of 7.2× over libomp and a 10× reduction in memory usage, with similar favorable comparisons against Intel’s TBB. This remarkable efficiency is attributed to the segmented stack model that mitigates memory wastage and the coroutine-based design that minimally impacts the overhead.

Implications and Future Directions

The successful demonstration of Libfork's performance and its fully portable model holds significant implications for the future of parallel programming in C++. It not only challenges existing frameworks but also opens avenues for exploring finer-grained parallelism without the constraint of platform dependency or the necessity for compiler modifications.

Future work could explore further optimizations in coroutine management, better integration with hardware-specific features, and the expansion of Libfork's application to broader domains beyond those tested. Additionally, as C++ continues to evolve, there is potential for integrating new language features that could enhance Libfork's performance and usability.

Conclusion

Libfork represents a significant step forward in the pursuit of efficient, portable parallel programming. By leveraging modern C++'s stackless coroutines for continuation-stealing and introducing segmented stacks for resource management, it sets a new benchmark for fine-grained parallelism libraries. Looking ahead, the continual development of Libfork and similar initiatives will be critical in fully exploiting the capabilities of multicore processors in HPC and beyond.