Libfork: portable continuation-stealing with stackless coroutines (2402.18480v1)
Abstract: Fully-strict fork-join parallelism is a powerful model for shared-memory programming due to its optimal time scaling and strong bounds on memory scaling. The latter is rarely achieved due to the difficulty of implementing continuation stealing in traditional High Performance Computing (HPC) languages -- where it is often impossible without modifying the compiler or resorting to non-portable techniques. We demonstrate how stackless coroutines (a new feature in C++20) can enable fully-portable continuation stealing and present libfork a lock-free fine-grained parallelism library, combining coroutines with user-space, geometric segmented-stacks. We show our approach is able to achieve optimal time/memory scaling, both theoretically and empirically, across a variety of benchmarks. Compared to openMP (libomp), libfork is on average 7.2x faster and consumes 10x less memory. Similarly, compared to Intel's TBB, libfork is on average 2.7x faster and consumes 6.2x less memory. Additionally, we introduce non-uniform memory access (NUMA) optimizations for schedulers that demonstrate performance matching busy-waiting schedulers.
- Gordon E. Moore “Cramming more components onto integrated circuits, Reprinted from Electronics, volume 38, number 8, April 19, 1965, pp.114 ff.” In IEEE Solid-State Circuits Society Newsletter 11.3 Institute of ElectricalElectronics Engineers (IEEE), 2006, pp. 33–35 URL: http://dx.doi.org/10.1109/N-SSC.2006.4785860
- Laszlo B Kish “End of Moore’s law: thermal (noise) death of integration in micro and nano electronics” In Physics Letters A 305.3–4 Elsevier BV, 2002, pp. 144–149 DOI: 10.1016/s0375-9601(02)01365-8
- W.Daniel Hillis and Guy L. Steele “Data parallel algorithms” In Communications of the ACM 29.12 Association for Computing Machinery (ACM), 1986, pp. 1170–1183 DOI: 10.1145/7902.7903
- “On-the-Fly Pipeline Parallelism” In ACM Transactions on Parallel Computing 2.3 Association for Computing Machinery (ACM), 2015, pp. 1–42 DOI: 10.1145/2809808
- “A taxonomy of task-based parallel programming technologies for high-performance computing” In The Journal of Supercomputing 74.4 Springer ScienceBusiness Media LLC, 2018, pp. 1422–1434 DOI: 10.1007/s11227-018-2238-4
- “Actors: A Conceptual Foundation for Concurrent Object-Oriented Programming” In Research Directions in Object-Oriented Programming Cambridge, MA, USA: MIT Press, 1987, pp. 49–74
- Michel F. Sanner “Python: a programming language for software integration and development.” In Journal of molecular graphics & modelling 17 1, 1999, pp. 57–61 URL: https://api.semanticscholar.org/CorpusID:12160699
- Rob Pike “Go at Google” In Proceedings of the 3rd annual conference on Systems, programming, and applications: software for humanity, SPLASH ’12 ACM, 2012 DOI: 10.1145/2384716.2384720
- “Kotlin coroutines: design and implementation” In Proceedings of the 2021 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, SPLASH ’21 ACM, 2021 DOI: 10.1145/3486607.3486751
- Ana Lúcia De Moura and Roberto Ierusalimschy “Revisiting coroutines” In ACM Transactions on Programming Languages and Systems 31.2 Association for Computing Machinery (ACM), 2009, pp. 1–31 DOI: 10.1145/1462166.1462167
- Melvin E. Conway “A multiprocessor system design” In Proceedings of the November 12-14, 1963, fall joint computer conference on XX - AFIPS ’63 (Fall), AFIPS ’63 (Fall) ACM Press, 1963 DOI: 10.1145/1463822.1463838
- “Notes on the History of Fork and Join” In IEEE Annals of the History of Computing 38.3 Institute of ElectricalElectronics Engineers (IEEE), 2016, pp. 84–87 DOI: 10.1109/mahc.2016.34
- Matteo Frigo, Charles E. Leiserson and Keith H. Randall “The implementation of the Cilk-5 multithreaded language” In Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation, PLDI98 ACM, 1998 DOI: 10.1145/277650.277725
- “The Design of OpenMP Tasks” In IEEE Transactions on Parallel and Distributed Systems 20.3 Institute of ElectricalElectronics Engineers (IEEE), 2009, pp. 404–418 DOI: 10.1109/tpds.2008.105
- “Cpp-Taskflow: Fast Task-Based Parallel Programming Using Modern C++” In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) IEEE, 2019 DOI: 10.1109/ipdps.2019.00105
- Alexey Kukanov “The Foundations for Scalable Multicore Software in Intel Threading Building Blocks” In Intel Technology Journal 11.04 Intel, 2007 DOI: 10.1535/itj.1104.05
- “Nowa: A Wait-Free Continuation-Stealing Concurrency Platform” In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) IEEE, 2021 DOI: 10.1109/ipdps49936.2021.00044
- “A Practical Solution to the Cactus Stack Problem” In Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’16 ACM, 2016 DOI: 10.1145/2935764.2935787
- Pablo Halpern “Strict fork-join parallelism” In WG21 paper N 3409, 2012
- Edouard Lucas “Théorie des nombres” Gauthier-Villars, 1891
- Robert D. Blumofe and Charles E. Leiserson “Scheduling multithreaded computations by work stealing” In Journal of the ACM 46.5 Association for Computing Machinery (ACM), 1999, pp. 720–748 DOI: 10.1145/324133.324234
- N.S. Arora, R.D. Blumofe and C.G. Plaxton “Thread Scheduling for Multiprogrammed Multiprocessors” In Theory of Computing Systems 34.2 Springer ScienceBusiness Media LLC, 2001, pp. 115–144 DOI: 10.1007/s00224-001-0004-z
- “Non-blocking steal-half work queues” In Proceedings of the twenty-first annual symposium on Principles of distributed computing, PODC02 ACM, 2002 DOI: 10.1145/571825.571876
- Tom Dijk and Jaco C. Pol “Lace: Non-blocking Split Deque for Work-Stealing” In Euro-Par 2014: Parallel Processing Workshops Springer International Publishing, 2014, pp. 206–217 DOI: 10.1007/978-3-319-14313-2_18
- “Scalable work stealing” In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09 ACM, 2009 DOI: 10.1145/1654059.1654113
- Hannah Cartier, James Dinan and D.Brian Larkins “Optimizing Work Stealing Communication with Structured Atomic Operations” In 50th International Conference on Parallel Processing, ICPP 2021 ACM, 2021 DOI: 10.1145/3472456.3472522
- Karl-Filip Faxén “Wool-A work stealing library” In ACM SIGARCH Computer Architecture News 36.5 Association for Computing Machinery (ACM), 2008, pp. 93–100 DOI: 10.1145/1556444.1556457
- Karl-Filip Faxen “Efficient Work Stealing for Fine Grained Parallelism” In 2010 39th International Conference on Parallel Processing IEEE, 2010 DOI: 10.1109/icpp.2010.39
- “Correct and efficient work-stealing for weak memory models” In Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP ’13 ACM, 2013 DOI: 10.1145/2442516.2442524
- “CDSchecker: checking concurrent data structures written with C/C++ atomics” In ACM SIGPLAN Notices 48.10 Association for Computing Machinery (ACM), 2013, pp. 131–150 DOI: 10.1145/2544173.2509514
- “Dynamic circular work-stealing deque” In Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures, SPAA05 ACM, 2005 DOI: 10.1145/1073970.1073974
- Jaemin Choi “Formal Verification of Chase-Lev Deque in Concurrent Separation Logic” In ArXiv abs/2309.03642, 2023 URL: https://api.semanticscholar.org/CorpusID:261582315
- “Distributed Continuation Stealing is More Scalable than You Might Think” In 2022 IEEE International Conference on Cluster Computing (CLUSTER) IEEE, 2022 DOI: 10.1109/cluster51413.2022.00027
- Will Clinger, Anne Hartheimer and Eric Ost “Implementation strategies for continuations” In Proceedings of the 1988 ACM conference on LISP and functional programming, LISP88 ACM, 1988 DOI: 10.1145/62678.62692
- “Using memory mapping to support cactus stacks in work-stealing runtime systems” In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT ’10 ACM, 2010 DOI: 10.1145/1854273.1854324
- Chun-Xun Lin, Tsung-Wei Huang and Martin D.F. Wong “An Efficient Work-Stealing Scheduler for Task Dependency Graph” In 2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS) IEEE, 2020 DOI: 10.1109/icpads51040.2020.00018
- “Bringing Segmented Stacks to Embedded Systems” In Proceedings of the 24th International Workshop on Mobile Computing Systems and Applications, HotMobile ’23 ACM, 2023 DOI: 10.1145/3572864.3580344
- “hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications” In 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing IEEE, 2010 DOI: 10.1109/pdp.2010.67
- David W Walker and Jack J Dongarra “MPI: a standard message passing interface” In Supercomputer 12 ASFRA BV, 1996, pp. 56–68
- “UTS: An unbalanced tree search benchmark” In Languages and Compilers for Parallel Computing Berlin, Heidelberg: Springer Berlin Heidelberg, 2007, pp. 235–250
- “LLVM: A compilation framework for lifelong program analysis” In International Symposium on Code Generation and Optimization, 2004. CGO 2004. IEEE DOI: 10.1109/cgo.2004.1281665