DCAFE: Dynamic load-balanced loop Chunking & Aggressive Finish Elimination for Recursive Task Parallel Programs

Published 21 Feb 2015 in cs.DC | (1502.06086v1)

Abstract: In this paper, we present two symbiotic optimizations to optimize recursive task parallel (RTP) programs by reducing the task creation and termination overheads. Our first optimization Aggressive Finish-Elimination (AFE) helps reduce the redundant join operations to a large extent. The second optimization Dynamic Load-Balanced loop Chunking (DLBC) extends the prior work on loop chunking to decide on the number of parallel tasks based on the number of available worker threads, at runtime. Further, we discuss the impact of exceptions on our optimizations and extend them to handle RTP programs that may throw exceptions. We implemented DCAFE (= DLBC+AFE) in the X10v2.3 compiler and tested it over a set of benchmark kernels on two different hardwares (a 16-core Intel system and a 64-core AMD system). With respect to the base X10 compiler extended with loop-chunking of Nandivada et al Nandivada et al.(2013)Nandivada, Shirako, Zhao, and Sarkar, DCAFE achieved a geometric mean speed up of 5.75x and 4.16x on the Intel and AMD system, respectively. We also present an evaluation with respect to the energy consumption on the Intel system and show that on average, compared to the LC versions, the DCAFE versions consume 71.2% less energy.