Sizey: Memory-Efficient Execution of Scientific Workflow Tasks (2407.16353v1)
Abstract: As the amount of available data continues to grow in fields as diverse as bioinformatics, physics, and remote sensing, the importance of scientific workflows in the design and implementation of reproducible data analysis pipelines increases. When developing workflows, resource requirements must be defined for each type of task in the workflow. Typically, task types vary widely in their computational demands because they are simply wrappers for arbitrary black-box analysis tools. Furthermore, the resource consumption for the same task type can vary considerably as well due to different inputs. Since underestimating memory resources leads to bottlenecks and task failures, workflow developers tend to overestimate memory resources. However, overprovisioning of memory wastes resources and limits cluster throughput. Addressing this problem, we propose Sizey, a novel online memory prediction method for workflow tasks. During workflow execution, Sizey simultaneously trains multiple machine learning models and then dynamically selects the best model for each workflow task. To evaluate the quality of the model, we introduce a novel resource allocation quality (RAQ) score based on memory prediction accuracy and efficiency. Sizey's prediction models are retrained and re-evaluated online during workflow execution, continuously incorporating metrics from completed tasks. Our evaluation with a prototype implementation of Sizey uses metrics from six real-world scientific workflows from the popular nf-core framework and shows a median reduction in memory waste over time of 24.68% compared to the respective best-performing state-of-the-art baseline.
- P. Di Tommaso, M. Chatzou, E. W. Floden, P. P. Barja, E. Palumbo, and C. Notredame, “Nextflow enables reproducible computational workflows,” Nature biotechnology, vol. 35, no. 4, 2017.
- E. Deelman, K. Vahi, G. Juve, M. Rynge, S. Callaghan, P. J. Maechling, R. Mayani, W. Chen, R. F. Da Silva, M. Livny et al., “Pegasus, a workflow management system for science automation,” Future Generation Computer Systems, vol. 46, 2015.
- J. Köster and S. Rahmann, “Snakemake—a scalable bioinformatics workflow engine,” Bioinformatics, vol. 28, no. 19, pp. 2520–2522, 2012.
- P. A. Ewels, A. Peltzer, S. Fillinger, H. Patel, J. Alneberg, A. Wilm, M. U. Garcia, P. Di Tommaso, and S. Nahnsen, “The nf-core framework for community-curated bioinformatics pipelines,” Nature Biotechnology, 2020.
- F. Lehmann, D. Frantz, S. Becker, U. Leser, and P. Hostert, “FORCE on Nextflow: Scalable Analysis of Earth Observation data on Commodity Clusters,” in Proceedings of the CIKM 2021 Workshops, Online, 2021.
- J. Schaarschmidt, J. Yuan, T. Strunk, I. Kondov, S. P. Huber, G. Pizzi, L. Kahle, F. T. Bölle, I. E. Castelli, T. Vegge et al., “Workflow engineering in materials design within the battery 2030+ project,” Advanced Energy Materials, p. 2102638, 2021.
- C. S. Liew, M. P. Atkinson, M. Galea, T. F. Ang, P. Martin, and J. I. V. Hemert, “Scientific workflows: moving across paradigms,” ACM Computing Surveys (CSUR), vol. 49, no. 4, pp. 1–39, 2016.
- T. Coleman, H. Casanova, L. Pottier, M. Kaushik, E. Deelman, and R. F. da Silva, “Wfcommons: A framework for enabling scientific workflow research and development,” Future Generation Computer Systems, vol. 128, pp. 16–27, 2022.
- J. Bader, K. Styp-Rekowski, L. Doehler, S. Becker, and O. Kao, “Macaw: The machine learning magnetometer calibration workflow,” in 2022 IEEE International Conference on Data Mining Workshops (ICDMW), 2022, pp. 1095–1101.
- B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes, “Borg, omega, and kubernetes,” Queue, vol. 14, no. 1, 2016.
- A. B. Yoo, M. A. Jette, and M. Grondona, “Slurm: Simple linux utility for resource management,” in Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 2003.
- D. Thain, T. Tannenbaum, and M. Livny, “Distributed computing in practice: the condor experience,” Concurrency and computation: practice and experience, vol. 17, no. 2-4, 2005.
- A. Ilyushkin and D. Epema, “The impact of task runtime estimate accuracy on scheduling workloads of workflows,” in 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE, 2018, pp. 331–341.
- A. Hirales-Carbajal, A. Tchernykh, R. Yahyapour, J. L. González-García, T. Röblitz, and J. M. Ramírez-Alcaraz, “Multiple workflow scheduling strategies with user run time estimates on a grid,” Journal of Grid Computing, vol. 10, no. 2, pp. 325–346, 2012.
- J. Bader, N. Diedrich, L. Thamsen, and O. Kao, “Predicting dynamic memory requirements for scientific workflow tasks,” in 2023 IEEE International Conference on Big Data (Big Data), 2023. [Online]. Available: https://arxiv.org/pdf/2311.08185.pdf
- B. Tovar, B. Lyons, K. Mohrman, B. Sly-Delgado, K. Lannon, and D. Thain, “Dynamic task shaping for high throughput data analysis applications in high energy physics,” in 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2022, pp. 346–356.
- C. Witt, J. van Santen, and U. Leser, “Learning low-wastage memory allocations for scientific workflows at icecube,” in 2019 International Conference on High Performance Computing & Simulation (HPCS). IEEE, 2019, pp. 233–240.
- C. Witt, M. Bux, W. Gusew, and U. Leser, “Predictive performance modeling for distributed batch processing using black box monitoring and machine learning,” Information Systems, vol. 82, pp. 33–52, 2019.
- M. Bux, J. Brandt, C. Witt, J. Dowling, and U. Leser, “Hi-way: Execution of scientific workflows on hadoop yarn,” in 20th International Conference on Extending Database Technology, EDBT 2017, 21 March 2017 through 24 March 2017. OpenProceedings. org, 2017, pp. 668–679.
- J. Bader, F. Lehmann, L. Thamsen, U. Leser, and O. Kao, “Lotaru: Locally predicting workflow task runtimes for resource management on heterogeneous infrastructures,” Future Generation Computer Systems, vol. 150, pp. 171–185, 2024.
- S. Masoudnia and R. Ebrahimpour, “Mixture of experts: a literature survey,” Artificial Intelligence Review, vol. 42, no. 2, pp. 275–293, 2014.
- N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” arXiv preprint arXiv:1701.06538, 2017.
- J. A. F. Yates, T. C. Lamnidis, M. Borry, A. A. Valtueña, Z. Fagernäs, S. Clayton, M. U. Garcia, J. Neukamm, and A. Peltzer, “Reproducible, portable, and efficient ancient genome reconstruction with nf-core/eager,” PeerJ, vol. 9, p. e10947, 2021.
- P. d. B. Damgaard, N. Marchi, S. Rasmussen, M. Peyrot, G. Renaud, T. Korneliussen, J. V. Moreno-Mayar, M. W. Pedersen, A. Goldberg, E. Usmanova et al., “137 ancient human genomes from across the eurasian steppes,” Nature, vol. 557, no. 7705, 2018.
- J. L. Green, R. E. Osterhout, A. L. Klova, C. Merkwirth, S. R. McDonnell, R. B. Zavareh, B. C. Fuchs, A. Kamal, and J. S. Jakobsen, “Molecular characterization of type i ifn-induced cytotoxicity in bladder cancer cells reveals biomarkers of resistance,” Molecular Therapy-Oncolytics, vol. 23, pp. 547–559, 2021.
- F. H. Karlsson, V. Tremaroli, I. Nookaew, G. Bergström, C. J. Behre, B. Fagerberg, J. Nielsen, and F. Bäckhed, “Gut metagenome in european women with normal, impaired and diabetic glucose control,” Nature, vol. 498, no. 7452, pp. 99–103, 2013.
- S. J. Baumgart, E. Nevedomskaya, R. Lesche, R. Newman, D. Mumberg, and B. Haendler, “Darolutamide antagonizes androgen signaling by blocking enhancer and super-enhancer activation,” Molecular oncology, vol. 14, no. 9, pp. 2022–2039, 2020.
- O. Thompson, F. von Meyenn, Z. Hewitt, J. Alexander, A. Wood, R. Weightman, S. Gregory, F. Krueger, S. Andrews, I. Barbaric et al., “Low rates of mutation in clinical grade human pluripotent stem cells under different culture conditions,” Nature communications, vol. 11, no. 1, p. 1528, 2020.
- T. Rettelbach, M. Langer, I. Nitze, B. M. Jones, V. Helm, J.-C. Freytag, and G. Grosse, “From images to hydrologic networks-understanding the arctic landscape with graphs,” in Proceedings of the 34th International Conference on Scientific and Statistical Database Management, 2022, pp. 1–10.
- B. Tovar, R. F. da Silva, G. Juve, E. Deelman, W. Allcock, D. Thain, and M. Livny, “A job sizing strategy for high-throughput scientific workflows,” IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 2, pp. 240–253, 2017.
- C. Witt, D. Wagner, and U. Leser, “Feedback-based resource allocation for batch scheduling of scientific workflows,” in 2019 HPCS. IEEE, 2019.
- R. F. Da Silva, G. Juve, E. Deelman, T. Glatard, F. Desprez, D. Thain, B. Tovar, and M. Livny, “Toward fine-grained online task characteristics estimation in scientific workflows,” in Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science, 2013, pp. 58–67.
- R. F. Da Silva, G. Juve, M. Rynge, E. Deelman, and M. Livny, “Online task resource consumption prediction for scientific workflows,” Parallel Processing Letters, vol. 25, no. 03, 2015.
- J. Bader, N. Zunker, S. Becker, and O. Kao, “Leveraging reinforcement learning for task resource allocation in scientific workflows,” in 2022 IEEE International Conference on Big Data (Big Data). IEEE, 2022.
- M. Tanash, B. Dunn, D. Andresen, W. Hsu, H. Yang, and A. Okanlawon, “Improving hpc system performance by predicting job resources via supervised machine learning,” in Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning), 2019, pp. 1–8.
- E. R. Rodrigues, R. L. Cunha, M. A. Netto, and M. Spriggs, “Helping hpc users specify job memory requirements via machine learning,” in 2016 Third International Workshop on HPC User Support Tools (HUST). IEEE, 2016, pp. 6–13.
- Jonathan Bader (28 papers)
- Fabian Skalski (1 paper)
- Fabian Lehmann (21 papers)
- Dominik Scheinert (32 papers)
- Jonathan Will (22 papers)
- Lauritz Thamsen (65 papers)
- Odej Kao (80 papers)