Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Design and Implementation of an Analysis Pipeline for Heterogeneous Data (2403.15721v3)

Published 23 Mar 2024 in cs.DC

Abstract: Managing and preparing complex data for deep learning, a prevalent approach in large-scale data science can be challenging. Data transfer for model training also presents difficulties, impacting scientific fields like genomics, climate modeling, and astronomy. A large-scale solution like Google Pathways with a distributed execution environment for deep learning models exists but is proprietary. Integrating existing open-source, scalable runtime tools and data frameworks on high-performance computing (HPC) platforms is crucial to address these challenges. Our objective is to establish a smooth and unified method of combining data engineering and deep learning frameworks with diverse execution capabilities that can be deployed on various high-performance computing platforms, including cloud and supercomputers. We aim to support heterogeneous systems with accelerators, where Cylon and other data engineering and deep learning frameworks can utilize heterogeneous execution. To achieve this, we propose Radical-Cylon, a heterogeneous runtime system with a parallel and distributed data framework to execute Cylon as a task of Radical Pilot. We thoroughly explain Radical-Cylon's design and development and the execution process of Cylon tasks using Radical Pilot. This approach enables the use of heterogeneous MPI-communicators across multiple nodes. Radical-Cylon achieves better performance than Bare-Metal Cylon with minimal and constant overhead. Radical-Cylon achieves (4~15)% faster execution time than batch execution while performing similar join and sort operations with 35 million and 3.5 billion rows with the same resources. The approach aims to excel in both scientific and engineering research HPC systems while demonstrating robust performance on cloud infrastructures. This dual capability fosters collaboration and innovation within the open-source scientific research community.

Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets