Papers
Topics
Authors
Recent
2000 character limit reached

HPCClusterScape: Increasing Transparency and Efficiency of Shared High-Performance Computing Clusters for Large-scale AI Models (2310.02120v2)

Published 3 Oct 2023 in cs.HC and cs.DC

Abstract: The emergence of large-scale AI models, like GPT-4, has significantly impacted academia and industry, driving the demand for high-performance computing (HPC) to accelerate workloads. To address this, we present HPCClusterScape, a visualization system that enhances the efficiency and transparency of shared HPC clusters for large-scale AI models. HPCClusterScape provides a comprehensive overview of system-level (e.g., partitions, hosts, and workload status) and application-level (e.g., identification of experiments and researchers) information, allowing HPC operators and machine learning researchers to monitor resource utilization and identify issues through customizable violation rules. The system includes diagnostic tools to investigate workload imbalances and synchronization bottlenecks in large-scale distributed deep learning experiments. Deployed in industrial-scale HPC clusters, HPCClusterScape incorporates user feedback and meets specific requirements. This paper outlines the challenges and prerequisites for efficient HPC operation, introduces the interactive visualization system, and highlights its contributions in addressing pain points and optimizing resource utilization in shared HPC clusters.

Citations (2)

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.