Design and Operation of Shared Machine Learning Clusters on Campus

Published 4 Oct 2021 in cs.DC and cs.NI | (2110.01556v2)

Abstract: Amid the rapid advancements in large ML models, universities worldwide are investing substantial funds and efforts into GPU clusters. However, managing a shared GPU cluster poses a pyramid of challenges, from hardware configuration to resource allocation among users. This paper introduces SING, a full-stack solution designed to streamline the management of shared GPU clusters in academic institutions. Motivated by the pressing need for efficient resource sharing and the challenges posed by limited staffing, we present a comprehensive view of SING's architecture and design choices, which achieves operational efficiency (i.e., low maintenance cost and high resource utilization). We also share experience and insights from the real-world operations of SING, including analysis of its usage patterns and management of incidents and failures. This paper is part of our ongoing effort to improve the management of shared ML clusters. We open-source relevant resources to facilitate the development and operation of similar clusters for ML.