Emergent Mind

MPIrigen: MPI Code Generation through Domain-Specific Language Models

(2402.09126)
Published Feb 14, 2024 in cs.DC , cs.AI , cs.CL , cs.LG , and cs.SE

Abstract

The imperative need to scale computation across numerous nodes highlights the significance of efficient parallel computing, particularly in the realm of Message Passing Interface (MPI) integration. The challenging parallel programming task of generating MPI-based parallel programs has remained unexplored. This study first investigates the performance of state-of-the-art language models in generating MPI-based parallel programs. Findings reveal that widely used models such as GPT-3.5 and PolyCoder (specialized multi-lingual code models) exhibit notable performance degradation, when generating MPI-based programs compared to general-purpose programs. In contrast, domain-specific models such as MonoCoder, which are pretrained on MPI-related programming languages of C and C++, outperform larger models. Subsequently, we introduce a dedicated downstream task of MPI-based program generation by fine-tuning MonoCoder on HPCorpusMPI. We call the resulting model as MPIrigen. We propose an innovative preprocessing for completion only after observing the whole code, thus enabling better completion with a wider context. Comparative analysis against GPT-3.5 zero-shot performance, using a novel HPC-oriented evaluation method, demonstrates that MPIrigen excels in generating accurate MPI functions up to 0.8 accuracy in location and function predictions, and with more than 0.9 accuracy for argument predictions. The success of this tailored solution underscores the importance of domain-specific fine-tuning in optimizing language models for parallel computing code generation, paving the way for a new generation of automatic parallelization tools. The sources of this work are available at our GitHub MPIrigen repository: https://github.com/Scientific-Computing-Lab-NRCN/MPI-rigen

Overview

  • MPIrigen utilizes domain-specific Language Models (LMs) to automate MPI-based parallel program generation, addressing challenges in integrating MPI for large-scale computations.

  • The research introduces HPCorpusMPI, a novel dataset focused on MPI domain decomposition codes, and demonstrates MPIrigen's superior performance over conventional models like GPT-3.5.

  • A significant contribution is the development of a code pre-processing technique that improves code generation accuracy by allowing better context comprehension.

  • The findings highlight the importance of domain-specific fine-tuning for LMs in parallel computing and suggest future applications in automatic code generation across various areas of parallel computing.

Advancements in MPI Code Generation with Domain-Specific Language Models

Introduction to MPIrigen and its Context

The perpetual evolution of computational demands necessitates the efficient scaling of computations across numerous nodes, emphasizing the significance of Message Passing Interface (MPI) in parallel computing. MPI serves as the cornerstone for large-scale computations, especially concerning domain decomposition. However, the integration of MPI in parallel programs poses significant challenges due to the intricate nature of parallel programming and the limitations of static tools aimed at addressing these challenges. Enter MPIrigen, a novel approach leveraging domain-specific Language Models (LMs) for the generation of MPI-based parallel programs.

Leveraging Language Models for MPI Parallelization

Recent trends have showcased a shift towards data-driven methods, particularly LLMs, for a range of programming tasks including the challenging domain of parallel programming. While conventional approaches have achieved success in generating OpenMP pragmas through LMs, the generation of complex, multi-functional MPI code remains a relatively unexplored frontier.

MPIrigen introduces a dedicated fine-tuning approach, constructing upon the foundation created by MonoCoder, a PolyCoder model pre-trained on C and C++ languages associated with MPI. This model is fine-tuned on a novel corpus, HPCorpusMPI, focusing solely on MPI domain decomposition codes. This innovative step marks a significant stride towards harnessing the power of domain-specific LMs in generating MPI-based parallel programs.

Key Contributions and Experimental Results

The paper presents several crucial contributions to the field of parallel computing and LMs:

  • Creation of HPCorpusMPI: The first dataset focused solely on MPI domain decomposition codes, providing an essential resource for training and evaluating MPI-focused LMs.
  • Code Pre-processing Technique: An innovative approach that enhances code completion tasks by enabling better comprehension of a wider context, thereby significantly improving the generation accuracy.
  • Superior Performance of MPIrigen: When compared to state-of-the-art models like GPT-3.5, MPIrigen demonstrates superior capabilities in accurately generating MPI functions, including correct placement, function calls, and argument generation.

Experimental results reveal that MPIrigen significantly outperforms existing models across various metrics, establishing its efficacy in the generation of accurate and efficient MPI codes.

Implications and Future Directions

The research encapsulated in MPIrigen underscores the vital role of domain-specific fine-tuning in optimizing LMs for the nuanced task of parallel computing code generation. The findings not only contribute to the theoretical understanding of LMs in programming tasks but also offer practical implications by simplifying the process of integrating MPI in parallel programs.

Future developments may explore extending the methodologies established by MPIrigen to other areas of parallel computing, potentially leading to a broader application of LMs in automatic code generation. Moreover, the advancements in LMs, coupled with domain-specific datasets and fine-tuning procedures, pave the way for more sophisticated tools that could further streamline the development of parallel programs.

Acknowledgments

The creation of MPIrigen was supported by several organizations, including the Israeli Council for Higher Education, Intel Corporation, and the Lynn and William Frankel Center for Computer Science. Computational support was provided by the NegevHPC project and Intel Developer Cloud, highlighting the collaborative effort involved in this research.

In conclusion, MPIrigen represents a pivotal advancement in the domain of parallel programming, offering a promising route towards the automated generation of MPI-based parallel programs. The successful application of domain-specific LMs in this context not only enriches the toolbox of parallel programmers but also sets the stage for future research directions in the application of artificial intelligence in high-performance computing.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube