Multi-lingual Evaluation of Code Generation Models

Published 26 Oct 2022 in cs.LG and cs.CL | (2210.14868v3)

Abstract: We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages and are generated using a scalable conversion framework that transpiles prompts and test cases from the original Python datasets into the corresponding data in the target language. Using these benchmarks, we are able to assess the performance of code generation models in a multi-lingual fashion, and discovered generalization ability of LLMs on out-of-domain languages, advantages of multi-lingual models over mono-lingual, the ability of few-shot prompting to teach the model new languages, and zero-shot translation abilities even on mono-lingual settings. Furthermore, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages, which can be used for other code-related evaluations such as code insertion, robustness, or summarization tasks. Overall, our benchmarks represents a significant step towards a deeper understanding of LLMs' code generation abilities. We publicly release our code and datasets at https://github.com/amazon-research/mxeval.

Abstract PDF Upgrade to Chat

Authors (25)

First 10 authors:

Citations (125)

View on Semantic Scholar

Summary

The paper introduces new multilingual benchmarks and a scalable framework to convert Python datasets into over 10 programming languages.
It conducts a large-scale evaluation of models from 125M to 13B parameters, revealing that multilingual models excel over mono-lingual ones in out-of-domain tasks.
The study demonstrates that few-shot prompting and cross-language knowledge spillover significantly enhance the accuracy of code generation across languages.

Multi-lingual Evaluation of Code Generation Models: A Summary

The paper focuses on the evaluation of code generation models across multiple programming languages, highlighting the development and introduction of new benchmarks—MBXP, Multilingual HumanEval, and MathQA-X. These benchmarks aim to facilitate the understanding and assessment of code generation models in a multilingual context. The research leverages a scalable framework to transcribe existing Python datasets into more than ten other languages, providing a comprehensive basis for performance evaluation of these models' multilingual capabilities.

Core Contributions

Benchmarks and Dataset Conversion: The authors present a robust framework capable of converting execution-based evaluation datasets from Python to multiple programming languages, ensuring a scalable approach to dataset preparation. They focus specifically on function completion-style tasks and have managed to convert these tasks while maintaining the necessary testing infrastructure to ensure functional correctness across languages.
Comprehensive Evaluation: The paper conducts a large-scale evaluation using several trained models, ranging from 125 million to 13 billion parameters, to compare multi-lingual models against mono-lingual ones. This is executed through various tasks such as in-domain and out-of-domain code generation, few-shot prompting, zero-shot translation, and robustness assessments.
Synthetic Solutions: Through the use of large-scale bootstrapping, the study generates synthetic canonical solutions in new languages, extending the applicability of these datasets for tasks like code insertion and summarization.

Key Findings

Multi-lingual models generally outperform mono-lingual counterparts when the model size reaches a sufficiently large capacity, demonstrating the value of trained models across diverse languages.
The presence of cross-language knowledge spillover, where data from one language is embedded within another, contributes significantly to the models’ out-of-domain language capabilities. This phenomenon facilitates the generation of syntactically and semantically correct programs even in languages not specifically targeted during training.
Few-shot prompting significantly enhances the models' abilities to generate syntactically proper code for out-of-domain languages, reducing non-assertion errors such as syntax or compilation errors.
LLMs exhibit notable zero-shot translation abilities, where reference solutions in one language improve function completion in another. This is observable even on mono-lingual models, suggesting that knowledge of the target language is a critical factor in translation performance.

Implications and Future Directions

The work highlights both practical and theoretical implications. Practically, the findings suggest that training large-scale, multi-lingual models could be more efficient and effective than maintaining several specialized models, enabling better cross-lingual support systems for developers. Theoretically, the study sheds light on the potential mechanisms of knowledge transfer within code generation models, providing a foundation for future research in LLM generalization across domains.

The framework established in this research also holds promise for further developments in AI. This includes enhancing the robustness of code models through rigorous perturbation tests and augmenting the translation capabilities among non-programming languages, applicable in diverse real-world scenarios. The extension to more programming languages and the exploration of compositionality across language pairs are natural next steps in this line of research.

The release of these benchmarks and datasets provides a platform for future research, potentially paving the way for innovative development in code generation and program synthesis. Researchers can leverage this work to explore new methodologies for inherently multi-lingual systems, making strides toward more sophisticated integrative AI solutions.

Markdown Report Issue