SpecTra: Enhancing the Code Translation Ability of Language Models by Generating Multi-Modal Specifications (2405.18574v2)

Published 28 May 2024 in cs.SE

Abstract: LLMs are increasingly being used for the task of automated code translation, which has important real-world applications. However, most existing approaches use only the source code of a program as an input to an LLM, and do not consider the different kinds of specifications that can be extracted from a program. In this paper, we propose SpecTra, a multi-stage approach that uses a novel self-consistency filter to first generate high-quality static specifications, test cases, and natural language descriptions from a given program, and then uses these along with the source code to improve the quality of LLM-generated translations. We evaluate SpecTra on three code translation tasks - C to Rust, C to Go, and JavaScript to TypeScript - and show that it can enhance the performance of six popular LLMs on these tasks by up to 10 percentage points and a relative improvement of 26\%. Our research suggests that generating high-quality specifications could be a promising and efficient way to improve the performance of LLMs for code translation. We make our code and data available, anonymized for review.

References (22)

Citations (3)

View on Semantic Scholar

Summary

The paper presents SpecTra, a novel multi-modal approach that augments LLM code translation with validated static, input-output, and descriptive specifications.
The methodology employs self-consistency filtering and test cases to ensure the quality of generated specifications during the translation process.
Evaluations across three translation tasks show performance improvements of up to 10 percentage points and a relative gain of 26%, enhancing both accuracy and idiomatic expression.

Overview of SpecTra: Enhancing Code Translation with Multi-Modal Specifications

The paper "SpecTra: Enhancing the Code Translation Ability of LLMs by Generating Multi-Modal Specifications" introduces a novel methodology aimed at improving the performance of LLMs in automated code translation tasks. The authors address a critical gap wherein most existing techniques rely solely on the program's source code, neglecting the rich potential of program specifications to inform translation tasks.

Methodology

The core contribution of the paper is the SpecTra approach, a multi-stage methodology that leverages a combination of static specifications, test cases, and natural language descriptions to augment LLM-based code translations. The process is as follows:

Specification Generation: SpecTra begins by generating multiple candidate specifications from the given code, utilizing a self-consistency filter for validation. The approach creates three types of specifications:
- Static Specifications: Structured representations of the program's behavior.
- Input-Output Specifications: Specific examples of input-output behavior.
- Descriptions: Natural language summaries of the code functionality.
Specification Validation: The generated specifications are validated for self-consistency. Static and descriptive specifications are verified by regenerating the source code and comparing it to the original using test cases. For input-output specifications, the program is executed to assess correctness.
Specification-Guided Translation: Once validated, each type of specification is integrated into the translation task sequentially. This process is designed to combine the idiomatic expressiveness of LLM-generated code with the functional accuracy traditionally associated with transpilers.

Evaluation and Results

SpecTra was evaluated on three code translation tasks—converting C to Rust, C to Go, and JavaScript to TypeScript—using six popular LLMs. The results demonstrated significant improvements, with up to a 10 percentage point increase and a relative improvement of 26% over baseline models. This was particularly evident in tasks involving translations where initial specification limitations were overcome by integrating multiple modalities.

Implications and Future Directions

The implications of this research are twofold. Practically, it offers a method to harness specifications for more accurate and idiomatic code translations, potentially reducing the technical debt associated with maintaining legacy code. Theoretically, it provides insights into how multi-modal information can be utilized to enhance the capabilities of LLMs beyond traditional code-related applications.

For future research, the paper suggests exploring the generation of specifications in formal languages or assert statements for automatic cross-verification. Another promising direction is to evaluate the utility of these specifications in other code-related tasks such as debugging or code synthesis.

Conclusion

SpecTra represents an innovative step towards improving automated code translation by integrating multi-modal specifications into existing LLM frameworks. The proposed method not only enhances the quality of translations but also bridges the gap between traditional rule-based correctness and LLM-driven idiomatic coding practices. As AI continues to evolve, methodologies like SpecTra could lead to more reliable and maintainable software systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ComputerPapers/status/1796038040382472343