Towards Translating Real-World Code with LLMs: A Study of Translating to Rust (2405.11514v2)
Abstract: LLMs show promise in code translation - the task of translating code written in one programming language to another language - due to their ability to write code in most programming languages. However, LLM's effectiveness on translating real-world code remains largely unstudied. In this work, we perform the first substantial study on LLM-based translation to Rust by assessing the ability of five state-of-the-art LLMs, GPT4, Claude 3, Claude 2.1, Gemini Pro, and Mixtral. We conduct our study on code extracted from real-world open source projects. To enable our study, we develop FLOURINE, an end-to-end code translation tool that uses differential fuzzing to check if a Rust translation is I/O equivalent to the original source program, eliminating the need for pre-existing test cases. As part of our investigation, we assess both the LLM's ability to produce an initially successful translation, as well as their capacity to fix a previously generated buggy one. If the original and the translated programs are not I/O equivalent, we apply a set of automated feedback strategies, including feedback to the LLM with counterexamples. Our results show that the most successful LLM can translate 47% of our benchmarks, and also provides insights into next steps for improvements.
- “C to go translator.” https://github.com/gotranspile/cxgo.
- “Sharpen - automated java-¿c# coversion.” https://github.com/mono/sharpen.
- “C2rust transpiler.” https://c2rust.com/.
- Z. Tang, M. Agarwal, A. Shypula, B. Wang, D. Wijaya, J. Chen, and Y. Kim, “Explain-then-translate: an analysis on improving program translation with self-generated explanations,” in Findings of the Association for Computational Linguistics: EMNLP 2023 (H. Bouamor, J. Pino, and K. Bali, eds.), (Singapore), pp. 1741–1788, Association for Computational Linguistics, Dec. 2023.
- B. Rozière, M. Lachaux, L. Chanussot, and G. Lample, “Unsupervised translation of programming languages,” in NeurIPS, 2020.
- B. Rozière, J. Zhang, F. Charton, M. Harman, G. Synnaeve, and G. Lample, “Leveraging automated unit tests for unsupervised code translation,” in ICLR, OpenReview.net, 2022.
- M. Szafraniec, B. Roziere, H. L. F. Charton, P. Labatut, and G. Synnaeve, “Code translation with compiler representations,” ICLR, 2023.
- R. Pan, A. R. Ibrahimzada, R. Krishna, D. Sankar, L. P. Wassi, M. Merler, B. Sobolev, R. Pavuluri, S. Sinha, and R. Jabbarvand, “Lost in translation: A study of bugs introduced by large language models while translating code,” 2024.
- P. Jana, P. Jha, H. Ju, G. Kishore, A. Mahajan, and V. Ganesh, “Attention, compilation, and solver-based symbolic analysis are all you need,” arXiv preprint arXiv:2306.06755, 2023.
- R. Puri, D. S. Kung, G. Janssen, W. Zhang, G. Domeniconi, V. Zolotov, J. Dolby, J. Chen, M. Choudhury, L. Decker, et al., “Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks,” arXiv preprint arXiv:2105.12655, 2021.
- W. U. Ahmad, M. G. R. Tushar, S. Chakraborty, and K.-W. Chang, “Avatar: A parallel corpus for java-python program translation,” arXiv preprint arXiv:2108.11590, 2021.
- J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021.
- P. Deligiannis, A. Lal, N. Mehrotra, and A. Rastogi, “Fixing rust compilation errors using llms,” arXiv preprint arXiv:2308.05177, 2023.
- J. Zhang, P. Nie, J. J. Li, and M. Gligoric, “Multilingual code co-evolution using large language models,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 695–707, 2023.
- Q. Zhang, J. Wang, G. H. Xu, and M. Kim, “Heterogen: transpiling c to heterogeneous hls code with automated test generation and program repair,” in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’22, (New York, NY, USA), p. 1017–1029, Association for Computing Machinery, 2022.
- B. Mariano, Y. Chen, Y. Feng, G. Durrett, and I. Dillig, “Automated transpilation of imperative to functional code using neural-guided program synthesis,” Proceedings of the ACM on Programming Languages, vol. 6, no. OOPSLA1, pp. 1–27, 2022.
- H. F. Eniser, V. Wüstholz, and M. Christakis, “Automatically testing functional properties of code translation models,” arXiv preprint arXiv:2309.12813, 2023.
- M. Jiao, T. Yu, X. Li, G. Qiu, X. Gu, and B. Shen, “On the evaluation of neural code translation: Taxonomy and benchmark,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 1529–1541, IEEE, 2023.
- H. Zhang, C. David, Y. Yu, and M. Wang, “Ownership guided C to Rust translation,” in Computer Aided Verification (CAV), vol. 13966 of LNCS, pp. 459–482, Springer, 2023.
- M. Emre, R. Schroeder, K. Dewey, and B. Hardekopf, “Translating C to safer Rust,” Proceedings of the ACM on Programming Languages, vol. 5, no. OOPSLA, pp. 1–29, 2021.
- Y. Noller, C. S. Păsăreanu, M. Böhme, Y. Sun, H. L. Nguyen, and L. Grunske, “Hydiff: Hybrid differential software analysis,” in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pp. 1273–1285, 2020.
- M. Böhme, B. C. d. S. Oliveira, and A. Roychoudhury, “Regression tests to expose change interaction errors,” in Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, pp. 334–344, 2013.
- H. Palikareva, T. Kuchta, and C. Cadar, “Shadow of a doubt: testing for divergences between software versions,” in Proceedings of the 38th International Conference on Software Engineering, pp. 1181–1192, 2016.
- S. Person, G. Yang, N. Rungta, and S. Khurshid, “Directed incremental symbolic execution,” Acm Sigplan Notices, vol. 46, no. 6, pp. 504–515, 2011.
- J. Guo, Y. Jiang, Y. Zhao, Q. Chen, and J. Sun, “Dlfuzz: Differential fuzzing testing of deep learning systems,” in Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 739–743, 2018.
- W. Jin, A. Orso, and T. Xie, “Automated behavioral regression testing,” in 2010 Third international conference on software testing, verification and validation, pp. 137–146, IEEE, 2010.
- S. Nilizadeh, Y. Noller, and C. S. Pasareanu, “Diffuzz: differential fuzzing for side-channel analysis,” in 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 176–187, IEEE, 2019.
- T. Petsios, A. Tang, S. Stolfo, A. D. Keromytis, and S. Jana, “Nezha: Efficient domain-independent differential testing,” in 2017 IEEE Symposium on security and privacy (SP), pp. 615–632, IEEE, 2017.
- W. Li, J. Ruan, G. Yi, L. Cheng, X. Luo, and H. Cai, “PolyFuzz: Holistic greybox fuzzing of Multi-Language systems,” in 32nd USENIX Security Symposium (USENIX Security 23), (Anaheim, CA), pp. 1379–1396, USENIX Association, Aug. 2023.
- J. J. Garzella, M. Baranowski, S. He, and Z. Rakamarić, “Leveraging compiler intermediate representation for multi- and cross-language verification,” in Verification, Model Checking, and Abstract Interpretation (D. Beyer and D. Zufferey, eds.), (Cham), pp. 90–111, Springer International Publishing, 2020.
- C. S. Xia, Y. Wei, and L. Zhang, “Automated program repair in the era of large pre-trained language models,” in ICSE, IEEE, 2023.
- J. Kong, M. Cheng, X. Xie, S. Liu, X. Du, and Q. Guo, “Contrastrepair: Enhancing conversation-based automated program repair via contrastive test case pairs,” arXiv preprint arXiv:2403.01971, 2024.
- H. W. Kuhn, “The hungarian method for the assignment problem,” in 50 Years of Integer Programming 1958-2008 - From the Early Years to the State-of-the-Art (M. Jünger, T. M. Liebling, D. Naddef, G. L. Nemhauser, W. R. Pulleyblank, G. Reinelt, G. Rinaldi, and L. A. Wolsey, eds.), pp. 29–47, Springer, 2010.
- E. T. Bray, “The javascript object notation (json) data interchange format,” RFC 8259, RFC Editor, 12 2017.
- K. Serebryany, “Continuous fuzzing with libfuzzer and addresssanitizer,” in 2016 IEEE Cybersecurity Development (SecDev), pp. 157–157, 2016.
- C. S. Xia and L. Zhang, “Conversational automated program repair,” arXiv preprint arXiv:2301.13246, 2023.
- “Clippy: A bunch of lints to catch common mistakes and improve your rust code.” https://rust-lang.github.io/rust-clippy/.
- O. Tange, “Gnu parallel 20240122 (’frederik x’),” Jan. 2023. GNU Parallel is a general parallelizer to run multiple serial command line programs in parallel without changing them.
- J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
- “Claude.” https://www.anthropic.com/index/introducing-claude.
- “Gemini.” https://blog.google/technology/ai/google-gemini-ai/.
- A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand, et al., “Mixtral of experts,” arXiv preprint arXiv:2401.04088, 2024.
- “Moov ach.” https://github.com/moov-io/ach.
- “S2 geometry library in go.” https://github.com/golang/geo.
- “Open source implementation of audio processing technology codec (aptx).” https://github.com/pali/libopenaptx.
- “Engine for making things with a ms-dos feel, but for modern platforms.” https://github.com/mattiasgustavsson/dos-like/blob/main/source/libs/opl.h.
- “go-gt.” https://github.com/ThePaw/go-gt.
- “String comparison and edit distance algorithms library.” https://github.com/hbollon/go-edlib.
- “2d triangulation library.” https://github.com/tchayen/triangolatte.
- S. Ouyang, J. M. Zhang, M. Harman, and M. Wang, “Llm is like a box of chocolates: the non-determinism of chatgpt in code generation,” 2023.