CodeMind: Evaluating Large Language Models for Code Reasoning (2402.09664v5)
Abstract: LLMs have been widely used to automate programming tasks. Their capabilities have been evaluated by assessing the quality of generated code through tests or proofs. The extent to which they can reason about code is a critical question revealing important insights about their true capabilities. This paper introduces CodeMind, a framework designed to gauge the code reasoning abilities of LLMs through the following explicit and implicit code reasoning tasks: Independent Execution Reasoning (IER), Specification Reasoning (SR) and Dynamic Semantics Reasoning (DSR). The first evaluates the abilities of LLMs to simulate the execution of given inputs to a code and predict the output (IER). The second assesses the abilities of LLMs to incorporate the simulation of test data in the specification into code generation (SR). Finally, CodeMind evaluates LLMs' abilities to understand overall code semantics only given a specific input/output (DSR). Our extensive evaluation of ten LLMs across four widely used benchmarks using CodeMind shows that LLMs, depending on their size and training strategy, can reason about some dynamic aspects of code. However, their performance drops for code with higher complexity, non-trivial logical and arithmetic operators, non-primitive types, and API calls. We show that these reasoning tasks evaluate LLMs differently, and a comprehensive evaluation of code reasoning requires them all. Finally, we show that the performance of LLMs in bug repair is not correlated with any of the code reasoning tasks, and except for advanced frontier models, other LLMs do not incorporate code reasoning when performing bug repair.
- Avatar: A parallel corpus for java-python program translation. arXiv preprint arXiv:2108.11590, 2021.
- Multi-lingual evaluation of code generation models. arXiv preprint arXiv:2210.14868, 2022.
- Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Multipl-e: A scalable and extensible approach to benchmarking neural code generation. arXiv preprint arXiv:2208.08227, 2022.
- Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397, 2022.
- Evaluating large language models trained on code, 2021.
- CodeMind. Artifact website. https://github.com/CodeMindICML/CodeMindICML, 2024.
- Rect: A recursive transformer architecture for generalizable mathematical reasoning. In NeSy, pp. 165–175, 2021.
- Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation. arXiv preprint arXiv:2308.01861, 2023.
- What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022.
- Cyclomatic complexity density and software maintenance productivity. IEEE transactions on software engineering, 17(12):1284–1288, 1991.
- Cruxeval: A benchmark for code reasoning, understanding and execution. arXiv preprint arXiv:2401.03065, 2024.
- Do lvlms understand charts? analyzing and correcting factual errors in chart captioning. arXiv preprint arXiv:2312.10160, 2023.
- Mathprompter: Mathematical reasoning using large language models. arXiv preprint arXiv:2303.05398, 2023.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023.
- Code simulation challenges for large language models. arXiv preprint arXiv:2401.09074, 2024.
- Starcoder: may the source be with you!, 2023.
- Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210, 2023.
- Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
- The larger they are, the harder they fail: Language models do not recognize identifier swaps in python. arXiv preprint arXiv:2305.15507, 2023.
- Beyond accuracy: Evaluating self-consistency of code large language models with identitychain. arXiv preprint arXiv:2310.14053, 2023.
- Program synthesis with large language models. In n/a, pp. n/a, n/a, 2021. n/a.
- OpenAI. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt, 2023a.
- OpenAI. Gpt-4 technical report. https://arxiv.org/abs/2303.08774, 2023b.
- Understanding the effectiveness of large language models in code translation. arXiv preprint arXiv:2308.03109, 2023.
- Project codenet: A large-scale ai for code dataset for learning a diversity of coding tasks. 2021.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Natural language to code translation with execution. arXiv preprint arXiv:2204.11454, 2022.
- Spearman, C. The proof and measurement of association between two things. 1961.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498, 2022.
- Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning. arXiv preprint arXiv:2310.03731, 2023.
- Recode: Robustness evaluation of code generation models. arXiv preprint arXiv:2212.10264, 2022.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022b.
- Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120, 2023.
- Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. arXiv preprint arXiv:2307.02477, 2023.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
- Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
- Transformer-based models are not yet perfect at learning to emulate structural recursion. arXiv preprint arXiv:2401.12947, 2024.
- Algo: Synthesizing algorithmic programs with generated oracle verifiers. arXiv preprint arXiv:2305.14591, 2023.
- Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 5673–5684, 2023.
- Codegen-test: An automatic code generation model integrating program test information. arXiv preprint arXiv:2202.07612, 2022.