OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models (2401.06628v2)
Abstract: Advancing automated programming necessitates robust and comprehensive code generation benchmarks, yet current evaluation frameworks largely neglect object-oriented programming (OOP) in favor of functional programming (FP), e.g., HumanEval and MBPP. To address this, our study introduces a pioneering OOP-focused benchmark, featuring 431 Python programs that encompass essential OOP concepts and features like classes and encapsulation methods. We propose a novel evaluation metric, pass@o, tailored for OOP, enhancing traditional pass@k measures. Our evaluation of 23 leading LLMs, including both general and code-specialized models, reveals three key insights: 1) pass@o offers a more relevant and comprehensive assessment for OOP code generation; 2) Despite excelling in FP, code-specialized LLMs like WizardCoder lag in OOP compared to models like ChatGPT; 3) The poor performance of all advanced LLMs on our OOP benchmark highlights a critical need for improvements in this field. Our benchmark and scripts are publicly released at: https://github.com/alphadl/OOP-eval.
- Santacoder: don’t reach for the stars! arXiv preprint.
- The falcon series of open language models. arXiv preprint.
- Program synthesis with large language models. arXiv preprint.
- Qwen technical report. arXiv preprint.
- Multipl-e: a scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering.
- Evaluating large language models trained on code. arXiv preprint.
- Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint.
- Codescore: Evaluating code generation by learning code execution. arXiv preprint.
- Codeapex: A bilingual programming evaluation benchmark for large language models. arXiv preprint.
- Measuring coding challenge competence with apps. arXiv preprint.
- Spoc: Search-based pseudocode to code. In NeurIPS.
- Efficient memory management for large language model serving with pagedattention. In SOSP.
- Starcoder: may the source be with you! arXiv preprint.
- Competition-level code generation with alphacode. Science.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In ACL.
- Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint.
- Error analysis prompting enables human-like translation evaluation in large language models: A case study on chatgpt. arXiv preprint.
- Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint.
- Adaptive machine translation with large language models. In EAMT.
- Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint.
- OpenAI. 2023. Gpt-4 technical report. arXiv preprint.
- Training language models to follow instructions with human feedback. In NeurIPS.
- Bleu: a method for automatic evaluation of machine translation. In ACL.
- Towards making the most of chatgpt for machine translation. arxiv preprint.
- Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint.
- Code llama: Open foundation models for code. arXiv preprint.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint.
- Mark Stefik and Daniel G Bobrow. 1985. Object-oriented programming: Themes and variations. AI magazine.
- Bjarne Stroustrup. 1988. What is object-oriented programming? IEEE software.
- InternLM Team. 2023a. Internlm: A multilingual language model with progressively enhanced capabilities.
- MosaicMLÂ NLP Team. 2023b. Introducing mpt-7b: A new standard for open-source, commercially usable llms. Accessed: 2023-05-05.
- Prompt-to-os (p2os): Revolutionizing operating systems and human-computer interaction with integrated ai generative models. arXiv preprint.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint.
- Attention is all you need.
- Execution-based evaluation for open-domain code generation. arXiv preprint.
- Peter Wegner. 1990. Concepts and paradigms of object-oriented programming. ACM Sigplan Oops Messenger.
- Emergent abilities of large language models. arXiv preprint.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint.
- Cert: Continual pre-training on sketches for library-oriented code generation. In IJCAI.
- Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint.
- Can chatgpt replace stackoverflow? a study on robustness and reliability of large language model code generation. arXiv preprint.
- Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert. arXiv preprint.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.