OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models (2401.06628v2)
Abstract: Advancing automated programming necessitates robust and comprehensive code generation benchmarks, yet current evaluation frameworks largely neglect object-oriented programming (OOP) in favor of functional programming (FP), e.g., HumanEval and MBPP. To address this, our study introduces a pioneering OOP-focused benchmark, featuring 431 Python programs that encompass essential OOP concepts and features like classes and encapsulation methods. We propose a novel evaluation metric, pass@o, tailored for OOP, enhancing traditional pass@k measures. Our evaluation of 23 leading LLMs, including both general and code-specialized models, reveals three key insights: 1) pass@o offers a more relevant and comprehensive assessment for OOP code generation; 2) Despite excelling in FP, code-specialized LLMs like WizardCoder lag in OOP compared to models like ChatGPT; 3) The poor performance of all advanced LLMs on our OOP benchmark highlights a critical need for improvements in this field. Our benchmark and scripts are publicly released at: https://github.com/alphadl/OOP-eval.
- Santacoder: don’t reach for the stars! arXiv preprint.
- The falcon series of open language models. arXiv preprint.
- Program synthesis with large language models. arXiv preprint.
- Qwen technical report. arXiv preprint.
- Multipl-e: a scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering.
- Evaluating large language models trained on code. arXiv preprint.
- Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint.
- Codescore: Evaluating code generation by learning code execution. arXiv preprint.
- Codeapex: A bilingual programming evaluation benchmark for large language models. arXiv preprint.
- Measuring coding challenge competence with apps. arXiv preprint.
- Spoc: Search-based pseudocode to code. In NeurIPS.
- Efficient memory management for large language model serving with pagedattention. In SOSP.
- Starcoder: may the source be with you! arXiv preprint.
- Competition-level code generation with alphacode. Science.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In ACL.
- Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint.
- Error analysis prompting enables human-like translation evaluation in large language models: A case study on chatgpt. arXiv preprint.
- Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint.
- Adaptive machine translation with large language models. In EAMT.
- Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint.
- OpenAI. 2023. Gpt-4 technical report. arXiv preprint.
- Training language models to follow instructions with human feedback. In NeurIPS.
- Bleu: a method for automatic evaluation of machine translation. In ACL.
- Towards making the most of chatgpt for machine translation. arxiv preprint.
- Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint.
- Code llama: Open foundation models for code. arXiv preprint.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint.
- Mark Stefik and Daniel G Bobrow. 1985. Object-oriented programming: Themes and variations. AI magazine.
- Bjarne Stroustrup. 1988. What is object-oriented programming? IEEE software.
- InternLM Team. 2023a. Internlm: A multilingual language model with progressively enhanced capabilities.
- MosaicML NLP Team. 2023b. Introducing mpt-7b: A new standard for open-source, commercially usable llms. Accessed: 2023-05-05.
- Prompt-to-os (p2os): Revolutionizing operating systems and human-computer interaction with integrated ai generative models. arXiv preprint.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint.
- Attention is all you need.
- Execution-based evaluation for open-domain code generation. arXiv preprint.
- Peter Wegner. 1990. Concepts and paradigms of object-oriented programming. ACM Sigplan Oops Messenger.
- Emergent abilities of large language models. arXiv preprint.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint.
- Cert: Continual pre-training on sketches for library-oriented code generation. In IJCAI.
- Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint.
- Can chatgpt replace stackoverflow? a study on robustness and reliability of large language model code generation. arXiv preprint.
- Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert. arXiv preprint.