InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models (2404.07940v3)
Abstract: LLMs for code (code LLMs) have witnessed tremendous progress in recent years. With the rapid development of code LLMs, many popular evaluation benchmarks, such as HumanEval, DS-1000, and MBPP, have emerged to measure the performance of code LLMs with a particular focus on code generation tasks. However, they are insufficient to cover the full range of expected capabilities of code LLMs, which span beyond code generation to answering diverse coding-related questions. To fill this gap, we propose InfiBench, the first large-scale freeform question-answering (QA) benchmark for code to our knowledge, comprising 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. InfiBench uses four types of model-free automatic metrics to evaluate response correctness where domain experts carefully concretize the criterion for each question. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings. Our detailed analyses showcase potential directions for further advancement of code LLMs. InfiBench is fully open source at https://infi-coder.github.io/infibench and continuously expanding to foster more scientific and systematic practices for code LLM evaluation.
- Multi-lingual evaluation of code generation models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Bo7eeXm6An8.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- DeepSeekAI. Deepseek coder: Let the code write itself. https://deepseekcoder.github.io/, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533, 2023.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Github. Github copilot - your ai pair programmer. https://github.com/features/copilot, 2023.
- Measuring coding challenge competence with apps. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
- Large language models for software engineering: A systematic literature review. arXiv preprint arXiv:2308.10620, 2023.
- Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pp. 18319–18345. PMLR, 2023.
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
- Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
- Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81, 2004.
- Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=1qvx610Cu7.
- Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210, 2023b.
- Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
- Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2023.
- Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=iaYcJKpY2B_.
- OpenAI. Gpt-4 technical report. OpenAI, 2023. URL https://cdn.openai.com/papers/gpt-4.pdf.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- StackExchange. All sites — stackexchange. 2024. URL https://stackexchange.com/sites?view=list#users.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. In KDD, 2023.