Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 60 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 14 tok/s Pro
GPT-4o 77 tok/s Pro
Kimi K2 159 tok/s Pro
GPT OSS 120B 456 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

TaskEval: Assessing Difficulty of Code Generation Tasks for Large Language Models (2407.21227v2)

Published 30 Jul 2024 in cs.SE and cs.AI

Abstract: LLMs excel in code-related tasks like code generation, but benchmark evaluations often overlook task characteristics, such as difficulty. Moreover, benchmarks are usually built using tasks described with one single prompt, despite the formulation of prompts having a profound impact on the outcome. This paper introduces a generalist approach, TaskEval, a framework using diverse prompts and Item Response Theory (IRT) to efficiently assess LLMs' capabilities and benchmark task characteristics, improving the understanding of their performance. Using two code generation benchmarks, HumanEval+ and ClassEval, as well as 5 code generation LLMs, we show that TaskEval is capable of characterizing the properties of tasks. Using topic analysis, we identify and analyze the tasks of respectively 17 and 21 topics within the benchmarks. We also cross-analyze tasks' characteristics with programming constructs (e.g., variable assignment, conditions, etc.) used by LLMs, emphasizing some patterns with tasks' difficulty. Finally, we conduct a comparison between the difficulty assessment of tasks by human-annotators and LLMs. Orthogonal to current benchmarking evaluation efforts, TaskEval can assist researchers and practitioners in fostering better assessments of LLMs. The tasks' characteristics can be used to identify shortcomings within existing benchmarks. This could be used to generate additional related tasks for the evaluation or improvement of LLM.

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube