SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (2310.06770v3)

Published 10 Oct 2023 in cs.CL, cs.AI, and cs.SE

Abstract: LLMs have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of LLMs. To this end, we introduce SWE-bench, an evaluation framework consisting of $2,294$ software engineering problems drawn from real GitHub issues and corresponding pull requests across $12$ popular Python repositories. Given a codebase along with a description of an issue to be resolved, a LLM is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation tasks. Our evaluations show that both state-of-the-art proprietary models and our fine-tuned model SWE-Llama can resolve only the simplest issues. The best-performing model, Claude 2, is able to solve a mere $1.96$% of the issues. Advances on SWE-bench represent steps towards LMs that are more practical, intelligent, and autonomous.

Citations (238)

View on Semantic Scholar

Summary

The paper introduces SWE-bench, a benchmark using real GitHub issue data to evaluate LM-generated patches for complex multi-file edits.
It employs a three-stage filtering pipeline and robust test-based evaluations to validate patch applicability and code correctness.
Experiments show state-of-the-art LMs achieve very low resolution rates, highlighting challenges in long-context understanding and code localization.

This paper introduces SWE-bench (2310.06770), a benchmark designed to evaluate large LMs on realistic software engineering tasks sourced from real-world GitHub repositories. The authors argue that existing code generation benchmarks like HumanEval are too simplistic, focusing on self-contained problems solvable in a few lines, failing to capture the complexity of real-world software development which often involves navigating large codebases, understanding interactions across multiple files, and performing complex edits.

SWE-bench Construction and Task:

Data Source: The benchmark consists of 2,294 task instances derived from issues and corresponding pull requests (PRs) across 12 popular Python repositories (e.g., django, matplotlib, scikit-learn, sympy).
Filtering Pipeline: A 3-stage pipeline selects high-quality instances:
1. Scraping PRs from popular repositories.
2. Attribute-based filtering: Selecting merged PRs that resolve a GitHub issue and modify test files.
3. Execution-based filtering: Ensuring the PR's tests can be installed and run, and that at least one test transitions from failing (before the PR's code changes) to passing (after the changes).
Task Formulation: Given the text description of a GitHub issue and the full repository codebase at the commit just before the fixing PR, the LM's task is to generate a patch file containing the code changes needed to resolve the issue.
Evaluation: The generated patch is applied to the codebase using the patch utility. Then, the specific tests associated with the original PR (both fail-to-pass tests and other tests run by the repository's test suite) are executed. The task is considered resolved only if the patch applies successfully and all associated tests pass. The primary metric is the percentage of resolved instances (% Resolved).

Key Features and Challenges of SWE-bench:

Realism: Tasks are actual bugs or feature requests submitted by users.
Large Context: Models must process issue descriptions (avg. 195 words) and potentially navigate large codebases (avg. 438K lines, 3K files).
Cross-Context Editing: Solutions often require coordinated changes across multiple functions, classes, and files (avg. 1.7 files, 3 functions, 32.8 lines edited in reference solutions).
Robust Evaluation: Uses the repository's own testing framework and includes tests specifically added to address the issue (fail-to-pass tests) plus regression tests (pass-to-pass tests).
Updatability: The collection pipeline is largely automated, allowing continuous updates with new issues.

SWE-Llama:

To evaluate open models, the authors fine-tuned CodeLlama-Python 7B and 13B models, creating SWE-Llama.
Training Data (SWE-bench-train): A separate dataset of ~19,000 issue-PR pairs from 37 different repositories was collected, without the requirement for test file changes.
Fine-tuning: LoRA was used on attention layers. Models were trained to generate the gold patch given the issue text and the oracle retrieved files (files edited in the gold patch) as context, up to 30k tokens.

Experimental Setup and Baselines:

Retrieval: Since codebases exceed context limits, retrieval is necessary. Two methods were used:
- BM25 (Sparse): Retrieves files based on text similarity between the issue and file content/paths. Tested with varying context token limits (13k, 27k, 50k).
- Oracle: Retrieves only the files modified in the reference PR solution (an upper bound for retrieval).
Models: Evaluated Claude 2, ChatGPT-3.5 (16k context), GPT-4 (32k context), and SWE-Llama 7b/13b (100k+ context).
Input Prompt: Included instructions, issue text, retrieved file contents, and an example patch format.

Results:

Overall Performance: State-of-the-art models struggle significantly. The best performance using BM25 retrieval was Claude 2, resolving only 1.96% of issues.
Oracle Retrieval: Performance improves with oracle retrieval (Claude 2 reaches 4.8%), highlighting the importance of context selection.
Context Length: Performance generally decreases with larger BM25 context windows, suggesting models struggle to identify relevant information ("lost in the middle" effect). Performance improved significantly when using a highly condensed "oracle-collapsed" context, further indicating localization difficulties.
SWE-Llama Performance: SWE-Llama 13b was competitive with Claude 2 in the oracle setting (3.97% vs 4.80%) but performed worse with BM25 retrieval (0.70% vs 1.96%), suggesting sensitivity to the context distribution shift from its oracle-based fine-tuning.
Patch Applicability: While resolution rates are low, models often generate patches that can be syntactically applied (e.g., SWE-Llama 13b applied 53.6% with BM25, Claude 2 applied 43.1%). SWE-Llama models had higher application rates and required fewer post-generation fixes to apply correctly.
Edit Complexity: Model-generated patches that successfully applied were much simpler (shorter, fewer files/functions modified) than the human-written gold patches.
Qualitative Analysis: Models often produce syntactically correct but simplistic fixes, sometimes ignoring codebase conventions or failing to leverage existing utilities. Human solutions often involve more refactoring or structural improvements.

Conclusion:

SWE-bench presents a significantly more challenging and realistic evaluation for LMs in software engineering than previous benchmarks. Current models achieve very low success rates, indicating substantial room for improvement in areas like long-context understanding, code localization, complex reasoning, and multi-file editing. The benchmark and associated resources (training data, SWE-Llama models) aim to drive progress towards LMs that are more practically useful for real-world software development.