- The paper introduces SWE-bench, a benchmark using real GitHub issue data to evaluate LM-generated patches for complex multi-file edits.
- It employs a three-stage filtering pipeline and robust test-based evaluations to validate patch applicability and code correctness.
- Experiments show state-of-the-art LMs achieve very low resolution rates, highlighting challenges in long-context understanding and code localization.
This paper introduces SWE-bench (2310.06770), a benchmark designed to evaluate large LMs on realistic software engineering tasks sourced from real-world GitHub repositories. The authors argue that existing code generation benchmarks like HumanEval are too simplistic, focusing on self-contained problems solvable in a few lines, failing to capture the complexity of real-world software development which often involves navigating large codebases, understanding interactions across multiple files, and performing complex edits.
SWE-bench Construction and Task:
Key Features and Challenges of SWE-bench:
- Realism: Tasks are actual bugs or feature requests submitted by users.
- Large Context: Models must process issue descriptions (avg. 195 words) and potentially navigate large codebases (avg. 438K lines, 3K files).
- Cross-Context Editing: Solutions often require coordinated changes across multiple functions, classes, and files (avg. 1.7 files, 3 functions, 32.8 lines edited in reference solutions).
- Robust Evaluation: Uses the repository's own testing framework and includes tests specifically added to address the issue (fail-to-pass tests) plus regression tests (pass-to-pass tests).
- Updatability: The collection pipeline is largely automated, allowing continuous updates with new issues.
SWE-Llama:
- To evaluate open models, the authors fine-tuned CodeLlama-Python 7B and 13B models, creating SWE-Llama.
- Training Data (SWE-bench-train): A separate dataset of ~19,000 issue-PR pairs from 37 different repositories was collected, without the requirement for test file changes.
- Fine-tuning: LoRA was used on attention layers. Models were trained to generate the gold patch given the issue text and the oracle retrieved files (files edited in the gold patch) as context, up to 30k tokens.
Experimental Setup and Baselines:
- Retrieval: Since codebases exceed context limits, retrieval is necessary. Two methods were used:
- BM25 (Sparse): Retrieves files based on text similarity between the issue and file content/paths. Tested with varying context token limits (13k, 27k, 50k).
- Oracle: Retrieves only the files modified in the reference PR solution (an upper bound for retrieval).
- Models: Evaluated Claude 2, ChatGPT-3.5 (16k context), GPT-4 (32k context), and SWE-Llama 7b/13b (100k+ context).
- Input Prompt: Included instructions, issue text, retrieved file contents, and an example patch format.
Results:
- Overall Performance: State-of-the-art models struggle significantly. The best performance using BM25 retrieval was Claude 2, resolving only 1.96% of issues.
- Oracle Retrieval: Performance improves with oracle retrieval (Claude 2 reaches 4.8%), highlighting the importance of context selection.
- Context Length: Performance generally decreases with larger BM25 context windows, suggesting models struggle to identify relevant information ("lost in the middle" effect). Performance improved significantly when using a highly condensed "oracle-collapsed" context, further indicating localization difficulties.
- SWE-Llama Performance: SWE-Llama 13b was competitive with Claude 2 in the oracle setting (3.97% vs 4.80%) but performed worse with BM25 retrieval (0.70% vs 1.96%), suggesting sensitivity to the context distribution shift from its oracle-based fine-tuning.
- Patch Applicability: While resolution rates are low, models often generate patches that can be syntactically applied (e.g., SWE-Llama 13b applied 53.6% with BM25, Claude 2 applied 43.1%). SWE-Llama models had higher application rates and required fewer post-generation fixes to apply correctly.
- Edit Complexity: Model-generated patches that successfully applied were much simpler (shorter, fewer files/functions modified) than the human-written gold patches.
- Qualitative Analysis: Models often produce syntactically correct but simplistic fixes, sometimes ignoring codebase conventions or failing to leverage existing utilities. Human solutions often involve more refactoring or structural improvements.
Conclusion:
SWE-bench presents a significantly more challenging and realistic evaluation for LMs in software engineering than previous benchmarks. Current models achieve very low success rates, indicating substantial room for improvement in areas like long-context understanding, code localization, complex reasoning, and multi-file editing. The benchmark and associated resources (training data, SWE-Llama models) aim to drive progress towards LMs that are more practically useful for real-world software development.