SWE-bench
📣 New benchmark: CodeClash (website, github) evaluates SWE agents on goals, not tasks
📣 New: Meet mini, the 100 line AI agent that still gets 65% on SWE-bench verified!
Software engineering agents, benchmarks, and models.
Built and maintained by researchers from Stanford University and Princeton University.
This organization contains the source code for several projects in the SWE-* open source ecosystem, including:
- SWE-bench, a benchmark for evaluating AI systems on real world GitHub issues.
- SWE-agent, a system that automatically solves GitHub issues using an LM agent.
- SWE-smith, a toolkit for generating SWE training data at scale.
- mini, an AI agent written in just 100 lines of code that scores >70% on SWE-bench verified
Also check out the supporting infrastructure for working with SWE-* projects
Pinned Loading
Repositories
Showing 9 of 9 repositories
-
SWE-smith Public
[NeurIPS 2025 D&B Spotlight] Scaling Data for SWE-agents
-
SWE-bench Public
SWE-bench: Can Language Models Resolve Real-world Github Issues?
-
experiments Public
Open sourced predictions, execution logs, trajectories, and results from model inference + evaluation runs on the SWE-bench task.
-
SWE-smith-envs Public
Artifacts for building environments (Docker images) for repositories represented in SWE-smith
-
reading-list Public
Academic papers and works related to SWE-bench and SWE-agents
-
humanevalfix-results Public archive
Evaluation data + results for SWE-agent inference on HumanEvalFix task
SWE-bench/humanevalfix-results’s past year of commit activity