Mastodawn

We're the new open-source SOTA AI Agent on SWE-bench Verified.
Score: 69.9% — 349/500 tasks solved.

Key tech behind the run:
• debug_script() sub-agent using pdb
• strategic_planning() tool powered by o3
• Automated guardrails that course-correct mid-run

🧵

Show thread

Refact.ai May 22, 2025

As Refact.ai Agent is open-source, we made our full SWE-bench Verified pipeline live on GitHub.

You can run it end-to-end and reproduce our Agent’s approach and 69.8% score.

➡️ https://github.com/smallcloudai/refact-bench

GitHub - smallcloudai/refact-bench: A benchmarking tool for evaluating AI coding assistants on real-world software engineering tasks from the SWE-Bench dataset.

A benchmarking tool for evaluating AI coding assistants on real-world software engineering tasks from the SWE-Bench dataset. - smallcloudai/refact-bench

GitHub

Show thread

Refact.ai May 22, 2025

Model setup:

• Orchestration: Claude 3.7
• debug_script(): Claude 3.7 + o4-mini
• strategic_planning(): o3
• Temp: 0 for Claude

For each benchmark task, our AI Agent made one multi-step run to produce a single, correct final solution.

Show thread

Refact.ai May 22, 2025

We introduced a new sub-agent — debug_script().

It uses pdb to debug, modify, and generate scripts, gathering:
1. Which files are affected
2. What caused the failure
3. How it might be fixed.

We forced at least 1 and up to 3 calls per task.

Show thread

Refact.ai May 22, 2025

AI Agent needed to be more reliable to solve SWE-bench tasks in pass@1.

🛡️We added automatic guardrails:
A script runs static checks on model outputs. If it detects Agent going off track, it injects mid-run helper messages (as from a “user”) to nudge it back in the right direction.

These small actions make a big difference in stability.

Show thread

Refact.ai May 22, 2025

🧠The strategic_planning() tool (powered by o3) stepped in when deeper reasoning is needed.

It analyzed the debug_script() report, brainstormed the solution, and applied fixes directly — no patches or diffs.

One mandatory call per task, lean and focused.

Show thread

Refact.ai May 22, 2025

Before SWE-bench Verified, we applied lessons from our SOTA SWE-bench Lite run:

• Made tools more tolerant of the model’s uncertainty
• Renamed them for clarity: definition()→search_symbol_definition(), etc.
• Reduced chat compression
❌Dropped multi-step planning
• & more

Show thread

Refact.ai May 22, 2025

What makes Refact.ai special isn’t just the score — it’s our end-to-end approach.

We build for real-world results, not just leaderboards.

Delegate your everyday programming tasks to our AI Agent, preview every step, and guide the process whenever you like

***

🖇Explore the technical details of our setup: https://refact.ai/blog/2025/open-source-sota-on-swe-bench-verified-refact-ai/

Try Refact.ai Agent — SOTA on SWE-bench Verified — in your IDE, today:

• VS Code: http://marketplace.visualstudio.com/items?itemName=smallcloud.codify
• JetBrains: http://plugins.jetbrains.com/plugin/20647-refact-ai

GitHub - smallcloudai/refact-bench: A benchmarking tool for evaluating AI coding assistants on real-world software engineering tasks from the SWE-Bench dataset.

Refact.ai is the #1 open-source AI Agent on SWE-bench Verified with a 69.8% score