So here's the thing — I woke up this morning to my Slack absolutely blowing up. A Vertex AI error log surfaced what appears to be claude-sonnet-5-20260203, and within hours half the developer community was sharing screenshots claiming Sonnet 5 "surpasses 80.9% on SWE-bench Verified." I've been running coding agent evaluations inside Verdent for months now, and my first reaction was the same one you should have: where's the methodology? No harness config. No sandbox specs. No pass@1 breakdown. Just a number floating around with zero reproducibility. I've spent the last few weeks building out Verdent's own SWE-bench Verified pipeline — controlled environments, locked dependencies, the whole setup — so let me walk you through what we actually know, what's verified, and what you can run yourself.
Methodology
Let me be straight with you: SWE-bench Verified is not just a number you drop in a blog post. It's a 500-problem, human-validated benchmark built from real GitHub issues across 12 open-source Python repositories. Each task runs inside an isolated Docker container. The agent gets the issue description and the repo snapshot — nothing else. No gold patches, no hidden tests visible during execution. The grading is binary: your patch either flips all the FAIL_TO_PASS tests green and keeps every PASS_TO_PASS test passing, or it doesn't count.
Here's what bugs me about how most teams report their scores. SWE-bench measures the entire agent system, not just the model. The scaffold — prompts, tool design, context management, memory handling — all of it changes the outcome, even when you're running the exact same underlying model. I've seen this firsthand: swap out the tool set or adjust the timeout budget, and your pass@1 moves by 2–3 points overnight. That's why Verdent's evaluation runs against the same production agent our users actually ship with. No benchmark-specific tuning. No cherry-picked configs.
Repro Settings: Tooling, Timeouts & Sandboxes
Here's our exact setup — copy this if you want to reproduce:
Environment
| Parameter | Verdent Config |
|---|---|
| Harness | SWE-agent v1.0.1default scaffold |
| Dataset | SWE-bench Verified (500 problems) |
| Container | Docker, dependency-locked per-issue |
| Instance size | 8 vCPU / 32 GB RAM |
| Token budget | 1,000,000 tokens hard cap |
| Temperature | 1.0 (default) |
| Reasoning effort | High |
| Thinking mode | Enabled (Claude Sonnet 4.5) |
Baseline tool set (ablation control):
# Minimal tool config — what we tested in ablation
tools = ["bash", "read", "write", "edit"]
# Full production tool set adds:
# git, linter, diff, search, file_view (with 100-line window + 2-line overlap)This ablation matters. We stripped the toolkit down to these four primitives and ran the same 500 problems. Performance on SWE-bench Verified barely moved. That tells you something important about the benchmark — and something uncomfortable about how much "tool engineering" actually transfers to production work. Real-world repos need the full stack. SWE-bench, as it's currently structured, doesn't always demand it.
Provider variance is real, and people ignore it. We tested Claude Sonnet 4.5 across multiple API providers under identical scaffold and evaluation conditions. The cross-provider gap in pass@1 hit as high as 1.2%. Amazon Bedrock showed noticeably higher run-to-run variance compared to direct API access. For any serious evaluation, lock your provider. Don't average across endpoints and call it a benchmark.
Results Dashboard
Okay, here's where people actually want to land. Below is Verdent's pass@1 and pass@3 scores alongside the verified numbers from other systems — all as of late January / early February 2026. I'm pulling from the official SWE-bench leaderboard, Live-SWE-agent results, and vals.ai independent evaluations.
What's pass@1 vs pass@3? pass@1 is single-shot — one chance to get it right. That's what maps closest to "I gave my coding agent a task and it either worked or it didn't." pass@3 is three attempts on the same issue; if any one of them passes, it counts. This is more forgiving, and honestly closer to how developers actually work — you try something, it fails, you roll back and try again.
Verified Scores — February 2026
| System | Model | pass@1 | pass@3 | Source | Notes |
|---|---|---|---|---|---|
| Verdent | Claude Sonnet 4.5 (Thinking ON) | 82%* | 88%* | Verdent Technical Report | *100-problem random subset |
| Verdent | Claude Sonnet 4.5 (production) | 76.10% | 81.20% | Verdent Technical Report | Full 500 problems, no tuning |
| Claude Code | Claude Sonnet 4.5 (Thinking ON) | 78%* | 86%* | Verdent Technical Report | *100-problem random subset |
| Anthropic (official) | Claude Sonnet 4.5 | 77.20% | — | Anthropic blog | 82.0% w/ parallel compute |
| Live-SWE-agent | Claude Opus 4.5 | 79.20% | — | Live-SWE-agent | Nov 24, 2025 |
| Live-SWE-agent | Gemini 3 Pro Preview | 77.40% | — | Live-SWE-agent | Nov 20, 2025 |
| vals.ai(independent) | Claude Sonnet 4.5 | 69.80% | — | vals.ai | Standardized harness |
| vals.ai(independent) | GPT-5 Codex | 69.40% | — | vals.ai | Standardized harness |
| GPT-5.2 | — | 75.40% | — | vals.ai | Dec 11, 2025 |
What About the Claude Sonnet 5 Leak?
I want to address this head-on because it's all anyone's talking about today. A Vertex AI error log surfaced claude-sonnet-5-20260203 this morning, and a detailed breakdown on dev.to walked through what's actually verifiable versus what's speculation. The leak claims Sonnet 5 hits 80.9% on SWE-bench Verified. That number would put it roughly in line with Opus 4.5's territory — impressive if true, but we have zero methodology behind it. No harness config, no provider, no thinking mode flag. Anthropic has made no official announcement as of today.
My take? Don't build your roadmap around leaked benchmarks. The gap between Anthropic's self-reported 77.2% for Sonnet 4.5 and vals.ai's independent 69.8% already shows you how much the scaffold and evaluation setup matter. A leaked number without those details is just noise. When Sonnet 5 ships officially — and it will — run it through your own harness first.
Thinking Mode: The 2–4 Point Swing Nobody Talks About
This is the finding that surprised me most in our evaluations. Enabling the "Thinking" mode on Claude Sonnet 4.5 consistently added 2 percentage points on pass@1 across both Verdent and Claude Code scaffolds. On pass@3, the gap widened to 2–3 points. That's not a rounding error — in a 500-problem benchmark, 2 points is 10 additional problems solved correctly.
# How to enable thinking mode in your eval loop
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=16000,
thinking={
"type": "enabled",
"budget_tokens": 10000 # tune this per task complexity
},
messages=[
{"role": "user", "content": issue_description}
]
)If you're running your own SWE-bench evals and you haven't toggled thinking on, you're leaving points on the table. Simple as that.
Failure Modes You'll Actually See in Real Repos
Here's where benchmarks get honest — or where they get quietly swept under the rug. I've been staring at failure trajectories from our Verdent eval runs, and the patterns are consistent with what SWE-bench Pro's error taxonomy identified across frontier models. These aren't hypothetical. These are the exact failure modes that show up when you run Claude Sonnet 4.5 (or any top model) against real GitHub issues.
1. Wrong Solution — The Silent Killer
This is the #1 failure mode, full stop. The model understands the issue, writes syntactically valid code, and submits a patch — but the patch doesn't actually fix the underlying problem. It addresses a symptom, not the root cause. I see this most often on issues where the bug is in the interaction between modules, not within a single function. The agent localizes to the wrong file.
What it looks like in practice:
# The issue: a race condition between two async handlers
# What the agent patches (wrong):
async def handle_request(self, req):
result = await self.process(req) # Added await — fixes nothing
return result
# What actually needs to change:
async def handle_request(self, req):
async with self.lock: # The real fix: mutual exclusion
result = await self.process(req)
return result2. Multi-File Edit Failure
The agent correctly identifies that changes are needed across multiple files, but fails to coordinate them. It patches file_a.py correctly, then either skips file_b.py entirely or applies an inconsistent change. This is especially brutal on enterprise codebases — the kind of work that SWE-bench Verified's 500 problems only partially captures.
3. Context Window Collapse
Long repos. Deep call stacks. The agent starts strong, then gradually loses track of what it already changed 50 steps ago. On SWE-bench Verified with its 1M token cap, this shows up as the agent looping — trying the same fix repeatedly without realizing it already failed. Verdent's plan-code-verify loop and Git worktree isolation help here, but it's not a solved problem.
4. Environment & Dependency Mismatch
The agent writes a patch that would work on a version of the library — just not the one locked in the Docker container. The code is often correct in principle; it just fails due to API differences between versions. Partly a benchmark artifact, but it maps directly to a real pain point: your CI catches things your local env doesn't.
What SWE-bench Verified Misses
161 out of 500 problems require only 1–2 line changes. Nearly a third. Real enterprise work lives in SWE-bench Pro territory, where GPT-5 and Claude Opus 4.1 top out around 23%. That gap — 77% on Verified vs 23% on Pro — tells you everything about where AI coding agents still need to grow.
FAQ
Q: Is Claude Sonnet 5 actually out right now? A: No. As of February 3, 2026, there's no official Anthropic announcement. What exists is an unverified leak. Treat it as a rumor until confirmed.
Q: Why does Verdent's pass@1 (76.1%) differ from Anthropic's official 77.2%? A: Different scaffolds, different tool sets. Verdent's number is production-grade — same config real users run, no benchmark tuning. Both are legitimate; they measure different agent systems.
Q: Should I wait for Sonnet 5 before choosing a coding agent? A: No. Sonnet 4.5 with Thinking enabled already hits 82% on our subset. Build now, migrate later.
Q: How do I reproduce this evaluation locally? A: Clone SWE-bench, use SWE-agent v1.0.1, lock Docker environments per issue configs, temperature 1.0, high reasoning effort. The configs above are exactly what we run.