🆚 Claude Sonnet 5 vs. Sonnet 4.6, GLM-5.2, Kimi K2.7 Code & Qwen 3.7 Max
Complete Guide — July 1, 2026
📌 Executive Summary
Claude Sonnet 5 achieves 53 on the Artificial Analysis Intelligence Index. With max effort it improves 6 points over Sonnet 4.6, reaching the same Intelligence Index as GPT-5.5 with high reasoning. It is the #5 model on the Artificial Analysis Intelligence Index, only 2–3 points behind GPT-5.5 (xhigh) and Opus 4.8 (max).
With max effort, Sonnet 5 works harder than previous Anthropic models — using ~40% more output tokens per Intelligence Index task than Sonnet 4.6, and ~3× the agentic turns for knowledge work evaluations AA-Briefcase and GDPval-AA. This behavior scales well with the effort setting, with max effort using around 6× more turns than low effort on GDPval-AA.
Cost impact: Sonnet 5 costs $2.29 per task on the Intelligence Index — a ~2× increase compared to Sonnet 4.6 and ~15% more than Claude Opus 4.8. This is driven entirely by increased token usage.
There is a catch: Sonnet 5 uses a new tokenizer, so the same text can map to up to 1.35× more tokens than before. Anthropic set the introductory price so the switch stays roughly cost-neutral.
Part 1 — Claude Sonnet 5 vs. Claude Sonnet 4.6
📊 1.1 Benchmark Performance
Anthropic published a benchmark table comparing Sonnet 5, Sonnet 4.6, and Opus 4.8. Sonnet 5 beats its predecessor in every tested category and closes much of the gap to Opus 4.8.
| Benchmark | Sonnet 4.6 | Sonnet 5 | Opus 4.8 |
|---|---|---|---|
| SWE-bench Pro (agentic coding) | 58.1% | 63.2% | 69.2% |
| Terminal-Bench 2.1 | 67.0% | 80.4% | 82.7% |
| OSWorld-Verified (computer use) | 78.5% | 81.2% | 83.4% |
| HLE with tools | 46.8% | 57.4% | 57.9% |
| GDPval-AA v2 (knowledge work) | 1,395 | 1,618 | 1,615 |
On SWE-bench Pro, Sonnet 5 scores 63.2% compared to Sonnet 4.6's 58.1%, bringing it within striking distance of Opus 4.8's 69.2%. On Terminal-Bench 2.1, the gap narrows further: 80.4% for Sonnet 5 versus 67.0% for Sonnet 4.6 and 82.7% for Opus 4.8.
In multidisciplinary reasoning, Sonnet 5 scores 43.2% without tools and 57.4% with tools — the latter essentially matching Opus 4.8's 57.9%.
On GDPval-AA v2 (knowledge work), it scores 1,618 — surpassing Opus 4.8's 1,615 and far exceeding Sonnet 4.6's 1,395.
💰 1.2 Pricing: Sticker Price vs. Real Cost Per Task
Per-token list price:
Sonnet 5 retains the same 15 per 1M input/output token pricing as Sonnet 4.6, compared to 25 for Opus 4.8. However, Anthropic is offering a one-third reduction to 10 until September 1.
Cache pricing: 0.30/M for cache hits (90% discount).
Tokenizer inflation (hidden cost):
Sonnet 5 uses an updated tokenizer, the same one introduced with Opus 4.7. The same text can map to roughly 1.0 to 1.35× more tokens.
The updated tokenizer's 1.0 to 1.35× token expansion could quietly erode the pricing advantage for certain workloads — enterprise customers should run their own cost analyses rather than relying on headline per-token prices.
Cost per completed task:
Artificial Analysis estimated Sonnet 5's operating cost at $2.29 per task — about twice Sonnet 4.6 and about 15% above Opus 4.8.
⚠️ Key takeaway: At the intro price (10), Sonnet 5 is excellent value. At standard pricing (15) after September 1, the higher token consumption and tokenizer inflation can make it more expensive per completed task than Opus 4.8 — even at the same sticker price.
⚙️ 1.3 Effort Levels: The Key Cost-Control Dial
The model exposes effort levels: low, medium, high, and xhigh (extra high). Higher effort spends more tokens on reasoning — that raises both quality and cost.
Sonnet 5 adds an additional xhigh effort setting relative to Sonnet 4.6, matching the 5 effort levels available on Opus 4.8 (max, xhigh, high, medium, low).
Sonnet 5 is now capable of matching Opus 4.8 performance on some task categories when set to higher effort levels. Opus 4.8 remains the stronger choice for the highest-accuracy requirements on agentic search, computer use, and cybersecurity work that requires reduced guardrails. The practical recommendation: use Sonnet 5 with a high or xhigh effort setting for complex agentic work and reserve Opus 4.8 for tasks where even a small accuracy difference is costly.
✅ 1.4 When to Use Sonnet 5 / ❌ When Not To
✅ Use Sonnet 5 when:
It "can make plans, use tools like browsers and terminals, and run autonomously at a level that, just a few months ago, required larger and more expensive models."
Industry reporting highlights improvements in multi-step plan retention, tool orchestration, and longer-horizon code edits as the primary engineering gains.
Early testers told Anthropic the model finishes complex jobs where older Sonnets gave up, and that it checks its own output without being asked.
On safety, Sonnet 5 demonstrates a lower rate of undesirable behaviors than its predecessor, making it safer to use in agentic contexts. It is better at refusing malicious requests and sidestepping hijack attempts in prompt-injection attacks. It also hallucinates and engages in sycophantic behavior at a lower rate than Sonnet 4.6.
Best uses: Multi-step coding, tool-use chains, agentic knowledge work, computer use, safety-critical pipelines, during the promo window (before Sept 1), or as a cost-efficient replacement for Opus 4.8 on knowledge-work tasks (GDPval-AA).
❌ Avoid Sonnet 5 (or use lower effort) when:
Companies rushed to deploy AI agents, then recoiled at the bills. Agents loop, call tools, and burn tokens fast.
- Simple, single-turn, or high-volume tasks where Sonnet 4.6 is cheaper and fast enough
- Strict latency budgets — Sonnet 5 max effort can take well over 20 minutes per task on knowledge-work evals
- After Sept 1, 2026 with tight cost-per-task budgets, where the tokenizer effect amplifies spend
- Cybersecurity tasks — Anthropic explicitly notes it "has a much lower ability to perform dangerous cybersecurity tasks than our current Opus models."
Part 2 — Sonnet 5 vs. Sonnet 4.6, GLM-5.2, Kimi K2.7 Code & Qwen 3.7 Max
📊 2.1 Full Comparison Table
| Model | AA Intelligence Index | SWE-bench Pro | Context | Open Weight? | Input/Output (per 1M) |
|---|---|---|---|---|---|
| Claude Sonnet 5 | 53 | 63.2% | 1M | ❌ Proprietary | 15 (intro 10) |
| Claude Sonnet 4.6 | 47 | 58.1% | 1M | ❌ Proprietary | 15 |
| GLM-5.2 | 51 | 62.1% | 1M | ✅ MIT | ~4.40 |
| Kimi K2.7 Code | ~42 | Proprietary benchmarks only | 256K | ✅ Mod. MIT | 4.00 |
| Qwen 3.7 Max | 56.6 | 60.6% | 1M | ❌ Proprietary | 7.50 |
⚠️ Benchmark caveat: AA Intelligence Index versions differ across models (v4.0 vs v4.1). Treat cross-version comparisons as directional guides, not precise rankings.
🧠 2.2 Model Deep-Dives
🔷 Claude Sonnet 5 — The Reliability King
Best fit: Balanced agentic reliability, safety guarantees, native tool orchestration, Claude Code ecosystem, and computer-use workflows.
A single model that covers a continuous cost-performance curve from light tasks to near-Opus-grade autonomous work, depending on how much compute budget a developer assigns to each call.
Anthropic is no longer presenting Sonnet as merely the "good enough" cheaper model. Sonnet 5 is built for agentic work: tasks where the model must make a plan, use tools, read files, browse, run commands, write or edit code, check the result, and continue without constant user correction.
Key trade-off: Higher "effort level" modes consume more tokens, and max-effort cost can exceed Opus 4.8 at similar quality.
🔶 Claude Sonnet 4.6 — The Steady Workhorse
Best fit: Legacy systems, high-volume simple tasks, strict latency budgets, or any workload where Sonnet 4.6 was already "good enough" and you want predictable spend.
The new model demonstrates significant improvements over its predecessor Sonnet 4.6, released in February, on agentic performance like reasoning, tool use, and software coding. Sonnet 4.6 remains the right choice when those improvements don't justify the cost premium.
🟢 GLM-5.2 — The Value King (Open Weights)
Best fit: Frontier-adjacent coding at dramatically lower cost, self-hosting, long-horizon coding agents, privacy-sensitive deployments.
GLM-5.2 is Z.ai's latest open-weight large language model, released on June 16, 2026 under an unrestricted MIT license. It features 744 billion total parameters with approximately 40 billion active parameters per token (via Mixture-of-Experts), a 1-million-token context window, and architectural innovations that make it competitive with — and in several categories better than — frontier proprietary models.
On the Artificial Analysis Intelligence Index v4.1 — which aggregates 9 evaluations including GDPval, Terminal-Bench, Humanity's Last Exam, and GPQA Diamond — GLM-5.2 scores 51, making it the leading open-weights model.
GLM-5.2 beats GPT-5.5 on SWE-bench Pro (62.1 vs 58.6), FrontierSWE (74.4% vs 72.6%), PostTrainBench, MCP-Atlas, and Design Arena.
On Terminal-Bench 2.1, it scores 81.0 — a massive jump from GLM-5.1's 63.5, within reach of Claude Opus 4.8 (85.0).
Pricing on Fireworks is 4.40 output per 1M tokens — roughly 5–7× cheaper than Opus 4.8 and GPT-5.5 on output. The model is MIT-licensed open weights, with a 1M-token context and 131K-token max output, runnable locally via 1-bit GGUF on consumer hardware.
Key trade-off: Reproducibility is contested — at least one prominent commentator calls it bench-maxxed. Validate on your own tasks before committing to production.
🟡 Kimi K2.7 Code — The Token-Efficiency King
Best fit: Long-horizon, high-volume coding agents where inference cost per turn is a primary KPI.
Kimi K2.7 Code uses a Mixture-of-Experts architecture with 1T total parameters and 32B activated parameters, 61 layers, 384 experts, MLA attention, and a 256K-token context window.
External benchmark evaluations show that Kimi K2.7 Code significantly improves instruction compliance and long-horizon coding performance compared to K2.6, while reducing overthinking tendencies by 30% on average.
Because thinking is always on and reasoning tokens are billed as output, a 30% reduction is a direct cost cut on every task — not just a quality claim.
K2.7 beat Opus 4.8 on MCP Mark Verified (81.1 vs 76.4), suggesting better tool invocation accuracy in agentic workflows. At ~6× cheaper, K2.7 is the value play for tool-heavy agent pipelines.
API pricing is 4.00 per million input/output tokens. Cache hits drop input to $0.19 per million.
Key trade-offs:
- Always-on thinking mode — mandatory for reliable agentic performance; cannot be disabled.
- Opus 4.8 has a 1M context window — 4× K2.7's 256K.
- All published benchmark gains are from Moonshot AI's own proprietary benchmarks. The model has not been submitted to DeepSWE, an independent coding benchmark that produces a more discriminating signal for teams configuring model routing systems.
🔴 Qwen 3.7 Max — The Reasoning & Long-Autonomy King
Best fit: Math-heavy and scientific reasoning, ultra-long autonomous runs, strong coding at significantly lower output cost than Claude.
Qwen 3.7 Max ships with a 1M-token context window, a 56.6 score on the Artificial Analysis Intelligence Index, and benchmarks that top Claude Opus 4.6 Max on Terminal-Bench 2.0, SWE-Bench Pro, and MCP-Atlas.
On GPQA Diamond, Qwen3.7-Max posts 92.4 (ahead of Claude Opus 4.6 Max at 91.3, behind GPT-5.5 at 93.6) and 97.1 on HMMT 2026 February, the highest score in its comparison group.
In a 35-hour autonomous kernel optimization run, the model made 1,158 tool calls and achieved a 10× geometric mean speedup over the Triton reference implementation.
API pricing is 7.50 per million output tokens, with cached input dropping to $0.25/M. That makes Qwen 3.7 Max roughly half the price of Claude Opus 4.8 on input and less than a third on output.
Key trade-offs:
Qwen3.7-Max's low hallucination rate is partly an artifact of higher abstention — its attempt rate fell to 48.0%, the lowest among comparable frontier models. In plain English, it refuses to answer more often, which lowers wrong answers but also lowers usefulness on edge cases. That trade-off matters if you're plugging it into an agent that needs to push through ambiguity.
Qwen 3.7 Max is closed-weights, API-only.
🎯 2.3 Summary: Which Model for Which Use Case?
| Use Case | Best Pick |
|---|---|
| Agentic multi-step work (best overall reliability) | Claude Sonnet 5 |
| Long-horizon coding, large codebases (1M context) | GLM-5.2 |
| Token efficiency / cost savings in coding agent loops | Kimi K2.7 Code (–30% thinking tokens) |
| Math, science reasoning + long autonomous agent runs | Qwen 3.7 Max |
| Open-weight / self-hostable frontier coding model | GLM-5.2 (MIT) |
| Budget API at scale for coding | Kimi K2.7 or Qwen 3.7 Max |
| Safety-critical pipelines + computer use | Claude Sonnet 5 |
| Simple, high-volume, cost-sensitive single-turn tasks | Claude Sonnet 4.6 |
💡 Practical Routing Rules (Works Across All Five Models)
These rules come directly from the "why Sonnet 5 got more expensive per task" lesson: agent costs are dominated by turns × tokens per turn, not the sticker price.
-
Default to balanced effort; escalate only when needed. Effort remains the recommended way of configuring model performance and latency.
-
Budget for tokenizer inflation before Sept 1. Sonnet 5 uses a revised tokenizer that may process the same input as 1.0 to 1.35× as many tokens as the previous tokenizer. Developers running high-volume agentic workflows should measure their actual token consumption against the new tokenizer before standard pricing takes effect.
-
Use a "cheap builder + expensive reviewer" pattern. Send cheap, high-volume work (formatting, classification, boilerplate, retrieval queries) to a lower-cost model, and escalate to a more expensive frontier model only for the small fraction of tasks that genuinely need it. Routing keeps the blended cost per request low without sacrificing quality on hard problems.
-
For the introductory window: Starting June 30, Claude Sonnet 5 is the default model for free and Pro plans. At launch, it is priced at 10 per million output tokens through August 31, after which it rises to 15 — still well below the 25 output pricing of Opus 4.8.
🏁 Final Verdict
Claude Sonnet 5 is the model you test first when you want Opus-like follow-through but do not want Opus-level cost. It is the new default for production Claude agents, with the promo window providing excellent value through August 31.
GLM-5.2 is the open-weight value leader: GPT-5.5 trails it on SWE-bench Pro (58.6 vs 62.1) and FrontierSWE (72.6% vs 74.4%), and costs approximately 3.6× more on input and 6.8× more on output.
Kimi K2.7 Code is already one of the cheapest capable coding models available, cutting thinking tokens by 30% versus K2.6 at 4.00 output per million.
Qwen 3.7 Max scored 56.6 on the Artificial Analysis Intelligence Index — the highest-ranked Chinese AI model on that leaderboard at launch — carries a 1M-token context window, costs $2.50/M input, and Alibaba's internal testing reports a 35-hour autonomous coding run that fired 1,158 tool calls.
The bottom line: Chinese models now match or closely trail Claude Opus 4.8 and GPT-5.5 on most coding and agentic benchmarks while costing 5–30× less per token. The key differentiators for Sonnet 5 remain its safety guarantees, agentic reliability, native tooling ecosystem, computer-use capabilities, and the Claude Code integration — plus exceptional value during the introductory pricing window through August 31, 2026.
Sources: Artificial Analysis — Claude Sonnet 5 Agentic Cost Review (Jun 30, 2026); TechCrunch, VentureBeat, The Next Web, MarkTechPost, TechTimes reporting on Sonnet 5 launch (Jun 30, 2026); Anthropic Platform Docs (pricing); Flowtivity / CometAPI / Codersera / Lushbinary / VentureBeat on Kimi K2.7 Code; Flowtivity / kie.ai / avenchat / Semgrep / llm-stats on GLM-5.2; Qwen official blog, DataCamp, felloai, overchat.ai, ofox.ai on Qwen 3.7 Max.