Opus 4.7 is live: vision jumps from 54% to 98%, but a tokenizer change could silently raise costs 35%
Anthropic shipped Claude Opus 4.7 today. Pricing unchanged ($5/M input, $25/M output), API identifier claude-opus-4-7, available everywhere — Claude products, API, Bedrock, Vertex AI, Microsoft Foundry.
Looks like a "free upgrade." But dig in, and several changes will directly affect how you work. This isn't just a model ID swap.
Opus 4.6's vision was in the "technically present but not reliable" bucket — 54.5% visual accuracy, basically coin-flip territory. 4.7 jumps to 98.5%, and the max resolution now goes up to 2576px on the long edge (~3.75 megapixels), 3x higher than before.
This isn't "a bit better." This is going from toy to tool.
Concrete impact:
If you've been routing around weak vision by converting images to text first, time to re-evaluate your workflow.
4.7 substantially improved literal instruction following. Previously, Claude would "cleverly" skip parts it deemed unimportant or interpret your requirements loosely. Now it executes to the letter.
Upside: complex multi-step instructions, strict output formatting, edge case handling — all more reliable.
Downside: your existing prompts might break.
Claude used to "understand your intent" and gloss over imprecise wording. Now it does exactly what you say. If your prompts have vague phrasing, redundant instructions, or contradictory requirements — 4.6 might have been covering for you. 4.7 won't.
Migration tips:
This is a "model got smarter, but you need to get more precise" upgrade.
Official numbers:
| Benchmark | Improvement |
|---|---|
| 93-task coding benchmark | +13% |
| Rakuten-SWE-Bench (production) | 3x resolution rate |
| Multi-step workflows | +14%, fewer tool errors |
The 3x on Rakuten-SWE-Bench is the headline — these are real production tasks, not synthetic benchmarks. The +14% on multi-step workflows plus fewer tool errors means long task chains jumped a tier in reliability.
Pair this with Claude Code changes: the default effort level bumped from high to the new xhigh (between high and max), so the model spends more reasoning tokens on complex tasks. The new /ultrareview command gives you dedicated code review sessions — Pro and Max users get 3 free per month.
4.7 updated the tokenizer. The same input text now maps to roughly 1.0–1.35× the token count, depending on content type.
Pricing didn't change, but the same input eats more tokens. Real-world cost could go up 0–35%.
If you run token-budget-sensitive applications, do a token count comparison on real data before upgrading. For long-document workflows especially, a 35% bump is not a rounding error.
Document reasoning errors down 21%. BigLaw Bench legal accuracy at 90.9%.
Legal has always been an LLM weak spot — not because models are "dumb," but because legal text demands precision. The difference between "or" and "and" can flip a conclusion. A 21% error reduction is substantive progress.
Combined with the vision upgrade, contract review becomes a much more viable workflow: scanned document in → clause extraction → risk analysis, with reliability up across the whole chain.
Safety profile roughly matches 4.6: low deception, low sycophancy, improved prompt injection resistance.
But two intentional limits:
Cybersecurity capabilities deliberately reduced: Compared to Mythos Preview, 4.7 actively dialed back cyber capabilities. High-risk requests get detected and blocked automatically. Security researchers can apply to the Cyber Verification Program for legitimate access.
Controlled substance harm reduction: An acknowledged weak point — the model is sub-optimal at providing harm reduction info for controlled substances.
The official wording is "largely well-aligned and trustworthy, though not fully ideal," noting that Mythos Preview remains the best-aligned model. 4.7 made a trade-off — prioritize practical capability, maintain alignment without breaking new ground.
Anthropic explicitly says Opus 4.7 is less broadly capable than Mythos Preview, but surpasses Opus 4.6 across multiple practical benchmarks: office tasks, vision, document reasoning, long context, biology, coding, long-horizon coherence.
This hints at Anthropic's product strategy: the Mythos line pushes frontiers (broader capability, possibly more expensive or restricted), the Opus line is the workhorse (strong, stable, reasonably priced). For most users, 4.7 has more practical value than Mythos Preview.
| Your situation | Recommendation |
|---|---|
| Vision-heavy workflows (diagrams, screenshots, scans) | Upgrade now — the delta is huge |
| Long task chains / multi-step workflows | Upgrade — reliability and tool use both improved |
| Heavy Claude Code user | Upgrade — xhigh default + /ultrareview |
| Document analysis / legal | Upgrade — reasoning and precision noticeably up |
| Token-budget-sensitive high-throughput apps | Test first — tokenizer change could raise costs |
| Production systems with lots of existing prompts | Test before upgrading — instruction following change may need prompt tuning |
| Cybersecurity-adjacent needs | Skip or apply — capabilities deliberately reduced |
Opus 4.7 isn't a "bigger, pricier" upgrade. It's a "same price, re-allocated capability points" upgrade. Vision and instruction following are the biggest winners. The tokenizer change is the biggest hidden cost. Run your prompt test suite before upgrading — and if you don't have one, now's a good time to build one.