Opus 4.7 Deep Dive: What to Upgrade For, What to Watch Out For

Anthropic shipped Claude Opus 4.7 today. Pricing unchanged ($5/M input, $25/M output), API identifier claude-opus-4-7, available everywhere — Claude products, API, Bedrock, Vertex AI, Microsoft Foundry.

Looks like a "free upgrade." But dig in, and several changes will directly affect how you work. This isn't just a model ID swap.

Vision: From "Works" to "Actually Useful"

Opus 4.6's vision was in the "technically present but not reliable" bucket — 54.5% visual accuracy, basically coin-flip territory. 4.7 jumps to 98.5%, and the max resolution now goes up to 2576px on the long edge (~3.75 megapixels), 3x higher than before.

This isn't "a bit better." This is going from toy to tool.

Concrete impact:

Technical diagrams: Architecture, flow, and ER diagrams can now be fed in directly instead of being manually transcribed
Chemical structures: Molecular diagrams from papers can be read straight out
Screenshots: UI and error screenshots get significantly more reliable
Scanned documents: OCR on contracts, invoices, and scanned PDFs now has a practical baseline

If you've been routing around weak vision by converting images to text first, time to re-evaluate your workflow.

Instruction Following: Good News Is Also Bad News

4.7 substantially improved literal instruction following. Previously, Claude would "cleverly" skip parts it deemed unimportant or interpret your requirements loosely. Now it executes to the letter.

Upside: complex multi-step instructions, strict output formatting, edge case handling — all more reliable.

Downside: your existing prompts might break.

Claude used to "understand your intent" and gloss over imprecise wording. Now it does exactly what you say. If your prompts have vague phrasing, redundant instructions, or contradictory requirements — 4.6 might have been covering for you. 4.7 won't.

Migration tips:

Test existing prompts in low-stakes scenarios first
Focus on spots where "Claude seemed to be guessing your intent" before
Convert implicit expectations into explicit instructions

This is a "model got smarter, but you need to get more precise" upgrade.

Coding: Numbers and Feel

Official numbers:

Benchmark	Improvement
93-task coding benchmark	+13%
Rakuten-SWE-Bench (production)	3x resolution rate
Multi-step workflows	+14%, fewer tool errors

The 3x on Rakuten-SWE-Bench is the headline — these are real production tasks, not synthetic benchmarks. The +14% on multi-step workflows plus fewer tool errors means long task chains jumped a tier in reliability.

Pair this with Claude Code changes: the default effort level bumped from high to the new xhigh (between high and max), so the model spends more reasoning tokens on complex tasks. The new /ultrareview command gives you dedicated code review sessions — Pro and Max users get 3 free per month.

Tokenizer Changes: Same Price, More Tokens

4.7 updated the tokenizer. The same input text now maps to roughly 1.0–1.35× the token count, depending on content type.

Pricing didn't change, but the same input eats more tokens. Real-world cost could go up 0–35%.

If you run token-budget-sensitive applications, do a token count comparison on real data before upgrading. For long-document workflows especially, a 35% bump is not a rounding error.

Document Reasoning and Legal Scenarios

Document reasoning errors down 21%. BigLaw Bench legal accuracy at 90.9%.

Legal has always been an LLM weak spot — not because models are "dumb," but because legal text demands precision. The difference between "or" and "and" can flip a conclusion. A 21% error reduction is substantive progress.

Combined with the vision upgrade, contract review becomes a much more viable workflow: scanned document in → clause extraction → risk analysis, with reliability up across the whole chain.

Safety and Limits: Deliberate Trade-offs

Safety profile roughly matches 4.6: low deception, low sycophancy, improved prompt injection resistance.

But two intentional limits:

Cybersecurity capabilities deliberately reduced: Compared to Mythos Preview, 4.7 actively dialed back cyber capabilities. High-risk requests get detected and blocked automatically. Security researchers can apply to the Cyber Verification Program for legitimate access.
Controlled substance harm reduction: An acknowledged weak point — the model is sub-optimal at providing harm reduction info for controlled substances.

The official wording is "largely well-aligned and trustworthy, though not fully ideal," noting that Mythos Preview remains the best-aligned model. 4.7 made a trade-off — prioritize practical capability, maintain alignment without breaking new ground.

Relation to Mythos Preview

Anthropic explicitly says Opus 4.7 is less broadly capable than Mythos Preview, but surpasses Opus 4.6 across multiple practical benchmarks: office tasks, vision, document reasoning, long context, biology, coding, long-horizon coherence.

This hints at Anthropic's product strategy: the Mythos line pushes frontiers (broader capability, possibly more expensive or restricted), the Opus line is the workhorse (strong, stable, reasonably priced). For most users, 4.7 has more practical value than Mythos Preview.

Upgrade Decision Matrix

Your situation	Recommendation
Vision-heavy workflows (diagrams, screenshots, scans)	Upgrade now — the delta is huge
Long task chains / multi-step workflows	Upgrade — reliability and tool use both improved
Heavy Claude Code user	Upgrade — xhigh default + /ultrareview
Document analysis / legal	Upgrade — reasoning and precision noticeably up
Token-budget-sensitive high-throughput apps	Test first — tokenizer change could raise costs
Production systems with lots of existing prompts	Test before upgrading — instruction following change may need prompt tuning
Cybersecurity-adjacent needs	Skip or apply — capabilities deliberately reduced

In One Line

Opus 4.7 isn't a "bigger, pricier" upgrade. It's a "same price, re-allocated capability points" upgrade. Vision and instruction following are the biggest winners. The tokenizer change is the biggest hidden cost. Run your prompt test suite before upgrading — and if you don't have one, now's a good time to build one.