The test setup
100 prompts across three task classes: code generation, long-document summarisation, and structured JSON output. Both models accessed via API with identical system prompts.
Results
| Task | Llama 3 70B | Claude Sonnet |
|---|---|---|
| Code (pass@1) | 71% | 78% |
| Summarisation | Good | Better |
| JSON fidelity | 94% | 99% |
Cost difference
Llama 3 self-hosted runs at ~$0.0004/1k tokens on a single A10G. Claude Sonnet is $0.003/1k input, $0.015/1k output.
Takeaway
If JSON fidelity and instruction following matter, Claude wins. If cost at scale matters and you can tolerate a few more errors, Llama 3 self-hosted is compelling.