Benchmark the actual OpenClaw prompt stack, not a generic chat prompt.
Benchmarks
DeepSeek V4 Flash benchmark checklist
Benchmarks should prove the Flash-first route under realistic OpenClaw and production workloads. Use closed models as comparison references, not as the center of the page.
Evaluation matrix
Compare by workload
The matrix makes the site more crawlable and gives readers a concrete testing plan.
| Workload | Primary model | Compare with | Measure |
|---|---|---|---|
| OpenClaw agent planning | V4 Flash | V4 Pro, GPT | Completion rate, retries, cost per solved task |
| Retrieval answer generation | V4 Flash | Claude, GPT | Citation accuracy, unsupported claims, cache-hit ratio |
| Code explanation batches | V4 Flash | V4 Pro, Claude | Developer acceptance, follow-up turns, token cost |
| Multimodal or Google workflow | Gemini | V4 Flash for text-only steps | Modality coverage, handoff cost, latency |
| Realtime xAI ecosystem work | Grok | V4 Flash for non-realtime tasks | Freshness need, tool result quality, routing overhead |
Checklist
What to record
Cost is only one column. For OpenClaw and agent workflows, repeated prompts, retries, and escalations decide the real cost.
Track cache-hit ratio separately from raw input-token volume.
Score retry cost and failed-turn cost, not only first-response price.
Include a Pro escalation lane so Flash is tested as the default, not the only model.
Keep provider-specific strengths visible without weakening the Flash headline.