Op

gpt-oss-120b Benchmark & Insights

Openai Cloudflare Workers AI
Updated Mar 25, 2026 All models
Sample size
40 runs
in window
Accuracy
67.5%
consensus match · 40d
Confidence
95%
over 40 runs
Window end
Mar 25, 2026
most recent run
Input price
$0.35/MTok
prompt tokens
Output price
$0.75/MTok
completion tokens
Model insights
  • 01 The weakest full-sample performer: 10 false "unsafe" calls in just 40 days, all divergences packed into a Feb–Mar stretch where it reached for "high" as readily as "low".
  • 02 Cheap, but a one-in-three error rate makes the savings moot.
Recent forecasts