Op
gpt-oss-120b Benchmark & Insights
Openai Cloudflare Workers AI
Updated Mar 25, 2026 All models
Sample size
40 runs
in window
Accuracy
67.5%
consensus match · 40d
Confidence
95%
over 40 runs
Window end
Mar 25, 2026
most recent run
Input price
$0.35/MTok
prompt tokens
Output price
$0.75/MTok
completion tokens
Model insights
- 01 The weakest full-sample performer: 10 false "unsafe" calls in just 40 days, all divergences packed into a Feb–Mar stretch where it reached for "high" as readily as "low".
- 02 Cheap, but a one-in-three error rate makes the savings moot.
Recent forecasts
Date
Conf.
Risk
Safe
Mar 25, 2026
95%
low
Safe
Mar 24, 2026
98%
low
Safe
Mar 23, 2026
97%
high
Unsafe
Mar 22, 2026
85%
medium
Unsafe
Mar 21, 2026
95%
medium
Unsafe
Mar 20, 2026
96%
medium
Safe
Mar 19, 2026
90%
low
Safe
Mar 18, 2026
98%
low
Safe
Mar 17, 2026
93%
medium
Unsafe
Mar 16, 2026
97%
high
Unsafe