gpt-5.1 Benchmark & Insights

OpenAI OpenAI API

Updated Jun 6, 2026 All models

Sample size

111 runs

in window

Accuracy

85.8%

consensus match · 113d

Confidence

93%

over 113 runs

Window end

Jun 6, 2026

most recent run

Input price

$1.25/MTok

prompt tokens

Output price

$10.00/MTok

completion tokens

Model insights

01 The riskiest failure mode on the board: 19 underratings against zero overratings, including 6 days where it said "safe" on a consensus "unsafe" day — mostly in a Feb–Mar cluster.
02 For a driving-safety signal this optimism is worse than its score suggests; gpt-5.4 and gpt-5.5 corrected it.

Recent forecasts

Date

Conf.

Risk

Safe