Live benchmark · pipeline ran 07:00 EST today

15+ LLMs, one structured task.
A daily public benchmark with a real-world ground truth

Every morning a Cloudflare Worker hands the NWS forecast to 15+ frontier LLMs with the same prompt and the same JSON schema, then publishes each model's verdict, confidence, and agreement rate. Apples-to-apples - model vs. model, version vs. version, provider vs. provider.

How it works

Same input, same schema, same reasoning effort. Every model. Every day.

A scheduled Cloudflare Worker fetches raw forecast data, normalizes it, and routes one prompt per model. Results land in a public page here every morning at 7am EST.

01
Fetch
A cron job pulls hourly, daily, and alert feeds from the NWS forecast API.
cron · weather.gov
02
Analyze
One queued task per model. Each LLM returns a structured verdict, reasons, and confidence.
15+ LLMs · queue
03
Consensus
Verdicts are aggregated into a majority call, and the public page goes live.
leaderboard · insights
Why a panel of models

One model can be wrong. 15 rarely agree by accident.

We treat each model as an independent analyzer. When they disagree, we surface the disagreement instead of hiding it behind a single answer.

Consensus, not opinion
A daily verdict is a majority vote across all models on safety and risk level — ties break toward the more cautious call.
majority · safe / risk
Public track record
Accuracy, agreement rate, and behavioral drift are published on each model's profile.
permanent
Disagreement is a signal
When models split, you see exactly which model dissented and why.
visible splits
The panel

15+ models, one prompt

Same forecast input, same JSON schema out, same reasoning/thinking effort.

By the numbers

Three months in. Here's what we've seen

Days assessed
153
since Jan 6, 2026
Verdicts logged
800+
by multiple LLMs
AI models
18
diversity
Insights
23%
models fail
Common questions

FAQ

We don't know exactly how the models were trained, what data they saw, or what biases they carry. However, by asking every model the same real-world question (is it safe to drive today?), we get a direct, apples-to-apples comparison: model vs. model, version vs. version, provider vs. provider.

A single configured zone - currently Morris County, New Jersey (NOAA grid). We know you would prefer to see the Bay Area forecast here.

Each model provides a confidence value (0–100%) alongside its verdict. This reflects how confident the model is in its assessment based on the available data. A low score does not necessarily mean that the verdict is incorrect; it may simply indicate that the conditions were ambiguous or that the model was less decisive.

All models are given the same data: visibility, precipitation, wind gusts, temperature and active alerts. The models then return a "safe" value of true if it is generally safe to drive and false if conditions are hazardous; "risk level": overall risk for typical drivers (low, medium, high).

Treat it as a second opinion, not authority. The official NWS forecasts are always the source of truth.

Yes. Every analysis, every raw forecast, and the verdict pages are server-rendered HTML with no auth. Linkable, scrapeable, archive-friendly.

Once a day, at 07:00 EST. Today's page revalidates every 5 minutes; archived dates are cached at the edge for 24 hours since the data is immutable.

15+ opinions. One structured task. A new datapoint every morning.
Today's report is live. Browse the archive to see how the panel handled past storms, fog, and wind events.