Live benchmark · pipeline ran 07:00 EST today

15+ LLMs, one structured task.
A daily public benchmark with a real-world ground truth

Every morning a Cloudflare Worker hands the NWS forecast to 15+ frontier LLMs with the same prompt and the same JSON schema, then publishes each model's verdict, confidence, and agreement rate. Apples-to-apples - model vs. model, version vs. version, provider vs. provider.

See Leaderboard Today's verdict

How it works

Same input, same schema, same reasoning effort. Every model. Every day.

A scheduled Cloudflare Worker fetches raw forecast data, normalizes it, and routes one prompt per model. Results land in a public page here every morning at 7am EST.

Fetch

A cron job pulls hourly, daily, and alert feeds from the NWS forecast API.

cron · weather.gov

Analyze

One queued task per model. Each LLM returns a structured verdict, reasons, and confidence.

15+ LLMs · queue

Consensus

Verdicts are aggregated into a majority call, and the public page goes live.

leaderboard · insights

Why a panel of models

One model can be wrong. 15 rarely agree by accident.

We treat each model as an independent analyzer. When they disagree, we surface the disagreement instead of hiding it behind a single answer.

Consensus, not opinion

A daily verdict is a majority vote across all models on safety and risk level — ties break toward the more cautious call.

majority · safe / risk

Public track record

Accuracy, agreement rate, and behavioral drift are published on each model's profile.

permanent

Disagreement is a signal

When models split, you see exactly which model dissented and why.

visible splits

The panel

15+ models, one prompt

Same forecast input, same JSON schema out, same reasoning/thinking effort.

Moonshot AI · Workers AI

nemotron-3-120b-a12b

Nvidia · Workers AI

New models join once released

Native providers · Cloudflare Workers AI

By the numbers

Three months in. Here's what we've seen

Days assessed

198

since Jan 6, 2026

Verdicts logged

800+

by multiple LLMs

AI models

diversity

Insights

23%

models fail

Common questions

FAQ

Why we are doing this?

We don't know exactly how the models were trained, what data they saw, or what biases they carry. However, by asking every model the same real-world question (is it safe to drive today?), we get a direct, apples-to-apples comparison: model vs. model, version vs. version, provider vs. provider.

What region does the forecast cover?

A single configured zone - currently Morris County, New Jersey (NOAA grid). We know you would prefer to see the Bay Area forecast here.

What does the "confidence" score mean?

Each model provides a confidence value (0–100%) alongside its verdict. This reflects how confident the model is in its assessment based on the available data. A low score does not necessarily mean that the verdict is incorrect; it may simply indicate that the conditions were ambiguous or that the model was less decisive.

How are "safe" and "risk" defined?

All models are given the same data: visibility, precipitation, wind gusts, temperature and active alerts. The models then return a "safe" value of true if it is generally safe to drive and false if conditions are hazardous; "risk level": overall risk for typical drivers (low, medium, high).

Can I trust this for actual driving decisions?

Treat it as a second opinion, not authority. The official NWS forecasts are always the source of truth.

Is the data open?

Yes. Every analysis, every raw forecast, and the verdict pages are server-rendered HTML with no auth. Linkable, scrapeable, archive-friendly.

How often does it run?

Once a day, at 07:00 EST. Today's page revalidates every 5 minutes; archived dates are cached at the edge for 24 hours since the data is immutable.

15+ opinions. One structured task. A new datapoint every morning.

Today's report is live. Browse the archive to see how the panel handled past storms, fog, and wind events.

Leaderboard Archive

15+ LLMs, one structured task. A daily public benchmark with a real-world ground truth

Same input, same schema, same reasoning effort. Every model. Every day.

One model can be wrong. 15 rarely agree by accident.

15+ models, one prompt

Three months in. Here's what we've seen

FAQ

15+ LLMs, one structured task.
A daily public benchmark with a real-world ground truth