15+ LLMs, one structured task.
A daily public benchmark with a real-world ground truth
Every morning a Cloudflare Worker hands the NWS forecast to 15+ frontier LLMs with the same prompt and the same JSON schema, then publishes each model's verdict, confidence, and agreement rate. Apples-to-apples - model vs. model, version vs. version, provider vs. provider.
Same input, same schema, same reasoning effort. Every model. Every day.
A scheduled Cloudflare Worker fetches raw forecast data, normalizes it, and routes one prompt per model. Results land in a public page here every morning at 7am EST.
One model can be wrong. 15 rarely agree by accident.
We treat each model as an independent analyzer. When they disagree, we surface the disagreement instead of hiding it behind a single answer.
15+ models, one prompt
Same forecast input, same JSON schema out, same reasoning/thinking effort.
Three months in. Here's what we've seen
FAQ
We don't know exactly how the models were trained, what data they saw, or what biases they carry. However, by asking every model the same real-world question (is it safe to drive today?), we get a direct, apples-to-apples comparison: model vs. model, version vs. version, provider vs. provider.
A single configured zone - currently Morris County, New Jersey (NOAA grid). We know you would prefer to see the Bay Area forecast here.
Each model provides a confidence value (0–100%) alongside its verdict. This reflects how confident the model is in its assessment based on the available data. A low score does not necessarily mean that the verdict is incorrect; it may simply indicate that the conditions were ambiguous or that the model was less decisive.
All models are given the same data: visibility, precipitation, wind gusts, temperature and active alerts. The models then return a "safe" value of true if it is generally safe to drive and false if conditions are hazardous; "risk level": overall risk for typical drivers (low, medium, high).
Treat it as a second opinion, not authority. The official NWS forecasts are always the source of truth.
Yes. Every analysis, every raw forecast, and the verdict pages are server-rendered HTML with no auth. Linkable, scrapeable, archive-friendly.
Once a day, at 07:00 EST. Today's page revalidates every 5 minutes; archived dates are cached at the edge for 24 hours since the data is immutable.