Methodology

How the number is built

Plain-English version: we tracked seven concrete things that have to happen for NBA basketball to return to Seattle. We assigned a probability to each, weighted them, and combined them. The number on the front page is what that combination produces. This is not an official NBA source, betting advice, private information, or a machine-learning model. It is a transparent forecast from public evidence, audited assumptions, and a published scorecard.

The seven components

The model has exactly seven critical-path components — the "Road to Tipoff" gates. New factors such as competing-city status, TV-deal economics, and league-leadership public statements are tracked as context signals on the dashboard, but they are not additional headline gates without an explicit spec change and editorial approval record.

Each component has a current state (e.g. "exploring", "range_announced"), an audited baseline probability for reaching the success state by the 2030 deadline, and a weight reflecting how predictive the component is. Recent news items also produce a separate diagnostic context signal for each component. In the current production headline, those signals are displayed as context only; they do not change the audited baseline probabilities used for the headline.

The math

Today's number on the front page is produced by the weighted logit pool.Each component's baseline probability is converted to log-odds, weighted, summed, and converted back:

pooled_log_odds = sum(weight_i * logit(clamp(p_i, 0.01, 0.99))) / sum(weight_i)
p_aggregate = sigmoid(pooled_log_odds)

The cutover from weighted geometric mean to weighted logit pool was ratified on 2026-05-02 (see the editorial approval record). Logit pool is a heuristic weighted log-odds aggregator — not a Bayesian update — and correlated inputs can double-count evidence. On the same baselines it sits ~2.5 points above the geometric mean because it does not penalize weakest-link components as harshly. The geometric-mean implementation is retained in the codebase so historical model runs (recorded with the prior method) keep rendering correctly.

Weights are linear coefficients in the logit pool. They are notintuitive percent-importance numbers — see the per-component sensitivity bars on the homepage's Forces strip for the actual headline impact of each component.

The dashboard withholds the headline if the latest model run is missing its replayable component snapshot or if the snapshot recompute does not reproduce the stored probability within one-millionth. This audit trail keeps component changes from silently altering the public estimate.

What the launch headline can be

The editorial approval record uses 50% to 70% as a launch audit guardrail. If the audited model output falls outside that range, the headline is withheld until the assumptions are reviewed and documented. The output is not forced into the band, and the band is not moved post-hoc to fit the result. Any change to the guardrail requires a new editorial-approval-record entry.

Known limitations

  1. Conditional independence assumption.Both V1 (geometric mean) and V2 (logit pool) treat components as independent. They aren't — ownership/financing, expansion terms, and arena readiness all feed into final league approval, while draft rules and schedule slots follow official league action. We document this rather than fix it in the formula.
  2. No statistical training data. The NBA has only a handful of modern expansion precedents: the 1988/1989 Miami, Charlotte, Minnesota, and Orlando expansion wave; 1995 Toronto/Vancouver; and 2004 Charlotte. That is too small a sample to train a calibrated model. The baselines are engineered from public evidence, not learned from history.
  3. Logit pool is a heuristic, not Bayesian. It does not guarantee calibration, and correlated inputs can double-count evidence.
  4. Market may price private information. Participants who know owners, GMs, or agents may have an information edge we will not match. Persistent market disagreement can also reflect liquidity, bid/ask spreads, fees, stale prices, noisy trading, or resolution wording that does not exactly match the model target.
  5. Editorial bias risk. The team is Seattle-friendly. The scorecard explicitly tracks model vs market on resolved sub-questions so any directional bias surfaces as a measurable miss.

Context signals

The dashboard's "context signals" strip surfaces a per-component news pressure score derived from classified news items in the last 30 days; items older than that are not ingested. For each item that the news classifier marked as affecting one of the seven critical-path components, we add importance × sentiment × decay to that component's score, with sentiment ∈ {+1, −1, 0} and a daily decay of 0.95 (~13-day half-life). The result is a signed pressure value: positive = recent news has been favorable, negative = recent news has been unfavorable.

Eleven RSS feeds are currently active across national, local, business, and official lanes: ESPN NBA, CBS Sports NBA, Yahoo Sports NBA, Seattle Times Sports, KING 5 Seattle Sports, KOMO Seattle Sports, Front Office Sports, Sportico, GeekWire, Cascade PBS / Crosscut, and NBA Communications. Each fetched article is keyword-filtered, deduplicated by URL (and by a per-feed fingerprint that catches the same article being republished by the same outlet), and classified by Claude Haiku for tag, sentiment, importance on a 1-5 scale, and which of the seven components it affects. Reddit ingestion is currently disabled; we may re-enable it in a future update.

When the same news event is reported across multiple outlets, the pipeline collapses them into a single timeline entry rather than counting each one separately. Articles already grouped into a probability-move event are left as that event's sources; significant articles (importance 4 or 5) outside any such event are checked pairwise — only pairs published within 48 hours of each other and from different outlets are even considered, and a Claude AI step then decides whether each candidate pair really describes the same news event before they're merged. The dedup step is deliberately cautious: when the AI is unsure, it leaves the articles separate. Each merged cluster gets a single Claude-generated rollup headline; the rollup is cached against a fingerprint that covers the contributing articles' titles and classifier tags, so a later correction to any one article's classification triggers a fresh rollup.

News pressure is diagnostic context, not a forecast input in the current production headline. Future V2 overlay work may translate context signals into bounded log-odds adjustments, but the headline will not use those adjustments until source-quality weighting, tag caps, and audit gates are in place.

Known limitations of this layer: articles are not yet uniformly weighted by source quality, though when Reddit ingestion is active those community-lane items are capped at importance 2 and cannot be classified as confirmed_news. The dedup step errs on the side of keeping articles separate rather than merging them incorrectly, so the typical failure mode is the same story occasionally appearing more than once, rather than two different stories being incorrectly collapsed. Two outlets covering the same event more than 48 hours apart are also never offered to the AI judge and will appear as separate entries. Both behaviors will be revisited as source-quality weighting and resolved-evidence calibration arrive.

Public scorecard

We maintain a public scorecard of resolved sub-questions where our model's 30-days-prior forecast can be compared like-for-like against a real-money market's 30-days-prior forecast on the same event. Brier score is the comparison metric.

A Brier score is a penalty for wrong predictions: 0 means a perfect call, 1 means always wrong, and 0.25 is no better than guessing 50/50 every time. Lower means better Brier performance on resolved comparable events; it is evidence about forecast quality, not proof of full calibration. We measure model and market on the same yes/no event, then compare.

The dashboard surfaces a verdict label next to the scorecard number. It moves through six discrete states:

  • Unscored (unscored) — no resolved comparable sub-questions yet.
  • Tracking (tracking) — fewer than eight events have resolved; that is not enough data to tell whether the model has any edge over the market, so no performance claim is made. This is neutral, not negative.
  • Outperforming (outperforming) — ≥ 8 resolutions, model Brier at least 0.02 lower than market.
  • Matching (matching) — ≥ 8 resolutions, |model − market Brier| less than 0.02.
  • Underperforming (underperforming) — ≥ 8 resolutions, market Brier at least 0.02 lower than model.
  • Reverted tracker (reverted_tracker) — two consecutive 8-event windows underperforming, OR an explicit editorial decision to stop reporting model-vs-market performance and revert to a tracker.

The scorecard is the only place the site reports limited forecast-quality evidence. The dashboard never says "we beat the market" outside the scorecard. The 0.02 Brier-lift threshold is roughly the difference between predicting 0.50 vs 0.43 on a yes/no event — small enough that smaller deltas at sample size 8 are probably noise. Both the threshold and the 8-resolution minimum are non-trivial barriers by design.

Affiliate disclosure

Seattle Tipoff does not currently display affiliate links. If that changes, affiliate relationships will be clearly labeled near the relevant links and will not affect the probability estimate.

Reading the dashboard in 30 seconds

  • The big number is the model's estimate of the probability that a Seattle NBA franchise plays a regular-season game before Jan. 1, 2030.
  • The market number is the best available comparable real-money prediction-market benchmark when a fresh qualifying market exists; it can still be noisy because of liquidity, spreads, and resolution wording.
  • The delta is model minus market — the dashboard explains it, and the explanation always allows for "the market may price private information we cannot see."
  • The forces strip and context-signals strip are diagnostic. Forces shows component-impact on the headline; context signals show news flow.
  • The scorecard shows a limited calibration record on resolved comparable events; unscored or tracking means no performance claim yet.