Methodology

Overview

This dashboard rests on a Bayesian model that turns two sources of information, expert consensus and player usage/game environment data, into weekly fantasy-point predictions expressed as full probability distributions rather than single numbers (e.g., mean, sd). The design has three moving parts: two structurally different models, each fit in two distributional families, combined by stacking, and then evaluated on a season the model never saw during development. What follows is the full account for wide receivers and running backs, from the data through the held-out test (a brief overview for other positions is at the end).

The data and the two signals

All player statistics, schedules, usage data, injury reports, and Vegas lines come from the nflverse project, retrieved with the nflreadr package. The expert signal is the FantasyPros aggregated Expert Consensus Rank (ECR), also retrieved through the nflverse archive.

Each position is trained on the 2022, 2023, and 2024 seasons and tested on 2025, which was sequestered throughout development. The receiver models use 5,402 active player-weeks, filtered to receivers with an offensive snap share above 20% who were not ruled out by injury; the running-back models use 2,806 active, non-DNP back-weeks, without the snap-share floor, since a back’s fantasy value can come from a smaller, carry-heavy or goal-line role below that threshold.

Two models, by design

The heart of the design is two models that are structurally different, not the same model nudged in two directions. That difference is what makes “expert versus data” a genuine comparison rather than an modeling artifact.

Model A (expert-anchored) uses only the expert consensus projection (the calibrated ECR) and an injury-status flag, plus a player-specific random intercept. It is, in effect, the experts’ all-things-considered estimate, recalibrated against history and shrunk toward the player’s own track record.

Model B (data-driven) never sees the expert consensus. It works from player usage and game environment alone, again with a player random intercept; its value proposition is a forecast built from observable signals rather than expert judgment. The usage inputs are four-week rolling measures and differ by position: for receivers, target share, air yards, and snap share; for running backs, carry share, target share, snap share, and per-carry rates of yards after contact and broken tackles. The game-environment inputs are common to both, the Vegas-implied team total and spread, the opponent’s recent points allowed to the position in question, and a home indicator, alongside the injury flag both models share.

Model predictors

Both models draw from a position-specific feature table, built once per season and frozen on the same logic across years and positions. Every continuous predictor is z-scored against that position’s training distribution so coefficients sit on a common scale and the priors hold the same meaning across variables.

The expert signal (used by Model A only):

consensus_proj — the FantasyPros aggregated ECR converted to expected fantasy points via a frozen rank-to-points calibration. I could not find a free archive of historical weekly ECR-based point projections (with a transparent and stably defined conversion process), so the expert signal enters the model as an expected fantasy-point value rather than as a raw rank. The conversion is a monotonic rank-to-points calibration fit separately for each position on its own 2022–2023 active, non-DNP player-weeks. The conversion involved binning player-weeks to integer ECR rank, taking the mean realized half-PPR fantasy points per rank, and running a pool-adjacent-violators isotonic regression (R’s stats::isoreg) on the binned means with the y-axis negated, then re-negate so the result is non-increasing in rank — a better rank maps to at least as many expected points as a worse one. Predictions for ranks not seen in the binned table are obtained by linear interpolation, with flat extrapolation past the highest and lowest observed ranks. The resulting consensus_proj column is the same predictor everywhere, frozen and applied unchanged to 2024 and 2025.

The usage signals (four-week rolling trailing means, reset at season boundaries; used by Model B):

target_share_l4, snap_share_l4 — the player’s share of team pass targets, and offensive snap share (trailing four weeks, non-active weeks counted as zero); both positions.
air_yards_l4 (receivers) — share of the team’s air yards directed at the player.
carry_share_l4 (running backs) — fraction of the team’s rush attempts handed to the back.
pfr_yac_per_att_l4, pfr_broken_tackles_per_att_l4 (running backs) — per-carry yards after contact and broken tackles, from Pro-Football-Reference advanced rushing data, capturing the back’s own efficiency apart from raw volume.

The game-environment signals (used by Model B, both positions):

vegas_team_total, spread_for_team — the Vegas-implied team scoring total for the upcoming game, derived from the market total and spread; and the point spread signed from the player’s team’s perspective (negative when favored).
opp_pts_allowed_l4 — the opposing defense’s trailing four-week mean of fantasy points allowed to the player’s position (to receivers for WRs, to backs for RBs), on the same scoring scale.

Binary status flags: injury_status (both models) — a flag for a Questionable / Doubtful / Out / IR listing on the final pre-game report, included because the FantasyPros consensus snapshot is fixed in time and the injury report might update between snapshot and kickoff, putting both models on equal decision-time information footing; and home (Model B only) — a binary home indicator.

Opening-week and missing-data handling. Where a rolling four-week mean is undefined (early in a season, or for a player without four prior active weeks), the value falls back to that player’s prior-season active-weeks mean; if that is also missing, to the position’s population mean computed on 2022–2023 active weeks. Each rolling feature carries an _was_imputed companion flag so downstream checks can identify the fallback rows. The Vegas inputs and the home flag are observed directly and need no fallback; consensus_proj is imputed to the population mean only when the underlying ECR is itself missing, which is rare.

Model A/B equations

Formally, with \(i\) indexing player-weeks and \(p[i]\) the player, both models share the same hierarchical structure and differ only in their predictor matrix \(X\):

\[ \eta_i = X_i \beta + u_{p[i]}, \qquad u_p \sim \mathcal{N}(0, \sigma_u^2). \]

\(\eta_i\) is the linear predictor on the appropriate distributional scale (see Distributional families below) and \(u_{p[i]}\) is the player-level random intercept. The two models differ only in what \(X\) contains:

Model A: intercept, consensus_proj_z, injury_status.
Model B: intercept, vegas_team_total_z, spread_for_team_z, target_share_l4_z, air_yards_l4_z, snap_share_l4_z, opp_pts_allowed_wr_l4_z, injury_status, home.

These are the receiver predictor sets; Model A is identical for running backs, and Model B for backs swaps air yards for carry share and adds the two per-carry efficiency rates: intercept, vegas_team_total_z, spread_for_team_z, carry_share_l4_z, target_share_l4_z, snap_share_l4_z, pfr_yac_per_att_l4_z, pfr_broken_tackles_per_att_l4_z, opp_pts_allowed_rb_l4_z, injury_status, home.

All continuous predictors are standardized (the _z suffix) against the training-set mean and standard deviation, so the priors \(\beta \sim \mathcal{N}(0, 2.5)\) and \(\sigma_u \sim \mathcal{N}(0, 1.5)\) retain a common meaning across variables. Auxiliary distributional parameters (the negative-binomial dispersion, the gamma shape, the zero-inflation logit) carry their own intercepts and, in Model B, may depend additionally on snap_share_l4_z, since a player on the field more is harder to put at zero. In Model A those auxiliaries are kept sparse, intercept-only or intercept-plus-injury_status, to keep Model A strongly anchored to ECR ranking and thus keep the expert consensus-anchored model interpretable.

Why two models rather than one?

A natural question is why not fit a single model with the consensus rank as one covariate among the usage features, and let the data decide its weight. That is a perfectly reasonable model, and for raw point prediction it might do just as well; the two-model design was chosen for other reasons. The first and main one is the goal of the project, to make the everyday act of weighing expert judgment against the numbers visible, which requires two separate predictive distributions you can hold side by side rather than a single distribution with the expert signal dissolved into a coefficient. The disagreement view, the head-to-head comparison, and the Expert/Data/Blend split all depend on having two genuinely distinct forecasts. The second reason is that stacking whole predictive distributions is more robust to either model being misspecified than trusting one model’s assumed functional form, and it yields a clean, inspectable answer to how much weight each signal carries, which a single model’s entangled coefficients cannot, especially since consensus and usage are correlated enough that their coefficients would compete for the same variance. The cost is a more elaborate pipeline; the benefit is transparency and a decomposition that might tell me something interesting.

Distributional families

Weekly fantasy scoring is awkward to model: a spike of near-zero (floor) weeks sits beside a long upside (ceiling) tail. Four families were fit to that shape, each as both Model A and Model B, and judged on 2024 data under a strict convergence gate, requiring every fit to reach an R-hat below 1.01, zero divergent transitions, and a bulk effective sample size above 400, with a single permitted refit at higher adaptation. Two families failed the gate even after the refit and were disqualified: a hurdle-lognormal, whose expert model would not converge (R-hat 1.02), and a skew-normal, whose data model both failed to converge and mixed poorly (R-hat 1.02, effective sample size 178). That left a zero-inflated negative binomial, which handles the zero spike on the natural integer scale, and a gamma with a log link, a continuous and naturally right-skewed positive distribution.

On held-out 2024 accuracy the two survivors were nearly tied, with a pooled CRPS of 3.16 for the negative binomial against 3.18 for the gamma. Rather than crown one on a margin that small, the project asked a different question: do the two families actually predict differently? They do, diverging beyond their typical level on about a quarter of player-weeks, and so both were retained and combined, on the principle that two models that disagree somewhat frequently might carry independent signals and, thus, both are worth keeping.

Model zinb/gamma equations

The two retained families are written formally as follows, suppressing the player random intercept already specified above and letting \(\eta_i^{\text{main}}\) denote the linear predictor from Model A or Model B.

Zero-inflated negative binomial (zinb_rounded). The fantasy-point outcome is rounded to a non-negative integer \(y_i\) and modeled as a mixture:

\[ y_i \mid \pi_i, \mu_i, \phi \;\sim\; \pi_i\, \delta_0 \;+\; (1 - \pi_i)\, \mathrm{NegBin}(\mu_i, \phi), \] \[ \log \mu_i = \eta_i^{\text{main}}, \qquad \mathrm{logit}\, \pi_i = \eta_i^{\text{zi}}, \]

where \(\delta_0\) is a point mass at zero, \(\eta^{\text{main}}\) models the count mean, and \(\eta^{\text{zi}}\) is a separate linear predictor for the zero-inflation probability (intercept-plus-injury_status in Model A; additionally snap_share_l4_z in Model B). The integer rounding is the cost of using a discrete family on continuous fantasy points and is documented as a small approximation.

Gamma with log link (gamma_log). Fantasy points are floored at a tiny positive value \(y_i^+ = \max(y_i, 0.01)\) so the support is the positive reals, and modeled as

\[ y_i^+ \mid \mu_i, \alpha_i \;\sim\; \mathrm{Gamma}\!\big(\text{shape} = \alpha_i,\; \text{rate} = \alpha_i / \mu_i\big), \] \[ \log \mu_i = \eta_i^{\text{main}}, \qquad \log \alpha_i = \eta_i^{\text{shape}}, \]

where \(\alpha_i\) is the gamma shape parameter — also modeled with covariates — controlling how heavy the right tail is. The flooring is, like the integer rounding above, a small concession to the family’s support; it touches only the rare exact-zero weeks.

The same two families carry the running-back models. Rather than re-run the full bake-off, the two receiver survivors were fit to running-back data and re-checked under the identical convergence gate; both passed, and the negative binomial was again the slight favorite on model-averaged CRPS (3.55 against 3.57 on held-out 2024 backs).

Bayesian priors

All models are specified with standard weakly informative priors: Normal(0, 2.5) on standardized fixed-effect coefficients, Normal(0, 1.5) on the player random-effect standard deviation, and Normal(0, 2.5) on auxiliary distributional parameters. Models were fit in brms on a cmdstanr backend.

Combining the models via stacking

The four fitted models (two structural models in two families) are combined by stacking, a method that weights predictive distributions by how well they actually forecast out of sample rather than by how well they fit (Yao, Vehtari, Simpson, and Gelman, 2018). The combination happens in two steps, and the weights come out different for the two positions. Within each family, Model A and Model B are stacked using leave-one-out predictive scores; for receivers the weight on the expert-anchored Model A comes out to 0.49 in the negative-binomial family and 0.13 in the gamma family, and across families about 69% of the weight goes on the negative-binomial blend. Running backs are stacked the same way and land further toward usage, with Model A weights of 0.28 and 0.01 and about 25% on the negative-binomial blend.

Regrouping those weights by which model each piece came from gives a clean summary: for receivers the stacking output draws roughly 38% on the expert-anchored models and 62% on the data-driven ones, and for running backs about 8% and 92%. That decomposition into an expert half and a data half is exactly how the Expert, Data, and Blend views are defined, whatever lean the blend is set to. How much of either position’s stacking output actually drives the deployed blend is the subject of the next section.

From stacking weights to a deployed hedge

Stacking gave the four models the weights above, and the natural thing is to deploy them as the blend’s lean. Looking harder at them changed my mind. The within-family and cross-family weights are fit by maximizing leave-one-out predictive accuracy on 2022–2024, and for both positions that objective turns out to be nearly flat along the axis that matters most here, the lean between the expert-anchored and data-driven halves. Sweeping that lean from pure expert to pure data moves the pooled CRPS by only about two to four hundredths of a point across the whole sensible middle range, and the lean that looks best on the development seasons disagrees with the lean that looks best on the held-out 2025; for receivers the development optimum sits near 0.34 expert while the holdout prefers 0.64, and for running backs the two are 0.24 and 0.72. The stacking weights’ own intervals say the same thing another way, with the running-back expert weight carrying a 95% interval from 0.14 to 0.43 inside the negative-binomial family. When the objective cannot tell the leans apart and the development and holdout optima point in opposite directions, the single number stacking returns is mostly noise, and deploying it means betting the whole blend on a coin the data never actually flipped.

The principled response to an unidentified weight is to fall back on a sensible prior, and the sensible prior for a project built around hedging is an equal one. So the deployed blend is now a 50/50 mix of the expert and data halves for both positions, rather than the stacking argmax, which had leaned to 62% data for receivers and 92% for running backs. On the held-out season this equal hedge does what a hedge should: its pooled CRPS comes in below either single model for both positions, which for running backs recovers the accuracy the over-leaned stacking weight had given away. It also pulls the running-back ranking back toward the expert consensus. The one cost sits at the running-back flex floor, the 6 fp line, where the usage-data model is the better-calibrated source and an equal blend gives back some of that calibration; the average miss there rises from about four to about six percentage points, while every other slot-threshold stays in its usual two-to-three-point range. I judged that an acceptable price for a blend that genuinely hedges, and the dashboard now lets you set the lean yourself: a slider on the projection pages moves the blend between pure data and pure expert, defaulting to the equal hedge, so for something like a flex-floor call on a running back you can lean it toward the data model, where the calibration is better.

One note on the pre-registration that follows. The eleven predictions in each scorecard below were written and locked against the stacking weights, and they are reported exactly as they resolved against that configuration; nothing here rewrites them. The shift to an equal hedge is a deployment decision made afterward, justified by how poorly the weights are identified rather than by the holdout outcome, and it leaves the registered test untouched. What the live dashboard shows you now is the deployed hedge; what the pre-registration tables record is how the stacking weights, the configuration I registered, actually performed.

Holdout discipline, pre-registration, and evaluating predictions

Until it can be tested in real time on unrealized predictions (e.g., in upcoming 2026 season), the project’s credibility rests on 2025 being genuinely untouched until the end. Development, including every family choice, prior adjustment, and feature decision, used only 2022–2024; a guard script blocked any accidental reference to the holdout during those phases. Before either position’s 2025 evaluation was run, a set of mostly methodological predictions about what it would show was written down and locked: eleven predictions for receivers and eleven for running backs, spanning ranking accuracy, calibration, where the blend would and would not help, and which player situations would drive expert-data disagreement. How they resolved is below.

How predictions are evaluated

The model was selected and is judged primarily on the continuous ranked probability score (CRPS), a proper scoring rule that rewards a forecast for placing probability mass near what actually happened and penalizes both overconfidence and vagueness (Gneiting and Raftery, 2007). Alongside CRPS, calibration is summarized by the Expected Calibration Error (ECE), or the average gap between the model’s stated probability and the realized exceedance frequency, broken out by probability bins. Zero means perfectly calibrated; lower is better. For example, ECE = 0.02 reads as “on average, when the model says ‘X% chance,’ the actual rate is about 2 percentage points off.” The Track Record page shows the same quantity visually as reliability curves, of which ECE is the scalar summary. In addition, the Track Record page compares weekly ranking accuracy against the raw expert consensus, and reports mean absolute error; the Edge Finder page stratifies results by situational archetype — seven judgment-defined flags described in detail below — to locate where the two signals might systematically diverge.

What the 2025 holdout showed

When the holdout was finally scored, the receiver predictions came in at six confirmed, three contradicted, and four genuinely ambiguous, and the running-back predictions at eight, two, and one. Reporting the misses and the ambiguities rather than quietly dropping them is part of the point, and the two complete scorecards are below.

A few findings run across both positions. The probabilities are well-calibrated: every slot-threshold calibration error landed inside its pre-registered band for both receivers and backs, several below their development anchors. The hedge earns its keep on the proper scoring rule, since at the deployed 50/50 lean the blend’s pooled CRPS comes in below either single model for both positions, the payoff of not having to guess in advance whether the experts or the data will be closer; under the originally registered weights the running-back blend fell just short, which is part of what flagged the lean as wrong, and the deployment section above tells that story. I also pre-registered two situational predictions — that the consensus would under-project handcuff backs, and that the usage model would underrate returning stars. On careful post-hoc scrutiny neither holds up. The handcuff under-projection was an artifact of defining the flag on the carries a back actually got; defined prospectively, the consensus prices elevation candidates about right. The returning-star pattern cannot be told apart from noise on a single season’s cases. The disagreements between the signals are real but mostly idiosyncratic, which is itself the argument for hedging rather than trusting either one’s outliers.

The misses are as informative as the passes, and they rhyme across positions. The per-week cross-family weight is noisier than the development seasons suggested for both, and worse for backs, where only 8 of 17 weeks sat in band; that instability is exactly what a cross-family hedge is built to absorb. And the experts win the ranking, with raw consensus at least as accurate as the blend for receivers and clearly ahead for backs, which is no surprise for a model never built to out-rank a wisdom-of-the-crowd signal. The shape of the expert-data disagreement is taken up on the Edge Finder page; the situation-specific blind spots I expected to find there did not survive scrutiny.

The eleven pre-registered wide-receiver predictions and how they resolved

Each prediction below was written with a 2024 anchor value and a specific falsification condition before any 2025 prediction or evaluation had been run. The rightmost column gives the non-technical meaning of each result.

Terms used here. Cross-blend or Blend refers to the deployed cross-family mixture — the Blend you see throughout the dashboard. Pure refers to any one of the four un-blended base models: Model A or Model B fit in either the ZINB family or the gamma family. γ (gamma) is the cross-family stacking weight, the share of the Blend that comes from the ZINB family. LPD (log predictive density) is how well a model’s predictive distribution fits an actual outcome; higher is better, and the per-row LPD correlation between two models is how much they agree on which player-weeks are easy or hard to forecast. ECE and CRPS are defined in the How predictions are evaluated section above. The disagreement archetypes are fill-ins, emerging players, late-season expansions, and returning stars.

ID	What it tested	2024 anchor	2025 result	Verdict	What this means
P1a	How close the Blend’s stated probability of clearing the WR1 floor came to the observed exceedance rate (cross-blend ECE)	0.035	0.018, CI [0.012, 0.037]	AMBIGUOUS	The model’s stated chance of clearing the WR1 floor was off by about 2 percentage points on average — actually better than required, but the point estimate sat just below the band’s lower edge, so the formal test is ambiguous.
P1b	Same for the WR1 target threshold	0.016	0.013 (within band)	PASS	The model’s stated chance of clearing the WR1 target was off by about 1 percentage point on average — solidly within the expected range.
P1c	Same for the WR1 ceiling threshold	0.009	0.023 (within band)	PASS	Same metric at the ceiling threshold — about 2 percentage points off on average, well inside the band.
P2	Whether the expert-anchored (Model A) and data-driven (Model B) models agree on which player-weeks are well- vs poorly-predicted, measured by per-row LPD correlation in both retained families	0.90–0.95 staged	zinb 0.952, gamma 0.877	PASS	The expert and data models agreed on which weeks were easy or hard to forecast far more often than not, in both distributional families.
P3	Whether the Blend’s mix between the ZINB and gamma families (γ) stayed stable across the 17 weeks — within [0.45, 0.85] on at least 14 of 17	0.439 / 0.635 staged; deployed γ 0.691	7 of 17 weeks in band	FAIL	The Blend’s mix between its two distributional families bounced around more across weeks than the 2024 staged results had suggested. The relative weight optimal for ZINB versus gamma families changed week to week, which is exactly the situation a cross-family hedge is designed for.
P4	Whether ceiling-threshold calibration was at least as good as floor-threshold calibration at the WR1 slot, repeating the 2024 pattern	ceiling 0.026 better	ceiling 0.023 vs floor 0.018	AMBIGUOUS	In 2024 the ceiling probabilities were better-calibrated than the floor; in 2025 they came out nearly tied — a directional flip, but inside the noise.
P5	Whether the model’s biggest weekly misses were extreme outcomes it had already flagged as unlikely (top 20 surprises landing in the predictive tails)	19 boom / 1 bust	20 of 20 (all boom)	PASS	When the model was most “wrong,” it was a tail event the model already marked unlikely — typically a big boom — rather than systematic mis-prediction of a typical player.
P6	How often switching between a pure-ZINB and a pure-gamma predictive (the cross-family γ slider) changes a receiver’s stated WR1-target probability by more than 15 percentage points	0.1%	0.109%	PASS	Switching between the two distributional families moves a receiver’s WR1-target probability by more than a slim margin on only about 0.1% of player-weeks — for almost all weeks the two families produce very similar WR1 threshold probabilities (i.e., simlilarly low predicted probabilities of high scores for most player-weeks).
P7	Among the closest start/sit calls (predicted-mean difference ≤ 2 fp), how often the recommended starter flips when swapping a pure-gamma predictive for a pure-ZINB predictive (the cross-family γ slider)	28.4%	17.9%	FAIL	Only about 18% of close-call pairs flip when you swap between a pure-ZINB and a pure-gamma predictive — fewer than the 28% the 2024 pattern suggested. The two distributional families agreed more often on close-call start/sit decisions in 2025 than in 2024, but disagreement was still common enough (about 1 in 5 close calls) to warrant blending across families.
P8	Whether the Blend beats the best single un-blended component (the ‘best pure’) on CRPS in the disagreement archetypes, by ≥ 0.010	cross 3.24 vs pure 3.28	cross 3.168 vs single-best pure 3.229 (PASS) or per-archetype oracle 3.135 (FAIL)	AMBIGUOUS	The pre-reg text defined “best pure” two incompatible ways. Under the more natural reading the Blend beats the best 2024-identifiable pure; under a per-archetype retrospective oracle (only available after the fact) it loses. Both readings are reported.
P9	Whether the Blend stays close to the retrospectively best pure on 2025 — on both ECE and CRPS — in the disagreement archetypes	ECE +0.001, CRPS −0.041	ECE +0.0103, CRPS +0.0013	AMBIGUOUS	Even compared to whichever pure would have looked best after the fact, the Blend held up: within bootstrap noise on calibration, comfortably inside the tolerance on CRPS. The “you can’t pick the best pure in advance” rationale for hedging stands.
P10	Whether the expert-anchored model in the ZINB family (pure_A_zinb) systematically under-projects fill-in-situation player-weeks compared to what those players actually scored	n = 97, gap 1.633 fp	n = 55, gap 1.498 fp	PASS	Across 2025 fill-in player-weeks (a depth player pressed into a larger role by an absent teammate), the expert consensus-anchored model in the ZINB family projected about 1.5 fewer points than the player actually scored. Registered as a pass on the flag as defined; that flag conditions on realized current-week usage, and the prospectively-defined effect is ≈ zero (see “What the 2025 holdout showed”).
P11	Whether the data-driven model in the ZINB family (pure_B_zinb) is better-calibrated than the expert-anchored model in the same family (pure_A_zinb) at the slot floor thresholds (averaged across WR1, WR2, Flex)	+0.024 in data’s favor	−0.011 (expert slightly better)	FAIL	The 2024 advantage of the usage-driven ZINB model over the expert consensus-anchored ZINB model at the slot floors did not replicate — in 2025 the sign flipped slightly the other way, with the consensus-anchored model marginally better. Variability in which model is better calibrated — across player types, across weeks, and (if it replicates) across years — is why I hedge across models rather than commit to one.

Final tally across the thirteen sub-predictions: 6 PASS, 3 FAIL, 4 AMBIGUOUS.

The misses and ambiguities matter at least as much as the passes. Among the failures, P3 told me the per-week cross-family γ is noisier than the 2024 staged estimates suggested; P7 said close-call start/sit decisions were less sensitive to distributional family in 2025 than 2024 implied, though still common enough to matter (about 1 in 5 close calls flipped); P11 reversed direction, with the expert model slightly better at the floor on the holdout rather than worse. Of the four ambiguous verdicts, three sit inside bootstrap noise — P1a’s calibration miss is actually in the better-calibrated direction — and P8 surfaced a drafting flaw, with “best pure” defined two incompatible ways in the same locked prediction, a lesson for any future pre-registration. Variability in which model — experts vs. data, or ZINB vs. gamma family — is better calibrated across player types, across weeks, and (on this evidence) across years, is why I hedge across models rather than commit to one. None of these results undermines the project’s central claim; each refines it.

The eleven pre-registered running-back predictions and how they resolved

ID	What it tested	2024 anchor	2025 result	Verdict	What this means
P1	Whether the blend’s stated chance of clearing each of the nine RB slot-threshold lines matched the observed rate (ECE within ±0.03 of 2024)	nine ECEs, 0.007–0.044	all nine in band, several below anchor	PASS	The model’s stated chances of clearing each floor/target/ceiling line for backs were about as accurate as in development, often better — the most important property for the dashboard’s numbers.
P2	Whether the expert and data models agree on which back-weeks are easy vs hard to forecast (per-row LPD correlation per family)	zinb 0.939 / gamma 0.834	zinb 0.956 / gamma 0.851	PASS	The two models agreed even more tightly out of sample — so tightly that, as the deployment section explains, the weight between them is barely identifiable.
P3	Whether the family mix (γ) stayed in a stable band across the 17 weeks (≥12 of 17 in [0.15, 0.75])	deployed γ 0.254	8 of 17 in band	FAIL	The weekly family mix wandered more than development suggested, more so than for receivers — exactly the instability a cross-family hedge is meant to absorb.
P4	Whether the expert model under-projects roster-confirmed handcuff backs the week their workload spikes	n = 117, −4.08 fp	n = 39, −6.24 fp, CI [−8.65, −3.90]	PASS	On the weeks a back inherits a starter’s carries, the consensus projected 6+ fewer points than he scored. Registered as a pass on the flag as defined; that flag conditions on realized current-week usage, and the prospectively-defined effect is ≈ zero (see “What the 2025 holdout showed”).
P5	Whether the blend’s CRPS stays within 0.010 of the best single component	blend 3.383 vs best pure 3.391	blend 3.675 vs best pure 3.621 (+0.054, CI [−0.004, 0.114])	AMBIGUOUS	Under the originally registered (data-heavy) weights the blend fell just short of the best single component, the interval brushing zero — a borderline miss the deployment section returns to.
P6	Whether the biggest misses were upside booms the model had flagged unlikely (booms among the top 20 surprises)	20 of 20	20 of 20	PASS	When the model was most “wrong” about a back, it was a long-TD or workload-spike game it had already marked unlikely, not a systematic mis-read.
P7	Whether the blend ranks within-week RB pairs at least as accurately as raw ECR (within 1 pp)	blend 69.1% vs ECR 68.1%	blend 67.6% vs ECR 70.1% (−2.5 pp, CI [−3.4, −1.6])	FAIL	For ordering backs against each other, ECR alone beat the blend, and the gap was real; the model was never built to out-rank a consensus, and its value is calibrated probabilities.
P8	Whether ceiling calibration was at least as good as floor calibration at RB1, as in development	ceiling better by 0.021	floor 0.020 vs ceiling 0.020 (−0.0006)	PASS	The two came out essentially tied rather than ceiling-better, but inside the tolerance — the registered ordering held.
P9	How often the closest calls flip when the lean is pushed to pure expert (within ±5 pp of 0.400)	0.400	0.421, CI [0.407, 0.436]	PASS	About 42% of near-tie back pairs reorder when you swing the lean fully to the expert side — close to development, and why the new lean slider is informative on close calls.
P10	How often the closest calls flip when the family mix is pushed to pure ZINB (within ±5 pp of 0.230)	0.230	0.235, CI [0.219, 0.252]	PASS	The two distributional families rarely reverse a close call (about 1 in 4), replicating the receiver result.
P11	Whether the data model is better-calibrated than the expert at the RB1 and RB2 floors, as in development	B better at both	RB1 A 0.039 / B 0.023; RB2 A 0.037 / B 0.036 (B better both)	PASS	The usage model kept its floor-calibration edge at both starter tiers, unlike receivers where this reversed; the same property holds at the flex floor, which is why a flex-RB floor call is best read off the data side.

Tally: 8 PASS, 2 FAIL, 1 AMBIGUOUS.

The deployed model, in numbers

	Wide receivers	Running backs
Training seasons	2022–2024, 5,402 player-weeks	2022–2024, 2,806 player-weeks
Held-out test	2025, 1,829 realized	2025, 915 realized
Retained families	zero-inflated negative binomial, gamma (log)	same two families
Stacking weights (LOO)	Model A 0.49 (neg-binomial) / 0.13 (gamma); cross-family 0.69 on neg-binomial → ≈38% expert	Model A 0.28 (neg-binomial) / 0.01 (gamma); cross-family 0.25 on neg-binomial → ≈8% expert
Deployed blend	equal 50/50 expert/data hedge; lean adjustable on the projection pages	equal 50/50 expert/data hedge; lean adjustable on the projection pages

Other positions at a glance

Code

{
  for (const el of document.querySelectorAll(".other-te")) el.style.display = otherPos === "TE" ? "" : "none";
  for (const el of document.querySelectorAll(".other-qb")) el.style.display = otherPos === "QB" ? "" : "none";
  return html``;
}

The quarterback model is the same two-model, two-family, stacked design, trained on 2022–2024 (1,668 active, non-DNP quarterback weeks) and tested on the sequestered 2025 season (547 realized weeks), exactly as the other three were.

Model B trades the receiving frame for a passing-and-rushing one: trailing pass-attempt volume and air yards, trailing rushing carries — the dual-threat term that has no receiver analog — the Vegas total and spread, a home flag, and recent points allowed to quarterbacks, plus one quality term, trailing passing EPA per dropback, kept after it earned its place on the development data. Model A is unchanged (the calibrated ECR, an injury flag, and a player random intercept). The dispersion side carries a dual usage lever: pass attempts tighten the distribution and suppress busts, while carries widen it but lift the floor, the statistical signature of a dual-threat quarterback.

Stacking leans toward the continuous family, the mirror image of tight end. The two families carried are a skew-normal and the zero-inflated negative binomial, and the cross-family weight lands about 85% on the skew-normal — the opposite of the tight end’s ~95%-discrete lean — because quarterback scoring is high, roughly symmetric, and carries a small genuine negative tail (sacks and turnovers can push a week below zero) that a real-valued family represents naturally. The rounding family is not dropped, though; it stays a co-contributor, and the 2025 holdout confirmed the balance, the skew-normal holding narrowly ahead out of sample rather than reversing. On the expert/data axis the deployed blend is the same equal 50/50 hedge, lean-adjustable on the projection pages.

One roster tier, like tight end, but for a different reason. Quarterbacks do not flex in standard leagues, so the model ships a single QB1 tier (floor 15 / target 20 / ceiling 25 fp, half-PPR) with no shared Flex line, and a replacement baseline (R_QB = 13.98 fp) set on the same scorecard logic as the other positions. The situational archetypes are rebuilt for the position, including a dual-threat flag defined as a stable type from prior-season carries rather than a recent-form streak, so a dual-threat stays flagged through a quiet rushing week.

Calibration held out of sample. Every pre-registered quarterback band held on the holdout, with the stated chances tracking observed frequencies within a few percentage points across the floor, target, and ceiling; the data-driven model reads the floor a touch better than the expert, so a quarterback floor call is the case for leaning the blend toward Data. The same holds within situations: across the rebuilt archetypes, including the dual-threat quarterbacks whose carries tend to lift their floor, the stated chances stay in line with what actually happened.

The seven pre-registered quarterback predictions and how they resolved

Seven predictions were written and locked against the 2022–2024 development data, each with a falsification condition, before any 2025 quarterback outcome was read. Six held; the lone miss is a thin, two-sided surprise tail that touches none of the deployed numbers.

ID	What it tested	Dev anchor	2025 result	Verdict	What this means
P1	Three-line calibration error (ECE) within ±0.03 of development, across the QB1 floor/target/ceiling	floor 0.036 / target 0.023 / ceiling 0.012	floor 0.057 / target 0.025 / ceiling 0.038, all in band	PASS	The stated chances of clearing a quarterback line held their calibration out of sample at every threshold.
P2	Whether the expert and data models agree on which QB-weeks are easy vs hard to forecast (per-row LPD correlation, each family)	skew 0.934 / zinb 0.933	skew 0.916 / zinb 0.925	PASS	The two models still tracked each other closely — the agreement that justifies hedging between them.
P3	Whether the cross-family blend’s CRPS beats the best single un-blended component out of sample	blend 4.048 vs best-pure 4.022 (a small in-sample cost)	blend 4.470 vs best-pure 4.495	PASS	The hedge reversed its in-sample cost out of sample, beating the best single model — exactly its purpose.
P4	Whether the biggest weekly misses were booms the model flagged unlikely (at least 14 of the top-20 surprises in the upper tail)	16/20 boom	13/20 boom	FAIL	The lone miss, reported as-is: 2025’s surprise tail ran more two-sided (more busts) than the boom-skewed development pattern. A thin (n=20) composition gap, not a calibration failure — P1 held — and it touches none of the deployed numbers.
P5	Whether the blend ranks within-week QB pairs at least as accurately as raw ECR	blend 66.2% vs ECR 64.2%	blend 62.8% vs ECR 62.6%	PASS	The blend drew even with the consensus on ranking out of sample, holding the line it was asked to hold.
P6	Cross-family revision rule: revise the ~85%-skew weight only if the discrete family beats the skew-normal on CRPS by more than 0.08	skew ahead by 0.051	skew 4.468 vs zinb 4.515 (skew +0.047)	CONFIRM	The skew-normal stayed ahead; the cross-family weight was confirmed, not revised.
P7	Whether a quarterback’s tendency to beat or miss his ranking (player random effect) persists out of sample	top and bottom random effects face-valid in development	correlation 0.148, sign-match 56% (n=32)	PASS	A modest but real signal carried over — descriptive, not an accuracy lever, since the random effect’s aggregate value is about nil.

Tally: 5 PASS, 1 CONFIRM, 1 FAIL — six of seven held.

The tight-end model is the same two-model, two-family, stacked design, trained on 2022–2024 (2,844 active, non-DNP tight-end weeks) and tested on the sequestered 2025 season (1,045 realized weeks), exactly as the receiver and back models were.

Model B keeps the receiver’s usage and game-environment frame — target share, air yards, and snap share, with the Vegas total and spread and a home flag — with one position-specific swap: the opponent-defense term is recent points allowed to tight ends rather than to receivers. Model A is unchanged (the calibrated ECR, an injury flag, and a player random intercept). Both distributional families carry over, fit and re-checked under the identical convergence gate.

Stacking leans hard toward the discrete family. The cross-family weight lands around 95% on the zero-inflated negative binomial — far more than for receivers (~69%) or backs (~25%) — because that family so clearly suits a low-count, zero-heavy position. The 2025 holdout confirmed the direction but only narrowly: out of sample the two families were nearly tied, so the negative binomial is best read as consistently if narrowly the better-fitting family for tight ends rather than a runaway. On the expert/data axis the deployed blend is the same equal 50/50 hedge, lean-adjustable on the projection pages.

One roster tier, not three. Tight end is top-heavy with no coherent middle, so it ships a single TE1 tier (floor 5 / target 8 / ceiling 12 fp, half-PPR) rather than a WR1/WR2 split, while still riding the shared Flex bars so an elite tight end can be measured against the same absolute lines as a receiver. The replacement baseline (R_TE = 4.70 fp) and the seven situational archetypes are rebuilt on trailing target share rather than snap share, since blocking tight ends play high snaps with few targets.

Calibration is the softest of the three positions. Every pre-registered tight-end band held on the holdout, but the misses run larger at the low end — roughly two to six percentage points, widest at the floor and target lines, where tight-end scoring is spikiest. The data-driven model reads those lines best, so a tight-end floor or target call is the clearest case for leaning the blend toward Data.

The six pre-registered tight-end predictions and how they resolved

Six predictions were written and locked against the 2022–2024 development data, each with a falsification condition, before any 2025 tight-end outcome was read. Five held; the lone ambiguity is ranking, which the blend was never built to win.

ID	What it tested	Dev anchor	2025 result	Verdict	What this means
P1	Six-line calibration error (ECE) within ±0.03 of development, across the TE1 and Flex floor/target/ceiling lines	floors the soft spot (TE1 floor 0.054)	6/6 in band; floors held (TE1 floor 0.061, Flex 0.049)	PASS	The stated chances held their calibration out of sample, including at the floor — the spot flagged softest in development.
P2	Whether the expert and data models agree on which TE-weeks are easy vs hard to forecast (per-row LPD correlation per family)	zinb 0.977 / gamma 0.964	zinb 0.959 / gamma 0.960	PASS	The two models still tracked each other closely — the agreement that justifies hedging between them.
P3	Whether the cross-family blend’s CRPS beats the best single un-blended component	blend 2.145 vs best-pure 2.147	blend 2.271 vs best-pure gamma 2.285	PASS	The blend earned its keep, edging the best single model on the proper score (the best-pure even flipped families out of sample).
P4	Whether the biggest weekly misses were booms the model had flagged unlikely (top-20 surprises in the upper tail)	20/20 boom	20/20 boom	PASS	When the model was most “wrong” about a tight end, it was a real touchdown spike it had marked a long shot, not a systematic miss.
P5	Whether the blend ranks within-week TE pairs at least as accurately as raw ECR	blend 71.2% vs ECR 70.1% (in-sample)	blend 70.0% vs ECR 71.1%	AMBIGUOUS	The in-sample ranking edge reversed out of sample; the blend does not out-rank the consensus, consistent with its stated value (calibrated probabilities, not ranking).
P6	Cross-family revision rule: whether gamma beats zinb on CRPS by more than 0.08 (which would force re-weighting the ~95%-zinb mix)	zinb ahead by 0.063	zinb 2.273 vs gamma 2.284 (zinb +0.011)	CONFIRM	The zinb family stayed ahead, if narrowly; the cross-family weight was confirmed, not revised.

Tally: 4 PASS, 1 CONFIRM, 1 AMBIGUOUS — five of six held.

Situational archetypes

Some players are tagged with one or more archetype flags — per-position, pre-game, player-week flags computed from the same features that feed the models (snap-, carry-, and target-share lags, ECR-rank lags, injury-report status, prior-season usage, and, for the positions that use it, roster availability for qualifying teammates), through frozen definitions the diagnostics and the dashboard export both source. The point of carving the holdout up this way is to ask where the expert consensus disagrees with what the usage and game-environment data alone are saying, and which side is reliably closer when they do — a question a player-week tests more sharply when the player is in some kind of transition than when nothing about his role has changed. The four positions use parallel but not identical flag sets, since each role is read through its own usage channel: snaps and targets for receivers, carries for backs, target share for tight ends, and passing volume plus rushing for quarterbacks. Pick a position:

Code

{
  for (const el of document.querySelectorAll(".arch-wr")) el.style.display = archPos === "WR" ? "" : "none";
  for (const el of document.querySelectorAll(".arch-rb")) el.style.display = archPos === "RB" ? "" : "none";
  for (const el of document.querySelectorAll(".arch-te")) el.style.display = archPos === "TE" ? "" : "none";
  for (const el of document.querySelectorAll(".arch-qb")) el.style.display = archPos === "QB" ? "" : "none";
  return html``;
}

Wide-receiver archetypes

The seven flags, with thresholds set by judgment and documented in code:

fill_in_situation — a same-position teammate who had been a recent contributor is unavailable on a pre-game basis (ruled Out, Doubtful, or IR, listed inactive, held out of practice all week, or marked down sharply in the week’s consensus ranking), and the focal player, a plausible beneficiary by his recent role and ranked outside the startable tier (ECR > 60), stands to absorb the vacated work. Computed entirely from pre-game information, with no current-week snap or usage data.
emerging_player_elevation — the same pre-game elevation pattern as a fill-in, but for a more highly ranked focal player (ECR ≤ 60), where the consensus has already taken notice.
late_season_expansion — the rolling four-week snap share is at least 0.15 higher than four weeks ago, and ECR has not improved by more than 20 places over that span.
stable_veteran — current four-week snap share above 0.65, sustained above 0.65 for at least six prior active weeks, no current injury listing, and ECR within 20 places of each of the last four weeks.
recent_role_change — a week-over-week snap-share jump above 0.30, not already classified as fill-in or emerging.
rookie_or_low_sample — fewer than six prior active weeks of history, and not already flagged as emerging.
star_returning — ECR rank ≤ 24, on the injury report in either of the last two weeks, and not currently ruled Out or IR.

Running-back archetypes

The set is not identical to the receivers’: the emerging-player flag has no clean backfield analog and is dropped, the rookie and low-sample cases are split into two flags, the fill-in flag is defined prospectively from a teammate’s pre-game unavailability rather than from realized carries, and a dual-threat flag marks the genuine two-way backs. The eight flags, with thresholds again set by judgment and documented in code:

fill_in_rb — a recent-contributor running-back teammate is unavailable pre-game (injury report, inactive list, no practice all week, or a sharp ECR drop) and the focal back is a plausible beneficiary by his recent carry and snap role. Pre-game only, no current-week carries.
is_rookie — the player’s NFL entry season (a roster years_exp of zero).
low_sample — a non-rookie with fewer than six prior active weeks.
late_season_expansion — trailing four-week carry share at least 0.15 higher than four weeks earlier, while ECR has not improved by more than 20 places.
recent_role_change — a week-over-week trailing-carry-share jump above 0.30 that is not already a fill-in.
stable_veteran — current trailing carry share above 0.50, sustained above 0.50 for at least six prior active weeks, no current injury listing, and ECR within 20 places of each of the last four weeks. The 0.50 line is deliberately lower than the receivers’ 0.65 snap-share cut, since a 0.65 carry share would capture only a handful of true bell-cows.
star_returning — a highly ranked back (ECR ≤ 24) on the injury report in either of the prior two weeks but not currently ruled Out or IR.
dual-threat back (dual_threat_rb) — a genuine two-way back: prior-season usage of at least 4 carries per game and a receiving share of at least 20% of total yards, plus at least 3.5 targets per game or an 11% team target share (cumulative season-to-date for backs with no prior season). Receptions are a stable, game-script-independent source of points, so these backs tend to carry steadier weekly floors.

Tight-end archetypes

The tight-end set drops the handcuff/fill-in concept and reads role through trailing target share rather than snap share, since blocking tight ends play heavy snaps with few targets. Seven flags:

emerging_player_elevation — trailing four-week target share at least 0.06 above the player’s prior-season average, and not already a stable veteran; organic growth into a larger role.
late_season_expansion — trailing four-week target share up at least 0.06 over four weeks earlier, without ECR improving by more than 12 places.
recent_role_change — a week-over-week target-share jump above 0.05 not already caught by the two flags above.
is_rookie — the player’s NFL entry season.
low_sample — a non-rookie with fewer than six prior active weeks.
stable_veteran — trailing target share above 0.15 (the featured-tight-end line, about the active-TE 75th percentile), sustained above it for at least four prior weeks, no current injury listing, and ECR within 20 places of each of the last four weeks.
star_returning — a highly ranked tight end (ECR ≤ 12, position-scaled down from the receivers’ 24) on the injury report in either of the last two weeks but not currently ruled Out or IR.

Quarterback archetypes

Active quarterbacks all throw at starter volume, so “featured” usage can’t be read off passing share the way it is elsewhere; the quarterback set is built around consensus rank, attempt-volume transitions, and rushing. Eight flags:

stable_starter — trailing four-week mean ECR in the QB1 tier (≤ 12) with no current injury listing; the entrenched starter. A deliberately rank-based definition, distinct from the usage-based “stable” flags elsewhere, because every active quarterback clears starter passing volume.
dual-threat QB (rushing_qb) — a stable rushing type: prior-season carries per game of at least 5 (cumulative season-to-date for passers with no prior season), so a dual-threat stays flagged through a low-rushing week rather than only in his big-rushing ones.
new_starter — a week-over-week jump in trailing pass attempts above 7, the signature of a just-installed starter, not already an emerging or late-expansion case.
emerging_elevation — trailing pass attempts at least 5 above the player’s prior-season average, and not a stable starter.
late_season_expansion — trailing pass attempts up at least 8 over four weeks earlier, without ECR improving by more than 8 places.
star_returning — a highly ranked passer (ECR ≤ 8) on the injury report in either of the last two weeks but not currently ruled Out or IR.
is_rookie — the player’s NFL entry season.
low_sample — a non-rookie with fewer than six prior active weeks.

Two caveats hold across every set. The thresholds are judgment-set rather than fit to an objective, chosen for face-validity against football intuition. And the per-archetype samples on a single holdout season are modest — these are player-weeks, not distinct players, so a starter flagged across many weeks contributes many rows, and even so the situational comparisons are underpowered on one season. Stratifying the 2025 holdout by these situations, the patterns I expected did not reliably survive scrutiny; defined collider-free and with honest uncertainty, the models look comparably accurate across situations rather than one reliably winning in any of them. The flags are situational context, and the Edge Finder and Track Record pages report them with that understanding.

Limitations

This is a proof of concept. It covers wide receivers, running backs, tight ends, and quarterbacks, on a half-PPR scale, and rests on a single held-out season, so the situational findings, drawn from small numbers of fill-in or returning-star weeks, are suggestive rather than settled. The Expert, Data, and Blend views are model-based summaries, not the raw numbers experts publish, and the consensus signal is a rank converted to points through a fixed mapping.

References

Yao, Y., Vehtari, A., Simpson, D., & Gelman, A. (2018). Using stacking to average Bayesian predictive distributions. Bayesian Analysis, 13(3), 917-1007. DOI link
Gneiting, T., & Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association, 102(477), 359–378. DOI link
Bürkner, P-C. (2017). brms: An R Package for Bayesian Multilevel Models Using Stan. Journal of Statistical Software, 80(1), 1-28. doi:10.18637/jss.v080.i01. Link
Silver, N. (2012). The Signal and the Noise. Penguin Press. Link