• Player explorer
  • Compare players
  • Disagreement
  • Surprises
  • About
    • About FFHedge
    • Methodology
    • Calibration
  • reluctant criminologists

On this page

  • Overview
  • The data and the two signals
  • Two models, by design
    • Model predictors
  • Why two models rather than one?
  • Distributional families
  • Bayesian priors
  • Combining the models via stacking
  • Holdout discipline, pre-registration, and evaluating predictions
    • How predictions are evaluated
  • Situational archetypes
  • The deployed model, in numbers
  • Limitations
  • References

Methodology

Overview

This dashboard rests on a Bayesian model that turns two sources of information, expert consensus and player usage/game environment data, into weekly fantasy-point predictions expressed as full probability distributions rather than single numbers (e.g., mean, sd). The design has three moving parts: two structurally different models, each fit in two distributional families, combined by stacking, and then evaluated on a season the model never saw during development. What follows is the full account, from the data through the held-out test.

The data and the two signals

All player statistics, schedules, snap counts, air yards, injury reports, and Vegas lines come from the nflverse project, retrieved with the nflreadr package. The expert signal is the FantasyPros aggregated Expert Consensus Rank (ECR), also retrieved through the nflverse archive.

The model is trained on the 2022, 2023, and 2024 seasons (5,402 active player-weeks, filtered to receivers with an offensive snap share above 20% who were not ruled out by injury), and tested on 2025, which was sequestered throughout development.

Two models, by design

The heart of the design is two models that are structurally different, not the same model nudged in two directions. That difference is what makes “expert versus data” a genuine comparison rather than an modeling artifact.

Model A (expert-anchored) uses only the expert consensus projection (the calibrated ECR) and an injury-status flag, plus a player-specific random intercept. It is, in effect, the experts’ all-things-considered estimate, recalibrated against history and shrunk toward the player’s own track record.

Model B (data-driven) never sees the expert consensus. It works from player usage (four-week rolling target share, air yards, snap share, and an injury flag) and game environment (Vegas-implied team total and spread, opponent’s recent points allowed to receivers, and a home indicator) information, again with a player random intercept. Its value proposition is a forecast built from observable usage and game environment alone.

Model predictors

Both models draw from the same feature table, built once per season and frozen on the same logic across years. Every continuous predictor is z-scored against the training distribution before entering the model so coefficients sit on a common scale and the priors hold the same meaning across variables.

The expert signal (used by Model A only):

  • consensus_proj — the FantasyPros aggregated ECR converted to expected fantasy points via a frozen rank-to-points calibration. I could not find a free archive of historical weekly ECR-based point projections (with a transparent and stably defined conversion process), so the expert signal enters the model as an expected fantasy-point value rather than as a raw rank. The conversion is a monotonic rank-to-points calibration fit once, on 2022–2023 active non-DNP WR player-weeks. The conversion involved binning player-weeks to integer ECR rank, taking the mean realized half-PPR fantasy points per rank, and running a pool-adjacent-violators isotonic regression (R’s stats::isoreg) on the binned means with the y-axis negated, then re-negate so the result is non-increasing in rank — a better rank maps to at least as many expected points as a worse one. Predictions for ranks not seen in the binned table are obtained by linear interpolation, with flat extrapolation past the highest and lowest observed ranks. The resulting consensus_proj column is the same predictor everywhere, frozen and applied unchanged to 2024 and 2025.

The usage signals, each a rolling four-week trailing mean computed per player and reset at season boundaries (used by Model B):

  • target_share_l4 — fraction of the team’s pass attempts targeted at the player, averaged over the trailing four active weeks; injury-DNP weeks are excluded from the average.
  • air_yards_l4 — share of the team’s air yards directed at the player, trailing four weeks with non-missing entries.
  • snap_share_l4 — the player’s offensive snap share, trailing four weeks including non-active weeks counted as zero. A player who missed games drags this measure down, which is the intended behavior for a usage signal.

The game-environment signals (used by Model B):

  • vegas_team_total — the Vegas-implied team scoring total for the upcoming game, derived from the market total and spread.
  • spread_for_team — the point spread signed from the player’s team’s perspective (negative when favored).
  • opp_pts_allowed_wr_l4 — the opposing defense’s trailing four-week mean of total fantasy points allowed to opposing WRs, on the same scoring scale.

Binary status flags:

  • injury_status — flag for a Questionable / Doubtful / Out / IR listing on the final pre-game report. Both models include it, because the FantasyPros consensus snapshot is fixed in time and the injury report might update between snapshot and kickoff; including the flag is intended to put both models on equal decision-time information footing.
  • home — binary home indicator, Model B only.

Opening-week and missing-data handling. Where a rolling four-week mean is undefined (early in a season, or for a player without four prior active weeks), the value falls back to that player’s prior-season active-weeks mean; if that is also missing, to the league-wide WR-population mean computed on 2022–2023 active weeks. Each rolling feature carries an _was_imputed companion flag so downstream checks can identify the fallback rows. The Vegas inputs and the home flag are observed directly and need no fallback; consensus_proj is imputed to the population mean only when the underlying ECR is itself missing, which is rare.

NoteModel A/B equations

Formally, with \(i\) indexing player-weeks and \(p[i]\) the player, both models share the same hierarchical structure and differ only in their predictor matrix \(X\):

\[ \eta_i = X_i \beta + u_{p[i]}, \qquad u_p \sim \mathcal{N}(0, \sigma_u^2). \]

\(\eta_i\) is the linear predictor on the appropriate distributional scale (see Distributional families below) and \(u_{p[i]}\) is the player-level random intercept. The two models differ only in what \(X\) contains:

  • Model A: intercept, consensus_proj_z, injury_status.
  • Model B: intercept, vegas_team_total_z, spread_for_team_z, target_share_l4_z, air_yards_l4_z, snap_share_l4_z, opp_pts_allowed_wr_l4_z, injury_status, home.

All continuous predictors are standardized (the _z suffix) against the training-set mean and standard deviation, so the priors \(\beta \sim \mathcal{N}(0, 2.5)\) and \(\sigma_u \sim \mathcal{N}(0, 1.5)\) retain a common meaning across variables. Auxiliary distributional parameters (the negative-binomial dispersion, the gamma shape, the zero-inflation logit) carry their own intercepts and, in Model B, may depend additionally on snap_share_l4_z, since a player on the field more is harder to put at zero. In Model A those auxiliaries are kept sparse, intercept-only or intercept-plus-injury_status, to keep Model A strongly anchored to ECR ranking and thus keep the expert consensus-anchored model interpretable.

Why two models rather than one?

A natural question is why not fit a single model with the consensus rank as one covariate among the usage features, and let the data decide its weight. That is a perfectly reasonable model, and for raw point prediction it might do just as well; the two-model design was chosen for other reasons. The first and main one is the goal of the project, to make the everyday act of weighing expert judgment against the numbers visible, which requires two separate predictive distributions you can hold side by side rather than a single distribution with the expert signal dissolved into a coefficient. The disagreement view, the head-to-head comparison, and the Expert/Data/Blend split all depend on having two genuinely distinct forecasts. The second reason is that stacking whole predictive distributions is more robust to either model being misspecified than trusting one model’s assumed functional form, and it yields a clean, inspectable answer to how much weight each signal carries, which a single model’s entangled coefficients cannot, especially since consensus and usage are correlated enough that their coefficients would compete for the same variance. The cost is a more elaborate pipeline; the benefit is transparency and a decomposition that might tell us something interesting.

Distributional families

Weekly receiver scoring is awkward to model: a spike of near-zero (floor) weeks sits beside a long upside (ceiling) tail. Four families were fit to that shape, each as both Model A and Model B, and judged on 2024 data under a strict convergence gate, requiring every fit to reach an R-hat below 1.01, zero divergent transitions, and a bulk effective sample size above 400, with a single permitted refit at higher adaptation. Two families failed the gate even after the refit and were disqualified: a hurdle-lognormal, whose expert model would not converge (R-hat 1.02), and a skew-normal, whose data model both failed to converge and mixed poorly (R-hat 1.02, effective sample size 178). That left a zero-inflated negative binomial, which handles the zero spike on the natural integer scale, and a gamma with a log link, a continuous and naturally right-skewed positive distribution.

On held-out 2024 accuracy the two survivors were nearly tied, with a pooled CRPS of 3.16 for the negative binomial against 3.18 for the gamma. Rather than crown one on a margin that small, the project asked a different question: do the two families actually predict differently? They do, diverging beyond their typical level on about a quarter of player-weeks, and so both were retained and combined, on the principle that two models that disagree somewhat frequently might carry independent signals and, thus, both are worth keeping.

NoteModel zinb/gamma equations

The two retained families are written formally as follows, suppressing the player random intercept already specified above and letting \(\eta_i^{\text{main}}\) denote the linear predictor from Model A or Model B.

Zero-inflated negative binomial (zinb_rounded). The fantasy-point outcome is rounded to a non-negative integer \(y_i\) and modeled as a mixture:

\[ y_i \mid \pi_i, \mu_i, \phi \;\sim\; \pi_i\, \delta_0 \;+\; (1 - \pi_i)\, \mathrm{NegBin}(\mu_i, \phi), \] \[ \log \mu_i = \eta_i^{\text{main}}, \qquad \mathrm{logit}\, \pi_i = \eta_i^{\text{zi}}, \]

where \(\delta_0\) is a point mass at zero, \(\eta^{\text{main}}\) models the count mean, and \(\eta^{\text{zi}}\) is a separate linear predictor for the zero-inflation probability (intercept-plus-injury_status in Model A; additionally snap_share_l4_z in Model B). The integer rounding is the cost of using a discrete family on continuous fantasy points and is documented as a small approximation.

Gamma with log link (gamma_log). Fantasy points are floored at a tiny positive value \(y_i^+ = \max(y_i, 0.01)\) so the support is the positive reals, and modeled as

\[ y_i^+ \mid \mu_i, \alpha_i \;\sim\; \mathrm{Gamma}\!\big(\text{shape} = \alpha_i,\; \text{rate} = \alpha_i / \mu_i\big), \] \[ \log \mu_i = \eta_i^{\text{main}}, \qquad \log \alpha_i = \eta_i^{\text{shape}}, \]

where \(\alpha_i\) is the gamma shape parameter — also modeled with covariates — controlling how heavy the right tail is. The flooring is, like the integer rounding above, a small concession to the family’s support; it touches only the rare exact-zero weeks.

Bayesian priors

All models are specified with standard weakly informative priors: Normal(0, 2.5) on standardized fixed-effect coefficients, Normal(0, 1.5) on the player random-effect standard deviation, and Normal(0, 2.5) on auxiliary distributional parameters. Models were fit in brms on a cmdstanr backend.

Combining the models via stacking

The four fitted models (two structural models in two families) are combined by stacking, a method that weights predictive distributions by how well they actually forecast out of sample rather than by how well they fit (Yao, Vehtari, Simpson, and Gelman, 2018). The combination happens in two steps. Within each family, Model A and Model B are stacked using leave-one-out predictive scores: the weight on the expert-anchored Model A comes out to 0.49 in the negative-binomial family and 0.13 in the gamma family. Across families, the two within-family blends are stacked on a common fantasy-point scoring scale, placing about 69% of the weight on the negative-binomial blend.

Regrouping those weights by which model each piece came from gives a clean summary of the deployed blend: it draws roughly 38% on the expert-anchored models and 62% on the data-driven ones. That decomposition is exactly how the Expert, Data, and Blend views on the rest of the site are defined.

It can look surprising that the blend leans toward the data model when the raw expert consensus is the better ranker (see Calibration). Stacking weights are tuned to forecast the whole distribution well rather than to order players, and a model can lose the ranking race while producing better-calibrated distributions; the negative-binomial family splits the two models roughly evenly, while the gamma family, whose job is the magnitude of a productive week, leans hard on the usage signal that predicts it. A good deal of what the experts know also tends to be present already in a receiver’s usage, so adding the expert model on top yields limited incremental gain.

Holdout discipline, pre-registration, and evaluating predictions

Until it can be tested in real time on unrealized predictions (e.g., in upcoming 2026 season), the project’s credibility rests on 2025 being genuinely untouched until the end. Development, including every family choice, prior adjustment, and feature decision, used only 2022–2024; a guard script blocked any accidental reference to the holdout during those phases. Before the 2025 evaluation was run, a set of mostly methodological predictions about what that evaluation would show was written down and locked: eleven predictions (thirteen counting sub-parts), spanning ranking accuracy, calibration, where the blend would and would not help, and which player situations would drive expert-data disagreement. When the holdout was finally scored, six were confirmed, three were contradicted, and four were genuinely ambiguous. Reporting the misses and the ambiguities, rather than quietly dropping them, is part of the point.

How predictions are evaluated

The model was selected and is judged primarily on the continuous ranked probability score (CRPS), a proper scoring rule that rewards a forecast for placing probability mass near what actually happened and penalizes both overconfidence and vagueness (Gneiting and Raftery, 2007). Alongside CRPS, calibration is summarized by the Expected Calibration Error (ECE), or the average gap between the model’s stated probability and the realized exceedance frequency, broken out by probability bins. Zero means perfectly calibrated; lower is better. For example, ECE = 0.02 reads as “on average, when the model says ‘X% chance,’ the actual rate is about 2 percentage points off.” The Calibration page shows the same quantity visually as reliability curves, of which ECE is the scalar summary. In addition, the Calibration page compares weekly ranking accuracy against the raw expert consensus, and reports mean absolute error; the Disagreement page stratifies results by situational archetype — seven judgment-defined flags described in detail below — to locate where the two signals might systematically diverge.

NoteThe eleven pre-registered predictions and how they resolved

Each prediction below was written with a 2024 anchor value and a specific falsification condition before any 2025 prediction or evaluation had been run. The rightmost column gives the non-technical meaning of each result.

Terms used here. Cross-blend or Blend refers to the deployed cross-family mixture — the Blend you see throughout the dashboard. Pure refers to any one of the four un-blended base models: Model A or Model B fit in either the ZINB family or the gamma family. γ (gamma) is the cross-family stacking weight, the share of the Blend that comes from the ZINB family. LPD (log predictive density) is how well a model’s predictive distribution fits an actual outcome; higher is better, and the per-row LPD correlation between two models is how much they agree on which player-weeks are easy or hard to forecast. ECE and CRPS are defined in the How predictions are evaluated section above. The disagreement archetypes are fill-ins, emerging players, late-season expansions, and returning stars.

ID What it tested 2024 anchor 2025 result Verdict What this means
P1a How close the Blend’s stated probability of clearing the WR1 floor came to the observed exceedance rate (cross-blend ECE) 0.035 0.018, CI [0.012, 0.037] AMBIGUOUS The model’s stated chance of clearing the WR1 floor was off by about 2 percentage points on average — actually better than required, but the point estimate sat just below the band’s lower edge, so the formal test is ambiguous.
P1b Same for the WR1 target threshold 0.016 0.013 (within band) PASS The model’s stated chance of clearing the WR1 target was off by about 1 percentage point on average — solidly within the expected range.
P1c Same for the WR1 ceiling threshold 0.009 0.023 (within band) PASS Same metric at the ceiling threshold — about 2 percentage points off on average, well inside the band.
P2 Whether the expert-anchored (Model A) and data-driven (Model B) models agree on which player-weeks are well- vs poorly-predicted, measured by per-row LPD correlation in both retained families 0.90–0.95 staged zinb 0.952, gamma 0.877 PASS The expert and data models agreed on which weeks were easy or hard to forecast far more often than not, in both distributional families.
P3 Whether the Blend’s mix between the ZINB and gamma families (γ) stayed stable across the 17 weeks — within [0.45, 0.85] on at least 14 of 17 0.439 / 0.635 staged; deployed γ 0.691 7 of 17 weeks in band FAIL The Blend’s mix between its two distributional families bounced around more across weeks than the 2024 staged results had suggested. The relative weight optimal for ZINB versus gamma families changed week to week, which is exactly the situation a cross-family hedge is designed for.
P4 Whether ceiling-threshold calibration was at least as good as floor-threshold calibration at the WR1 slot, repeating the 2024 pattern ceiling 0.026 better ceiling 0.023 vs floor 0.018 AMBIGUOUS In 2024 the ceiling probabilities were better-calibrated than the floor; in 2025 they came out nearly tied — a directional flip, but inside the noise.
P5 Whether the model’s biggest weekly misses were extreme outcomes it had already flagged as unlikely (top 20 surprises landing in the predictive tails) 19 boom / 1 bust 20 of 20 (all boom) PASS When the model was most “wrong,” it was a tail event the model already marked unlikely — typically a big boom — rather than systematic mis-prediction of a typical player.
P6 How often switching between a pure-ZINB and a pure-gamma predictive (the cross-family γ slider) changes a receiver’s stated WR1-target probability by more than 15 percentage points 0.1% 0.109% PASS Switching between the two distributional families moves a receiver’s WR1-target probability by more than a slim margin on only about 0.1% of player-weeks — for almost all weeks the two families produce very similar WR1 threshold probabilities (i.e., simlilarly low predicted probabilities of high scores for most player-weeks).
P7 Among the closest start/sit calls (predicted-mean difference ≤ 2 fp), how often the recommended starter flips when swapping a pure-gamma predictive for a pure-ZINB predictive (the cross-family γ slider) 28.4% 17.9% FAIL Only about 18% of close-call pairs flip when you swap between a pure-ZINB and a pure-gamma predictive — fewer than the 28% the 2024 pattern suggested. The two distributional families agreed more often on close-call start/sit decisions in 2025 than in 2024, but disagreement was still common enough (about 1 in 5 close calls) to warrant blending across families.
P8 Whether the Blend beats the best single un-blended component (the ‘best pure’) on CRPS in the disagreement archetypes, by ≥ 0.010 cross 3.24 vs pure 3.28 cross 3.168 vs single-best pure 3.229 (PASS) or per-archetype oracle 3.135 (FAIL) AMBIGUOUS The pre-reg text defined “best pure” two incompatible ways. Under the more natural reading the Blend beats the best 2024-identifiable pure; under a per-archetype retrospective oracle (only available after the fact) it loses. Both readings are reported.
P9 Whether the Blend stays close to the retrospectively best pure on 2025 — on both ECE and CRPS — in the disagreement archetypes ECE +0.001, CRPS −0.041 ECE +0.0103, CRPS +0.0013 AMBIGUOUS Even compared to whichever pure would have looked best after the fact, the Blend held up: within bootstrap noise on calibration, comfortably inside the tolerance on CRPS. The “you can’t pick the best pure in advance” rationale for hedging stands.
P10 Whether the expert-anchored model in the ZINB family (pure_A_zinb) systematically under-projects fill-in-situation player-weeks compared to what those players actually scored n = 97, gap 1.633 fp n = 55, gap 1.498 fp PASS Across 2025 fill-in player-weeks (a depth player pressed into a larger role by an absent teammate), the expert consensus-anchored model in the ZINB family projected about 1.5 fewer points than the player actually scored — a real if modest expert blind spot.
P11 Whether the data-driven model in the ZINB family (pure_B_zinb) is better-calibrated than the expert-anchored model in the same family (pure_A_zinb) at the slot floor thresholds (averaged across WR1, WR2, Flex) +0.024 in data’s favor −0.011 (expert slightly better) FAIL The 2024 advantage of the usage-driven ZINB model over the expert consensus-anchored ZINB model at the slot floors did not replicate — in 2025 the sign flipped slightly the other way, with the consensus-anchored model marginally better. Variability in which model is better calibrated — across player types, across weeks, and (if it replicates) across years — is why we hedge across models rather than commit to one.

Final tally across the thirteen sub-predictions: 6 PASS, 3 FAIL, 4 AMBIGUOUS.

The misses and ambiguities matter at least as much as the passes. Among the failures, P3 told us the per-week cross-family γ is noisier than the 2024 staged estimates suggested; P7 said close-call start/sit decisions were less sensitive to distributional family in 2025 than 2024 implied, though still common enough to matter (about 1 in 5 close calls flipped); P11 reversed direction, with the expert model slightly better at the floor on the holdout rather than worse. Of the four ambiguous verdicts, three sit inside bootstrap noise — P1a’s calibration miss is actually in the better-calibrated direction — and P8 surfaced a drafting flaw, with “best pure” defined two incompatible ways in the same locked prediction, a lesson for any future pre-registration. Variability in which model — experts vs. data, or ZINB vs. gamma family — is better calibrated across player types, across weeks, and (on this evidence) across years, is why we hedge across models rather than commit to one. None of these results undermines the project’s central claim; each refines it.

Situational archetypes

Some players are tagged with one or more archetype flags. These are not labels invented at evaluation time. They are seven mutually-structured player-week flags computed from the same features already feeding the model — snap-share lags, ECR-rank lags, injury-report status, prior-season snap share, and roster availability for qualifying teammates — through a single shared definition (R/situational_flags.R) that both the diagnostic scripts and the dashboard export source, so every page on this site is talking about the same objects.

The point of carving up the holdout data in this way is that it helps us answer one of the project’s central questions: Where does the expert consensus disagree with what the player usage/game environment data alone is saying, and which side is reliably closer when they do? The core insight is that a particular player-week might offer a stronger test of that question when the player is in some sort of transition — newly elevated by a teammate’s injury, freshly emerged into a larger snap or target share, performance climbing late in a season, or a high-ranked starter just coming off the injury report — than when nothing about his role has changed. So rather than report aggregate calibration only, the diagnostics and the dashboard partition player-weeks into multiple situations where the expert-versus-data tension might be most apparent.

The seven flags, with thresholds set by judgment and documented in code:

  • fill_in_situation — a teammate WR is unavailable (Out, IR, Doubtful, or a depth-chart-eligible WR listed inactive that week), the focal player’s current-season snap share is at least 0.15 above his prior-season baseline, his prior-season baseline was a non-starter level (< 0.55), and his ECR rank is > 60. Intended to flag depth players pressed into a larger role by an absence.
  • emerging_player_elevation — the same teammate-driven elevation pattern as a fill-in, but with an ECR rank ≤ 60 (i.e., more highly ranked players). Usage data signals a key emerging player, and the expert consensus has taken notice.
  • late_season_expansion — the rolling four-week snap share is at least 0.15 higher than four weeks ago, and ECR has not improved by more than 20 places over that span. Usage is climbing without the rankings reflecting it.
  • stable_veteran — current four-week snap share above 0.65, sustained above 0.65 for at least six prior active weeks, no current injury listing, and ECR within 20 places of where it was each of the last four weeks. Established starters with no clear signal of role change.
  • recent_role_change — a week-over-week snap-share jump above 0.30, and not already classified as fill-in or emerging. Catches abrupt role shifts the first two flags miss.
  • rookie_or_low_sample — fewer than six prior active weeks of history, and not already flagged as emerging. Players whose history is too thin for the rolling features to do much work.
  • star_returning — ECR rank ≤ 24, on the injury report (Questionable, Doubtful, Out, or IR) in either of the last two weeks, and not currently ruled Out or IR. Highly-ranked players recently dinged but expected to play.

These flags drive three uses in turn. First, the 2022–2023 fits were evaluated out-of-sample on the 2024 development data, with calibration checked within each archetype to learn where the model’s probabilities tilted in particular kinds of weeks. Then, the locked model (trained on 2022–2024) was walked through the 2025 holdout and stratified by archetype, to see whether the systematic patterns the development phase suggested showed up again on data the model had never seen. They did, in a particular and informative way. The expert consensus-anchored model under-projects fill-ins and emerging players by around 1.4 to 1.5 fp on average, while the data-driven model under-projects returning stars by about the same amount. In the dashboard, the Disagreement page reports these patterns directly, and the player explorer surfaces the flags as badges on each player-week.

Two caveats are worth stating. First, the thresholds are judgment-set rather than fit by an algorithm (e.g., 0.65 for “stable,” 0.15 for elevation, 0.30 for an abrupt role change, ECR 24 for “star”). These were intentionally chosen for face-validity against league-football intuition rather than optimized against an objective, but the resulting flags were not inspected systematically to ensure comprehensive face-validity of archetype inclusion/exclusion decisions across player weeks. Second, the per-archetype sample sizes on a single holdout season are modest. For example, in the 2025 holdout data, 55 player-weeks were flagged as fill-in, 35 as returning-star, 23 emerging, 50 recent-role-change, 124 late-season expansion, on up through 257 rookie/low-sample and 468 stable-veteran player-weeks, against a total of 1,829 realized player-weeks. (Note: Those counts are player-weeks, not unique players; a stable veteran flagged across many weeks contributes many rows, so, the underlying number of distinct players in each archetype is smaller — especially for stable veterans.) Given these caveats, the archetypal patterns are suggestive rather than settled, and the dashboard reports them with that understanding.

The deployed model, in numbers

Component Setting
Training seasons 2022–2024 (5,402 player-weeks)
Held-out test 2025 (1,829 realized player-weeks)
Retained families zero-inflated negative binomial, gamma (log link)
Within-family weight on Model A 0.49 (neg-binomial), 0.13 (gamma)
Cross-family weight on neg-binomial 0.69
Deployed blend, by model ≈38% expert-anchored, ≈62% data-driven

Limitations

This is a proof of concept. It covers wide receivers only, on a half-PPR scale, and rests on a single held-out season, so the situational findings, drawn from small numbers of fill-in or returning-star weeks, are suggestive rather than settled. The Expert, Data, and Blend views are model-based summaries, not the raw numbers experts publish, and the consensus signal is a rank converted to points through a fixed mapping.

References

  • Yao, Y., Vehtari, A., Simpson, D., & Gelman, A. (2018). Using stacking to average Bayesian predictive distributions. Bayesian Analysis, 13(3), 917-1007. DOI link
  • Gneiting, T., & Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association, 102(477), 359–378. DOI link
  • Bürkner, P-C. (2017). brms: An R Package for Bayesian Multilevel Models Using Stan. Journal of Statistical Software, 80(1), 1-28. doi:10.18637/jss.v080.i01. Link
  • Silver, N. (2012). The Signal and the Noise. Penguin Press. Link

FFHedge · 2025 season validation archive · a reluctant criminologists project.