Automatic diagnostics — what StatsPAI checks for you¶
Every fitted StatsPAI result carries a self-audit. You do not have to remember which assumption each estimator leans on — the result knows, and will tell you:
import statspai as sp
res = sp.ivreg("y ~ (d ~ z)", data=df)
res.violations() # structured list of flagged concerns (may be empty)
sp.audit(res) # reviewer checklist: what's checked, passed, missing
res.violations() inspects diagnostics the estimator already computed — it
never re-runs a test or touches your data, so it is instant. Each entry is a
dict an agent can branch on:
{"kind": "assumption", "severity": "warning", "test": "weak_instrument",
"value": 4.2, "threshold": 10.0,
"message": "First-stage F = 4.20 < 10 (Stock-Yogo 5% bias) — weak instrument …",
"recovery_hint": "Use sp.anderson_rubin_ci …",
"alternatives": ["sp.anderson_rubin_ci", "sp.iv"]}
sp.audit(res) is a superset of violations(): it adds the robustness /
sensitivity checks a referee would ask for (present, failed, or still missing)
and folds in every live violation, so one call gives the full picture.
Two design commitments make these signals trustworthy:
- Fit-time and structured API agree. Where an estimator warns at fit time
(weak IV, few clusters, separation), the same concern appears in
violations()— never one without the other. - Calibrated not to cry wolf. Thresholds are set so the field's canonical good examples stay silent (e.g. California Prop-99 does not trip the synthetic-control pre-fit check). A diagnostic that fires on the textbook example would erode trust, not build it.
The checklist¶
| Family | Check | Fires when | Points to |
|---|---|---|---|
| DID | parallel trends | pre-trend joint test p < 0.10 (Roth 2022) | sp.sensitivity_rr, sp.callaway_santanna, sp.did_imputation |
| IV | weak instrument | first-stage F < 10 (Stock-Yogo) | sp.anderson_rubin_ci, sp.iv(method='liml') |
| Panel / OLS | few clusters | # clusters < 30 (Cameron-Gelbach-Miller 2008) | sp.wild_cluster_bootstrap, sp.wild_cluster_ci_inv |
| Synthetic control | poor pre-fit | pre-RMSPE / pre-period SD > 0.6 | sp.synth_compare, sp.augsynth, sp.synth_sensitivity |
| RD | manipulation | McCrary density test p < 0.05 | sp.rddensity, sp.rdplotdensity, sp.rdrandinf |
| Matching | residual imbalance | max post-match SMD > 0.25 (Stuart 2010) | sp.ebalance, sp.cbps, sp.love_plot |
| Matching / IPW | overlap | min propensity weight share < 0.05 | sp.trimming |
| DML / AIPW | weak overlap | > 5% of units at the trimming bound | sp.trimming, sp.overlap_weights, sp.cbps |
| Logit / probit | (quasi-)separation | |slope coef| > 15 (Albert-Anderson 1984) | penalised logit / drop the separating predictor |
| Poisson | over-dispersion | Pearson dispersion > 1.5 | sp.nbreg, robust SEs (quasi-Poisson) |
| Count | excess zeros | observed − predicted zeros > 0.05 | sp.zip_model, sp.zinb |
| Heckman | no selection | inverse-Mills p > 0.10 | sp.regress (more efficient) |
| Heckman | unstable rho | |rho| > 0.99 (weak exclusion restriction) | strengthen the exclusion; compare sp.regress |
| Tobit | extreme censoring | > 90% of observations censored | sp.heckman, report bounds |
| Cox | non-proportional hazards | Schoenfeld PH test p < 0.05 | time interaction / stratify, sp.aft |
| Bayesian | non-convergence | r̂ > 1.01, bulk-ESS < 400, or divergences > 0 | more draws / chains, reparameterise |
| Any | numerical | non-positive or non-finite standard errors | check collinearity (sp.vif), sandwich setup |
The thresholds live in one place (statspai.core._agent_summary) and are shared
by violations() and audit(), so the two can never disagree, and a future
correctness fix that moves a cutoff moves both at once.
Using it in a workflow¶
res = sp.dml(df, y="y", treat="d", covariates=X, model="irm")
for v in res.violations():
if v["severity"] in ("warning", "error"):
print(v["message"])
print(" → try:", ", ".join(v["alternatives"]))
# Or get the full reviewer view, sorted by how ready the result is:
report = sp.audit(res)
report["coverage"] # passed / total, in [0, 1]
report["summary"] # {passed, failed, missing, n_total}
For the checks that re-run against the data (rather than inspecting stored
diagnostics), see sp.assumption_audit, and for the design-level robustness
sweep see the robustness workflow guide.