Public health & epidemiology with StatsPAI¶
StatsPAI is usable as an epidemiology and public-health toolkit, not only an econometrics one. The same import that exposes difference-in-differences and synthetic control also exposes the association measures, standardization, g-methods, marginal structural models, target-trial emulation, survival models, and complex-survey estimators that epidemiologists and public-health researchers reach for. This guide is the map: it shows which study design points to which function, and links to the deeper family guides.
This page is scope-honest. StatsPAI's cross-language parity
certification (sp.list_functions(validation_status="certified")) is
currently anchored on econometrics benchmarks. Most of the
epidemiology surface below is API-stable but not yet parity-certified
against R's epiR, survey, or MendelianRandomization. Treat those
point estimates as correct-by-construction and well tested, but
validate against your reference package before publication, and
always read sp.describe_function(name)["limitations"] first. See
Stability tiers.
Now certified: the point-treatment g-methods core (sp.ipw,
sp.g_computation, sp.g_estimation), IP-weighted survival, and the
E-value are parity-certified on real NHEFS data against both the
Hernán-Robins Causal Inference: What If textbook and an independent
base-R / survival / EValue reference — see
Reproducing What If on NHEFS.
1. Pick the function by study design¶
| Study design / question | Reach for | Family guide |
|---|---|---|
| 2×2 table: risk/odds/rate of disease by exposure | sp.odds_ratio, sp.relative_risk, sp.risk_difference, sp.incidence_rate_ratio, sp.number_needed_to_treat, sp.attributable_risk |
Epidemiological measures |
| Confounding by a categorical third variable | sp.mantel_haenszel, sp.breslow_day_test |
Epidemiological measures |
| Compare rates across populations with different age structure | sp.direct_standardize, sp.indirect_standardize (SMR) |
Epidemiological measures |
| Screening / diagnostic test accuracy | sp.sensitivity_specificity, sp.diagnostic_test, sp.roc_curve, sp.cohen_kappa |
Epidemiological measures |
| Weighing the evidence for causation | sp.bradford_hill |
Epidemiological measures |
| Point-treatment effect, confounding measured at baseline | sp.g_computation, sp.ipw, sp.tmle, sp.aipw |
G-methods for time-varying confounding |
| Time-varying treatment with time-varying confounders affected by past treatment | sp.msm, sp.gformula.ice, sp.gformula.mc, sp.ltmle |
G-methods for time-varying confounding |
| Observational data, but you want to reason like a trial | sp.target_trial.protocol + sp.target_trial.emulate + sp.clone_censor_weight |
Target-trial emulation |
| Time-to-event outcome (mortality, relapse, time-to-diagnosis) | sp.cox, sp.kaplan_meier, sp.survreg, sp.aft |
Survival analysis |
| Informative censoring / loss-to-follow-up | sp.ipcw |
Survival analysis |
| Complex survey data (NHANES, BRFSS, DHS, …) | sp.svydesign, sp.svymean, sp.svytotal, sp.svyglm, sp.rake |
Complex-survey analysis |
| Genetic instruments for a modifiable exposure | sp.mendelian_randomization and the MR family |
Mendelian randomization |
| Continuous exposure → dose-response curve | sp.dose_response, sp.vcnet |
G-methods for time-varying confounding |
| Unmeasured-confounding sensitivity | sp.evalue |
Robustness workflow |
| Power / sample size | sp.power, sp.mde |
— |
2. The five-minute epidemiology tour¶
Every snippet below runs offline and is self-contained.
Association measures from a 2×2 table¶
import statspai as sp
# A cohort: 50 exposed cases / 950 exposed non-cases,
# 10 unexposed cases / 990 unexposed non-cases.
rr = sp.relative_risk(50, 950, 10, 990)
print(rr.estimate, rr.ci) # RR = 5.0, with a log-binomial CI
orr = sp.odds_ratio(50, 20, 30, 40)
print(orr.estimate, orr.ci) # OR = 3.33 (Woolf CI); Haldane-corrected on zero cells
Confounder-adjusted association (Mantel–Haenszel)¶
import numpy as np
import statspai as sp
# Two strata, each a 2x2 [[exposed-case, exposed-noncase],
# [unexp-case, unexp-noncase]]
tables = np.array([[[10, 5], [8, 12]],
[[20, 15], [6, 9]]])
mh = sp.mantel_haenszel(tables, measure="OR")
print(mh.estimate) # pooled OR adjusted for stratum
print(mh.homogeneity_p) # Breslow–Day test for effect homogeneity
Age-standardized rates¶
import statspai as sp
std = sp.direct_standardize(
events=[50, 60], # events in each age band
population=[1000, 2000], # person-time / population in each band
standard_weights=[0.4, 0.6] # reference (standard) population structure
)
print(std.rate, std.ci)
3. The modern causal-epidemiology core¶
Modern epidemiology's hardest problem is time-varying confounding affected by prior treatment (Robins). Standard regression adjustment is biased there; g-methods are the answer. StatsPAI ships the three canonical g-methods plus the target-trial framework that ties them to a protocol:
import statspai as sp
# Parametric g-formula (iterative conditional expectation) under
# "always treat" vs "never treat" across two time points.
always = sp.gformula.ice(
data=wide, id_col="id", time_col=None,
treatment_cols=["A0", "A1"],
confounder_cols=[["L0"], ["L1"]],
outcome_col="Y", treatment_strategy=[1, 1],
)
never = sp.gformula.ice(
data=wide, id_col="id", time_col=None,
treatment_cols=["A0", "A1"],
confounder_cols=[["L0"], ["L1"]],
outcome_col="Y", treatment_strategy=[0, 0],
)
print(always.value - never.value) # g-formula contrast
See the g-methods family guide for the marginal
structural model (sp.msm), longitudinal TMLE (sp.ltmle), and the
full target-trial workflow that protects against immortal-time bias.
4. Reporting standards¶
Public-health and clinical journals expect design-specific reporting checklists. StatsPAI exposes structured reporting hooks rather than free-form prose:
- Target-trial protocol (the 7 protocol elements) —
sp.target_trial.protocol(...).summary()prints the eligibility, treatment strategies, assignment, time-zero, follow-up, outcome, causal contrast, and analysis plan as a structured block, andsp.target_trial.to_paper(...)renders a STROBE-style Methods & Results narrative [@hernan2016using]. - Estimator citations — mature estimators carry
.cite()so the exact methodological reference lands in your bibliography. - Sensitivity to unmeasured confounding — report an
sp.evalue(...)E-value alongside the point estimate [@vanderweele2017sensitivity].
5. Honest limitations (read before you publish)¶
- Parity is not yet certified for most epi methods. The numbers are well tested internally and recover known truths on simulated data, but cross-language certification against R's epidemiology packages is on the roadmap, not done. Re-run your headline estimate in your reference package.
- Positivity and sequential exchangeability are assumptions, not
outputs. G-methods identify causal effects only when treatment is
(sequentially) unconfounded given the measured covariates and every
covariate stratum has a chance of each treatment. Inspect weight
distributions (
sp.msmreports trimming) and think hard about unmeasured confounders. - Competing risks are not yet first-class.
sp.cox/sp.kaplan_meiertreat censoring as non-informative; there is no Fine–Gray subdistribution-hazard or cumulative-incidence-function estimator yet. For competing-risks settings, validate carefully and watch this space.