Public health & epidemiology with StatsPAI¶

StatsPAI is usable as an epidemiology and public-health toolkit, not only an econometrics one. The same import that exposes difference-in-differences and synthetic control also exposes the association measures, standardization, g-methods, marginal structural models, target-trial emulation, survival models, and complex-survey estimators that epidemiologists and public-health researchers reach for. This guide is the map: it shows which study design points to which function, and links to the deeper family guides.

This page is scope-honest. StatsPAI's cross-language parity certification (sp.list_functions(validation_status="certified")) is currently anchored on econometrics benchmarks. Most of the epidemiology surface below is API-stable but not yet parity-certified against R's epiR, survey, or MendelianRandomization. Treat those point estimates as correct-by-construction and well tested, but validate against your reference package before publication, and always read sp.describe_function(name)["limitations"] first. See Stability tiers.

Now certified: the point-treatment g-methods core (sp.ipw, sp.g_computation, sp.g_estimation), IP-weighted survival, and the E-value are parity-certified on real NHEFS data against both the Hernán-Robins Causal Inference: What If textbook and an independent base-R / survival / EValue reference — see Reproducing What If on NHEFS.

1. Pick the function by study design¶

Study design / question	Reach for	Family guide
2×2 table: risk/odds/rate of disease by exposure	`sp.odds_ratio`, `sp.relative_risk`, `sp.risk_difference`, `sp.incidence_rate_ratio`, `sp.number_needed_to_treat`, `sp.attributable_risk`	Epidemiological measures
Confounding by a categorical third variable	`sp.mantel_haenszel`, `sp.breslow_day_test`	Epidemiological measures
Compare rates across populations with different age structure	`sp.direct_standardize`, `sp.indirect_standardize` (SMR)	Epidemiological measures
Screening / diagnostic test accuracy	`sp.sensitivity_specificity`, `sp.diagnostic_test`, `sp.roc_curve`, `sp.cohen_kappa`	Epidemiological measures
Weighing the evidence for causation	`sp.bradford_hill`	Epidemiological measures
Point-treatment effect, confounding measured at baseline	`sp.g_computation`, `sp.ipw`, `sp.tmle`, `sp.aipw`	G-methods for time-varying confounding
Time-varying treatment with time-varying confounders affected by past treatment	`sp.msm`, `sp.gformula.ice`, `sp.gformula.mc`, `sp.ltmle`	G-methods for time-varying confounding
Observational data, but you want to reason like a trial	`sp.target_trial.protocol` + `sp.target_trial.emulate` + `sp.clone_censor_weight`	Target-trial emulation
Time-to-event outcome (mortality, relapse, time-to-diagnosis)	`sp.cox`, `sp.kaplan_meier`, `sp.survreg`, `sp.aft`	Survival analysis
Informative censoring / loss-to-follow-up	`sp.ipcw`	Survival analysis
Complex survey data (NHANES, BRFSS, DHS, …)	`sp.svydesign`, `sp.svymean`, `sp.svytotal`, `sp.svyglm`, `sp.rake`	Complex-survey analysis
Genetic instruments for a modifiable exposure	`sp.mendelian_randomization` and the MR family	Mendelian randomization
Continuous exposure → dose-response curve	`sp.dose_response`, `sp.vcnet`	G-methods for time-varying confounding
Unmeasured-confounding sensitivity	`sp.evalue`	Robustness workflow
Power / sample size	`sp.power`, `sp.mde`	—

2. The five-minute epidemiology tour¶

Every snippet below runs offline and is self-contained.

Association measures from a 2×2 table¶

import statspai as sp

# A cohort: 50 exposed cases / 950 exposed non-cases,
#           10 unexposed cases / 990 unexposed non-cases.
rr = sp.relative_risk(50, 950, 10, 990)
print(rr.estimate, rr.ci)          # RR = 5.0, with a log-binomial CI

orr = sp.odds_ratio(50, 20, 30, 40)
print(orr.estimate, orr.ci)        # OR = 3.33 (Woolf CI); Haldane-corrected on zero cells

Confounder-adjusted association (Mantel–Haenszel)¶

import numpy as np
import statspai as sp

# Two strata, each a 2x2 [[exposed-case, exposed-noncase],
#                         [unexp-case,   unexp-noncase]]
tables = np.array([[[10, 5], [8, 12]],
                   [[20, 15], [6, 9]]])
mh = sp.mantel_haenszel(tables, measure="OR")
print(mh.estimate)        # pooled OR adjusted for stratum
print(mh.homogeneity_p)   # Breslow–Day test for effect homogeneity

Age-standardized rates¶

import statspai as sp

std = sp.direct_standardize(
    events=[50, 60],            # events in each age band
    population=[1000, 2000],    # person-time / population in each band
    standard_weights=[0.4, 0.6] # reference (standard) population structure
)
print(std.rate, std.ci)

3. The modern causal-epidemiology core¶

Modern epidemiology's hardest problem is time-varying confounding affected by prior treatment (Robins). Standard regression adjustment is biased there; g-methods are the answer. StatsPAI ships the three canonical g-methods plus the target-trial framework that ties them to a protocol:

import statspai as sp

# Parametric g-formula (iterative conditional expectation) under
# "always treat" vs "never treat" across two time points.
always = sp.gformula.ice(
    data=wide, id_col="id", time_col=None,
    treatment_cols=["A0", "A1"],
    confounder_cols=[["L0"], ["L1"]],
    outcome_col="Y", treatment_strategy=[1, 1],
)
never = sp.gformula.ice(
    data=wide, id_col="id", time_col=None,
    treatment_cols=["A0", "A1"],
    confounder_cols=[["L0"], ["L1"]],
    outcome_col="Y", treatment_strategy=[0, 0],
)
print(always.value - never.value)   # g-formula contrast

See the g-methods family guide for the marginal structural model (sp.msm), longitudinal TMLE (sp.ltmle), and the full target-trial workflow that protects against immortal-time bias.

4. Reporting standards¶

Public-health and clinical journals expect design-specific reporting checklists. StatsPAI exposes structured reporting hooks rather than free-form prose:

Target-trial protocol (the 7 protocol elements) — sp.target_trial.protocol(...).summary() prints the eligibility, treatment strategies, assignment, time-zero, follow-up, outcome, causal contrast, and analysis plan as a structured block, and sp.target_trial.to_paper(...) renders a STROBE-style Methods & Results narrative [@hernan2016using].
Estimator citations — mature estimators carry .cite() so the exact methodological reference lands in your bibliography.
Sensitivity to unmeasured confounding — report an sp.evalue(...) E-value alongside the point estimate [@vanderweele2017sensitivity].

5. Honest limitations (read before you publish)¶

Parity is not yet certified for most epi methods. The numbers are well tested internally and recover known truths on simulated data, but cross-language certification against R's epidemiology packages is on the roadmap, not done. Re-run your headline estimate in your reference package.
Positivity and sequential exchangeability are assumptions, not outputs. G-methods identify causal effects only when treatment is (sequentially) unconfounded given the measured covariates and every covariate stratum has a chance of each treatment. Inspect weight distributions (sp.msm reports trimming) and think hard about unmeasured confounders.
Competing risks are not yet first-class. sp.cox / sp.kaplan_meier treat censoring as non-informative; there is no Fine–Gray subdistribution-hazard or cumulative-incidence-function estimator yet. For competing-risks settings, validate carefully and watch this space.