Survival analysis for public health¶
Time-to-event is the native outcome of cohort epidemiology and clinical research. Mortality, relapse, time-to-diagnosis, time-to-discharge — all are right-censored durations. StatsPAI ships the standard estimators (Kaplan–Meier, Cox proportional hazards, parametric AFT, frailty models) plus the causal-survival tools (inverse-probability-of-censoring weights, causal survival forests) that connect survival outcomes to the g-methods in the g-methods guide.
Scope note: sp.cox is parity-tested against R's survival::coxph and
Stata's stcox (see tests/r_parity/, tests/stata_parity/). The
causal-survival surface (sp.ipcw, sp.causal_survival_forest,
sp.synth_survival) is API-stable but not yet cross-language certified
— validate before publication and read
sp.describe_function(name)["limitations"].
1. Kaplan–Meier survival curves¶
import statspai as sp
km = sp.kaplan_meier(data=df, duration="time", event="event", group="arm")
print(km.median_survival) # median survival per group
print(km.survival_table) # risk table: n.risk, n.event, S(t), CI
km.plot() # step survival curves with at-risk counts
event is the standard 1 = event observed, 0 = right-censored coding.
Pass group= to stratify; omit it for a single pooled curve.
2. Cox proportional hazards¶
import statspai as sp
# Formula or explicit columns — both work
cox = sp.cox(data=df, duration="time", event="event", x=["arm", "age", "sex"],
ties="efron")
print(cox.summary())
The reported coefficients are log hazard ratios; exponentiate for
the hazard ratio (pass hazard_ratio=True to print HRs directly). The
counting-process internals mean Cox accepts time-varying covariates
via (start, stop, event) rows, and you can add:
strata=for a stratified baseline hazard,robust="hc1"orcluster=for robust / clustered standard errors,ties="efron"(default) /"breslow"for tie handling.
Always sanity-check the proportional-hazards assumption (Schoenfeld residuals) before trusting a single HR over long follow-up.
3. Parametric survival (AFT) and frailty¶
When you need a fully parametric model — for extrapolation, or because
the accelerated-failure-time interpretation is more natural — use
sp.survreg / sp.aft:
import statspai as sp
# Weibull / lognormal / loglogistic / exponential AFT
m = sp.survreg(data=df, duration="time", event="event", x=["arm", "age"],
dist="weibull")
print(m.summary())
aft = sp.aft("time ~ arm + age", data=df, family="lognormal")
For clustered / shared-frailty survival (recurrent events, multi-centre
cohorts) the sp.survival module also exposes a frailty model.
4. Informative censoring — IPCW¶
When loss-to-follow-up depends on measured covariates, censoring is informative and naïve KM/Cox are biased. Inverse-probability-of- censoring weights restore the population that would have been observed under complete follow-up [@robins2000marginal]:
import statspai as sp
ipcw = sp.ipcw(
data=df, time="time", event="event",
censor_covariates=["age", "ldl", "on_treatment"],
stabilize=True,
method="pooled_logistic",
)
print(ipcw.summary_stats) # weight distribution — watch for extremes
weights = ipcw.weights # feed into a weighted outcome analysis
These weights compose directly with sp.clone_censor_weight for
per-protocol target-trial analyses (see
g-methods).
5. Causal survival effects¶
For heterogeneous treatment effects on a survival outcome under
unconfoundedness, sp.causal_survival_forest estimates a
restricted-mean-survival-time or survival-probability contrast that
varies with covariates; sp.synth_survival brings the synthetic-control
idea to time-to-event panel settings.
import statspai as sp
csf = sp.causal_survival_forest(
data=df, duration="time", event="event",
treatment="arm", covariates=["age", "sex", "ldl"],
)
6. What is not here yet¶
- Competing risks: there is no Fine–Gray subdistribution-hazard model or cumulative-incidence-function / Aalen–Johansen estimator yet. In a setting with competing events (e.g. death from other causes), cause-specific Cox is available (model each cause with censoring on the others), but the subdistribution approach is on the roadmap.
- Time-dependent ROC / net-benefit / decision-curve analysis.
If your analysis depends on these, validate carefully and consider
cross-checking with R's cmprsk / survival in the interim.