Complex-survey analysis (NHANES, BRFSS, DHS, …)¶
Most population-health data are not simple random samples. NHANES, BRFSS, DHS, MEPS and their kin use stratification, clustering, and unequal selection probabilities. Ignoring the design biases point estimates and understates standard errors. StatsPAI's
sp.surveyfamily mirrors R'ssurveypackage: declare the design once, then every estimator returns design-corrected standard errors, design effects (DEFF), and design degrees of freedom.
Scope note: the survey estimators are API-stable and reproduce the
standard linearization (Taylor-series) variance, but are not yet
parity-certified against R's survey::svyglm. Validate before
publication and read sp.describe_function(name)["limitations"].
1. Declare the design once¶
import statspai as sp
design = sp.svydesign(
data=nhanes,
weights="WTMEC2YR", # examination weight column
strata="SDMVSTRA", # stratification variable
cluster="SDMVPSU", # primary sampling unit
nest=True, # PSU ids are nested within strata
# fpc="FPC", # finite population correction, if available
)
Everything downstream takes this design object, so the design is
specified in exactly one place and cannot drift between estimates.
2. Means, totals, and proportions¶
import statspai as sp
m = sp.svymean(["bmi", "sbp"], design)
print(m.summary()) # Estimate / Std.Err / t / p / 95% CI / DEFF per variable
print(m.estimate, m.std_error, m.deff, m.dof)
tot = sp.svytotal("had_visit", design) # population total with design SE
The reported DEFF (design effect) tells you how much the complex design inflated the variance relative to simple random sampling — a DEFF of 1.5 means your effective sample size is two-thirds of the nominal n. The design degrees of freedom (≈ #PSUs − #strata) drive the t-based confidence intervals, which is why survey CIs are wider than naïve ones.
3. Survey-weighted regression¶
import statspai as sp
# Linear
fit = sp.svyglm("sbp ~ age + bmi + smoker", design, family="gaussian")
print(fit.summary())
# Logistic (design-based prevalence-odds model)
logit = sp.svyglm("hypertension ~ age + bmi + smoker",
design, family="binomial")
print(logit.summary())
Coefficients are population-level (weighted) associations; the standard errors and CIs already account for stratification and clustering.
4. Calibration and raking¶
When your sample margins do not match known population totals (e.g. census age × sex distributions), calibrate the weights so they do — either by raking (iterative proportional fitting) or by Deville–Särndal linear calibration:
import statspai as sp
# Raking to known marginal distributions
cal = sp.rake(
data=df,
margins={
"age_group": {"18-34": 0.30, "35-54": 0.40, "55+": 0.30},
"sex": {"M": 0.49, "F": 0.51},
},
weight="base_weight",
)
calibrated_weights = cal # feed back into sp.svydesign(weights=...)
# Linear calibration to known control totals
lin = sp.linear_calibration(df, totals={"age": 4.2e6, "income": 1.1e9},
weight="base_weight")
5. Why this matters for public health¶
Reporting an unweighted mean from NHANES, or a model SE that ignores the
PSU clustering, is one of the most common reproducibility errors in
applied public-health analysis. Declaring the design with sp.svydesign
and routing every estimate through the sp.svy* functions makes the
correction automatic and auditable.
A known gap: design-based survival analysis (a survey-weighted Cox
model) is not yet wired into the survey family — for now, extract the
calibrated weights and pass them to a weighted analysis manually, and
validate against R's survey::svycoxph.