Skip to content

statspai.survey

survey

Survey design and weighted estimation — StatsPAI's answer to R's survey package and Stata's svy: prefix.

Supports stratified, clustered, and weighted survey designs with design-corrected standard errors for means, totals, and regression.

import statspai as sp design = sp.svydesign(data=df, weights='pw', strata='stratum', ... cluster='psu') design.mean('income') design.total('income') design.glm('income ~ education + age')

SurveyDesign

Declare a complex survey design.

Parameters:

Name Type Description Default
data DataFrame

Survey microdata.

required
weights str or array - like

Sampling weights (inverse probability). If str, column name in data.

required
strata str or None

Stratification variable (column name).

None
cluster str or None

Primary sampling unit (PSU) variable (column name).

None
fpc str or None

Finite population correction — column of stratum population sizes or sampling fractions. If values are < 1 they are treated as fractions; otherwise as population counts.

None
nest bool

If True, PSU ids are nested within strata (re-label internally).

False

mean

mean(variables: Union[str, List[str]], alpha: float = 0.05) -> SurveyResult

Design-corrected weighted mean(s).

total

total(variables: Union[str, List[str]], alpha: float = 0.05) -> SurveyResult

Design-corrected weighted total(s).

glm

glm(formula: str, family: str = 'gaussian', alpha: float = 0.05)

Survey-weighted generalised linear model.

svydesign

svydesign(data: DataFrame, weights: Union[str, ndarray], strata: Optional[str] = None, cluster: Optional[str] = None, fpc: Optional[str] = None, nest: bool = False) -> SurveyDesign

Create a survey design object — functional interface.

Parameters are identical to :class:SurveyDesign.

Examples:

>>> import statspai as sp
>>> design = sp.svydesign(data=df, weights='pw', strata='region',
...                       cluster='psu_id')
>>> design.mean('income')
>>> design.glm('income ~ age + education')

svymean

svymean(variables: Union[str, List[str]], design: 'SurveyDesign', alpha: float = 0.05) -> SurveyResult

Survey-weighted mean with design-corrected standard errors.

Uses Taylor-series linearisation identical to R survey::svymean.

Parameters:

Name Type Description Default
variables str or list of str

Column name(s) in the design's data.

required
design SurveyDesign
required
alpha float
0.05

Returns:

Type Description
SurveyResult

svytotal

svytotal(variables: Union[str, List[str]], design: 'SurveyDesign', alpha: float = 0.05) -> SurveyResult

Survey-weighted total with design-corrected standard errors.

Parameters:

Name Type Description Default
variables str or list of str
required
design SurveyDesign
required
alpha float
0.05

Returns:

Type Description
SurveyResult

svyglm

svyglm(formula: str, design: 'SurveyDesign', family: str = 'gaussian', alpha: float = 0.05)

Survey-weighted generalised linear model.

Fits WLS (for gaussian family) or weighted IRLS (for binomial/poisson) and computes design-corrected standard errors via the sandwich estimator.

Parameters:

Name Type Description Default
formula str

"y ~ x1 + x2" style formula.

required
design SurveyDesign
required
family str

"gaussian", "binomial" (logistic), or "poisson".

'gaussian'
alpha float
0.05

Returns:

Type Description
SurveyResult with regression coefficient estimates.

rake

rake(data: DataFrame, margins: Dict[str, Dict], weight: Optional[str] = None, max_iter: int = 100, tol: float = 1e-06) -> CalibrationResult

Raking (iterative proportional fitting).

Parameters:

Name Type Description Default
data DataFrame
required
margins dict

{column_name: {category: target_proportion}}. E.g. {"sex": {"M": 0.49, "F": 0.51}, "age_group": {"18-34": 0.3, ...}}.

required
weight str

Existing design weight column. If None, starts with equal weights.

None

linear_calibration

linear_calibration(data: DataFrame, totals: Dict[str, float], weight: Optional[str] = None) -> CalibrationResult

Deville-Särndal (1992) linear calibration.

Find calibrated weights g_i * d_i minimising Σ (g_i - 1)² / d_i subject to Σ g_i d_i x_{ik} = T_k for each auxiliary variable k with known total T_k.

Parameters:

Name Type Description Default
totals dict

{column_name: population_total} for continuous auxiliary variables.

required
weight str

Design weight column. If None, uses equal weights.

None