`statspai.survey`¶

survey ¶

Survey design and weighted estimation — StatsPAI's answer to R's survey package and Stata's svy: prefix.

Supports stratified, clustered, and weighted survey designs with design-corrected standard errors for means, totals, and regression.

import statspai as sp design = sp.svydesign(data=df, weights='pw', strata='stratum', ... cluster='psu') design.mean('income') design.total('income') design.glm('income ~ education + age')

SurveyDesign ¶

Declare a complex survey design.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Survey microdata.	required
`weights`	`str or array - like`	Sampling weights (inverse probability). If str, column name in data.	required
`strata`	`str or None`	Stratification variable (column name).	`None`
`cluster`	`str or None`	Primary sampling unit (PSU) variable (column name).	`None`
`fpc`	`str or None`	Finite population correction — column of stratum population sizes or sampling fractions. If values are < 1 they are treated as fractions; otherwise as population counts.	`None`
`nest`	`bool`	If True, PSU ids are nested within strata (re-label internally).	`False`

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(0)
>>> n = 300
>>> df = pd.DataFrame({
...     "stratum": rng.integers(0, 3, size=n),
...     "psu": rng.integers(0, 30, size=n),
...     "wt": rng.uniform(0.5, 2.0, size=n),
...     "income": rng.normal(50, 10, size=n),
... })
>>> design = sp.SurveyDesign(df, weights="wt", strata="stratum",
...                          cluster="psu")
>>> design.n
300
>>> m = design.mean("income")
>>> type(m).__name__
'SurveyResult'

mean ¶

mean(variables: Union[str, List[str]], alpha: float = 0.05) -> SurveyResult

Design-corrected weighted mean(s).

total ¶

total(variables: Union[str, List[str]], alpha: float = 0.05) -> SurveyResult

Design-corrected weighted total(s).

glm ¶

glm(formula: str, family: str = 'gaussian', alpha: float = 0.05) -> SurveyResult

Survey-weighted generalised linear model.

svydesign ¶

svydesign(data: DataFrame, weights: Union[str, ndarray], strata: Optional[str] = None, cluster: Optional[str] = None, fpc: Optional[str] = None, nest: bool = False) -> SurveyDesign

Create a survey design object — functional interface.

Parameters are identical to :class:SurveyDesign.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(0)
>>> n = 300
>>> df = pd.DataFrame({
...     "region": rng.integers(0, 3, size=n),
...     "psu_id": rng.integers(0, 30, size=n),
...     "pw": rng.uniform(0.5, 2.0, size=n),
...     "income": rng.normal(50, 10, size=n),
...     "age": rng.normal(40, 12, size=n),
... })
>>> design = sp.svydesign(data=df, weights='pw', strata='region',
...                       cluster='psu_id')
>>> type(design).__name__
'SurveyDesign'
>>> m = design.mean('income')
>>> g = design.glm('income ~ age')

svymean ¶

svymean(variables: Union[str, List[str]], design: 'SurveyDesign', alpha: float = 0.05) -> SurveyResult

Survey-weighted mean with design-corrected standard errors.

Uses Taylor-series linearisation identical to R survey::svymean.

Parameters:

Name	Type	Description	Default
`variables`	`str or list of str`	Column name(s) in the design's data.	required
`design`	`SurveyDesign`		required
`alpha`	`float`		`0.05`

Returns:

Type	Description
`SurveyResult`

Examples:

>>> import numpy as np, pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(0)
>>> n = 200
>>> df = pd.DataFrame({
...     "income": rng.normal(50, 10, n),
...     "region": rng.integers(0, 4, n),   # strata
...     "psu_id": rng.integers(0, 20, n),  # clusters
...     "pw": rng.uniform(1.0, 3.0, n),    # sampling weights
... })
>>> design = sp.svydesign(data=df, weights="pw", strata="region",
...                       cluster="psu_id")
>>> res = sp.svymean("income", design)
>>> list(res.estimate.index)
['income']

svytotal ¶

svytotal(variables: Union[str, List[str]], design: 'SurveyDesign', alpha: float = 0.05) -> SurveyResult

Survey-weighted total with design-corrected standard errors.

Parameters:

Name	Type	Default
`variables`	`str or list of str`	required
`design`	`SurveyDesign`	required
`alpha`	`float`	`0.05`

Returns:

Type	Description
`SurveyResult`

Examples:

>>> import numpy as np, pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(0)
>>> n = 200
>>> df = pd.DataFrame({
...     "income": rng.normal(50, 10, n),
...     "region": rng.integers(0, 4, n),   # strata
...     "psu_id": rng.integers(0, 20, n),  # clusters
...     "pw": rng.uniform(1.0, 3.0, n),    # sampling weights
... })
>>> design = sp.svydesign(data=df, weights="pw", strata="region",
...                       cluster="psu_id")
>>> res = sp.svytotal("income", design)
>>> list(res.estimate.index)
['income']

svyglm ¶

svyglm(formula: str, design: 'SurveyDesign', family: str = 'gaussian', alpha: float = 0.05) -> SurveyResult

Survey-weighted generalised linear model.

Fits WLS (for gaussian family) or weighted IRLS (for binomial/poisson) and computes design-corrected standard errors via the sandwich estimator.

Parameters:

Name	Type	Description	Default
`formula`	`str`	`"y ~ x1 + x2"` style formula.	required
`design`	`SurveyDesign`		required
`family`	`str`	`"gaussian"`, `"binomial"` (logistic), or `"poisson"`.	`'gaussian'`
`alpha`	`float`		`0.05`

Returns:

Type	Description
`SurveyResult with regression coefficient estimates.`

Examples:

>>> import numpy as np, pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(0)
>>> n = 200
>>> df = pd.DataFrame({
...     "income": rng.normal(50, 10, n),
...     "age": rng.integers(20, 65, n),
...     "region": rng.integers(0, 4, n),   # strata
...     "psu_id": rng.integers(0, 20, n),  # clusters
...     "pw": rng.uniform(1.0, 3.0, n),    # sampling weights
... })
>>> design = sp.svydesign(data=df, weights="pw", strata="region",
...                       cluster="psu_id")
>>> res = sp.svyglm("income ~ age", design)
>>> list(res.estimate.index)
['Intercept', 'age']

rake ¶

rake(data: DataFrame, margins: Dict[str, Dict], weight: Optional[str] = None, max_iter: int = 100, tol: float = 1e-06) -> CalibrationResult

Raking (iterative proportional fitting).

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`margins`	`dict`	`{column_name: {category: target_proportion}}`. E.g. `{"sex": {"M": 0.49, "F": 0.51}, "age_group": {"18-34": 0.3, ...}}`.	required
`weight`	`str`	Existing design weight column. If `None`, starts with equal weights.	`None`

Examples:

>>> import statspai as sp
>>> import numpy as np
>>> import pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 200
>>> df = pd.DataFrame({
...     "sex": rng.choice(["M", "F"], size=n),
...     "age_group": rng.choice(["18-34", "35-54", "55+"], size=n),
... })
>>> margins = {
...     "sex": {"M": 0.49, "F": 0.51},
...     "age_group": {"18-34": 0.30, "35-54": 0.40, "55+": 0.30},
... }
>>> res = sp.rake(df, margins=margins)
>>> bool(res.converged)
True

linear_calibration ¶

linear_calibration(data: DataFrame, totals: Dict[str, float], weight: Optional[str] = None) -> CalibrationResult

Deville-Särndal (1992) linear calibration.

Find calibrated weights g_i * d_i minimising Σ (g_i - 1)² / d_i subject to Σ g_i d_i x_{ik} = T_k for each auxiliary variable k with known total T_k.

Parameters:

Name	Type	Description	Default
`totals`	`dict`	`{column_name: population_total}` for continuous auxiliary variables.	required
`weight`	`str`	Design weight column. If `None`, uses equal weights.	`None`

Examples:

>>> import statspai as sp
>>> import numpy as np
>>> import pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 200
>>> df = pd.DataFrame({
...     "income": rng.normal(50.0, 10.0, size=n),
...     "age": rng.normal(40.0, 12.0, size=n),
... })
>>> totals = {"income": df["income"].sum() * 1.05,
...           "age": df["age"].sum() * 0.98}
>>> res = sp.linear_calibration(df, totals=totals)
>>> cal_total = float((res.calibrated_weights * df["income"]).sum())
>>> bool(abs(cal_total - totals["income"]) < 1e-4)
True

statspai.survey¶

survey ¶

SurveyDesign ¶

mean ¶

total ¶

glm ¶

svydesign ¶

svymean ¶

svytotal ¶

svyglm ¶

rake ¶

linear_calibration ¶

`statspai.survey`¶