`statspai.experimental`¶

experimental ¶

Experimental design and analysis tools.

Provides randomization, balance checking, attrition analysis, and pre-analysis plan generation for RCTs.

RandomizationResult ¶

Bases: ResultProtocolMixin

Results from randomization.

Produced by :func:randomize. Carries the assigned data frame (.data), arm counts and an optional balance summary, with a formatted .summary().

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 200
>>> df = pd.DataFrame({"age": rng.normal(40, 10, n),
...                    "income": rng.normal(50, 15, n),
...                    "district": rng.integers(0, 4, n)})
>>> res = sp.randomize(df, strata="district",
...                    balance_vars=["age", "income"], seed=1)
>>> type(res).__name__
'RandomizationResult'
>>> int(res.n_treated + res.n_control)
200
>>> isinstance(res.summary(), str)
True

BalanceResult ¶

Bases: ResultProtocolMixin

Results from balance check.

Produced by :func:balance_check. Exposes the per-covariate balance .table (means, raw and normalized differences, t-test p-values), the omnibus F-test, and .summary() / .plot() (a love plot of normalized differences).

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 200
>>> treated = rng.integers(0, 2, n)
>>> df = pd.DataFrame({"treated": treated,
...                    "age": rng.normal(40, 10, n),
...                    "income": rng.normal(50, 15, n)})
>>> bal = sp.balance_check(df, treatment="treated",
...                        covariates=["age", "income"])
>>> type(bal).__name__
'BalanceResult'
>>> bal.n_treat + bal.n_control
200
>>> list(bal.table["variable"])
['age', 'income']
>>> isinstance(bal.summary(), str)
True

plot ¶

plot(ax: Any = None, **kwargs: Any) -> Any

Love plot of normalized differences.

AttritionResult ¶

Bases: ResultProtocolMixin

Results from attrition analysis.

Produced by :func:attrition_test. Exposes overall / arm-specific attrition rates, the differential-attrition chi-squared test, and an optional table of covariate predictors of attrition via .summary().

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 400
>>> treated = rng.integers(0, 2, n)
>>> observed = (rng.random(n) < np.where(treated == 1, 0.9, 0.75)).astype(int)
>>> age = rng.normal(40, 10, n)
>>> df = pd.DataFrame({"treated": treated,
...                    "endline_observed": observed, "age": age})
>>> res = sp.attrition_test(df, treatment="treated",
...                         observed="endline_observed", covariates=["age"])
>>> type(res).__name__
'AttritionResult'
>>> res.n_total
400
>>> isinstance(res.summary(), str)
True

OptimalDesignResult ¶

Bases: ResultProtocolMixin

Results from optimal design calculation.

Returned by :func:optimal_design. Carries the required total / per-arm sample size, cluster counts and size (for cluster designs), the intra- cluster correlation, the minimum detectable effect, and the target power.

Examples:

>>> import statspai as sp
>>> result = sp.optimal_design(
...     design="cluster", mde=0.2, sigma=1.0, icc=0.05, cluster_size=20
... )
>>> type(result).__name__
'OptimalDesignResult'
>>> result.design_type
'Cluster RCT'
>>> bool(result.n_total > 0 and result.n_clusters > 0)
True

randomize ¶

randomize(data: DataFrame, n_arms: int = 2, prob: Optional[List[float]] = None, strata: Optional[str] = None, cluster: Optional[str] = None, method: str = 'simple', balance_vars: Optional[List[str]] = None, n_rerand: int = 0, rerand_threshold: float = 0.001, seed: Optional[int] = None, treatment_col: str = 'treatment') -> RandomizationResult

Randomize units to treatment and control.

Equivalent to R's randomizr::complete_ra() / block_ra() / cluster_ra().

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Data with units to randomize.	required
`n_arms`	`int`	Number of treatment arms.	`2`
`prob`	`list of float`	Probability of each arm. Default: equal.	`None`
`strata`	`str`	Stratification variable for block randomization.	`None`
`cluster`	`str`	Cluster variable for cluster randomization.	`None`
`method`	`str`	'simple', 'complete', 'stratified', 'cluster'.	`'simple'`
`balance_vars`	`list of str`	Variables to check balance on (for re-randomization).	`None`
`n_rerand`	`int`	Number of re-randomization iterations (0 = no re-randomization).	`0`
`rerand_threshold`	`float`	Mahalanobis distance threshold for re-randomization.	`0.001`
`seed`	`int`	Random seed for reproducibility.	`None`
`treatment_col`	`str`	Name of treatment column to create.	`'treatment'`

Returns:

Type	Description
`RandomizationResult`

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 200
>>> df = pd.DataFrame({"district": rng.integers(0, 4, n),
...                    "age": rng.normal(40, 10, n),
...                    "income": rng.normal(50, 15, n)})
>>> result = sp.randomize(df, strata="district",
...                       balance_vars=["age", "income"], seed=1)
>>> type(result).__name__
'RandomizationResult'
>>> int(result.n_treated + result.n_control)
200
>>> bool(isinstance(result.summary(), str))
True
>>> df_randomized = result.data

balance_check ¶

balance_check(data: DataFrame, treatment: str, covariates: List[str], alpha: float = 0.05) -> BalanceResult

Check covariate balance between treatment and control.

Computes normalized differences, t-tests, and omnibus F-test.

Equivalent to Stata's iebaltab and R's cobalt::bal.tab().

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`treatment`	`str`	Binary treatment variable (0/1).	required
`covariates`	`list of str`	Covariates to check balance on.	required
`alpha`	`float`		`0.05`

Returns:

Type	Description
`BalanceResult`

Notes

If the omnibus F-test cannot be computed (e.g. non-numeric or perfectly collinear covariates), omnibus_f and omnibus_p are reported as NaN and a StatsPAIWarning is emitted.

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 200
>>> df = pd.DataFrame({"treated": rng.integers(0, 2, n),
...                    "age": rng.normal(40, 10, n),
...                    "income": rng.normal(50, 15, n),
...                    "education": rng.normal(12, 3, n)})
>>> bal = sp.balance_check(df, treatment="treated",
...                        covariates=["age", "income", "education"])
>>> type(bal).__name__
'BalanceResult'
>>> list(bal.table["variable"])
['age', 'income', 'education']
>>> bool(isinstance(bal.summary(), str))
True

attrition_test ¶

attrition_test(data: DataFrame, treatment: str, observed: str, covariates: Optional[List[str]] = None) -> AttritionResult

Test for differential attrition in an RCT.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`treatment`	`str`	Treatment indicator (0/1).	required
`observed`	`str`	Indicator for whether outcome is observed (1) or missing (0).	required
`covariates`	`list of str`	Baseline covariates to test as predictors of attrition.	`None`

Returns:

Type	Description
`AttritionResult`

Notes

If the attrition-predictor regression for a covariate fails (e.g. constant or degenerate values), its row in covariate_tests is reported as NaN and a StatsPAIWarning names the covariate.

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 400
>>> treated = rng.integers(0, 2, n)
>>> # attrition slightly higher in the control arm
>>> p_obs = 0.9 - 0.1 * (treated == 0)
>>> df = pd.DataFrame({
...     "treated": treated,
...     "endline_observed": (rng.uniform(size=n) < p_obs).astype(int),
...     "age": rng.normal(40, 10, n),
...     "income": rng.normal(50, 15, n),
...     "education": rng.integers(8, 18, n).astype(float),
... })
>>> result = sp.attrition_test(df, treatment='treated',
...                            observed='endline_observed',
...                            covariates=['age', 'income', 'education'])
>>> _ = result.summary()

attrition_bounds ¶

attrition_bounds(data: DataFrame, y: str, treatment: str, observed: Optional[str] = None, method: str = 'lee', alpha: float = 0.05) -> Dict[str, Any]

Compute bounds on treatment effects under attrition.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`y`	`str`	Outcome variable.	required
`treatment`	`str`	Treatment indicator (0/1).	required
`observed`	`str`	Indicator for observed outcome. If None, uses non-missing y.	`None`
`method`	`str`	Bounding method: 'lee' (Lee 2009), 'manski' (worst-case).	`'lee'`
`alpha`	`float`		`0.05`

Returns:

Type	Description
`dict`	Keys: 'lower_bound', 'upper_bound', 'naive_ate', 'method', 'n_obs'.

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 400
>>> treated = rng.integers(0, 2, n)
>>> y = 1.0 * treated + rng.normal(size=n)
>>> observed = (rng.random(n) < np.where(treated == 1, 0.9, 0.75)).astype(int)
>>> y = np.where(observed == 1, y, np.nan)
>>> df = pd.DataFrame({"y": y, "treated": treated, "observed": observed})
>>> res = sp.attrition_bounds(df, y="y", treatment="treated",
...                           observed="observed", method="lee")
>>> sorted(res.keys())
['attrition_rate', 'lower_bound', 'method', 'n_obs', 'n_total',
 'naive_ate', 'upper_bound']
>>> bool(res["lower_bound"] <= res["naive_ate"] <= res["upper_bound"])
True

References

[@lee2009training]

optimal_design ¶

optimal_design(design: str = 'individual', sigma: float = 1.0, mde: Optional[float] = None, power: float = 0.8, alpha: float = 0.05, n_arms: int = 2, prop_treat: float = 0.5, icc: float = 0.0, cluster_size: Optional[int] = None, n_clusters: Optional[int] = None, cost_per_cluster: Optional[float] = None, cost_per_unit: Optional[float] = None, r2: float = 0.0, baseline_mean: float = 0.0) -> OptimalDesignResult

Compute optimal sample size and design parameters.

Parameters:

Name	Type	Description	Default
`design`	`str`	'individual', 'cluster', 'stratified'.	`'individual'`
`sigma`	`float`	Standard deviation of the outcome.	`1.0`
`mde`	`float`	Minimum detectable effect. If None, compute MDE given n.	`None`
`power`	`float`	Statistical power (1 - Type II error).	`0.8`
`alpha`	`float`	Significance level.	`0.05`
`n_arms`	`int`	Number of treatment arms.	`2`
`prop_treat`	`float`	Proportion assigned to treatment.	`0.5`
`icc`	`float`	Intra-cluster correlation (for cluster designs).	`0.0`
`cluster_size`	`int`	Average cluster size.	`None`
`n_clusters`	`int`	Number of clusters (if fixed).	`None`
`cost_per_cluster`	`float`	Cost of adding a cluster (for optimal allocation).	`None`
`cost_per_unit`	`float`	Cost per individual unit.	`None`
`r2`	`float`	R-squared from baseline covariates (variance reduction).	`0.0`
`baseline_mean`	`float`		`0.0`

Returns:

Type	Description
`OptimalDesignResult`

Examples:

>>> import statspai as sp
>>> result = sp.optimal_design(
...     mde=0.2, sigma=1.0, icc=0.05, cluster_size=20
... )
>>> print(result.summary())

statspai.experimental¶

experimental ¶

RandomizationResult ¶

BalanceResult ¶

plot ¶

AttritionResult ¶

OptimalDesignResult ¶

randomize ¶

balance_check ¶

attrition_test ¶

attrition_bounds ¶

optimal_design ¶

`statspai.experimental`¶