`statspai.rd`¶

rd ¶

Regression Discontinuity (RD) module for StatsPAI.

Regression-discontinuity tools exposed through the StatsPAI API:

Core estimation: - Sharp, Fuzzy, and Kink RD estimation with robust bias-corrected inference (CCT 2014) - Covariate-adjusted local polynomial estimation (Calonico et al. 2019) - Donut-hole RD for manipulation near cutoff - Regression Discontinuity in Time (Hausman-Rapson 2018) - Multi-cutoff and multi-score RD (Cattaneo et al. 2024) - Boundary discontinuity / 2D RD designs (Cattaneo-Titiunik-Yu 2025)

Bandwidth selection: - MSE-optimal: mserd, msetwo, msecomb1, msecomb2 - CER-optimal: cerrd, certwo, cercomb1, cercomb2 (Calonico-Cattaneo-Farrell 2020) - Fuzzy-specific and covariate-adjusted bandwidth selection

Inference: - Honest confidence intervals (Armstrong-Kolesar 2018, 2020) - Local randomization inference with Fisher exact tests (Cattaneo-Titiunik-VB 2016) - Window selection and sensitivity analysis - Rosenbaum sensitivity bounds

Treatment effect heterogeneity: - CATE estimation via fully interacted local linear (Calonico et al. 2025) - ML + RD: Causal Forest, Gradient Boosting, LASSO-assisted RD

External validity: - Extrapolation away from cutoff (Angrist-Rokkanen 2015) - Multi-cutoff extrapolation (Cattaneo et al. 2024) - External validity diagnostics

Diagnostics & visualization: - Diagnostic dashboard (rdsummary) - IMSE-optimal binned scatter with pointwise CI bands - Density manipulation testing (CJM 2020) - Bandwidth sensitivity, covariate balance, placebo cutoff tests - Power analysis and sample size calculations

Examples:

sp.rd is a callable dispatcher: the default path is the CCT robust bias-corrected estimator (method="rdrobust"), and every other RD variant is one method= away.

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 1000
>>> x = rng.uniform(-1, 1, n)
>>> y = 0.8 * (x >= 0) + 0.5 * x + rng.normal(0, 0.3, n)
>>> df = pd.DataFrame({"x": x, "y": y})
>>> res = sp.rd(df, y="y", x="x", c=0)          # CCT robust bias-corrected
>>> type(res).__name__
'CausalResult'
>>> round(float(res.estimate), 3)
0.844
>>> honest = sp.rd(df, y="y", x="x", c=0, method="honest")
>>> round(float(honest.estimate), 3)            # Armstrong-Kolesar honest CI
0.781

RDMultiResult ¶

Bases: ResultProtocolMixin

Results from multi-cutoff/multi-score RD.

Returned by :func:rdmc. Holds per-cutoff estimates in cutoff_results plus the pooled estimate / SE / CI, and exposes :meth:summary and a forest :meth:plot.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(0)
>>> n = 900
>>> running_var = rng.uniform(0, 100, n)
>>> cutoffs = [30, 60]
>>> nearest = np.array([min(cutoffs, key=lambda c: abs(rv - c))
...                     for rv in running_var])
>>> treated = (running_var >= nearest).astype(float)
>>> score = 1.0 + 0.02 * running_var + 2.0 * treated + rng.normal(0, 1, n)
>>> df = pd.DataFrame({"score": score, "running_var": running_var})
>>> result = sp.rdmc(df, y="score", x="running_var", cutoffs=[30, 60])
>>> isinstance(result, sp.RDMultiResult)
True
>>> result.n_cutoffs
2

plot ¶

plot(ax: Optional[Any] = None, **kwargs: Any) -> Any

Forest plot of cutoff-specific and pooled estimates.

RDInterferenceResult `dataclass` ¶

Bases: ResultProtocolMixin

Direct + spillover RDD effects under network interference.

Returned by :func:rd_interference. Carries the boundary direct effect and the neighbour-boundary spillover effect, each with a local-linear standard error, plus the sample size and bandwidth.

Examples:

>>> import statspai as sp
>>> import numpy as np
>>> import pandas as pd
>>> rng = np.random.default_rng(1)
>>> n = 600
>>> r = rng.uniform(-1, 1, size=n)
>>> rn = rng.uniform(-1, 1, size=n)
>>> y = (0.5 * (r >= 0) + 0.3 * (rn >= 0)
...      + 0.2 * r + 0.1 * rn + rng.normal(scale=0.3, size=n))
>>> df = pd.DataFrame({"y": y, "run": r, "nbr_run": rn})
>>> res = sp.rd_interference(df, y="y", running="run",
...                          neighbour_running="nbr_run")
>>> isinstance(res, sp.RDInterferenceResult)
True
>>> isinstance(res.summary(), str)
True

MultiScoreRDResult `dataclass` ¶

Bases: ResultProtocolMixin

Multi-score RDD effect on the eligibility boundary.

Attributes:

Name	Type	Description
`boundary_effect`	`float`	Local average treatment effect on the boundary manifold.
`se`	`float`	Standard error of `boundary_effect`.
`n_obs`	`int`	Number of observations used (after dropping missing rows).
`boundary_share`	`float`	Fraction of observations inside the kernel window.
`bandwidth`	`float`	Bandwidth applied to the (max-) distance to the boundary.

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 2000
>>> m = rng.uniform(400, 800, n)
>>> v = rng.uniform(400, 800, n)
>>> treat = ((m >= 600) & (v >= 600)).astype(int)
>>> y = 10.0 + 0.01 * m + 0.01 * v + 5.0 * treat + rng.normal(0, 2, n)
>>> df = pd.DataFrame({"y": y, "math": m, "verbal": v})
>>> res = sp.rd_multi_score(
...     df, y="y", running_vars=["math", "verbal"], cutoffs=[600.0, 600.0])
>>> isinstance(res, sp.MultiScoreRDResult)
True
>>> res.n_obs
2000
>>> bool(res.se >= 0.0)
True

DistRDResult `dataclass` ¶

Bases: ResultProtocolMixin

RDD effect on each quantile.

Attributes:

Name	Type	Description
`quantiles`	`ndarray`	Quantiles of `y` at which the effect is evaluated.
`qte`	`ndarray`	RDD effect on the conditional CDF at each quantile.
`se`	`ndarray`	Standard error of each `qte` estimate.
`bandwidth`	`float`	Bandwidth used in the local-linear fit.
`n_obs`	`int`	Number of complete observations.

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 400
>>> x = rng.uniform(-1, 1, n)
>>> y = 1.0 + 0.5 * x + (x >= 0) * 0.8 + rng.normal(0, 1, n)
>>> data = pd.DataFrame({"y": y, "x": x})
>>> res = sp.rd_distribution(data, y="y", running="x")
>>> bool(res.qte.shape == res.quantiles.shape)
True

BayesRDHTEResult `dataclass` ¶

Bases: ResultProtocolMixin

Bayesian RDD with HTE posterior summaries.

Returned by :func:rd_bayes_hte. posterior_mean / posterior_sd / posterior_ci summarise the average treatment effect at the cutoff, while cate / cate_sd give the per-unit covariate-conditional effect posteriors.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(0)
>>> n = 400
>>> run = rng.uniform(-1, 1, size=n)
>>> x = rng.normal(size=n)
>>> treat = (run >= 0).astype(float)
>>> y = (1.0 + 0.5 * run + (0.4 + 0.3 * x) * treat
...      + rng.normal(scale=0.2, size=n))
>>> df = pd.DataFrame({'y': y, 'run': run, 'x': x})
>>> res = sp.rd_bayes_hte(df, y='y', running='run', covariates=['x'],
...                       cutoff=0.0, n_draws=500, seed=0)
>>> res.cate.shape == (res.n_obs,)
True
>>> bool(res.posterior_sd >= 0)
True

DDDResult `dataclass` ¶

Bases: ResultProtocolMixin

Distributional discontinuity design output.

Attributes:

Name	Type	Description
`quantiles`	`ndarray`	Quantiles of `y` at which the effects are evaluated.
`rdd_effect`	`ndarray`	Level jump in the conditional CDF at the cutoff (RDD).
`rkd_effect`	`ndarray`	Slope jump in the conditional CDF at the cutoff (RKD).
`bandwidth`	`float`	Bandwidth used in the local-linear fit.
`n_obs`	`int`	Number of complete observations.

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 400
>>> x = rng.uniform(-1, 1, n)
>>> y = 1.0 + 0.5 * x + (x >= 0) * 0.8 + rng.normal(0, 1, n)
>>> data = pd.DataFrame({"y": y, "x": x})
>>> res = sp.rd_distributional_design(data, y="y", running="x")
>>> bool(res.rdd_effect.shape == res.quantiles.shape)
True

rdplot ¶

rdplot(data: DataFrame, y: str, x: str, c: float = 0, nbins: Optional[int] = None, binselect: str = 'esmv', p: int = 4, kernel: str = 'triangular', ci_level: float = 0.95, shade_ci: bool = True, donut: float = 0, show_bw: bool = False, h: Optional[float] = None, covs: Optional[List[str]] = None, weights: Optional[str] = None, hide_ci: bool = False, scatter: bool = True, ax: Optional[Any] = None, figsize: Tuple[float, float] = (10, 7), title: Optional[str] = None, x_label: Optional[str] = None, y_label: Optional[str] = None) -> Tuple[Any, Any]

RD plot: binned scatter with polynomial fit on each side of the cutoff.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`y`	`str`	Outcome and running variable names.	required
`x`	`str`	Outcome and running variable names.	required
`c`	`float`	Cutoff.	`0`
`nbins`	`int`	Bins per side. If None, uses data-driven selection via binselect.	`None`
`binselect`	`str`	Bin selection method (when nbins=None): - 'es' : IMSE-optimal evenly spaced - 'espr' : IMSE-optimal evenly spaced (mimicking variance) - 'qs' : IMSE-optimal quantile-spaced - 'qspr' : IMSE-optimal quantile-spaced (mimicking variance) - 'esmv' : IMSE-optimal evenly spaced with variance mimicking (default) - 'qsmv' : IMSE-optimal quantile-spaced with variance mimicking	`'esmv'`
`p`	`int`	Polynomial order for the fitted curve.	`4`
`kernel`	`str`	Kernel for the fitted curve.	`'triangular'`
`ci_level`	`float`	Confidence level for pointwise CI bands.	`0.95`
`shade_ci`	`bool`	Show confidence interval bands around the polynomial fit.	`True`
`donut`	`float`	If > 0, shades the donut region \|x - c\| <= donut.	`0`
`show_bw`	`bool`	If True, shades the bandwidth window.	`False`
`h`	`float`	Bandwidth to display.	`None`
`covs`	`list of str`	Covariates to partial out before binning and plotting.	`None`
`weights`	`str`	Column name for observation weights in polynomial fitting.	`None`
`hide_ci`	`bool`	If True, suppress CI bands entirely.	`False`
`scatter`	`bool`	Show binned scatter points.	`True`
`ax`	`matplotlib Axes`		`None`
`figsize`	`tuple`		`(10, 7)`
`title`	`str`		`None`
`x_label`	`str`		`None`
`y_label`	`str`		`None`

Returns:

Type	Description
`(fig, ax)`

Examples:

>>> import numpy as np, pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 1000
>>> x = rng.uniform(-1, 1, n)
>>> y = 0.5 * x + 2.0 * (x >= 0) + rng.normal(0, 0.4, n)
>>> df = pd.DataFrame({'y': y, 'x': x})
>>> fig, ax = sp.rdplot(df, y='y', x='x', c=0, title='RD plot')
>>> fig.savefig('rd_plot.png')

Typical flow: visualise first, then estimate with :func:rdrobust:

>>> result = sp.rdrobust(df, y='y', x='x', c=0)
>>> abs(result.estimate - 2.0) < 0.3
True

rdplotdensity ¶

rdplotdensity(data: DataFrame, x: str, c: float = 0, p: int = 2, n_grid: int = 50, h: Optional[float] = None, ci_level: float = 0.95, hist: bool = True, nbins: int = 30, ax: Optional[Any] = None, figsize: Tuple[float, float] = (10, 7), title: Optional[str] = None) -> Tuple[Any, Any]

Boundary-adaptive density discontinuity plot at the RD cutoff.

Implements the local-polynomial density estimator of Cattaneo, Jansson & Ma (2020, Journal of the American Statistical Association 115(531), 1449-1455): for each side of the cutoff the empirical CDF F̂ is constructed and a local polynomial of order p is fit to F̂. The slope at the cutoff equals the density f̂(c±), and the asymptotic variance comes from the same local-polynomial sandwich. Boundary-adaptive — does not require a separate boundary kernel.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`x`	`str`	Running variable.	required
`c`	`float`	Cutoff.	`0`
`p`	`int`	Polynomial order for the CDF regression (p=2 recommended; p=1 is faster but with worse boundary behavior).	`2`
`n_grid`	`int`	Grid points per side for the density curve.	`50`
`h`	`float`	Bandwidth. If None, side-specific Silverman pilot.	`None`
`ci_level`	`float`	Confidence level for CI bands.	`0.95`
`hist`	`bool`	Overlay histogram.	`True`
`nbins`	`int`	Number of histogram bins per side.	`30`
`ax`	`matplotlib Axes`		`None`
`figsize`	`tuple`		`(10, 7)`
`title`	`str`		`None`

Returns:

Type	Description
`(fig, ax)`

References

Cattaneo, M.D., Jansson, M. and Ma, X. (2020). "Simple Local Polynomial Density Estimators." JASA 115(531), 1449-1455. [@cattaneo2020simple]

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> df = pd.DataFrame({"x": rng.uniform(-1, 1, 500)})
>>> fig, ax = sp.rdplotdensity(df, x="x", c=0)

rdbwselect ¶

rdbwselect(data: DataFrame, y: str, x: str, c: float = 0, fuzzy: Optional[str] = None, deriv: int = 0, p: int = 1, q: Optional[int] = None, covs: Optional[List[str]] = None, kernel: str = 'triangular', bwselect: str = 'mserd', cluster: Optional[str] = None, all: bool = False) -> DataFrame

Bandwidth selection for local polynomial RD estimation.

Implements all eight MSE-optimal and CER-optimal bandwidth selection procedures from Calonico, Cattaneo, and Farrell (2020). MSE-optimal bandwidths minimize the mean squared error of the RD point estimator, while CER-optimal bandwidths minimize the coverage error rate of robust bias-corrected confidence intervals.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input dataset.	required
`y`	`str`	Outcome variable column name.	required
`x`	`str`	Running variable column name.	required
`c`	`float`	RD cutoff value.	`0`
`fuzzy`	`str`	Treatment variable name for fuzzy RD. When provided, bandwidth accounts for first-stage variance in the Wald / IV estimator.	`None`
`deriv`	`int`	Derivative order. 0 = standard RD (jump in level), 1 = regression kink design (change in slope).	`0`
`p`	`int`	Polynomial order for point estimation (1 = local linear).	`1`
`q`	`int`	Polynomial order for bias correction. Default is p + 1.	`None`
`covs`	`list of str`	Covariate column names. When provided, the variance estimates used in bandwidth selection account for covariate adjustment, typically yielding narrower bandwidths.	`None`
`kernel`	`str`	Kernel function: 'triangular', 'uniform', or 'epanechnikov'.	`'triangular'`
`bwselect`	`str`	Bandwidth selection method. One of: `'mserd'` : MSE-optimal common bandwidth (default) `'msetwo'` : MSE-optimal separate left/right bandwidths `'msecomb1'` : min of mserd, mseleft, mseright `'msecomb2'` : median of mserd, mseleft, mseright `'cerrd'` : CER-optimal common bandwidth `'certwo'` : CER-optimal separate left/right `'cercomb1'` : min of cerrd, cerleft, cerright `'cercomb2'` : median of cerrd, cerleft, cerright	`'mserd'`
`cluster`	`str`	Cluster variable name for cluster-robust variance estimation.	`None`
`all`	`bool`	If True, compute and return all eight bandwidth types.	`False`

Returns:

Type	Description
`DataFrame`	DataFrame with columns `[method, h_left, h_right, b_left, b_right, n_left, n_right]`. When `all=False`, contains a single row for the selected method. When `all=True`, contains eight rows for all methods.

Notes

The CER-optimal bandwidth is related to the MSE-optimal bandwidth by:

h_CER = h_MSE * n^{-1/((2p+3)(2p+5))}

For p=1 (local linear), this gives h_CER ~ h_MSE * n^{-1/35}, which is strictly narrower than h_MSE. The narrower bandwidth yields confidence intervals with better coverage properties at the cost of slightly wider intervals.

The MSE-optimal bandwidth has rate n^{-1/(2p+3)} while the CER rate is n^{-1/(2p+3) - 1/((2p+3)(2p+5))}. For large samples the difference is meaningful: CER bandwidths produce robust CIs that achieve their nominal coverage rate, whereas MSE bandwidths can exhibit substantial coverage distortion.

References

Calonico, S., Cattaneo, M.D. and Farrell, M.H. (2020). "Optimal Bandwidth Choice for Robust Bias-Corrected Inference in Regression Discontinuity Designs." Econometrics Journal, 23(2), 192-210. [@calonico2020optimal]

Calonico, S., Cattaneo, M.D. and Titiunik, R. (2014). "Robust Nonparametric Confidence Intervals for Regression-Discontinuity Designs." Econometrica, 82(6), 2295-2326. [@calonico2014robust]

Examples:

Basic MSE-optimal bandwidth:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(42)
>>> n = 2000
>>> x = rng.uniform(-1, 1, n)
>>> y = 0.5 * x + 3.0 * (x >= 0) + rng.normal(0, 0.3, n)
>>> df = pd.DataFrame({'y': y, 'x': x})
>>> bw = sp.rdbwselect(df, y='y', x='x', c=0)
>>> bool('h_left' in bw.columns)  # DataFrame with bandwidth info
True

Compare all eight bandwidth methods:

>>> bw_all = sp.rdbwselect(df, y='y', x='x', c=0, all=True)
>>> bool(len(bw_all) >= 1)
True

CER-optimal bandwidth for better coverage:

>>> bw_cer = sp.rdbwselect(df, y='y', x='x', c=0, bwselect='cerrd')
>>> bool(bw_cer is not None)
True

Fuzzy RD with covariates:

>>> treated = ((x >= 0) & (rng.uniform(0, 1, n) < 0.85)).astype(float)
>>> df['treated'] = treated
>>> df['age'] = rng.normal(40, 10, n)
>>> df['gender'] = rng.integers(0, 2, n).astype(float)
>>> bw_fuzzy = sp.rdbwselect(df, y='y', x='x', c=0,
...                          fuzzy='treated', covs=['age', 'gender'])
>>> bool(bw_fuzzy is not None)
True

rdbwsensitivity ¶

rdbwsensitivity(data: DataFrame, y: str, x: str, c: float = 0, fuzzy: Optional[str] = None, p: int = 1, kernel: str = 'triangular', bw_grid: Optional[List[float]] = None, n_grid: int = 15, bw_range: Tuple[float, float] = (0.5, 2.0), alpha: float = 0.05, ax: Optional[Any] = None, figsize: Tuple[float, float] = (10, 6)) -> DataFrame

Bandwidth sensitivity analysis for RD estimates.

Re-estimates the RD effect across a grid of bandwidths to assess robustness. Plots point estimates with confidence intervals.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`y`	`str`	Outcome and running variable names.	required
`x`	`str`	Outcome and running variable names.	required
`c`	`float`	Cutoff.	`0`
`fuzzy`	`str`	Treatment variable for fuzzy RD.	`None`
`p`	`int`	Polynomial order.	`1`
`kernel`	`str`		`'triangular'`
`bw_grid`	`list of float`	Explicit bandwidth values to evaluate. If None, auto-generates a grid as multiples of the MSE-optimal bandwidth.	`None`
`n_grid`	`int`	Number of grid points if bw_grid is None.	`15`
`bw_range`	`tuple`	Range of multipliers for the optimal bandwidth.	`(0.5, 2.0)`
`alpha`	`float`		`0.05`
`ax`	`matplotlib Axes`		`None`
`figsize`	`tuple`		`(10, 6)`

Returns:

Type	Description
`DataFrame`	Columns: bandwidth, estimate, se, ci_lower, ci_upper, pvalue.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 500
>>> x = rng.uniform(-1, 1, n)
>>> y = 0.8 * (x >= 0) + 0.5 * x + rng.normal(0, 0.3, n)
>>> df = pd.DataFrame({"x": x, "y": y})
>>> tab = sp.rdbwsensitivity(df, y="y", x="x", c=0.0, n_grid=5)
>>> tab.shape
(5, 6)
>>> list(tab.columns)
['bandwidth', 'estimate', 'se', 'ci_lower', 'ci_upper', 'pvalue']
>>> [round(v, 3) for v in tab["estimate"].tolist()]
[0.834, 0.816, 0.797, 0.789, 0.785]

rdbalance ¶

rdbalance(data: DataFrame, x: str, c: float = 0, covs: Optional[List[str]] = None, p: int = 1, kernel: str = 'triangular', h: Optional[float] = None, alpha: float = 0.05) -> DataFrame

Covariate balance test at the RD cutoff.

For each covariate, estimates the discontinuity at the cutoff using local polynomial regression. Pre-treatment covariates should show no significant jump at the cutoff if the RD design is valid.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`x`	`str`	Running variable name.	required
`c`	`float`	Cutoff.	`0`
`covs`	`list of str`	Covariate names to test. If None, tests all numeric columns except x.	`None`
`p`	`int`		`1`
`kernel`	`str`		`'triangular'`
`h`	`float`	Manual bandwidth. If None, uses MSE-optimal per covariate.	`None`
`alpha`	`float`		`0.05`

Returns:

Type	Description
`DataFrame`	Columns: covariate, estimate, se, z, pvalue, significant.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> x = rng.uniform(-1, 1, 500)
>>> y = 0.5 * (x >= 0) + 0.8 * x + rng.normal(0, 0.3, 500)
>>> w1 = 0.3 * x + rng.normal(0, 0.5, 500)
>>> w2 = rng.normal(0, 1, 500)
>>> df = pd.DataFrame({"x": x, "y": y, "w1": w1, "w2": w2})
>>> bal = sp.rdbalance(df, x="x", covs=["w1", "w2"])
>>> bal["covariate"].tolist()
['w1', 'w2']
>>> bool(bal["significant"].any())
False

rdplacebo ¶

rdplacebo(data: DataFrame, y: str, x: str, c: float = 0, placebo_cutoffs: Optional[List[float]] = None, n_placebo: int = 10, side: str = 'both', fuzzy: Optional[str] = None, p: int = 1, kernel: str = 'triangular', alpha: float = 0.05, ax: Optional[Any] = None, figsize: Tuple[float, float] = (10, 6)) -> DataFrame

Placebo cutoff test for RD validity.

Estimates the RD effect at fake cutoff points where no treatment effect should exist. Significant effects at placebo cutoffs suggest the RD design may be invalid.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`y`	`str`	Outcome and running variable names.	required
`x`	`str`	Outcome and running variable names.	required
`c`	`float`	True cutoff.	`0`
`placebo_cutoffs`	`list of float`	Explicit placebo cutoff values. If None, auto-generates from the data distribution.	`None`
`n_placebo`	`int`	Number of placebo cutoffs if auto-generating.	`10`
`side`	`str`	Which side of the true cutoff to place placebos: 'left', 'right', or 'both'.	`'both'`
`fuzzy`	`str`	Treatment variable for fuzzy RD.	`None`
`p`	`int`		`1`
`kernel`	`str`		`'triangular'`
`alpha`	`float`		`0.05`
`ax`	`matplotlib Axes`		`None`
`figsize`	`tuple`		`(10, 6)`

Returns:

Type	Description
`DataFrame`	Columns: cutoff, estimate, se, ci_lower, ci_upper, pvalue, is_true_cutoff.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> x = rng.uniform(-1, 1, 500)
>>> y = 0.5 * (x >= 0) + 0.8 * x + rng.normal(0, 0.3, 500)
>>> df = pd.DataFrame({"x": x, "y": y})
>>> pl = sp.rdplacebo(df, y="y", x="x", n_placebo=4)
>>> true_row = pl[pl["is_true_cutoff"]]
>>> round(float(true_row["estimate"].iloc[0]), 2)
0.51
>>> int((pl[~pl["is_true_cutoff"]]["pvalue"] < 0.05).sum())
0

rdsummary ¶

rdsummary(data: DataFrame, y: str, x: str, c: float = 0, fuzzy: Optional[str] = None, covs: Optional[List[str]] = None, p: int = 1, kernel: str = 'triangular', alpha: float = 0.05, verbose: bool = True, plot: bool = False, full: bool = False) -> Dict[str, Any]

One-stop RD diagnostic battery.

Runs the main RD estimate plus all standard validation checks in a single call, returning a structured summary.

Diagnostics run (standard): 1. Main RD estimate (conventional + robust) 2. Density manipulation test (CJM 2020) 3. Covariate balance at cutoff (if covs provided) 4. Bandwidth sensitivity

Extended diagnostics (full=True): 5. Honest CI (Armstrong-Kolesár 2020) 6. Power analysis (MDE at 80%) 7. Placebo cutoff tests (5 placebos per side) 8. Multiple bandwidth selectors comparison

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`y`	`str`	Outcome and running variable.	required
`x`	`str`	Outcome and running variable.	required
`c`	`float`	Cutoff.	`0`
`fuzzy`	`str`	Treatment variable for fuzzy RD.	`None`
`covs`	`list of str`	Pre-treatment covariates for balance test.	`None`
`p`	`int`		`1`
`kernel`	`str`		`'triangular'`
`alpha`	`float`		`0.05`
`verbose`	`bool`	Print formatted summary to console.	`True`
`plot`	`bool`	Generate a multi-panel diagnostic plot.	`False`
`full`	`bool`	Run extended diagnostics (honest CI, power, placebos).	`False`

Returns:

Type	Description
`dict with keys:`	'estimate': CausalResult from rdrobust 'density_test': CausalResult from rddensity 'balance': pd.DataFrame (if covs given) 'bw_sensitivity': pd.DataFrame 'honest_ci': CausalResult (if full=True) 'power': RDPowerResult (if full=True) 'placebos': pd.DataFrame (if full=True) 'bandwidth_comparison': pd.DataFrame (if full=True) 'figure': matplotlib Figure (if plot=True)

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> x = rng.uniform(-1, 1, 500)
>>> y = 0.5 * (x >= 0) + 0.8 * x + rng.normal(0, 0.3, 500)
>>> w1 = 0.3 * x + rng.normal(0, 0.5, 500)
>>> df = pd.DataFrame({"x": x, "y": y, "w1": w1})
>>> res = sp.rdsummary(df, y="y", x="x", covs=["w1"],
...                    verbose=False)
>>> sorted(res.keys())
['balance', 'bw_sensitivity', 'density_test', 'estimate']
>>> round(float(res["estimate"].estimate), 2)
0.51

rd_honest ¶

rd_honest(data: DataFrame, y: str, x: str, c: float = 0.0, M: Optional[float] = None, kernel: str = 'triangular', h: Optional[float] = None, alpha: float = 0.05, opt_criterion: str = 'mse') -> CausalResult

Honest confidence intervals for regression discontinuity designs.

Implements Armstrong & Kolesár (2018, 2020): CIs that are valid uniformly over the class of regression functions whose second derivative is bounded by M. These "honest" CIs account for the smoothing bias that standard local-polynomial CIs ignore, yielding correct coverage even in finite samples.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input data.	required
`y`	`str`	Outcome variable name.	required
`x`	`str`	Running variable name.	required
`c`	`float`	RD cutoff.	`0`
`M`	`float`	Upper bound on \|f''(c)\|. If `None`, estimated from a local quadratic fit on each side of the cutoff.	`None`
`kernel`	`str`	Kernel function: `"triangular"`, `"epanechnikov"`, or `"uniform"`.	`"triangular"`
`h`	`float`	Bandwidth. If `None`, chosen by the criterion in opt_criterion.	`None`
`alpha`	`float`	Significance level for the confidence interval.	`0.05`
`opt_criterion`	`str`	Bandwidth selection criterion when h is `None`: `"mse"` (MSE-optimal, Imbens-Kalyanaraman style) or `"flci"` (minimises honest CI length).	`"mse"`

Returns:

Type	Description
`CausalResult`	Result object with `model_info` containing: - `honest_ci` : tuple – honest confidence interval - `naive_ci` : tuple – standard CI for comparison - `M` : float – smoothness bound used - `bias_bound`: float – estimated worst-case bias - `bandwidth` : float – bandwidth used

Examples:

>>> import numpy as np, pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 1000
>>> x = rng.uniform(-1, 1, n)
>>> y = 0.5 * x + 2.0 * (x >= 0) + rng.normal(0, 0.4, n)
>>> df = pd.DataFrame({'y': y, 'x': x})
>>> result = sp.rd_honest(df, y='y', x='x', c=0)
>>> bool(abs(result.estimate - 2.0) < 0.3)  # true jump is 2.0
True
>>> lo, hi = result.model_info['honest_ci']
>>> nlo, nhi = result.model_info['naive_ci']
>>> bool((hi - lo) > (nhi - nlo))  # honest CI widens for worst-case bias
True

References

Armstrong, T. B. and Kolesár, M. (2018). Optimal Inference in a Class of Regression Models. Econometrica. [@armstrong2018optimal]

Armstrong, T. B. and Kolesár, M. (2020). Simple and honest confidence intervals in nonparametric regression. Quantitative Economics. [@armstrong2020simple]

rdmc ¶

rdmc(data: DataFrame, y: str, x: str, cutoffs: List[float], bandwidth: Optional[float] = None, kernel: str = 'triangular', pooling: str = 'ivw', alpha: float = 0.05) -> RDMultiResult

Multi-cutoff RD design.

Estimates treatment effects at multiple cutoffs and pools them.

Equivalent to rdmulti::rdmc() in R/Stata.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`y`	`str`	Outcome variable.	required
`x`	`str`	Running variable.	required
`cutoffs`	`list of float`	Cutoff values.	required
`bandwidth`	`float`	Bandwidth for local polynomial. If None, uses Silverman rule.	`None`
`kernel`	`str`		`'triangular'`
`pooling`	`str`	Pooling method: 'ivw' (inverse-variance weighted) or 'equal'.	`'ivw'`
`alpha`	`float`		`0.05`

Returns:

Type	Description
`RDMultiResult`

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(0)
>>> n = 900
>>> running_var = rng.uniform(0, 100, n)
>>> cutoffs = [30, 60]
>>> nearest = np.array([min(cutoffs, key=lambda c: abs(rv - c))
...                     for rv in running_var])
>>> treated = (running_var >= nearest).astype(float)
>>> score = 1.0 + 0.02 * running_var + 2.0 * treated + rng.normal(0, 1, n)
>>> df = pd.DataFrame({"score": score, "running_var": running_var})
>>> result = sp.rdmc(df, y="score", x="running_var", cutoffs=[30, 60])
>>> result.n_cutoffs
2
>>> summary_text = result.summary()
>>> ax = result.plot()

rdms ¶

rdms(data: DataFrame, y: str, x1: str, x2: str, cutoff1: float = 0, cutoff2: float = 0, bandwidth: Optional[float] = None, kernel: str = 'triangular', alpha: float = 0.05) -> CausalResult

Multi-score / Geographic RD design.

Handles two-dimensional running variables for geographic boundaries.

Equivalent to rdmulti::rdms() and Keele & Titiunik (2015).

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`y`	`str`	Outcome variable.	required
`x1`	`str`	First running variable (e.g., latitude distance to boundary).	required
`x2`	`str`	Second running variable (e.g., longitude distance to boundary).	required
`cutoff1`	`float`	Cutoff for x1.	`0`
`cutoff2`	`float`	Cutoff for x2.	`0`
`bandwidth`	`float`		`None`
`kernel`	`str`		`'triangular'`
`alpha`	`float`		`0.05`

Returns:

Type	Description
`CausalResult`

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(1)
>>> n = 800
>>> dist_lat = rng.uniform(-5, 5, n)
>>> dist_lon = rng.uniform(-5, 5, n)
>>> treated = (dist_lat >= 0).astype(float)
>>> outcome = (1.0 + 1.5 * treated + 0.1 * dist_lat
...            + 0.1 * dist_lon + rng.normal(0, 1, n))
>>> df = pd.DataFrame({"outcome": outcome,
...                    "dist_lat": dist_lat, "dist_lon": dist_lon})
>>> result = sp.rdms(df, y="outcome", x1="dist_lat", x2="dist_lon",
...                  bandwidth=3.0)
>>> summary_text = result.summary()

rdsampsi ¶

rdsampsi(tau: float, var_left: float = 1.0, var_right: float = 1.0, h_left: float = 0.5, h_right: float = 0.5, alpha: float = 0.05, target_power: float = 0.8, ratio: float = 1.0) -> RDSampSiResult

Minimum sample size for a given power in an RD design.

Parameters:

Name	Type	Description	Default
`ratio`	`float`	`n_right / n_left`. Default 1.0 assumes equal allocation.	`1.0`

Examples:

>>> import statspai as sp
>>> res = sp.rdsampsi(tau=0.15, target_power=0.80)
>>> res.n_total
978
>>> (res.n_left, res.n_right)
(489, 489)

rdrandinf ¶

rdrandinf(data: DataFrame, y: str, x: str, c: float = 0, wl: Optional[float] = None, wr: Optional[float] = None, statistic: str = 'diffmeans', p: int = 0, covs: Optional[List[str]] = None, kernel: str = 'uniform', n_perms: int = 1000, fuzzy: Optional[str] = None, alpha: float = 0.05, seed: int = 42) -> CausalResult

Randomization inference for regression discontinuity designs.

Under the local randomization assumption, units within a small window around the cutoff are treated as if randomly assigned. Inference is based on Fisher's randomization test.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input dataset.	required
`y`	`str`	Outcome variable name.	required
`x`	`str`	Running variable name.	required
`c`	`float`	RD cutoff value.	`0`
`wl`	`float`	Window left bound offset from cutoff (typically negative). The left edge of the window is `c + wl`.	`None`
`wr`	`float`	Window right bound offset from cutoff (typically positive). The right edge of the window is `c + wr`.	`None`
`statistic`	`str`	Test statistic: 'diffmeans', 'ksmirnov', 'ranksum', or 'all'.	`'diffmeans'`
`p`	`int`	Polynomial order for adjustment (0 = unadjusted).	`0`
`covs`	`list of str`	Covariate names to partial out before testing.	`None`
`kernel`	`str`	Kernel weighting (only 'uniform' currently supported for local randomization).	`'uniform'`
`n_perms`	`int`	Number of permutations for Fisher randomization test.	`1000`
`fuzzy`	`str`	Actual treatment variable for fuzzy RD. The Wald (IV) estimator is computed within the window.	`None`
`alpha`	`float`	Significance level.	`0.05`
`seed`	`int`	Random seed for reproducibility.	`42`

Returns:

Type	Description
`CausalResult`	Result with treatment effect estimate, permutation and asymptotic p-values, and confidence interval.

Notes

The confidence interval is obtained by test inversion when statistic='diffmeans': the set of hypothesised effect values tau_0 that are not rejected by the permutation test at level alpha.

References

Cattaneo, M.D., Titiunik, R. and Vazquez-Bare, G. (2016). "Inference in Regression Discontinuity Designs under Local Randomization." The Stata Journal, 16(2), 331-367. [@cattaneo2016inference]

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 500
>>> x = rng.uniform(-1, 1, n)
>>> y = 0.8 * (x >= 0) + 0.5 * x + rng.normal(0, 0.3, n)
>>> df = pd.DataFrame({"x": x, "y": y})
>>> res = sp.rdrandinf(df, y="y", x="x", c=0.0, wl=-0.3,
...                    wr=0.3, n_perms=500, seed=42)
>>> round(float(res.estimate), 3)
0.926
>>> res.pvalue is not None
True

rdwinselect ¶

rdwinselect(data: DataFrame, x: str, c: float = 0, covs: Optional[List[str]] = None, wmin: Optional[float] = None, wstep: Optional[float] = None, nwindows: int = 10, statistic: str = 'diffmeans', p: int = 0, seed: int = 42, alpha: float = 0.15) -> DataFrame

Data-driven window selection for local randomization RD.

Tests covariate balance at successively larger windows around the cutoff. The recommended window is the largest for which all covariates remain balanced (p > alpha).

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input dataset.	required
`x`	`str`	Running variable name.	required
`c`	`float`	RD cutoff value.	`0`
`covs`	`list of str`	Covariate names to test balance for. If None, uses quantiles of the running variable as pseudo-covariates.	`None`
`wmin`	`float`	Minimum half-window width. Defaults to the smallest gap between adjacent observations near the cutoff.	`None`
`wstep`	`float`	Window increment. Defaults to `(max_range - wmin) / nwindows`.	`None`
`nwindows`	`int`	Number of windows to evaluate.	`10`
`statistic`	`str`	Test statistic for balance testing.	`'diffmeans'`
`p`	`int`	Polynomial order for adjustment.	`0`
`seed`	`int`	Random seed.	`42`
`alpha`	`float`	Significance level for balance (lenient by default to be conservative about window selection).	`0.15`

Returns:

Type	Description
`DataFrame`	Columns: window_left, window_right, n_left, n_right, p_value, balanced. Rows sorted by window width.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 500
>>> x = rng.uniform(-1, 1, n)
>>> z1 = 0.2 * x + rng.normal(0, 1, n)
>>> df = pd.DataFrame({"x": x, "z1": z1})
>>> tab = sp.rdwinselect(df, x="x", c=0.0, covs=["z1"],
...                      nwindows=5)
>>> tab.shape
(5, 6)
>>> list(tab.columns)
['window_left', 'window_right', 'n_left', 'n_right', 'p_value', 'balanced']
>>> tab["n_left"].tolist()
[26, 74, 129, 193, 249]

rdsensitivity ¶

rdsensitivity(data: DataFrame, y: str, x: str, c: float = 0, wlist: Optional[List[float]] = None, nwindows: int = 20, statistic: str = 'diffmeans', p: int = 0, n_perms: int = 500, seed: int = 42, alpha: float = 0.05) -> DataFrame

Sensitivity of RD estimates across different window widths.

For each window, runs rdrandinf and records the estimate, standard error, and p-value. Optionally produces a plot if matplotlib is available.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input dataset.	required
`y`	`str`	Outcome variable name.	required
`x`	`str`	Running variable name.	required
`c`	`float`	RD cutoff value.	`0`
`wlist`	`list of float`	Symmetric half-window widths to evaluate. If None, an evenly-spaced grid is generated automatically.	`None`
`nwindows`	`int`	Number of windows when `wlist` is None.	`20`
`statistic`	`str`	Test statistic for inference.	`'diffmeans'`
`p`	`int`	Polynomial order for adjustment.	`0`
`n_perms`	`int`	Number of permutations per window.	`500`
`seed`	`int`	Random seed.	`42`
`alpha`	`float`	Significance level.	`0.05`

Returns:

Type	Description
`DataFrame`	Columns: window, estimate, se, pvalue, ci_lower, ci_upper, significant.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> x = rng.uniform(-1, 1, 500)
>>> y = 0.5 * (x >= 0) + 0.8 * x + rng.normal(0, 0.3, 500)
>>> df = pd.DataFrame({"x": x, "y": y})
>>> sens = sp.rdsensitivity(
...     df, y="y", x="x", wlist=[0.25, 0.5, 0.75],
...     n_perms=200, seed=42,
... )
>>> sens["estimate"].round(3).tolist()
[0.66, 0.834, 1.078]
>>> bool(sens["significant"].all())
True

rdrbounds ¶

rdrbounds(data: DataFrame, y: str, x: str, c: float = 0, wl: Optional[float] = None, wr: Optional[float] = None, gamma_list: Optional[List[float]] = None, statistic: str = 'ranksum', n_perms: int = 1000, seed: int = 42) -> DataFrame

Rosenbaum sensitivity bounds for RD under local randomization.

Assesses how much hidden bias (departure from random assignment) would be needed to explain away the estimated treatment effect. For each gamma (odds ratio), computes upper and lower p-value bounds under worst-case confounding.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input dataset.	required
`y`	`str`	Outcome variable name.	required
`x`	`str`	Running variable name.	required
`c`	`float`	RD cutoff value.	`0`
`wl`	`float`	Window left bound offset (typically negative).	`None`
`wr`	`float`	Window right bound offset (typically positive).	`None`
`gamma_list`	`list of float`	Odds ratios to evaluate. Defaults to [1, 1.5, 2, 2.5, 3, 4, 5]. gamma=1 is pure randomization; gamma>1 allows confounding.	`None`
`statistic`	`str`	Test statistic (ranksum is standard for Rosenbaum bounds).	`'ranksum'`
`n_perms`	`int`	Number of permutations for p-value computation.	`1000`
`seed`	`int`	Random seed.	`42`

Returns:

Type	Description
`DataFrame`	Columns: gamma, pvalue_upper, pvalue_lower.

Notes

Under Rosenbaum's model, if gamma=Gamma, the probability that unit i is treated satisfies:

1/(1+Gamma) <= P(D_i=1) <= Gamma/(1+Gamma)

instead of the uniform 1/2 under pure randomization. The bounds on the p-value are obtained by computing the worst-case assignment probabilities at each gamma level.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 500
>>> x = rng.uniform(-1, 1, n)
>>> y = 0.8 * (x >= 0) + 0.5 * x + rng.normal(0, 0.3, n)
>>> df = pd.DataFrame({"x": x, "y": y})
>>> tab = sp.rdrbounds(df, y="y", x="x", c=0.0, wl=-0.3,
...                    wr=0.3, gamma_list=[1.0, 1.5, 2.0],
...                    n_perms=500, seed=42)
>>> tab.shape
(3, 3)
>>> list(tab.columns)
['gamma', 'pvalue_upper', 'pvalue_lower']
>>> tab["gamma"].tolist()
[1.0, 1.5, 2.0]

rdhte ¶

rdhte(data: DataFrame, y: str, x: str, z: Union[str, List[str]], c: float = 0, p: int = 1, h: Optional[float] = None, b: Optional[float] = None, kernel: str = 'triangular', bwselect: str = 'mserd', cluster: Optional[str] = None, alpha: float = 0.05, eval_points: Optional[ndarray] = None, n_eval: int = 20) -> CausalResult

Estimate conditional average treatment effects (CATE) in RD designs.

Fits a fully interacted local polynomial model on each side of the cutoff and computes CATE(z) = (alpha_R - alpha_L) + z'(gamma_R - gamma_L).

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input dataset.	required
`y`	`str`	Outcome variable name.	required
`x`	`str`	Running variable name.	required
`z`	`str or list of str`	Covariate(s) for treatment effect heterogeneity.	required
`c`	`float`	RD cutoff value.	`0`
`p`	`int`	Polynomial order for the running variable (1 = local linear).	`1`
`h`	`float`	Bandwidth for estimation. If None, MSE-optimal bandwidth is selected.	`None`
`b`	`float`	Bandwidth for bias correction. Defaults to h.	`None`
`kernel`	`str`	Kernel function: 'triangular', 'uniform', or 'epanechnikov'.	`'triangular'`
`bwselect`	`str`	Bandwidth selection method: 'mserd' or 'msetwo'.	`'mserd'`
`cluster`	`str`	Cluster variable name for cluster-robust standard errors.	`None`
`alpha`	`float`	Significance level for confidence intervals.	`0.05`
`eval_points`	`ndarray`	Z values at which to evaluate CATE. Each row is a point in Z-space. If None, n_eval equally spaced quantiles (10th to 90th pctile) are used.	`None`
`n_eval`	`int`	Number of evaluation points when eval_points is not provided.	`20`

Returns:

Type	Description
`CausalResult`	estimate: average CATE across evaluation points (the ATE) detail: DataFrame with [z_value, cate, se, ci_lower, ci_upper, pvalue] model_info: coefficients, heterogeneity test, bandwidth info

Examples:

>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(42)
>>> n = 2000
>>> X = rng.uniform(-1, 1, n)
>>> Z = rng.normal(0, 1, n)
>>> tau_z = 2.0 + 1.5 * Z  # CATE varies with Z
>>> Y = 0.5 * X + tau_z * (X >= 0) + rng.normal(0, 0.5, n)
>>> df = pd.DataFrame({'y': Y, 'x': X, 'z': Z})
>>> import statspai as sp
>>> result = sp.rdhte(df, y='y', x='x', z='z', c=0)
>>> bool(abs(result.estimate - 2.0) < 1.0)  # average CATE near 2
True

rdbwhte ¶

rdbwhte(data: DataFrame, y: str, x: str, z: Union[str, List[str]], c: float = 0, p: int = 1, kernel: str = 'triangular') -> float

MSE-optimal bandwidth selection for the fully interacted RD model.

Accounts for the additional variance introduced by the covariate interaction terms compared to the standard local polynomial RD bandwidth.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input dataset.	required
`y`	`str`	Outcome variable name.	required
`x`	`str`	Running variable name.	required
`z`	`str or list of str`	Covariate(s) for treatment effect heterogeneity.	required
`c`	`float`	RD cutoff value.	`0`
`p`	`int`	Polynomial order.	`1`
`kernel`	`str`	Kernel function.	`'triangular'`

Returns:

Type	Description
`float`	MSE-optimal bandwidth.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 500
>>> x = rng.uniform(-1, 1, n)
>>> z = rng.integers(0, 2, n)
>>> y = (0.8 * (x >= 0) + 0.3 * z * (x >= 0) + 0.5 * x
...      + rng.normal(0, 0.3, n))
>>> df = pd.DataFrame({"x": x, "y": y, "z": z})
>>> h = sp.rdbwhte(df, y="y", x="x", z="z", c=0.0)
>>> round(float(h), 3)
0.174

rdhte_lincom ¶

rdhte_lincom(result: CausalResult, weights: ndarray, alpha: float = 0.05) -> Dict[str, Any]

Compute a weighted linear combination of CATE estimates.

Useful for computing group-specific average treatment effects, e.g., the average CATE for males vs females.

Parameters:

Name	Type	Description	Default
`result`	`CausalResult`	Result from rdhte().	required
`weights`	`ndarray`	Linear combination weights. Must have length equal to the number of evaluation points in result.detail.	required
`alpha`	`float`	Significance level.	`0.05`

Returns:

Type	Description
`dict`	Keys: 'estimate', 'se', 'ci', 'pvalue'.

Examples:

Compute the average CATE for the first half of the Z grid:

>>> import numpy as np, pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 2000
>>> X = rng.uniform(-1, 1, n)
>>> Z = rng.normal(0, 1, n)
>>> Y = 0.5 * X + (2.0 + 1.5 * Z) * (X >= 0) + rng.normal(0, 0.5, n)
>>> df = pd.DataFrame({'y': Y, 'x': X, 'z': Z})
>>> result = sp.rdhte(df, y='y', x='x', z='z', c=0)
>>> n_pts = len(result.detail)
>>> w1 = np.zeros(n_pts)
>>> w1[:n_pts // 2] = 1.0 / (n_pts // 2)  # first half of the Z grid
>>> lincom1 = sp.rdhte_lincom(result, w1)
>>> bool('estimate' in lincom1)
True

rd2d_bw ¶

rd2d_bw(data: DataFrame, y: str, x1: str, x2: str, treatment: str, boundary: Optional[Callable] = None, approach: str = 'distance', p: int = 1, kernel: str = 'triangular') -> float

Bandwidth selection for 2D boundary RD.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input dataset.	required
`y`	`str`	Outcome variable name.	required
`x1`	`str`	Running variable names.	required
`x2`	`str`	Running variable names.	required
`treatment`	`str`	Binary treatment indicator.	required
`boundary`	`callable`	Boundary function f(x1) -> x2. None implies x1 = 0.	`None`
`approach`	`str`	'distance' or 'location'.	`'distance'`
`p`	`int`	Polynomial order.	`1`
`kernel`	`str`	Kernel function.	`'triangular'`

Returns:

Type	Description
`float`	MSE-optimal bandwidth.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 800
>>> x1 = rng.uniform(-1, 1, n)
>>> x2 = rng.uniform(-1, 1, n)
>>> treat = (x1 >= 0).astype(int)
>>> y = 1.5 * treat + 0.5 * x1 + 0.3 * x2 + rng.normal(0, 0.5, n)
>>> df = pd.DataFrame({"y": y, "x1": x1, "x2": x2, "treat": treat})
>>> h = sp.rd2d_bw(df, y="y", x1="x1", x2="x2", treatment="treat")
>>> round(float(h), 3)
0.338

rd2d_plot ¶

rd2d_plot(data: DataFrame, y: str, x1: str, x2: str, treatment: str, boundary: Optional[Callable[..., Any]] = None, result: Optional[CausalResult] = None, plot_type: str = 'scatter', ax: Optional[Any] = None, figsize: Tuple[float, float] = (10, 8)) -> Tuple[Any, Any]

2D boundary RD visualization.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input dataset.	required
`y`	`str`	Outcome variable name.	required
`x1`	`str`	Running variable names.	required
`x2`	`str`	Running variable names.	required
`treatment`	`str`	Binary treatment indicator.	required
`boundary`	`callable`	Boundary function f(x1) -> x2. None implies x1 = 0.	`None`
`result`	`CausalResult`	Result from `rd2d()`, used for bandwidth and effect info.	`None`
`plot_type`	`str`	`'scatter'`: 2D scatter of (x1, x2) colored by treatment status, with boundary curve and optional bandwidth region. `'heatmap'`: outcome values displayed as a heatmap with boundary overlay. `'boundary_effects'`: treatment effect estimates along the boundary (requires `result` with multiple eval points).	`'scatter'`
`ax`	`matplotlib Axes`	Pre-existing axes to draw on.	`None`
`figsize`	`tuple`	Figure size.	`(10, 8)`

Returns:

Type	Description
`(fig, ax)`

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 800
>>> x1 = rng.uniform(-1, 1, n)
>>> x2 = rng.uniform(-1, 1, n)
>>> treat = (x1 >= 0).astype(int)
>>> y = 1.5 * treat + 0.5 * x1 + 0.3 * x2 + rng.normal(0, 0.5, n)
>>> df = pd.DataFrame({"y": y, "x1": x1, "x2": x2, "treat": treat})
>>> res = sp.rd2d(df, y="y", x1="x1", x2="x2", treatment="treat")
>>> fig, ax = sp.rd2d_plot(
...     df, y="y", x1="x1", x2="x2", treatment="treat", result=res
... )

rd_forest ¶

rd_forest(data: DataFrame, y: str, x: str, c: float = 0, covs: Optional[List[str]] = None, h: Optional[float] = None, n_trees: int = 500, min_leaf: int = 20, honesty: bool = True, alpha: float = 0.05, seed: int = 42) -> CausalResult

Causal Forest for RD — heterogeneous treatment effect estimation.

Adapts the Athey-Wager (2019) generalized random forests framework to an RD context. Within bandwidth h, treatment is D = 1(X >= c). Two separate random forests estimate E[Y|Z, D=0] and E[Y|Z, D=1], and the CATE is their difference.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input dataset.	required
`y`	`str`	Outcome variable.	required
`x`	`str`	Running variable.	required
`c`	`float`	RD cutoff.	`0`
`covs`	`list of str`	Covariate names used as features for heterogeneity detection. Must not include the running variable x.	`None`
`h`	`float`	Bandwidth (uses IK-style automatic selection if None).	`None`
`n_trees`	`int`	Number of trees in each forest.	`500`
`min_leaf`	`int`	Minimum leaf size (larger → more regularisation).	`20`
`honesty`	`bool`	Split-sample (honest) estimation: half the data for tree construction, the other half for leaf predictions.	`True`
`alpha`	`float`	Significance level for confidence intervals.	`0.05`
`seed`	`int`	Random seed.	`42`

Returns:

Type	Description
`CausalResult`	estimate = average CATE; detail = DataFrame with per-obs CATE and SE; model_info includes variable importance.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 800
>>> x = rng.uniform(-1, 1, n)
>>> z1 = rng.normal(size=n)
>>> z2 = rng.normal(size=n)
>>> treat = (x >= 0).astype(float)
>>> y = (1.0 + 0.8 * z1) * treat + 0.5 * x + rng.normal(0, 0.5, n)
>>> df = pd.DataFrame({"y": y, "x": x, "z1": z1, "z2": z2})
>>> res = sp.rd_forest(
...     df, y="y", x="x", c=0, covs=["z1", "z2"],
...     n_trees=50, seed=42,
... )
>>> round(float(res.estimate), 2)
1.2
>>> list(res.model_info["variable_importance"].keys())[0]
'z1'

rd_boost ¶

rd_boost(data: DataFrame, y: str, x: str, c: float = 0, covs: Optional[List[str]] = None, h: Optional[float] = None, n_estimators: int = 200, max_depth: int = 3, learning_rate: float = 0.1, alpha: float = 0.05, seed: int = 42) -> CausalResult

Gradient Boosting for RD — flexible CATE estimation.

Fits separate GBM models on each side of the cutoff and estimates individual-level CATEs as mu_1(z) - mu_0(z). Standard errors are obtained via bootstrap.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input dataset.	required
`y`	`str`	Outcome variable.	required
`x`	`str`	Running variable.	required
`c`	`float`	RD cutoff.	`0`
`covs`	`list of str`	Covariate names for heterogeneity.	`None`
`h`	`float`	Bandwidth (auto-selected if None).	`None`
`n_estimators`	`int`	Number of boosting rounds.	`200`
`max_depth`	`int`	Maximum tree depth per round.	`3`
`learning_rate`	`float`	Shrinkage factor.	`0.1`
`alpha`	`float`	Significance level.	`0.05`
`seed`	`int`	Random seed.	`42`

Returns:

Type	Description
`CausalResult`	estimate = average CATE; detail = per-obs CATE and bootstrap SE.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 600
>>> x = rng.uniform(-1, 1, n)
>>> z1 = rng.normal(size=n)
>>> z2 = rng.normal(size=n)
>>> treat = (x >= 0).astype(float)
>>> y = (1.0 + 0.8 * z1) * treat + 0.5 * x + rng.normal(0, 0.5, n)
>>> df = pd.DataFrame({"y": y, "x": x, "z1": z1, "z2": z2})
>>> res = sp.rd_boost(
...     df, y="y", x="x", c=0, covs=["z1", "z2"],
...     n_estimators=50, seed=42,
... )
>>> round(float(res.estimate), 2)
1.17

rd_lasso ¶

rd_lasso(data: DataFrame, y: str, x: str, c: float = 0, covs: Optional[List[str]] = None, h: Optional[float] = None, kernel: str = 'triangular', cv_folds: int = 5, alpha: float = 0.05) -> CausalResult

LASSO-assisted RD via post-double-selection.

Uses LASSO to select relevant covariates from a potentially large set, then runs a local linear RD with the selected covariates. Follows Belloni, Chernozhukov, and Hansen (2014).

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input dataset.	required
`y`	`str`	Outcome variable.	required
`x`	`str`	Running variable.	required
`c`	`float`	RD cutoff.	`0`
`covs`	`list of str`	Candidate covariates (can be large set).	`None`
`h`	`float`	Bandwidth (auto-selected if None).	`None`
`kernel`	`str`	Kernel for local linear regression ('triangular', 'uniform', 'epanechnikov').	`'triangular'`
`cv_folds`	`int`	Cross-validation folds for LASSO penalty selection.	`5`
`alpha`	`float`	Significance level.	`0.05`

Returns:

Type	Description
`CausalResult`	estimate = RD treatment effect with LASSO-selected covariates; model_info includes selected_covariates and LASSO paths.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 800
>>> x = rng.uniform(-1, 1, n)
>>> treat = (x >= 0).astype(float)
>>> Z = rng.normal(size=(n, 8))
>>> y = (1.0 * treat + 0.5 * x + 0.7 * Z[:, 0]
...      + rng.normal(0, 0.5, n))
>>> covs = [f"z{i}" for i in range(8)]
>>> df = pd.DataFrame({"y": y, "x": x})
>>> for i, name in enumerate(covs):
...     df[name] = Z[:, i]
>>> res = sp.rd_lasso(df, y="y", x="x", c=0, covs=covs)
>>> round(float(res.estimate), 2)
1.16
>>> res.model_info["n_selected"]
1

rd_cate_summary ¶

rd_cate_summary(data: DataFrame, y: str, x: str, c: float = 0, covs: Optional[List[str]] = None, h: Optional[float] = None, methods: Optional[List[str]] = None, alpha: float = 0.05, seed: int = 42) -> dict

Run multiple ML-RD methods and compare CATE estimates.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input dataset.	required
`y`	`str`	Outcome variable.	required
`x`	`str`	Running variable.	required
`c`	`float`	RD cutoff.	`0`
`covs`	`list of str`	Covariates for heterogeneity / selection.	`None`
`h`	`float`	Bandwidth (shared across methods).	`None`
`methods`	`list of str`	Subset of `['forest', 'boost', 'lasso']`. Default: all three.	`None`
`alpha`	`float`	Significance level.	`0.05`
`seed`	`int`	Random seed.	`42`

Returns:

Type	Description
`dict`	Keys = method names, values = CausalResult. Additional key `'comparison'` = pd.DataFrame summarising ATEs. Additional key `'heterogeneity_drivers'` = top variable importances from the forest (if run).

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 600
>>> x = rng.uniform(-1, 1, n)
>>> z1 = rng.normal(size=n)
>>> z2 = rng.normal(size=n)
>>> treat = (x >= 0).astype(float)
>>> y = (1.0 + 0.8 * z1) * treat + 0.5 * x + rng.normal(0, 0.5, n)
>>> df = pd.DataFrame({"y": y, "x": x, "z1": z1, "z2": z2})
>>> out = sp.rd_cate_summary(
...     df, y="y", x="x", c=0, covs=["z1", "z2"],
...     methods=["forest", "lasso"], seed=42,
... )
>>> round(float(out["forest"].estimate), 2)
1.16

rd_extrapolate ¶

rd_extrapolate(data: DataFrame, y: str, x: str, c: float = 0, covs: Optional[List[str]] = None, treatment: Optional[str] = None, eval_points: Optional[ndarray] = None, n_eval: int = 20, method: str = 'ols', h_local: Optional[float] = None, alpha: float = 0.05) -> CausalResult

Angrist-Rokkanen (2015) extrapolation of RD effects away from the cutoff.

If Y(0), Y(1) are independent of X conditional on covariates Z, then treatment effects can be estimated at any value of the running variable, not just at the cutoff.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input dataset.	required
`y`	`str`	Outcome variable name.	required
`x`	`str`	Running variable name.	required
`c`	`float`	RD cutoff value.	`0`
`covs`	`list of str`	Covariate names for conditional independence. Required.	`None`
`treatment`	`str`	Treatment variable for fuzzy RD. If None, sharp design assumed (D = 1{X >= c}).	`None`
`eval_points`	`ndarray`	Running variable values at which to extrapolate the CATE. If None, `n_eval` equally spaced points spanning the data range.	`None`
`n_eval`	`int`	Number of evaluation points when `eval_points` is None.	`20`
`method`	`str`	Estimation method: `'ols'`, `'ipw'`, or `'doubly_robust'`.	`'ols'`
`h_local`	`float`	Bandwidth for local RD estimate at the cutoff (for comparison). If None, the MSE-optimal bandwidth from `rdrobust` is used.	`None`
`alpha`	`float`	Significance level.	`0.05`

Returns:

Type	Description
`CausalResult`	`estimate` is the average treatment effect across evaluation points. `detail` is a DataFrame with columns `[x_value, cate, se, ci_lower, ci_upper]`. `model_info` contains conditional independence test results and the local RD estimate for comparison.

Notes

The key identifying assumption is conditional independence: Y(0), Y(1) ⊥ X | Z. A partial F-test is run on each side of the cutoff to check whether X retains predictive power for Y after conditioning on Z. A rejection suggests the assumption may fail.

References

Angrist, J.D. and Rokkanen, M. (2015). "Wanna Get Away? Regression Discontinuity Estimation of Exam School Effects Away from the Cutoff." JASA, 110(512), 1331-1344. [@angrist2015wanna]

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(42)
>>> n = 2000
>>> Z = rng.normal(0, 1, n)
>>> X = Z + rng.normal(0, 0.5, n)
>>> D = (X >= 0).astype(int)
>>> Y = 1.0 + 2.0 * Z + 3.0 * D + rng.normal(0, 0.5, n)
>>> df = pd.DataFrame({'y': Y, 'x': X, 'z': Z})
>>> result = sp.rd_extrapolate(df, y='y', x='x', c=0, covs=['z'])
>>> bool(abs(result.estimate - 3.0) < 1.0)
True

rd_multi_extrapolate ¶

rd_multi_extrapolate(data: DataFrame, y: str, x: str, cutoffs: List[float], eval_points: Optional[ndarray] = None, method: str = 'linear', alpha: float = 0.05) -> CausalResult

Multi-cutoff RD extrapolation (Cattaneo, Keele, Titiunik, Vazquez-Bare 2021).

Estimates local RD effects at each cutoff, then interpolates/extrapolates a treatment effect function tau(x) through those point estimates.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input dataset.	required
`y`	`str`	Outcome variable name.	required
`x`	`str`	Running variable name.	required
`cutoffs`	`list of float`	Cutoff values. Must contain at least 2 cutoffs.	required
`eval_points`	`ndarray`	Running variable values at which to predict tau(x). Defaults to 30 equally spaced points spanning the data range.	`None`
`method`	`str`	Interpolation method: `'linear'`: tau(x) = a + b*x `'polynomial'`: polynomial of degree min(len(cutoffs)-1, 3) `'weighted'`: inverse-variance weighted local linear	`'linear'`
`alpha`	`float`	Significance level.	`0.05`

Returns:

Type	Description
`CausalResult`	`estimate` is the average extrapolated effect. `detail` is a DataFrame with columns `[x_value, cate_extrapolated, se, ci_lower, ci_upper]`. `model_info` contains cutoff-specific estimates and heterogeneity test results.

References

Cattaneo, M.D., Keele, L., Titiunik, R. and Vazquez-Bare, G. (2021). "Extrapolating Treatment Effects in Multi-Cutoff Regression Discontinuity Designs." Journal of the American Statistical Association, 116(536), 1941-1952. doi:10.1080/01621459.2020.1751646 [@cattaneo2021extrapolating]

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(42)
>>> n = 3000
>>> X = rng.uniform(-2, 4, n)
>>> tau_true = 2.0 + 0.5 * X
>>> D = ((X >= 1) | (X >= 3)).astype(int)
>>> Y = 0.5 * X + tau_true * D + rng.normal(0, 0.5, n)
>>> df = pd.DataFrame({'y': Y, 'x': X})
>>> result = sp.rd_multi_extrapolate(df, y='y', x='x', cutoffs=[1.0, 3.0])
>>> bool(result.estimate > 0)
True

rd_external_validity ¶

rd_external_validity(data: DataFrame, y: str, x: str, c: float = 0, covs: Optional[List[str]] = None, target_x_range: Optional[tuple] = None, alpha: float = 0.05) -> Dict[str, Any]

Diagnostic assessment of RD external validity.

Compares covariate distributions between the cutoff neighborhood and a target population, runs the conditional independence test, and provides a recommendation on whether extrapolation is credible.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input dataset.	required
`y`	`str`	Outcome variable name.	required
`x`	`str`	Running variable name.	required
`c`	`float`	RD cutoff value.	`0`
`covs`	`list of str`	Covariate names for overlap and CI testing.	`None`
`target_x_range`	`tuple of (float, float)`	Running variable range `(x_low, x_high)` defining the target population. Defaults to the full data range.	`None`
`alpha`	`float`	Significance level.	`0.05`

Returns:

Type	Description
`dict`	Keys: `'local_estimate'`: local RD effect at the cutoff (CausalResult). `'ci_test'`: conditional independence test results (dict or None). `'overlap'`: covariate overlap statistics (dict or None). `'extrapolated_estimate'`: extrapolated ATE for the target population if CI test passes (CausalResult or None). `'recommendation'`: human-readable guidance string.

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(42)
>>> n = 2000
>>> Z = rng.normal(0, 1, n)
>>> X = Z + rng.normal(0, 0.5, n)
>>> D = (X >= 0).astype(int)
>>> Y = 1.0 + 2.0 * Z + 3.0 * D + rng.normal(0, 0.5, n)
>>> df = pd.DataFrame({'y': Y, 'x': X, 'z': Z})
>>> diag = sp.rd_external_validity(df, y='y', x='x', c=0, covs=['z'])
>>> bool('recommendation' in diag)
True

rd_interference ¶

rd_interference(data: DataFrame, y: str, running: str, neighbour_running: str, cutoff: float = 0.0, bandwidth: Optional[float] = None, kernel: str = 'triangular', alpha: float = 0.05) -> RDInterferenceResult

Sharp RDD with network interference (Cabrelli-Marconi 2024).

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`y`	`str`	Outcome.	required
`running`	`str`	Own running variable.	required
`neighbour_running`	`str`	Average running variable across neighbours (precomputed).	required
`cutoff`	`float`		`0.0`
`bandwidth`	`float`	Defaults to IQR of own running variable.	`None`
`kernel`	`str`		`'triangular'`
`alpha`	`float`		`0.05`

Returns:

Type	Description
`RDInterferenceResult`

Examples:

>>> import statspai as sp
>>> import numpy as np
>>> import pandas as pd
>>> rng = np.random.default_rng(1)
>>> n = 600
>>> r = rng.uniform(-1, 1, size=n)    # own running variable
>>> rn = rng.uniform(-1, 1, size=n)   # avg neighbour running variable
>>> y = (0.5 * (r >= 0) + 0.3 * (rn >= 0)
...      + 0.2 * r + 0.1 * rn + rng.normal(scale=0.3, size=n))
>>> df = pd.DataFrame({"y": y, "run": r, "nbr_run": rn})
>>> res = sp.rd_interference(df, y="y", running="run",
...                          neighbour_running="nbr_run")
>>> type(res).__name__
'RDInterferenceResult'
>>> res.n_obs
600
>>> bool(np.isfinite(res.direct_effect))
True

rd_multi_score ¶

rd_multi_score(data: DataFrame, y: str, running_vars: List[str], cutoffs: List[float], bandwidth: Optional[float] = None, kernel: str = 'triangular', alpha: float = 0.05) -> MultiScoreRDResult

Multi-score RDD: treatment if all running variables exceed cutoffs.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`y`	`str`	Outcome.	required
`running_vars`	`list of str`	Multiple running variables.	required
`cutoffs`	`list of float`	One cutoff per running variable (same length).	required
`bandwidth`	`float`	Defaults to median IQR across running vars.	`None`
`kernel`	`str`		`'triangular'`
`alpha`	`float`		`0.05`

Returns:

Type	Description
`MultiScoreRDResult`

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 2000
>>> sat_math = rng.uniform(400, 800, n)
>>> sat_verbal = rng.uniform(400, 800, n)
>>> treat = ((sat_math >= 600) & (sat_verbal >= 600)).astype(int)
>>> y = (10.0 + 0.01 * sat_math + 0.01 * sat_verbal
...      + 5.0 * treat + rng.normal(0, 2, n))
>>> df = pd.DataFrame({"y": y, "math": sat_math, "verbal": sat_verbal})
>>> res = sp.rd_multi_score(
...     df, y="y", running_vars=["math", "verbal"],
...     cutoffs=[600.0, 600.0])
>>> res.n_obs
2000
>>> isinstance(res.boundary_effect, float)
True
>>> print(res.summary())

rd_distribution ¶

rd_distribution(data: DataFrame, y: str, running: str, cutoff: float = 0.0, quantiles: Optional[ndarray] = None, bandwidth: Optional[float] = None, kernel: str = 'triangular', alpha: float = 0.05) -> DistRDResult

Distribution-valued sharp RDD.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`y`	`str`		required
`running`	`str`		required
`cutoff`	`float`		`0.0`
`quantiles`	`array - like`	Defaults to (0.1, 0.25, 0.5, 0.75, 0.9).	`None`
`bandwidth`	`float`		`None`
`kernel`	`str`		`'triangular'`
`alpha`	`float`		`0.05`

Returns:

Type	Description
`DistRDResult`	RDD effect (`qte`) and standard error (`se`) on the conditional CDF of `y` at each requested quantile.

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 400
>>> x = rng.uniform(-1, 1, n)
>>> y = 1.0 + 0.5 * x + (x >= 0) * 0.8 + rng.normal(0, 1, n)
>>> data = pd.DataFrame({"y": y, "x": x})
>>> res = sp.rd_distribution(data, y="y", running="x", cutoff=0.0)
>>> res.n_obs
400
>>> bool(res.qte.shape == res.se.shape == res.quantiles.shape)
True

rd_bayes_hte ¶

rd_bayes_hte(data: DataFrame, y: str, running: str, covariates: List[str], cutoff: float = 0.0, bandwidth: Optional[float] = None, kernel: str = 'triangular', alpha: float = 0.05, n_draws: int = 2000, seed: int = 0) -> BayesRDHTEResult

Bayesian RDD allowing CATE to depend on covariates.

Parameters:

Name	Type	Default
`data`	`DataFrame`	required
`y`	`str`	required
`running`	`str`	required
`covariates`	`list of str`	required
`cutoff`	`float`	`0.0`
`bandwidth`	`float`	`None`
`kernel`	`str`	`'triangular'`
`alpha`	`float`	`0.05`
`n_draws`	`int`	`2000`
`seed`	`int`	`0`

Returns:

Type	Description
`BayesRDHTEResult`

Examples:

Sharp RD with a covariate-modulated jump at the cutoff. Without PyMC installed this falls back to the conjugate closed-form posterior:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(0)
>>> n = 400
>>> run = rng.uniform(-1, 1, size=n)
>>> x = rng.normal(size=n)
>>> treat = (run >= 0).astype(float)
>>> y = (1.0 + 0.5 * run + (0.4 + 0.3 * x) * treat
...      + rng.normal(scale=0.2, size=n))
>>> df = pd.DataFrame({'y': y, 'run': run, 'x': x})
>>> res = sp.rd_bayes_hte(df, y='y', running='run', covariates=['x'],
...                       cutoff=0.0, n_draws=500, seed=0)
>>> res.n_obs
400
>>> len(res.cate)
400
>>> bool(np.isfinite(res.posterior_mean))
True
>>> lo, hi = res.posterior_ci
>>> bool(lo <= hi)
True
>>> print(res.summary())

rd_distributional_design ¶

rd_distributional_design(data: DataFrame, y: str, running: str, cutoff: float = 0.0, quantiles: Optional[ndarray] = None, bandwidth: Optional[float] = None, kernel: str = 'triangular') -> DDDResult

Joint RDD + RKD on the conditional distribution of Y.

Parameters:

Name	Type	Default
`data`	`DataFrame`	required
`y`	`str`	required
`running`	`str`	required
`cutoff`	`float`	`0.0`
`quantiles`	`array - like`	`None`
`bandwidth`	`float`	`None`
`kernel`	`str`	`'triangular'`

Returns:

Type	Description
`DDDResult`	Quantile-by-quantile level jump (`rdd_effect`) and slope jump (`rkd_effect`) in the conditional CDF of `y` at the cutoff.

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 400
>>> x = rng.uniform(-1, 1, n)
>>> y = 1.0 + 0.5 * x + (x >= 0) * 0.8 + rng.normal(0, 1, n)
>>> data = pd.DataFrame({"y": y, "x": x})
>>> res = sp.rd_distributional_design(data, y="y", running="x", cutoff=0.0)
>>> res.n_obs
400
>>> len(res.rdd_effect)
5

rd_bias_aware_fuzzy ¶

rd_bias_aware_fuzzy(data: DataFrame, y: str, x: str, fuzzy: str, c: float = 0.0, M_y: Optional[float] = None, M_d: Optional[float] = None, h: Optional[float] = None, kernel: str = 'triangular', alpha: float = 0.05, cluster: Optional[str] = None, n_grid: int = 401) -> CausalResult

Bias-aware confidence interval for fuzzy RD (Noack & Rothe 2024).

Constructs an Anderson--Rubin-style CI for the Wald-ratio fuzzy RD parameter that takes worst-case smoothing bias of the numerator and denominator local-linear estimates into account. The CI is robust to weak first stages and avoids the power asymmetry documented in Kaliski-Keane-Neal (2025).

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`y`	`str`	Outcome column.	required
`x`	`str`	Running variable column.	required
`fuzzy`	`str`	Treatment indicator column for the fuzzy first stage.	required
`c`	`float`	RD cutoff.	`0.0`
`M_y`	`float`	Smoothness bounds on \|g_Y''\| and \|g_D''\|. If `None`, both are estimated from a local-quadratic fit on each side of the cutoff (the same default as :func:`rd_honest`).	`None`
`M_d`	`float`	Smoothness bounds on \|g_Y''\| and \|g_D''\|. If `None`, both are estimated from a local-quadratic fit on each side of the cutoff (the same default as :func:`rd_honest`).	`None`
`h`	`float`	Bandwidth. If `None`, defaults to a Silverman pilot.	`None`
`kernel`	`str`	Kernel.	``'triangular'``
`alpha`	`float`	Significance level.	`0.05`
`cluster`	`str`	Cluster variable for variance estimation.	`None`
`n_grid`	`int`	Grid resolution for the AR-style test inversion.	`401`

Returns:

Type	Description
`CausalResult`	Result with `model_info['bias_aware']` containing `M_y`, `M_d`, `naive_ci` (standard CCT-style fuzzy CI), `bias_aware_ci`, `rejection_grid` and `first_stage_F`.

Notes

The bias-aware CI is the set of τ_0 for which the test statistic

T(τ_0) = (Δ̂_Y − τ_0 · Δ̂_D) / σ̂(τ_0)

has |T(τ_0)| ≤ cv(b(τ_0)) where b(τ_0) is the worst-case bias- to-noise ratio of the numerator–denominator combination. The grid inversion accommodates non-convex CIs that arise under weak first stages.

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 600
>>> age = rng.uniform(60, 70, n)
>>> # Take-up jumps sharply at the cutoff -> strong first stage:
>>> p = 0.05 + 0.9 * (age >= 65)
>>> retired = (rng.uniform(size=n) < np.clip(p, 0, 1)).astype(float)
>>> earnings = 20 - 4.0 * retired + 0.1 * (age - 65) + rng.normal(0, 1.0, n)
>>> df = pd.DataFrame({'age': age, 'earnings': earnings, 'retired': retired})
>>> r = sp.rd_bias_aware_fuzzy(df, y='earnings', x='age', fuzzy='retired',
...                            c=65.0, n_grid=101)
>>> lo, hi = r.model_info['bias_aware']['bias_aware_ci']
>>> bool(lo < hi)
True

rd_dashboard ¶

rd_dashboard(data: DataFrame, y: str, x: str, c: float = 0.0, covs: Optional[List[str]] = None, fuzzy: Optional[str] = None, bw_grid: Optional[Sequence[float]] = None, h: Optional[float] = None, figsize: Tuple[float, float] = (12, 9), title: Optional[str] = None, save: Optional[str] = None) -> Tuple[Any, Any]

Four-panel RD diagnostic dashboard.

Combines the four checks every RD analysis should report:

RD plot (top-left): IMSE-binned scatter with polynomial fit.
Density discontinuity (top-right): :func:rdplotdensity output for a McCrary-style manipulation test.
Covariate balance (bottom-left): mean of each cov on each side of the cutoff with point ranges.
Bandwidth sensitivity (bottom-right): point estimate and CI across a grid of bandwidths.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`y`	`str`		required
`x`	`str`		required
`c`	`float`	Cutoff.	`0.0`
`covs`	`list of str`	Covariates for the balance panel. If None, the panel shows the density's binomial near-cutoff test instead.	`None`
`fuzzy`	`str`	Fuzzy treatment column (passed through to RD plot/sensitivity).	`None`
`bw_grid`	`sequence of float`	Bandwidths to evaluate in the sensitivity panel. If None, uses `[0.5, 0.75, 1.0, 1.25, 1.5, 2.0] × h_mse`.	`None`
`h`	`float`	Reference bandwidth used for plotting and as the basis for `bw_grid`. If None, MSE-optimal bandwidth from rdrobust.	`None`
`figsize`	`tuple`		`(12, 9)`
`title`	`str`	Suptitle.	`None`
`save`	`str`	If a path is given, also save the figure (extension determines format).	`None`

Returns:

Type	Description
`(fig, axes) where axes is a 2x2 numpy array.`

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 500
>>> x = rng.uniform(-1, 1, n)
>>> y = 0.8 * (x >= 0) + 0.5 * x + rng.normal(0, 0.3, n)
>>> df = pd.DataFrame({"x": x, "y": y})
>>> fig, axes = sp.rd_dashboard(df, y="y", x="x", c=0.0)
>>> axes.shape
(2, 2)
>>> fig.savefig("rd_dashboard.png")

rd_compare ¶

rd_compare(data: DataFrame, y: str, x: str, c: float = 0.0, methods: Sequence[str] = _DEFAULT_COMPARE_METHODS, fuzzy: Optional[str] = None, method_kwargs: Optional[Dict[str, Dict[str, Any]]] = None, alpha: float = 0.05) -> DataFrame

Compare multiple RD estimators on the same data.

Returns a tidy DataFrame: one row per method with point estimate, SE, p-value and CI. Useful for robustness tables across estimation families (local-polynomial vs honest vs local-randomisation vs flexible-covariate-adjusted).

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`y`	`str`		required
`x`	`str`		required
`c`	`float`		`0.0`
`methods`	`sequence of str`	Method aliases recognised by the :data:`sp.rd._RD_METHOD_ALIASES` dispatcher. Default: `('rdrobust', 'honest', 'randinf')`.	`_DEFAULT_COMPARE_METHODS`
`fuzzy`	`str`	Fuzzy treatment column passed through to all methods that accept it.	`None`
`method_kwargs`	`dict of dict`	Per-method extra kwargs, e.g. `{'rdrobust': {'kernel': 'uniform'}}`.	`None`
`alpha`	`float`	Confidence level forwarded to each estimator that accepts `alpha`.	`0.05`

Returns:

Type	Description
`DataFrame`	Columns: `method`, `estimate`, `se`, `pvalue`, `ci_lower`, `ci_upper`, `n_obs`, `status`.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 500
>>> x = rng.uniform(-1, 1, n)
>>> y = 0.8 * (x >= 0) + 0.5 * x + rng.normal(0, 0.3, n)
>>> df = pd.DataFrame({"x": x, "y": y})
>>> tab = sp.rd_compare(df, y="y", x="x", c=0.0,
...                     methods=["rdrobust"])
>>> tab.shape
(1, 8)
>>> tab.loc[0, "method"], tab.loc[0, "status"]
('rdrobust', 'ok')
>>> round(float(tab.loc[0, "estimate"]), 3)
0.813

rd_robustness_table ¶

rd_robustness_table(data: DataFrame, y: str, x: str, c: float = 0.0, fuzzy: Optional[str] = None, kernels: Sequence[str] = ('triangular', 'epanechnikov', 'uniform'), bwselects: Sequence[str] = ('mserd', 'cerrd', 'msetwo'), polynomials: Sequence[int] = (1, 2), donuts: Sequence[float] = (0.0,), covs: Optional[List[str]] = None, cluster: Optional[str] = None, alpha: float = 0.05) -> DataFrame

Sweep over (kernel, bwselect, polynomial, donut) and return a robustness table for reporting.

Each row is one specification. The DataFrame contains both "Conventional" and "Robust" point estimates and CIs from :func:rdrobust, plus the bandwidth used. Suitable for direct .to_latex(...) / .to_excel(...) export, or feed into :func:statspai.outreg2 for native multi-column tables.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`y`	`str`		required
`x`	`str`		required
`c`	`float`		`0.0`
`fuzzy`	`optional, forwarded to rdrobust`		`None`
`covs`	`optional, forwarded to rdrobust`		`None`
`cluster`	`optional, forwarded to rdrobust`		`None`
`kernels`	`sequences`	Specification grid.	`('triangular', 'epanechnikov', 'uniform')`
`bwselects`	`sequences`	Specification grid.	`('triangular', 'epanechnikov', 'uniform')`
`polynomials`	`sequences`	Specification grid.	`('triangular', 'epanechnikov', 'uniform')`
`donuts`	`sequences`	Specification grid.	`('triangular', 'epanechnikov', 'uniform')`
`alpha`	`float`		`0.05`

Returns:

Type	Description
`pd.DataFrame with one row per specification and columns:`	kernel, bwselect, p, donut, h, b, estimate_conv, se_conv, ci_conv_lo, ci_conv_hi, pvalue_conv, estimate_rbc, se_rbc, ci_rbc_lo, ci_rbc_hi, pvalue_rbc, n_left, n_right, status

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 500
>>> x = rng.uniform(-1, 1, n)
>>> y = 0.8 * (x >= 0) + 0.5 * x + rng.normal(0, 0.3, n)
>>> df = pd.DataFrame({"x": x, "y": y})
>>> tab = sp.rd_robustness_table(
...     df, y="y", x="x", c=0.0,
...     kernels=["triangular", "uniform"],
...     bwselects=["mserd"], polynomials=[1],
... )
>>> tab.shape
(2, 19)
>>> tab["kernel"].tolist()
['triangular', 'uniform']
>>> round(float(tab.loc[0, "estimate_conv"]), 3)
0.765

multi_cutoff_rd ¶

multi_cutoff_rd(*args: Any, **kwargs: Any) -> RDMultiResult

User-friendly alias for :func:sp.rdmc (multi-cutoff RD).

See :func:statspai.rd.rdmulti.rdmc for full documentation.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 600
>>> x = rng.uniform(-1, 1, n)
>>> y = 0.8 * (x >= 0) + 0.5 * x + rng.normal(0, 0.3, n)
>>> df = pd.DataFrame({"x": x, "y": y})
>>> res = sp.multi_cutoff_rd(
...     df, y="y", x="x", cutoffs=[-0.4, 0.0, 0.4]
... )
>>> res.n_cutoffs
3
>>> len(res.cutoff_results)
3
>>> round(float(res.pooled_estimate), 3)
0.258

geographic_rd ¶

geographic_rd(*args: Any, **kwargs: Any) -> CausalResult

User-friendly alias for :func:sp.rdms (multi-score RD).

Geographic RD is the most common multi-score special case: the running score is a signed distance to a political boundary. Cattaneo et al. (2024) dispatch it via rdms; this alias makes the intent explicit.

See :func:statspai.rd.rdmulti.rdms for the full signature.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(0)
>>> n = 2000
>>> x1 = rng.uniform(-1, 1, n)
>>> x2 = rng.uniform(-1, 1, n)
>>> treat = ((x1 >= 0) & (x2 >= 0)).astype(float)
>>> y = 0.8 * treat + 0.5 * x1 + 0.3 * x2 + rng.normal(0, 0.3, n)
>>> df = pd.DataFrame({"x1": x1, "x2": x2, "y": y})
>>> res = sp.geographic_rd(df, y="y", x1="x1", x2="x2", bandwidth=0.5)
>>> type(res).__name__
'CausalResult'
>>> bool(hasattr(res, "estimate"))
True

References

keele2015geographic

boundary_rd ¶

boundary_rd(*args: Any, **kwargs: Any) -> CausalResult

User-friendly alias for :func:sp.rd2d (boundary discontinuity design).

Cattaneo, Titiunik & Yu (2025) boundary discontinuity design for 2D running variables (lat/long, for example).

See :func:statspai.rd.rd2d.rd2d for the full signature.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(0)
>>> n = 1500
>>> x1 = rng.uniform(-1, 1, n)
>>> x2 = rng.uniform(-1, 1, n)
>>> treat = (x1 >= 0).astype(int)
>>> y = 0.7 * treat + 0.4 * x1 + 0.2 * x2 + rng.normal(0, 0.3, n)
>>> df = pd.DataFrame({"x1": x1, "x2": x2, "treat": treat, "y": y})
>>> res = sp.boundary_rd(df, y="y", x1="x1", x2="x2", treatment="treat")
>>> type(res).__name__
'CausalResult'
>>> bool(hasattr(res, "estimate"))
True

References

cattaneo2025boundary

multi_score_rd ¶

multi_score_rd(*args: Any, **kwargs: Any) -> MultiScoreRDResult

User-friendly alias for :func:sp.rd_multi_score.

Multi-score RD when eligibility depends on more than one discontinuous rule (e.g. income AND age thresholds). Separate from boundary RDD in that the rules are axis-aligned.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 800
>>> x1 = rng.uniform(-1, 1, n)
>>> x2 = rng.uniform(-1, 1, n)
>>> treat = (x1 >= 0) & (x2 >= 0)
>>> y = (0.8 * treat + 0.5 * x1 + 0.3 * x2
...      + rng.normal(0, 0.3, n))
>>> df = pd.DataFrame({"x1": x1, "x2": x2, "y": y})
>>> res = sp.multi_score_rd(
...     df, y="y", running_vars=["x1", "x2"], cutoffs=[0.0, 0.0]
... )
>>> res.n_obs
800
>>> round(float(res.boundary_effect), 3)
0.137

fit ¶

fit(data: Any = None, y: Any = None, x: Any = None, c: Any = 0, *, method: str = 'rdrobust', **kwargs: Any) -> Any

Alias for :func:_rd_dispatch. See sp.rd.__doc__ for usage.

statspai.rd¶

rd ¶

RDMultiResult ¶

plot ¶

RDInterferenceResult dataclass ¶

MultiScoreRDResult dataclass ¶

DistRDResult dataclass ¶

BayesRDHTEResult dataclass ¶

DDDResult dataclass ¶

rdplot ¶

rdplotdensity ¶

rdbwselect ¶

rdbwsensitivity ¶

rdbalance ¶

rdplacebo ¶

rdsummary ¶

rd_honest ¶

rdmc ¶

rdms ¶

rdsampsi ¶

rdrandinf ¶

rdwinselect ¶

rdsensitivity ¶

rdrbounds ¶

rdhte ¶

rdbwhte ¶

rdhte_lincom ¶

rd2d_bw ¶

rd2d_plot ¶

rd_forest ¶

rd_boost ¶

rd_lasso ¶

rd_cate_summary ¶

rd_extrapolate ¶

rd_multi_extrapolate ¶

rd_external_validity ¶

rd_interference ¶

rd_multi_score ¶

rd_distribution ¶

rd_bayes_hte ¶

rd_distributional_design ¶

rd_bias_aware_fuzzy ¶

rd_dashboard ¶

rd_compare ¶

rd_robustness_table ¶

multi_cutoff_rd ¶

geographic_rd ¶

boundary_rd ¶

multi_score_rd ¶

fit ¶

`statspai.rd`¶

RDInterferenceResult `dataclass` ¶

MultiScoreRDResult `dataclass` ¶

DistRDResult `dataclass` ¶

BayesRDHTEResult `dataclass` ¶

DDDResult `dataclass` ¶