`statspai.matching`¶

matching ¶

Matching module for StatsPAI.

Unified interface for matching estimators:

Nearest-neighbor matching (propensity score, Mahalanobis, Euclidean)
Exact matching
Coarsened Exact Matching (CEM)
Propensity score stratification / subclassification
Abadie-Imbens (2011) bias correction
Entropy balancing (Hainmueller 2012)
Covariate Balancing Propensity Score (Imai-Ratkovic 2014)
Genetic Matching (Diamond-Sekhon 2013)
Stable Balancing Weights (Zubizarreta 2015)
Optimal pair / full / cardinality matching (Rosenbaum 1989, 2012)
Overlap weights (Li-Morgan-Zaslavsky 2018)

The single entry point is :func:match — a method-aware dispatcher that routes method= to the correct estimator. Standalone functions (ebalance, cbps, genmatch, sbw, optimal_match, cardinality_match, overlap_weights) remain fully accessible for power users who need their estimator-specific parameters.

References

Rosenbaum, P.R. and Rubin, D.B. (1983). Biometrika, 70(1), 41-55. Abadie, A. and Imbens, G.W. (2006). Econometrica, 74(1), 235-267. Abadie, A. and Imbens, G.W. (2011). JBES, 29(1), 1-11. Iacus, S.M., King, G., and Porro, G. (2012). Political Analysis, 20(1), 1-24. Hainmueller, J. (2012). Political Analysis, 20(1), 25-46. Imai, K. and Ratkovic, M. (2014). JRSS-B, 76(1), 243-263. Diamond, A. and Sekhon, J.S. (2013). REStat, 95(3), 932-945. Zubizarreta, J.R. (2015). JASA, 110(511), 910-922. Li, F., Morgan, K.L., and Zaslavsky, A.M. (2018). JASA, 113(521), 390-400. Rosenbaum, P.R. (2012). JASA, 107(498), 691-700. Cunningham, S. (2021). Causal Inference: The Mixtape. Yale University Press. [@rosenbaum1983central]

MatchEstimator ¶

Unified matching estimator supporting multiple distance × method combinations.

This is the object-oriented backend behind :func:match. Most users should call :func:sp.match; construct MatchEstimator directly only when you want to hold the configured estimator and call .fit() yourself. .fit() returns a CausalResult.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input data.	required
`y`	`str`	Outcome column.	required
`treat`	`str`	Binary (0/1) treatment column.	required
`covariates`	`list of str`	Variables to match on.	required
`distance`	`str`	`'propensity'`, `'mahalanobis'`, `'euclidean'` or `'exact'`.	`None`
`method`	`str`	`'nearest'`, `'stratify'` or `'cem'` (legacy `'psm'` / `'mahalanobis'` are also accepted).	`'nearest'`
`estimand`	`str`	`'ATT'` or `'ATE'`.	`'ATT'`

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 200
>>> age = rng.normal(40, 8, n)
>>> edu = rng.normal(12, 2, n)
>>> ps = 1 / (1 + np.exp(-(0.05 * (age - 40) + 0.1 * (edu - 12))))
>>> training = rng.binomial(1, ps)
>>> wage = 20 + 0.3 * age + 0.5 * edu + 4.0 * training + rng.normal(0, 3, n)
>>> df = pd.DataFrame({"wage": wage, "training": training,
...                    "age": age, "edu": edu})
>>> est = sp.MatchEstimator(df, y="wage", treat="training",
...                         covariates=["age", "edu"], distance="propensity")
>>> result = est.fit()
>>> type(result).__name__
'CausalResult'

fit ¶

fit() -> CausalResult

Fit matching estimator and return results.

PSBalanceResult ¶

Bases: ResultProtocolMixin

Container for propensity score balance diagnostics.

Attributes:

Name	Type	Description
`table`	`DataFrame`	Balance statistics per covariate: mean_treat, mean_control, smd_raw, smd_weighted, variance_ratio, ks_stat.
`ps`	`Series`	Estimated propensity scores.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 300
>>> x1 = rng.normal(size=n)
>>> x2 = rng.normal(size=n)
>>> p = 1.0 / (1.0 + np.exp(-(0.5 * x1 - 0.5 * x2 - 0.5)))
>>> d = rng.binomial(1, p)
>>> df = pd.DataFrame({'d': d, 'x1': x1, 'x2': x2})
>>> bal = sp.ps_balance(df, treatment='d', covariates=['x1', 'x2'])
>>> isinstance(bal, sp.PSBalanceResult)
True
>>> bal.table['smd_weighted'].round(2).tolist()
[0.02, -0.06]

summary ¶

summary() -> str

Formatted balance summary table.

love_plot ¶

love_plot(threshold: float = 0.1, **kwargs: Any) -> Tuple[Any, Any]

Convenience method: calls love_plot() from balance data.

BalanceDiagnosticsResult ¶

Bases: ResultProtocolMixin

Container for raw/weighted matching balance diagnostics.

Returned by :func:sp.balance_diagnostics. Holds a per-covariate table (raw vs. weighted SMDs, variance ratios, KS stats) and a summary_stats dict (max/mean SMDs, imbalance counts, effective sample size, propensity overlap). Call .summary() for a report.

Examples:

>>> import statspai as sp
>>> df = sp.cps_wage()
>>> bal = sp.balance_diagnostics(
...     df, treatment='union',
...     covariates=['education', 'experience', 'tenure'])
>>> isinstance(bal, sp.BalanceDiagnosticsResult)
True
>>> bal.summary_stats['n_obs']
3000

OptimalMatchResult `dataclass` ¶

Bases: ResultProtocolMixin

Result of :func:optimal_match (optimal 1:1 Hungarian matching).

Attributes:

Name	Type	Description
`pairs`	`DataFrame`	One row per matched pair with columns `treated_idx`, `control_idx`, `distance`.
`distances`	`ndarray`	Matching distance for each matched pair.
`ate`	`float`	Matched-pair average treatment effect on the treated (ATT).
`se`	`float`	Analytic standard error of `ate` over the matched pairs.
`n_treated, n_matched`	`int`	Number of treated units and number of retained matched pairs.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 300
>>> x1 = rng.normal(size=n)
>>> x2 = rng.normal(size=n)
>>> p = 1.0 / (1.0 + np.exp(-(0.5 * x1 - 0.5 * x2 - 0.5)))
>>> d = rng.binomial(1, p)
>>> y = 1.0 + 2.0 * d + x1 + x2 + rng.normal(size=n)
>>> df = pd.DataFrame({'y': y, 'd': d, 'x1': x1, 'x2': x2})
>>> res = sp.optimal_match(df, treatment='d', outcome='y',
...                        covariates=['x1', 'x2'])
>>> isinstance(res, sp.OptimalMatchResult)
True
>>> res.n_matched
111
>>> round(res.ate, 2)
1.88
>>> res.pairs.columns.tolist()
['treated_idx', 'control_idx', 'distance']
>>> bool(len(res.distances) == res.n_matched)
True

CardinalityMatchResult `dataclass` ¶

Bases: ResultProtocolMixin

Result of :func:cardinality_match (Zubizarreta cardinality matching).

Attributes:

Name	Type	Description
`treated_matched, control_matched`	`ndarray`	Row indices (into the cleaned data) of the matched treated and control units making up each pair.
`ate`	`float`	Matched-pair average treatment effect on the treated (ATT).
`se`	`float`	Analytic standard error of `ate` over the matched pairs.
`n_matched_pairs`	`int`	Number of matched pairs retained.
`balance`	`DataFrame`	Post-match balance table with columns `covariate`, `SMD`, `\|SMD\|` (standardised mean differences).

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 300
>>> x1 = rng.normal(size=n)
>>> x2 = rng.normal(size=n)
>>> p = 1.0 / (1.0 + np.exp(-(0.5 * x1 - 0.5 * x2 - 0.5)))
>>> d = rng.binomial(1, p)
>>> y = 1.0 + 2.0 * d + x1 + x2 + rng.normal(size=n)
>>> df = pd.DataFrame({'y': y, 'd': d, 'x1': x1, 'x2': x2})
>>> res = sp.cardinality_match(df, treatment='d', outcome='y',
...                            covariates=['x1', 'x2'],
...                            smd_tolerance=0.1)
>>> isinstance(res, sp.CardinalityMatchResult)
True
>>> res.n_matched_pairs
107
>>> round(res.ate, 2)
1.86
>>> res.balance['|SMD|'].round(3).tolist()
[0.111, 0.082]

GenMatchResult `dataclass` ¶

Bases: ResultProtocolMixin

Output of :func:sp.genmatch (Diamond-Sekhon genetic matching).

Holds the ATT estimate and bootstrap SE, the optimal covariate weight vector, the matched control indices, and a pre/post balance table. Call .summary() for a formatted report.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 300
>>> x1 = rng.normal(size=n)
>>> x2 = rng.normal(size=n)
>>> p = 1.0 / (1.0 + np.exp(-(0.5 * x1 - 0.5 * x2 - 0.5)))
>>> d = rng.binomial(1, p)
>>> y = 1.0 + 2.0 * d + x1 + x2 + rng.normal(size=n)
>>> df = pd.DataFrame({'y': y, 'd': d, 'x1': x1, 'x2': x2})
>>> res = sp.genmatch(df, y='y', treat='d', covariates=['x1', 'x2'],
...                   population_size=10, generations=5)
>>> isinstance(res, sp.GenMatchResult)
True
>>> res.n_treated
111

SBWResult ¶

Bases: CausalResult

Stable balancing weights with a diagnostic panel.

Thin subclass of :class:CausalResult that attaches the weight vector, effective sample size, and covariate balance table. Returned by :func:sbw.

Examples:

>>> import statspai as sp
>>> df = sp.cps_wage().iloc[:400].copy()
>>> res = sp.sbw(df, treat="union",
...              covariates=["education", "experience", "tenure"],
...              y="log_wage", delta=0.05)
>>> isinstance(res, sp.SBWResult)
True
>>> res.estimand
'ATT'
>>> list(res.balance.columns)
['mean_treated', 'mean_control', 'SMD_before', 'SMD_after']

PSMatch2Result ¶

Bases: ResultProtocolMixin

Container for a sp.psmatch2 run.

Attributes:

Name	Type	Description
`matched_data`	`DataFrame`	The input data plus the psmatch2 columns (`_pscore`, `_treated`, `_support`, `_weight`, `_y`; plus `_n1` …, `_nn`, `_pdif` for nearest-neighbour matching). Also available as `.data`.
`att, se, pvalue, ci`	`float / tuple`	Average treatment effect on the treated and its inference.
`estimand`	`str`	Always `'ATT'` for `psmatch2`.
`result`	`CausalResult`	The underlying :func:`sp.match` result.

Methods:

Name	Description
`matched_sample`	Rows that entered the matched sample (`_weight` not missing).
`balance`	Post-matching covariate balance on the weighted matched sample.
`psplot`	Propensity-score density before/after matching.
`psm_did`	Frequency-weighted PSM-DID regression.

Examples:

>>> import statspai as sp
>>> df = sp.cps_wage()
>>> m = sp.psmatch2(df, outcome='log_wage', treat='union',
...                 covariates=['education', 'experience', 'tenure'])
>>> '_weight' in m.matched_data.columns
True
>>> bal = m.balance()                      # post-matching balance
>>> fig, ax = m.psplot()

matched_sample ¶

matched_sample(*, on_support: bool = True, drop_unmatched: bool = True) -> DataFrame

Return the rows that make up the matched sample.

Parameters:

Name	Type	Description	Default
`on_support`	`bool`	Keep only rows with `_support == 1`. Has no effect when matching was run with `common_support='none'` (every row is on support).	`True`
`drop_unmatched`	`bool`	Drop rows with a missing `_weight` — i.e. controls never used as a match and treated units that found no match. This is the sample Stata uses for `[fweight=_weight]` regressions.	`True`

Returns:

Type	Description
`DataFrame`

balance ¶

balance(covariates: Optional[Sequence[str]] = None, *, threshold: float = 0.1) -> BalanceDiagnosticsResult

Covariate balance before vs after matching (Stata pstest).

Standardized mean differences are reported two ways, exactly like pstest:

smd_raw — before matching: unweighted SMD over the full treated vs control sample.
smd_weighted — after matching: SMD with the _weight frequency weights, so a control used twice counts twice and unmatched / off-support units drop out (weight 0).

Parameters:

Name	Type	Description	Default
`covariates`	`list of str`	Variables to assess. Defaults to the matching covariates.	`None`
`threshold`	`float`	\|SMD\| balance threshold.	`0.1`

Returns:

Type	Description
`BalanceDiagnosticsResult`	`.table` (per-covariate before vs after SMD, variance ratio, KS) and `.summary_stats`.

psplot ¶

psplot(*, before: bool = True, n_grid: int = 300, ax: Any = None, figsize: tuple[float, float] = (8.0, 4.5), title: Optional[str] = None) -> tuple[Any, Any]

Propensity-score density by treatment group, after matching.

Controls are reweighted by _weight so the plotted control density reflects the matched sample, not the raw pool. With before=True the raw (unweighted) densities are overlaid as dashed lines so the user can see how matching tightened overlap.

Returns:

Type	Description
`(fig, ax)`

psm_did ¶

psm_did(panel: DataFrame, *, id: str, y: str, time: Optional[str] = None, post: Optional[str] = None, treat: Optional[str] = None, treat_time: Optional[Any] = None, covariates: Optional[Sequence[str]] = None, fixed_effects: Optional[Sequence[str]] = None, cluster: Optional[Union[str, List[str]]] = None, on_support: bool = True, weight: str = 'fweight', alpha: float = 0.05) -> CausalResult

Frequency-weighted PSM-DID on a panel.

Implements the Stata workflow

.. code-block:: stata

psmatch2 d x1 x2, out(y) ...        // produces _weight
// merge _weight back onto the panel by id, then
reg y i.treat##i.post [fweight=_weight] if _support==1

The matching _weight (and _support) are merged onto panel by id, the matched sample is selected, and the weighted difference-in-differences regression

y ~ treat + post + treat:post (+ covariates | fixed_effects)

is fitted with :func:sp.feols. The treat:post coefficient is the PSM-DID treatment effect.

Parameters:

Name	Type	Description	Default
`panel`	`DataFrame`	Long panel (one row per unit-period).	required
`id`	`str`	Unit identifier. Must also exist in the matching data so the per-unit `_weight` can be merged in.	required
`y`	`str`	Outcome in the panel.	required
`time`	`str`	Time variable. Used with `treat_time` to build `post` if `post` is not supplied directly.	`None`
`post`	`str`	Binary post-period indicator. Provide this or `time` + `treat_time`.	`None`
`treat`	`str`	Time-invariant treated-group indicator in the panel. Defaults to the matching treatment variable.	`None`
`treat_time`	`scalar`	First treated period; `post = time >= treat_time`.	`None`
`covariates`	`list of str`	Additional time-varying controls.	`None`
`fixed_effects`	`list of str`	Columns absorbed as fixed effects (e.g. `[id, time]` for TWFE).	`None`
`cluster`	`str or list`	Cluster variable(s) for the standard errors.	`None`
`on_support`	`bool`	Keep only matched units on common support.	`True`
`weight`	`('fweight', 'none')`	`'fweight'` weights the regression by `_weight`; `'none'` runs the matched-sample DiD unweighted.	`'fweight'`
`alpha`	`float`	Significance level for the returned CI.	`0.05`

Returns:

Type	Description
`CausalResult`	`.estimate` is the DiD (`treat:post`) coefficient; the full weighted regression is stored in `model_info['feols_result']`.

summary ¶

summary() -> SummaryText

Stata-style text summary of the matched ATT.

cite ¶

cite(format: str = 'bibtex') -> Any

Citation for the matching estimator (delegates to the result).

balanceplot ¶

balanceplot(result: CausalResult, threshold: float = 0.1, ax: Any = None, figsize: tuple = (8, None), title: Optional[str] = None) -> Tuple[Any, Any]

Love plot: covariate balance visualization (SMD dot plot).

Displays standardized mean differences (SMD) for each covariate. The standard threshold for good balance is |SMD| < 0.1.

Parameters:

Name	Type	Description	Default
`result`	`CausalResult`	Result from `match()` or `ebalance()`.	required
`threshold`	`float`	SMD threshold lines.	`0.1`
`ax`	`matplotlib Axes`		`None`
`figsize`	`tuple`	Height auto-scales with number of covariates if None.	`(8, None)`
`title`	`str`		`None`

Returns:

Type	Description
`(fig, ax)`

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 200
>>> age = rng.normal(40, 8, n)
>>> edu = rng.normal(12, 2, n)
>>> ps = 1 / (1 + np.exp(-(0.05 * (age - 40) + 0.1 * (edu - 12))))
>>> training = rng.binomial(1, ps)
>>> wage = 20 + 0.3 * age + 0.5 * edu + 4.0 * training + rng.normal(0, 3, n)
>>> df = pd.DataFrame({"wage": wage, "training": training,
...                    "age": age, "edu": edu})
>>> result = sp.match(df, y="wage", treat="training",
...                   covariates=["age", "edu"])
>>> fig, ax = sp.balanceplot(result)
>>> fig.savefig("balance.png")
>>> type(ax).__name__
'Axes'

psplot ¶

psplot(data: DataFrame, treat: str, covariates: List[str], *, n_bins: int = 40, ax: Any = None, figsize: tuple = (8, 5), title: Optional[str] = None, labels: tuple = ('Control', 'Treated'), colors: tuple = ('#3498DB', '#E74C3C'), trim: Optional[float] = None) -> Tuple[Any, Any]

Propensity score distribution plot (common support diagnostic).

Overlays histograms of the estimated propensity score for treated and control groups, so the user can visually assess whether the common support (overlap) assumption holds.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`treat`	`str`	Binary treatment column.	required
`covariates`	`list of str`	Covariates used to estimate the propensity score.	required
`n_bins`	`int`	Number of histogram bins.	`40`
`ax`	`matplotlib Axes`		`None`
`figsize`	`tuple`		`(8, 5)`
`title`	`str`		`None`
`labels`	`tuple of str`	Labels for (control, treated).	`('Control', 'Treated')`
`colors`	`tuple of str`	Colors for (control, treated).	`('#3498DB', '#E74C3C')`
`trim`	`float`	If set, draw vertical lines at (trim, 1-trim) to show the recommended trimming region.	`None`

Returns:

Type	Description
`(fig, ax)`

Examples:

>>> import statspai as sp, numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 400
>>> x1, x2 = rng.normal(size=n), rng.normal(size=n)
>>> D = rng.binomial(1, 1 / (1 + np.exp(-(x1 + 0.5 * x2))))
>>> df = pd.DataFrame({"D": D, "x1": x1, "x2": x2})
>>> fig, ax = sp.psplot(df, treat="D", covariates=["x1", "x2"])

propensity_score ¶

propensity_score(data: DataFrame, treatment: str, covariates: List[str], method: str = 'logit', trimming: Optional[str] = None) -> Series

Estimate propensity scores P(D=1|X).

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input data.	required
`treatment`	`str`	Name of binary treatment column (0/1).	required
`covariates`	`list of str`	Covariate column names.	required
`method`	`(logit, probit, gbm)`	Estimation method. `'logit'` uses IRLS (no sklearn needed). `'probit'` uses scipy.optimize. `'gbm'` tries sklearn GradientBoostingClassifier, falling back to logit with interactions.	`'logit'`
`trimming`	`(None, crump)`	If `'crump'`, apply Crump et al. (2009) trimming after estimation. Trimmed observations receive `NaN` scores.	`None`

Returns:

Type	Description
`Series`	Propensity scores indexed like data.

Examples:

>>> import statspai as sp
>>> df = sp.cps_wage()
>>> ps = sp.propensity_score(df, treatment='union',
...                          covariates=['education', 'experience',
...                                      'tenure'])
>>> round(float(ps.mean()), 3)  # matches the union share of 0.177
0.177

GBM scores with Crump trimming — poorly overlapping observations receive NaN:

>>> ps_trim = sp.propensity_score(df, treatment='union',
...                               covariates=['education', 'experience',
...                                           'tenure'],
...                               method='gbm', trimming='crump')

Typical diagnostics flow afterwards: pass the scores (or derived IPW weights) to :func:sp.overlap_plot, :func:sp.ps_balance, or :func:sp.balance_diagnostics.

overlap_plot ¶

overlap_plot(data: DataFrame, treatment: str, covariates: List[str], ps: Optional[Series] = None, method: str = 'logit', ax: Any = None, figsize: Tuple[float, float] = (8, 4), title: str = 'Propensity Score Overlap') -> Tuple[Any, Any]

Mirrored density plot of propensity scores by treatment group.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input data.	required
`treatment`	`str`	Binary treatment column.	required
`covariates`	`list of str`	Covariates for PS estimation (ignored if ps supplied).	required
`ps`	`Series`	Pre-estimated propensity scores.	`None`
`method`	`str`	PS estimation method if ps is None.	`'logit'`
`ax`	`matplotlib Axes`	Axes to plot on. If None, a new figure is created.	`None`
`figsize`	`tuple`	Figure size (width, height).	`(8, 4)`
`title`	`str`	Plot title.	`'Propensity Score Overlap'`

Returns:

Type	Description
`(fig, ax) : tuple`	Matplotlib figure and axes.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 300
>>> x1 = rng.normal(size=n)
>>> x2 = rng.normal(size=n)
>>> p = 1.0 / (1.0 + np.exp(-(0.5 * x1 - 0.5 * x2 - 0.5)))
>>> d = rng.binomial(1, p)
>>> df = pd.DataFrame({'d': d, 'x1': x1, 'x2': x2})
>>> fig, ax = sp.overlap_plot(df, treatment='d',
...                           covariates=['x1', 'x2'])
>>> fig.savefig('overlap.png')

Reuse pre-estimated propensity scores and set a custom title:

>>> ps = sp.propensity_score(df, treatment='d',
...                          covariates=['x1', 'x2'])
>>> fig, ax = sp.overlap_plot(df, treatment='d',
...                           covariates=['x1', 'x2'], ps=ps,
...                           title='PS overlap')

trimming ¶

trimming(data: DataFrame, treatment: str, covariates: List[str], method: str = 'crump', ps: Optional[Series] = None, ps_method: str = 'logit') -> DataFrame

Trim sample to optimal overlap region.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input data.	required
`treatment`	`str`	Binary treatment column.	required
`covariates`	`list of str`	Covariates for PS estimation (if ps not supplied).	required
`method`	`(crump, sturmer)`	`'crump'` uses Crump et al. (2009) optimal rule. `'sturmer'` trims at the fixed [0.1, 0.9] interval.	`'crump'`
`ps`	`Series`	Pre-estimated propensity scores. If None, estimated via ps_method.	`None`
`ps_method`	`str`	Method for PS estimation if ps is None.	`'logit'`

Returns:

Type	Description
`DataFrame`	Trimmed data (rows with PS in the overlap region).

Examples:

Strong selection on covariates creates limited overlap:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 300
>>> x1 = rng.normal(size=n)
>>> x2 = rng.normal(size=n)
>>> p = 1.0 / (1.0 + np.exp(-(2.0 * x1 - 2.0 * x2)))
>>> d = rng.binomial(1, p)
>>> df = pd.DataFrame({'d': d, 'x1': x1, 'x2': x2})

Crump et al. (2009) optimal rule drops poor-overlap rows:

>>> trimmed = sp.trimming(df, treatment='d',
...                       covariates=['x1', 'x2'])
>>> (len(df), len(trimmed))
(300, 206)

Fixed [0.1, 0.9] trimming keeps a narrower sample:

>>> trimmed_s = sp.trimming(df, treatment='d',
...                         covariates=['x1', 'x2'],
...                         method='sturmer')
>>> len(trimmed_s)
180

love_plot ¶

love_plot(data: DataFrame, treatment: str, covariates: List[str], weights: Optional[Union[ndarray, Series]] = None, threshold: float = 0.1, ps_method: str = 'logit', ax: Any = None, figsize: Tuple[float, Optional[float]] = (7, None), title: str = 'Covariate Balance (Love Plot)') -> Tuple[Any, Any]

Love plot: dot plot of standardized mean differences before/after.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input data.	required
`treatment`	`str`	Binary treatment column.	required
`covariates`	`list of str`	Covariate columns.	required
`weights`	`array - like`	IPW or matching weights. If None, inverse-PS weights are computed.	`None`
`threshold`	`float`	SMD threshold for the vertical dashed line (default 0.1).	`0.1`
`ps_method`	`str`	PS estimation method for balance computation.	`'logit'`
`ax`	`matplotlib Axes`		`None`
`figsize`	`tuple`	(width, height). Height defaults to 0.4 * n_covariates + 1.	`(7, None)`
`title`	`str`	Plot title.	`'Covariate Balance (Love Plot)'`

Returns:

Type	Description
`(fig, ax) : tuple`

Examples:

>>> import statspai as sp
>>> df = sp.cps_wage()
>>> fig, ax = sp.love_plot(df, treatment='union',
...                        covariates=['education', 'experience',
...                                    'tenure'])
>>> fig.savefig('love_plot.png')

>>> # Tighter balance threshold and custom title
>>> fig, ax = sp.love_plot(df, treatment='union',
...                        covariates=['education', 'experience',
...                                    'tenure'],
...                        threshold=0.05,
...                        title='Balance: union vs non-union')

ps_balance ¶

ps_balance(data: DataFrame, treatment: str, covariates: List[str], weights: Optional[Union[ndarray, Series]] = None, method: str = 'logit') -> PSBalanceResult

Compute comprehensive propensity score balance table.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input data.	required
`treatment`	`str`	Binary treatment column.	required
`covariates`	`list of str`	Covariate columns to assess balance for.	required
`weights`	`array - like`	IPW or matching weights. If None, inverse-PS weights are computed automatically from estimated propensity scores.	`None`
`method`	`str`	PS estimation method ('logit', 'probit', 'gbm').	`'logit'`

Returns:

Type	Description
`PSBalanceResult`	Object with `.table`, `.ps`, `.summary()`, `.love_plot()`.

Examples:

Simulated data with confounded treatment assignment:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 300
>>> x1 = rng.normal(size=n)
>>> x2 = rng.normal(size=n)
>>> p = 1.0 / (1.0 + np.exp(-(0.5 * x1 - 0.5 * x2 - 0.5)))
>>> d = rng.binomial(1, p)
>>> df = pd.DataFrame({'d': d, 'x1': x1, 'x2': x2})

Without weights, ATE inverse-propensity weights are computed from the estimated propensity scores:

>>> bal = sp.ps_balance(df, treatment='d', covariates=['x1', 'x2'])
>>> bal.table['smd_raw'].round(2).tolist()
[0.68, -0.45]
>>> bal.table['smd_weighted'].round(2).tolist()
[0.02, -0.06]
>>> fig, ax = bal.love_plot()

balance_diagnostics ¶

balance_diagnostics(data: DataFrame, treatment: str, covariates: List[str], weights: Optional[Union[ndarray, Series, str]] = None, ps: Optional[Union[ndarray, Series, str]] = None, method: str = 'logit', threshold: float = 0.1) -> BalanceDiagnosticsResult

Unified balance diagnostics for matching and weighting estimators.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Analysis frame.	required
`treatment`	`str`	Binary treatment indicator.	required
`covariates`	`list of str`	Covariates to audit.	required
`weights`	`array - like or str`	Observation weights after matching/weighting. If omitted, ATE inverse-propensity weights are computed from `ps`.	`None`
`ps`	`array - like or str`	Propensity scores. If omitted, estimated with `method`.	`None`
`method`	`(logit, probit, gbm)`	Propensity-score model when `ps` is not supplied.	`'logit'`
`threshold`	`float`	Balance threshold for absolute standardized mean differences.	`0.1`

Returns:

Type	Description
`BalanceDiagnosticsResult`	`.table` has one row per covariate; `.summary_stats` records max/mean SMDs, imbalance counts, effective sample size, and propensity-score overlap.

Examples:

With no weights, ATE inverse-propensity weights are computed from the estimated propensity scores:

>>> import statspai as sp
>>> df = sp.cps_wage()
>>> bal = sp.balance_diagnostics(
...     df, treatment='union',
...     covariates=['education', 'experience', 'tenure'])
>>> bal.summary_stats['n_obs']
3000
>>> bool(bal.summary_stats['n_imbalanced_weighted']
...      <= bal.summary_stats['n_imbalanced_raw'])
True

Typical post-estimation flow — audit your own weights and scores:

>>> import numpy as np
>>> ps = sp.propensity_score(
...     df, 'union', ['education', 'experience', 'tenure'])
>>> w = np.where(df['union'] == 1, 1 / ps, 1 / (1 - ps))
>>> bal = sp.balance_diagnostics(
...     df, treatment='union',
...     covariates=['education', 'experience', 'tenure'],
...     weights=w, ps=ps)
>>> bool(bal.summary_stats['effective_sample_size'] > 0)
True

optimal_match ¶

optimal_match(data: DataFrame, treatment: str, outcome: str, covariates: List[str], metric: str = 'mahalanobis', caliper: Optional[float] = None) -> OptimalMatchResult

Optimal 1:1 matching via the Hungarian algorithm.

Each treated unit is matched to exactly one control; the total sum of matched distances is globally minimised. Requires n_treated ≤ n_control.

Parameters:

Name	Type	Description	Default
`caliper`	`float`	Drop any pair with distance greater than `caliper`.	`None`

Examples:

Simulated observational data with two confounders (true ATT = 2):

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 300
>>> x1 = rng.normal(size=n)
>>> x2 = rng.normal(size=n)
>>> p = 1.0 / (1.0 + np.exp(-(0.5 * x1 - 0.5 * x2 - 0.5)))
>>> d = rng.binomial(1, p)
>>> y = 1.0 + 2.0 * d + x1 + x2 + rng.normal(size=n)
>>> df = pd.DataFrame({'y': y, 'd': d, 'x1': x1, 'x2': x2})
>>> res = sp.optimal_match(df, treatment='d', outcome='y',
...                        covariates=['x1', 'x2'])
>>> res.n_matched
111
>>> round(res.ate, 2)
1.88
>>> res.pairs.columns.tolist()
['treated_idx', 'control_idx', 'distance']

cardinality_match ¶

cardinality_match(data: DataFrame, treatment: str, outcome: str, covariates: List[str], smd_tolerance: float = 0.1) -> CardinalityMatchResult

Cardinality matching — maximise the number of matched pairs subject to a standardised-mean-difference tolerance on every covariate.

Formulation (Zubizarreta 2014):

maximise   sum_j z_j
s.t.       |mean(X_k | T=1) - sum_j z_j X_{jk} / sum_j z_j|
            <= smd_tolerance * SD(X_k)   ∀ k
           z_j ∈ {0, 1}  for each control j

Uses a continuous LP relaxation (scipy.optimize.linprog) then rounds weights to 0/1 via a threshold — sufficient in almost all applied work. Matched pair sample is the matched controls each paired sequentially with the nearest treated in covariate space.

Examples:

Simulated observational data with two confounders (true ATT = 2):

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 300
>>> x1 = rng.normal(size=n)
>>> x2 = rng.normal(size=n)
>>> p = 1.0 / (1.0 + np.exp(-(0.5 * x1 - 0.5 * x2 - 0.5)))
>>> d = rng.binomial(1, p)
>>> y = 1.0 + 2.0 * d + x1 + x2 + rng.normal(size=n)
>>> df = pd.DataFrame({'y': y, 'd': d, 'x1': x1, 'x2': x2})
>>> res = sp.cardinality_match(df, treatment='d', outcome='y',
...                            covariates=['x1', 'x2'],
...                            smd_tolerance=0.1)
>>> res.n_matched_pairs
107
>>> round(res.ate, 2)
1.86
>>> res.balance['|SMD|'].round(3).tolist()
[0.111, 0.082]

statspai.matching¶

matching ¶

MatchEstimator ¶

fit ¶

PSBalanceResult ¶

summary ¶

love_plot ¶

BalanceDiagnosticsResult ¶

OptimalMatchResult dataclass ¶

CardinalityMatchResult dataclass ¶

GenMatchResult dataclass ¶

SBWResult ¶

PSMatch2Result ¶

matched_sample ¶

balance ¶

psplot ¶

psm_did ¶

summary ¶

cite ¶

balanceplot ¶

psplot ¶

propensity_score ¶

overlap_plot ¶

trimming ¶

love_plot ¶

ps_balance ¶

balance_diagnostics ¶

optimal_match ¶

cardinality_match ¶

`statspai.matching`¶

OptimalMatchResult `dataclass` ¶

CardinalityMatchResult `dataclass` ¶

GenMatchResult `dataclass` ¶