`statspai.decomposition`¶

decomposition ¶

Decomposition Analysis module for StatsPAI.

Decomposition toolkit covering mean, distributional, inequality, demographic, and causal decomposition methods under a unified API: sp.decompose(method=...).

Methods (19 in total — Yu-Elwert added in v1.15)

Mean decomposition - oaxaca — Blinder-Oaxaca (Blinder 1973; Oaxaca 1973) with 5 reference-coefficient choices (A, B, pooled/Neumark, Cotton, Reimers) - gelbach — Gelbach (2016) sequential orthogonal decomposition of omitted-variable bias - fairlie — Fairlie (2005) nonlinear decomposition for logit/probit - bauer_sinning / yun_nonlinear — Bauer-Sinning (2008) + Yun (2005) detailed nonlinear decomposition

Distributional decomposition - rif — Recentered Influence Function regression + OB decomposition (Firpo-Fortin-Lemieux 2009) - ffl — Firpo-Fortin-Lemieux (2018) two-step detailed decomposition - dfl — DiNardo-Fortin-Lemieux (1996) reweighting - machado_mata — Machado-Mata (2005) quantile decomposition - melly — Melly (2005) analytical quantile decomposition - cfm — Chernozhukov-Fernández-Val-Melly (2013) counterfactual distributions via distribution regression

Inequality decomposition - subgroup — between/within decomposition (Theil T/L, GE, Gini, Atkinson, CV²) - shapley_inequality — Shorrocks-Shapley (2013) allocation of inequality to covariates - gini_source — Lerman-Yitzhaki (1985) Gini source decomposition

Demographic / standardisation - kitagawa — Kitagawa (1955) two-factor rate decomposition - das_gupta — Das Gupta (1993) multi-factor decomposition

Causal decomposition - gap_closing — Lundberg (2022) gap-closing estimator (regression / IPW / AIPW) - mediation — VanderWeele (2014) natural direct/indirect effects - disparity / causal_jvw — Jackson-VanderWeele (2018) causal disparity decomposition - yu_elwert — Yu & Elwert (2025) nonparametric causal decomposition of group disparities into baseline, prevalence, effect, and selection components (efficient-influence-function-based; ML-friendly)

Unified Entry

sp.decompose(method=..., **kwargs) dispatches to any of the above.

Polish (v1.15)

Every result class now inherits DecompResultMixin, exposing a common .confint(), .cite(), .to_dict(), .to_json(), .to_excel(), and .to_word() surface in addition to each method's bespoke .summary() / .plot() / .to_latex(). Plots share a common palette and minimalist style via :mod:statspai.decomposition.plots (forest plots, mediation forest, Yu-Elwert mechanism plot, RIF heatmap, …).

OaxacaResult ¶

Bases: DecompResultMixin

Result container for Oaxaca-Blinder decomposition.

Attributes:

Name	Type	Description
`overall`	`dict`	Keys: `'gap'`, `'explained'`, `'unexplained'`, `'explained_se'`, `'unexplained_se'`, `'unexplained_a'`, `'unexplained_b'` (threefold components).
`detailed`	`DataFrame`	Variable-level decomposition with columns `contribution`, `se`, `pct_of_explained`.
`group_stats`	`dict`	Per-group means, coefficients, standard errors, sample sizes.
`reference`	`str or int`	Reference weight specification used.

Examples:

>>> import statspai as sp
>>> res = sp.oaxaca(
...     data=sp.cps_wage(), y="log_wage", group="female",
...     x=["education", "experience", "tenure"], reference=0)
>>> isinstance(res, sp.OaxacaResult)
True
>>> bool({"gap", "explained", "unexplained"}.issubset(res.overall))
True
>>> "contribution" in res.detailed.columns
True
>>> print(res.summary())

summary ¶

summary() -> str

Return formatted decomposition summary.

plot ¶

plot(figsize: Any = (8, 5), kind: str = 'waterfall', **kwargs: Any) -> Any

Bar / forest chart of per-variable explained contributions.

Parameters:

Name	Type	Description	Default
`kind`	`('waterfall', 'forest')`	`"waterfall"` (default) is a sign-coloured bar chart with optional 95% CI whiskers; `"forest"` shows point estimates with CI lines and greys out non-significant rows.	`"waterfall"`

to_latex ¶

to_latex() -> str

Return a LaTeX-formatted decomposition table.

GelbachResult ¶

Bases: DecompResultMixin

Result container for Gelbach (2016) decomposition.

Attributes:

Name	Type	Description
`total_change`	`float`	Total change in the base coefficient when added controls are included: beta_base - beta_full.
`decomposition`	`DataFrame`	Per-variable contributions with columns `delta`, `se`, `pct_of_change`.
`base_coef`	`float`	Coefficient of interest from the base (short) regression.
`full_coef`	`float`	Coefficient of interest from the full (long) regression.
`base_var`	`str`	Name of the variable of interest.

Examples:

>>> import statspai as sp
>>> res = sp.gelbach(
...     data=sp.cps_wage(), y="log_wage",
...     base_x=["education"],
...     added_x=["experience", "tenure", "union"])
>>> isinstance(res, sp.GelbachResult)
True
>>> res.base_var
'education'
>>> bool(abs((res.base_coef - res.full_coef) - res.total_change) < 1e-6)
True
>>> print(res.summary())

summary ¶

summary() -> str

Return formatted Gelbach decomposition summary.

plot ¶

plot(figsize: Any = (8, 5), color: str = '#4CAF50') -> Any

Horizontal bar chart of Gelbach contributions.

Returns:

Type	Description
`(fig, ax)`

to_latex ¶

to_latex() -> str

Return LaTeX table of the decomposition.

DFLResult `dataclass` ¶

Bases: DecompResultMixin

Container for DFL reweighting decomposition results.

plot ¶

plot(**kwargs: Any) -> Any

Delegate to plots.dfl_plot().

FFLResult `dataclass` ¶

Bases: DecompResultMixin

Container for Firpo-Fortin-Lemieux two-step decomposition.

MachadoMataResult `dataclass` ¶

Bases: DecompResultMixin

Container for Machado-Mata decomposition.

MellyResult `dataclass` ¶

Bases: DecompResultMixin

Container for Melly quantile decomposition.

CFMResult `dataclass` ¶

Bases: DecompResultMixin

Chernozhukov-Fernández-Val-Melly counterfactual distribution result.

YuElwertResult `dataclass` ¶

Bases: DecompResultMixin

Yu-Elwert (2025) causal decomposition of a group disparity.

Attributes:

Name	Type	Description
`disparity`	`float`	Observed gap `E[Y\|R=1] - E[Y\|R=0]`.
`baseline`	`float`	Counterfactual disparity if no one were treated.
`prevalence`	`float`	Contribution of differential treatment uptake (group A vs. B), scaled by the reference group's average treatment effect.
`effect`	`float`	Contribution of group heterogeneity in average treatment effects, scaled by the advantaged group's treatment prevalence.
`selection`	`float`	Group-specific covariance between treatment assignment and individual-level effect heterogeneity — the signature mechanism of Yu-Elwert.
`se`	`dict[str, float] \| None`	Standard errors keyed by component (`disparity`, `baseline`, `prevalence`, `effect`, `selection`).
`ci`	`dict[str, (float, float)] \| None`	Two-sided 95% confidence intervals (matching `alpha` argument).
`detailed`	`DataFrame`	Tidy table of (component, value, se, ci_low, ci_high).
`nuisance`	`dict`	Diagnostic snapshot — group sizes, fitted per-cell means, plus `fallback_cell_count` and `bootstrap_failure_count` when applicable so the user can audit a degenerate run.
`method`	`str`	`"plugin"` or `"efficient"`.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(0)
>>> n = 400
>>> r = rng.integers(0, 2, size=n)
>>> x1 = rng.normal(size=n)
>>> x2 = rng.normal(size=n)
>>> p = 1 / (1 + np.exp(-(0.5 * r + 0.3 * x1)))
>>> t = (rng.uniform(size=n) < p).astype(int)
>>> y = (1.0 + 0.5 * x1 + 0.3 * x2 + (0.8 + 0.4 * r) * t
...      + rng.normal(size=n))
>>> df = pd.DataFrame({"y": y, "t": t, "r": r, "x1": x1, "x2": x2})
>>> res = sp.yu_elwert_decompose(
...     df, y="y", treatment="t", group="r", x=["x1", "x2"],
...     inference="none",
... )
>>> type(res).__name__
'YuElwertResult'
>>> res.method
'plugin'

gelbach ¶

gelbach(data: DataFrame, y: str, base_x: Sequence[str], added_x: Sequence[str], var_of_interest: Optional[str] = None, alpha: float = 0.05) -> GelbachResult

Gelbach (2016) decomposition of omitted variable bias.

When controls are added to a regression, the coefficient on a variable of interest may change. This function decomposes that change into contributions from each added variable, answering: "Which added controls explain the change, and by how much?"

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Input dataset.	required
`y`	`str`	Outcome variable name.	required
`base_x`	`list of str`	Variables in the base (short) specification.	required
`added_x`	`list of str`	Variables added to obtain the full (long) specification.	required
`var_of_interest`	`str`	Which base variable's coefficient change to decompose. Defaults to the first element of `base_x`.	`None`
`alpha`	`float`	Significance level.	`0.05`

Returns:

Type	Description
`GelbachResult`	Result object with `.summary()`, `.plot()`, `.to_latex()`.

Notes

The Gelbach identity:

.. math::

\hat\beta^{\text{base}}_k - \hat\beta^{\text{full}}_k
= \sum_{j \in \text{added}} \tilde\gamma_{kj} \hat\beta^{\text{full}}_j

where :math:\tilde\gamma_{kj} is the coefficient from regressing added variable j on all base variables (including constant).

Examples:

>>> import statspai as sp
>>> result = sp.gelbach(
...     data=sp.cps_wage(), y="log_wage",
...     base_x=["education"],
...     added_x=["experience", "tenure", "union"],
... )
>>> type(result).__name__
'GelbachResult'
>>> result.base_var
'education'
>>> bool(abs((result.base_coef - result.full_coef) - result.total_change) < 1e-6)
True

References

gelbach2016covariates

rifreg ¶

rifreg(formula: str, data: DataFrame, statistic: StatisticKind = 'quantile', tau: float = 0.5, quantile_convention: Literal['statspai', 'dineq'] = 'statspai') -> RIFResult

RIF regression (Firpo, Fortin & Lemieux 2009).

Parameters:

Name	Type	Description	Default
`formula`	`str`	`"y ~ x1 + x2"` style.	required
`data`	`DataFrame`		required
`statistic`	`('quantile', 'variance', 'gini')`		`"quantile"`
`tau`	`float`	Quantile level (default 0.5 = median UQPE).	`0.5`
`quantile_convention`	`('statspai', 'dineq')`	Quantile RIF convention for `statistic="quantile"`.	`"statspai"`

Examples:

>>> import statspai as sp
>>> df = sp.cps_wage()
>>> r = sp.rifreg(
...     'log_wage ~ education + experience',
...     data=df, statistic='quantile', tau=0.5,
... )
>>> type(r).__name__
'RIFResult'
>>> list(r.params.index)
['Intercept', 'education', 'experience']
>>> r.statistic, r.tau
('quantile', 0.5)

rif_decomposition ¶

rif_decomposition(formula: str, data: DataFrame, group: str, statistic: StatisticKind = 'quantile', tau: float = 0.5, reference: int = 0, quantile_convention: Literal['statspai', 'dineq'] = 'statspai') -> RIFDecompositionResult

RIF Oaxaca-Blinder decomposition (FFL 2009, Section 5).

Decomposes the between-group difference in a distributional statistic into explained (covariate endowment) and unexplained (coefficient) parts at the chosen statistic.

Parameters:

Name	Type	Description	Default
`group`	`str`	Binary (0/1) group indicator column.	required
`reference`	`int`	Which group's coefficients to use as the reference (0 or 1).	`0`
`quantile_convention`	`('statspai', 'dineq')`	Quantile RIF convention for `statistic="quantile"`. Use `"dineq"` for R `dineq::rif` parity.	`"statspai"`

Examples:

>>> import statspai as sp
>>> df = sp.cps_wage()
>>> r = sp.rif_decomposition(
...     'log_wage ~ education + experience + tenure',
...     data=df, group='female', statistic='quantile', tau=0.5,
... )
>>> print(r.summary())
>>> r.total_diff, r.explained, r.unexplained

>>> # Decompose the gender gap in the Gini coefficient
>>> r = sp.rif_decomposition(
...     'log_wage ~ education + experience + tenure',
...     data=df, group='female', statistic='gini',
... )

rif_values ¶

rif_values(y: ndarray, statistic: StatisticKind = 'quantile', tau: float = 0.5, quantile_convention: Literal['statspai', 'dineq'] = 'statspai') -> ndarray

Compute the RIF of each observation.

Delegates to :func:statspai.decomposition._common.influence_function, which is the canonical implementation shared with the FFL two-step decomposition. Supported statistics (expanded in this release):

``quantile`` (with ``tau``), ``mean``, ``variance``, ``std``,
``log_var``, ``iqr``, ``gini``, ``theil_t``, ``theil_l``,
``atkinson`` (ε = 1).

Parameters:

Name	Type	Description	Default
`y`	`(n,) array`		required
`statistic`	`str`		`'quantile'`
`tau`	`float`	Quantile level (only used when `statistic="quantile"`).	`0.5`
`quantile_convention`	`('statspai', 'dineq')`	Quantile RIF convention. `"statspai"` preserves the historical empirical-CDF/Silverman path; `"dineq"` mirrors the R `dineq::rif` convention (Hmisc type-7 quantile, R `bw.nrd0` Gaussian density, and `I(y < q)`).	`"statspai"`

dfl_decompose ¶

dfl_decompose(data: DataFrame, y: str, group: str, x: Sequence[str], stat: str = 'mean', tau: float = 0.5, reference: int = 0, weights: Optional[Union[str, ndarray]] = None, trim: float = 0.001, inference: str = 'analytical', n_boot: int = 299, alpha: float = 0.05, quantile_grid: Optional[Sequence[float]] = None, seed: Optional[int] = 12345) -> DFLResult

DFL (1996) reweighting decomposition at a chosen distributional statistic.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`y`	`str — outcome variable name`		required
`group`	`str — binary (0/1) group indicator`		required
`x`	`Sequence[str] — covariates used for propensity model`		required
`stat`	`('mean', 'variance', 'std', 'quantile', 'iqr', 'gini', 'log_var')`		`'mean'`
`tau`	`float — quantile level (when stat='quantile')`		`0.5`
`reference`	`(0, 1)`	0: reweight Group B to look like A's X (default). The counterfactual is F_{Y<1\|0>} — A's X distribution with B's outcome structure. 1: reweight Group A to look like B's X. The counterfactual is F_{Y<0\|1>} — B's X distribution with A's outcome structure. .. warning:: `reference` has different economic semantics across method families. In DFL, `reference=0` yields cf = A's X, B's β (reweighting approach). In `machado_mata` / `melly` / `cfm`, `reference=0` yields cf = A's β, B's X (coefficient-substitution approach). These are opposite counterfactual constructions. Within each method labels are internally consistent (DFL structure = A − cf; MM composition = A − cf). When comparing estimates across methods, read the per-method docstrings carefully.	`0`
`weights`	`str, array or None — sample weights`		`None`
`trim`	`float — clip propensity scores to [trim, 1-trim]`		`0.001`
`inference`	`('none', 'bootstrap', 'analytical')`		`'none'`
`n_boot`	`int — bootstrap replications`		`299`
`alpha`	`float — CI level`		`0.05`
`quantile_grid`	`sequence of τ ∈ (0, 1) or None`	If provided, also compute quantile-process decomposition on this grid.	`None`
`seed`	`int or None`		`12345`

Returns:

Type	Description
`DFLResult`

Examples:

>>> import statspai as sp
>>> df = sp.cps_wage()
>>> r = sp.dfl_decompose(df, y='log_wage', group='female',
...                      x=['education', 'experience', 'tenure'],
...                      stat='quantile', tau=0.5)
>>> r.summary()
>>> r.gap, r.composition, r.structure

>>> # Variance decomposition with bootstrap inference
>>> r = sp.dfl_decompose(df, y='log_wage', group='female',
...                      x=['education', 'experience', 'tenure'],
...                      stat='variance', inference='bootstrap',
...                      n_boot=49, seed=12345)
>>> r.se['gap']

See also :func:sp.decompose — the unified dispatcher — via sp.decompose('dfl', data=df, ...).

ffl_decompose ¶

ffl_decompose(data: DataFrame, y: str, group: str, x: Sequence[str], stat: str = 'quantile', tau: float = 0.5, reference: int = 0, weights: Optional[Union[str, ndarray]] = None, trim: float = 0.001, inference: str = 'analytical', n_boot: int = 299, alpha: float = 0.05, seed: Optional[int] = 12345) -> FFLResult

Firpo-Fortin-Lemieux two-step detailed distributional decomposition.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`y`	`str`		required
`group`	`str — binary {0, 1}`		required
`x`	`Sequence[str]`		required
`stat`	`{'quantile', 'mean', 'variance', 'std', 'iqr', 'gini',`	`'log_var', 'theil_t', 'theil_l', 'atkinson'}`	`'quantile'`
`tau`	`float (for quantile)`		`0.5`
`reference`	`int {0, 1}`	0: B reweighted to look like A's X (composition = effect of A's X on B's outcomes relative to observed B)	`0`
`weights`	`(str, array or None)`		`None`
`trim`	`float — propensity trim`		`0.001`
`inference`	`('analytical', 'bootstrap', 'none')`		`'analytical'`
`n_boot`	`int`		`299`
`alpha`	`float`		`0.05`
`seed`	`int or None`		`12345`

Examples:

>>> import statspai as sp
>>> df = sp.cps_wage()
>>> r = sp.ffl_decompose(df, y='log_wage', group='female',
...                      x=['education', 'experience', 'tenure'],
...                      stat='quantile', tau=0.5)
>>> print(r.summary())
...Firpo-Fortin-Lemieux Two-Step Decomposition...
>>> bool(abs(r.gap - (r.composition + r.structure
...                    + r.spec_error + r.reweight_error)) < 1e-6)
True

References

[@firpo2018decomposing]

melly_decompose ¶

melly_decompose(data: DataFrame, y: str, group: str, x: Sequence[str], tau_grid: Optional[Sequence[float]] = None, reference: int = 0, n_tau_qr: int = 99) -> MellyResult

Melly (2005) quantile decomposition.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`y`	`column names`		required
`group`	`column names`		required
`x`	`column names`		required
`tau_grid`	`Sequence[float] or None — reporting τ grid`		`None`
`reference`	`(0, 1)`	Same convention as `machado_mata`: `reference=0` uses A's β on B's X (coefficient-swap counterfactual F_{Y<0\|1>}), opposite to `dfl_decompose` whose `reference=0` uses A's X with B's β.	`0`
`n_tau_qr`	`int — QR estimation grid resolution`		`99`

Returns:

Type	Description
`MellyResult`

Examples:

>>> import statspai as sp
>>> df = sp.cps_wage()
>>> r = sp.melly_decompose(df, y='log_wage', group='female',
...                        x=['education', 'experience', 'tenure'],
...                        tau_grid=[0.25, 0.5, 0.75])
>>> print(r.summary())
...Melly (2005) Quantile Decomposition...
>>> list(r.quantile_grid['tau'])
[0.25, 0.5, 0.75]
>>> set(['gap', 'composition', 'structure']) <= set(r.quantile_grid.columns)
True

References

[@melly2005decomposition]

cfm_decompose ¶

cfm_decompose(data: DataFrame, y: str, group: str, x: Sequence[str], tau_grid: Optional[Sequence[float]] = None, reference: int = 0, n_thresh: int = 40, ks_test: bool = True) -> CFMResult

Chernozhukov-Fernández-Val-Melly (2013) counterfactual decomposition.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`y`	`column names`		required
`group`	`column names`		required
`x`	`column names`		required
`tau_grid`	`Sequence[float] or None`		`None`
`reference`	`(0, 1)`	Same convention as `machado_mata` / `melly_decompose`: `reference=0` builds the counterfactual from A's distribution regression coefficients applied to B's X (F_{Y<0\|1>}), opposite to the reweighting convention in `dfl_decompose`.	`0`
`n_thresh`	`int — number of thresholds for distribution regression`		`40`
`ks_test`	`bool — whether to compute Kolmogorov-Smirnov gap test`		`True`

Returns:

Type	Description
`CFMResult`

Examples:

>>> import statspai as sp
>>> df = sp.decomposition.datasets.cps_wage()
>>> r = sp.cfm_decompose(df, y="log_wage", group="female",
...                      x=["education", "experience"],
...                      tau_grid=[0.25, 0.5, 0.75], n_thresh=15)
>>> type(r).__name__
'CFMResult'
>>> sorted(r.overall)
['mean_composition', 'mean_gap', 'mean_structure']

fairlie ¶

fairlie(data: DataFrame, y: str, group: str, x: Sequence[str], model: str = 'logit', reference: int = 0, n_sim: int = 500, seed: Optional[int] = 12345) -> NonlinearDecompResult

Fairlie (2005) nonlinear decomposition for binary outcomes.

Procedure: fit model on reference group; rank-match one group onto the other; compute mean predicted probability under counterfactual X; variable-level contribution = change in mean prediction when that variable is swapped to the other group's value.

Parameters:

Name	Type	Default
`data`	`DataFrame`	required
`y`	`str — binary {0, 1}`	required
`group`	`str — binary`	required
`x`	`Sequence[str]`	required
`model`	`('logit', 'probit')`	`'logit'`
`reference`	`(0, 1)`	`0`
`n_sim`	`int — number of random matchings to average over`	`500`
`seed`	`int or None`	`12345`

Examples:

>>> import statspai as sp
>>> df = sp.cps_wage()
>>> res = sp.fairlie(df, y="union", group="female",
...                  x=["education", "experience"], model="logit",
...                  n_sim=50, seed=0)
>>> res.method
'Fairlie'
>>> list(res.detailed["variable"])
['education', 'experience']

bauer_sinning ¶

bauer_sinning(data: DataFrame, y: str, group: str, x: Sequence[str], model: str = 'logit', reference: int = 0, variant: str = 'yun') -> NonlinearDecompResult

Bauer-Sinning (2008) nonlinear Oaxaca-Blinder decomposition with Yun (2004, 2005) weights for detailed contributions.

Implements the three-fold equivalent: gap = [E(p_a(X_a)) - E(p_r(X_a))] # not used here but uses Yun's weight decomposition: explained_j = w_j · (E(p_r(X_a)) - E(p_r(X_b))) where w_j = (Δx̄_j · β_r_j) / Σ_k (Δx̄_k · β_r_k)

Parameters:

Name	Type	Default
`model`	`('logit', 'probit')`	`'logit'`
`reference`	`(0, 1)`	`0`
`variant`	`'yun'`	`'yun'`

Examples:

>>> import statspai as sp
>>> df = sp.cps_wage()
>>> res = sp.bauer_sinning(df, y="union", group="female",
...                        x=["education", "experience"], model="logit")
>>> res.method
'Bauer-Sinning (Yun weights)'
>>> bool(abs((res.explained + res.unexplained) - res.gap) < 1e-8)
True

inequality_index ¶

inequality_index(y: ndarray, index: str = 'theil_t', weights: Optional[ndarray] = None, eps: float = 1.0, alpha: Optional[float] = None) -> float

Compute a single inequality index.

Parameters:

Name	Type	Description	Default
`y`	`np.ndarray — outcome (e.g. income)`		required
`index`	`str — one of theil_t, theil_l, mld, ge0, ge1, ge2,`	atkinson, gini, cv2	`'theil_t'`
`weights`	`ndarray or None`		`None`
`eps`	`float — Atkinson inequality-aversion parameter`		`1.0`
`alpha`	`float or None — GE parameter override`		`None`

Examples:

>>> import statspai as sp
>>> import numpy as np
>>> income = np.random.default_rng(0).lognormal(10.0, 0.5, size=500)
>>> g = sp.inequality_index(income, index="gini")
>>> bool(0 < g < 1)
True
>>> bool(sp.inequality_index(income, index="theil_t") > 0)
True

References

[@shorrocks1984inequality]

subgroup_decompose ¶

subgroup_decompose(data: DataFrame, y: str, by: str, index: str = 'theil_t', weights: Optional[Union[str, ndarray]] = None, eps: float = 1.0, alpha: Optional[float] = None) -> SubgroupDecompResult

Subgroup decomposition (between / within) of an inequality index.

Supported for additive GE family (theil_t, theil_l, mld, ge0, ge1, ge2, cv2, atkinson(ε=1)). Gini returns Dagum (1997) Gini_B / Gini_W / Gini_overlap.

Parameters:

Name	Type	Default
`data`	`DataFrame`	required
`y`	`str — outcome`	required
`by`	`str — grouping variable`	required
`index`	`str — inequality index name`	`'theil_t'`
`weights`	`(str, array or None)`	`None`
`eps`	`float — Atkinson parameter`	`1.0`
`alpha`	`float or None — GE parameter override`	`None`

Examples:

>>> import statspai as sp
>>> import numpy as np
>>> df = sp.cps_wage(n=800, seed=1)
>>> df["wage"] = np.exp(df["log_wage"])
>>> res = sp.subgroup_decompose(df, y="wage", by="female", index="theil_t")
>>> bool(res.total > 0)
True
>>> bool(abs(res.between + res.within - res.total) < 1e-6)
True

References

[@shorrocks1984inequality], [@dagum1997approach]

source_decompose ¶

source_decompose(data: DataFrame, sources: Sequence[str], weights: Optional[Union[str, ndarray]] = None) -> SourceDecompResult

Lerman-Yitzhaki (1985) Gini source decomposition.

Total income = Σ sources. Each source's contribution is S_k · R_k · G_k / G_total where S_k is its share of total mean, R_k the Gini correlation with total rank, G_k its own Gini.

Examples:

>>> import statspai as sp
>>> import numpy as np
>>> import pandas as pd
>>> rng = np.random.default_rng(0)
>>> df = pd.DataFrame({
...     "labor": rng.lognormal(9.5, 0.4, 400),
...     "capital": rng.lognormal(8.0, 0.7, 400),
... })
>>> res = sp.source_decompose(df, sources=["labor", "capital"])
>>> bool(0 < res.total_gini < 1)
True
>>> len(res.sources)
2

References

[@lerman1985income]

shapley_inequality ¶

shapley_inequality(data: DataFrame, y: str, x: Sequence[str], index: str = 'theil_t', weights: Optional[Union[str, ndarray]] = None) -> ShapleyInequalityResult

Shorrocks-Shapley decomposition of an inequality index across covariates.

For each subset S ⊆ covariates, compute the predicted outcome ŷ_S = X_S · β_S (OLS) and evaluate index I(ŷ_S). The marginal contribution of variable j to I is averaged over all orderings yielding its Shapley value φ_j.

Parameters:

Name	Type	Default
`data`	`as usual`	required
`y`	`as usual`	required
`x`	`as usual`	required
`index`	`str`	`'theil_t'`
`weights`	`(str, array or None)`	`None`

Examples:

>>> import statspai as sp
>>> import numpy as np
>>> df = sp.cps_wage(n=600, seed=2)
>>> df["wage"] = np.exp(df["log_wage"])
>>> res = sp.shapley_inequality(
...     df, y="wage", x=["education", "experience"], index="theil_t")
>>> bool(res.total > 0)
True
>>> len(res.shapley)
2

Notes

Combinatorial cost: O(2^|x|). For |x| ≤ 10 this is fine; for larger x the function warns and uses a random permutation sampler.

References

[@shorrocks2013decomposition]

kitagawa_decompose ¶

kitagawa_decompose(data: DataFrame, rate: str, group: str, by: Union[str, Sequence[str]], weights: Optional[str] = None, normalize: str = 'symmetric') -> KitagawaResult

Kitagawa (1955) two-factor rate decomposition.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Tidy data. Either individual-level (aggregated internally) or pre-aggregated cell-level with columns: `group`, `by`, `rate`, optional `weights` (population size in each cell).	required
`rate`	`str`	Column holding the category-specific rate (or 0/1 outcome at the individual level).	required
`group`	`str`	Binary group indicator.	required
`by`	`str or list of str`	Category variable(s) defining cells.	required
`weights`	`str or None`	Cell population weights. If None, each row treated as individual-level data (weight = 1).	`None`
`normalize`	`('symmetric', 'a', 'b')`	'a': rate effect evaluated at A's composition 'b': rate effect evaluated at B's composition 'symmetric': average (default)	`'symmetric'`

Examples:

>>> import statspai as sp
>>> import numpy as np
>>> import pandas as pd
>>> rng = np.random.default_rng(0)
>>> rows = []
>>> for grp, base in [(0, 0.2), (1, 0.3)]:
...     for age, share in [('young', 0.5), ('mid', 0.3), ('old', 0.2)]:
...         n = int(400 * share) + (10 if grp == 1 else 0)
...         rate = base + {'young': 0.0, 'mid': 0.05, 'old': 0.1}[age]
...         for _ in range(n):
...             rows.append({'group': grp, 'age': age,
...                          'y': int(rng.random() < rate)})
>>> df = pd.DataFrame(rows)
>>> res = sp.kitagawa_decompose(df, rate='y', group='group', by='age')
>>> # Gap = rate effect + composition effect + interaction (exact identity)
>>> float(round(res.gap - (res.rate_effect + res.composition_effect
...                        + res.interaction), 10))
0.0

References

kitagawa1955components

das_gupta ¶

das_gupta(data_a: DataFrame, data_b: DataFrame, factor_names: Sequence[str]) -> DasGuptaResult

Das Gupta (1993) multi-factor decomposition.

Decomposes the difference in a product-form aggregate into each factor's contribution using symmetric averaging across all possible orderings.

Parameters:

Name	Type	Description	Default
`data_a`	`pd.DataFrame with the same factor columns.`	Each row contributes the factor value. The aggregate for each group is computed as Σ_i ∏f factor{f,i}. For single-row DataFrames (one population, no stratification) the aggregate is simply ∏_f factor_f.	required
`data_b`	`pd.DataFrame with the same factor columns.`	Each row contributes the factor value. The aggregate for each group is computed as Σ_i ∏f factor{f,i}. For single-row DataFrames (one population, no stratification) the aggregate is simply ∏_f factor_f.	required
`factor_names`	`list of factor column names.`		required

Notes

Assumes: rate = f_1 * f_2 * ... * f_m (aggregate product form). For additive forms use kitagawa_decompose.

Examples:

>>> import statspai as sp
>>> import pandas as pd
>>> # Crude birth rate = fertility x share_women (product form).
>>> data_a = pd.DataFrame({'fertility': [0.10], 'share_women': [0.50]})
>>> data_b = pd.DataFrame({'fertility': [0.08], 'share_women': [0.55]})
>>> res = sp.das_gupta(data_a, data_b,
...                    factor_names=['fertility', 'share_women'])
>>> # Factor effects sum to the total gap (exact identity)
>>> float(round(res.gap - res.factor_effects['effect'].sum(), 12))
0.0

References

dasgupta1993standardization

gap_closing ¶

gap_closing(data: DataFrame, y: str, group: str, x: Sequence[str], method: str = 'aipw', target_dist: int = 1, trim: float = 0.001, inference: str = 'analytical', n_boot: int = 299, alpha: float = 0.05, seed: Optional[int] = 12345) -> GapClosingResult

Lundberg (2021) gap-closing estimator.

Computes the counterfactual mean gap that would remain if one group's covariate distribution were shifted to match the other's.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`y`	`column names`		required
`group`	`column names`		required
`x`	`column names`		required
`method`	`('regression', 'ipw', 'aipw')`	AIPW is doubly robust (recommended).	`'regression'`
`target_dist`	`(0, 1)`	1: shift Group A's covariate distribution to match Group B's 0: shift Group B's to match Group A's	`0`
`trim`	`float — propensity trim`		`0.001`
`inference`	`('analytical', 'bootstrap', 'none')`		`'analytical'`

Examples:

>>> import statspai as sp
>>> import numpy as np
>>> import pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 600
>>> g = rng.integers(0, 2, n)
>>> x1 = rng.normal(g * 0.6, 1.0)          # covariate differs by group
>>> x2 = rng.normal(0, 1.0, n)
>>> y = 1.0 + 0.8 * x1 + 0.5 * x2 + 0.4 * g + rng.normal(0, 1.0, n)
>>> df = pd.DataFrame({'y': y, 'group': g, 'x1': x1, 'x2': x2})
>>> res = sp.gap_closing(df, y='y', group='group', x=['x1', 'x2'],
...                      method='aipw', inference='none')
>>> # Identity: closed gap = observed gap - counterfactual gap
>>> round(res.closed_gap - (res.observed_gap
...                         - res.counterfactual_gap), 10)
0.0

References

lundberg2021gap

mediation_decompose ¶

mediation_decompose(data: DataFrame, y: str, treatment: str, mediator: str, covariates: Optional[Sequence[str]] = None, inference: str = 'analytical', n_boot: int = 299, alpha: float = 0.05, seed: Optional[int] = 12345) -> MediationDecompResult

Linear nested-models mediation decomposition (VanderWeele 2014 four-way simplified to natural direct / indirect under linearity).

Parameters:

Name	Type	Default
`data`	`DataFrame`	required
`y`	`str — continuous outcome`	required
`treatment`	`str — binary exposure`	required
`mediator`	`str — mediator`	required
`covariates`	`list of str or None`	`None`
`inference`	`('analytical', 'bootstrap')`	`'analytical'`

Returns:

Type	Description
`MediationDecompResult with NDE, NIE, CDE, proportion mediated.`

Notes

Under the purely linear model used here, the controlled direct effect CDE(m*) evaluated at the reference level m* = E[M | A=0] coincides numerically with the natural direct effect (NDE). The cde field is therefore redundant in this implementation — it is retained for API compatibility with VanderWeele's four-way decomposition, but users should not treat it as independent information from nde unless a nonlinear or interaction-heterogeneous extension is added.

Examples:

Simulate A -> M -> Y with a direct A -> Y path (true NDE = 0.5, NIE = 0.8):

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(3)
>>> n = 500
>>> a = rng.binomial(1, 0.5, size=n)            # binary exposure
>>> m = 0.8 * a + rng.normal(size=n)            # A -> M
>>> y = 0.5 * a + 1.0 * m + rng.normal(size=n)  # direct + mediated
>>> df = pd.DataFrame({'y': y, 'a': a, 'm': m})
>>> result = sp.mediation_decompose(df, y='y', treatment='a',
...                                 mediator='m')
>>> round(result.nde, 2), round(result.nie, 2)
(0.39, 0.78)
>>> round(result.propn_mediated, 2)  # share of total effect via M
0.66

Bootstrap CIs for the decomposition components:

>>> result = sp.mediation_decompose(df, y='y', treatment='a',
...                                 mediator='m',
...                                 inference='bootstrap',
...                                 n_boot=99, seed=1)
>>> result.ci['nie']  # (lower, upper)

disparity_decompose ¶

disparity_decompose(data: DataFrame, y: str, group: str, mediator: str, covariates: Optional[Sequence[str]] = None, target_level: Optional[float] = None) -> DisparityDecompResult

Jackson & VanderWeele (2018) causal disparity decomposition.

Decomposes an observed group disparity in Y into: - initial disparity: what would remain if mediator M were set to a reference level (e.g. Group A's M distribution). - mediator-attributable: the complementary share.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`y`	`str — outcome`		required
`group`	`str — binary group (0/1, where 1 = disadvantaged)`		required
`mediator`	`str — mediator`		required
`covariates`	`list or None`		`None`
`target_level`	`float or None`	Value at which to fix mediator for the "initial" counterfactual. Default: mean of M in reference group (group=0).	`None`

Examples:

>>> import statspai as sp
>>> import numpy as np
>>> import pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 600
>>> g = rng.integers(0, 2, n)                  # 1 = disadvantaged
>>> m = rng.normal(2.0 - 0.8 * g, 1.0)         # mediator (e.g. education)
>>> y = 50.0 + 5.0 * m + 3.0 * g + rng.normal(0, 2.0, n)
>>> df = pd.DataFrame({'y': y, 'group': g, 'med': m})
>>> res = sp.disparity_decompose(df, y='y', group='group', mediator='med')
>>> # Identity: total = initial + mediator-attributable
>>> round(res.total_disparity - (res.initial_disparity
...                              + res.mediator_attributable), 8)
0.0

References

jackson2018decomposition

yu_elwert_decompose ¶

yu_elwert_decompose(data: DataFrame, y: str, treatment: str, group: str, x: Sequence[str], *, method: str = 'plugin', inference: str = 'bootstrap', n_boot: int = 499, alpha: float = 0.05, trim: float = 0.005, cluster: Optional[str] = None, seed: Optional[int] = 12345) -> YuElwertResult

Nonparametric causal decomposition of a group disparity.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Long-format panel with one row per observation.	required
`y`	`str`	Name of the (continuous) outcome column.	required
`treatment`	`str`	Binary treatment indicator (0/1).	required
`group`	`str`	Binary group indicator (0/1) — `1` = advantaged / index group.	required
`x`	`sequence of str`	Adjustment covariates (used to identify within-group CATEs).	required
`method`	`('plugin', 'efficient')`	`"plugin"` uses within-cell OLS for outcomes and within-group logit for the propensity and computes plug-in expectations (Yu-Elwert 2025, Section 4.1). `"efficient"` augments each moment with the doubly-robust correction term — recommended when nuisance functions might be misspecified.	`"plugin"`
`inference`	`('bootstrap', 'none')`	`"bootstrap"` returns SEs and percentile CIs from the non-parametric (cluster-aware) bootstrap. `"none"` skips inference.	`"bootstrap"`
`n_boot`	`int`		`499`
`alpha`	`float`	Two-sided coverage level.	`0.05`
`trim`	`float`	Lower/upper clip for fitted propensities (only used in `method="efficient"`).	`0.005`
`cluster`	`str or None`	Column name to use for cluster bootstrap.	`None`
`seed`	`int or None`		`12345`

Returns:

Type	Description
`YuElwertResult`

Notes

Identification requires conditional ignorability of treatment given (R, X) (no unmeasured confounders within group). The framework does not require R itself to be unconfounded, distinguishing it from causal-mediation approaches.

The "selection" component is zero whenever individuals are randomly assigned to treatment (no selection on individual gain) or whenever the CATE is constant within group (no heterogeneity to select on). A non-zero selection term — particularly one of opposite sign in the two groups — flags that targeting differs systematically across groups, often the lever a designer can pull.

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 400
>>> r = rng.integers(0, 2, size=n)
>>> x1 = rng.normal(size=n)
>>> x2 = rng.normal(size=n)
>>> p = 1 / (1 + np.exp(-(0.5 * r + 0.3 * x1)))
>>> t = (rng.uniform(size=n) < p).astype(int)
>>> y = (1.0 + 0.5 * x1 + 0.3 * x2 + (0.8 + 0.4 * r) * t
...      + rng.normal(size=n))
>>> df = pd.DataFrame({"y": y, "t": t, "r": r, "x1": x1, "x2": x2})
>>> res = sp.decompose(
...     "yu_elwert", data=df, y="y", treatment="t", group="r",
...     x=["x1", "x2"], inference="none",
... )
>>> type(res).__name__
'YuElwertResult'
>>> res.method
'plugin'

References

yu2025nonparametric

decompose ¶

decompose(method: str, /, **kwargs: Any) -> Any

Unified entry point for all decomposition methods.

Parameters:

Name	Type	Description	Default
`method`	`str`	One of the methods listed in `available_methods()`. Aliases are supported (e.g. 'mm' → 'machado_mata').	required
`**kwargs`	`method-specific keyword arguments (see individual`	function signatures for details).	`{}`

Returns:

Type	Description
method-specific result class with ``.summary()``, ``.plot()``,
``.to_latex()``, ``._repr_html_()``.

Examples:

>>> import statspai as sp
>>> df = sp.decomposition.datasets.cps_wage()
>>> r = sp.decompose('oaxaca', data=df, y='log_wage', group='female',
...                  x=['education', 'experience', 'tenure'])
>>> r.summary()

>>> r = sp.decompose('ffl', data=df, y='log_wage', group='female',
...                  x=['education', 'experience', 'tenure'],
...                  stat='quantile', tau=0.5)
>>> r.summary()

>>> # NOTE: ``method='aipw'`` below is passed through to
>>> # ``gap_closing``'s own ``method`` parameter; the dispatcher's
>>> # own method arg is positional-only so there is no collision.
>>> r = sp.decompose('gap_closing', data=df, y='log_wage',
...                  group='female',
...                  x=['education', 'experience', 'tenure'],
...                  method='aipw')

Convention warning for dfl vs machado_mata / melly / cfm: reference=0 has different semantics across method families (reweighting vs coefficient-swap). See the per-method docstrings before comparing composition/structure estimates across methods.

available_methods ¶

available_methods() -> list[str]

Return list of all registered decomposition method names.

Examples:

>>> import statspai as sp
>>> methods = sp.available_methods()
>>> bool("oaxaca" in methods)
True
>>> bool("ffl" in methods and "rif" in methods)
True

cps_wage ¶

cps_wage(n: int = 3000, seed: Optional[int] = 42) -> DataFrame

CPS-style wage data with a gender gap.

Columns: - female : int {0, 1} - education : years - experience : years - tenure : years - union : int {0, 1} - married : int {0, 1} - log_wage : float

Examples:

>>> import statspai as sp
>>> df = sp.cps_wage()
>>> df.shape
(3000, 7)
>>> sorted(df["female"].unique().tolist())
[0, 1]

chilean_households ¶

chilean_households(n: int = 2500, seed: Optional[int] = 42) -> DataFrame

Chilean-style household income with urban/rural gap.

Columns: - rural : int {0, 1} - head_education : years - head_age : years - household_size : int - log_income : float

Examples:

>>> import statspai as sp
>>> df = sp.chilean_households()
>>> df.shape
(2500, 5)
>>> "log_income" in df.columns
True

mincer_wage_panel ¶

mincer_wage_panel(n: int = 5000, seed: Optional[int] = 42) -> DataFrame

Two-period Mincer wage distribution with a structural shift.

Useful for DFL / FFL examples where Group 0 is early period and Group 1 is late period.

Columns: - period : int {0, 1} - education : years - experience : years - union : int - occupation_high_skill : int - log_wage : float

Examples:

>>> import statspai as sp
>>> df = sp.mincer_wage_panel()
>>> df.shape
(5000, 6)
>>> sorted(df["period"].unique().tolist())
[0, 1]

disparity_panel ¶

disparity_panel(n: int = 3000, seed: Optional[int] = 42) -> DataFrame

Synthetic disparity panel with treatment, mediator, outcome.

Columns: - group : int {0, 1} (disadvantaged = 1) - education : years (mediator) - parent_income : float (confounder) - age : years (confounder) - income : float (outcome)

Examples:

>>> import statspai as sp
>>> df = sp.disparity_panel()
>>> df.shape
(3000, 5)
>>> "income" in df.columns
True

statspai.decomposition¶

decomposition ¶

OaxacaResult ¶

summary ¶

plot ¶

to_latex ¶

GelbachResult ¶

summary ¶

plot ¶

to_latex ¶

DFLResult dataclass ¶

plot ¶

FFLResult dataclass ¶

MachadoMataResult dataclass ¶

MellyResult dataclass ¶

CFMResult dataclass ¶

YuElwertResult dataclass ¶

gelbach ¶

rifreg ¶

rif_decomposition ¶

rif_values ¶

dfl_decompose ¶

ffl_decompose ¶

melly_decompose ¶

cfm_decompose ¶

fairlie ¶

bauer_sinning ¶

inequality_index ¶

subgroup_decompose ¶

source_decompose ¶

shapley_inequality ¶

kitagawa_decompose ¶

das_gupta ¶

gap_closing ¶

mediation_decompose ¶

disparity_decompose ¶

yu_elwert_decompose ¶

decompose ¶

available_methods ¶

cps_wage ¶

chilean_households ¶

mincer_wage_panel ¶

disparity_panel ¶

`statspai.decomposition`¶

DFLResult `dataclass` ¶

FFLResult `dataclass` ¶

MachadoMataResult `dataclass` ¶

MellyResult `dataclass` ¶

CFMResult `dataclass` ¶

YuElwertResult `dataclass` ¶