`statspai.bcf`¶

bcf ¶

Bayesian Causal Forest (BCF) for heterogeneous treatment effects.

Decomposes the outcome into a prognostic function mu(X) and a treatment effect function tau(X), each modeled by BART:

Y_i = mu(X_i) + tau(X_i) * D_i + epsilon_i

This separation allows regularization-induced confounding (RIC) to be mitigated, producing better CATE estimates than standard BART.

References

Hahn, P. R., Murray, J. S., & Carvalho, C. M. (2020). Bayesian Regression Tree Models for Causal Inference: Regularization, Confounding, and Heterogeneous Effects. Bayesian Analysis, 15(3), 965-1056. [@hahn2020bayesian]

BayesianCausalForest ¶

Bayesian Causal Forest estimator.

Lower-level engine behind :func:sp.bcf; fit() returns the same :class:~statspai.core.results.CausalResult.

Parameters:

Name	Type	Default
`data`	`DataFrame`	required
`y`	`str`	required
`treat`	`str`	required
`covariates`	`list of str`	required
`n_trees_mu`	`int`	`200`
`n_trees_tau`	`int`	`50`
`n_bootstrap`	`int`	`200`
`n_folds`	`int`	`5`
`alpha`	`float`	`0.05`
`random_state`	`int`	`42`

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> from statspai.bcf.bcf import BayesianCausalForest
>>> rng = np.random.default_rng(0)
>>> n = 120
>>> x1 = rng.normal(size=n)
>>> x2 = rng.normal(size=n)
>>> e = 1 / (1 + np.exp(-0.5 * x1))
>>> treat = (rng.uniform(size=n) < e).astype(int)
>>> y = 1.0 + 0.5 * x1 + x2 + treat * (1.0 + x1) + rng.normal(size=n)
>>> df = pd.DataFrame({"y": y, "treat": treat, "x1": x1, "x2": x2})
>>> est = BayesianCausalForest(
...     data=df, y="y", treat="treat", covariates=["x1", "x2"],
...     n_trees_mu=50, n_trees_tau=25, n_bootstrap=25, n_folds=2,
...     random_state=0,
... )
>>> result = est.fit()
>>> type(result).__name__
'CausalResult'
>>> result.model_info["cate"].shape
(120,)

fit ¶

fit() -> CausalResult

Fit BCF and return treatment effect estimates.

effect ¶

effect(X_new: Optional[ndarray] = None) -> ndarray

Predict CATE for new observations.

BCFLongResult `dataclass` ¶

Bases: ResultProtocolMixin

Longitudinal BCF output container.

Holds per-time and average treatment effects plus per-(unit, time) CATE estimates produced by :func:bcf_longitudinal.

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(53)
>>> rows = []
>>> for u in range(40):
...     x1, x2, u_eff = rng.normal(), rng.normal(), rng.normal(0, 0.3)
...     for t in range(4):
...         d = int(rng.uniform() < 0.5)
...         y = (0.5 * x1 + 0.3 * x2 + 1.5 * d + u_eff
...              + 0.2 * t + rng.normal(0, 0.3))
...         rows.append({"unit": u, "time": t, "y": y, "d": d,
...                      "x1": x1, "x2": x2})
>>> df = pd.DataFrame(rows)
>>> res = sp.bcf_longitudinal(
...     df, outcome="y", treatment="d", unit="unit", time="time",
...     covariates=["x1", "x2"], n_trees_mu=80, n_trees_tau=30,
...     n_bootstrap=30, random_state=53,
... )
>>> res.average_ate  # ATE averaged over time
>>> res.per_time_ate  # per-time ATE table

to_dict ¶

to_dict() -> dict

JSON-safe dict of every field (agent-native serialization).

BCFOrdinalResult `dataclass` ¶

Bases: ResultProtocolMixin

Output of :func:bcf_ordinal.

Attributes:

Name	Type	Description
`cate`	`DataFrame`	Per-unit CATE estimates `tau_k(x_i)` for each dose level `k = 1..K` (one column per level).
`cate_se`	`DataFrame`	Bootstrap standard errors with the same shape.
`ate`	`Series`	Aggregate ATE(k) = E_i[tau_k(X_i)].
`ate_se`	`Series`	Standard error of `ate[k]` via pooled bootstrap.
`ate_ci`	`DataFrame`	Lower/upper 95% CI per dose level.
`levels`	`list`	Sorted unique dose levels actually observed.
`method`	`str`
`diagnostics`	`dict`

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 400
>>> X = rng.normal(size=(n, 3))
>>> T = rng.integers(0, 4, size=n)  # 4 ordered dose levels
>>> Y = X[:, 0] + 0.5 * T + rng.normal(0, 0.3, n)
>>> df = pd.DataFrame(
...     {"Y": Y, "T": T, "X0": X[:, 0], "X1": X[:, 1], "X2": X[:, 2]}
... )
>>> res = sp.bcf_ordinal(
...     df, y="Y", treat="T", covariates=["X0", "X1", "X2"],
...     n_bootstrap=50, random_state=0,
... )
>>> res.ate  # ATE(k) vs baseline k=0, one row per dose
>>> res.levels  # sorted dose levels observed

to_dict ¶

to_dict() -> dict

JSON-safe dict of every field (agent-native serialization).

BCFFactorExposureResult `dataclass` ¶

Bases: ResultProtocolMixin

Output of :func:bcf_factor_exposure.

Holds the PCA factor loadings/scores, per-factor ATEs, and the aggregated total mixture effect produced by :func:bcf_factor_exposure.

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n, p = 300, 10
>>> X = rng.normal(size=(n, 3))
>>> Z = rng.normal(size=(n, p))
>>> Y = X[:, 0] + Z.sum(axis=1) * 0.1 + rng.normal(size=n)
>>> cols = {f"x{j}": X[:, j] for j in range(3)}
>>> cols.update({f"z{j}": Z[:, j] for j in range(p)})
>>> cols["Y"] = Y
>>> df = pd.DataFrame(cols)
>>> res = sp.bcf_factor_exposure(
...     df, y="Y",
...     exposures=[f"z{j}" for j in range(p)],
...     covariates=[f"x{j}" for j in range(3)],
...     n_factors=3, n_bootstrap=40, random_state=0,
... )
>>> res.total_mixture_ate  # sum of per-factor ATEs
>>> res.per_factor_ate  # per-factor ATE table

to_dict ¶

to_dict() -> dict

JSON-safe dict of every field (agent-native serialization).

bcf_longitudinal ¶

bcf_longitudinal(data: DataFrame, *, outcome: str, treatment: str, unit: str, time: str, covariates: Sequence[str], n_trees_mu: int = 200, n_trees_tau: int = 50, n_bootstrap: int = 100, alpha: float = 0.05, random_state: int = 42) -> BCFLongResult

Longitudinal Bayesian Causal Forest (BCFLong).

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Long-format panel. `(unit, time)` should be unique per row.	required
`outcome`	`str`		required
`treatment`	`str`		required
`unit`	`str`		required
`time`	`str`		required
`covariates`	`sequence of str`		required
`n_trees_mu`	`int`	BCF regularisation: `n_trees_tau < n_trees_mu` shrinks toward homogeneous effects.	`200`
`n_trees_tau`	`int`	BCF regularisation: `n_trees_tau < n_trees_mu` shrinks toward homogeneous effects.	`200`
`n_bootstrap`	`int`		`100`
`alpha`	`float`		`0.05`
`random_state`	`int`		`42`

Returns:

Type	Description
`BCFLongResult`

Notes

The hierarchy is handled by iteratively:

Estimate unit random intercepts u_i as the unit-demeaned average of current residuals.
Fit the mu/tau forests at each time slice on Y - u_i.
Update residuals and iterate (2 passes suffice in practice).

Point estimates come from the final pass; uncertainty comes from nonparametric unit-level cluster-bootstrap — the correct resampling unit for panel data.

References

Prevot, Häring, Nichols, Holmes & Ganjgahi (arXiv:2508.08418, 2025). [@prevot2025hierarchical] Hahn, Murray, Carvalho (2020), Bayesian Analysis.

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(53)
>>> rows = []
>>> for u in range(40):
...     x1, x2, u_eff = rng.normal(), rng.normal(), rng.normal(0, 0.3)
...     for t in range(4):
...         d = int(rng.uniform() < 0.5)
...         y = (0.5 * x1 + 0.3 * x2 + 1.5 * d + u_eff
...              + 0.2 * t + rng.normal(0, 0.3))
...         rows.append({"unit": u, "time": t, "y": y, "d": d,
...                      "x1": x1, "x2": x2})
>>> df = pd.DataFrame(rows)
>>> res = sp.bcf_longitudinal(
...     df, outcome="y", treatment="d", unit="unit", time="time",
...     covariates=["x1", "x2"], n_trees_mu=80, n_trees_tau=30,
...     n_bootstrap=30, random_state=53,
... )
>>> res.per_time_ate

bcf_ordinal ¶

bcf_ordinal(data: DataFrame, *, y: str, treat: str, covariates: CovariatesArg, baseline: Any = None, n_trees_mu: int = 200, n_trees_tau: int = 50, n_bootstrap: int = 100, n_folds: int = 5, alpha: float = 0.05, random_state: int = 42) -> BCFOrdinalResult

BCF for an ordered multi-level treatment.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`y`	`str`	Outcome column.	required
`treat`	`str`	Integer/ordered-categorical treatment column with values `0, 1, ..., K`.	required
`covariates`	`sequence of str`	Pre-treatment covariate columns.	required
`baseline`	`any`	Dose level used as the reference `k=0`. Defaults to the smallest observed value of `treat`.	`None`
`n_trees_mu`	`int`	Forwarded to :func:`sp.bcf.bcf` for each incremental BCF fit.	`200`
`n_trees_tau`	`int`	Forwarded to :func:`sp.bcf.bcf` for each incremental BCF fit.	`200`
`n_bootstrap`	`int`	Forwarded to :func:`sp.bcf.bcf` for each incremental BCF fit.	`200`
`n_folds`	`int`	Forwarded to :func:`sp.bcf.bcf` for each incremental BCF fit.	`200`
`alpha`	`int`	Forwarded to :func:`sp.bcf.bcf` for each incremental BCF fit.	`200`
`random_state`	`int`	Forwarded to :func:`sp.bcf.bcf` for each incremental BCF fit.	`200`

Returns:

Type	Description
`BCFOrdinalResult`

Notes

The cumulative decomposition assumes monotone-increasing exposure — i.e. the dose-response is well defined through a single ordering. For unordered multi-valued treatments, use :func:sp.multi_treatment.

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 800
>>> X = rng.normal(size=(n, 3))
>>> T = rng.integers(0, 4, size=n)  # 4 dose levels
>>> tau = 0.5 * T + 0.3 * X[:, 0] * T   # heterogeneous dose response
>>> Y = X[:, 0] + tau + rng.normal(0, 0.3, n)
>>> df = pd.DataFrame({'Y': Y, 'T': T, 'X0': X[:, 0], 'X1': X[:, 1], 'X2': X[:, 2]})
>>> res = sp.bcf_ordinal(df, y='Y', treat='T',
...                     covariates=['X0', 'X1', 'X2'],
...                     n_bootstrap=50, random_state=0)
>>> sorted(res.levels) == [0, 1, 2, 3]
True

bcf_factor_exposure ¶

bcf_factor_exposure(data: DataFrame, *, y: str, exposures: Sequence[str], covariates: Sequence[str], n_factors: int = 3, binarize: str = 'median', loadings: Optional[Union[DataFrame, ndarray]] = None, n_trees_mu: int = 200, n_trees_tau: int = 50, n_bootstrap: int = 100, n_folds: int = 5, alpha: float = 0.05, random_state: int = 42) -> BCFFactorExposureResult

BCF on PCA-factor scores of a high-dimensional exposure vector.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`y`	`str`	Outcome column.	required
`exposures`	`sequence of str`	High-dimensional exposure columns. Standardised internally.	required
`covariates`	`sequence of str`	Pre-exposure confounders.	required
`n_factors`	`int`	Number of PCA factors to keep. Ignored if `loadings` is given.	`3`
`binarize`	`('median', 'zero', 'none')`	How to turn each factor score into a binary "high vs low" exposure for the BCF step: `'median'` — split at the factor median; `'zero''` — split at 0 (post-centering); `'none'` — keep the score continuous and pass as a numeric treatment (uses a simple linearised BCF via a continuous-T wrapper; for cleaner continuous-T inference use :func:`sp.dose_response`).	`'median'`
`loadings`	`DataFrame or ndarray`	Pre-computed exposure loadings (`exposures × factors`). If provided, overrides PCA; useful when the user has a theoretical factor structure (dietary patterns, polygenic scores, etc.).	`None`
`n_trees_mu`	`int`	Forwarded to :func:`sp.bcf.bcf` for each factor's BCF.	`200`
`n_trees_tau`	`int`	Forwarded to :func:`sp.bcf.bcf` for each factor's BCF.	`200`
`n_bootstrap`	`int`	Forwarded to :func:`sp.bcf.bcf` for each factor's BCF.	`200`
`n_folds`	`int`	Forwarded to :func:`sp.bcf.bcf` for each factor's BCF.	`200`
`alpha`	`int`	Forwarded to :func:`sp.bcf.bcf` for each factor's BCF.	`200`
`random_state`	`int`	Forwarded to :func:`sp.bcf.bcf` for each factor's BCF.	`200`

Returns:

Type	Description
`BCFFactorExposureResult`

Notes

The total mixture ATE is the sum of per-factor ATEs. Variance is aggregated under a local independence assumption across factors — which is exact when factors come from an orthonormal SVD but is only an approximation under user-supplied (non-orthogonal) loadings.

Examples:

>>> import statspai as sp, numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n, p = 500, 10
>>> X = rng.normal(size=(n, 3))
>>> Z = rng.normal(size=(n, p))
>>> Y = X[:, 0] + Z.sum(axis=1) * 0.1 + rng.normal(size=n)
>>> cols = {f'x{j}': X[:, j] for j in range(3)}
>>> cols.update({f'z{j}': Z[:, j] for j in range(p)})
>>> cols['Y'] = Y
>>> df = pd.DataFrame(cols)
>>> res = sp.bcf_factor_exposure(
...     df, y='Y',
...     exposures=[f'z{j}' for j in range(p)],
...     covariates=[f'x{j}' for j in range(3)],
...     n_factors=3, n_bootstrap=40, random_state=0,
... )
>>> len(res.factor_loadings.columns) == 3
True

statspai.bcf¶

bcf ¶

BayesianCausalForest ¶

fit ¶

effect ¶

BCFLongResult dataclass ¶

to_dict ¶

BCFOrdinalResult dataclass ¶

to_dict ¶

BCFFactorExposureResult dataclass ¶

to_dict ¶

bcf_longitudinal ¶

bcf_ordinal ¶

bcf_factor_exposure ¶

`statspai.bcf`¶

BCFLongResult `dataclass` ¶

BCFOrdinalResult `dataclass` ¶

BCFFactorExposureResult `dataclass` ¶