Skip to content

statspai.bcf

bcf

Bayesian Causal Forest (BCF) for heterogeneous treatment effects.

Decomposes the outcome into a prognostic function mu(X) and a treatment effect function tau(X), each modeled by BART:

Y_i = mu(X_i) + tau(X_i) * D_i + epsilon_i

This separation allows regularization-induced confounding (RIC) to be mitigated, producing better CATE estimates than standard BART.

References

Hahn, P. R., Murray, J. S., & Carvalho, C. M. (2020). Bayesian Regression Tree Models for Causal Inference: Regularization, Confounding, and Heterogeneous Effects. Bayesian Analysis, 15(3), 965-1056. [@hahn2020bayesian]

BayesianCausalForest

Bayesian Causal Forest estimator.

Parameters:

Name Type Description Default
data DataFrame
required
y str
required
treat str
required
covariates list of str
required
n_trees_mu int
200
n_trees_tau int
50
n_bootstrap int
200
n_folds int
5
alpha float
0.05
random_state int
42

fit

fit() -> CausalResult

Fit BCF and return treatment effect estimates.

effect

effect(X_new: Optional[ndarray] = None) -> ndarray

Predict CATE for new observations.

BCFLongResult dataclass

Longitudinal BCF output container.

BCFOrdinalResult dataclass

Output of :func:bcf_ordinal.

Attributes:

Name Type Description
cate DataFrame

Per-unit CATE estimates tau_k(x_i) for each dose level k = 1..K (one column per level).

cate_se DataFrame

Bootstrap standard errors with the same shape.

ate Series

Aggregate ATE(k) = E_i[tau_k(X_i)].

ate_se Series

Standard error of ate[k] via pooled bootstrap.

ate_ci DataFrame

Lower/upper 95% CI per dose level.

levels list

Sorted unique dose levels actually observed.

method str
diagnostics dict

BCFFactorExposureResult dataclass

Output of :func:bcf_factor_exposure.

bcf_longitudinal

bcf_longitudinal(data: DataFrame, *, outcome: str, treatment: str, unit: str, time: str, covariates: Sequence[str], n_trees_mu: int = 200, n_trees_tau: int = 50, n_bootstrap: int = 100, alpha: float = 0.05, random_state: int = 42) -> BCFLongResult

Longitudinal Bayesian Causal Forest (BCFLong).

Parameters:

Name Type Description Default
data DataFrame

Long-format panel. (unit, time) should be unique per row.

required
outcome str
required
treatment str
required
unit str
required
time str
required
covariates sequence of str
required
n_trees_mu int

BCF regularisation: n_trees_tau < n_trees_mu shrinks toward homogeneous effects.

200
n_trees_tau int

BCF regularisation: n_trees_tau < n_trees_mu shrinks toward homogeneous effects.

200
n_bootstrap int
100
alpha float
0.05
random_state int
42

Returns:

Type Description
BCFLongResult
Notes

The hierarchy is handled by iteratively:

  1. Estimate unit random intercepts u_i as the unit-demeaned average of current residuals.
  2. Fit the mu/tau forests at each time slice on Y - u_i.
  3. Update residuals and iterate (2 passes suffice in practice).

Point estimates come from the final pass; uncertainty comes from nonparametric unit-level cluster-bootstrap — the correct resampling unit for panel data.

References

Prevot, Häring, Nichols, Holmes & Ganjgahi (arXiv:2508.08418, 2025). [@prevot2025hierarchical] Hahn, Murray, Carvalho (2020), Bayesian Analysis.

bcf_ordinal

bcf_ordinal(data: DataFrame, *, y: str, treat: str, covariates: Sequence[str], baseline: Any = None, n_trees_mu: int = 200, n_trees_tau: int = 50, n_bootstrap: int = 100, n_folds: int = 5, alpha: float = 0.05, random_state: int = 42) -> BCFOrdinalResult

BCF for an ordered multi-level treatment.

Parameters:

Name Type Description Default
data DataFrame
required
y str

Outcome column.

required
treat str

Integer/ordered-categorical treatment column with values 0, 1, ..., K.

required
covariates sequence of str

Pre-treatment covariate columns.

required
baseline any

Dose level used as the reference k=0. Defaults to the smallest observed value of treat.

None
n_trees_mu int

Forwarded to :func:sp.bcf.bcf for each incremental BCF fit.

200
n_trees_tau int

Forwarded to :func:sp.bcf.bcf for each incremental BCF fit.

200
n_bootstrap int

Forwarded to :func:sp.bcf.bcf for each incremental BCF fit.

200
n_folds int

Forwarded to :func:sp.bcf.bcf for each incremental BCF fit.

200
alpha int

Forwarded to :func:sp.bcf.bcf for each incremental BCF fit.

200
random_state int

Forwarded to :func:sp.bcf.bcf for each incremental BCF fit.

200

Returns:

Type Description
BCFOrdinalResult
Notes

The cumulative decomposition assumes monotone-increasing exposure — i.e. the dose-response is well defined through a single ordering. For unordered multi-valued treatments, use :func:sp.multi_treatment.

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 800
>>> X = rng.normal(size=(n, 3))
>>> T = rng.integers(0, 4, size=n)  # 4 dose levels
>>> tau = 0.5 * T + 0.3 * X[:, 0] * T   # heterogeneous dose response
>>> Y = X[:, 0] + tau + rng.normal(0, 0.3, n)
>>> df = pd.DataFrame({'Y': Y, 'T': T, 'X0': X[:, 0], 'X1': X[:, 1], 'X2': X[:, 2]})
>>> res = sp.bcf_ordinal(df, y='Y', treat='T',
...                     covariates=['X0', 'X1', 'X2'],
...                     n_bootstrap=50, random_state=0)
>>> sorted(res.levels) == [0, 1, 2, 3]
True

bcf_factor_exposure

bcf_factor_exposure(data: DataFrame, *, y: str, exposures: Sequence[str], covariates: Sequence[str], n_factors: int = 3, binarize: str = 'median', loadings: Optional[Union[DataFrame, ndarray]] = None, n_trees_mu: int = 200, n_trees_tau: int = 50, n_bootstrap: int = 100, n_folds: int = 5, alpha: float = 0.05, random_state: int = 42) -> BCFFactorExposureResult

BCF on PCA-factor scores of a high-dimensional exposure vector.

Parameters:

Name Type Description Default
data DataFrame
required
y str

Outcome column.

required
exposures sequence of str

High-dimensional exposure columns. Standardised internally.

required
covariates sequence of str

Pre-exposure confounders.

required
n_factors int

Number of PCA factors to keep. Ignored if loadings is given.

3
binarize ('median', 'zero', 'none')

How to turn each factor score into a binary "high vs low" exposure for the BCF step: 'median' — split at the factor median; 'zero'' — split at 0 (post-centering); 'none' — keep the score continuous and pass as a numeric treatment (uses a simple linearised BCF via a continuous-T wrapper; for cleaner continuous-T inference use :func:sp.dose_response).

'median'
loadings DataFrame or ndarray

Pre-computed exposure loadings (exposures × factors). If provided, overrides PCA; useful when the user has a theoretical factor structure (dietary patterns, polygenic scores, etc.).

None
n_trees_mu int

Forwarded to :func:sp.bcf.bcf for each factor's BCF.

200
n_trees_tau int

Forwarded to :func:sp.bcf.bcf for each factor's BCF.

200
n_bootstrap int

Forwarded to :func:sp.bcf.bcf for each factor's BCF.

200
n_folds int

Forwarded to :func:sp.bcf.bcf for each factor's BCF.

200
alpha int

Forwarded to :func:sp.bcf.bcf for each factor's BCF.

200
random_state int

Forwarded to :func:sp.bcf.bcf for each factor's BCF.

200

Returns:

Type Description
BCFFactorExposureResult
Notes

The total mixture ATE is the sum of per-factor ATEs. Variance is aggregated under a local independence assumption across factors — which is exact when factors come from an orthonormal SVD but is only an approximation under user-supplied (non-orthogonal) loadings.

Examples:

>>> import statspai as sp, numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n, p = 500, 10
>>> X = rng.normal(size=(n, 3))
>>> Z = rng.normal(size=(n, p))
>>> Y = X[:, 0] + Z.sum(axis=1) * 0.1 + rng.normal(size=n)
>>> cols = {f'x{j}': X[:, j] for j in range(3)}
>>> cols.update({f'z{j}': Z[:, j] for j in range(p)})
>>> cols['Y'] = Y
>>> df = pd.DataFrame(cols)
>>> res = sp.bcf_factor_exposure(
...     df, y='Y',
...     exposures=[f'z{j}' for j in range(p)],
...     covariates=[f'x{j}' for j in range(3)],
...     n_factors=3, n_bootstrap=40, random_state=0,
... )
>>> len(res.factor_loadings.columns) == 3
True