statspai.bcf¶
bcf ¶
Bayesian Causal Forest (BCF) for heterogeneous treatment effects.
Decomposes the outcome into a prognostic function mu(X) and a treatment effect function tau(X), each modeled by BART:
Y_i = mu(X_i) + tau(X_i) * D_i + epsilon_i
This separation allows regularization-induced confounding (RIC) to be mitigated, producing better CATE estimates than standard BART.
References
Hahn, P. R., Murray, J. S., & Carvalho, C. M. (2020). Bayesian Regression Tree Models for Causal Inference: Regularization, Confounding, and Heterogeneous Effects. Bayesian Analysis, 15(3), 965-1056. [@hahn2020bayesian]
BayesianCausalForest ¶
Bayesian Causal Forest estimator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
|
required |
y
|
str
|
|
required |
treat
|
str
|
|
required |
covariates
|
list of str
|
|
required |
n_trees_mu
|
int
|
|
200
|
n_trees_tau
|
int
|
|
50
|
n_bootstrap
|
int
|
|
200
|
n_folds
|
int
|
|
5
|
alpha
|
float
|
|
0.05
|
random_state
|
int
|
|
42
|
BCFLongResult
dataclass
¶
Longitudinal BCF output container.
BCFOrdinalResult
dataclass
¶
Output of :func:bcf_ordinal.
Attributes:
| Name | Type | Description |
|---|---|---|
cate |
DataFrame
|
Per-unit CATE estimates |
cate_se |
DataFrame
|
Bootstrap standard errors with the same shape. |
ate |
Series
|
Aggregate ATE(k) = E_i[tau_k(X_i)]. |
ate_se |
Series
|
Standard error of |
ate_ci |
DataFrame
|
Lower/upper 95% CI per dose level. |
levels |
list
|
Sorted unique dose levels actually observed. |
method |
str
|
|
diagnostics |
dict
|
|
BCFFactorExposureResult
dataclass
¶
Output of :func:bcf_factor_exposure.
bcf_longitudinal ¶
bcf_longitudinal(data: DataFrame, *, outcome: str, treatment: str, unit: str, time: str, covariates: Sequence[str], n_trees_mu: int = 200, n_trees_tau: int = 50, n_bootstrap: int = 100, alpha: float = 0.05, random_state: int = 42) -> BCFLongResult
Longitudinal Bayesian Causal Forest (BCFLong).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Long-format panel. |
required |
outcome
|
str
|
|
required |
treatment
|
str
|
|
required |
unit
|
str
|
|
required |
time
|
str
|
|
required |
covariates
|
sequence of str
|
|
required |
n_trees_mu
|
int
|
BCF regularisation: |
200
|
n_trees_tau
|
int
|
BCF regularisation: |
200
|
n_bootstrap
|
int
|
|
100
|
alpha
|
float
|
|
0.05
|
random_state
|
int
|
|
42
|
Returns:
| Type | Description |
|---|---|
BCFLongResult
|
|
Notes
The hierarchy is handled by iteratively:
- Estimate unit random intercepts
u_ias the unit-demeaned average of current residuals. - Fit the mu/tau forests at each time slice on
Y - u_i. - Update residuals and iterate (2 passes suffice in practice).
Point estimates come from the final pass; uncertainty comes from nonparametric unit-level cluster-bootstrap — the correct resampling unit for panel data.
References
Prevot, Häring, Nichols, Holmes & Ganjgahi (arXiv:2508.08418, 2025). [@prevot2025hierarchical] Hahn, Murray, Carvalho (2020), Bayesian Analysis.
bcf_ordinal ¶
bcf_ordinal(data: DataFrame, *, y: str, treat: str, covariates: Sequence[str], baseline: Any = None, n_trees_mu: int = 200, n_trees_tau: int = 50, n_bootstrap: int = 100, n_folds: int = 5, alpha: float = 0.05, random_state: int = 42) -> BCFOrdinalResult
BCF for an ordered multi-level treatment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
|
required |
y
|
str
|
Outcome column. |
required |
treat
|
str
|
Integer/ordered-categorical treatment column with values
|
required |
covariates
|
sequence of str
|
Pre-treatment covariate columns. |
required |
baseline
|
any
|
Dose level used as the reference |
None
|
n_trees_mu
|
int
|
Forwarded to :func: |
200
|
n_trees_tau
|
int
|
Forwarded to :func: |
200
|
n_bootstrap
|
int
|
Forwarded to :func: |
200
|
n_folds
|
int
|
Forwarded to :func: |
200
|
alpha
|
int
|
Forwarded to :func: |
200
|
random_state
|
int
|
Forwarded to :func: |
200
|
Returns:
| Type | Description |
|---|---|
BCFOrdinalResult
|
|
Notes
The cumulative decomposition assumes monotone-increasing exposure
— i.e. the dose-response is well defined through a single ordering.
For unordered multi-valued treatments, use
:func:sp.multi_treatment.
Examples:
>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 800
>>> X = rng.normal(size=(n, 3))
>>> T = rng.integers(0, 4, size=n) # 4 dose levels
>>> tau = 0.5 * T + 0.3 * X[:, 0] * T # heterogeneous dose response
>>> Y = X[:, 0] + tau + rng.normal(0, 0.3, n)
>>> df = pd.DataFrame({'Y': Y, 'T': T, 'X0': X[:, 0], 'X1': X[:, 1], 'X2': X[:, 2]})
>>> res = sp.bcf_ordinal(df, y='Y', treat='T',
... covariates=['X0', 'X1', 'X2'],
... n_bootstrap=50, random_state=0)
>>> sorted(res.levels) == [0, 1, 2, 3]
True
bcf_factor_exposure ¶
bcf_factor_exposure(data: DataFrame, *, y: str, exposures: Sequence[str], covariates: Sequence[str], n_factors: int = 3, binarize: str = 'median', loadings: Optional[Union[DataFrame, ndarray]] = None, n_trees_mu: int = 200, n_trees_tau: int = 50, n_bootstrap: int = 100, n_folds: int = 5, alpha: float = 0.05, random_state: int = 42) -> BCFFactorExposureResult
BCF on PCA-factor scores of a high-dimensional exposure vector.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
|
required |
y
|
str
|
Outcome column. |
required |
exposures
|
sequence of str
|
High-dimensional exposure columns. Standardised internally. |
required |
covariates
|
sequence of str
|
Pre-exposure confounders. |
required |
n_factors
|
int
|
Number of PCA factors to keep. Ignored if |
3
|
binarize
|
('median', 'zero', 'none')
|
How to turn each factor score into a binary "high vs low" exposure
for the BCF step:
|
'median'
|
loadings
|
DataFrame or ndarray
|
Pre-computed exposure loadings ( |
None
|
n_trees_mu
|
int
|
Forwarded to :func: |
200
|
n_trees_tau
|
int
|
Forwarded to :func: |
200
|
n_bootstrap
|
int
|
Forwarded to :func: |
200
|
n_folds
|
int
|
Forwarded to :func: |
200
|
alpha
|
int
|
Forwarded to :func: |
200
|
random_state
|
int
|
Forwarded to :func: |
200
|
Returns:
| Type | Description |
|---|---|
BCFFactorExposureResult
|
|
Notes
The total mixture ATE is the sum of per-factor ATEs. Variance is aggregated under a local independence assumption across factors — which is exact when factors come from an orthonormal SVD but is only an approximation under user-supplied (non-orthogonal) loadings.
Examples:
>>> import statspai as sp, numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n, p = 500, 10
>>> X = rng.normal(size=(n, 3))
>>> Z = rng.normal(size=(n, p))
>>> Y = X[:, 0] + Z.sum(axis=1) * 0.1 + rng.normal(size=n)
>>> cols = {f'x{j}': X[:, j] for j in range(3)}
>>> cols.update({f'z{j}': Z[:, j] for j in range(p)})
>>> cols['Y'] = Y
>>> df = pd.DataFrame(cols)
>>> res = sp.bcf_factor_exposure(
... df, y='Y',
... exposures=[f'z{j}' for j in range(p)],
... covariates=[f'x{j}' for j in range(3)],
... n_factors=3, n_bootstrap=40, random_state=0,
... )
>>> len(res.factor_loadings.columns) == 3
True