Conformal causal inference — the full family¶
"Your CATE point estimate is a number. Your CATE prediction interval is a promise." Conformal inference is the only framework that gives you finite-sample coverage guarantees on the individual treatment effect interval, with no distributional assumptions on the outcome. StatsPAI ships the 2021 Lei-Candès foundation plus the 2024-2026 frontier: covariate shift, density-based HDR, multi-stage, debiased ML, fairness, continuous treatment, and interference.
If the phrase "95% prediction interval" in your CATE output comes from a bootstrap or a Normal approximation, you don't have a coverage guarantee — you have a conjecture about coverage. Conformal inference replaces the conjecture with a theorem: if calibration and test data are exchangeable, empirical coverage hits the nominal level in finite samples. The ten estimators below each loosen one assumption of Lei-Candès 2021 to handle a real-world wrinkle.
Every function is at top level: sp.conformal_cate,
sp.weighted_conformal_prediction, sp.conformal_counterfactual,
sp.conformal_ite_interval, sp.conformal_density_ite,
sp.conformal_ite_multidp, sp.conformal_debiased_ml,
sp.conformal_fair_ite, sp.conformal_continuous,
sp.conformal_interference.
The two things conformal inference promises¶
For a test point x and a target coverage 1 - α:
- Marginal coverage
P(τ(x) ∈ [L(x), U(x)]) ≥ 1 − α— averaged over the random test point and calibration set, the interval contains the truth at least1 − αof the time. - Finite-sample validity — the guarantee holds for any sample
size
n, not asymptotically.
Conditional coverage (P(τ(x) ∈ [L(x), U(x)] | X=x) ≥ 1 − α) is
not promised and is impossible without further assumptions. When
someone says "CQR gives conditional coverage", they mean better
conditional coverage than vanilla split conformal, not literal
conditional validity.
1. Baseline — sp.conformal_cate (Lei-Candès 2021)¶
Split-conformal prediction intervals for the CATE, using separate
outcome models μ_1(x), μ_0(x) and the combined calibration
quantile q_1 + q_0. Default outcome model is gradient boosting.
r = sp.conformal_cate(
data=df, y="y", treat="d",
covariates=["x1", "x2", "x3"],
alpha=0.05,
calib_fraction=0.25,
random_state=42,
)
cate = r.model_info["cate"] # point estimate per row
cate_lower = r.model_info["cate_lower"] # lower bound of PI
cate_upper = r.model_info["cate_upper"] # upper bound
width = r.model_info["interval_width"] # mean width
When to use it: observational or experimental data, binary treatment, random calibration-test split is plausibly exchangeable, you want one number of coverage per observation. Default.
Reference: Lei & Candès (2021), JRSS-B 83(5).
2. Covariate-shift-corrected counterfactuals — sp.conformal_counterfactual¶
The refinement that makes per-arm conformal work under treatment
selection on observables. The treated and control calibration sets
have different covariate distributions; conformal_counterfactual
re-weights each calibration set to the marginal-X distribution using
the propensity score as the Tibshirani-Barber-Candès-Ramdas (2019)
weight:
r = sp.conformal_counterfactual(
data=df, y="y", treat="d", covariates=["x1", "x2"],
X_test=df[["x1", "x2"]].values,
alpha=0.1,
calib_frac=0.3,
)
r.summary() # per-arm coverage target + mean band widths
r.to_frame() # columns: Y1_lower, Y1_upper, Y0_lower, Y0_upper
When to use it: observational data with non-random treatment
assignment, and you want prediction intervals for the counterfactual
potential outcomes Y(1) | X and Y(0) | X (not just their
difference).
References: Lei & Candès (2021) Theorem 1; Tibshirani et al. (2019), NeurIPS.
3. ITE intervals (nested bound) — sp.conformal_ite_interval¶
The Lei-Candès Eq. 3.4 nested bound for the individual treatment
effect τ(x) = Y(1) − Y(0). Built from two per-arm conformal bands,
each at level α/2, so the combined band has finite-sample coverage
≥ 1 − α by a simple union bound:
r = sp.conformal_ite_interval(
data=df, y="y", treat="d", covariates=["x1", "x2"],
alpha=0.1, calib_frac=0.3,
)
r.summary()
r.to_frame() # columns: tau, tau_lower, tau_upper
When to use it: you want a conservative but finite-sample
valid ITE interval. "Conservative" here means "wider than necessary";
that's the price of not needing distributional assumptions. If the
interval is too wide to be useful, try conformal_density_ite or
conformal_debiased_ml next.
Reference: Lei & Candès (2021), Eq. 3.4.
4. Generic covariate-shift conformal — sp.weighted_conformal_prediction¶
Low-level primitive: split-conformal with arbitrary per-calibration
weights. Useful when you know the density ratio
w(x) = f_test(x) / f_train(x) from some external source:
lower, upper, point = sp.weighted_conformal_prediction(
X_train, y_train,
X_calib, y_calib, X_test,
weights_calib=w, # likelihood-ratio weights
model=my_sklearn_model,
alpha=0.1,
)
When to use it: you're layering conformal on top of a custom
pipeline and need to inject weights. Most users don't need this
directly — conformal_counterfactual calls it internally.
Reference: Tibshirani, Barber, Candès, Ramdas (2019), NeurIPS.
5. Conditional-density ITE — sp.conformal_density_ite (2025)¶
When the counterfactual distribution is heavy-tailed or multimodal,
the mean-based conformal interval can be misleadingly wide and
off-center. conformal_density_ite builds a highest-density region
(HDR) from a KDE of the ITE-residual distribution, giving a sharper
(and potentially non-symmetric) interval:
r = sp.conformal_density_ite(
data=df, y="y", treat="d",
covariates=["x1", "x2"],
alpha=0.1,
bandwidth=None, # Silverman's rule; override with a float
seed=0,
)
print(r.intervals) # (n_test, 2)
print(r.point_estimate) # ITE mean per row
When to use it: the counterfactual distribution is visibly
non-Gaussian. A classic tell: r.intervals[:, 1] - r.intervals[:, 0]
is noticeably narrower than what conformal_ite_interval produces,
while coverage holds on a held-out fold.
Citation: arXiv:2501.14933 (2025).
6. Multi-stage ITE (DTR) — sp.conformal_ite_multidp (2025)¶
Dynamic treatment regimes and multi-stage decisions: at stage k, the
treatment is chosen based on history through stage k−1, and the
final ITE accumulates across stages. Joint coverage uses a Bonferroni
correction α/K per stage:
r = sp.conformal_ite_multidp(
data=df,
y_per_stage=["y1", "y2", "y3"],
treat_per_stage=["d1", "d2", "d3"],
history_per_stage=[
["x_base"], # stage 1 features
["x_base", "y1", "d1"], # stage 2 features
["x_base", "y1", "d1", "y2", "d2"],
],
alpha=0.1,
)
print(r.intervals_per_stage) # K arrays of (n_test, 2)
print(r.cumulative_interval) # sum across stages
When to use it: DTR, sequential RCTs (SMART), or any clinical /
business process where the k-th decision depends on the history of
previous outcomes and decisions. Bonferroni is conservative — if K
is large and you need tighter bands, consider per-stage recalibration
with a holdout test set.
Citation: arXiv:2512.08828 (2025).
7. Debiased ML conformal — sp.conformal_debiased_ml (2026)¶
Uses a cross-fit AIPW score as the non-conformity base rather than
raw outcome residuals. This gives the interval double robustness
— the coverage guarantee is preserved even when either the outcome
model or the propensity model is moderately misspecified, as long as
the product error rate is o_p(n^{-1/2}):
r = sp.conformal_debiased_ml(
data=df, y="y", treat="d",
covariates=["x1", "x2"],
alpha=0.1,
n_folds=5,
seed=0,
)
print(r.intervals)
print(r.point_estimate)
When to use it: observational data, you suspect misspecification
in one of your nuisance models, and you want the extra robustness
from AIPW residualization. The hidden cost is n_folds extra model
fits — fine for n ≤ 100k.
Citation: arXiv:2604.03772 (2026).
8. Counterfactual-fair conformal — sp.conformal_fair_ite (2025)¶
Adds a fairness constraint: coverage must be the same across protected subgroups (race, sex, SES). Computes calibration quantiles per group and reports both per-group and pooled intervals:
r = sp.conformal_fair_ite(
data=df, y="y", treat="d",
covariates=["x1", "x2"], # protected is auto-excluded from regression
protected="race",
alpha=0.1,
)
print(r.group_widths) # dict: group → interval width
print(r.group_coverage_targets) # dict: group → nominal coverage
When to use it: regulated applications (healthcare, credit,
hiring, criminal justice) where differential coverage across
protected groups is a fairness violation on top of a statistical one.
When a group has fewer than 5 calibration points, the estimator
falls back to the conservative maximum quantile across well-covered
groups — the per-group coverage guarantee is preserved, at the cost
of a wider interval for the underrepresented group. If all groups
are small, a warnings.warn is emitted and the per-group guarantee
does not hold.
Citation: arXiv:2510.08724 (2025).
9. Continuous-treatment dose response — sp.conformal_continuous (2024)¶
For continuous treatment, the target is the dose-response curve
E[Y | T=t, X] across a dose grid. Produces split-conformal bands at
each t on the grid:
r = sp.conformal_continuous(
data=train_df, y="y", treatment="dose",
covariates=["x1", "x2"],
test_data=test_df,
dose_grid=np.linspace(0, 1, 21),
alpha=0.1,
)
print(r.predictions) # point + band at each test row's observed dose
print(r.dose_curves) # full curve per test row across the grid
When to use it: dose-response analysis, e.g. drug dosing,
fertilizer application, time-in-app. Returns a point estimate + band
at each (test_row, dose) pair in the grid.
Citation: Schröder et al. (arXiv:2407.03094, 2024). [@schroder2024conformal]
10. Cluster-exchangeable conformal — sp.conformal_interference (2021/2025)¶
When units within a cluster interfere (spillover, networks), unit-level exchangeability fails but cluster-level exchangeability can still hold. Computes a cluster-averaged residual score and does split-conformal at the cluster level:
r = sp.conformal_interference(
data=df, y="y", treatment="d", cluster="village",
covariates=["x1", "x2"],
test_clusters=["v101", "v102", "v103"],
alpha=0.1,
)
print(r.predictions) # cluster, prediction, lo, hi
print(r.cluster_scores) # per-calibration-cluster mean |residual|
When to use it: SUTVA is violated within clusters but clusters
themselves are independent (villages, schools, firms). Requires ≥ 4
non-test clusters — anything less and the calibration quantile is
meaningless.
References: Lei, Sesia & Candès (2021); systematic review at arXiv:2509.21660 (2025) uses this as the recommended default under interference.
Decision guide¶
Point + interval for each unit's CATE, binary treatment
→ sp.conformal_cate
I want Y(1) | X and Y(0) | X bands separately
→ sp.conformal_counterfactual
I want τ(x) = Y(1) - Y(0) ITE interval
→ sp.conformal_ite_interval
---
My counterfactual distribution is heavy-tailed / bimodal
→ sp.conformal_density_ite
My decision problem has K sequential stages
→ sp.conformal_ite_multidp
My nuisance models might be misspecified
→ sp.conformal_debiased_ml
I need equal coverage across protected subgroups
→ sp.conformal_fair_ite
---
Treatment is continuous — I want a dose-response curve
→ sp.conformal_continuous
Units interfere within clusters (SUTVA violated inside clusters)
→ sp.conformal_interference
---
I have a custom pipeline and need weighted split-conformal
→ sp.weighted_conformal_prediction (low-level primitive)
Diagnostics every conformal analysis should report¶
- Empirical coverage on a held-out fold. The theoretical guarantee is marginal, not conditional; always check empirically that coverage matches the nominal level on a fresh test split, not just on the calibration fold.
- Mean and median interval width. The guarantee is on coverage;
width is the cost you paid. If median width ≈ the full range of
Yyou are reporting"anything is possible", which is correct but useless. - Per-group coverage (even without
conformal_fair_ite). Even with marginal coverage nominally valid, subgroup coverage can drift badly. Report coverage conditional on key covariates. - Calibration set size. Marginal coverage requires
n_calib ≥ 1/α ≈ 10forα = 0.1. If your calibration set is smaller (common whenconformal_fair_itestrata thin out), the quantileceil((1-α)(n+1)) / nclips to 1 and you get[-∞, +∞]intervals. Checkr.model_info['n_calib']or the equivalent field. - Propensity overlap. For
conformal_counterfactualandconformal_ite_interval, verifyps ∈ [0.02, 0.98]. The TBCR weight1/g(X)explodes near the boundaries, making the band arbitrarily wide.
How to read disagreement¶
If conformal_cate and conformal_ite_interval disagree on coverage
or width, the usual culprit is that conformal_cate uses q_1 + q_0
(sum of per-arm quantiles) while ite_interval uses Δ_1(x) + Δ_0(x)
(sum of per-arm half-widths after splitting α/2 between arms). The
latter is always at least as conservative as the former. If you
want the tightest finite-sample interval and have a well-specified
outcome model, use conformal_cate. If you want the most
defensible theoretical guarantee, use conformal_ite_interval.
If conformal_cate and conformal_debiased_ml give very different
widths: your outcome model is probably misspecified. The AIPW score
has smaller asymptotic variance, so narrower debiased_ml bands
while coverage holds are a good sign.
Current for StatsPAI ≥ 1.5.0. All functions are registered;
sp.describe_function("conformal_ite_interval") returns the hand-
written schema. For dispatcher-style access see sp.conformal(kind=...).
For Agents¶
Pre-conditions - calibration sample disjoint from training sample (auto-split or user-supplied) - exchangeability between calibration and test distributions (weighted variants for covariate shift) - for CATE / ITE variants: unconfoundedness + overlap on covariates - ≥ 500 calibration observations for reliable finite-sample coverage at alpha ≤ 0.1
Identifying assumptions - Exchangeability of calibration and test points (base case) - For kind='weighted': known or estimable density ratio between calibration and test - For kind='cate' / 'ite': selection-on-observables with correct propensity / outcome model - For kind='interference': cluster-exchangeable exchangeability
Failure modes → recovery
| Symptom | Exception | Remedy | Try next |
|---|---|---|---|
| Calibration and test distributions differ (covariate shift) | statspai.AssumptionViolation |
Use kind='weighted' with estimated density ratios. | |
| Calibration set too small — intervals wide | statspai.DataInsufficient |
Increase calibration sample or raise alpha; coverage gets loose below ~100. | |
| Miscalibrated nuisance (propensity / outcome) for CATE/ITE | statspai.AssumptionWarning |
Use kind='debiased' which orthogonalises via DML-style nuisance handling. |
Alternatives (ranked)
- sp.conformal_cate
- sp.weighted_conformal_prediction
- sp.conformal_counterfactual
Typical minimum N: 500