Skip to content

statspai.surrogate

surrogate

Long-term effects via surrogate indices (sp.surrogate).

Industrial A/B tests can only run for weeks, but the quantities of interest (LTV, retention, clinical outcomes) are months or years downstream. The surrogate-index framework lets you extrapolate short-term surrogates to long-term outcomes by combining an experimental sample (with the surrogate) and an observational sample (with both surrogate and long-term outcome).

Estimators
  • :func:surrogate_index — Athey, Chetty, Imbens & Kang (2019). Classical single-wave surrogate index.
  • :func:long_term_from_short — Tran, Bibaut & Kallus (arXiv:2311.08527, 2023). Long-term effect of long-term treatments from short-term experiments.
  • :func:proximal_surrogate_index — Imbens, Kallus, Mao & Wang (2025, JRSS-B 87(2); arXiv:2202.07234). Proximal identification when unobserved confounders link surrogate and long-term outcome.
References

Athey, S., Chetty, R., Imbens, G. W., & Kang, H. (2019). "The Surrogate Index: Combining Short-Term Proxies to Estimate Long-Term Treatment Effects More Rapidly and Precisely." NBER Working Paper 26463. [@athey2019surrogate]

Imbens, G., Kallus, N., Mao, X., & Wang, Y. (2025). "Long-term Causal Inference Under Persistent Confounding via Data Combination." Journal of the Royal Statistical Society Series B, 87(2), 362-388. arXiv:2202.07234. [@imbens2025long]

SurrogateResult dataclass

Structured container for surrogate-index estimation artefacts.

surrogate_index

surrogate_index(experimental: DataFrame, observational: DataFrame, *, treatment: str, surrogates: Sequence[str], long_term_outcome: str, covariates: Optional[Sequence[str]] = None, model: Union[str, Any] = 'ols', alpha: float = 0.05, n_boot: int = 0, random_state: Optional[int] = None) -> CausalResult

Athey-Chetty-Imbens-Kang surrogate-index estimator for the long-term ATE.

Parameters:

Name Type Description Default
experimental DataFrame

Experimental sample. Must contain treatment and surrogates. long_term_outcome need not be present — that is the whole point.

required
observational DataFrame

Observational / historical sample with surrogates and long_term_outcome. Need not contain treatment.

required
treatment str

Name of the binary treatment column in experimental.

required
surrogates sequence of str

Names of short-term surrogates — present in both samples.

required
long_term_outcome str

Name of the long-term outcome column in observational.

required
covariates sequence of str

Optional pre-treatment covariates appended to the surrogate vector.

None
model 'ols'

How to fit E[Y | S] on the observational sample.

'ols'
alpha float
0.05
n_boot int

If > 0, use n_boot paired-bootstrap replications instead of the analytic delta-method variance.

0
random_state int
None

Returns:

Type Description
CausalResult

estimand='ATE', method='surrogate_index'.

Notes

The key identifying assumption is surrogacy: Y ⟂ T | S, X — conditional on the surrogate(s) and covariates, the treatment has no direct effect on the long-term outcome. This is strictly stronger than ignorability and should be defended explicitly (e.g. with placebo long-term outcomes in a validation sample).

References

Athey, S., Chetty, R., Imbens, G. W., & Kang, H. (2019). "The Surrogate Index: Combining Short-Term Proxies to Estimate Long-Term Treatment Effects More Rapidly and Precisely." NBER Working Paper 26463. [@athey2019surrogate]

long_term_from_short

long_term_from_short(experimental: DataFrame, observational: DataFrame, *, treatment: str, surrogates_waves: Sequence[Sequence[str]], long_term_outcome: str, covariates: Optional[Sequence[str]] = None, model: Union[str, Any] = 'ols', alpha: float = 0.05, n_boot: int = 200, random_state: Optional[int] = None) -> CausalResult

Long-term ATE under multi-wave short-term surrogates.

Extends the classical surrogate index by chaining K surrogate waves — successive short-term measurements — so you can estimate the effect of a sustained treatment from a short experiment.

Parameters:

Name Type Description Default
surrogates_waves sequence of sequences

[wave_1_cols, wave_2_cols, ..., wave_K_cols] where each wave_k_cols is a list of column names. Wave k is assumed measured in both samples (and so on for each of the K waves).

required
Notes

Uses the iterated-expectation estimator (Ghassami et al. 2024, Eq. 3):

f_K(s_K) = E[Y | S_K = s_K]                    in observational
f_{k}(s_k) = E[f_{k+1}(S_{k+1}) | S_k = s_k]   in observational
ATE = E[ f_1(S_1) | T=1 ] - E[ f_1(S_1) | T=0 ] in experimental

Inference is bootstrap-based because the iterated delta variance is unwieldy in closed form.

References

Tran, A., Bibaut, A., & Kallus, N. (2023). "Inferring the Long-Term Causal Effects of Long-Term Treatments from Short-Term Experiments." arXiv:2311.08527.

proximal_surrogate_index

proximal_surrogate_index(experimental: DataFrame, observational: DataFrame, *, treatment: str, surrogates: Sequence[str], proxies: Sequence[str], long_term_outcome: str, covariates: Optional[Sequence[str]] = None, alpha: float = 0.05, n_boot: int = 200, random_state: Optional[int] = None) -> CausalResult

Proximal surrogate index — long-term ATE under unobserved confounding.

Relaxes the surrogacy assumption Y ⟂ T | S by allowing an unobserved U that confounds S → Y. Identification uses a proxy W satisfying the two-stage completeness conditions of Imbens, Kallus, Mao & Wang (2025, JRSS-B). In linear-Gaussian form the bridge function h(s, x) solves

E[Y | S, X, W] = W' * α + β * h(S, X)

which we estimate by two-stage least squares with W instrumenting for the unobserved structure.

Parameters:

Name Type Description Default
proxies sequence of str

Names of proxy variables W — present in the observational sample only. Proxies must be (a) related to U and (b) excluded from the direct effect on Y (classical IV-style exclusion).

required
Notes

The linear-Gaussian implementation below is faithful to the paper's proposition 3.1 (the bridge equation) but makes strong parametric assumptions. For nonparametric bridges, use the model hooks in :func:surrogate_index with a kernel/NN estimator and pass a custom 2SLS wrapper.

References

Imbens, G., Kallus, N., Mao, X., & Wang, Y. (2025). "Long-term Causal Inference Under Persistent Confounding via Data Combination." Journal of the Royal Statistical Society Series B, 87(2), 362-388. arXiv:2202.07234. [@imbens2025long]