statspai.surrogate¶
surrogate ¶
Long-term effects via surrogate indices (sp.surrogate).
Industrial A/B tests can only run for weeks, but the quantities of interest (LTV, retention, clinical outcomes) are months or years downstream. The surrogate-index framework lets you extrapolate short-term surrogates to long-term outcomes by combining an experimental sample (with the surrogate) and an observational sample (with both surrogate and long-term outcome).
Estimators
- :func:
surrogate_index— Athey, Chetty, Imbens & Kang (2019). Classical single-wave surrogate index. - :func:
long_term_from_short— Tran, Bibaut & Kallus (arXiv:2311.08527, 2023). Long-term effect of long-term treatments from short-term experiments. - :func:
proximal_surrogate_index— Imbens, Kallus, Mao & Wang (2025, JRSS-B 87(2); arXiv:2202.07234). Proximal identification when unobserved confounders link surrogate and long-term outcome.
References
Athey, S., Chetty, R., Imbens, G. W., & Kang, H. (2019). "The Surrogate Index: Combining Short-Term Proxies to Estimate Long-Term Treatment Effects More Rapidly and Precisely." NBER Working Paper 26463. [@athey2019surrogate]
Imbens, G., Kallus, N., Mao, X., & Wang, Y. (2025). "Long-term Causal Inference Under Persistent Confounding via Data Combination." Journal of the Royal Statistical Society Series B, 87(2), 362-388. arXiv:2202.07234. [@imbens2025long]
SurrogateResult
dataclass
¶
Structured container for surrogate-index estimation artefacts.
surrogate_index ¶
surrogate_index(experimental: DataFrame, observational: DataFrame, *, treatment: str, surrogates: Sequence[str], long_term_outcome: str, covariates: Optional[Sequence[str]] = None, model: Union[str, Any] = 'ols', alpha: float = 0.05, n_boot: int = 0, random_state: Optional[int] = None) -> CausalResult
Athey-Chetty-Imbens-Kang surrogate-index estimator for the long-term ATE.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
experimental
|
DataFrame
|
Experimental sample. Must contain |
required |
observational
|
DataFrame
|
Observational / historical sample with |
required |
treatment
|
str
|
Name of the binary treatment column in |
required |
surrogates
|
sequence of str
|
Names of short-term surrogates — present in both samples. |
required |
long_term_outcome
|
str
|
Name of the long-term outcome column in |
required |
covariates
|
sequence of str
|
Optional pre-treatment covariates appended to the surrogate vector. |
None
|
model
|
'ols'
|
How to fit |
'ols'
|
alpha
|
float
|
|
0.05
|
n_boot
|
int
|
If |
0
|
random_state
|
int
|
|
None
|
Returns:
| Type | Description |
|---|---|
CausalResult
|
|
Notes
The key identifying assumption is surrogacy: Y ⟂ T | S, X —
conditional on the surrogate(s) and covariates, the treatment has no
direct effect on the long-term outcome. This is strictly stronger than
ignorability and should be defended explicitly (e.g. with placebo
long-term outcomes in a validation sample).
References
Athey, S., Chetty, R., Imbens, G. W., & Kang, H. (2019). "The Surrogate Index: Combining Short-Term Proxies to Estimate Long-Term Treatment Effects More Rapidly and Precisely." NBER Working Paper 26463. [@athey2019surrogate]
long_term_from_short ¶
long_term_from_short(experimental: DataFrame, observational: DataFrame, *, treatment: str, surrogates_waves: Sequence[Sequence[str]], long_term_outcome: str, covariates: Optional[Sequence[str]] = None, model: Union[str, Any] = 'ols', alpha: float = 0.05, n_boot: int = 200, random_state: Optional[int] = None) -> CausalResult
Long-term ATE under multi-wave short-term surrogates.
Extends the classical surrogate index by chaining K surrogate waves — successive short-term measurements — so you can estimate the effect of a sustained treatment from a short experiment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
surrogates_waves
|
sequence of sequences
|
|
required |
Notes
Uses the iterated-expectation estimator (Ghassami et al. 2024, Eq. 3):
f_K(s_K) = E[Y | S_K = s_K] in observational
f_{k}(s_k) = E[f_{k+1}(S_{k+1}) | S_k = s_k] in observational
ATE = E[ f_1(S_1) | T=1 ] - E[ f_1(S_1) | T=0 ] in experimental
Inference is bootstrap-based because the iterated delta variance is unwieldy in closed form.
References
Tran, A., Bibaut, A., & Kallus, N. (2023). "Inferring the Long-Term Causal Effects of Long-Term Treatments from Short-Term Experiments." arXiv:2311.08527.
proximal_surrogate_index ¶
proximal_surrogate_index(experimental: DataFrame, observational: DataFrame, *, treatment: str, surrogates: Sequence[str], proxies: Sequence[str], long_term_outcome: str, covariates: Optional[Sequence[str]] = None, alpha: float = 0.05, n_boot: int = 200, random_state: Optional[int] = None) -> CausalResult
Proximal surrogate index — long-term ATE under unobserved confounding.
Relaxes the surrogacy assumption Y ⟂ T | S by allowing an
unobserved U that confounds S → Y. Identification uses a proxy
W satisfying the two-stage completeness conditions of Imbens,
Kallus, Mao & Wang (2025, JRSS-B). In linear-Gaussian form the
bridge function h(s, x) solves
E[Y | S, X, W] = W' * α + β * h(S, X)
which we estimate by two-stage least squares with W instrumenting
for the unobserved structure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
proxies
|
sequence of str
|
Names of proxy variables |
required |
Notes
The linear-Gaussian implementation below is faithful to the paper's
proposition 3.1 (the bridge equation) but makes strong parametric
assumptions. For nonparametric bridges, use the model hooks in
:func:surrogate_index with a kernel/NN estimator and pass a custom
2SLS wrapper.
References
Imbens, G., Kallus, N., Mao, X., & Wang, Y. (2025). "Long-term Causal Inference Under Persistent Confounding via Data Combination." Journal of the Royal Statistical Society Series B, 87(2), 362-388. arXiv:2202.07234. [@imbens2025long]