Skip to content

Proximal Causal Inference

Tchetgen Tchetgen et al. (2020). Identifies the ATE in the presence of an unmeasured confounder using two proxies of that confounder:

  • Z — "treatment-inducing confounding proxy": Z ⊥ Y | (D, U, X).
  • W — "outcome-inducing confounding proxy": W ⊥ D | (U, X).

plus measured covariates X.

       ┌── Z ──► D ──► Y ──┐
       │                   │
       U ───────────────── U     (U unobserved)
       │                   │
       └────── W ──────────┘

sp.proximal(...) — linear 2SLS on the outcome bridge

Assumes a linear outcome bridge :math:h(W, D, X) = \gamma_0 + \gamma_D D + \gamma_W^\top W + \gamma_X^\top X. Under that restriction the bridge equation reduces to a standard 2SLS problem where W is endogenous and Z is its instrument.

# Unobserved U → biased OLS; (Z, W) unblocks identification.
r = sp.proximal(
    df, y='lung_cancer', treat='smoker',
    proxy_z=['occupation'],             # treatment-side proxy (IV for W)
    proxy_w=['shs_exposure'],           # outcome-side proxy (endogenous)
    covariates=['age', 'sex'],          # exogenous controls
    bridge='linear',                    # currently the only option
)

r.estimate                              # γ_D — ATE under the linear bridge
r.model_info['first_stage_F']           # weak-instrument diagnostic (k_w==1 only)
r.model_info['bridge']                  # 'linear' — recorded for audit

Inference options

  • Closed-form SE (default): 2SLS sandwich, homoskedastic. r.model_info['se_method'] == '2sls_sandwich'.
  • Bootstrap SE: pass n_boot=500 to switch. Failures are tracked and warned rather than silently falling back.
r = sp.proximal(df, y='y', treat='d',
                proxy_z=['z'], proxy_w=['w'],
                n_boot=500, seed=0)
r.model_info['se_method']             # 'bootstrap'
r.model_info['n_boot_failed']         # 0 for clean runs

Weak-instrument diagnostics

For a single endogenous proxy (k_w == 1), StatsPAI reports the first-stage F statistic of regressing W on the excluded Z (conditioning on X). A warning fires when F < 10.

if r.model_info['first_stage_F'] is not None:
    print(f"First-stage F: {r.model_info['first_stage_F']:.2f}")

For multiple endogenous proxies (k_w > 1), the correct diagnostic is the Cragg-Donald / Kleibergen-Paap minimum-eigenvalue statistic. That is not yet implemented — first_stage_F is None and a RuntimeWarning explains why. See docs/ROADMAP.md §2.

Class API

from statspai import ProximalCausalInference

obj = ProximalCausalInference(
    y='y', treat='d', proxy_z=['z'], proxy_w=['w'],
    covariates=['x'],
).fit(df)
r = obj.result_            # CausalResult

Non-linear bridges

Kernel (Mastouri et al. 2021) and sieve / RKHS (Deaner 2018) bridges are on the roadmap but not yet shipped. bridge='kernel' and bridge='sieve' currently raise NotImplementedError with a pointer to docs/ROADMAP.md §1.

The scaffold lives in statspai.proximal.p2sls — contributors interested in shipping a non-linear bridge should start there.

References

  • Tchetgen Tchetgen, E.J., Ying, A., Cui, Y., Shi, X. and Miao, W. (2020). An Introduction to Proximal Causal Learning. arXiv:2009.10982.
  • Miao, W., Geng, Z. and Tchetgen Tchetgen, E.J. (2018). Identifying causal effects with proxy variables of an unmeasured confounder. Biometrika 105(4).
  • Cui, Y., Pu, H., Shi, X., Miao, W. and Tchetgen Tchetgen, E.J. (2024). Semiparametric proximal causal inference. JASA.
  • Mastouri, A. et al. (2021). Proximal Causal Learning with Kernels. ICML. (kernel bridge — future work)
  • Deaner, B. (2018). Proxy Controls and Panel Data. arXiv:1810.00283. (sieve bridge — future work)