Double / Debiased Machine Learning¶
statspai.dml — Chernozhukov et al. (2018) double/debiased ML with
cross-fitted nuisance estimation and Neyman-orthogonal scores.
Four model types¶
# Partially Linear Regression (PLR): Y = Dθ + g(X) + U, E[U|D,X]=0
r = sp.dml(df, y='wage', treat='training',
covariates=['age','edu','exp'],
model='plr',
ml_g='rf', ml_m='rf', # nuisance learners
n_folds=5, n_rep=1)
# Interactive Regression Model (IRM): binary treatment, AIPW ATE.
# Propensity scores m̂(X) are auto-clipped to [0.01, 0.99] for AIPW
# stability; the clip counts are reported in r.diagnostics.
r = sp.dml(df, y='y', treat='d_bin', covariates=[...],
model='irm')
# Partially Linear IV (PLIV) — new in v0.9.3
# Y = Dθ + g(X) + U, D = m(X) + V, Z = r(X) + ε
r = sp.dml(df, y='y', treat='d', covariates=[...],
instrument='z',
model='pliv',
ml_g='rf', ml_m='rf', ml_r='rf')
# Interactive IV Model (IIVM) — new in v0.9.6
# Binary D, binary Z → LATE (compliers) via the Wald ratio of two
# Neyman-orthogonal doubly-robust scores. See Chernozhukov et al.
# (2018) §5.
r = sp.dml(df, y='earnings', treat='college',
covariates=['age', 'ability', 'parent_edu'],
model='iivm', instrument='lottery_win')
r.estimand # 'LATE'
Multiple excluded instruments
StatsPAI's PLIV and IIVM implementations use a scalar
instrument. For a problem with multiple excluded instruments,
build a scalar first-stage index with
sp.scalar_iv_projection(data, treat, instruments=[...], covariates=[...])
and pass the resulting column name to instrument=. See
docs/ROADMAP.md §2 for native multi-instrument support.
Per-model classes¶
Each model family is also available as a dedicated class for users who want fine-grained control or want to subclass the base infrastructure:
from statspai.dml import DoubleMLPLR, DoubleMLIRM, DoubleMLPLIV, DoubleMLIIVM
r = DoubleMLIIVM(data=df, y='y', treat='d', covariates=[...],
instrument='z').fit()
All four share statspai.dml._DoubleMLBase, which handles cross-
fitting, default learner selection (classifier vs regressor per
model), repeated-split aggregation, and CausalResult construction.
Learners¶
Either pass the name of a built-in learner ('rf', 'lasso',
'elastic_net', 'xgb', 'nn') or any sklearn-compatible estimator:
from sklearn.ensemble import GradientBoostingRegressor
r = sp.dml(..., ml_g=GradientBoostingRegressor(n_estimators=200),
ml_m=GradientBoostingRegressor(n_estimators=200))
Cross-fitting¶
With n_rep > 1, StatsPAI repeats the cross-fitting with different
random fold assignments and reports the median point estimate
together with the Chernozhukov et al. (2018, eq. 3.7) aggregated SE:
which accounts for both within-rep nuisance variance and between-rep
dispersion of the point estimates — taking only median(se_r) would
under-cover whenever splits-randomness moves the point estimate.
Result attributes¶
r.estimate # θ̂ point estimate
r.se # influence-function standard error
r.ci # (lower, upper); level set by sp.dml(..., alpha=0.05)
r.pvalue # two-sided p-value against H0: θ = 0
r.diagnostics # cross-fitting config + overlap diagnostics (dict)
r.summary() # full table with reference
r.cite() # Chernozhukov et al. 2018 BibTeX
References¶
- Chernozhukov, V. et al. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal 21(1).
- Chernozhukov, V. et al. (2018, §4.2). Partially Linear IV and
the Neyman-orthogonal score used by
model='pliv'. - Chernozhukov, V. et al. (2018, §5). Interactive IV Model and the
ratio-of-scores LATE estimator used by
model='iivm'.