Rigorous (data-driven) Lasso — sp.rlasso and the hdm port¶
sp.rlasso is StatsPAI's faithful Python port of R's
hdm package (Chernozhukov,
Hansen & Spindler, The R Journal 8(2), 2016, doi
10.32614/RJ-2016-040). It
implements the rigorous Lasso of Belloni, Chernozhukov & Hansen (2014)
and the optimal-instrument IV of Belloni, Chen, Chernozhukov & Hansen
(2012, Econometrica, doi
10.3982/ECTA9626).
The defining property versus a vanilla sklearn Lasso is the
data-driven, theory-justified penalty — no cross-validation:
with per-coefficient, heteroskedasticity-robust penalty loadings \(\Psi_j\) refined by iteration, and an optional post-Lasso OLS refit on the selected support to remove shrinkage bias.
Why a faithful port (and not a reconstruction)¶
An earlier from-memory reconstruction (iv.bch_post_lasso_iv) used the
asymptotic penalty \(\lambda = 2c\sqrt{2n\log(2p/\alpha)}\) and selected
only instruments. On the canonical BCH eminent-domain application it came
out ~17× off hdm (0.013 vs 0.227). The lesson: aligning a
selection method numerically requires porting the reference algorithm
exactly, not approximating its formulas. sp.rlasso does that, and the
parity tests
pin every surface against hdm 0.3.2:
| Surface | Agreement with hdm |
|---|---|
rlasso coefficients / λ₀ / loadings / residuals |
machine precision (~1e-13) |
rlasso selected support |
exact |
rlasso_effect (partialling-out & double-selection) α, SE |
~1e-14 |
rlasso_iv (all 4 selection regimes, well-conditioned design) |
~1e-14 |
rlasso_iv on eminent domain (rank-deficient controls) |
~1e-9 |
Implementation note. hdm's IV routines invert near-singular control blocks with
MASS::ginv, whose singular-value cutoff is \(\sqrt{\varepsilon}\approx 1.49\times10^{-8}\). numpy'spinvdefaults to a far tighterrcond = 1e-15, which keeps spurious tiny singular values and silently produces a different pseudo-inverse. Matching the MASS tolerance is exactly what restores eminent-domain parity.
hdm ↔ StatsPAI function map¶
The complete public surface of hdm, and where each piece lives in
StatsPAI. Two categories: faithful ports (bit-for-bit against hdm,
no cross-validation, deterministic fixtures) and same-score
equivalents (the high-dimensional treatment-effect estimators, which
StatsPAI exposes through sp.dml's Neyman-orthogonal scores with
cross-fitting (Chernozhukov et al., Econometrics Journal 2018) rather
than hdm's full-sample post-Lasso — same estimand and score family, not
claimed bit-for-bit).
hdm |
StatsPAI | Status |
|---|---|---|
rlasso |
sp.rlasso |
faithful port — coeffs/λ₀/loadings/residuals ~1e-13, support exact |
rlassologit |
sp.rlassologit |
faithful port — support exact, glmnet engine ~1e-6, post=TRUE refit ~1e-9 |
rlassoEffect |
sp.rlasso_effect |
faithful port — α, SE ~1e-14 (partialling-out & double-selection) |
rlassoEffects |
sp.rlasso_effects |
faithful port — ~1e-14 (multi-target) |
rlassoIV, rlassoIVselectX, rlassoIVselectZ |
sp.rlasso_iv(select_X=…, select_Z=…) |
faithful port — ~1e-14 well-conditioned, ~1e-9 rank-deficient |
tsls |
sp.rlasso_iv(select_Z=False, select_X=False) |
robust 2SLS path (no selection) |
lambdaCalculation |
sp.rlasso(penalty=…) (internal) |
homoskedastic & heteroskedastic (X-independent) bit-exact; X-dependent in-distribution only |
rlassoATE |
sp.dml(model='irm', score='ATE', ml_g='rlasso', ml_m='rlassologit') |
same doubly-robust (AIPW) score family; cross-fitted, DoubleML-aligned |
rlassoATET |
sp.dml(model='irm', score='ATTE', …) |
same score family; cross-fitted |
rlassoLATE |
sp.dml(model='iivm', ml_g='rlasso', ml_m='rlassologit') |
same LATE Wald-ratio score family; cross-fitted |
rlassoLATET |
sp.dml(model='iivm', …) |
covered by the IIVM score; a LATET-specific variant is not separately exposed |
rlassologitEffect, rlassologitEffects |
sp.rlassologit_effect / sp.rlassologit_effects |
faithful port — coef & post SE ~1e-14, multi-target ~1e-15 |
print / summary / confint / predict |
.summary() / .conf_int() / .predict() |
result-object methods |
dataset EminentDomain |
parity fixture in tests/reference_parity/ |
used to pin rlasso_iv |
datasets pension, GrowthData, AJR, cps2012, BLP |
not yet bundled as sp.datasets |
on the roadmap (canonical-case reproduction) |
On the treatment-effect family.
hdm::rlassoATE/ATET/LATE/LATETestimate the same program-evaluation parameters assp.dml's IRM and IIVM models; StatsPAI routes them through the cross-fitted DML scores so the high-dimensional and generic-ML paths share one validated implementation, rather than maintaining a second full-sample estimator. Thesp.dmlpath is parity-tested againstDoubleML(Bach et al., JMLR 2022 for Python; Journal of Statistical Software 2024 for R).
The three entry points¶
1. sp.rlasso — rigorous (post-)Lasso¶
import numpy as np, statspai as sp
rng = np.random.default_rng(0)
n, p = 200, 100
X = rng.normal(size=(n, p))
beta = np.zeros(p); beta[:4] = [3, -2, 1.5, 1]
y = X @ beta + rng.normal(size=n)
fit = sp.rlasso(X, y, post=True) # post-Lasso (default)
fit.n_selected # -> 4 (recovers the support)
fit.beta[:4] # -> ~[3, -2, 1.5, 1]
fit.lambda0, fit.sigma # data-driven penalty, residual scale
fit.predict(X) # fitted / out-of-sample predictions
print(fit.summary())
Tuning is via the penalty and control dicts, mirroring hdm:
sp.rlasso(X, y, penalty={"c": 1.1, "gamma": 0.05,
"homoscedastic": False}) # default: heteroskedastic
sp.rlasso(X, y, post=False) # plain Lasso (c defaults to 0.5)
2. sp.rlasso_effect — effect of d after selecting controls¶
The high-dimensional analogue of a partial regression coefficient —
root-n consistent and asymptotically normal under approximate
sparsity (Belloni–Chernozhukov–Hansen, 2014).
res = sp.rlasso_effect(X, y, d, method="partialling out") # or "double selection"
res.alpha, res.se, res.pvalue, res.conf_int()
3. sp.rlasso_iv — IV with rigorous selection¶
A faithful port of hdm::rlassoIV. Selection can run on the instruments
z, the controls x, both, or neither:
# BCH (2012) optimal-instrument IV — select among many instruments
res = sp.rlasso_iv(y="y", d="d", z=z_cols, x=x_cols, data=df,
select_Z=True, select_X=False)
res.coef, res.se # eminent domain logGDP -> 0.2274, 0.2466
select_Z |
select_X |
method (hdm equivalent) |
|---|---|---|
True |
False |
rlassoIVselectZ — instrument selection (BCH 2012) |
False |
True |
rlassoIVselectX — partial out high-dim controls |
True |
True (default) |
double selection on Z and X |
False |
False |
plain robust 2SLS (tsls) |
As a Double-ML nuisance learner¶
The rigorous Lasso is the theory-correct sparse nuisance learner for
Double/Debiased ML (the DML rate conditions are stated for a plug-in,
not a cross-validated, penalty). It plugs straight into sp.dml:
res = sp.dml(data=df, y="y", treat="d", covariates=x_cols,
model="plr", ml_g="rlasso", ml_m="rlasso", n_folds=5)
sp.RlassoRegressor / sp.RlassoClassifier are scikit-learn-compatible
(get_params / set_params / clone-safe), so they survive the
cross-fitting clone() on every fold. The classifier is a linear
probability propensity (clipped to (ε, 1−ε)) — convenient, but prefer
a calibrated classifier when propensity calibration matters.
This path is pinned against R: with a shared fold partition (pass
fold_indices=), sp.dml(model='plr', ml_g='rlasso', ml_m='rlasso')
reproduces a manual Double-ML PLR estimator whose nuisances are
hdm::rlasso to machine precision (θ̂ and SE; see
test_dml_rlasso_learner_matches_r_doubleml). So the rigorous-Lasso DML
path is validated end to end, not just learner-by-learner.
Relationship to iv.bch_post_lasso_iv¶
iv.bch_post_lasso_iv is the original reconstruction and is deprecated
(it emits a DeprecationWarning); it is kept only for backward
compatibility with its existing numerics. For agreement with hdm — and
for selection on both instruments and controls — use sp.rlasso_iv. The
two are not numerically interchangeable.
What is and isn't ported¶
Ported and parity-tested against hdm: rlasso, rlassoEffect /
rlassoEffects (single and multi-target), rlassoIV (all four selection
regimes), tsls, rlassologit (the logistic rigorous Lasso), and the
data-driven lambdaCalculation for the homoskedastic and heteroskedastic
(X-independent) penalties.
Not yet ported¶
Not yet ported as separate hdm-named wrappers: rlassoLATET is covered
through the IIVM score in sp.dml(...), not exposed as its own
sp.rlasso_latet; exact R RNG parity for
penalty={"X.dependent.lambda": True} is also not claimed because StatsPAI
does not reproduce R's Mersenne-Twister stream. These are documented scope
boundaries rather than silent approximations.
sp.rlassologit — the logistic rigorous Lasso¶
hdm::rlassologit is the binary-outcome analogue of rlasso: its
penalized fit is glmnet(family="binomial", alpha=1, lambda=λ,
standardize=TRUE) at a single data-driven λ. StatsPAI reproduces
glmnet's binomial lasso at that λ directly — an IRLS outer loop, a
weighted coordinate-descent inner loop, the 1/n-scaled deviance
objective, population-variance standardization and glmnet's pmin
probability clamp — rather than substituting scikit-learn's L1 logistic
(whose objective/standardization differ at a fixed λ).
fit = sp.rlassologit(X, y, post=True) # y binary
fit.predict(X, type="response") # probabilities (or "link" = log-odds)
sp.RlassologitClassifier() # genuine logistic propensity for sp.dml
Parity (vs hdm 0.3.2 / glmnet 4.1): the selected support matches
exactly; the glmnet engine's coefficients match to ~1e-6 (glmnet's own
convergence tolerance — no tighter ground truth exists); and post=True
(the default) coefficients/intercept/residuals — coming from an
unpenalized logistic refit on the selected set — match to ~1e-9.
sp.RlassologitClassifier is the principled binary nuisance learner for
Double-ML: sp.dml(model='irm', ml_m='rlassologit') uses a calibrated
logistic propensity (unlike the linear-probability RlassoClassifier).
sp.rlassologit_effect / rlassologit_effects — logistic treatment effects¶
The high-dimensional logistic treatment effect (Belloni–Chernozhukov–Wei
2016 double-selection for GLMs: a √σ²-weighted union of a logistic and a
linear rigorous-Lasso selection, refit by a low-dimensional logit, with a
max-of-two sandwich variance) is now a faithful port of
hdm::rlassologitEffect(s):
res = sp.rlassologit_effect(X, y, d) # binary y, treatment d, controls X
res.alpha, res.se, res.pvalue, res.conf_int()
out = sp.rlassologit_effects(X, y, index=[0, 1]) # each X column as the target
Parity vs hdm 0.3.2 (tests/reference_parity/test_rlassologit_effect_parity.py):
the coefficient and the post-Lasso standard error match to ~1e-14, the
multi-target effects to ~1e-15, and the non-post SE to ~1e-7 (the non-post
LassoShooting tolerance). For a plain binary-treatment causal effect you can
still use sp.dml(model='irm', ml_m='rlassologit').
X-dependent penalty simulation (penalty={"X.dependent.lambda": True})
is implemented but matches hdm only in distribution — R's
Mersenne-Twister stream is not reproduced — so it is not bit-exact. The
default (X-independent) path is.
Reproducing the hdm vignette¶
The three headline worked examples in the hdm vignette reproduce
exactly in StatsPAI — same data, same numbers:
hdm vignette example |
call | StatsPAI = hdm |
|---|---|---|
Growth — conditional convergence (Barro-Lee GrowthData) |
sp.rlasso_effect(X, y, d) |
coef −0.04981 (SE 0.01394) partialling-out; −0.05001 (SE 0.01579) double-selection |
AJR — institutions → GDP, settler-mortality IV (AJR) |
sp.rlasso_iv(y, d, z, x, select_X=True, select_Z=False) |
coef 0.84503 (SE 0.26993) |
cps2012 — gender wage gap (cps2012) |
sp.rlasso_effects(X, y, index=female_cols) |
female coef −0.15492 (SE 0.05016); all 16 female targets match hdm |
import statspai as sp, pandas as pd
# Growth: effect of (log) initial GDP on growth, selecting ~60 controls
g = pd.read_csv("hdm_growth_data.csv")
y, d = g["Outcome"].to_numpy(), g["gdpsh465"].to_numpy()
X = g.drop(columns=["Outcome", "intercept", "gdpsh465"]).to_numpy()
sp.rlasso_effect(X, y, d, method="partialling out").alpha # -> -0.04981
# AJR: expropriation risk on GDP, instrumented by settler mortality
a = pd.read_csv("hdm_ajr_data.csv")
y, d, z = a["GDP"].to_numpy(), a["Exprop"].to_numpy(), a["logMort"].to_numpy()
Xa = a.drop(columns=["GDP", "Exprop", "logMort"]).to_numpy()
sp.rlasso_iv(y, d, z, x=Xa, select_Z=False, select_X=True).coef # -> 0.84503
All three are pinned to hdm in
tests/reference_parity/test_rlasso_vignette_parity.py.
The datasets are public, published economic facts (AJR: Acemoglu,
Johnson & Robinson 2001, AER 91(5); Growth: Barro & Lee; cps2012: U.S.
CPS 2012). Growth and AJR are bundled in full; for cps2012 the committed
fixture is a deterministic 800-row subsample (the full 27 MB expanded
design is not bundled) that pins the female main effect, while the
full-sample number above is reproduced and recorded at generation time.
Regenerating the parity fixtures¶
The hdm reference numbers are deterministic (no cross-validation, no
RNG on the default penalty path), so the fixtures are a hard contract.
Regenerate only when the algorithm changes:
Rscript tests/reference_parity/_generate_rlasso.R # core / effect / IV / logit
Rscript tests/reference_parity/_generate_rlasso_vignette.R # Growth + AJR vignette
pytest tests/reference_parity/test_rlasso_parity.py tests/reference_parity/test_rlasso_vignette_parity.py -q
References¶
- Belloni, A., Chen, D., Chernozhukov, V. & Hansen, C. (2012). Sparse
Models and Methods for Optimal Instruments With an Application to
Eminent Domain. Econometrica, 80(6), 2369–2429.
doi
10.3982/ECTA9626. - Belloni, A., Chernozhukov, V. & Hansen, C. (2014). Inference on Treatment Effects after Selection among High-Dimensional Controls. The Review of Economic Studies, 81(2), 608–650.
- Chernozhukov, V., Hansen, C. & Spindler, M. (2016). hdm:
High-Dimensional Metrics. The R Journal, 8(2), 185–199.
doi
10.32614/RJ-2016-040.