Changelog¶
All notable changes to StatsPAI will be documented in this file.
[Unreleased]¶
Added¶
- R/Stata reference-parity expansion — panel, count/quantile, and SDID
estimators now carry frozen cross-package parity (19 new tests,
tests/reference_parity/124 → 143). Each ships a deterministic data fixture, a_generate_*.Rscript, and a frozen*_R.json: sp.panelvs Rplm— within (FE), Swamy-Arora RE, and between, with classical and cluster-robust SEs (matchesplm::vcovHCHC1/group); coefficients and classical SEs agree to ~1e-5. Closes the gap wherepanel, a core ≥95% estimator, had no reference parity.sp.poisson/sp.nbreg/sp.qreg/sp.tobit/sp.zip_model/sp.zinbvsglm/MASS::glm.nb/quantreg::rq/AER::tobit/pscl::zeroinfl— exact coefficient parity (Poisson/NB/Tobit also exact model SEs and Tobit σ; NBalpha == 1/theta).sp.sdidvs the authors'synthdidR package (Arkhangelsky et al. 2021) onsp.california_prop99()— point estimate agrees to ~1e-6 (−17.8985 packs/capita).- NIST StRD one-way ANOVA certification for
sp.regress(tests/numerical_accuracy/test_nist_strd_anova.py). All 11 NIST ANOVA datasets (SiRstv, SmLs01–09, AtmWtAg) bundled verbatim; a one-way ANOVA is OLS ofy ~ C(group), so these certify the F-statistic / R² / sum-of-squares numerical accuracy. Backed by the new mean-centred OLS fit (see ⚠️ Correctness below),sp.regressreproduces the certified F to machine precision through the average-difficulty family (incl. n=18009); the three highest-difficulty designs (SmLs07/08/09, 9 constant leading digits) reach the irreducible IEEE-754 float64 floor (~7e-5) of their data and are checked at a documented 1e-3 tolerance. All reference R DOIs verified via Crossref. - Tier D analytic special-case test campaign — 24 reference-less estimators
now carry known-truth tests (77 tests across 9
tests/test_tierD_*.pyfiles). Estimators that had no numerical-assertion test (only smoke calls or none) are now anchored to closed-form identities or known-DGP recovery: partial-identification bounds (horowitz_manski,iv_bounds,oster_delta,trimming), power/MDE (power,mde,power_cluster_rct,power_iv),frontdoor,ps_balance,moran_local,effective_f_test,stepwise,feglm,mi_estimate, boundary/multi-score RD aliases (boundary_rd,geographic_rd,multi_score_rd), production functions (levpet,opreg),peer_effects,notch,model_averaging_dml,test_calibration. A read-only classifierscripts/tierd_classify.pygrades every registered function's test evidence (reference / anchored / weak / smoke / untested) and emits the worklist; the zero-guard P1 floor went from 25 → 0 (the lone remainder,blp, fixed in this release — see Fixed). Purely additive: no estimator numerics changed. - Tier B executable replication notebooks
(
Paper-JSS/replication/notebooks/*.ipynb). Self-contained Jupyter notebooks reproduce the headline numbers of Card (1995), ADH (2010) Prop 99, LaLonde/DW NSW, Lee (2008) RD, and Graddy (2006) from the bundled real data; each ends with a drift-guard cell, andtests/test_replication_notebooks.pyexecutes them headless in CI (fails on drift). Generated byscripts/build_replication_notebooks.py; newnotebooksextra (pip install -e ".[notebooks]"). - Agent-native surface hardening (agent-infra work line). A batch of
additive metadata / result-object / MCP-runtime improvements that make the
schema an agent reads (
sp.describe_function) trustworthy. No estimator numerics, signatures, orpaper.md/paper.bibtouched (seedocs/dev/agent_infra_campaign.mdfor the JOSS-isolation contract): EconometricResults.cite(format=...)— citation parity withCausalResult. Previouslysp.bib_for(<regression result>)raised because onlyCausalResultcarried.cite(). The bib key resolves exactly frommodel_info['citation_key'] → model_type → methodagainst the sharedCausalResult._CITATIONS; OLS/logit/probit/poisson return a placeholder rather than a fuzzy/fabricated match (§10 zero-hallucination). Pure addition (tests/test_econometric_results_cite.py).- Opt-in TTL + reason-aware misses for the MCP result cache. New
STATSPAI_MCP_RESULT_CACHE_TTLenv var (default unset = no expiry, prior behaviour byte-identical); expired handles are swept lazily on access / eagerly on insert, and a bounded ledger records why a handle left (ttl/lru/explicit) so the "result expired" hint an agent receives is tailored to the cause. Newevict()/purge_expired()/stats()(tests/test_result_cache_ttl.py). - Docstring parser now reads type-less NumPy param headers (
namethen an indented description, no: type), recovering 65 parameter descriptions across 41 functions (feols,causal_forest,match, …) thatdescribe_functionwas silently dropping. No new dependency; a column-0 barename branch with the false-positive boundary pinned bytests/test_docstring_param_parser.py. - Three forward CI guards (test-only, no runtime change): every MCP
workflow tool has a real dispatch branch
(
tests/test_workflow_tool_dispatch_contract.py); the 42-edgeFunctionSpec.inherits_fromgraph stays free of dangling parents / self-references / cycles (tests/test_inherits_from_integrity.py); and a concrete source annotation never collapses toAnyin auto-generated specs (tests/test_auto_spec_type_resolution.py).
Changed¶
- Repositioned StatsPAI as a "library" (not "platform") across the README,
README_CN,
pyproject.tomldescription, and the GitHub repo description. Documentation/positioning wording only — no API, behaviour, or estimator numerics changed.
Fixed¶
-
⚠️ Functionality fix:
sp.blpwas non-functional on every estimation path. The GMM objective called_gmm_objective(..., maxiter=1000)but the parameter is namedmaxiter_inner, so everysp.blpcall raisedTypeError: _gmm_objective() got an unexpected keyword argument 'maxiter'before producing any output. Renamed the keyword at both call sites (first- and second-stage GMM). A new analytic recovery test (tests/test_tierD_structural_analytic.py::TestBLPAnalytic) recovers the known linear price/characteristic coefficients on a logit DGP with endogenous price and valid cost instruments, and guards the keyword regression directly. This clears the last Tier D zero-guard remainder. -
⚠️ Correctness fix:
sp.granger_causalitytest statistic was wrong by orders of magnitude. The Wald variance was a placeholderV = sigma2 * I(the outcome residual variance as if it were the coefficient covariance), ignoring the design-matrix term(X'X)⁻¹. The F-statistic was therefore too small by a factor of ≈T·Var(regressors), so the test almost never rejected regardless of the true causal structure — on a DGP wherex[t-1]drivesywith coefficient 0.8, the genuine F is ≈326 (p≈1e-16) but the function returned F≈0.36 (p≈0.70,reject=False).VARResultnow carries(X'X)⁻¹andgranger_causalityforms the proper coefficient covarianceσ²_caused·(X'X)⁻¹, so the F now matches the restricted-vs-unrestricted OLS F-test and detects causal direction correctly. Guarded bytests/test_tierD_p2_timeseries_analytic.py. No prior valid result is invalidated (the old statistic was statistically meaningless); no JOSS/JSS table usesgranger_causality. Found by the Tier D campaign (CLAUDE.md §5). -
⚠️ Correctness fix: d-separation (
statspai.dag) was wrong on forks and colliders. The moralisation step in_d_separatedmarried siblings (children of a common parent) instead of co-parents (parents of a common child). As a result conditioning on a common cause failed to block a fork (A ⊥ C | MonM→A, M→CreturnedFalse) and conditioning on a collider failed to open it (A ⊥ C | KonA→K←CreturnedTrue) — the two non-trivial d-separation cases were both backwards. This propagated to everything built on_d_separated:DAG.d_separated,adjustment_sets,backdoor_paths,do_rule1/2/3,do_calculus_apply,swig, anddag_recommend_estimator. Moralisation now connects every pair of a node's parents; the chain/fork/collider truths and back-door adjustment-set finding are correct (e.g.adjustment_sets("X","Y")onW→X, W→Y, X→Ynow returns[{W}]). Guarded bytests/test_tierD_p2_dag_dsep_analytic.py; all 9 dag-touching test files still pass (none had pinned the broken behaviour). No JOSS/JSS table uses these graph routines. Found by the Tier D campaign. -
⚠️ Correctness fix:
sp.evalueHR E-values and confidence-interval E-values (now full REValueparity). Two behaviours that can move a previously-reported number (#21): measure='HR'was always treated as a rare-outcome ratio (OR ≈ RR ≈ HR). It now uses the exact common-outcome conversion(1 − 0.5^√HR)/(1 − 0.5^√(1/HR))by default (rare=False), matchingEValue::evalues.HR. HR-based E-values therefore change for non-rare outcomes; passrare=Trueto recover the old rare approximation. (ORalready converted to RR by default, so OR results are unchanged;RRis unaffected.)-
A confidence-interval E-value is now clamped to exactly 1 when the interval contains the null (or a user-supplied
truereference). The E-value was previously computed from the interval limit regardless, so a non-significant result could report a spurious E-value > 1; it now correctly reports 1 (no unmeasured confounding is needed to explain a result already compatible with the null). The keywordrare_outcomeis renamedrare; the old name still works as aDeprecationWarningalias. Verified at machine precision against REValueacross every measure (tests/r_parity/23_evalue.py, 26 rows, worst relative difference 5.8e-14). JOSS/JSS note: this is the23_evaluerow of the JSS cross-language parity table (Paper-JSS/manuscript/tables/appendix_b_parity.tex); the change increases agreement with the gold-standard R package and the row stays a machine-precision PASS (CI "Numerical reference parity" + "R closed-form parity" jobs green). No JOSS (#10604) figure uses an HR or CI E-value. -
Agent-UX:
describe_functionadvertised stale params / wrong defaults for 29 hand-written specs (two invariant classes, 18 + 11). An agent that reads the schema and callssp.<name>(**kwargs)verbatim was led intoTypeErrors and incorrect defaults. No estimator numerics, signatures, or parity numbers changed — metadata only. - 18 specs advertised param names the function cannot accept (e.g.
metalearnersaidtreatment/method; the signature — and the spec's own example — usetreat/learner). Fixed to match real signatures;tests/test_registry_signature_contract.pynow locks two invariants over all hand-written specs (no phantom params; every required signature param documented). -
11 specs reported a default value the estimator does not use — e.g.
drdidmethod'dr'→'imp'(enum['dr','or','ipw','reg','stdipw']→['imp','trad']),did_bcfn_trees200→50,harvest_didreference'pre'(str)→-1(int),iv_compare/iv_diagdefaults stored as string-reprs-of-tuples → real lists.tests/test_registry_default_contract.pylocks that a pinned default neither contradicts the signature nor falls outside its own enum. (Also surfaced and corrected a latent duplicateharvest_didregistration.) -
⚠️ Reproducibility fix:
sp.match(method='nearest')now resolves exact nearest-neighbor ties by source DataFrame index. The Euclidean/propensity nearest-neighbor path previously relied onargpartition/ incidental row order for equal-distance controls (and target order when matching without replacement), so exact ties on discrete or binary covariates could move the ATT across environments. Equal-distance ties now select lower-index pool units first, with lower-index target units used as the without-replacement fallback. A shuffled-row regression test guards the policy, andtests/test_tierD_lalonde_psm_guard.pyis tightened from the old cross-backend band to the pinned LaLonde PSM ATT (1963.43). Existing results change only when previous data contained exact equal-distance ties. See MIGRATION.md.
Known issues¶
- NIST StRD Linear Least Squares certification for the OLS kernel
(
tests/numerical_accuracy/test_nist_strd_ols.py). All 11 NIST Statistical Reference Datasets for linear regression (Norris, Pontius, NoInt1/2, Filip, Longley, Wampler1–5) are bundled verbatim undertests/numerical_accuracy/_fixtures/nist_strd/and checked against their embedded certified values. Both the low-levelols_fit/OLSEstimatorkernel and the certified standard errors are verified, plus a QR-beats-normal-equations guard on the ill-conditioned designs. Lives in a dedicatedtests/numerical_accuracy/suite (separate from the R/Statareference_parity/fixtures) since it certifies certified-value accuracy rather than cross-language parity.
Changed (⚠️ Correctness)¶
-
sp.regressnow fits in mean-centred (Frisch-Waugh-Lovell) coordinates, fixing coefficient/F/R² loss under large constant offsets. When an intercept is present,OLSEstimator.estimatedemeansyand the non-intercept regressors before the least-squares solve and reconstructs the intercept, so a large constant offset iny(or a regressor) no longer destroys the slope coefficients through catastrophic cancellation. FWL makes this algebraically identical to the previous fit, so well-conditioned designs are unchanged to machine precision (verified: coefficients matchnumpy.linalg.lstsqto ~1e-15; the NIST StRD Linear LS certification and all R/Stata parity suites are unchanged). The effect is only visible on pathological designs: on the NIST StRD ANOVA datasetsSmLs07/08/09(9 constant leading digits, e.g.y ≈ 1.0000000004e12) the certified-F relative error drops from ≈4.8e-4 / 9.9e-3 / 2.3e-2 to the irreducible IEEE-754 float64 floor (~3–7e-5). Previously these were the lone failing NIST ANOVA cases; they now pass at a documented float64-floor tolerance. No user-facing numbers on real data change. -
OLS kernel now solves via QR factorisation instead of the normal equations.
ols_fit(coefficients) andOLSEstimator.estimate(covariance) previously solved(X'X) b = X'yand formedinv(X'X), which squares the condition number of the design matrix. On well-conditioned data nothing observable changes (results match the old path to ~1e-12; the fullreference_parity+external_parityJOSS suites are unaffected). On ill-conditioned designs the new path is dramatically more accurate: on the NIST StRD suite the degree-10 Filippelli polynomial went from 0 correct digits to ~7, Longley from ~7.7 to ~11, and Wampler1 from ~6.1 to ~9.6 correct digits. Any code that fit OLS on near-collinear or high-degree polynomial designs and relied on (silently wrong) coefficients will now get materially different — correct — numbers. See MIGRATION.md. - Exact-fit OLS no longer emits a divide-by-zero
RuntimeWarning. When a regression fits the data exactly (R² == 1, e.g. NIST Wampler1/2), the F-statistic is now reported asinf(matching the NIST-certified "Infinity") withf_pvalue = 0.0, computed without tripping the warning. The reported value is unchanged for every non-exact fit. - Auxiliary OLS in
sp.did(method='wooldridge')and the Romano–Wolf step-down (sp.romano_wolf) hardened to the same QR solve. Both carried private_ols_fithelpers that still formedinv(X'X)— Wooldridge for both the coefficients and the cluster/HC1 sandwich bread, Romano–Wolf for the bread only (its coefficients were already QR). They now reuse the QR factor ((X'X)⁻¹ = R⁻¹R⁻ᵀ). Output is bit-identical to ~1e-15 on well-conditioned designs (verified directly, and the Callaway–Sant'Anna R-parity pin plus the MHT suites are unchanged); the change only adds headroom on near-collinear designs (e.g. saturated group×period dummies). No API change. sp.regressnow fails loudly on a perfectly collinear design instead of returning unidentified garbage. A rank-deficient design (duplicate or proportional regressors, the dummy-variable trap with complementary 0/1 dummies, or a constant non-intercept regressor) previously returned enormous meaningless coefficients (e.g.~1e14) with no warning. It now raisesNumericalInstability, naming the offending columns in.diagnostics. Detection is structural (duplicate/proportional columns, zero-variance regressors), not conditioning-based — a singular-value/rank tolerance loose enough to catch collinearity would also flag legitimately ill-conditioned full-rank designs such as NIST Filippelli (s_min/s_max ~ 6e-16), so those still fit. General exact dependence among 3+ columns that is not a pairwise duplicate/constant is intentionally not auto-detected for the same reason. See MIGRATION.md.sp.logit/sp.probitnow warn on perfect (quasi-complete) separation. When the outcome is perfectly predicted by the linear index the MLE does not exist; Newton–Raphson previously "converged" by its step tolerance to large finite coefficients with no signal. It now emits aConvergenceWarning(the reported coefficients/SEs are stopping-rule artefacts; the hint suggests penalized/Firth logistic regression). Non-separable fits are unaffected.
Fixed¶
-
sp.iv(method='lasso', formula=...)raisedTypeErrorinstead of fitting. The IV dispatcher forwardedformula=verbatim intolasso_iv, which takes nativex_endog/z/x_exoglists and does not accept aformulaargument, so the Patsy-style formula route (sp.iv(method='lasso', formula="y ~ (d ~ z1 + z2) + x", data=df)) failed withlasso_iv() got an unexpected keyword argument 'formula'. The dispatcher now parses the formula intolasso_iv's native parameter names (and accepts the canonicalendog/instruments/exogaliases), so the formula path returns the same estimates as the explicitx_endog=[...],z=[...]calling convention — verified bit-for-bit (atol=0) intests/test_iv_cov_tail.py::test_dispatch_lasso_formula_matches_native. No existing numerics move: the nativelasso_ivpath is unchanged; this only turns a previously-erroring route into a working one. -
decomposition/_kernel_density_atNumPy 1.25DeprecationWarning. The legacy RIF kernel-density helper calledfloat()on the length-1 array returned byscipy.stats.gaussian_kde(...)(point), which NumPy 1.25 deprecates (and a future NumPy will turn into an error). It now indexes the single element first (float(np.asarray(kde(point)).ravel()[0])). The returned value is identical — it is the same elementfloat()already extracted — so no decomposition output changes; the warning is simply gone. -
sp.event_studycrashed on string/extension-dtype time columns under pandas ≥ 3.0. The non-numeric-time branch guarded onnp.issubdtype(col.dtype, np.number), which raisesTypeError: Cannot interpret '<StringDtype...>'when pandas ≥ 3.0 infers a string column asStringDtype(rather thanobject). Switched both checks topd.api.types.is_numeric_dtype(col), whose truth value is identical for every numpy numeric dtype — so numeric-time results are byte-for-byte unchanged (verified against the full event-study suite and thedidreference-parity set); it only repairs the previously-crashing string-time path. Surfaced by the Windows/macOS CI matrix on pandas 3.0 / numpy 2.4. -
sp.lpcmci/sp.dynotearswould crash on string columns under pandas ≥ 3.0. Their default numeric-variable filter used the samenp.issubdtype(col.dtype, np.number)anti-pattern, which raises on aStringDtypecolumn. Switched topd.api.types.is_numeric_dtype— identical numeric-column selection (verified against the causal-discovery suite), now simply skipping string/extension columns instead of crashing. Pre-emptive hardening found by a repo-wide scan after theevent_studyfix above.
Added¶
-
Registry-example bind guard (
tests/test_registry_examples_bind.py). A parametrized test now statically parses every registeredexamplestring and binds the keyword arguments of itssp.<name>(...)call against the real signature, failing if an example references a keyword the function does not accept or does not parse. This locks down the agent copy-paste path — an agent readssp.describe_function(name)and runs the example verbatim — so the registry/example drift fixed below cannot silently return. 373 examples bind green. -
parityoptional extra — opt-in DoubleML reference pin forsp.dml.pip install -e ".[dev,parity]"now installsdoubleml-for-py(the Python DoubleML reference of Bach, Chernozhukov, Kurz & Spindler, JMLR 23(53), 2022), sotests/external_parity/test_dml_python_parity.pyruns instead of silently skipping. Under identical scikit-learn learners and folds,sp.dml(model='plr')reproducesdoubleml-for-pyto machine precision on the seed-42 fixture —|Δ coefficient| = 1.1e-16and|Δ standard error| = 1.4e-17, i.e. one float64 unit in the last place.doublemlremains not a runtime dependency. The measured numbers, software versions, and the divergence discussion are recorded in a new Double Machine Learning Parity section ofdocs/joss_validation_dossier.md, with a one-command reproduce path added todocs/joss_reviewer_guide.md. Verified by installing the extra and running bothtests/external_parity/test_dml_python_parity.pyandtests/reference_parity/test_dml_parity.py(55 DML tests green).
Fixed¶
-
⚠️ Correctness:
sp.drdid(method='trad')returned ~half the true ATT. The traditional doubly-robust DiD branch ofsp.drdid(Sant'Anna & Zhao 2020) normalised each of its four cell terms (treated/control × post/pre) by the full sample sizeninstead of by that cell's weight mass. This multiplied every term by the cell's sample share (~0.25 each on a balanced 2×2), biasing the ATT toward zero by ~50%: on a 2×2 with a true ATT of 2.0 (raw DiD 1.96) the traditional estimator returned ≈1.04. Each term is now normalised by its own weight total, somethod='trad'reduces exactly to the raw 2×2 DiD when no covariates are supplied and recovers the true ATT with covariates. The improved (locally efficient)method='imp'— the default — already normalised correctly and is unchanged, so no default-path or parity/dossier numbers move.sp.drdidnow also raisesValueErroron an unknownmethodinstead of silently treating it as traditional (previously e.g.method='ipw'ran the traditional branch). See MIGRATION.md. -
⚠️ Correctness:
sp.multiway_cluster_vcovundercounted intersection clusters, biasing multiway-cluster-robust standard errors. The Cameron-Gelbach-Miller inclusion-exclusion builds an intersection cluster (the unique combinations of the clustering dimensions); its key was formed by joining the dimensions into a single string with a"\0"separator, but NumPy fixed-width unicode strips the embedded NUL, so e.g.(1, 23)and(12, 3)both collapsed to"123". On a 40×50 crossed-cluster DGP this merged 1733 true intersection clusters into 1639, biasing the two-way SE by ~0.2% (~0.5% at three-way) versus the canonical estimator. The intersection key is now built collision-free vianp.unique(axis=0)on per-dimension integer codes.sp.multiway_cluster_vcovnow reproducessandwich::vcovCLandsp.twoway_clusterto machine precision (two-way exact; three-way rel ~4e-7), pinned by the newtests/r_paritymodule 56 and a direct twoway-vs-multiway regression test. Propagates todid.harvestandpanel.feolsmultiway-clustered SEs.sp.twoway_clusteritself was already correct (distinct collision-free key) and is unchanged. See MIGRATION.md. -
.glance()crashed (OverflowError: cannot convert float infinity to integer) on Cox and parametric-survival (survreg) results. Those estimators deliberately storedf_resid = infto signal a large-sample (normal) reference distribution, butglance()cast the residual degrees of freedom withint()unconditionally. The cast now passes non-finite degrees of freedom through unchanged; finite results keep an integerdf_resid(no change). A crash-hunt across ~48 fitted results confirmed the rest of the §3 unified-result export surface (summary/to_latex/to_markdown/to_word/to_excel/cite/tidy/for_agent/plot) is otherwise clean across 20+ estimator families. Covered bytests/test_glance_survival.py. -
sp.event_studyresults crashed the library's own exporters, plotters and pre-trend tools (canonical-column mismatch).sp.event_studyemitted its coefficient table under the column nameestimate, but the rest of the DID family — and every downstream consumer — keys on the canonicalattcolumn (did._core.EVENT_STUDY_COLUMNS). So the canonical event-study estimator was incompatible with its own tooling:.tidy()(and the.to_markdown()/.to_excel()/.to_word()exporters that delegate to it) raisedTypeError: unsupported operand type(s) for /: 'NoneType' and 'float';.plot()/.event_study_plot()/sp.enhanced_event_study_plotraisedKeyError: 'att'; andsp.honest_did/sp.breakdown_mraisedValueError: missing {'att'}. The event-study table now carries the canonicalattcolumn (withestimateretained as a backward-compatible alias), fixing every consumer at the source. No numerical change. -
sp.pretrends_test/sp.pretrends_summarycrashed (LinAlgError: Singular matrix) on every standardsp.event_studyresult — the same reference-period defect already fixed inpretrends_power: the SE = 0 omitted period made the diagonal VCV singular. It is now dropped before inversion, with a clearValueErroron a genuinely collinear pre-period set. -
sp.diagnose_resultcrashed (TypeError: bad operand type for abs(): 'str') onsp.synthresults. The donor-pool check iterated the synthetic weights, butsp.synthstores them as a['unit', 'weight']DataFrame, so iteration yielded column-name strings. The weights are now coerced to their numeric values regardless of container (DataFrame / Series / dict / array). -
sp.pretrends_powercrashed (LinAlgError: Singular matrix) on every standardsp.event_studyresult. Roth's (2022) pre-trend power calculation inverts the pre-period variance–covariance matrix, but the omitted reference period (relative time −1) is reported with a standard error of exactly zero, so the diagonal VCV was singular andnp.linalg.invraised on the exact workflow shown in the function's own docstring. The reference period (and any other mechanically-normalised, zero-SE period) is now dropped before inversion — it is the baseline, not an estimated coefficient — so the joint pre-trend test runs on the estimated pre-periods only. A full-rankmodel_info['vcv_pre']and a full-lengthdeltaare aligned to the retained periods, and a still-singular VCV now raises a clearValueError(collinear pre-periods) instead of an opaque NumPy error. No output changes for any call that previously succeeded. Covered bytests/test_pretrends_power.py. -
⚠️ Correctness —
sp.structural_breaksup-F p-value used the wrong null distribution. The Chow/sup-F statistic is a supremum of the F statistic over candidate break points, so under H0 it follows the Andrews (1993) sup-F law — notF(k, n-2k). The previous code referred the maximised statistic to the ordinary F CDF, which ignored the maximisation and massively over-rejected: on pure Gaussian white noise at the 5% level the test flagged a spurious structural break in 33–37% of series (measured, n ∈ {100, 200, 400}). The p-value is now computed from the Andrews (1993) limiting null — a q-vector Brownian-bridge functional sampled by a deterministic, cached simulation on a grid tied to the sample size — restoring nominal size (~0.05) while retaining power (1.00 / 0.88 to detect a one-/half-σ mean shift at n=200). The same correct threshold now drives the Bai-Perron sequentialsupF(l+1|l)stopping rule (previously the same naive-F over-detection), somethod='bai-perron'no longer over-segments noise. As a side benefit the Bai-Perron result now populatesf_stats/p_values(one sup-F statistic and Andrews p-value per detected break, chronologically aligned) instead of returningNone. Reference verified via Crossref / Econometric Society / RePEc: Andrews, D.W.K. (1993), Econometrica 61(4), 821-856, doi:10.2307/2951764. SeeMIGRATION.md. -
33 registered
examplestrings were statically broken (agent-UX). Six failed to parse (stray/unmatchedparens, a positional-after-keyword...placeholder, an unclosed call) and 27 passed a keyword the function does not accept — a deterministicTypeError/SyntaxErroron the exact agent copy-paste path. Root causes were two long-standing parameter-name drifts: the Mendelian-randomization family (mr_egger/mr_ivw/mr_raps/mr_presso) used the shortb_exp/b_out/se_exp/se_outnames instead of the implementedbeta_exposure/beta_outcome/se_exposure/se_outcome, and the Bayesian family (bayes_rd/bayes_fuzzy_rd/bayes_mte/bayes_did/bayes_iv) plusmetalearner/tmle/causal_impact/sensemakr/spec_curve/qreg/tobit/heckman/cluster_cross_interference/causal_dqn/pci_mtp/bartik/ffl_decomposedrifted from their signatures.qreg/tobit/heckman/spec_curveadditionally had wrong call shapes (formula passed positionally intodata; flatcontrolswhere a list-of-lists was required) and were rebuilt to runnable form and executed to confirm. Fixed acrossregistry.pyand_baseline_cards.py; the regeneratedschemas/bundle stays in sync. Guarded by the new bind test above. sp.heckman/sp.tobitMCP tool schemas advertised parameters the functions reject (agent-native path). The priorexample-string fix above repaired the copy-paste path, but the other drifted field — theFunctionSpec.paramsthat generate each tool's MCPinput_schema— was left pointing at a never-implemented formula API:heckmanexposed requiredformula/select_formulaandtobitexposedformula/lower/upper, while the implementations takedata, y, x, select, zanddata, y, x, ll, ul. An agent calling either tool exactly as the manifest instructed hit a deterministicTypeError: ... got an unexpected keyword argument 'formula'. Theexample-bind guard above does not cover this because it validates theexamplefield, notparams. Re-pointed bothparamsblocks at the real signatures inregistry.py; the regeneratedschemas/bundle is back in sync and both tools now dispatch (sp.heckmanrecovers the known-truth β=2.05 fixture through the MCP dispatch layer). The estimator numerics were never affected — Python-directsp.heckman(...)/sp.tobit(...)always worked; only the MCP schema was wrong.- ⚠️ Correctness —
sp.stabilized_weights/sp.msmsilently dropped all IPTW adjustment on single-period panels. On a single-period (point- treatment) panel the within-unit lagged-treatment column is all-zero, which made the logistic treatment-model design singular. The internal_logit_probahelper swallowed the resultingLinAlgErrorand silently returned the marginal mean for both the numerator and denominator models, so every stabilized weight collapsed to exactly1.0— turning the marginal structural model into an unweighted, confounded regression with no warning (a CLAUDE.md §7 silent-degradation violation)._logit_probanow drops zero-variance columns before fitting (the numerically correct move: such columns only duplicate the intercept thatadd_constantadds) and, if the treatment model still fails to converge (e.g. perfect separation), emits aRuntimeWarninginstead of silently degrading. On the regression fixture the fixed path reproduces a textbook stabilized-IPTW computation to machine precision (max|Δw| = 1.8e-15); the already-correct multi-period path is unchanged. Verified bytests/test_msm_singleperiod_iptw_regression.py(3 new tests) plus the existingtests/test_msm.py(8 green total). See MIGRATION.md. - CI: scikit-learn 1.9 compatibility —
LassoCV(n_alphas=...)removal. scikit-learn 1.7 deprecated then_alphasargument of the coordinate-descent CV estimators (LassoCV/ElasticNetCV/…) in favour of passing an integer toalphas, and 1.9 removed it outright — constructingLassoCV(n_alphas=20)now raisesTypeError: LassoCV.__init__() got an unexpected keyword argument 'n_alphas'. TheCI/CD Pipelinematrix resolves to scikit-learn 1.9.0, sosp.tmle(method='hal')(HALRegressor) andsp.rd_flex(learner='lasso')both failed at construction (tests/test_hal_tmle.py,tests/test_estimator_provenance_round5.py::TestHalTmleProvenance,tests/test_low_cov_battery.py::test_hal_regressor_predicts_finite). A new version-robust shimstatspai.compat.sklearn.lasso_cv_alphas_kwargs(n)emits{"alphas": n}on scikit-learn >= 1.7 and{"n_alphas": n}on older releases; both call sites now route through it. The number of path alphas (20 for HAL, 50 forrd_flex) is unchanged — no numerical effect. Verified on the local scikit-learn 1.6.1 pin (8 HAL tests + 6rd_flextests green) and by version-logic assertion across 1.6/1.7/1.8/1.9. -
CI:
schemas/functions.jsonno longer drifts by pandas version. The auto-registered schema export stringified parameter annotations viastr(typing.Optional[pandas.DataFrame]), which pandas 3.0 renders aspandas.DataFramewhile pandas < 3.0 renders as the internalpandas.core.frame.DataFrame. Because pandas 3.0 requires Python >= 3.11, the committed bundle matched the CIubuntu x 3.10shard but was flagged stale on every 3.11/3.12/3.13 shard, failingCI/CD Pipeline(tests/test_schema_export.py::test_committed_schemas_dir_is_in_sync)._stringify_annotationnow canonicalises pandas internal module paths to their public form, so the exported bundle is byte-identical across pandas versions. Verified by regenerating under both pandas 2.x and pandas 3.0 and confirming identical output. Agent-facing schema metadata only — no numerical effect. -
CI: Windows shards no longer fail on validation-evidence path separators.
_scan_reference_testsrecorded each parity test as a validation-evidence note viastr(path.relative_to(root)), which emits OS-native separators — so Windows runners wrotetests\reference_parity\...backslash notes. That (a) failedtests/test_jss_validation_api.py::test_certified_validated_symbols_have_attached_evidence_notesbecause the JSS evidence-grade markers are forward-slash, and (b) driftedagent_cards.jsonaway from the POSIX-generated committed bundle, failing the schema-sync guard on every Windows shard. Notes are now built withPath.as_posix()and the test files are iterated in a POSIX-keyed sort, so the evidence notes — and the exported bundle — are byte-identical on Windows and POSIX. Agent-facing metadata only — no numerical effect. -
CI hardening: swept the same OS-portability class across the codebase. A multi-agent audit surfaced the remaining latent siblings of the two bugs above (none yet red, but each a Windows landmine of the identical class):
validation._rel(artifact paths insp.validation_report()) andscripts/stability_audit.pynow build paths withPath.as_posix();tests/test_causal_workflow.pyandtests/test_paper_tables.pynow read the UTF-8 HTML/LaTeX/markdown reports they generate withencoding="utf-8"(cp1252 default would mojibake on Windows — CLAUDE.md §5). Two POSIX-runnable schema-bundle invariants were added (no backslash separators; no internalpandas.core.paths) so both classes are caught on any shard, not only the one whose regeneration diverges. -
Docs:
docs/reference/dml.mddocumented foursp.dmlpatterns that raised on copy-paste.r.coef→r.estimate;r.ci(alpha=0.05)→r.ci(a(lower, upper)tuple — the level is set viasp.dml(..., alpha=));r.influence_function(never exposed onCausalResult) →r.diagnostics/r.pvalue; and an IRM example passing a non-existenttrim=keyword (propensity clipping is automatic at[0.01, 0.99], with the clip counts reported inr.diagnostics). Every documented result attribute and all fourmodel=types now run on the bundled fixture. Docs only — no API or numerical change. -
Docs: corrected an unverified claim in
docs/guides/sp_dml_vs_doubleml.md. The guide asserted that matching propensity-trimming thresholds eliminates the smallsp.dml-vs-doubleml-for-pyIRM gap; the gap (0.0076 absolute, ≈ 0.10 SE) is verified not to move with trimming nor withnormalize_ipw, and stems from internal AIPW score construction. PLR agreement is now stated as the measured machine-precision figure (1.1e-16/1.4e-17) rather than "four decimal places", and an over-broad "PLR / PLIV / IIVM agree to machine precision" line was narrowed to the PLR case that is actually pinned. -
Docs: refreshed live registry-stats drift.
docs/stats.md(per-module table + the measured source/test LOC rows),README.md, andREADME_CN.mdnow matchpython scripts/registry_stats.py(269,043 core LOC / 96,514 test LOC; 1,020 functions across 81 submodules). This restorestests/test_jss_release_manifest.py::test_registry_stats_docs_are_live(scripts/registry_stats.py --checkexits 0). Docs only — no API change.
[1.16.1] — 2026-06-01¶
⚠️ Correctness fix — sp.synth() default restored to canonical classic SCM¶
sp.synth(...)now defaults tomethod='classic'(the Abadie, Diamond & Hainmueller 2010 synthetic control) again, matching the documented default in thesp.synthdocstring (method : str, default 'classic'), theSynth::synth→sp.synthmapping in the migration-from-R guide, and the canonical Prop99 examples in the docs andexamples/synth_prop99.py. The signature default had silently drifted tomethod='augmented'(Augmented SCM, Ben-Michael, Feller & Rothstein 2021), whose ridge correction is designed to allow negative donor weights by extrapolating outside the donor convex hull — surprising for the canonicalsynth()entry point and inconsistent with the package's own documentation. A baresp.synth(...)call again returns convex, non-negative, sum-to-one donor weights. Augmented SCM remains fully available viamethod='augmented'(or'ascm'); every non-default method is unchanged. Verified against the full synth test surface (170 tests) and the RSynthrecovery parity test. Guarded bytests/test_synth.py::TestSyntheticControl::test_weights_non_negative.
⚠️ Correctness fix — synthetic-control weights projected back onto the simplex¶
solve_simplex_weights(the inner W solver shared bysp.synthand the SCM/sdid/augsynth/gsynth family) now projects the SLSQP solution back onto the unit simplex — clipping sub-tolerance negative weights to zero and renormalising to sum 1 — before returning. SLSQP enforces thew_j ≥ 0, Σw = 1constraints only up to its own tolerance, so the rawresult.xcould carry small negative donor weights (observed down to≈ -7.5e-4), violating the non-negativity invariant that the synthetic control estimand and every reference implementation (RSynth,gsynth) rely on. Donor weights change by the solver's sub-tolerance noise; the projection moves the native output toward the reference clean-simplex solution, so reference parity is preserved (verified against the synth parity suite). Guarded bytests/test_synth.py::TestSyntheticControl::test_weights_non_negative.
Fixed — agent schema generation preserves full typing shapes¶
sp.function_schema/ the registry schema generator now keep parametrised typing annotations intact across Python 3.9–3.13.registry._stringify_annotationcheckedhasattr(ann, "__name__")before inspecting typing generics; on Python 3.10+ aliases such asOptional[Dict[str, Any]]expose__name__, so the helper collapsed them to the bare origin name (Optional,Dict) and dropped the inner element types. It now resolvestyping.-prefixed and__origin__-bearing annotations first, so the machine-readable parameter shapes agents consume stay stable and version-independent. No estimator numbers change.
Docs¶
- Reviewer-facing validation docs (
docs/joss_reviewer_guide.md,docs/joss_validation_dossier.md,README.md) refreshed: the focused reviewer follow-up regression command is documented, thetests/test_joss_reviewer_followups.pycompatibility path is restored for the public review thread (delegating totests/test_external_reviewer_followups.py), and the activity/measurement dates are updated to 2026-06-01. Livedocs/stats.mdcounts re-measured against the 1.16.1 source tree (source LOC 269,010).
[1.16.0+source.20260531] — 2026-05-31¶
Correctness fix — RD density native path now ports rddensity defaults¶
sp.rddensity(backend="native")now mirrors the defaultrddensity::rddensityunrestricted triangular-kernel path instead of using a dependency-light Silverman-style pilot bandwidth and ECDF-slope approximation. The native implementation portsrdbwdensitycombination bandwidths, mass-point ECDF handling, and jackknife CJM local-polynomial density inference. On the Lee/RD Senate replica, native StatsPAI now matchesrddensityon the robust p-value atrel = 3.3e-11and on the Stata reference atrel = 8.9e-11;09_rddensitymoves from a T4 bandwidth-selector disclosure to a T2 native reference-parity pass. Side-specific manual bandwidths remain supported, andbackend="r"still delegates to the R package for users who want direct package execution. Guarded bytests/test_rddensity_io.pyand the Track A parity harness.
⚠️ Correctness fix — causal-forest ATE/ATT now doubly-robust (AIPW)¶
CausalForest.average_treatment_effect(and thetarget_sample"all"/"treated"/"control"/"overlap"aggregations) now returns the doubly-robust AIPW influence-function mean — the estimandgrf::average_treatment_effectreports — instead of a plug-in average of the regularisation-shrunk CATE predictions. The plug-in average overshoots the true effect (≈ 15 % on a clean-overlap DGP) and disagreed withgrfby an order of magnitude on overlap-pathological samples. The AIPW score reuses the forest's own cross-fitted nuisancesm̂(X)=Ê[Y|X],ê(X)=Ê[T|X]and the CATEτ̂(X):Γ_i = τ̂ + (T−ê)/(ê(1−ê))·(Y − m̂ − (T−ê)τ̂). Numbers change for any code readingaverage_treatment_effect(...)['estimate']; the SE is now the influence-function SEsd(Γ)/√n. The plug-incf.ate()/cf.att()convenience methods are unchanged and retained, but are no longer the validation estimand. Users who reported a causal-forest ATE fromaverage_treatment_effectshould re-run.- On a clean-overlap DGP (
e(X)∈[0.30,0.70], known ATE = 1) the AIPW ATE recovers the truth within 1.5 SE and agrees withgrfatrel = 0.037(z = 0.69combined SE). For the JSS source snapshot,13_causal_forestis now a T3 combined-Monte-Carlo-error pass: the row is like-for-like AIPW versusgrfand is graded against combined sampling error, not sold as deterministic machine-precision equality. The strictness-tier denominator is12 / 27 / 10 / 2 on the 51 R-joined modules: the forest row shares the methodological/T4 bucket with the remaining documented classical-SCM non-uniqueness gap, but it is the only row in that bucket graded as a T3 combined-Monte-Carlo-error pass. - Guards:
tests/reference_parity/test_causal_forest_aipw_recovery.py(recovery against truth, no R needed) and the tightenedtests/reference_parity/test_grf_parity.py(combined-SE parity vs a committedgrffixture). - Synthetic-control solver certified on identified problems and exact
Synthparity exposed when reviewers need it. Addedtests/reference_parity/test_scm_recovery.pyand the cross-language moduletests/r_parity/52_scm_unique: on a DGP whose synthetic-control weights are uniquely identified (treated unit exactly a convex combination of donors in the pre-period),sp.synth(method="classic")recovers the exact weights and gap (pre-RMSE = 0) and agrees withSynth::synthto 0.7 %. For the ambiguous Basque-data row, the parity harness keeps the native default visible as a documented donor-weight-non-uniqueness/reference-disagreement gap. On the same ADH special-predictor specification, native StatsPAI tracks Statasynthatrel = 4.2e-4while RSynthand Stata differ atrel = 2.3e-2; the methodological-gap ledger now fails if this guard disappears. Users who need exact R numbers can call the optionalsp.synth(method="classic", backend="synth")bridge. The release claim is therefore native-solver certification on identified SCM problems plus an explicit Basque reference-disagreement limitation, not a hidden machine-precision victory on a non-unique original-data example. - Augmented SCM native path now ports
augsynth's centered Ridge+SCM weight convention. The Basque18_augsynthrow now compares the native Python estimator directly withaugsynth::augsynth, not the optional R bridge: ATT relative error is7.9e-06and pre-RMSPE relative error is3.0e-06. - Generalized SCM native path now ports
gsynth/fect's two-way FE factor convention. The Basque19_gsynthrow now compares the native Python estimator directly withgsynth::gsynth, not the optional R bridge: ATT and pre-RMSPE match at machine precision, moving the row out of the T4 factor-convention bucket.backend="augsynth"andbackend="gsynth"remain reference-package migration bridges, not native parity comparators.
Changed — synthetic-DID regularisation aligned to synthdid convention¶
sp.sdidnative unit/time weights now use the \citet{arkhangelsky2021synthetic}synthdidregularisation and Frank-Wolfe weight solver. The native path now mirrorssynthdid:::collapsed.form,synthdid:::sc.weight.fw, the default sparsification step,ζ_ω = (N_tr · T_post)^{1/4} · σ̂, andζ_λ = 10^{-6}·σ̂, whereσ̂is the standard deviation of the control units' pre-treatment first differences. On the California-99 replica, the native ATT now matchessynthdid::synthdid_estimateon identical CSV bytes atrel = 2.6e-15; the row moves from a T4 regularisation-zeta gap to a T2 native reference-parity pass. The SE still uses StatsPAI's deterministic all-control placebo convention whilesynthdid_seuses random placebo replications, so the SE remains compared under a 5% tolerance rather than sold as bitwise equality.
Added — MCP protocol modernization¶
- Protocol version negotiation + bump to
2025-06-18(src/statspai/agent/mcp_server.py) — the server now negotiates its protocol revision with the client (SUPPORTED_PROTOCOL_VERSIONS = ("2025-06-18", "2025-03-26", "2024-11-05")): it echoes the client's requested revision when supported, else offers the latest. Replaces the hard-codedprotocolVersion: "2024-11-05". Fully backward-compatible — a client negotiating2024-11-05ignores the new fields below. - Tool annotations (MCP
2025-03-26) — every advertised tool now carriesannotations: {readOnlyHint: true, openWorldHint: false}. StatsPAI tools read the supplied dataset and compute; they never mutate the input file or external state, so a client can auto-approve calls without a confirmation prompt. A manifest entry may override either hint. - Structured tool output (MCP
2025-06-18) — everytools/callnow returns the result object asstructuredContentalongside the existingtextblock, so agents get typed data without re-parsing serialized JSON. Error envelopes are structured too (branch onerror_kind). Strict-JSON guarantee (noNaN/Infinity) holds for the structured payload as well. Every tool advertises a compactoutputSchema; the full documented result envelope (estimate/std_error/conf_low/conf_high/method/n_obs/coefficients/diagnostics/violations/next_steps/next_calls/citations/result_id/error…) is served once via the newstatspai://schema/resultresource rather than inlined into all ~480 tools — which keeps thetools/listpayload ~1.2 MB smaller (the schema is byte-identical per tool, so inlining it everywhere was 50% duplication).
Added — JSS reproducibility hardening (Track A R/Stata parity)¶
- R reference environment locked + proven reproducible
(
tests/r_parity/) — every committedresults/*_R.jsongolden value now carries an inlineprovenanceblock (R version, platform, BLAS/LAPACK, and the version of each attached/loaded package), emitted by_common.R::.r_provenance. The full 245-package dependency closure is pinned intests/r_parity/renv.lock(exact versions; true GitHub commit SHAs foraugsynth/synthdid), with a human-readable manifest intests/r_parity/R_ENVIRONMENT.md. A newverify_reproduce.pyre-runs each R reference on the committed CSV bytes and diffs every statistic against the golden value at a 1e-9 reproducibility tolerance: 46 of 47 data-driven modules reproduce bit-for-bit under R 4.5.2; the report isresults/REPRODUCIBILITY_REPORT.md. r-parity.ymlCI workflow — re-runs the closed-form / MLE / matching R core (fixest, sandwich, AER, survival, MASS, oaxaca, MatchIt) on every push and fails the build on any drift from the committed golden JSON, so the cross-language closed-form parity is genuinely refreshed in CI rather than only frozen.sp.validation_report(collect_tests=True)— shells out topytest --collect-onlyand returns the authoritative, parametrize-expanded parity test counts (124 reference-parity, 50 external-parity, 12 coverage Monte Carlo on the current source snapshot); a regression test pins those three to the JSS manuscript headline so a parity test added/removed without updating the paper fails CI. Defaultvalidation_report()path is unchanged (fast, metadata-only).- Strictness-tier breakdown in the Track A parity tables
(
tests/r_parity/compare.py) — each module is classified by its registered point-estimate tolerance into machine / iterative / moderate / methodological/T4 tiers (12 / 27 / 10 / 2 on the 51 R-joined modules), shown in the Markdown ledger and the LaTeX appendix caption so a machine-precision match is not flattened together with a deliberately loose stochastic or documented-convention tolerance. - Stata leg brought to the same rigor as R (
tests/stata_parity/) —_common.donow writes an inlineprovenanceblock (engine version, edition, OS) onto every*_Stata.json;verify_reproduce_stata.pyre-runs each.doon the committed CSV bytes and confirms all 44 Stata modules reproduce bit-for-bit (worst rel 0) under Stata 18 MP, including the iterative-optimiser commands (set seed 42+ deterministic solvers);_capture_stata_env.do+_gen_stata_env.pypin the engine and the verbatim*!banner of 17 community ado packages inSTATA_ENVIRONMENT.md(the Stata analogue ofrenv.lock). Reproduction ledger:results/REPRODUCIBILITY_REPORT_STATA.md. - Provenance drift guard
(
tests/test_parity_harness_contract.py) — the normal-CI contract suite now fails the build if any committed*_R.json(r_version + packages) or*_Stata.json(stata_version + edition) loses its provenance block.
Changed — Track A R golden values regenerated under the locked environment¶
- The current R parity ledger covers 51 rendered R-joined modules under
R 4.5.2 with the
renv.lockpackage set so each is self-describing. The material parity-status movement is the added52_scm_uniquecounterpart for classical SCM: an identified synthetic-control DGP now separates native solver correctness from ambiguous real-data weight selection, while the Basque row remains a documented native non-uniqueness gap. This is primarily a reference-fixture and evidence refresh, not a broadstatspaiestimator-output change. - The 44 committed
results/*_Stata.jsonwere likewise refreshed to embed the engineprovenanceblock; their numbers are unchanged (every module reproduces at exactly 0 under the locked Stata 18 MP environment).
Fixed — MCP cold-start bundle drift¶
- Stale schema-bundle guard (
.github/workflows/parity-guards.yml) — the MCP server servestools/list+ resources from the committedschemas/bundle on its cold-start fast path, gated only on a matchingstatspai_version. A registry change within the same version (a refactor between releases) drifted the bundle silently, so the server served a stale tool list. CI now runspython scripts/dump_schemas.py --checkalongside the existingregistry_stats --check, failing the build until the bundle is regenerated. Regenerated the bundle (schemas/+src/statspai/schemas/) to clear the existing drift.
Added — agent-native sprint¶
- Agent-card metadata overlay (
src/statspai/_agent_cards_extra.py) — 89 curated Tier-A cards (assumptions / pre_conditions / failure_modes / alternatives / typical_n_min) for certified + validated estimators that previously had none, applied viaregistry._apply_agent_card_seedswith extend-missing semantics (hand-writtenFunctionSpeccontent always wins). Lifts curated agent-native field coverage roughly threefold. Everyalternativeandexceptionis CI-validated to resolve. - Relational-integrity contract suite (
tests/test_agent_native_contract.py) — guards that every agent-cardalternatives/failure_modes.alternativeresolves to a real function/MCP tool, everyfailure_modes.exceptionis a real class, every advertised MCP tool is executable, every_FOLLOWUP_BY_TOOLnext-call is an advertised tool, and every_CITATIONS_BY_TOOLbib key exists inpaper.bib. - Machine-readable schema bundle (
scripts/dump_schemas.py,src/statspai/_schema_export.py,schemas/) — an import-free, versioned bundle (tools.json/functions.json/agent_cards.json/result.schema.json/index.json) so a non-Python client can discover the full surface offline. Includes a JSON Schema (draft 2020-12) for the agent-facing result payload, contract-tested against realCausalResultandEconometricResultsoutputs.--checkgates drift. - MCP-sampling LLM client (
statspai.causal_llm.sampling_client) —SamplingLLMClient/resolve_llm_client()bridge the MCP server→clientsampling/createMessageround-trip into theLLMClientinterface, sosp.llm_dag_propose(and friends) can reuse the connected agent's own model with no extra API key, falling back to the deterministic heuristic when sampling is unavailable. interpret_resultMCP tool — natural-language explanation of a fitted result from its cached handle. Wiresresolve_llm_client()into the server's tool-dispatch path: when the client advertised sampling it reuses the agent's own model (grounded in the result's own numbers — the model is told not to invent estimates), and degrades to a deterministic structured brief otherwise. Mid-call sampling failures fall back loudly (the error is surfaced insampling_error, never swallowed). Optionalquestion/audienceknobs; exposed as a dataless tool so strict-schema clients dispatch it without adata_path.- Auto-tool citation enrichment —
_enrichment.build_citationsnow falls back to verified citation tokens in a function's registryreferencefield, so hundreds of carded estimators carry citations in their MCP output automatically. Only keys that resolve inpaper.bibare ever surfaced (CLAUDE.md §10 red line holds). - Agent-workflow regression net (
tests/agent_eval/) — an end-to-end transcript test (detect_design → preflight → fit(as_handle) → audit_result → sensitivity_from_result) plus handle-chaining and graceful-failure UX contracts. - Docs —
docs/guides/agent_native_workflow.md, an operational playbook for driving StatsPAI as an agent.
Added — regression-table export¶
- Symmetric single-model export surface on
EconometricResults—sp.regress/sp.ols/sp.ivresults now expose.to_latex(),.to_html(),.to_markdown(),.to_excel(), and.to_word(), closing the asymmetry where these lived only onCausalResult(sosp.did(...).to_latex()worked butsp.regress(...).to_latex()raisedAttributeError). Each method delegates to the canonicalsp.regtablerenderer and forwards everyregtablekeyword (coef_labels/keep/drop/order/stats/se_type/stars/fmt/template/notes…);to_latexaddscaption=/label=, and the string formats accept an optionalpath=. - Agent-native table serialisation (
RegtableResult.to_dict()/.to_json()) — a JSON-safe payload with three layers: metadata, the rendered cell grid (the formatted"2.067***"/"(0.074)"strings), and the numeric truth per model (estimate / SE / t / p / CI / stats / depvar). NaN/Inf coerce tonull.renders=True(or a format list) optionally embeds rendered strings.RegtableResult.save()andregtable(..., filename=...)now recognise the.jsonextension. - Docs —
docs/guides/exporting-regression-tables.md: single- and multi-model export across all six formats, the agent-native payload, journal templates, multi-panel /Collectioncontainers, and a Stata (esttab/estout/outreg2) and R (modelsummary/stargazer/fixest::etable/texreg) cross-reference table. Every code snippet was executed to verify it runs.migration-from-r.mdcorrected:sp.regressreturnsEconometricResults(notCausalResult) and theetablemapping points atsp.regtable. RegtableResult.from_dict()— the inverse ofto_dict(). The payload now carries arender_specblock (fmt/alpha/panel_sizes/add_rows/keep/drop/order/se_label) so a serialised table reconstructs and re-renders byte-identically across text/LaTeX/HTML/Markdown for the common feature set (exoticmulti_se/eform/column_spanners/testsare documented as not surviving the round-trip). Makes the JSON a faithful cache, not just a snapshot.- Journal-grade LaTeX —
to_latex(siunitx=True)decimal-aligns numeric columns withsiunitxv3Scolumns (coefficients align on the decimal point; stars ride along as\textsuperscriptviatable-space-text-post; SE / text cells wrapped so they are not mis-parsed).threeparttable=Truemoves the footnotes into atablenotesblock;siunitx_preamble=Trueemits the required\usepackagehint. Unsupported regimes raiseNotImplementedErrorrather than emit non-compiling LaTeX. The default LaTeX path is byte-identical. Threaded throughEconometricResults.to_latex. sp.coefplot_tikz()—pgfplots/ TikZ coefficient forest plot (the LaTeX-native counterpart tosp.coefplot, whose(fig, ax)already gives PNG/PDF viafig.savefig): one\addplotseries per model with horizontal CI error bars, reversed y-axis, dashed zero line;coef_labels/level/standaloneoptions. Auto-registered.- Export-surface contract (
tests/test_export_surface_contract.py) — 55 parametrized checks assertingsp.regtable(r)consumes registered result-class outputs (EconometricResults/CausalResult/PanelResults/FrontierResult) and round-trips, so a future non-exportable result fails loudly. Guide §7–§8 document coefplots and the table/non-table boundary. Collection.to_dict()/.to_json()andPaperTables.to_dict()/.to_json()— agent-native serialisation extended to the multi-table containers. Each regression-table item/panel reusesRegtableResult.to_dict()(and staysfrom_dict-round-trippable); DataFrame items round-trip throughDataFrame.to_json(NaN → null); text/heading carry their string. Bothto_json()emit strict JSON.
Fixed¶
- Two dangling enrichment citations —
_CITATIONS_BY_TOOLreferenced a mistypeddechaisemartin2020twoway(corrected to the existingdechaisemartin2020two) and acattaneo2015randomizationkey absent frompaper.bib(added, verified via De Gruyter DOI 10.1515/jci-2013-0010 and the rdpackages reference). refs verified via aeaweb.org + RePEc (de Chaisemartin & D'Haultfœuille 2020) and degruyterbrill.com + rdpackages (Cattaneo, Frandsen & Titiunik 2015). - Four dangling agent-card alternatives —
rif_regression(→rif_decomposition),sp.ope_ipw(→sp.ipw), and twofixest_in_rexternal references (→hdfe_ols) pointed at non-existent functions; an agent following them would have hitAttributeError.
[1.16.0] — 2026-05-29¶
⚠️ Correctness fix¶
-
sp.qregPowell sandwich SE was wrong by a factor of √n — every pre-fix p-value, z-statistic, and confidence interval emitted bysp.qregwas unusable. The closed-form Koenker (2005, eq. 3.7) iid kernel sandwich isV = τ(1−τ) / f̂(0)² · (X'X)⁻¹._qreg_sehad an extra factor ofnin the denominator (/ (n * f0**2)), so the reported SE was the correct SE divided by √n — on n = 500 the SE was ~20× too small. The fix removes the spuriousn. After the fix the three-way parity at the median tolerance (tests/r_parity/40_qreg) matchesquantreg::rqwithin 1.4–6.8 % and Stataqregwithin 2.9 %, consistent with the documented kernel-vs-Koenker-Bassett SE method gap. Action: any analysis that previously usedsp.qregSE, z-statistic, p-value, or CI must be re-run; point estimates are unaffected. SeeMIGRATION.md§ sp-qreg-se-fix for the per-call impact and rerun recipe. -
sp.xtabond(Arellano-Bond difference GMM) point estimates AND SEs were wrong — finding #12. The estimator built a flat, fixed set of lagged-level instrument columns (gmm_lags=(2,5)) and then dropped every row missing any of them, which on a short panel discards most of the sample; it also usedW = (Z'Z)⁻¹as the one-step weight. The correct Arellano-Bond estimator uses a block-diagonal GMM instrument matrix (every available deeper lag is a period-specific moment, missing lags filled with 0, no rows dropped) and the one-step weightW = (Σᵢ Zᵢ'H Zᵢ)⁻¹whereHcarries the MA(1) structure of the differenced errors (2 on the diagonal, −1 on the first off-diagonals). On the parity DGP the old code gaveβ_{y₋₁}=0.264 (se 0.224)vs Stata's0.391 (se 0.046)— a 48 % estimate gap and an 80 % SE gap. After the rewrite the one-step robust estimates match Stata'sxtabond y x, lags(1) vce(robust)to machine precision (tests/r_parity/50_xtabond, rel ≈ 1e-15 on both β and SE). The defaultgmm_lagsis now(2, None)(all available deeper lags, matching Stata's default; pass an explicit max to cap). Two-step GMM now applies the Windmeijer (2005) finite-sample SE correction. Action: re-run any analysis that usedsp.xtabond— both point estimates and SEs change. SeeMIGRATION.md§ sp-xtabond-fix. -
sp.xtabond(method='system')/sp.panel(method='system')now raiseNotImplementedErrorinstead of returning an unvalidated (and, after the difference-GMM rewrite, badly distorted) estimate. Proper Blundell-Bond system GMM requires a stacked level equation and its own Stataxtdpdsysparity reference, which is planned for a future release. Action: usemethod='difference'(Arellano-Bond), now validated to machine precision.
Added — Parity coverage expansion (2026-05-28 session)¶
- 15 net-new parity modules (
tests/r_parity/{37–51}_*) coveringsp.ppmlhdfe,sp.drdid,sp.arima,sp.qreg,sp.tobit,sp.nbreg,sp.heckman,sp.mlogit,sp.ologit,sp.clogit,sp.probit,sp.oprobit,sp.xtabond,sp.newey, and a 3-FE PPML variant. The 3-way Track A table (tests/r_parity/results/parity_table_3way.md) now covers 50 R-joined modules versus 36 previously, with a Stata reference for 43 versus 21 (50_xtabondis a Py-Stata-only migration check omitted from the R-joined table). The expansion surfaced the qreg and newey SE fixes above and further P1/P2 findings recorded intests/r_parity/PARITY_SESSION_2026-05-28.md.
Fixed¶
- Cleaned up external-review follow-ups: removed two uncited duplicate BibTeX entries that caused editorialbot DOI suggestions, aligned the AKM shift-share citation key / DOI metadata, and refreshed v1.15.6 wording in reviewer-facing docs and README release callouts.
tools/audit_citations.pynow treats transient HTTP/socket/SSL timeouts as unresolved citation lookups instead of leaking Python tracebacks.tests/r_parity/36_mediation.pyreferencedmodel_info["n_boot"], butsp.mediation's schema renamed this ton_boot_requested/n_boot_successful/n_boot_failed. The parity script crashed before producing JSON; pinned it to the new key.
[1.15.6] — 2026-05-24¶
Changed — Co-authorship, software-journal submission readiness¶
- Added Scott Rozelle as co-author across all package metadata:
pyproject.toml,src/statspai/__init__.py(__author__),CITATION.cff,.zenodo.json,mkdocs.yml, the package citation templates insrc/statspai/_citation.py(BibTeX / APA / plain), and the README BibTeX snippets (English and Chinese). - ⚠️ Downstream-facing rename: unified the package BibTeX key to
wang2026statspai(CLAUDE.md §10lastnameYEARkeywordconvention). Previous keys emitted or documented in earlier versions (wang_statspai_2026,wang_rozelle_statspai_2026,statspai2026software, barestatspai) are removed in favor of a single canonical key. Downstream.texfiles that cite the previous key need a one-line rename to\cite{wang2026statspai}. The impact surface is small — only users who literally copied the previous BibTeX entry into their own.bibare affected; users who regenerate viasp.citation("bibtex")get the new key automatically. sp.citation("bibtex")now emits the unified key and the updated author list.sp.citation("apa")andsp.citation("plain")already reflected the co-author; both surfaces now also carry the 1.15.6 version string.CITATION.cffversion/date-releasedbumped to1.15.6/2026-05-24.
Added — reviewer-facing documentation¶
- reviewer guide — install, smoke test, representative offline
examples, targeted tests, and build check, intended as a short reviewer
path. All five smoke-test API calls are verified against the current
registry (
ivreg,callaway_santanna+aggte,rdrobust,synth,describe_function/function_schema). - validation dossier — project status, registry counts, validation tracks (R-parity / Stata-parity / reference-parity / Monte Carlo coverage / snapshot tests / citation audits), parity anchors, research-use statement (working-paper use; no published peer-reviewed article yet), open-core / commercial-downstream disclosure (StatsPAI Inc. + CoPaper.AI), and reproducible-check commands.
- Both pages added to the MkDocs navigation.
Changed — paper.md (software-journal manuscript)¶
- Repo URL casing corrected to canonical
StatsPAI; added Zenodo archive reference (@wang2026statspai). - Research-impact paragraph rephrased to match the actual current state: StatsPAI is used in working-paper workflows connected to Stanford REAP; no peer-reviewed research article using the package has yet been published.
- AI Usage Disclosure rewritten to spell out exactly what generative AI was used for (code generation, refactoring, test scaffolding, documentation drafting, manuscript copy-editing), to note that exact model identifiers were not retained for all exploratory sessions, and to confirm that generative AI will not be used to produce substantive responses to journal editors or reviewers.
- Acknowledgements split: explicit Author Contributions subsection attributing roles to each author, and an open-core / commercial-downstream disclosure (StatsPAI Inc. is the legal entity; CoPaper.AI is a commercial downstream product that may call the MIT-licensed StatsPAI package; the package itself remains permanently open source under MIT).
paper.bibadds thewang2026statspaisoftware entry pointing at the Zenodo concept DOI.
[1.15.5] — 2026-05-21¶
Added — Agent-card coverage ratchet and baseline enrichment¶
- Added
scripts/agent_card_coverage.py,docs/agent_cards_spec.md, andtests/test_agent_card_coverage.pyto make raw curated agent-card metadata measurable and CI-ratcheted. The committed floor tracks 15 counters across Tier-B, Tier-A, Tier-S, per-field coverage, and certified / validated evidence counts. - Added generated
src/statspai/_baseline_cards.pyplusscripts/gen_baseline_cards.pyto fill empty Tier-B fields from docstrings without overwriting curated registry entries. The baseline pass lifts tags to 100% of the then-current v1.15.5 registry and keeps examples / references limited to mechanically extracted, auditable content. - Added
FunctionSpec.inherits_fromand inherited agent-card rendering for canonical estimator variants. Child specs keep their own descriptions, examples, parameters, references, validation status, and limitations, while sharing parent assumptions, preconditions, failure modes, alternatives, andtypical_n_minwhere appropriate.
Changed — Registry and documentation refresh¶
- Expanded validation evidence seeds for tested long-tail estimators so the agent registry distinguishes stable APIs from functions with explicit unit, regression, parity, or reference-test coverage.
- Refreshed registry count, module statistics, and agent-platform positioning for the then-current v1.15.5 public surface.
- Updated DiD and agent-facing docs to mark
continuous_did(method='cgs')anddid_multiplegt_dynas experimental MVP paths rather than fully paper-parity estimators.
[1.15.4] — 2026-05-18¶
Added — Auto-CJK font fallback on import¶
import statspai as spnow auto-registers a detected CJK font (PingFang SC / Microsoft YaHei / Noto Sans CJK / SimHei / Source Han Sans / …) as a per-glyph fallback in matplotlib'sfont.familylist. Chinese text in plots renders correctly without callingsp.use_chinese()on any system that already has a CJK font installed (default on macOS / Windows / most modern Linux desktops).- Safe-by-design: the user's primary family stays at
font.family[0], so Latin text is rendered with the original primary font (DejaVu Sans / Helvetica / Times New Roman / …) — no visual change for English-only plots.axes.unicode_minusis untouched, so the minus sign on tick labels stays as the proper U+2212 from the Latin primary. - Opt-out: set environment variable
STATSPAI_NO_AUTO_CJK=1before importing statspai. The explicitsp.use_chinese()entry point still exists for users who want a specific font or who need primary-font control (e.g.,sp.use_chinese('serif')). - Mechanism:
font.familybecomes['sans-serif', '<best CJK sans>', '<best CJK serif>'](or whatever family was set). matplotlib 3.6+ per-glyph fallback walks this list, so CJK glyphs missing from the primary fall through to the appended CJK font without affecting Latin glyphs. Empirically required vs. appending tofont.sans-serif, which does not trigger fallback in matplotlib 3.10. - New tests:
tests/test_auto_cjk_fallback.pycovers the 7-point behavior contract (primary preserved, unicode_minus preserved, family-specific lists preserved, Chinese renders without warnings, user override wins, idempotent, env var opt-out).
[1.15.3] — 2026-05-17¶
Fixed — PyPI long-description hero banner¶
- The v1.15.2 PyPI project page rendered the hero banner as a broken
image. Root cause:
README.mdandREADME_CN.mdreferenceddocs/logo/readme-1.pngwith a repo-relative path, which GitHub resolves against the rendered tree but PyPI's long-description renderer has no base URL for and therefore leaves as a 404. Both READMEs now point at the absolute raw GitHub URLhttps://raw.githubusercontent.com/brycewang-stanford/StatsPAI/main/docs/logo/readme-1.pngso the banner loads on PyPI / TestPyPI / Open Source Insights / any other off-GitHub README renderer. - No code changes. All shipped artifacts have the same module hashes as v1.15.2 except for the regenerated long-description metadata baked into the wheel + sdist.
[1.15.2] — 2026-05-17¶
Headline¶
Patch release on top of v1.15.1. No estimator numerical paths change. Three independent hardening tracks land together:
- Agent-native infrastructure —
sp.agent.mcp_serveris now strict- JSON-clean over the wire (noNaN/Infinityliterals reaching Claude Desktop / RFC 8259 parsers), agent schema metadata extraction is more complete, and a newtextextra makessentence-transformersan explicit opt-in for the v1.6causal_textsurface instead of a soft import surprise. sp.replicatedual-track — Card (1995), Abadie-Diamond- Hainmueller (2010), Lalonde (1986) / DW (1999), and Lee (2008) replications are promoted from single-track stubs to full classic + modern recipes that ship with the original public- domain CSVs and pinned golden numbers.- Release packaging — wheel smoke tests now fail loudly on
import error,
py.typedships in the wheel for downstreammypy --strictconsumers, the result_repr_html_path escapes user-controlled cell content (notebook XSS-safety), and theformulaicdependency thatsp.spatialparses formulas with is declared explicitly instead of relying onlinearmodels's transitive resolution.
Added — text optional extra¶
- New
[project.optional-dependencies] text = ["sentence-transformers>=2.2.0"]inpyproject.toml. The v1.6sp.causal_textMVP used to lazy-importsentence_transformersand raise an opaqueImportErroron first call.pip install statspai[text]now wires the dependency explicitly; the lazy import still triggers a clear pointer to the extra when missing.
Added — sp.replicate dual-track guides for four canonical papers¶
- Card (1995) — proximity-to-college IV for returns to schooling.
- Abadie, Diamond & Hainmueller (2010) — California Proposition 99 synthetic-control.
- Lalonde (1986) / Dehejia-Wahba (1999) — NSW + PSID-1 (MatchIt
subset,
n=614) propensity-score matching. - Lee (2008) — US Senate RD (
n=1390) bandwidth-selected jump.
Each entry now ships a classic track (the estimator the paper
used: 2SLS for Card, weighted synthetic-control for Abadie, naive
OLS / adjusted OLS / 1:1 NN PSM for Lalonde, local-linear CCT RD
for Lee) and a modern track (DML PLR + entropy balancing,
bias-corrected robust RD, multi-method synth-compare). Real CSVs
land under src/statspai/datasets/data/
(card_1995.csv, california_prop99.csv, lalonde_matchit.csv,
lee_2008_senate.csv) and sp.datasets.nsw_lalonde /
sp.datasets.lee_2008_senate gain a simulated=False real-data
branch with published-paper benchmarks exposed via df.attrs. The
Lalonde classic track now reproduces DW (1999) Table 3-4 within a
$5 drift tolerance; the Lee CCT track returns Conv 7.414 and
bias-corrected robust 7.507 (SE 1.741, h=17.754), matching the
R rdrobust reference. All 13 BibTeX keys cited across the four
entries are verified in paper.bib.
Fixed — MCP wire format is strict-JSON-clean¶
sp.agent.mcp_serverused to serialise responses withjson.dumps(..., default=_json_default), which does not intercept native Pythonfloat('nan')/float('inf')— those become the non-standard literalsNaN/Infinityin the JSON output, which RFC 8259 parsers (including Claude Desktop'sJSON.parse) reject with errors like"No number after minus sign". Responses now pass through_clean_floats(recursively replacesNaN/±Infinitywithnullacrossdict/list/tuplecontainers) and serialise withallow_nan=False, so the server can never emit a JSON token a strict parser refuses. Covered by 273-line regression suitetests/test_mcp_nan_inf.py.- Agent schema metadata extraction (
sp.function_schema,sp.describe_function) now surfaces more signature detail for registry entries built from auto-introspection. - Stability-tier audit (
scripts/stability_audit.py) accounts for evidence files more precisely; newtests/test_agent_schema.pylocks the schema metadata fields agents rely on.
Fixed — result HTML escaping (notebook XSS-safety)¶
CausalResult._repr_html_(and the surrounding rich-display helpers) now route every user-derived cell throughhtml.escape. Previously, any string column whose contents contained</>/&/"would interpolate raw into the rendered HTML, opening a path for notebook XSS when a result was displayed in Jupyter / VS Code / nbviewer. New regression test:tests/test_results_html_escape.py.
Fixed — release packaging hygiene¶
pyproject.tomlbumps the build requirement tosetuptools>=77.0.0and migrateslicense = {text = "MIT"}to the modern PEP 639license = "MIT"+license-files = ["LICENSE"]pair. Drops the deprecatedLicense :: OSI Approved :: ...classifier path implicitly.MANIFEST.innow includessrc/statspai/py.typedand the sdist test fixtures sopip install --no-binary :all:andmypy --strictboth behave correctly on the published artifacts..github/workflows/build-wheels.ymland.github/workflows/ci-cd.yml: wheel smoke tests now fail the job onImportErrorinstead of swallowing it as a warning. Releases that silently ship a broken wheel are no longer possible from a green CI run.- New explicit dependency
formulaic>=0.6.0independencies(sp.spatial.*parses Wilkinson formulas through it; relying onlinearmodels's transitive resolution broke when downstream users pinned olderlinearmodels).
Docs¶
- Software-journal submission
paper.mdis rewritten for the Scott Rozelle review pass — tighter scope statement, cleaner schema description, explicit AI-use disclosure, 12 May 2026 submission date. Cited bibliography entries inpaper.bibare refreshed to match. - README / README_CN add the hero banner image
(
docs/logo/readme-1.png). - Track-C performance comparison table
(
tests/perf/results/perf_table.tex) switches to\scriptsizewith package-name macros and a direction-aware "Winner" column; log-log figure regenerated to match.
Internal¶
- Perf-benchmark harness factored — new
tests/perf/_common.pyshared utilities;tests/perf/05_feols_jax_bootstrap_bench.pyrewritten on top. - Full-suite validation snapshot refreshed in
test_results_full_suite.md.
[1.15.1] — 2026-05-07¶
Headline¶
Patch release on top of v1.15.0 preparing the public PyPI cut. Existing
estimator defaults are preserved. The only new runtime path is opt-in:
sp.rdrobust(..., bwselect='cct') delegates to the official
rdrobust>=1.3 Python package for bit-equal R rdrobust::rdrobust
replications. The default bwselect='mserd' remains StatsPAI's
internal MSE-optimal recipe.
The release notes and README now also document the negative-binomial
count-regression implementation that is already exposed through
sp.nbreg, sp.xtnbreg, and sp.menbreg. This is documentation /
release-packaging work plus the 1.15.0 → 1.15.1 version bump; it does
not change the negative-binomial numerical path.
Docs — negative-binomial regression implementation note¶
sp.nbregis a log-link MLE for overdispersed non-negative count outcomes. The defaultdispersion="mean"path is NB2,Var(Y|X)=mu + alpha * mu^2;dispersion="constant"switches to NB1,Var(Y|X)=mu * (1 + delta).- The optimizer starts from a Poisson IRLS fit. It then alternates
NB-weighted IRLS updates for the coefficient vector with scalar
profile-likelihood optimization for the dispersion parameter on the
log scale (
alphafor NB2,deltafor NB1). - Inference uses the NB working-weight bread.
robust="robust"/"hc0"/"hc1"select sandwich SEs, andcluster=selects cluster-robust SEs.irr=Truereports incidence-rate ratios by exponentiating coefficients and delta-method SEs. - Offsets and exposure are supported (
offset=supplies a log offset;exposure=is logged internally). Diagnostics include log-likelihood, AIC/BIC, pseudo-R2, fitted values / residuals, and a one-sided likelihood-ratio test of the dispersion path against Poisson using the standard 50:50 chi-bar-square mixture. - Formula fixed effects such as
y ~ x | idare implemented by explicit dummy expansion. That choice is transparent and compatible with the MLE path, but it is intended for moderate-cardinality panels; high-cardinality HDFE count models should use the Poisson/PPML surface (sp.fepois,sp.ppmlhdfe) unless a full dummy NB fit is really intended. sp.xtnbreg(model="fe")wrapssp.nbregwith entity/time fixed effects and defaultscluster=to the entity id.model="pooled"strips the panel fixed-effect part.model="re"dispatches tosp.menbreg, the random-intercept NB2 GLMM, and returns the multilevelMEGLMResult.
Added — R-parity opt-in for sp.rdrobust¶
- New
bwselect='cct'insp.rdrobustdelegates the entire estimation (bandwidth selection + bias-corrected inference) to the officialrdrobust>=1.3Python port (Calonico, Cattaneo & Titiunik 2014). This guarantees bit-equal alignment with Rrdrobust::rdrobustfor users who need exact replication of CCT-2014 published numbers — for example the canonical Lee/CCT Senate case where R returnsConv = 7.4141 / Robust = 7.5065 / h = 17.754. The internalbwselect='mserd'(default) is kept unchanged for backward compatibility — it uses StatsPAI's own MSE-optimal recipe which can drift from R'srdbwselectby up to ~70% on certain datasets (documented intests/orig_parity/results/parity_table_orig.mdrow 52, module05_lee_original). - Install with
pip install statspai[rd-cct](adds the officialrdrobust>=1.3dependency). Callingbwselect='cct'without it raises a clearImportErrorpointing to the install command. - See MIGRATION.md
for guidance on when to switch from
'mserd'to'cct'.
Tests — did::aggte parity lock¶
- Added
TestAggteRParityintests/external_parity/test_published_replications.py. Assertssp.aggte(type='simple')is bit-equal (≤1e-10) with Rdid::aggterecorded intests/orig_parity/results/02_mpdta_original_R.json, andtype='dynamic'matches R's published vignette output to 1e-3. Prevents future refactors from silently drifting away from R. - Added
TestCCTDelegationParityand anImportError-guarded test that pin the newbwselect='cct'delegation to RrdrobustSenate-replication numbers (Conv 7.4141, Robust 7.5065, h=17.754, 1e-3 tolerance).
Internal¶
- Added
[project.optional-dependencies] rd-cct = ["rdrobust>=1.3"]topyproject.toml. sp.datasets.list_datasets()now returns six columns (addedpaper_originalcolumn to honestly distinguish the published paper number from the simulated-replica's actual estimator output).
[1.15.0] — 2026-05-05¶
Docs — sp.dml_panel citation correction¶
- ⚠️ Docs-only correction —
sp.dml_panel(originally shipped in v1.7) was attributed in its docstring, registry entry, README blurb, and CHANGELOG release note to "Semenova & Chernozhukov (2023) Econometrics Journal 26(2), Debiased Machine Learning of Conditional Average Treatment Effects and Other Causal Functions." That citation is fabricated: independent verification via Crossref and the Oxford ECTJ issue TOC confirms no Semenova or Chernozhukov paper appears anywhere in Econometrics Journal 26(2) (May 2023), and the cited title in fact belongs to Semenova & Chernozhukov (2021) ECTJ 24(2) 264-289 (DOI 10.1093/ectj/utaa027) — a paper on CATE / debiased ML for causal functions, unrelated to long-panel PLR with fixed effects. - The estimator's actual reference is Clarke, P. S. & Polselli, A.
(2025). "Double Machine Learning for Static Panel Models with
Fixed Effects." The Econometrics Journal 29(1) 69-86, DOI
10.1093/ectj/utaf011,
arXiv:2312.08174. The paper specifies the within-group / first-
difference transform, block-k-fold cross-fitting that allocates
each unit's full time series to a single fold, and cluster-robust
variance at the unit level — point-for-point match with the
StatsPAI implementation. Companion Stata package:
xtdml. sp.synth(method='cluster')method-citations registry: ClusterSC second-author surname corrected (was a misattribution; now matches the arXiv 2503.21629 author list — Rho, Tang, Bergam, Cummings, Misra).paper.bibwas already correct; the typo only lived insrc/statspai/synth/report.py.- Updated callsites:
paper.bib(newclarke2025doubleentry),src/statspai/dml/panel_dml.py(module docstring + within-transform comment),src/statspai/dml/__init__.py(lazy-export tag),src/statspai/registry.py(FunctionSpec description + reference field),README.md(Long-panel Double-ML row),src/statspai/synth/report.py(ClusterSC author list), and the historical v1.7 entry below (annotated, not silently rewritten). No code logic, numerical path, API signature, or test changed — pure citation correction. - Refs verified via Crossref (DOIs
10.1093/ectj/utaf011and arXiv2503.21629) and OpenAlex.
Docs — v1.14 GPU sprint follow-up¶
- JSS manuscript (
Paper-JSS/manuscript/, gitignored — local working copy) gets a substantive expansion of §5.6 Performance backbone documentingsp.fast.feols_jax,sp.fast.feols_jax_bootstrap(with the score-formulation derivation for wild and wild-cluster variants and the Cameron-Gelbach-Miller 2008 attribution added tojss-bib.bib), the Rustcluster_meatkernel, andsp.iv(absorb=...). §6 originally sketched an accelerator-benchmark table forsp.fast.feols_jax_bootstrap, but the JSS source snapshot now relies only on packaged measurements. The active Track C table is generated from measured CPU benchmarks undertests/perf/, and GPU/JAX timing remains an opt-in engineering benchmark until a dedicated accelerator run is packaged as evidence. - Software-journal bullet 5 in
paper.mdsimplified from a four-paragraph exposition to a single tight paragraph that cross-references the JSS companion paper for full architecture detail. - Note on version numbering: the
pyproject.tomlversion moved from 1.14.0 (GPU sprint cut, commita87d788) to 1.15.0 (RDD polish cut) within a single day. Both[1.14.0]and[1.15.0]entries below remain historically correct for the work each release contained; no retroactive renaming is needed. PyPI publishes1.13.1 → 1.15.0directly —1.14.0was an internal cut that was never released to PyPI and is recorded here for git / CHANGELOG history only.
Headline¶
Five pushes in this cycle. First, an IV-module polish to the post-2022
reporting standard (the sp.iv.iv_diag bundle, see below). Second, a
synthetic-control polish pass: supported synthetic-control estimators in
that release now have a publication-oriented table-export pipeline, the trajectory
and gap plots gain prediction-interval / pre-RMSPE ribbon options
following Cattaneo, Feng and Titiunik (2021, JASA 116, DOI
10.1080/01621459.2021.1979561) and Cattaneo, Feng, Palomba and Titiunik
(2025, JSS 113(1), DOI 10.18637/jss.v113.i01), and the SDID schema
is canonicalised end-to-end so sp.synth_report(method='sdid', ...)
produces a full Markdown / text / LaTeX report rather than a row of
N/As. Seven 2022–2025 SCM citations were added to paper.bib,
each verified independently via Crossref / arXiv (refs verified
via crossref + arxiv). Third, a decomposition-module polish — see
the dedicated section below. Fourth, a ML+causal polish wave (v1.15)
covering DML / meta-learners / causal forests / causal discovery /
policy learning / mediation — see the dedicated section below. Fifth,
an RDD module polish (v1.15) to the 2018–2026 frontier with three
new estimators (sp.rd_flex, sp.rd_bias_aware_fuzzy, sp.rd_discrete),
three reporting helpers (sp.rd_dashboard, sp.rd_compare,
sp.rd_robustness_table), an sp.rdrobust polish pass with the CCT-2018
rho parameter and discrete-RV / weak-first-stage warnings, and a
sp.rdplotdensity upgrade to the Cattaneo-Jansson-Ma (2020) boundary-
adaptive density estimator — see the dedicated section below.
Added — ML+causal polish (v1.15)¶
- DML-OVB sensitivity analysis (
sp.dml_sensitivity,DMLSensitivityResult) implementing the Chernozhukov–Cinelli– Newey–Sharma–Syrgkanis (2022) "Long Story Short" framework (NBER WP 30302; arXiv:2112.13398). Returns the robustness value RV_q (strength of confounder needed to shrink the estimate to zero), the significance-loss value RV_{q,α}, scenario bias bounds for user-specified (cf_y, cf_d), benchmark-covariate comparisons, and aplot()rendering bias contours over the (cf_d, cf_y) grid à la Rsensemakr. Refs verified via NBER + arXiv. - DML diagnostics bundle (
sp.dml_diagnostics,DMLDiagnostics) bundles overlap (propensity histogram for IRM; |D-residual| distribution for PLR), score density (with N(0,σ̂²) overlay and Q-Q plot), residual-balance check (corr(X_k, Ỹ) and corr(X_k, D̃) for each covariate), and an orthogonality-score test in a single 2×2 publication-style panel matching DoubleML's defaults (Bach–Kurz–Chernozhukov–Spindler–Klaassen 2024, JSS 108(3), DOI 10.18637/jss.v108.i03). - Backbone-agnostic CATE evaluation (
sp.cate_eval,CATEEvalResult) computing Yadlowsky–Fleming–Shah–Brunskill– Wager (2025) RATE / AUTOC / Qini with closed-form influence- function SEs for any CATE array (meta-learner, BCF, conformal- CATE, neural-CATE), so the metric is decoupled from the forest backbone. JASA 120(549), DOI 10.1080/01621459.2024.2393466 (arXiv:2111.07966). Verified via Crossref + arXiv. - ⚠️ Correctness fix —
forest.CausalForest.best_linear_projectionis rewritten to use the Semenova–Chernozhukov (2021) AIPW pseudo-outcome Γ_i with HC1 standard errors. The previous implementation regressed the plug-in CATE estimate on X with naïve OLS SEs, which was anti-conservative in finite samples. Econometrics Journal 24(2): 264–289, DOI 10.1093/ectj/utaa027. Users who relied on the prior BLP SEs should re-fit and report the new HC1 numbers. - ⚠️ Correctness fix —
mediation.mediateno longer silently substitutes the point estimate for failed bootstrap replicates (which artificially shrunk SEs). Each failure now triggers up to five retry draws; remaining failures are dropped, and aRuntimeWarningfires if more than 10% of replicates fail. The result'smodel_infoexposesn_boot_requested,n_boot_successful,n_boot_failed, andboot_failure_ratefor audit. SEs estimated under heavy bootstrap failure on prior versions should be regenerated. - OPE namespace deduplication —
sp.policy_learning.OPEResultis now an alias for the canonicalsp.ope.estimators.OPEResult, soisinstance(sp.direct_method(X, A, R, π), sp.OPEResult)is True regardless of which entry point was used. The legacyestimator/n_obsattributes survive as properties on the unified class. - Causal-discovery graph visualization — every result class
(
LiNGAMResult,GESResult,FCIResult,ICPResult,PCMCIResult,LPCMCIResult,DYNOTEARSResult) and the dict- shaped returns fromsp.notearsandsp.pc_algorithm(now promoted to aDAGDictthin subclass) expose a unified.to_networkx()/.to_dot()/.plot()/.edge_list()API. Module-level helperssp.causal_discovery.{to_networkx, to_dot, plot_dag, edge_list, shd}work standalone on any adjacency matrix;shd()follows the Tsamardinos–Brown–Aliferis (2006) Structural Hamming Distance convention. - PolicyTreeResult promotion —
sp.policy_treenow returns aPolicyTreeResult(subclass ofdictfor full back-compat) with influence-function SE on the policy value and a 95% CI from the AIPW scores, plus a Graphviz-styleplot_tree(),summary(),to_latex(),to_excel(), andcite()(Athey & Wager 2021, Econometrica 89(1)). - Mediation sensitivity plot upgrade —
MediateSensitivityResult.plot()now produces a publication-style ACME(ρ) curve with coloured fill for the {ACME>0} / {ACME<0} regions, annotated baseline, and explicit ρ-at-zero (the robustness threshold). CausalResult.to_wordadded alongside the existing.to_latex/.to_excel. Renders a publication-style three-block Word document (estimates / detail / notes) using the AER booktab styling helpers inoutput/_aer_style.py. Coverage: supported estimators that already returnsCausalResult(DML, TMLE, BCF, mediate, conformal_cate, proximal.p2sls, matrix_completion, metalearners, hal_tmle, did, rd, synth) now has uniform LaTeX / Excel / Word export.- DTR + QTE test coverage —
tests/test_dtr.py(10 new tests) andtests/test_qte.py(7 new tests) close two zero-coverage modules flagged in the v1.13 audit. Tests verify (i) Q-learning exactly recovers the optimal terminal-stage rule under a linear blip, (ii) A-learning's terminal contrast aligns with the truth, (iii)qteandqdidrecover the constant-shift / parallel- trends benchmarks of Firpo (2007) and Athey–Imbens (2006) respectively to within 0.30 absolute error at every quantile, and (iv)distributional_te's CDFs are monotone with stochastic dominance in the right direction. tests/test_ml_causal_polish.py(22 new tests) covers all of the above end-to-end (BLP DR-score recovery, mediation bootstrap diagnostics, OPE isinstance, DAG viz,PolicyTreeResultcontract, DML sensitivity / diagnostics,cate_evaldirection,to_wordintegration).- Citation expansion — 4 new bib entries added to
paper.bib, each verified independently via NBER / arXiv / journal site:chernozhukov2022long,semenova2021debiased,yadlowsky2025evaluating,bach2024doubleml.
Added — decomposition module polish (v1.15)¶
- Yu–Elwert (2025) nonparametric causal decomposition
(
sp.yu_elwert_decompose, dispatcher aliasesyu_elwert/cdgd) — splits an observed group disparity into baseline, prevalence, effect, and selection mechanisms. Two estimators: a plug-in version that returns exactly zero residual by algebraic identity, and a doubly-robustmethod="efficient"augmented variant. Cluster-aware bootstrap inference, plot helper (yu_elwert_mechanisms_plot), and a per-component CI table. Aligned conceptually with the Rcdgdpackage. Reference verified via Crossref: doi:10.1214/24-AOAS1990 (Annals of Applied Statistics, 19(1) 821–845). - Unified
DecompResultMixin— every result class insp.decomposition(Oaxaca, Gelbach, RIF, FFL, DFL, Machado–Mata, Melly, CFM, Fairlie, Bauer–Sinning, Yun, Kitagawa, Das Gupta, Subgroup / Source / Shapley inequality, GapClosing, Mediation, Disparity, Yu–Elwert) now exposes the same surface:confint(alpha),cite()/cite("bibtex_keys"),to_dict(),to_json(),to_excel(path)(multi-sheet workbook), andto_word(path)(python-docx report) in addition to the existingsummary()/plot()/to_latex(). - Plot polish —
sp.decomposition.plotsnow ships a unified Material-style palette (DECOMP_PALETTE) and a despined minimal grid viaapply_decomp_style. New helpers:forest_plot(with significance shading),mediation_forest, andyu_elwert_mechanisms_plot. Existing plots (detailed_waterfall,quantile_process_plot,dfl_plot,ffl_waterfall,gap_closing_plot,inequality_subgroup_plot,counterfactual_cdf_plot,rif_heatmap) gain optional 95% CI whiskers / ribbons whenever the result carries SEs. - Wild bootstrap in
decomposition._common.wild_bootstrap_statwith Rademacher and Mammen multipliers and cluster-aware multiplier sharing (Cameron–Gelbach–Miller 2008 style).analytical_ci(point, se, alpha)helper for two-sided normal intervals. - Citation expansion — 15 new bib entries added to
paper.bib, each verified via Crossref:blinder1973wage,oaxaca1973male,neumark1988employers,cotton1988estimation,reimers1983labor,jann2008blinder,gelbach2016covariates,kline2011oaxaca,shorrocks1980class,cowell2007income,riosavila2020rif,kroger2021kitagawa,oaxaca2025meets,yu2025nonparametric,park2024choosing(refs verified via Crossref). - Docs — new family guide
docs/guides/decomposition_family.md, wired into MkDocs nav under the v1.15 entry. Decomposition section inpaper.mdrewritten with inline citation keys. - Tests —
tests/test_decomposition_polish.py(14 new tests) covering the Yu–Elwert algebraic identity, bootstrap inference, dispatcher routing, plot smoke test, the unified mixin (cite / to_dict / to_excel / confint), and the wild bootstrap helper (Rademacher / Mammen / clustered).
Added — synthetic control polish¶
sp.synth_to_latex(obj, ...)— booktabs LaTeX for anyCausalResultfromsp.synth(method=...)or for aSynthComparison(side-by-side multi-method layout). Optional donor-weights panel, configurable significance stars, and a\\hlinefall-back for non-booktabs documents.sp.synth_to_markdown(obj, ...)— pipe-table Markdown counterpart rendering on GitHub, in pandoc, and in static-site generators.sp.synth_to_excel(obj, path, ...)— multi-sheet workbook with Summary / Weights / Diagnostics / per-method Gap sheets. Soft dependency onopenpyxl; raises an actionableModuleNotFoundErrorwith install hint when missing.SynthComparison.to_latex(),.to_markdown(),.to_excel()— thin object-method wrappers over the three exporters above.sp.synthplot(..., pi_band=True)— overlay the prediction-interval / conformal CI ribbon on the synthetic counterfactual when the result carriesperiod_resultswithpi_lower/pi_uppercolumns (sp.scpi,sp.conformal_synth).sp.synthplot(..., pre_band=True)— overlay a \(\\pm 1.96 \\times\)pre-RMSPE noise envelope on the trajectory and gap plots (the convention popularised in Abadie, Diamond and Hainmueller 2010, JASA).- Method-aware citations in
sp.synth_report(): SDID, scpi, conformal, gsynth, augmented, matrix-completion, multi-outcome, cluster, sparse, fdid, BSTS, and penalised SCM each close with their own citation rather than the generic Abadie--Diamond-- Hainmueller (2010) reference. - SDID schema canonicalisation in
sp.synth_report(): the report now backfillsn_pre_periods/n_post_periods/n_donors/pre_treatment_rmse/gap_tablefrom SDID-style keys (T_pre,T_post,n_control,Y_obs) so the Markdown / text / LaTeX output is uniform across SC variants. - Seven new verified bib entries in
paper.bib: Abadie & Cattaneo (2021, JASA 116(536), 1713–1715, DOI 10.1080/01621459.2021.2002600); Abadie & Vives-i-Bastida (2022, arXiv:2203.06279); Cattaneo, Feng, Palomba & Titiunik (2025, JSS 113(1), DOI 10.18637/jss.v113.i01); Liu, Wang & Xu (2024, AJPS 68(1), 160–176, DOI 10.1111/ajps.12723); Qiu, Shi, Miao, Dobriban & Tchetgen Tchetgen (2024, Biometrics 80(2), ujae055, DOI 10.1093/biomtc/ujae055); Clarke, Pailañir, Athey & Imbens (2024, Stata Journal 24(4), 557–598, DOI 10.1177/1536867X241297914); Bottmer, Imbens, Spiess & Warnick (2024, JBES 42(2), 762–773, DOI 10.1080/07350015.2023.2238788). - 17 new tests in
tests/test_synth_exports.pycovering single-result and comparison LaTeX / Markdown / Excel exports, the new plot options, and SDID-canonicalised reports.
Added — IV polish (v1.15)¶
sp.iv.iv_diag(data, y, endog, instruments, exog, ...)— modern IV reporting bundle. Returns anIVDiagResultcontaining:- 2SLS point estimate, analytic + pairs / wild Rademacher bootstrap SEs and CIs (cluster-aware) following Davidson--MacKinnon (2010) and Young (2022, EER 147, 104112);
- Olea--Pflueger (2013, JBES 31(3), 358–369) robust effective F;
- Lee--McCrary--Moreira--Porter (2022, AER 112(10), 3260–3290)
tFadjusted critical value and tF-corrected confidence interval; - Anderson--Rubin (1949) / optional Moreira (2003) CLR / optional Kleibergen (2002) K weak-IV-robust confidence sets;
- Kleibergen--Paap (2006, J. Econometrics 133, 97–126) rk LM and Wald F;
- Conley--Hansen--Rossi (2012, ReStat 94(1), 260–272) plausibly-exogenous LTZ sensitivity CI;
- Blandhol--Bonney--Mogstad--Torgovitsky (NBER WP 29709, latest
revision Jan 2025) and Słoczyński (2024, arXiv:2011.06695)
TSLS-as-LATEnegative-weights caveat — automatically surfaced when covariates are present and the endogenous regressor is binary 0/1; - OLS comparator (informative; not causal) per Young (2022).
IVDiagResultexposes.summary(),.to_frame(),.to_dict(),.to_latex(),.to_excel(),.to_word(), and.plot('diagnostic'|'forest'|'weak_iv'|'first_stage').sp.iv.iv_compare(formula, data, methods=...)— run several k-class / JIVE estimators side-by-side and return a one-row-per- method comparison DataFrame (point, SE, CI, first-stage F). Auto-resolves the endogenous coefficient name across heterogeneous result classes.sp.iv.plot.plot_iv_forest(table, reference=...)— forest plot of estimates and CIs across methods.sp.iv.plot.plot_iv_forest_from_diag(result)— forest plot built directly from anIVDiagResultbundle.sp.iv.plot.plot_weak_iv_ci_overlay(result)— Wald / tF / AR / CLR / K / pairs- and wild-bootstrap / LTZ confidence-set overlay.sp.iv.plot.plot_iv_diagnostics(result)— 2x2 panel: first-stage scatter, AR set, weak-IV-CI overlay, leverage diagnostic (Young 2022 spirit).sp.iv_diag,sp.iv_compare,sp.IVDiagResultre-exported at top level for agent ergonomics; both also wired intosp.list_functionsvia the registry.- New verified
paper.bibentries (DOI / arXiv ID double-checked against Crossref + journal pages, May 2026):keane2024practical,young2022consistency,lal2024much,mikusheva2022inference,borusyak2023nonrandom,borusyak2025practical,kaido2021decentralization,chernozhukov2022automatic,chernozhukov2022rieszn,brinch2017beyond,bennett2023minimax,blandhol2025tsls,sloczynski2024should. Existing entries enriched with verified volume / issue / pages:mikusheva2024weak,lee2022valid,borusyak2022quasi,masten2021salvaging.
Changed — IV polish (v1.15)¶
docs/guides/choosing_iv_estimator.mdadds §10 (sp.iv.iv_compareforest comparison), §11 (sp.iv.iv_diagmodern reporting bundle), and a TL;DR pointer to the new bundle.paper.mdgains a self-contained IV bullet under Methodological coverage documenting the new bundle and recent methodology references.
Notes — IV polish (v1.15)¶
iv_diagis single-endogenous by design; for multi-endogenous specifications continue to usesp.weakrobustplussp.iv.sanderson_windmeijerper regressor.- Existing IV functions (
sp.iv,sp.weakrobust,sp.kleibergen_paap_rk,sp.anderson_rubin_test, etc.) are unchanged —iv_diagis purely additive and does not alter any numeric path. The 18 new tests intests/iv/test_iv_diag.pyall pass; no regressions in the 188 prior IV tests.
Added — RDD polish (v1.15)¶
RDD module polish tracking the 2018–2026 literature. Six
additions close the gap between sp.rd and the canonical R/Stata
rdpackages ecosystem on three fronts — recent methodology,
publication-oriented reporting, and automatic diagnostics:
- Three new estimators corresponding to flagship 2018–2025 papers:
sp.rd_flex— flexible (machine-learning) covariate adjustment via cross-fit residualisation following Noack, Olma & Rothe (2025, arXiv:2107.07942 v5). Built-in learnersboost/forest/ridge/lasso(any sklearn regressor accepted) with K-fold cross-fitting; reduces τ̂ variance whenever covariates predict the outcome and remains consistent under free-of-cutoff continuity of η. Reports out-of-sample R² and variance reduction relative to plainrdrobust.sp.rd_bias_aware_fuzzy— Anderson–Rubin-style bias-aware CI for fuzzy RD following Noack & Rothe (2024, Econometrica 92(3), 687–711, doi:10.3982/ECTA19466). Robust to weak first stages and avoids the power asymmetry of conventional fuzzy 2SLS-style CIs documented by Kaliski, Keane & Neal (2025, NBER 33972). Reports first-stage F and warns on F < 10.-
sp.rd_discrete— honest CIs for RD with a discrete running variable following Kolesár & Rothe (2018, AER 108(8), 2277–2304, doi:10.1257/aer.20160945). Two smoothness classes: bounded second derivative (bsd) and bounded misspecification (bm). Both have provable finite-sample coverage when standard rdrobust asymptotics break down because mass points are sparse. -
Three new reporting helpers for the standard CCT–Cattaneo– Idrobo–Titiunik best-practice workflow:
sp.rd_dashboard— single-figure 4-panel diagnostic (RD plot, CJM-2020 density, covariate balance, bandwidth sensitivity) following the recommendations of Calonico, Cattaneo & Titiunik (2015, JASA 110(512), doi:10.1080/01621459.2015.1017578) and Cattaneo, Idrobo & Titiunik (2024, doi:10.1017/9781009441896).sp.rd_compare— side-by-side estimation across an arbitrary list of methods (rdrobust,honest,randinf,flex, …) on the same data; returns a tidypd.DataFrameready forsp.outreg2/sp.modelsummary.-
sp.rd_robustness_table— sweep over kernel × bandwidth × poly × donut, returning paper-facing specifications withto_latex()/to_excel()for one-shot supplemental tables. -
sp.rdrobustpolish: - New
rhoparameter — Calonico, Cattaneo & Farrell (2018, JASA 113(522)) ratio bandwidthb = h / rho. - Auto warning when running variable has < 30 distinct values,
pointing to
sp.rd_discrete. - Auto warning on fuzzy first-stage F < 10, pointing to
sp.rd_bias_aware_fuzzyand quoting the Kaliski-Keane-Neal (2025) ITT recommendation. Both warnings can be silenced viawarn_mass_points=False/warn_weak_first_stage=False. -
First-stage F now exposed at
result.model_info['first_stage_F']. -
sp.rdplotdensityupgrade: replaced the legacy kernel-sum density estimate with the boundary-adaptive local-polynomial CDF- regression density of Cattaneo, Jansson & Ma (2020, JASA 115(531), 1449–1455). Same signature; better boundary behaviour. -
Dispatcher:
sp.rd(..., method='flex' | 'bias_aware' | 'discrete')with full alias coverage (rd_flex,flexible,ml_adjust,bias_aware_fuzzy,noack_rothe,rd_discrete,kolesar_rothe,discrete_rv, …).
Tests — RDD polish (v1.15)¶
- New
tests/test_rd_polish.pywith 21 checks: estimator parity recovery, dispatcher routing, warning behaviour, dashboard smoke tests. - All 156 RD tests (existing + new) pass on Python 3.13 / macOS.
Citations — RDD polish (v1.15)¶
DOI-verified via Crossref / publisher pages 2026-05-05:
noack2024biasaware, noack2025flexible, kolesar2018inference,
kaliski2025power, cattaneo2024extensions,
cattaneopalomba2025covariates, calonico2015optimal.
[1.14.0] — 2026-05-05¶
Headline¶
GPU-acceleration sprint. Three workloads now opt into accelerator
backends without changing their public API: (1) the neural causal
estimators (sp.deepiv, sp.tarnet, sp.cfrnet, sp.dragonnet,
sp.cevae) route through PyTorch CUDA / MPS via a centralised
device resolver; (2) sp.fast.feols_jax runs the full WLS solve on
JAX / XLA; and (3) sp.fast.feols_jax_bootstrap lifts a JIT-compiled
single-iteration WLS kernel to a jax.vmap batched primitive,
giving a 10–100x speedup over sequential CPU bootstrap on CUDA / TPU
at B ≥ 1000. Four bootstrap variants share the same JAX kernel
infrastructure: pairs (multinomial-weight resampling), cluster
(Cameron–Gelbach–Miller 2008 §III.A), wild (row-level Rademacher),
and wild cluster (Cameron–Gelbach–Miller 2008 §III.B); the wild
variants use the score formulation
β* = β̂ + (X'WX)⁻¹ X'W (η ⊙ û) which is mathematically identical
to refitting on y* = X β̂ + η ⊙ û but needs one mat-vec per
iteration instead of a full QR. A new cluster_meat Rust kernel
in statspai_hdfe (PyO3 + Rayon, parallel over clusters) is wired
behind statspai.core._numba_kernels.cluster_meat with the existing
numba kernel as automatic fallback. sp.iv(absorb=...) is the new
2SLS-with-HDFE entry point: residualises y, exogenous controls,
endogenous regressors, and instruments by one or more FE columns
via the Phase-1 Rust demean kernel before fitting, with the residual
DOF adjusted by Σ(G_k - 1) to charge the absorbed FE rank against
iid / HC1 / CR1 SEs. A new docs/guides/gpu_acceleration.md is the
canonical landing page for the accelerator story; the README and
paper.md link to it and explicitly bound the GPU promise (most
estimators are CPU-only by design — DiD / RD / synth / GMM are
bandwidth-bound or small-K convex programs where a tuned CPU
kernel matches GPU performance).
Added¶
sp.fast.feols_jax— JAX-backed end-to-end OLS / WLS with HDFE. Same formula DSL andFeolsResultreturn type assp.fast.feols; the WLS solve and HC1 sandwich run on the default JAX device. CR1 cluster sandwich delegates to the existingcrve(which itself dispatches to the newcluster_meatRust kernel when built). Defaultdtype="float64"preserves bit-comparable numerics;dtype="float32"available for the GPU fast path.sp.fast.feols_jax_bootstrap— vmap'd bootstrap with four variants (pairs,cluster,wild,wild_cluster).vmap_chunk_sizeparameter for memory control on tight devices. Same-seed → bit-identical reproducibility viajax.randomPRNG. Returns aFeolsBootstrapResultdataclass withcoef,se_boot, percentileci_lower/ci_upper, and the fullboot_betastable for custom CI methods.sp.iv(absorb=...)— 2SLS with HDFE residualisation. Accepts"firm + year"string syntax or["firm", "year"]list. LIML / Fuller / GMM / JIVE raiseNotImplementedError(Phase 3b).STATSPAI_TORCH_DEVICEenvironment variable (cpu/cuda/cuda:N/mps/auto) routes neural causal estimators through the requested device. Defaultcpupreserves existing pinned numerics; explicitcudaraises if the device is unavailable rather than silently falling back. Newsp.fast.torch_device_info()mirrorssp.fast.jax_device_info().statspai_hdfe::cluster_meatRust kernel — Rayon parallel over clusters with thread-local k×k upper-triangle accumulator and elementwise reduction. Bumped the crate version 0.5.0-alpha.1 → 0.7.0-alpha.1. Activation requires a one-timepip install maturin && cd rust/statspai_hdfe && maturin develop --release; Python falls back to the numba kernel transparently when the Rust extension is absent.docs/guides/gpu_acceleration.md— accelerator landing page with activation recipes, a Google Colab quickstart benchmark, and an explicit "what is not GPU-accelerated and why" table.
Changed¶
paper.mdadds a fifth bullet to the Unique features list documenting the accelerator story, and notes the Rust HDFE / cluster-meat kernel in the implementation paragraph.README.mdcomparison-table accelerator row now links to the new GPU guide; the What StatsPAI is — and is not bullet expands to explicitly mentionfeols_jax,feols_jax_bootstrap, and the vmap mechanism.
Internal¶
- New helper
_jax_prep_inputsshares formula-parse + FE-residualise logic betweenfeols_jaxandfeols_jax_bootstrap.feols_jaxitself is unchanged in this release; consolidation into a shared call site is a candidate follow-up. - Rust crate adds
src/cluster.rs(kernel) and acluster_meatPyO3 binding insrc/lib.rs. 3 cargo unit tests cover small-DGP reference parity, k=1 closed form, and empty-input safety.
Verified¶
- 10 PyTorch device-resolver tests
(
tests/test_torch_device_resolver.py); 51 existing neural tests pass without numerical drift on default CPU. - 9
cluster_meatRust parity tests (tests/test_cluster_meat_rust.py) — auto-skip whenstatspai_hdfeis not built. - 13
sp.iv(absorb=)parity tests vs explicit drop-first dummies (tests/test_iv_absorb.py); coefficients agree toatol=1e-9, iid SE tortol=1e-3, cluster SE tortol=1e-2. - 12
feols_jaxparity tests vsfeols(tests/test_jax_feols.py); iid / hc1 / cr1 / weighted / float32 / 6 error-path validations. - 24
feols_jax_bootstraptests (tests/test_jax_feols_bootstrap.py); convergence to HC1 SE for pairs / wild and CR1 SE for cluster / wild_cluster at B=2000 (rtol 10–15%); algebraic identity check that the wild score formulation reproduces the literal "refit on pseudo-y" bootstrap bit-for-bit on a no-FE DGP (atol=1e-9).
[1.13.1] — 2026-05-05¶
Headline¶
Stability tiers, external-validity dossier, and cold-start surgery in a
single release. Every FunctionSpec now carries a stability field
plus per-function limitations, surfaced through
sp.describe_function, sp.help, sp.list_functions(stability=...),
the statspai list CLI, and the LLM-facing sp.function_schema
description; sp.recommend / sp.causal / sp.paper default to
dropping experimental / deprecated entries unless
allow_experimental=True is passed — closing a path where an agent
could silently land on a frontier MVP. Eight high-impact estimators
(aipw, aggte, pretrends_test, sensitivity_rr, mccrary_test,
oster_bounds, wild_cluster_bootstrap, rd_honest) are upgraded
from auto-registered stubs to hand-written specs with full assumption
/ failure-mode / alternative metadata. A weak-instrument preflight
gate in sp.preflight(data, "ivreg", formula=...) raises a structured
warning row when the first-stage F falls below the Staiger–Stock
(1997) or Stock–Yogo (2005) thresholds, and sp.recommend(...
design='iv') adaptively reorders LIML / AR ahead of 2SLS on weak
first stages. A 36-module R parity harness, 21-module Stata parity
harness, 4-dataset original-paper replay (Card 1995, Callaway–Sant'Anna
mpdta, Abadie Basque, LaLonde NSW + PSID-1), Track-C performance
harness (HDFE / CS-DiD / SCM / DML log-log scaling), B=1000
Monte-Carlo coverage run, and a 900-trial CausalAgentBench prompt
suite all ship under tests/r_parity/, tests/stata_parity/,
tests/orig_parity/, tests/perf/, and tests/agent_bench/ with
paired R/Python/Stata drivers, JSON results, and 3-way Markdown +
LaTeX parity tables suitable for direct paper inclusion. A new
sp.validation_report() / sp.coverage_matrix() /
sp.reproduce_jss_tables() meta-API summarises the live registry,
materialises the parity / coverage / agent-bench artifacts as JSON,
and optionally re-runs the harnesses end-to-end so referees can
verify StatsPAI's external-validity claims without leaving Python.
Cold-start surgery in three steps brings sklearn submodules pulled by
import statspai from 245 to 0 (statspai.forest lazy-loaded —
Step 1B; 18 estimator files import sklearn lazily inside function
bodies — Step 1C; HAL TMLE classes drop sklearn class inheritance —
Step 1D), pinned by a new
test_sklearn_budget_ceiling_on_bare_import_statspai contract. The
workflow / paper orchestration layer replaces silent except: pass
paths with WorkflowDegradedWarning + structured degradations
records on the result object, so optional-stage failures surface in
PaperDraft.to_dict() and the rendered Pipeline notes section
instead of disappearing. sp.principal_strat(instrument=...) ships a
proper Angrist-Imbens-Rubin Wald-LATE estimator (the kwarg was
previously stubbed); sp.hal_tmle(variant='projection') keeps its
NotImplementedError but now points at a written-out RFC
(docs/rfc/hal_tmle_projection.md) instead of raising in silence.
Lazy-loading of optional families via __getattr__ keeps import
statspai fast without breaking same-name function/subpackage
collisions (bartik, deepiv, proximal, …) — pinned by a
late-bind / post-import-shadow contract test and a committed
__init__.pyi stub generator so IDE / mypy see lazy-loaded names. A
latent Callaway–Sant'Anna REG inference scaling bug — discovered
because the parity harness flagged it — is fixed in
did/callaway_santanna.py.
Added¶
- Weak-instrument preflight gate in
sp.preflight(data, "ivreg", formula=...). The newfirst_stage_strengthcheck parses the Wilkinson IV formula, runs the first-stage OLS, and emits awarningrow when the partial F-statistic falls below either the Staiger–Stock (1997) rule of thumb (F < 10, "very weak") or the Stock–Yogo (2005) 10% maximum-size critical value of 16.38 for one endog / one instrument (F < 16.38, "weak"). The warning payload includes structuredrecovery_hintspointing atmethod='liml',inference='ar', andsp.anderson_rubin_ci(...)so an LLM agent can branch on the typed envelope without parsing prose. Closes the §5.3 robustness DGP follow-up: Track B'stest_iv_weak_instrument_undercoveragedocuments that 2SLS+HC1 under-covers at the 0.88 level on a pi=0.10 first stage; the preflight now flags this before the user pays for the 2SLS fit. - Adaptive IV ranking in
sp.recommend(... design='iv'). When the live first-stage F is below 10 the recommendation list is reordered: LIML moves to the top, an Anderson–Rubin row is inserted, and the 2SLS row is annotated withvery_weak_iv=Trueplus a rationale that explains the HC1-coverage failure mode. When F is in [10, 16.38) the order stays 2SLS → LIML but the rationales reference the Stock–Yogo threshold. The 2SLS row now also carriesfirst_stage_Fas a numeric field so downstream tooling can consume it without regex. -
Preflight / recommend tests.
tests/test_preflight.pyaddsTestIVFirstStageStrength(6 cases covering strong / weak / borderline / non-IV / missing-columns / JSON-safe payloads).tests/test_smart_workflow.py::TestRecommendadds two cases pinning the 2SLS-first ordering on strong instruments and the LIML-first / AR-included ordering on weak instruments. -
R parity harness — 36 paired R/Python modules. New
tests/r_parity/ships 36 paired scripts (one R, one Python) that replay the same DGPs throughfixest/did/csdid/gsynth/MatchIt/DoubleML/rdrobust/Synth/lme4/plm/frontier/MR-PRESSO/lavaan/mediation/WeightIt/cobalt/mlogit/nlme/ordinal/ … and StatsPAI's matching estimator. Each module emits<id>_R.jsonand<id>_py.json;compare.pyproduces a 3-way parity table (parity_table.md/.tex/parity_table_3way.md/.tex) tightened with a small ID-column + longtable layout for direct paper inclusion. Two parallel tracks — "orig" (canonical-dataset replays) and "perf" (timing under matched DGPs) — share the same compare-tooling. -
Stata parity harness — 21-module StatsPAI ↔ Stata 3-way compare. New
tests/stata_parity/ships 21 paired.do/.pyscripts coveringreghdfe,xtreg,csdid,did_imputation,synth,synth_runner,ivreg2,xtivreg,rdrobust,psmatch2,teffects,xtfrontier,mixed/melogit/mepoisson,bayes,gmm,boottest, plus the Stata→Python translator round-trip. Drivers write Stata results to JSON via thestata-mcpstata_dotool; Python drivers run StatsPAI throughimport statspai as sp;compare_stata.pyjoins on(module, estimator, statistic)and emitsparity_table_stata.md/parity_table_stata.texplus the 3-way StatsPAI ↔ R ↔ Stata table. -
Canonical-dataset original-paper replays. New
tests/orig_parity/adds 4 module pairs that replay each paper's headline number bit-equal to the published value: Card (1995) returns-to-schooling onwooldridge::card; Callaway–Sant'Anna (2021) staggered DiD ondid::mpdta(the package's vendored minimum-wage panel); Abadie–Diamond–Hainmueller Basque onSynth::basque; LaLonde (1986) NSW onMatchIt::lalondeplus a 4b sub-module oncausalsens::lalonde.psid(true Dehejia–Wahba NSW + PSID-1, 2675 obs) wheresp.regress+sp.psmrecover the published −15,205 to relative tolerance 1.5e-05. Drivers + bundled data + JSON results +parity_table_orig.mdare all committed. -
Track-C performance harness — log-log timing. New
tests/perf/adds matched R/Python timing runs for HDFE (fixest::feolsvssp.feols), CS-DiD (did::att_gtvssp.callaway_santanna), SCM (Synth::synthvssp.synth), and DML (DoubleML::DoubleMLPLRvssp.dml) at log-spaced N. Drivers pluscompare_perf.pyproduceperf_table.md/.texand atrack_c_loglog.{pdf,png}log-log scaling figure. -
Coverage Monte Carlo at B=1000. New
tests/coverage_monte_carlo/run_b1000.pymeasures 95% CI coverage for OLS (0.952), 2×2 DiD (0.955), and strong-Z IV (0.962) — all inside the 99% Wilson band [0.935, 0.967] around nominal 0.95. The full slowpytest tests/coverage_monte_carlo/ -m slowsweep at B=1000 also passes 8/8 (753.51 s). Frozen run lives atresults_b1000/coverage_b1000.json; previous fast-track results remain at the default B=200. -
CausalAgentBench scaffolding (mock-mode shipped, API run gated). New
tests/agent_bench/ships a 50-prompt × L1/L2/L3 difficulty × 6 cells × 3 reps = 900-trial agent bench with a deterministic mock-LLM runner (runners/mock_llm.py), a frozen OSF pre-registration protocol (prompts/_protocol.md), and a grader (runners/grader.py) emitting an H1–H5 directional results table. Mock dry-run completes in <1 s and producesresults/headline.md+results/scores.csv+results/trials.jsonl; the production--apiflag is one switch away once the OSF pre-registration and API budget clear. -
sp.validation_report/sp.coverage_matrix/sp.reproduce_jss_tables— JSS-grade validation meta-API. Newsrc/statspai/validation.py(863 LOC, 58-test battery intests/test_jss_validation_api.py) exposes three top-level functions for the paper-submission audit trail:sp.validation_report()summarises the livesp.registryplus the materialised parity / coverage / agent-bench artifacts as a structuredValidationReport(registry.total_functions,evidence.r_parity_modules,evidence.stata_parity_modules,evidence.coverage_b1000,evidence.agent_bench_trials, …) with a one-paragraph.summary()and full JSON.to_dict();sp.coverage_matrix()enumerates every reference-implementation parity claim with its expected tolerance, observed gap, and the driver script that produced the JSON;sp.reproduce_jss_tables()returns aReproductionResultenumerating the exactRscript,python,pytest, andxelatexcommands needed to regenerate every table — and, when called withdry_run=False, executes them in dependency order with timing + return codes recorded per step. The default mode is metadata-only (no R / Stata / LaTeX required), soimport statspai as sp; print(sp.validation_report().summary())works on a stockpip install statspai==1.13.1install. -
8 high-impact estimators upgraded from auto-registered to hand-written FunctionSpec.
aipw,aggte,pretrends_test,sensitivity_rr,mccrary_test,oster_bounds,wild_cluster_bootstrap, andrd_honestnow ship with full agent-native metadata: 2–4 assumptions per spec, 2 failure modes with recovery hints, ranked alternatives, typical_n_min, vetted references with paper.bib bib keys, and full enum-validated ParamSpecs. Previously these were auto-registered with only the first docstring line and inferred parameter types — agents callingsp.describe_function('aipw')could not see the doubly-robust guarantee, the propensity overlap requirement, or the alternatives to fall back on. Hand-written count moves from 203 to 211; auto- registered drops from 768 to 760. (Step H of v1.13 stability roadmap.) -
sp.principal_strat(instrument=...)— encouragement-design AIR / Wald LATE. The previously-stubbedinstrument=parameter now routes to a proper estimator (Angrist-Imbens-Rubin 1996 §4): given binary instrumentZ, treatmentD, post-treatment stratumS, and outcomeY, under randomZ+ monotonicity D(1)>=D(0) + exclusion + SUTVA, the function reports two Wald LATEs among Z-compliers —τ_Yfor the effect ofDon the outcome andτ_Sfor the effect ofDon the post-treatment stratum variable — plus the complier shareπ_C(Z), all with bootstrap SE/CI. ARuntimeWarningis emitted when the first stage degenerates or points in the wrong direction for the supplied instrument coding.method=is ignored on this path because identification comes fromZ, not from the post-treatment stratum decomposition. Thelimitationsentry is rewritten: the only remaining gap on this path is always-survivor SACE under encouragement design (Mealli & Pacini 2013, partial identification). Seven new tests intests/test_principal_strat.py. -
sp.hal_tmle(variant='projection')RFC + sharper error. Rather than ship an unverified port of the Li-Qiu-Wang-vdL (2025) §3.2 Riesz-projection step (the v1.11.x code path was a no-op on the point estimate — see CHANGELOG), v1.13 keeps theNotImplementedErrorand addsdocs/rfc/hal_tmle_projection.mdwith the full implementation roadmap and the parity-test gates that must clear before the variant can be promoted tostable. The runtime exception message now points at the RFC and asks reporters to file an issue with the publication's headline number they'd like to match — so the next maintainer to pick this up has a clear target. Registrylimitationsentry updated with the RFC link. -
Smart layer respects
FunctionSpec.stability.sp.recommend(...),sp.causal(...), andsp.paper(...)now accept anallow_experimental: bool = Falseflag (default agent-safe). WhenFalse, recommendations whose backing function is registered asstability='experimental'(or'deprecated') are dropped from the ranked output and the workflow'swarnings/pipeline_notesrecords what was filtered. PassTrueto include frontier MVPs (e.g.did_multiplegt_dyn,text_treatment_effect). This closes a gap where an LLM agent askingsp.causal(df, ...)for an applied analysis could silently land on a frontier MVP just because the recommender ranked it first. Tests intests/test_smart_stability_gating.py. - Stability reverse-audit script.
scripts/stability_audit.pycross-checks everystability='stable'claim in the registry against parity-test coverage intests/reference_parity/andtests/external_parity/. Splits the catalogue into hand-written vs. auto-registered specs (the latter having been silently classifiedstableby default) and reports the count of unbacked claims in each bucket.--checkmode is CI-friendly and fails when the unbacked-handwritten count exceeds a loose floor (currently 220) — bumping the floor requires editing the script as a deliberate quality signal. Does NOT auto-downgrade; the call to flip a function fromstabletoexperimentalbelongs to a maintainer who has read the code. Tests intests/test_stability_audit.py. The audit fixed a registry bug along the way: auto-registered specs were never tagged_auto=Trueon theFunctionSpecinstance, sodescribe_functionerror hints and the audit itself couldn't distinguish them from hand-written entries; that's now fixed viaobject.__setattr__inside_auto_spec_from_callable. - Runtime consistency tests for
FunctionSpec.limitations. Eachlimitationsentry on aFunctionSpecis now structurally audited bytests/test_limitations_consistency.pyso the registry's parity-grade-with-known-gaps claims cannot drift away from runtime behaviour: every entry must (a) use vetted vocabulary and (b) be classified as either runtime-testable (a curated map calls the function with the unimplemented value and asserts the documented exception) or descriptively-soft (silent fallback / caveat, whitelisted inLIMITATIONS_DESCRIPTIVE_ONLY). Adding a new limitation without classifying it now fails CI. Caught one drift bug in this pass: thecgroup='nevertreated' + panel=Falselimitation was attached towooldridge_did, but only theetwfealias exposes those parameters — moved toetwfeand surfaced the missingcgroupParamSpec to the schema. - Test-coverage battery for the four worst-covered files +
parity-grade smoke battery across
did/synth/rd/iv/tmle/bayes. The v1.12.x audit flagged six causal-family modules at low statement coverage (did14.7%,synth12.9%,rd16.9%,iv18.0%,tmle14.8%,bayes14.1%) with four files entirely unexercised:wooldridge_did.py,did_imputation.py,synth/report.py,workflow/paper.py. Five new test files raise per-file coverage to synth/report.py 4% → 81%, wooldridge_did.py 76% → 93%, did_imputation.py 85% → 99%, workflow/paper.py 66% → 86% and add a 30-test cross-family smoke battery (tests/test_low_cov_battery.py) that exercises every headline estimator's CI/SE/point-estimate contract: tests/test_synth_report.py(25 tests) — full text/markdown/LaTeX SCM report renderer + every sensitivity sub-block + the LaTeX escape table.tests/test_wooldridge_did_branches.py(31 tests) — Bacon + dCDH decomposition, repeated-CS / never-only / xvar dispatch branches, everyetwfevalidation guard, all fouretwfe_emfxaggregations includinginclude_leads=True.tests/test_did_imputation_branches.py(14 tests) — everyValueErrorguard, the controls + horizon event-study path with pre-trend chi-squared test, and the_cluster_se_horizonN_k == 0short-circuit.tests/test_paper_branches.py(31 tests) — every YAML/TeX/MD helper, all fourto_qmdrendering branches (single vs. multi-format, author / bibliography / csl),to_docxfallback whenpython-docxis missing,write()extension dispatch, and the_render_dag_sectiontext + mermaid branches.-
CausalResult.summary()accepts both event-study column conventions. The sharedsummary()previously hard-coded(relative_time, att)and crashed withKeyError: 'relative_time'onwooldridge_did/etwferesults, which carry the(rel_time, estimate)schema instead. The renderer now auto-detects whichever pair is present and silently skips the event-study block when neither is — every existing caller keeps its formatting and the wooldridge family no longer crashes a user's.summary()call. Regression-pinned bytest_wooldridge_did_summary_renders_event_study. -
Stability tiers and per-function
limitations(parity-grade vs. frontier-grade visibility). EveryFunctionSpecnow carries astabilityfield ("stable"/"experimental"/"deprecated", exposed assp.STABILITY_TIERS) and alimitationslist that enumerates partial-implementation gaps inside otherwise stable functions (e.g.hal_tmle(variant='projection'),principal_strat(instrument=...),rdrobust(weights=...)). The fields flow throughsp.describe_function,sp.agent_card,sp.function_schema(description prefix +Known limitations:suffix so LLM tool-callers see the gap before calling),sp.list_functions(stability=...),sp.agent_cards(stability=...), theSTABILITYblock insp.help(), the per-function detail insp.help('<name>'), and a newstatspai list --stability ...CLI flag. This closes a layering gap where users (and agents) could not tell which functions are numerically aligned and signature-locked vs. which are MVP / RFC-tracked frontier work, and where specific unimplemented variants were only discoverable by triggeringNotImplementedErrormid-pipeline. Initial tagging covers the three causal_text /did_multiplegt_dynexperimental entries plus variant-level limitations onhal_tmle,principal_strat,rdrobust,callaway_santanna,wooldridge_did,network_exposure, andcontinuous_did. See the newdocs/guides/stability.mdfor the contract and promotion path.
Changed¶
-
Cold-start: lazy-load
statspai.forest(Step 1B).import statspaipreviously chainedfrom .forest.causal_forest import CausalForest, causal_forestplus three sibling eager imports forforest_inference/multi_arm_forest/iv_forestat module load, transitively pulling ~245sklearn.*submodules intosys.modules(~270 ms cumulative on cold cache) for every session — even ones that never touch heterogeneous-effect forests. The four eager lines are removed; the ten public leaves (CausalForest,causal_forest,calibration_test,test_calibration,rate,honest_variance,multi_arm_forest,MultiArmForestResult,iv_forest,IVForestResult) now resolve via_LAZY_ATTRSkeyed to dotted submodule paths (e.g.forest.causal_forest) and fault in on firstsp.<name>access.forestdoes not collide with a top-level function (nosp.forestcallable export) so the standard lazy path is safe;sp.causal's callable shim and thestatspai.causaldeprecation shim continue to work unchanged. Pinned by three new contracts intests/test_late_bind_contracts.py—import statspaimust not pre-load anystatspai.forest.*submodule (subprocess-isolated to avoidsys.modulespollution that would corrupt downstreamisinstancechecks); each of the 10 forest leaves must resolve to a callable on first access; and a downstreamfrom statspai.forest.causal_forest import CausalForestmust not re-shadowsp.causal_forestto the leaf module via Python's post-import attribute binding. Other sklearn-eager paths (did/overlap_did,metalearners/*,policy_learning/*,synth/cluster, plus ~7 conflict-prone same-name modules pinned eager for the late-bind contract) still pullsklearnon bare import; those are tracked separately for Step 1C and do not block this lazy-forest win. -
Cold-start: drop sklearn class inheritance from HAL estimators (Step 1D).
HALRegressor/HALClassifierintmle/hal_tmle.pypreviously subclassedsklearn.base.BaseEstimatorplus a Mixin, which pulled ~39sklearn.*submodules intosys.modulesat module-load time — the only remaining sklearn footprint after Steps 1B/1C lazy-loadedforestand the 18 estimator files. The inheritance is gratuitous here:super_learner.fitonly needssklearn.base.clone(learner)(which is duck-typed —get_params(deep=False)+cls(**params)reconstruction) plus.fit/.predict/.predict_proba; no code path calls.score(...),is_classifier(...), oris_regressor(...)on the HAL classes. Replaced the inheritance with a minimal_BaseHALproviding theget_params/set_params/__repr__slice thatclone()actually consumes (introspection viainspect.signature(self.__init__), with object identity preserved so sklearn's post-cloneparam1 is param2sanity check passes)._estimator_type = "regressor"/"classifier"class attributes keepsklearn.base.is_regressor/is_classifierreturning True for any future external caller. After Step 1D,import statspaipulls zero sklearn submodules — full 245 → 0 — and thetest_sklearn_budget_ceiling_on_bare_import_statspaicontract is tightened from<= 50to<= 0to pin the floor. 152 tests acrosstest_hal_tmle/test_tmle/test_late_bind_contracts/test_low_cov_battery/test_metalearnerspass cleanly. -
Cold-start: lazy-import sklearn across 18 estimator files (Step 1C). Building on Step 1B, every remaining top-level
from sklearn.X import Yindid/overlap_did.py,metalearners/{auto_cate,metalearners,auto_cate_tuned}.py,policy_learning/{policy_tree,ope}.py,synth/cluster.py,proximal/pci_regression.py,bcf/{bcf,longitudinal}.py,tmle/{tmle,super_learner,ltmle,ltmle_survival}.py,dose_response/gps.py,multi_treatment/multi_ipw.py,mediation/four_way.py, andinterference/orthogonal.pywas moved inside the function bodies that actually use it.BaseEstimatortype annotations were converted to string-literal form underif TYPE_CHECKING:soinspect.signature/ Pyright / mypy still resolve them without forcingsklearn.baseat module load. Several long-standing dead imports were dropped (BaseEstimator/is_classifier/cross_val_predictinmetalearners/metalearners.py;LinearRegressioninproximal/pci_regression.py;BaseEstimator/clone/GradientBoostingClassifierinmulti_treatment/multi_ipw.py; etc.). After Step 1B + 1C,import statspaipulls 39 sklearn submodules instead of 245 — a 5.3× reduction. The 39 aresklearn.baseplus its mandatory deps, pulled bytmle/hal_tmle.pywhoseHALRegressor(BaseEstimator, RegressorMixin)/HALClassifier(BaseEstimator, ClassifierMixin)need sklearn at class-definition time; refactoring that inheritance hierarchy is out of scope. Pinned by a newtest_sklearn_budget_ceiling_on_bare_import_statspaicontract intests/test_late_bind_contracts.py(≤ 50 ceiling, ~39 floor + 11 slack for sklearn-version drift) running in a subprocess so the cold-state measurement does not perturb other tests'sys.modules. 248 tests across the 18 affected modules (metalearners / metalearner_frontiers / auto_cate / auto_cate_tuned / overlap_did / tmle / hal_tmle / proximal / proximal_frontiers / bcf_longitudinal / bcf_ordinal / conformal_bcf_bunching_mc / policy_learning / mediation / mediation_sensitivity / interference_extensions / late_bind_contracts / causal_forest_grf / forest_inference / ope_cevae / ope_extensions / cluster_rct) pass cleanly. -
README lead aligned with the agent-native + parity-validated positioning.
README.mdandREADME_CN.mdboth foreground "first agent-native Python platform" and surface R / Stata parity validation in the lead paragraph (was previously buried in the Task View comparison section further down). -
sp.recommend()now defaults to an agent-safe stability gate: recommendations whose registry entry is markedstability='experimental'orstability='deprecated'are dropped unless the caller passesallow_experimental=True. The filter keeps backward compatibility for unknown custom recommendation entries, records dropped names inRecommendationResult.warnings, and is forwarded throughsp.causal(..., allow_experimental=...)andsp.paper(..., allow_experimental=...)so higher-level workflows cannot silently land on frontier MVP estimators. - Hardened the workflow/paper orchestration layer so optional failures
no longer disappear silently.
sp.causal(...).run(full=True)now records optional-stage failures (compare_estimators,sensitivity_panel,cate) inworkflow.pipeline_notes, andsp.causal(...).report(fmt='markdown')renders those notes in a dedicated section instead of silently dropping the context. sp.paper(...)now constructs its internalCausalWorkflowwithauto_run=Falseand advances stages exactly once. This removes the prior double-execution path where the workflow could fully auto-run beforepaper()manually re-randiagnose/recommend/estimate(and sometimesrobustness) again.PaperDraftnow surfaces orchestration degradations directly in aPipeline notessection and includesdegradationsinto_dict(), so missing DAG/citation/provenance/section-rendering steps are visible in the artifact itself rather than only via warnings.
Fixed¶
-
⚠️ Correctness —
sp.callaway_santanna(method='reg')inference. The Callaway–Sant'Anna outcome-regression (REG) path produced influence-function standard errors that were inconsistent with the IPW / DR variants because the control-regression uncertainty was not propagated and the per-cohort scaling in the influence-function aggregation was off by the cohort-size weighting. Coverage simulations under thempdtaDGP were running ~88% (nominal 95%) onmethod='reg', while'ipw'and'dr'were inside the 99% Wilson band. The fix tightens the REG influence-function scaling and explicitly adds the control-regression contribution; the parity table attests/r_parity/results/parity_table_3way.mdand the coverage frame attests/coverage_monte_carlo/FINDINGS.mdare refreshed accordingly. Regression-pinned bytests/reference_parity/test_did_parity.pyand the newtests/r_parity/04_csdid.pydriver. Re-run any v1.10–v1.13 Callaway–Sant'Anna analyses that usedmethod='reg';'ipw'and'dr'are unchanged. -
isinstance(res, sp.OPEResult)no longer false-negative on results fromsp.ope.*. During the lazy-load refactor of optional families the eager re-export path that used to bindsp.OPEResulttostatspai.ope.estimators.OPEResultwas dropped, sosp.OPEResultsilently resolved to a parallel class defined instatspai.policy_learning.ope— andisinstance(sp.ope.ips(...), sp.OPEResult)flipped fromTrue(v1.12.2) toFalse. The eagerfrom .policy_learning import ... OPEResultis removed sosp.OPEResultfalls through to the lazy_register_lazy("ope", "OPEResult", ...)table, restoring v1.12.2 class identity. Regression-pinned bytests/test_ope_cevae.py::test_ips_close_to_true_value. - Hand-written registry specs for
aggteandprincipal_stratnow exactly match their callable signatures (na_rm,alpha,seed), with a regression test guarding the new v1.13 hand-written upgrades against future signature drift. - The natural-language
sp.paper(data, question, ..., include_robustness=False)path no longer runs or renders the robustness section implicitly viasp.causalauto-run side effects. paper_from_question()now carries its collected degradation records into the returnedPaperDraft, so late provenance/citation/DAG failures remain inspectable after draft construction.- Top-level
statspai.__all__is now de-duplicated in order-preserving fashion, reducing public-surface drift between the import namespace and registry/help tooling. - The top-level function-first API now survives the
sp.ivbootstrap path for same-name families likebartikanddeepiv. The root package eagerly rebinds the 14 function/subpackage collisions (proximal,principal_strat,bartik,bridge,causal_impact,bcf,bunching,deepiv,dose_response,frontier,interference,msm,multi_treatment,tmle) whilestatspai.ivlazy-loads its optionalbartik/deepivre-exports, sosp.bartik(...)/sp.deepiv(...)stay callable instead of degenerating into bare module objects after import order changes. smart.assumptions,smart.brief,smart.identification,smart.sensitivity, andsmart.verifynow lazy-importworkflow._degradationonly inside failure paths. That removes a prematureworkflow/__init__import duringimport statspai, which had reintroduced partially initialized top-level symbols and made the lazy API order-sensitive.- Added a committed
src/statspai/__init__.pyigenerator and pinned it with a regression test so IDE/type-checker visibility tracks the live runtime namespace. The stub generator now skips exported constants during leaf scanning and correctly typesSTABILITY_TIERSasfrozenset[str], avoiding duplicate/conflicting declarations. - Pinned the two binding hazards introduced by the lazy-load refactor
with 21 explicit contracts in
tests/test_late_bind_contracts.py: the five late-bind aliases re-bound by_article_aliases(mediation,policy_tree,dml,matrix_completion,causal_discovery) plus thesp.ivcallable dispatcher must each remain callable rather than degenerating to a module on import re-order; and the 14 function/subpackage collisions (proximal,principal_strat,bridge,bcf,bunching,dose_response,multi_treatment,causal_impact,frontier,interference,tmle,msm,deepiv,bartik) must survive a downstreamfrom statspai.X import Ywithout the auto-bound submodule silently re-shadowing the function. Closes the residual gap left by Codex's lazy-load refactor and Claude Code's same-name eager-rebind follow-up.
[1.12.2] — 2026-05-01¶
Headline¶
ML-routing for the estimand-first DSL (sp.causal_question) plus a
shared robustness battery so sp.paper(...) renders the same audit
section regardless of entry point. The Egami et al. (2024) LLM-label
corrector graduates from binary-only to multi-class with a
bias-corrected bootstrap, and DML's IV variants (sp.dml(model='pliv'),
sp.dml(model='iivm')) now honour sample_weight end-to-end.
Citation metadata fixes the wrong Zenodo DOI shipped under v1.12.1 —
no estimator output changes.
Added¶
sp.llm_annotator_correct(causal_text/llm_annotator.py) — three v1.7-deferred upgrades to the Egami et al. (2024) measurement-error correction for LLM-derived treatment labels. Backward compatible: the binary-T numerical path is unchanged, every existing kwarg keeps its default behaviour, and existing diagnostics retain their keys.- Multi-class treatment. The corrector now auto-detects the class
set from the union of LLM and human labels. For K ≥ 3 the
confusion matrix
M[i, j] = P(T_obs=j | T_true=i), validation- marginalπ[i], and Bayes posteriorQ[i, j] = P(T_true=i | T_obs=j)are assembled; the K×K coefficient transformθ_obs = T θ_trueis inverted to recover per-class corrected contrasts. Headline.estimatereports the smallest non-reference class; full vector ships in.detail(per-class naive/corrected estimate, SE, CI, p-value). Singular / near-singularTraisesIdentificationFailure. - Bias-corrected bootstrap. Optional
bootstrap=Truejointly resamples the full sample (validation rows + unlabeled rows) and re-runs the entire correction pipelinen_bootstraptimes (default 500), reporting Efron-Tibshirani bias-corrected percentile CIs that reflect validation-set sampling uncertainty. New kwargs:bootstrap,n_bootstrap,bootstrap_seed. First-order SE/CI remain available inmodel_info['first_order_se' / '_ci'];bootstrapsub-dict reportsn_valid,n_failed,seed,method,mean,median. - SE inflation factor diagnostic. Both binary and multi-class
paths populate
model_info['se_inflation_factor']— a delta- method multiplier (≥ 1) the user can apply to the first-order SE for an honest accounting of validation-set noise. For binary it is derived analytically from the binomial variances ofp_01andp_10; for multi-class it is a finite-difference Jacobian-based heuristic (usebootstrap=Truefor the rigorous version). - Multi-class diagnostics also expose
confusion_matrix,q_posterior,transform_matrix,condition_number,pi_validation,headline_contrast. sp.causal_question(..., design=...)now accepts the four ML-selection-on-observables tags directly:design='dml' | 'tmle' | 'metalearner' | 'causal_forest'. The planner records the right identification story / assumptions for each, and the dispatcher now routes to the corresponding estimator with targeted validation (e.g. DML covariates required; PLIV / IIVM scalar-instrument guard; causal-forest binary-treatment guard for ATE inference).- New guide:
docs/guides/choosing_ml_causal_estimator.md— decision tree for choosing between DML / TMLE / metalearner / causal_forest, plus a side-by-side comparison of estimands, IV support, and inference. - Shared robustness battery:
workflow/_robustness.py+run_robustness_battery(...). Bothsp.paper(data, question, ...)andsp.paper(CausalQuestion(...))now render the same design-aware robustness section instead of splitting between a thin NL path and a placeholder estimand-first path. - Weighted
sample_weightsupport insp.dml(model='pliv')andsp.dml(model='iivm'). The IV orthogonality moment, residualisation step, and downstream sandwich SE are all weighted consistently (E[w · ψ(W; θ, η)] = 0); unit weights reproduce the unweighted path bit-for-bit. Closes the lastsample_weightgap indml/after the v1.12.0 PLR / interactive audits —sp.dml's four core estimators now all support survey / inverse-probability weights.
Changed¶
sp.causal(...).robustness()now delegates to the shared robustness battery and still preserves backwards compatibility via the legacy flatrobustness_findingsdict; structured per-finding records are additionally available under['_findings'].paper.bib/ docs metadata filled in missing bibliographic details for TMLE / causal forest / meta-learner references and removed a duplicate van der Laan entry so the new ML-estimator guide and the expandedcausal_questiondocstrings resolve cleanly.docs/guides/causal_text_family.mdand the registry card forsp.llm_annotator_correctnow describe the new multi-class, bootstrap, and SE-inflation-factor behaviour rather than the old binary-only path.
Fixed¶
sp.paper(CausalQuestion(...))no longer emits a placeholder Robustness section pointing users back tosp.causal(...); it now runs the same substantive battery as the natural-language paper path.sp.causal_question(..., estimand='CATE')now auto-promotes tometalearneronly when effect modifiers are actually declared. Without covariates it falls back honestly to a scalar ATE path with an explicit warning, soidentify()andestimate()agree.design='causal_forest'now reports the population ATE summary via cross-fit AIPW influence-function inference instead of leaving the planner with a CATE-only story and no principled scalar ATE layer.
[1.12.1] — 2026-04-30¶
Citation metadata polish — no numerical or API changes to any estimator.
Added¶
sp.citation(format=...)— package-level citation helper returning BibTeX (default), APA, plain text, or the rawCITATION.cffcontents. Distinct fromsp.cite(), which formats individual coefficients inline.sp.__citation__exposes the default BibTeX entry as astrfor one-liners.CITATION.cffat the repository root — GitHub renders a "Cite this repository" button from it; bundled in the sdist viaMANIFEST.in.- Zenodo DOI 10.5281/zenodo.19933900
(concept DOI; always resolves to the latest archived release). The
DOI now appears in
sp.citation()output, the README citation block, and a DOI badge alongside the existing review-status badge. .zenodo.jsonso future GitHub Releases mint version-specific DOIs with consistent metadata (creators, keywords, license, related identifiers).
[1.12.0] — 2026-04-30¶
Headline¶
The whole dml/ module got a careful audit. sp.dml / sp.dml_panel
/ sp.dml_model_averaging all stay backwards-compatible at the
call-site level (existing scripts keep working) but several internal
numerical behaviours change — see the ⚠️ Correctness section and
MIGRATION.md.
⚠️ Correctness¶
sp.dml(model='irm')andsp.dml(model='iivm')now useStratifiedKFold(stratified by D and Z respectively) — the oldKFoldcould produce a fold whose subgroup mask was empty, in which case the AIPW score for that fold's test rows was silently filled with zeros (biased point estimate, biased SE). Empty subgroups now raiseIdentificationFailurewith a clear remedy. Estimates may shift slightly on data sets where the oldKFoldhappened to produce extreme folds.sp.dml_panel(binary_treatment=True)is now a deprecated no-op. The previous classifier path fit a propensity on within-demeaned features but raw {0,1} labels — there is no clean interpretation as E[D̃ | X̃] for the result. The estimator now always uses a regressor on D̃ (PLR-with-FE is agnostic to D's type). ADeprecationWarningis emitted, andD ∈ {0,1}is validated when the flag is True.sp.dml_model_averagingnow drops rows with NaN in y / treat / covariates / sample_weight (matching every other DML class); previously NaNs propagated into sklearn fits and could produce NaN estimates undetected by the existingdenom < 1e-12guard.sp.dml_model_averaging: the defaultweight_ruleis now"short_stacking"— Ahrens, Hansen, Schaffer & Wiemann (2025, JAE) eq. 7 — which solves a constrained least squares stacking problem on cross-fitted nuisance predictions and plugs the stacked nuisance into a single PLR moment equation. The previous"inverse_risk"default (heuristic 1/MSE-weighted average of per-candidate θ̂_k) was not in the cited paper and is preserved as a clearly labelled baseline. New"single_best"matches the paper's footnote 8 formulation. Per-nuisance stacking weights are exposed asmodel_info["weights_g"]/weights_m.sp.dml(model='pliv')raisesRuntimeErrorwhen the ML-residualised partial correlation|corr(z̃, d̃)|falls below1e-3(was1e-6, too lenient to catch genuine weak-IV collapse). A newmodel_info["diagnostics"]block reports the partial correlation and an approximate first-stage F.
Added¶
- All four
sp.dml(model=…)variants now accept arandom_state=argument (default 42) controlling fold assignment. Repeated splits userandom_state + repso a single seed fully determines the result. sample_weight=support onsp.dml(model='plr'),sp.dml(model='irm'),sp.dml_panel, andsp.dml_model_averaging(any weight rule). The weighted estimator uses a Z-estimator sandwich variance throughout.sp.dml(model='pliv')andsp.dml(model='iivm')raiseNotImplementedErrorif a non-trivial weight is supplied — the weighted Wald-ratio variance derivation is non-trivial and lands in a follow-up.sample_weightmay be passed as a 1-D array, a pandas Series, or a column name string.- New
model_info["diagnostics"]block on every variant: - PLR: residual scales, partial correlation y_resid·d_resid, within-R² of each nuisance.
- IRM: propensity p01/p99/min/max, n clipped below/above the
[0.01, 0.99]overlap clip, n times the subgroup g̃₁/g̃₀ fit fell back to the subgroup mean. - IIVM: instrument-propensity p01/p99/min/max, clipping counts, subgroup fallbacks for both g(z, X) and r(z, X), and E[ψ_b] (the LATE Wald-ratio denominator — proximity to zero indicates a weak first stage).
- PLIV: first-stage partial correlation, approximate first-stage F, residual scales.
- panel_dml: y/d residual std, within-R², cluster Ω, weighted flag.
sp.dml_panel(sample_weight=…)does a weighted within transform (subtract weighted unit / time means) and reports a weighted Liang-Zeger cluster SE.
Changed¶
- Internal flag rename
_BINARY_TREATMENT→_ML_M_TARGET_BINARYand_BINARY_INSTRUMENT→_ML_R_TARGET_BINARYon the per-model DML classes. The new names describe the nuisance-target shape (the IIVMml_mactually models the instrument propensity, not D). These flags are private (underscore-prefixed); no public API change. paper.bib: filled in the missingvolume/number/pagesfields on@ahrens2025model(40(3):249–269), verified via the Wiley Online Library record and the JAE issue listing.
Internal¶
- Per-rep diagnostics now flow back to
model_info["diagnostics"]via a new_aggregate_diagnosticshelper on_DoubleMLBase. Each subclass populatesself._last_rep_diagnosticsinside_fit_one_rep; the base merges across reps (sum for counts, mean for floats, OR for booleans, concat for lists).
⚠️ Correctness — TMLE module audit pass¶
sp.tmle.SuperLearnerpreviously ran NNLS and post-hoc-normalised weights to sum to 1, which is not the simplex-constrained optimum (rescaling an unconstrained NNLS solution gives the simplex optimum only when the unconstrained sum already equals 1, a measure-zero event). Replaced with a direct SLSQP QP on the simplex; ensemble predictions are now genuinely the convex combination minimising squared loss. Affects every downstream caller —sp.tmle,sp.hal_tmle, and any user code that builds a Super Learner directly. Numerical results will shift slightly on data sets where the old NNLS solution did not happen to be on the simplex.sp.tmle.ltmlecensoring half-implementation: the regime-following indicator now includes& (C_k_obs == 1)so censored units are excluded from the targeting equation rather than continuing to contribute with1/p_c-inflated weights. (sp.tmle.ltmle_survivalwas already correct on this;ltmle.pywas the regression.)sp.tmle.ltmle_survivalinfluence function: previously used-H * (T_k - h_star_regime)summed across intervals as the influence function for both the RMST contrast and the terminal risk difference at K. The proper EIF for :math:E[S^a(t)](Cai & van der Laan 2020) needs the survival-product factor :math:S^a(t)/S^a(j)and the IC for the terminal RD at K is the EIF of :math:S^a(K)alone (NOT the cumulative-across-K RMST IC). Refactored_run_regimeto expose the per-subject sequencesS_seq,h_star_seq,H_seq,T_seq; the call site now computes the RMST and terminal-RD EIFs separately via_eif_rmstand_eif_survival_at_k. SE estimates change — generally smaller for RMST (was conservative), and the terminal-RD SE is now correctly tied to its target functional rather than picking up RMST's cross-time aggregation.sp.hal_tmle(variant='projection')was a no-op in v1.11.x and earlier. The projection variant ran an ad-hoc shrinkage onmodel_info["eps"]after the point estimate had already been computed; the variant flag did not change the estimate. The path now raises :class:NotImplementedErrorhonestly until the proper Riesz-projection step (Li-Qiu-Wang-vdL 2025 §3.2) is ported.sp.hal_tmledocstring previously claimed the basis was "rich enough to approximate any càdlàg function of bounded variation", the property of full HAL (Benkeser & van der Laan 2016). The implementation only builds main-effects indicator basis functions :math:\\mathbb 1\\{x_j \\le a_j\\}— i.e. L1-penalised additive piecewise-constant regression, NOT full HAL. Docstring is corrected; numerical behaviour unchanged.
Fixed — TMLE convergence + overlap diagnostics¶
sp.tmle._fit_epsilonnow emits aUserWarningwhen the Newton iteration on the fluctuation parameter fails to converge inmax_itersteps, instead of silently returning the last value (which yields a non-targeted plug-in). The warning includes the final score magnitude and ε for diagnosis.sp.tmlenow reportsmodel_info['propensity_diagnostics'](min, max, p01, p99, n clipped below/above, clip share) and emits aUserWarningwhen ≥ 5 % of propensities hit thepropensity_boundsclip — same overlap convention assp.metalearner. AIPW scores blow up at e≈0/1, so heavy clipping silently changes the estimand from ATE in the population to ATE on the trimmed sample.sp.tmle.SuperLearner(task='classification')validates that the target is binary (was silently dropping non-{0,1} columns ofpredict_proba); switches toStratifiedKFoldso every fold has both classes;predict()clips to (1e-6, 1-1e-6) for classification (was inconsistent withpredict_probawhich already clipped).
Fixed — TMLE / HAL-TMLE citations (§10 verification pass)¶
paper.bibnow records three previously-uncatalogued HAL-TMLE references with full Crossref/arXiv-verified metadata (added 2026-04-30):@li2025regularized— arXiv:2506.17214, verified via arxiv.org. Earlier inline-cited title inhal_tmle.pywas"Highly Adaptive Lasso Implementations"; the paper's actual title is"Highly Adaptive Lasso Implied Working Models"— fixed in docstring +model_info['citation'].@vanderlaan2023efficient— IJB 19(1):261–289, doi 10.1515/ijb-2019-0092, verified via degruyterbrill.com.@benkeser2016highly— IEEE DSAA 2016, pp. 689–696, doi 10.1109/DSAA.2016.93, verified via Crossref API.tmle.py:_CITATIONS['tmle']now includes thevanderlaan2006targetedreference that the docstring already cites (was missing — docstring promised it via[@vanderlaan2006targeted]but the inline BibTeX registered onlyvanderlaan2007super). Author punctuation / capitalisation aligned topaper.bib.ltmle_survival.pycai2020stepreference reformatted to match paper.bib (year 2020 vs the previous docstring's 2019; the IJB volume's nominal year is 2020).- Dropped the dangling "Qian-van der Laan Section 4" reference from
hal_tmle.pyprojection-variant docstring (the paper was never in References section and the cited Section 4 doesn't exist in any HAL-TMLE paper).
⚠️ Correctness — sp.metalearner unifies ATE / SE via AIPW influence function¶
- ATE for all learners (
learner ∈ {'s','t','x','r','dr'}) is now the mean of the AIPW (DR) pseudo-outcome :math:\varphi_i = \hat\mu_1(X_i) - \hat\mu_0(X_i) + D_i(Y_i-\hat\mu_1(X_i))/\hat e(X_i) - (1-D_i)(Y_i-\hat\mu_0(X_i))/(1-\hat e(X_i)), and SE is :math:\sigma(\varphi)/\sqrt n. AIPW is the semiparametric-efficient estimating function for :math:E[Y(1)-Y(0)](van der Laan & Robins 2003; Kennedy 2023), so the SE is valid for any CATE estimator the user picks vialearner=. - Previously S/T/X/R-Learner used
mean(τ̂(X))for ATE and a re-sampling bootstrap of the fitted CATE values for SE. That bootstrap silently treated τ̂ as fixed and only captured empirical- mean variation — completely missing the dominant component (estimation error in τ̂ itself). Result: SEs were systematically too small and CIs severely under-covered. - DR-Learner: ATE was previously
mean(τ̂(X))from the regularised CATE fit, while SE usedstd(φ)/√nfrom the raw pseudo-outcome — a finite-sample inconsistency that disappears under the newmean(φ)ATE. - New
model_info['se_method'] = 'aipw_influence_function'(was'bootstrap'for S/T/X/R,'influence_function'for DR).model_info['ate_method'] = 'aipw_dr_pseudo_outcome'.n_bootstrapparameter is deprecated and ignored; will be removed in a future minor release. - New
model_info['aipw_diagnostics']block reports clipped-propensity counts and share.UserWarningfires when ≥ 5 % of propensities hit the (0.01, 0.99) overlap clip — overlap is poor and the AIPW score may be biased toward the trimmed sample.
Fixed — Künzel et al. 2019 author hallucination (§10 red line)¶
metalearners.pypreviously listedSeetharam, Liang, Atheyas co-authors of Künzel et al. 2019 PNAS — those are invented names. Correct authors per the canonical record inpaper.bib: Künzel, Sekhon, Bickel, Yu (PNAS 116(10), 4156–4165, doi 10.1073/pnas.1804597116). Both the docstring and the inline BibTeX (used byresult.cite()) now matchpaper.bibbyte-for-byte. Verification path:paper.bib:99← Crossref / doi.org / Google Scholar all confirmKünzel, Sekhon, Bickel, Yu.Kennedy, Edward H(no period) was also out of sync with@kennedy2023towardsinpaper.bib(Edward H.); fixed.
Refactor — PLR variance code: psi → psi_inner / psi_score¶
dml/plr.pypreviously named the inner residual(Y − ĝ − θ̂(D − m̂))aspsi, even though the Neyman-orthogonal score is the product withd_resid. The misnomer made the variance linenp.mean((d_resid * psi)**2)look wrong on a cursory read. Renamed topsi_inner(the residual) andpsi_score(the actual scorepsi_inner * d_resid); math is unchanged. PLIV/IIVM already used the consistentpsi-as-score convention.
[1.11.4] — 2026-04-30¶
Fixed — sp.dml accepts string learner aliases¶
sp.dml(..., ml_g='rf', ml_m='rf')previously crashed withTypeError: Cannot clone object 'rf' (type str): it does not seem to be a scikit-learn estimator …once cross-fitting reachedsklearn.base.clone. The error surfaced in all four DML variants (PLR / IRM / PLIV / IIVM), not just PLR.- New
dml/_learners.pyresolves user-supplied strings into appropriately configured scikit-learn estimators:'rf'/'gbm'/'lasso'/'ridge'/'linear'/'ols'/'logistic'/'xgb'/'lgbm'(case-insensitive, with common synonyms). Classifier variants are selected automatically for the propensity (ml_mundermodel='irm') and instrument (ml_rundermodel='iivm') roles. - Estimator objects (anything exposing
.fit+.get_params) pass through unchanged. Unknown aliases / wrong types now raise an immediate, descriptiveValueError/TypeErrorat construction time rather than the cryptic clone error mid-cross-fit. - Optional dependencies (
xgboost,lightgbm) are imported lazily — not installed → cleanImportErrorwith install hint.
[1.11.3] — 2026-04-30¶
Fixed — output layer graceful degradation restored¶
to_excel/to_word: revert optional-dependency handling fromraise ImportErrorback towarnings.warn() + return, restoring graceful degradation whenopenpyxl/python-docxare absent. (Regression introduced in v1.11.2.)
[1.11.2] — 2026-04-29¶
Internal refactor only — collapses esttab, modelsummary, and
outreg2 to thin facades over the shared regtable engine. No API
changes, no estimator numerics changed.
Changed — output layer facades collapsed into regtable¶
outreg2→ thinregtablefacade; all formatting delegated toregtable.py(sharedFormatOptions/ star formatter / numeric formatter). Oldoutreg2.pyretained as import shim for backward compatibility.modelsummary→ thinregtablefacade; the summary layout logic is nowregtable.FormatOptionsdriven. Oldmodelsummary.pykept as import shim.esttab/EstimateTable→ thinregtablefacade; identical dispatch path asoutreg2andmodelsummary.- The
regtablesnapshot baselines added inc608528ensure any future drift is caught by the test suite.
Tests¶
test_regtable.py(new): 12 snapshot cases covering everyfmtvariant, star placement, confidence-interval style, andreorder/drop/keeppath.
[1.11.1] — 2026-04-29¶
Polish patch for the v1.11 agent surface. Closes the four "留意 / 没做" items from the v1.11 release notes: mcp_server.py 1,475-line bloat, from_stata Tier 3, deeper from_r, and the MCP sampling abstraction layer. No estimator numerics changed.
Added — from_stata Tier 3 (long-tail, ~95% coverage)¶
ppmlhdfe→sp.ppmlhdfe(Correia-Guimarães-Zylkin Poisson-PML with HDFE, multi-FE absorb).mlogit/oprobit→sp.glm(family='multinomial' / 'ordered_probit')with a translation note pointing strict diagnostics atresult.raw_model.xtabond/xtdpdsys→sp.xtabond/sp.xtdpdsys(Arellano- Bond difference / Blundell-Bond system GMM).bunching→sp.bunching(Saez 2010, Kleven-Waseem 2013).boottest→sp.wild_cluster_bootstrap(Roodman-Webb wild-cluster bootstrap; takes a fitted result).mi estimate: <inner>→ translation hint pointing atsp.mi_estimate(Stata's nested grammar isn't auto-parsed).- 33 alias entries / 29 distinct handlers total.
Added — from_r deepening (5 → 11 callables)¶
glmwith smart routing:family=binomial→sp.logit,family=binomial(link="probit")→sp.probit,family=poisson→sp.poisson, otherwisesp.glm.lmer→sp.multilevel;glmer→sp.glmer.plm(formula, data=df, model='within', index=c('id','t'))→sp.panel(method='within', id='id', time='t').matchit(treat ~ x, data=df, method='nearest')→sp.matchwith method-name aliasing (nearest→nn, genetic→genmatch).- R Synth synth() now emits a structured field-mapping note (predictors → predictors, dependent → outcome, unit.variable → unit, time.variable → time, treatment.identifier → treated_unit, time.predictors.prior[max] + 1 → treatment_time).
Added — MCP sampling/createMessage abstraction (opt-in)¶
- New
agent/_sampling.py: request_sampling(messages, max_tokens, ...)— server-to-client LLM request; blocks until response or timeout.set_capability(bool)/get_capability()— client capability advertisement flag.set_writer(callable)/route_response(message)— stdio writer registration + reply matcher.- Wire-up:
_handle_initializereadsparams.capabilities.sampling.handle_requestroutes JSON-RPC replies to pending sampling requests viaroute_response.serve_stdioregisters / clears the writer + capability flag.- Fail-closed:
UnsupportedSamplingErrorraised when no capability is advertised OR no writer is registered, so existing LLM helpers (llm_dag_propose/llm_evalue/llm_sensitivity) keep working via their user-API-key paths until clients (Claude Desktop, Cursor, …) advertisesampling. STATSPAI_MCP_SAMPLING_TIMEOUT_SECONDS(default60) caps every request;SamplingTimeoutErroron overage.
Changed — mcp_server.py split into leaf modules¶
- v1.11.0:
mcp_server.pywas 1,475 lines. - v1.11.1: split into 5 leaf modules:
_errors.py(35 LOC) —RpcError/InvalidParamsError/ResourceNotFoundErrortyped taxonomy._prompts.py(344 LOC) — 10 prompt templates plusSafeDict,handle_prompts_list,handle_prompts_get._resources.py(313 LOC) — catalog text / function detail / handle reads / templates list. Handlers acceptjson_defaultplus error classes via dependency injection (no circular import)._data_loader.py(176 LOC) —load_dataframe/ size cap / LRU cache / remote-URL routing._sampling.py(227 LOC) — see above.mcp_server.pyshrunk to 817 lines (well under the CLAUDE.md §4 ~800-line guideline).- All v1.x private names re-exported via thin import shims so
external code reaching for
_PROMPTS/_load_dataframe/_RpcErroretc. still works. - Test fixture compatibility preserved:
monkeypatch.setattr( agent.tools, '_resolve_fn', …)continues to take effect because the dispatch path looks up_resolve_fnvia the parent package namespace at call time (carried over from the v1.11.0 split).
Tests¶
test_mcp_sampling.py(10 cases, new):- Fail-closed when capability or writer unset.
- Round-trip via mock client thread (success + error envelope).
- Timeout via env-var override.
- serve_stdio integration (capability + writer lifecycle).
- Unsolicited / malformed reply handling.
test_translation.pyextended:- 8 new Stata Tier-3 round-trips + 3 edge cases.
- 10 new R round-trips.
- 61 → 82 cases; coverage assertion checks every distinct handler.
424/424 pass across all agent + MCP + translation + runner + sampling suites.
[1.11.0] — 2026-04-29¶
Agent-native infrastructure follow-up to v1.10. Closes the four follow-up items the v1.10 release notes flagged: tools.py subpackage split, Stata/R command translators, concurrent runner with progress notifications, and tool-call timeouts. No estimator numerics changed; this is the agent-orchestration layer.
Added — from_stata / from_r translators¶
- New
agent/_translation/subpackage exposesfrom_stata(line) → {ok, tool, arguments, python_code, notes, source, input}andfrom_r(line)with the same shape. - 21 distinct Stata handlers / 25 alias entries covering ~85% of real econ workflows:
- Tier 1 (~60% coverage):
regress/reg,xtreg,reghdfe,ivreg2/ivregress,csdid,did_imputation,synth,rdrobust. - Tier 2 (push to ~85%):
probit/logit/poisson/nbreg(shared GLM scaffold),tobit,heckman,rdplot,rddensity,teffects(ipw / nnmatch / psmatch / ra / aipw),margins/marginsplot,contrast,test,xtset/tsset(no-op note). - 5 R handlers:
feols,felm,lm,att_gt/did. fixest'sy ~ x | id^year | (d ~ z) | clusterpipe-form decomposition preserved. - Returns close-match
suggestionsfor unrecognised commands — never silently guesses. - Surfaces
notesfor partial mappings (e.g. Stataifclause →df.query(...)instructions). - Tests:
tests/test_translation.py— 61 cases, every distinct Stata handler covered by ≥1 round-trip.
Added — concurrent runner + progress notifications + timeouts¶
- New
agent/_runner.py: run_with_progress(work, progress_token, timeout, drain)— zero-argwork()runs in a worker thread; main loop drains a thread-safe queue.progress(value, total, message)— tool-side helper. No-op when no channel is registered (safe for in-process tests / directexecute_toolcallers).tool_timeout()— readsSTATSPAI_MCP_TOOL_TIMEOUT_SECONDS(default600;0disables). Hard wall-clock cap;TimeoutErrorsurfaces as-32000with the env-var name embedded._handle_tools_callruns every dispatch through the runner; readsparams._meta.progressTokenper MCP 2024-11-05._make_progress_drainwritesnotifications/progressJSON-RPC messages to the active stdio sink mid-call.serve_stdioregisters / unregisters the stdout sink so in-process tests don't accidentally write to a closed handle.- Threading rather than asyncio for cross-platform reliability —
decision documented in
_runner.py.
Changed — tools.py split into agent/tools/ subpackage¶
- Pre-1.11:
agent/tools.pywas 1,024 lines. - v1.11 layout:
agent/tools/
├── __init__.py # public API + legacy private re-exports
├── _helpers.py # _scalar_or_none / _default_serializer / _identification_serializer
├── _dispatch.py # tool_manifest / execute_tool / _resolve_fn
└── _specs/ # TOOL_REGISTRY split by family
├── _regression.py / _did.py / _iv.py / _rd.py
└── _matching.py / _diag.py / _orchestrate.py
- Public API unchanged (
from statspai.agent.tools import tool_manifest, execute_tool, TOOL_REGISTRY). - Legacy private imports preserved (
from .tools import _default_serializercontinues to resolve). - Test-fixture hook preserved:
monkeypatch.setattr(agent.tools, '_resolve_fn', …)works becauseexecute_toollooks up_resolve_fnvia the parent package namespace at call time. causal/recommendbespoke serializers promoted from inline lambdas to module-level functions (readable in stack traces).
MCP wire-up¶
from_stata/from_rregistered as workflow tools inWORKFLOW_TOOL_NAMES, dataless override list, and surface intools/list.- 393/393 pass across
test_mcp_protocol.py/test_mcp_error_envelope.py/test_mcp_result_handle.py/test_mcp_enrichment.py/test_mcp_image_content.py/test_mcp_pipelines.py/test_mcp_prompts_expanded.py/test_mcp_runner.py(new) /test_translation.py(new) plus the existing agent + registry + help + exceptions suites.
Known follow-ups (not in 1.11)¶
mcp_server.pyis now 1,475 lines — the 10 prompt-template dictionaries account for most of the bloat. Splitting them into_prompts.pyis a half-day mechanical follow-up.- MCP
sampling/createMessageserver-initiated LLM requests (forllm_dag_proposeetc. to reuse the client's auth) deferred pending Claude Desktop / Cursor capability advertisement.
[1.10.0] — 2026-04-29¶
Agent-native / MCP layer overhaul. Closes the chained-workflow gap that
the v1.9 stateless tools couldn't span and turns the sp.agent /
statspai.agent.mcp_server surface into a proper experimentation
workbench. No estimator numerics changed; this is purely the
discovery / orchestration / output layer.
Added¶
- Result handles (
as_handle=True) — everyexecute_tool/tools/callinvocation can now return aresult_id/result_uri(statspai://result/<id>) pointing to the fitted object. Backed by an in-process LRU cache (agent/_result_cache.py, default 32 entries, env overrideSTATSPAI_MCP_RESULT_CACHE_SIZE). Resource read returns the agent-detailto_dictpayload + a provenance block (originating tool + arguments + class name). - Handle-based workflow tools —
audit_result,brief_result,sensitivity_from_result,honest_did_from_resultaccept aresult_idinstead of forcing the LLM to ferry betas/sigma arrays back across turns.honest_did_from_resultauto-extractsbetas/sigma/num_pre_periods/num_post_periodsfrom the cached result (CallawaySantanna, EventStudy, BJS, SA shapes supported via best-effort attribute walk). - First-class workflow primitives —
audit,preflight,detect_design,briefare now hand-curated MCP tools (previously surfaced only via the auto-generated manifest with one-line descriptions). Schemas describe expected columns explicitly. bibtextool — pulls verified BibTeX entries frompaper.bib(single source of truth per CLAUDE.md §10). Unknown keys return empty bodies + close-match suggestions, never fabricated entries. Closes the citation-hallucination loophole at the source.- Composite pipelines —
pipeline_did/pipeline_iv/pipeline_rdrun preflight + estimator + audit + sensitivity + brief in one call, return a markdown narrative + cachedresult_id+ per-stage status.pipeline_rdattaches anrdplotPNG as MCP image content. - Image content blocks —
_handle_tools_callpromotes any_plot_pngbytes returned by a tool to a second{type: "image", mimeType: "image/png"}content block (Claude vision and any MCP image-capable client renders it inline). Newplot_from_resulttool renders the canonical diagnostic plot for a cached result (event-study / rdplot / synth-gap / love-plot / cate-plot / coef-plot — auto-detected by class name). - Output enrichment — every tool return now carries:
next_calls— pre-builttools/callpayloads withresult_idand forwarded base args; agents copy-paste verbatim.citations— verified bib keys (static map; empty list ⇒ intentionally absent, never invent) + BibTeX bodies pulled frompaper.bib.narrative— short markdown digest (method + estimate + CI + N + violations).- Expanded prompt templates —
prompts/listjumps from 3 to 10:audit_did_result(rewired topipeline_did),audit_iv_result,audit_rd_result,design_then_estimate,robustness_followup,paper_render,compare_methods,policy_evaluation,synth_full,decompose_inequality. - Schema injections — every MCP tool now exposes:
data_path(URL-aware:s3://,gs://,https://, plus.dta/.feather/.arrow/.jsonlin addition to the legacy.csv/.parquet/.xlsx/.json).data_columns— column projection for parquet / feather / stata fast partial reads.data_sample_n— deterministic uniform random subsample (seed=0) for fast iteration on large panels.result_id— handle reference for chained calls.as_handle— opt-in result caching.initializereturns a session-levelinstructionsblock describing the recommended workflow (detect_design → preflight → fitas_handle=true→audit_result→*_from_result→bibtex).statspai://result/{id}URI template advertised viaresources/templates/list.
Changed¶
_DATALESS_TOOLSis now registry-derived. A new_dataless_tool_names()helper walks the registry and marks any spec without a requireddataparameter as dataless; the hand-curated_DATALESS_OVERRIDESset covers stub-backed tools the registry can't reach (workflow / handle / bibtex / plot tools). The legacy_DATALESS_TOOLSconstant stays as a backward-compat alias.auto_tool_manifest(max_tools=...)default bumped 250 → 500 and emits aRuntimeWarningwhen more eligible tools exist than the cap admits — silent truncation was hiding registry growth.tool_manifest()no longer silently swallows auto-merge failures. ARuntimeWarningfires before the curated-only fallback so operators / CI log scrapers can detect registry introspection regressions._load_dataframeis LRU-cached by(path, mtime, columns)so repeatedtools/callinvocations on the same file are O(1) after the first load. New 2 GiB default file-size cap (env overrideSTATSPAI_MCP_MAX_DATA_BYTES; set to0to disable).- Bad/missing
data_pathnow surfaces as JSON-RPC-32602(invalid params) rather than the generic-32000. Clients branching on error codes get a cleaner signal. - Traceback exposure on
-32000errors gated bySTATSPAI_MCP_DEBUG=1. Production deployments no longer leak internal paths / class names through the JSON-RPC error envelope by default. _json_defaultcovers every type we've actually seen leak through the agent / MCP wire:np.bool_,np.complexfloating,np.datetime64,np.timedelta64, NaN / Inf →null,pd.Index/Timedelta/Categorical/Interval,set/frozenset,bytes(b64-wrapped),Decimal,pathlib.PurePath,Enum, dataclasses.oaxaca-style estimators with their owndetailparameter shadowed via MCP. The schema'sdetailenum is server-side control (forwarded toresult.to_dict(detail=...)); collisions are resolved by force-overwriting the registry's version. Affected estimators remain reachable via the direct Python API.
Tests¶
- New:
test_mcp_result_handle.py(31 cases) — result-cache LRU, resource read, handle-based workflows,_json_defaulttypes,STATSPAI_MCP_DEBUGgating. - New:
test_mcp_enrichment.py(14 cases) —next_calls/citations/narrativeshape;bibtexround-trip. - New:
test_mcp_image_content.py(4 cases) — PNG promotion to MCP image content block. - New:
test_mcp_pipelines.py(7 cases) — pipeline_did / iv / rd. - New:
test_mcp_prompts_expanded.py(5 cases) — full 10-prompt template surface. - Existing
test_mcp_protocol.pyupdated to use the registry-derived_dataless_tool_names()helper rather than the static override set.
321/321 pass across all tests/test_*agent*.py + test_mcp_*.py +
test_registry.py + test_help.py + test_exceptions.py.
New modules¶
src/statspai/agent/_result_cache.py— bounded LRU cache + entry metadata.src/statspai/agent/auto_dispatch.py— registry-driven dispatch for non-curated tools (filters kwargs againstParamSpec).src/statspai/agent/workflow_tools.py— handle-based + workflow primitive tools (audit_result / brief_result / *_from_result / audit / preflight / detect_design / brief / plot_from_result / bibtex).src/statspai/agent/pipeline_tools.py— pipeline_did / pipeline_iv / pipeline_rd composites.src/statspai/agent/_enrichment.py—next_calls+citations+narrativebuilder.
[Unreleased]¶
Changed — output module PR-B (continuation of v1.11.x cleanup)¶
esttab/EstimateTableResultare now thin facades overregtable.output/estimates.pypreviously housed a ~500-lineEstimateTableclass that re-implemented the full renderer pipeline (text / LaTeX / HTML / Markdown / CSV / DataFrame). PR-B/5c collapses it; theesttab()function now translates Stata-flavoured kwargs and forwards tosp.regtable, andEstimateTableResultbecomes a thin pass-through wrapper around the resultingRegtableResultthat preserves the legacy type identity.- Net code:
output/estimates.py987 → 526 lines (-47%). Helpers used byregression_table/mean_comparison/_inline(_ModelData,_extract_model_data,_ci_bounds,_format_starsre-exports,_latex_escape/_html_escape,_STAT_ALIASES/_STAT_DISPLAY,eststo/estclearglobal store) are kept verbatim. EstimateTableResult.to_csv()is implemented viato_dataframe().to_csv()(regtable does not natively expose CSV; the dataframe path is byte-identical to what the legacy esttab produced).- The four exclusive-output flags
se/t/p/cimap to regtable'sse_type=with priorityci > p > t > se(matches legacy behaviour). -
First call emits
DeprecationWarningpointing tosp.regtable. -
modelsummaryis now a thin facade overregtable. The R-stylemodelsummary()previously shipped a ~700-line renderer pipeline (_build_coef_rows/_to_text/_to_latex/_to_html/_to_excel/_to_word) that re-implemented coefficient extraction, star formatting, three-line table styling, and every export format — duplicating code already maintained bysp.regtable. - Net code:
output/modelsummary.py845 → 378 lines (-55%; remainder is module docstring +coefplotkept verbatim +_extract_coefsforcoefplot). - Rendered output now matches
regtableexactly. The dict form ofstars=is reinterpreted (only threshold values used; symbol overrides dropped — useregtable(notation='symbols')for†/‡/§).se_type='brackets'is no longer a separate render mode (emitsUserWarningand falls back to parens; useshow_ci=Truefor[lo, hi]).se_type='none'likewise keeps the SE row. - First call emits
DeprecationWarningpointing tosp.regtable. -
coefplotis unchanged (independent of the table renderer). -
outreg2is now a thin facade overregtable. The Stata-styleOutReg2class andoutreg2()function previously shipped a bespoke 800-line renderer that re-implemented coefficient extraction, star formatting, three-line table styling and Excel/Word/LaTeX export. Collapsed to ~150 lines that translate Stata-flavoured kwargs and forward tosp.regtable— single point of fix for rendering bugs going forward. - Net code:
outreg2.py804 → 341 lines (-58%). - Rendered output now matches
regtableexactly. Visible label changes:Variablescolumn header → blank (book-tab),R-squared/Adj. R-squared/Observations/F-statistic / Trees→R²/Adj. R²/N/F. LaTeX gains a proper star legend. Bug fixes: spurious& None & NoneLaTeX cell removed; the nonsensical/ Treeslabel that appeared on OLS results is gone. show_se=Falseis no longer supported (regression tables without uncertainty are pseudo-science) — emitsUserWarningand keeps the SE row.- First call emits
DeprecationWarningpointing tosp.regtable(...).to_excel(...). Plan to remove the facade in two minor releases. - See
MIGRATION.mdfor the side-by-side rewrite.
Added — output module PR-B foundation (B-1)¶
tests/test_regtable_snapshots.pysnapshot harness. Locks down the byte-stable rendered output ofsp.regtablefor five representative fixtures (simple OLS / multi-model / custom stats / notes+labels / GLM-logit) across four text formats (text / HTML / LaTeX / Markdown) — 20 snapshots total. Whitespace-normalised so diffs survive editor newline handling but catch real renderer drift. Excel / Word are not snapshotted (binary archives are brittle); coverage there is viatest_paper_tables_export.py. Update withSTATSPAI_UPDATE_SNAPSHOTS=1 pytest tests/test_regtable_snapshots.py.
Added — agent / dispatcher work (other sessions)¶
sp.panel()method=expanded with friendly aliases + HDFE.sp.panelalready supported amethod=table of 10 classical-
dynamic estimators (
fe/re/be/fd/pooled/twoway/mundlak/chamberlain/ab/system). The table is now case-insensitive and accepts intuitive aliases that match what users already write (instead of forcing the two-letter Stata shorthand): -
fe←fixed/fixed_effects/within re←random/random_effectsbe←between/between_effectsfd←first_difference/first_diffpooled←pooled_ols/pols/olstwoway←two_way/two_way_fe/2wayab←arellano_bond/gmm/diff_gmmsystem←blundell_bond/bb/system_gmm
Plus a new method='hdfe' (a.k.a. feols / reghdfe /
absorbed_ols) route that delegates to feols.hdfe_ols for
high-dimensional fixed-effects absorption. When the formula has
no | separator, the dispatcher bolts the entity and
time columns on automatically, so
is equivalent to
This closes the Stata reghdfe / R fixest::feols slot in
the sp.panel namespace without forcing users to switch APIs.
sp.panel_logit / sp.panel_probit / sp.interactive_fe /
sp.panel_unitroot are intentionally NOT in the method=
table — they have a different (data, y, x, id, time)-style
signature and remain accessible as standalone functions.
Regression-guarded by tests/test_panel_dispatcher.py (37 new
tests); 31 existing panel-family tests pass.
-
sp.match()method=expanded to cover the full matching toolkit.sp.matchwas already a function with built-inmethod=for classical algorithms (nearest / stratify / cem / psm / mahalanobis); the table now reaches every matching/weighting estimator instatspai.matchingfrom a single entry point: -
Classical:
nearest(default),stratify/subclass/subclassification,cem/coarsened_exact,psm,mahalanobis. - Weighting:
ebalance/entropy/entropy_balancing(Hainmueller 2012),cbps(Imai-Ratkovic 2014),sbw/stable_balancing(Zubizarreta 2015),overlap/ow/overlap_weights(LMZ 2018). - Genetic:
genmatch/genetic(Diamond-Sekhon 2013). - Optimization-based:
optimal/optimal_match(Rosenbaum 1989),cardinality/cardinality_match(Zubizarreta 2014).
The dispatcher translates treat ↔ treatment and y
↔ outcome for the few estimators that internally use the
alternate names (optimal_match, cardinality_match).
Standalone access (sp.ebalance, sp.cbps, sp.genmatch,
sp.sbw, sp.optimal_match, sp.cardinality_match,
sp.overlap_weights) is unchanged.
The dispatcher refuses to silently swallow nonsense: passing a
classical-matching kwarg (caliper= / replace= /
n_matches= / bias_correction= / ps_poly= /
n_strata= / n_bins=) with method='ebalance' etc.
raises TypeError: does not accept these classical-matching
kwargs.
Regression-guarded by tests/test_match_dispatcher.py (31
new tests); 76 existing matching-family tests still pass.
-
sp.rd()is now callable with a unifiedmethod=table. Same PEP 562 callable-module pattern used forsp.ivin this release: thestatspai.rdsubpackage itself dispatches calls, whilesp.rd.rdrobust/sp.rd.rdplot/sp.rd.rdsummaryand all 35+ existing names continue to resolve. The defaultsp.rd(data, y, x, c)call equalssp.rd.rdrobust(data, y, x, c)(CCT 2014 local polynomial). 18 canonicalmethod=aliases route to: -
Local polynomial:
rdrobust/default/rd/robust/local_poly(CCT 2014). - Honest CIs:
honest/armstrong_kolesar/ak. - Local randomisation:
randinf/random/local_randomization. - Heterogeneous effects:
hte/cate. - ML+RD:
forest/causal_forest,boost/gbm,lasso. - Bayesian HTE:
bayes_hte/bayes. - 2D / boundary RD:
rd2d/2d/boundary. - Multi-cutoff:
rdmc/multi_cutoff. - Multi-score / geographic:
rdms/geographic/multi_score. - Kink (RKD):
rkd/kink. - RD-in-time:
rdit/time. - Extrapolation:
extrapolate,multi_extrapolate. - Spillover/interference:
interference/spillover. - Distributional:
distribution,distributional_design. - External validity:
external_validity.
The dispatcher normalises x ↔ running and c ↔
cutoff for methods that use the alternate names internally
(rd_bayes_hte, rd_interference, rd_distribution,
rd_distributional_design).
Diagnostics-only functions (rdbwselect, rdbwsensitivity,
rdbalance, rdplacebo, rdsummary, rdplotdensity,
rdpower, rdsampsi, rdwinselect, rdsensitivity,
rdrbounds) are intentionally NOT in the method= table —
they are not estimators of treatment effects.
Regression-guarded by tests/test_rd_dispatcher.py (22 new
tests); 78 existing rd-family tests still pass.
Performance¶
import statspaicold-start: ~2,070 ms → ~1,680 ms (-19%).output/outreg2.pynow importsopenpyxllazily inside_export_with_formattinginstead of at module load. Top-levelimport openpyxlwas transitively pullingPIL/Pillowviaopenpyxl.drawing.imageon every session even when the user never touchedoutreg2(4 references in repo vs 163 forregtable). After the fix, no heavy modules (openpyxl/docx/PIL/matplotlib) are eagerly loaded at top level. Also drops unused symbol imports (Border/Side/PatternFill/dataframe_to_rows/write_title).
Changed¶
-
Output module: shared formatter helpers. New
output/_format.pyhouses the canonicalformat_stars/fmt_val/fmt_int/fmt_auto/is_missingimplementations.estimates._format_stars/_fmt_val/_fmt_int/_fmt_autoare now thin re-exports under their legacy underscore names; existingregression_table/_inlineimports are unchanged.outreg2._format_number/_format_pvalueandmodelsummary._format_numdelegate to the canonical helpers. Net effect: ~80 lines of duplicate formatters removed; bug fixes in one place propagate to every backend. -
Output module:
modelsummary._stars_strdead code removed. The original implementation had twoforloops where the first usedfor/elsethat always overwrotebestto''before the second loop ran — making the first loop unreachable. Cleaned to keep only the working logic. Behavior identical for all valid inputs. -
Output module:
MeanComparisonResultextracted.output/regression_table.pyhad grown to 3,335 lines and held two unrelated result classes. MovedMeanComparisonResultand the publicmean_comparison()API tooutput/mean_comparison.py(510 lines).regression_table.pyis now 2,831 lines (-15%). Re-exported fromregression_tablefor back-compat;from statspai.output.regression_table import MeanComparisonResultstill works.sp.list_functions()count unchanged. -
Output module:
__init__.pyreorganised. Imports and__all__are grouped by purpose (regression-table renderers / single-table helpers / multi-table bundles / provenance / bibliography / adapters), with a docstring documenting thatregtableis the canonical regression-table renderer and thatesttab/modelsummary/outreg2are Stata/R compatibility surfaces (full consolidation tracked indocs/rfc/output_pr_b_consolidation.md). No symbols added or removed;sp.list_functions()unchanged at 973. -
Top-level
__init__.pydeduplication. Removed redundantfrom .regression.glm/logit_probit/countimports that re-bound the same names twice (lines 245-247 vs 495-501). No public name was added or removed;sp.glm,sp.logit,sp.poissonetc. resolve identically to the earlier (canonical) binding. Same number of registered functions (973). Net: −5 LOC, −5 redundant import statements, identical behaviour.
Fixed¶
sp.iv()is now callable. Prior to this release, thestatspai.ivsubpackage shadowed the function exposed at line 45 ofstatspai/__init__.py(because Python attaches an imported subpackage to its parent's namespace, and the subpackage load happened after the function bind). The result was that every advertised callsite — registry examples, agent summaries, MCP server docs, replication examples, and the live call insrc/statspai/question/question.py:505— raisedTypeError: 'module' object is not callable. Fixed by installing a tinyModuleTypesubclass with__call__onstatspai.iv(PEP 562-style) and removingivfrom theregression.ivimport line so the subpackage isn't shadowed in reverse. Regression-guarded bytests/test_iv_dispatcher.py::test_sp_iv_is_callable(33 new tests total).
Changed¶
-
Unified IV dispatcher.
sp.iv(formula, data, method=...)now routes 25+ method aliases (case- and dash-insensitive) to 19 canonical estimators across theregression.iv/regression.advanced_iv/iv//deepiv/bartikmodules: -
K-class formula path:
2sls(a.k.a.tsls,iv),liml,fuller,gmm,jive. - Modern JIVE:
jive1,ujive,ijive,rjive. - Many-weak:
jive_mw,many_weak_ar. - Lasso:
lasso,post_lasso(a.k.a.bch). - ML/nonparametric:
kernel,npiv,ivdml,deepiv. - Bayesian:
bayes. - LATE/MTE:
continuous_late,mte,ivmte_bounds. - Quantile IV:
ivqreg. - Plausibly exogenous sensitivity:
plausibly_exog_uci,plausibly_exog_ltz. - Shift-share:
shift_share(a.k.a.bartik).
The dispatcher normalises common alias names (endog →
treat/treatment for kernel-style methods, exog →
covariates for ivdml, singleton instruments=['z'] →
instrument='z' for singular-instrument methods), and refuses
ambiguous combinations with TypeError: Got both 'endog' and
'treat'. Standalone access (sp.iv.kernel_iv,
sp.iv.bayesian_iv, sp.ivreg,
from statspai.regression.iv import iv) is unchanged.
sp.iv.fit(...) remains as an explicit alias for the dispatcher.
Diagnostics functions (anderson_rubin_test, effective_f_test,
kleibergen_paap_rk, sanderson_windmeijer,
conditional_lr_test) are intentionally not in the
method= table — they are not estimators.
[1.9.1] — MCP schema + JSON-RPC error polish¶
Patch release on top of 1.9.0. No estimator numerical paths changed. Two MCP-server fixes surfaced by strict-schema clients (Claude Desktop / Cursor) plus one docs typo.
Fixed¶
-
MCP
tools/listschema — dataless tools no longer requiredata_path. Tools whose underlying StatsPAI function does not consume a DataFrame (currentlyhonest_didandsensitivity) used to be advertised withdata_pathinrequired. Strict- schema MCP clients refused to dispatch the call without a CSV path the estimator never reads.data_pathis still exposed as an optional property for clients that always send it; only therequiredlist is conditional now. New_DATALESS_TOOLS = {"honest_did", "sensitivity"}is the single source of truth insrc/statspai/agent/mcp_server.py— keep in sync withTOOL_REGISTRYinagent/tools.py. -
MCP
tools/calltyped error — missingnamereturns -32602. Previously atools/callrequest without anamefield raised a genericValueError, which the dispatcher surfaced as-32000(server fallback). 1.9.0 already promised typed JSON-RPC errors for invalid params (-32602); this fixes the one path that escaped the audit. Regression-guarded bytest_tools_call_missing_name_returns_invalid_params.
Docs¶
- MIGRATION.md — fixed a typo in the 1.9.0
CausalResult.to_dictbyte-identity note: the no-kwargs default is identical toto_dict(detail="standard"), notcite(detail="standard").
[1.9.0] — Agent-native API surface: 12 modules across 4 phases¶
The 1.9.0 line ships StatsPAI's first deliberately agent-shaped API
surface — 12 new top-level entry points designed for Claude Code /
Cursor / Copilot CLI workflows where the LLM, not a human, is doing
the calling. No estimator numerical paths changed; all
additions are new functions or strictly additive parameters with
"agent" as the default so existing behaviour is byte-identical.
Added — Agent serialization & error envelope (Phase 1)¶
-
CausalResult.to_dict(detail=...)andEconometricResults.to_dict(detail=...)— unified payload control with three documented levels: -
"minimal"(~150 tokens) — bare answer; no diagnostics. "standard"(~250 tokens) — current default; coefficients + scalar diagnostics +detail_headrows. Byte-identical to legacyto_dict()."agent"(~620 tokens) — addsviolations/warnings/next_steps/suggested_functionsso an LLM can plan its next call without another round-trip.
for_agent() is now a thin alias for to_dict(detail="agent");
to_agent_summary() is unchanged but its docstring now points
at to_dict(detail="agent") as the canonical flat form.
execute_toolMCP error envelope — when an estimator raises a structuredStatsPAIErrorsubclass, the MCPtools/callresponse now surfaceserror_kind(e.g."method_incompatibility") plus the fullerror_payloaddict (code/recovery_hint/diagnostics/alternative_functions). Legacyerror/remediationfields preserved.
Added — MCP server polish (Phase 1)¶
statspai-mcpconsole script wired inpyproject.tomlsopip install statspaiexposes it on PATH.statspai://function/{name}per-function resources surfacing the registry's full agent-card (description, signature, assumptions, failure_modes, alternatives, typical_n_min, example). Listed via the newresources/templates/listhandler.statspai://functionsmachine-readable JSON index for one-shot tool discovery.- Typed JSON-RPC errors mapped to canonical MCP codes:
-32002(resource not found),-32602(invalid params),-32000(server fallback). Replaces the previous blanket-32000. notifications/*silenced — Claude Desktop / Cursor sendnotifications/initializedafter the handshake; the server now drops any method whose name starts withnotifications/per the MCP spec, instead of replying with-32601noise on every session.- MCP-level
detailparameter ontools/call— agents pickdetail="minimal" | "standard" | "agent"per call to control token cost. Validation rejects invalid values with-32602.
Added — Workflow primitives (Phases 2-4)¶
-
sp.audit(result)— missing-evidence checklist (the read-only counterpart tosp.assumption_audit): inspects what robustness / sensitivity diagnostics are stored on a fitted result and surfaces which method-family checks are still missing. Returns{checks: [{name, question, status, severity, importance, suggest_function, ...}], summary, coverage}with 18 curated checks across DID/RD/IV/synth/matching/OLS. -
sp.detect_design(data, **hints)— heuristic design identifier: returns{design, confidence, identified, candidates, n_obs, columns}withdesign ∈ {"panel", "rd", "cross_section"}. Symmetric(unit, time)pair dedup; RD confidence capped at 0.30 without explicit hint to avoid noise-data false positives. -
sp.preflight(data, method, **kwargs)— method-specific pre-estimation diagnostics distinct fromsp.check_identification(design-level) andsp.assumption_audit(re-runs tests). Cheap shape / column / treatment-binarity / sample-size checks per method family; returns{verdict: "PASS" | "WARN" | "FAIL", checks, summary, known_method}. -
CausalResult.cite(format=...)andsp.bib_for(result)— multi-format citations:"bibtex"(default, byte-identical to legacycite()),"apa"(parsed prose),"json"(structured{type, key, authors, year, title, journal, volume, number, pages, publisher, fields}). LaTeX-diacritic normalisation ({\\"o}→ö); multi-entry BibTeX strings (e.g.twfe_decompositioncites both Goodman-Bacon 2021 AND de Chaisemartin & D'Haultfœuille 2020) round-trip both authors — zero hallucination per CLAUDE.md §10. -
sp.examples(name)— runnable code snippets for any registered function; 10 hand-curated flagship snippets, falls back toregistry.examplefor the rest. -
sp.session(seed=42)— deterministic-RNG context manager snapshotting Pythonrandomand NumPy's legacy global MT19937 generator; restores prior state on exit even when an exception is raised inside the block. Lazy torch / jax interop — never auto-imports. Documented escape hatch fornp.random.default_rng()(which is not covered — passstate.seedexplicitly). -
result.brief()/sp.brief(result)— one-line dashboard string (~95 chars typical, ≤ 140 hard cap) for multi-result agent loops. -
MCP
prompts/list+prompts/get— three curated workflow prompt templates (audit_did_result/design_then_estimate/robustness_followup) surfaced as prompt buttons in MCP-compliant clients.
Changed¶
-
CausalResult.to_dict/EconometricResults.to_dictnow accept a keyword-onlydetailparameter. Default"standard"preserves the legacy shape exactly.CausalResult'sdetail_headis also keyword-only now (was positional-or- keyword) to close theto_dict("agent")foot-gun. -
CausalResult.cite()now acceptsformat=keyword; zero-arg call still returns BibTeX, byte-identical tocite(format="bibtex").
Tests¶
+422 targeted tests across the agent stack, all passing.
Token-budget assertions pin the size of every detail level so
future changes can't accidentally bloat the LLM tool-result channel.
No numerical changes¶
Existing estimator coefficient / SE / CI / p-value paths are byte- identical to 1.8.0. The 12 new modules are introspection, serialization, prompt-rendering, and RNG-management primitives — they read from existing result state, never recompute it.
[1.8.0] — 2026-04-28¶
Internal-development version covering five sp.regtable rounds, the
Native Rust IRLS for sp.fast.fepois, twelve provenance-rollout
phases, the production-function module, the clubSandwich-equivalent
HTZ Wald, the LLM-DAG closed loop, the synth refactor, the
estimand-first paper appendix, the great_tables / CSL pipeline, and
the export trinity (numerical lineage / replication pack / Quarto).
Subsections below preserve the chronological development order.
sp.regtable Round 4 (event_study_table, vcov= recompute, transpose)¶
Three further additions on top of Rounds 1-3. No numerical
changes to any estimator; the vcov= recompute reuses the
fit-time X + residuals already stored on OLS results.
Added¶
-
sp.event_study_table(result, *, regex=None, label_fmt="t={t}", include_reference=False)— adapter that turns an event-study fit into a regtable input. Two extraction paths: -
CausalResult fast path when
model_info['event_study']holds the canonicalrelative_time/estimate/se/ci_lower/ci_upper/pvalueDataFrame produced by :func:sp.event_study. -
Regex path when raw coefficient names like
"tau_-3","lag_-2","::-1"need to be parsed; the first capture group becomes the relative time. Rows are sorted in event-time order regardless of input ordering. -
vcov=parameter on :func:sp.regtable— recompute SE / t / p / 95% CI at print time without re-fitting. Currently supports OLS-style results that storedata_info['X']anddata_info['residuals']: -
"HC0"— White heteroskedasticity-robust "HC1"/"robust"— Stata'srobust(HC0 × n/(n-k))"HC2"— leverage-weighted"HC3"— leverage-squared (recommended for small samples; Long-Ervin 2000)
Columns whose underlying result lacks the X/residuals fields emit
a UserWarning and retain their fit-time SEs, so a
heterogeneous mix of OLS + non-OLS does not blow up.
transpose=Trueon :func:sp.regtable— rows become models, columns become variables. Single-panel only; multi-panel input ormulti_se=is rejected withNotImplementedErrorto keep the layout pivot semantics tight. Renders in text and HTML.
Tests¶
15 new tests in test_regtable_round4_extensions.py covering all
three features, including HC0/HC1/HC2/HC3 ordering verification
under heteroskedasticity, regex extraction fallback, and pivot
guards on multi-panel / multi_se.
577 targeted tests pass (Rounds 1-4 = 528 + 20 + 14 + 15, plus broad anchors). Zero regression.
2026-04-28 — Native Rust IRLS for sp.fast.fepois + production-function module¶
The headline of v1.8.0 is the 3× wall-clock improvement on the
medium HDFE benchmark: sp.fast.fepois runs at 0.855 s vs the
v1.7.x baseline's 2.61 s, and 1.34× of R fixest::fepois (0.64 s)
on the project's standard medium dataset (n=1M, fe1=100k, fe2=1k).
This closes the long-standing wall-clock gap to fixest to 1.34× —
well under the ≤ 1.5× target set in the v1.8 design spec.
Plus a new structural-estimation module: sp.prod_fn ships four
production-function estimators (Olley-Pakes, Levinsohn-Petrin,
Ackerberg-Caves-Frazer, Wooldridge) + De Loecker-Warzynski markup.
Performance — sp.fast.fepois on medium HDFE benchmark¶
| stage | wall | vs fixest | shipped |
|---|---|---|---|
| v1.7.x baseline (Python np.bincount inside Python IRLS) | 2.61 s | 4.08× | — |
| Phase A (Rust scatter, no cache) | 2.45 s | 3.83× | ✓ |
| Phase B0 (Rust sequential + dispatcher cache) | 1.441 s | 2.25× | ✓ |
| Phase B1 (native Rust IRLS, single PyO3 call) | 0.880 s | 1.37× | ✓ |
| Path A (B1 + Rust separation pre-pass) | 0.855 s | 1.34× | ✓ |
| R fixest::fepois | 0.64 s | 1.00× | — |
The closure was driven by three orthogonal contributions, each
verified with a wall-clock spike before the next was committed
(audited at benchmarks/hdfe/AUDIT.md):
- Phase A primitives:
statspai_hdfe.demean_2d_weightedPyO3 binding, Python_weighted_ap_demeandispatcher with NumPy fallback,weighted_demean_matrix_fortran_inplacecrate-internal Rust API. - Phase B0 algorithmic primitive: sort-by-primary-FE permutation
(
sort_perm::primary_fe_sort_perm) + sequential weighted sweep (weighted_group_sweep_sorted) replaces the L2-cache-miss-bound random-scatter inner loop on G1 = 100k bucket arrays. Plus the module-level FE-only-plan fingerprint cache in the dispatcher (avoids ~1.4 s per fepois of recomputingnp.argsort/searchsorted/ secondary perms across IRLS iters). - Phase B1 native Rust IRLS:
irls.rshostsfepois_loop, the full IRLS state machine (working response, working weight, sort-aware weighted demean, hand-coded SPD Cholesky for the WLS solve, eta clip, step-halving, deviance + convergence).FePoisIRLSWorkspaceholds scratch + Aitken history + sorted indices, allocated once per fepois call and reused across all IRLS iters. Single PyO3 call (fepois_irls) eliminates the 12 round-trips per fepois that Phase B0 still had. - Path A — Rust separation pre-pass: the iterative Poisson-
separation drop (drops rows in FE groups whose total y-sum is zero —
Poisson cannot identify them) was the last meaningful Python-side
O(n log n) overhead inside fepois (np.unique + np.isin per pass).
The Rust port (
separation::separation_mask) replaces it with an O(n × n_iter × K) bincount loop. ~25 ms additional wall reduction on the medium benchmark; closes 1.37× → 1.34× of fixest. Reusable by futurefeglmGLM families.
Numerical correctness — preserved at v1.7.x parity¶
sp.fast.fepoisvspyfixest.fepoiscoef on the medium dataset: unchanged (atol < 1e-13 across IRLS-converged fits).- Cluster-robust SE (
vcov="cr1"): the v1.7.x integration is untouched; commit39c94d0(CR1 recovery from auto-checkpoint) remains the canonical implementation. - The Python NumPy fallback path (when the compiled
statspai_hdfewheel is absent) is bit-for-bit identical to the v1.7.x behavior — verified bytest_fepois_falls_back_when_rust_unavailable.
Added¶
- New
statspai_hdfev0.6.0 PyO3 entry points (Rust crate v0.5.0-alpha.1): demean_2d_weighted— Phase A weighted variant of the K-way AP demean.demean_2d_weighted_sorted— Phase B0 sort-aware variant.fepois_irls— Phase B1 single-call IRLS state machine.-
separation_mask— Path A iterative Poisson separation detector. -
sp.prod_fnunified production-function estimator dispatcher with four named entry points (olley_pakes/opreg,levinsohn_petrin/levpet,ackerberg_caves_frazer/acf,wooldridge_prod) plusmarkup(De Loecker-Warzynski) andProductionResult. Cobb-Douglas default + translog functional form; firm-cluster bootstrap SE; full registry coverage. References: Olley-Pakes (1996), Levinsohn-Petrin (2003), Ackerberg-Caves-Frazer (2015), Wooldridge (2009), De Loecker-Warzynski (2012). 23 dedicated tests. sp.fast.fepoisPython dispatcher with three-tier fallback (native Rust IRLS → sort-aware Rust demean → random-scatter Rust demean → pure NumPy) — no user-facing API change.benchmarks/hdfe/run_fepois_phase_a.py,run_fepois_phase_b0.py,run_fepois_phase_b.py— reproducible wall-clock harnesses with hard merge gates.benchmarks/hdfe/AUDIT.md— Phase A round 1 (gate failure + root-cause), Phase B0 round 1 PASS, Phase B1 round 1 PASS audit trails. The audit pattern (measure-before-commit) is the structural counter-measure that prevented Phase A's "assumption broke" failure from repeating in B0 / B1.
Internal¶
- Rust crate
statspai_hdfebumped 0.2.0-alpha.1 → 0.5.0-alpha.1 across Phase A → Phase B → Path A (4 minor crate version bumps). - Python
__version__instatspai_hdfeextension:0.2.0→0.6.0.
Tests — 192 fast-fepois tests pass (was 187 in v1.7.x) + 23 prod_fn tests¶
- Phase B1 native-vs-Python IRLS parity: coef atol ≤ 1e-10, SE atol ≤ 1e-7
(
test_fepois_native_irls_vs_python_irls_parity). - Path A separation parity: 10 random seeds with synthetic zero-cluster
injection; Rust ↔ NumPy mask agreement element-wise
(
test_separation_rust_matches_python_fallback). - Cluster-SE suite intact (5 tests covering validation / NaN rejection / IID-baseline / closed-form / fixest-parity).
- New 23-test prod_fn suite covering OP / LP / ACF / Wooldridge / markup on synthetic Cobb-Douglas / translog DGPs + edge cases + bootstrap-SE reproducibility.
sp.regtable Round 3 (margins_table, tests= footer, fixef_sizes)¶
Three further additions on top of Round 1 + Round 2. No numerical
changes to any estimator (margins_table is a pure adapter; tests=
formats user-supplied test results; fixef_sizes reads pre-existing
model_info['n_fe_levels']).
Added¶
-
sp.margins_table(model)— adapter that wraps a :func:sp.marginsDataFrame as a duck-typed result with.params/.std_errors/.tvalues/.pvalues/.conf_int_*. Pipes straight into :func:sp.regtable, closing the "estimator → marginal-effects table" gap that previously required users to hand-buildadd_rows. Mirrors the R workflowmodelsummary(avg_slopes(model)). The wrapper z-stat is mapped totvaluesso existingse_type='t'/'p'/'ci'paths render unchanged. -
tests=parameter on :func:sp.regtable— render hypothesis-test rows in the diagnostic strip below the stats block.tests={"Wald F": [(12.34, 0.001), (8.91, 0.003)]}→ "Wald F 12.340 8.910". Each per-model entry can be a(stat, p)tuple, a bare p-value,None, or a pre-formatted string. Stars honour the configurednotationfamily for cross-table consistency. Closes the gap to Stata'sestadd scalar/testintegration where reviewers expect Wald / Sargan / Hansen-J / first-stage F right under the main results. -
fixef_sizes=Trueon :func:sp.regtable— auto-emit "# Firm: 1,234" / "# Year: 30" rows showing distinct levels per fixed effect. Readsmodel_info['n_fe_levels']from each result; currently populated bycount.py(Poisson/NegBin) and the pyfixest adapter. Other estimators silently no-op. Mirrors R fixest'setable(..., fixef_sizes=TRUE).
Tests¶
14 new tests in test_regtable_round3_extensions.py covering
all three features across text / LaTeX / HTML renderers.
562 targeted tests pass (Rounds 1-3 = 528 + 20 + 14, plus broad anchors); zero regression on the 33 output / regression test files exercised.
sp.regtable Round 2 (templates, notation, apply_coef, escape, Word/Excel spanners)¶
Five further additions on top of the Round 1 commit. No numerical changes to any estimator; output-layer only.
Added — Five regtable parameters¶
estimate=/statistic=— flexible cell templates that mirror Rmodelsummary's arguments. Placeholders:{estimate},{stars},{std_error},{t_value},{p_value},{conf_low},{conf_high}. Examples:estimate="{stars}{estimate}"for stars-first.statistic="t={t_value}, p={p_value}"for working-paper cells.-
statistic="[{conf_low}, {conf_high}]"for inline CI without needingse_type="ci"separately. Unknown placeholders raise aKeyErrorat theregtable()call site. -
notation=— alternative significance-marker family."stars"(default) keeps("*", "**", "***");"symbols"swaps to("†", "‡", "§")for AER / JPE contexts where star-shaped markers conflict with footnote symbols; pass a custom 3-tuple for any ladder. The footer "p<0.01, ..." line rebuilds itself to match. -
apply_coef=/apply_coef_deriv=— generaliseeformto any callable.apply_coef=lambda b: 100*bfor a percentage transform;apply_coef=np.logfor log-scale;apply_coef_derivenables delta-method SE rescaling (|f'(b)|·SE). Mutually exclusive witheform— both transform the point estimate, and silently combining them would hide whichever the user listed second. -
escape=False— opt out of auto-escape so users can pass raw LaTeX (e.g."$\\beta_1$") or HTML ("<i>β</i>") as labels. Mirrors RkableExtra::escapeandxtable::print. Cell content (numeric estimates, computed stats) is always safe — it never contains user-controlled metacharacters. -
Word + Excel
column_spannersrendering — closes the format parity gap left in Round 1. Word inserts an extra header row with merged cells across each column block; Excel usesws.merge_cellsand the spanner row sits above the model-label row inside the booktab top-rule region.
Tests¶
20 new tests in test_regtable_round2_extensions.py covering all
five features across text / LaTeX / HTML / Word / Excel renderers.
548 targeted tests pass (Round 1's 528 + 20 new), zero regression.
sp.regtable publication-quality extensions¶
Five additions designed to close the remaining gap between
sp.regtable and Stata esttab / R modelsummary /
R fixest::etable for empirical paper writing. No numerical
changes to any estimator; output-layer only.
Added — Five regtable parameters¶
-
eform— reportexp(b)(odds ratios forlogit/probit, incidence-rate ratios forpoisson, hazard ratios for Cox-style models). SE via delta method (exp(b)·SE(b)); CI bounds via(exp(lo), exp(hi))of the original endpoints; t and p unchanged becauseH_0: b=0is equivalent toH_0: exp(b)=1. Acceptsbool(apply to all) orList[bool](per-model — mix logit OR with OLS coefs in the same table). A footer note transparently flags which columns are exponentiated. Mirrors Stataesttab, eform. -
column_spanners— multi-row header above the model labels. Pass a list of(label, span)tuples whose spans partition all model columns, e.g.[("OLS", 2), ("IV", 2)]. Renders as\multicolumn{n}{c}{label}+\cmidrulein LaTeX,colspanin HTML, repeated bold cells in Markdown, and centered ASCII in text. Mirrors Statamgroups()and Rmodelsummary'sgroup. -
coef_map— single-shot rename + reorder + drop. Pass an ordered dict whose keys are coefficients to keep (in display order) and values are display labels. Variables not in the map are dropped. Mutually exclusive with the legacycoef_labels/keep/drop/orderquartet. Mirrors Rmodelsummary'scoef_map. -
stats=["depvar_mean", "depvar_sd"]— auto rows for the dependent variable's sample mean and standard deviation, populated from the result object'sdata_info['y'](orendog/dep_var) at extraction time. Rows render as "Mean of Y" and "SD of Y". Top-5 economics journals routinely require these so reviewers can sanity-check effect magnitudes against the outcome's scale. Aliases:"ymean"/"ysd". -
consistency_check(defaultTrue) — emit aUserWarningwhen sample sizes differ across columns. Disable viaconsistency_check=Falsewhen the N-mismatch is intentional (IV first stage on a subsample, RD bandwidth restriction). Reviewer red flag silenced by default in v1.7.2, surfaced now.
Tests¶
23 new tests in test_regtable_publication_extensions.py covering
all six format renderers (text / LaTeX / HTML / Markdown) plus the
parameter validation paths. Existing 204 output-area tests
unchanged.
Phase 12: provenance rollout to 66/925 (bounds + randomization + imputation)¶
Continues the v1.7.2 provenance rollout. No numerical changes to any estimator. 5 estimators instrumented spanning bounds / randomization inference / imputation. Coverage 61/925 → 66/925.
Added — Provenance for 5 estimators¶
sp.balke_pearl— Balke-Pearl bounds on ATE under monotonicity.sp.lee_bounds— Lee (2009) trimming bounds for selection.sp.manski_bounds— Manski (1990) worst-case ATE bounds.sp.fisher_exact— Fisher randomization test (permutation).sp.imputation.mice— Multiple Imputation by Chained Equations.
Tests¶
6 new (5 per-estimator + 1 multi-estimator integration). All pass.
Phase 11: provenance rollout to 61/925 (spatial + qte + bootstrap + conformal)¶
Continues the v1.7.2 provenance rollout. No numerical changes to any estimator. 7 estimators instrumented spanning spatial / quantile / distributional / bootstrap / conformal. Coverage 54/925 → 61/925.
Added — Provenance for 7 estimators¶
sp.spatial.spatial_did— spatial-lag DiD with spillover decomposition.sp.spatial.spatial_iv— spatial 2SLS.sp.qte.dist_iv— distributional IV / quantile LATE.sp.qte.beyond_average_late— quantile LATE under fuzzy compliance.sp.qte.qte_hd_panel— high-dim panel QTE via LASSO controls.sp.bootstrap— general-purpose bootstrap inference.sp.conformal_cate— conformal prediction intervals for CATE.
Tests¶
8 new (7 per-estimator + 1 multi-estimator integration). All pass.
Phase 10: provenance rollout to 54/925 (panel + decomp + mediation)¶
Continues the v1.7.2 provenance rollout. No numerical changes to
any estimator. 6 estimators instrumented; sp.panel refactored
into outer wrapper + dispatcher (parallel to Phase 4 sp.synth and
Phase 7 sp.etwfe). Coverage 48/925 → 54/925.
Added — Provenance for 6 estimators¶
sp.panel— multi-method panel dispatcher (FE / RE / BE / FD / pooled / twoway / CRE / GMM). Refactored: outerpanelwrapper captures kwargs + calls_dispatch_panel_impl+ attaches provenance once. Public signature unchanged.sp.causal_impact— Brodersen-Gallusser-Koehler-Remy-Scott (2015) BSTS-style impact.sp.mediate— Imai-Keele-Tingley (2010) mediation.sp.mediate_interventional— VanderWeele-Vansteelandt-Robins (2014) interventional (in)direct effects.sp.bartik— Goldsmith-Pinkham-Sorkin-Swift (2020) shift-share IV.sp.decompose— Oaxaca / FFL / DFL / RIF / gap-closing dispatcher;Provenance.functionsurfaces the dispatched method (e.g."sp.decompose.oaxaca").
Skipped — sp.did top-level dispatcher¶
The sp.did dispatcher delegates to already-instrumented inner
estimators (sp.did.callaway_santanna / sp.did.did_2x2 /
sp.did.aggte / sp.sun_abraham / sp.synth(method='sdid')).
With the established overwrite=False semantics, the inner
record's name (more specific) wins. Wrapping the dispatcher would
add no information.
Tests¶
8 new (6 per-estimator + 1 panel method-choice variant + 1 multi-estimator integration). 111 green across the panel / causal_impact / mediation / decomposition / bartik regression sweep.
Phase 9: provenance rollout to 48/925 (TMLE + forest + DR)¶
Continues the v1.7.2 provenance rollout. No numerical changes to any estimator. 12 ML-causal + classical-identification estimators instrumented. Coverage 36/925 → 48/925.
Added — Provenance for 12 estimators¶
ML-causal (8):
sp.tmle— van der Laan & Rose Targeted MLE (with Super Learner).sp.tmle.ltmle— Longitudinal TMLE for static regime contrasts.sp.tmle.hal_tmle— TMLE with Highly Adaptive Lasso nuisance.sp.causal_forest— GRF causal forest factory.sp.multi_arm_forest— multi-arm causal forest.sp.iv_forest— Athey-Tibshirani-Wager IV causal forest.sp.metalearner— S/T/X/R/DR meta-learner dispatcher.sp.bcf— Hahn-Murray-Carvalho Bayesian Causal Forest.
Classical identification (4):
sp.aipw— Augmented IPW (doubly robust, cross-fit).sp.ipw— Inverse Probability Weighting.sp.g_computation— parametric g-formula.sp.front_door— Pearl front-door adjustment.
Pattern reuse: established Phase 3 idiom — assign to _result,
attach_provenance(overwrite=False), return. The hal_tmle →
tmle cascade is handled correctly: inner sp.tmle record wins,
matching the etwfe → wooldridge_did and lasso_iv → iv
patterns from earlier rounds.
Tests¶
14 new (12 per-estimator + 1 metalearner choice variant + 1 multi-estimator integration). 103 green across the hal_tmle / causal_forest / metalearner / bcf / front_door / g_computation regression sweep.
production function estimators (OP / LP / ACF / Wooldridge + translog + DLW markup)¶
Adds proxy-variable production function estimation — Olley-Pakes,
Levinsohn-Petrin, Ackerberg-Caves-Frazer, Wooldridge — plus
Cobb-Douglas + translog functional forms and the De Loecker-Warzynski
markup. Closes the long-standing gap that forced StatsPAI users to
drop into R prodest or Stata prodest for productivity / TFP /
markup work.
Added¶
sp.prod_fn(method=..., functional_form=...)— unified dispatcher ('op' | 'lp' | 'acf' | 'wrdg','cobb-douglas' | 'translog').sp.olley_pakes(aliassp.opreg) — investment-proxy estimator with strictly-positive-investment filter.sp.levinsohn_petrin(aliassp.levpet) — intermediate-input proxy (avoids OP zero-investment selection).sp.ackerberg_caves_frazer(aliassp.acf) — modern default, corrects the OP/LP labor-coefficient identification problem via lagged-labor instruments.sp.wooldridge_prod— joint stacked-NLS estimator (Cobb-Douglas only; translog raises NotImplementedError; full-GMM Wooldridge on roadmap).sp.markup— De Loecker & Warzynski (2012) firm-time markup μ_it = θ_v · (PQ) / (P_v V) with optional η-correction. Supports both Cobb-Douglas (constant θ_v) and translog (firm-time θ_v_it read from the elasticity panel attached to the result).sp.ProductionResult— unified result class withcoef,tfp,productivity_process,cite(),summary(), plusmodel_info["elasticities"]for translog firm-time elasticities.- Translog functional form: input matrix expanded to linear + 0.5*x_j² + cross-term basis; instrument matrix expanded by the same polynomial; firm-time output elasticities computed from ∂y/∂x_j = β_j + β_jj·x_j + Σ_{k≠j} β_jk·x_k.
- Firm-cluster bootstrap SE (Wooldridge 2009 §4 convention) with convergence filtering on each replicate.
- Multi-start Nelder-Mead in stage 2 over 5 economic-prior starts (the OLS warm start is intentionally avoided — it lands in a spurious basin where the productivity AR overfits ω onto ω_lag at implausible β).
- UserWarning on non-consecutive panel time periods (lag operator would silently treat gaps as 1-period lags otherwise).
- 9 new registry entries (5 canonical + 3 aliases + markup) — total rises to 964 functions.
References (verified via Crossref API on 2026-04-27)¶
- Olley & Pakes (1996) Econometrica 64(6) 1263–1297, DOI 10.2307/2171831
- Levinsohn & Petrin (2003) RES 70(2) 317–341, DOI 10.1111/1467-937X.00246
- Ackerberg, Caves & Frazer (2015) Econometrica 83(6) 2411–2451, DOI 10.3982/ECTA13408
- Wooldridge (2009) Economics Letters 104(3) 112–114, DOI 10.1016/j.econlet.2009.04.026
- De Loecker & Warzynski (2012) AER 102(6) 2437–2471, DOI 10.1257/aer.102.6.2437
Tests¶
tests/test_prod_fn.py— 23 tests:- Synthetic DGP recovery (ACF tight; OP/LP loose per ACF's identification critique; Wooldridge feasible-range)
- Translog: 5-coef structure, dispatcher pass-through, CD-truth nesting (β_ll/β_kk/β_lk near 0), markup with firm-time θ_v_it, Wooldridge-translog raises, unknown functional_form raises
- Dispatcher, aliases, bootstrap SE, markup CD path, edge cases (missing columns, too-few-obs, zero-proxy filter, time-gap warning, registry presence, no-bootstrap diagnostics shape).
Notes¶
- Default
productivity_degree=1(linear AR(1)). Higher degrees can overfit ω given ω_lag in finite samples and flatten the GMM objective surface — see dispatcher docstring. - Translog identification caveat: stage-2 instruments are polynomial transforms of the same raw (k, l_lag) pair, so the moment system can be near-singular when state and lagged-free inputs are highly correlated. Higher-order coefficients have larger finite-sample variance than linear ones — bootstrap SEs recommended.
- Gandhi-Navarro-Rivers (2020) flexible-input identification and full efficient-GMM Wooldridge are roadmap items, not in this release.
Phase 8: provenance rollout to 36/925 (IV + matching + DML)¶
Continues the v1.7.2 provenance rollout from Phases 3-4-7. No
numerical changes to any estimator. 12 instrumentation points added
(15 user-facing functions, since the JIVE family of 4 share a single
_run instrumentation). Coverage 21/925 → 36/925.
Added — Provenance instrumentation for 12 more points¶
IV family (9 user-facing names):
sp.liml— Limited Information Maximum Likelihood / Fuller.sp.jive— legacy single-method JIVE (regression/advanced_iv).sp.lasso_iv— Belloni-Chen-Chernozhukov-Hansen (2012). The pre-existingiv()API drift bug here was also repaired —lasso_ivnow builds a formula string for the formula-onlysp.iv()API and maps the legacyrobust='robust'kwarg to the modernhc1enum.sp.iv.bayesian_iv— Chernozhukov-Hong (2003) Anderson-Rubin posterior with Metropolis-Hastings.sp.iv.jive1/sp.iv.ujive/sp.iv.ijive/sp.iv.rjive— all four flow through the shared_rundispatcher;methodarg discriminates and surfaces inProvenance.function("sp.iv.jive1"/"sp.iv.ujive"/ …). One instrumentation point covers four user-facing names.sp.iv.mte— Brinch-Mogstad-Wiswall (2017) polynomial Marginal Treatment Effect.
Matching family (5):
sp.match— main matching dispatcher (PSM / mahalanobis / CEM / strata / coarsened).sp.optimal_match— Hungarian-algorithm 1:1 with caliper.sp.cardinality_match— Zubizarreta (2014) LP with SMD tolerance.sp.genmatch— Diamond-Sekhon (2013) genetic matching.sp.sbw— Zubizarreta (2015) Stable Balancing Weights.
DML (1):
sp.dml— Chernozhukov et al. (2018) Double ML dispatcher covering plr / irm / pliv / iivm. Single-exit pattern.
Pattern reuse: each follows the established Phase 3 idiom — assign
result to _result, call attach_provenance(overwrite=False),
return. overwrite=False semantics preserve the inner-most record
when an outer wrapper (e.g. lasso_iv calling sp.iv) is also
instrumented.
Fixed — sp.lasso_iv API drift (pre-existing)¶
Independent fix: sp.lasso_iv was calling the legacy iv(y=,
x_endog=, x_exog=, z=) signature which is no longer accepted. Now
builds a Patsy-style formula (y ~ (endog ~ z) + exog) for the
current formula-only sp.iv() API.
Tests¶
16 new tests (12 per-estimator + 4 JIVE variants confirming each
gets the right method-discriminated function name + 1
multi-estimator integration). 155 green across the IV + matching +
DML + provenance regression sweep:
- IV:
test_iv.pyandtest_iv_frontiers.py. - Matching:
test_matching.pyandtest_matching_optimal.py. - DML:
test_dml.py,test_dml_iivm.py,test_dml_panel.py,test_dml_split.py. - Provenance: rounds 1+2+3+4.
Documentation¶
docs/guides/replication_workflow.md scorecard updated to 36/925.
production function estimators¶
Adds proxy-variable production function estimation — Olley-Pakes,
Levinsohn-Petrin, Ackerberg-Caves-Frazer, Wooldridge — plus the
De Loecker-Warzynski markup. Closes the long-standing gap that
forced StatsPAI users to drop into R prodest or Stata prodest
for productivity / TFP / markup work.
Added¶
sp.prod_fn(method=...)— unified Cobb-Douglas dispatcher ('op' | 'lp' | 'acf' | 'wrdg').sp.olley_pakes(aliassp.opreg) — investment-proxy estimator with strictly-positive-investment filter.sp.levinsohn_petrin(aliassp.levpet) — intermediate-input proxy (avoids OP zero-investment selection).sp.ackerberg_caves_frazer(aliassp.acf) — modern default, corrects the OP/LP labor-coefficient identification problem via lagged-labor instruments.sp.wooldridge_prod— joint stacked-NLS estimator (one-step GMM with identity weighting and instruments = regressors; full efficient-GMM variant on the roadmap).sp.markup— De Loecker & Warzynski (2012) firm-time markup μ_it = θ_v · (PQ) / (P_v V) with optional η-correction.sp.ProductionResult— unified result class withcoef,tfp,productivity_process,cite(),summary().- Firm-cluster bootstrap SE (Wooldridge 2009 §4 convention) with convergence filtering on each replicate.
- Multi-start Nelder-Mead in stage 2 to dodge the upward-biased OLS warm start (positive selection of labor on ω).
- UserWarning on non-consecutive panel time periods (lag operator would silently treat gaps as 1-period lags otherwise).
- 9 new registry entries (5 canonical + 3 aliases + markup), bringing total to 964 functions.
References (verified via Crossref API on 2026-04-27)¶
- Olley & Pakes (1996) Econometrica 64(6) 1263–1297, DOI 10.2307/2171831
- Levinsohn & Petrin (2003) RES 70(2) 317–341, DOI 10.1111/1467-937X.00246
- Ackerberg, Caves & Frazer (2015) Econometrica 83(6) 2411–2451, DOI 10.3982/ECTA13408
- Wooldridge (2009) Economics Letters 104(3) 112–114, DOI 10.1016/j.econlet.2009.04.026
- De Loecker & Warzynski (2012) AER 102(6) 2437–2471, DOI 10.1257/aer.102.6.2437
Tests¶
tests/test_prod_fn.py— synthetic DGP recovery (ACF tight, OP/LP loose per ACF's identification critique), dispatcher, aliases, bootstrap SE, markup, edge cases (missing columns, too-few-obs, zero-proxy filter, time-gap warning, registry presence). 18 tests.
Notes¶
- Default
productivity_degree=1(linear AR(1)). Higher degrees can overfit ω given ω_lag in finite samples and flatten the GMM objective surface — see dispatcher docstring. - Translog and Gandhi-Navarro-Rivers (2020) production functions are roadmap items, not in this release.
Phase 7: provenance rollout to 21/925 (DiD long-tail + RD)¶
Continues the v1.7.2 provenance rollout established in Phases 3-4.
No numerical changes to any estimator. 12 more estimators
instrumented; sp.etwfe refactored into wrapper + dispatcher
(parallel to the Phase 4 sp.synth move). Coverage now 21/925.
Added — Provenance instrumentation for 12 more estimators¶
DiD long-tail (10):
sp.cic— Athey-Imbens (2006) Changes-in-Changes.sp.cohort_anchored_event_study— staggered-robust ES (arXiv:2509.01829).sp.design_robust_event_study(Wright 2026, arXiv:2601.18801) — orthogonalised event-study under staggered adoption.sp.gardner_did/sp.did_2stage— Gardner (2021) two-stage.sp.harvest_did— Borusyak-Harmon-Hull-Jaravel-Spiess (2025) harvesting.sp.did_misclassified— staggered DiD with treatment misclassification + anticipation (arXiv:2507.20415).sp.stacked_did— Cengiz-Dube-Lindner-Zipperer (2019) stacked.sp.wooldridge_did— Wooldridge (2021) Extended TWFE.sp.etwfe— refactored into outer wrapper + 4-branch_dispatch_etwfe_implso the (with-xvar / never-only / notyet / repeated-cross-section) routing attaches provenance once on the way out. Same pattern as Phase 4'ssp.synthmove.sp.drdid— Sant'Anna-Zhao (2020) doubly robust DiD.
RD (2):
sp.rd_honest— Armstrong-Kolesar (2018, 2020) honest CIs.sp.rkd— Card-Lee-Pei-Weber (2015) Regression Kink Design.
Each follows the established Phase 3 idiom: assign result to
_result, call attach_provenance(overwrite=False), return.
overwrite=False semantics preserve the inner-most record so
estimand-first / sp.causal / sp.paper wrappers don't clobber
the more-specific call name.
Changed — sp.etwfe refactored into outer wrapper + dispatcher¶
Mirrors Phase 4's sp.synth refactor. The previous etwfe had 4
return sites (one per (panel × cgroup × xvar) branch), which
made naive instrumentation maintenance-hostile. New layout:
_dispatch_etwfe_impl(...)— full dispatcher (formeretwfebody), unchanged logic.etwfe(...)— thin outer wrapper that captures kwargs, calls impl, attaches provenance once before returning.
Public signature is bit-identical; the existing wooldridge / etwfe test sweep passes with zero changes.
Tests¶
14 new tests (12 per-estimator + 1 did_2stage alias check + 1
multi-estimator replication_pack integration). 346 green across
the DiD + RD + paper regression sweep (DiD: 214, paper+remaining:
132). Zero regressions across either family.
Documentation¶
docs/guides/replication_workflow.md scorecard updated to reflect
the new 21/925 coverage. Users running get_provenance(result)
can verify any estimator's status locally.
production function estimators¶
Adds proxy-variable production function estimation — Olley-Pakes,
Levinsohn-Petrin, Ackerberg-Caves-Frazer, Wooldridge — plus the
De Loecker-Warzynski markup. Closes the long-standing gap that
forced StatsPAI users to drop into R prodest or Stata prodest
for productivity / TFP / markup work.
Added¶
sp.prod_fn(method=...)— unified Cobb-Douglas dispatcher ('op' | 'lp' | 'acf' | 'wrdg').sp.olley_pakes(aliassp.opreg) — investment-proxy estimator with strictly-positive-investment filter.sp.levinsohn_petrin(aliassp.levpet) — intermediate-input proxy (avoids OP zero-investment selection).sp.ackerberg_caves_frazer(aliassp.acf) — modern default, corrects the OP/LP labor-coefficient identification problem via lagged-labor instruments.sp.wooldridge_prod— joint stacked-NLS estimator (one-step GMM with identity weighting and instruments = regressors; full efficient-GMM variant on the roadmap).sp.markup— De Loecker & Warzynski (2012) firm-time markup μ_it = θ_v · (PQ) / (P_v V) with optional η-correction.sp.ProductionResult— unified result class withcoef,tfp,productivity_process,cite(),summary().- Firm-cluster bootstrap SE (Wooldridge 2009 §4 convention) with convergence filtering on each replicate.
- Multi-start Nelder-Mead in stage 2 to dodge the upward-biased OLS warm start (positive selection of labor on ω).
- UserWarning on non-consecutive panel time periods (lag operator would silently treat gaps as 1-period lags otherwise).
References (verified via Crossref API on 2026-04-27)¶
- Olley & Pakes (1996) Econometrica 64(6) 1263–1297, DOI 10.2307/2171831
- Levinsohn & Petrin (2003) RES 70(2) 317–341, DOI 10.1111/1467-937X.00246
- Ackerberg, Caves & Frazer (2015) Econometrica 83(6) 2411–2451, DOI 10.3982/ECTA13408
- Wooldridge (2009) Economics Letters 104(3) 112–114, DOI 10.1016/j.econlet.2009.04.026
- De Loecker & Warzynski (2012) AER 102(6) 2437–2471, DOI 10.1257/aer.102.6.2437
Tests¶
tests/test_prod_fn.py— synthetic DGP recovery (ACF tight, OP/LP loose per ACF's identification critique), dispatcher, aliases, bootstrap SE, markup, edge cases (missing columns, too-few-obs, zero-proxy filter, time-gap warning, registry presence). 18 tests.
Notes¶
- Default
productivity_degree=1(linear AR(1)). Higher degrees can overfit ω given ω_lag in finite samples and flatten the GMM objective surface — see dispatcher docstring. - Translog and Gandhi-Navarro-Rivers (2020) production functions are roadmap items, not in this release.
clubSandwich-equivalent HTZ Wald (independent PR)¶
Adds a numerically-equivalent Python implementation of R
clubSandwich::Wald_test(..., test="HTZ") for cluster-robust Wald
tests under CR2 sandwich. Closes the BM-vs-HTZ gap documented in
cluster_dof_wald_bm (which uses the BM 2002 simplified formula
and can drift 50–100% from clubSandwich on multi-restriction tests).
Added¶
sp.fast.cluster_wald_htz()— full HTZ Wald test, returnsWaldTestResult(test, q, eta, F_stat, p_value, Q, R, r, V_R).sp.fast.cluster_dof_wald_htz()— DOF-only helper mirroring thecluster_dof_wald_bmsignature for easy substitution.sp.fast.WaldTestResult— frozen dataclass with.summary()and.to_dict().- Pustejovsky-Tipton 2018 §3.2 moment-matching DOF η computed as
q(q+1) / sum(var_mat)withvar_matderived from cluster-pair contributions toR · V^CR2 · R^Tunder a working covariance Φ = I (OLS+CR2; clubSandwich's default). - Hotelling-T² scaling:
F_stat = (η - q + 1) / (η · q) · Qwithp_value = 1 - F_{q, η-q+1}.cdf(F_stat).
Verification¶
- 3 frozen-fixture parity tests vs R clubSandwich 0.6.2 at
rtol < 1e-8(q ∈ {1, 2, 3}, balanced + unbalanced panels; fixture intests/fixtures/htz_clubsandwich.json, no R required in CI). - 3 live-R parity tests at
rtol < 1e-8(skipifRscriptmissing). - 14 unit tests: validation, invariance (X rescale + cluster relabel +
bread arg path), edge cases (singleton cluster warning, zero
residuals short-circuit,
η ≤ q-1rejection, non-uniform weightsNotImplementedError). - Total: 23/23 tests pass.
Scope (v1)¶
- Standalone — no wiring into
crve/feols/fepois/event_study. That's the next PR. - Working covariance
Φlocked toI(OLS+CR2). Non-uniform weights raiseNotImplementedErrorwith a pointer to v2. - HTZ test variant only; HTA / HTB / KZ / Naive / EDF deferred.
References¶
pustejovsky2018smalladded topaper.bibafter Crossref dual-source verification (DOI10.1080/07350015.2016.1247004; authors / year 2018 / vol 36(4) / pp 672–683 / title — all four elements verified per CLAUDE.md §10).- Implementation derived 1:1 from Pustejovsky-Tipton 2018 §3.2 + clubSandwich source (R Wald_testing / get_P_array / total_variance_mat). No GPL code copied; clubSandwich used only as black-box reference.
Phase 5: LLM-DAG closed loop + layered credential resolver¶
Closes the LLM-DAG closed-loop deferred from Phases 2-4. No numerical changes to any estimator. The export pipeline can now auto-propose a DAG via a real LLM (Anthropic Claude or OpenAI GPT) without requiring users to pre-build one — credential resolution follows the industry-standard layered fallback pattern.
Added — sp.causal_llm.get_llm_client() layered credential resolver¶
Resolution order (first match wins):
- Explicit
client=— already-builtLLMClient, pass through. - Explicit
provider=+api_key=— construct directly. - Environment variable —
ANTHROPIC_API_KEY/OPENAI_API_KEY. When both are set, tie-break to the config file's[llm].provider(or to Anthropic if no config). - Config file
~/.config/statspai/llm.toml(XDG-compliant) — storesproviderandmodelpreferences. Never stores API keys — that's the documented industry-standard split (Anthropic SDK / OpenAI SDK / AWS CLI / kubectl all keep keys in environment variables, never plaintext config). - Interactive prompt — only when
sys.stdin.isatty()ANDallow_interactive=True. Walks user through provider + model selection but never asks for the API key over stdin (security: leaks in shell history, no obvious env-var integration path). - Hard error with concrete remediation: lists the env vars to
set + points at
sp.causal_llm.configure_llm(...)for the provider+model preference part.
Added — sp.causal_llm.configure_llm() preferences setter¶
One-shot setter that persists provider+model to the XDG config file. Useful when a user has both env vars set and wants to pin the choice:
import statspai as sp
sp.causal_llm.configure_llm(provider="openai", model="gpt-4o")
# → ~/.config/statspai/llm.toml gets a [llm] block with the choice.
Added — sp.paper(..., llm='auto', llm_domain=...) auto-DAG hook¶
When the user doesn't pass an explicit dag=, llm='auto' (or
llm='heuristic' for a pinned offline path) triggers
llm_dag_propose against the resolved client + the variable list.
Failures (no API key, network error, malformed JSON) silently fall
back to a no-DAG paper — auto-DAG must never break the rest of the
pipeline. Pass llm_client= to override the resolver entirely.
The proposed DAG is materialised as a statspai.dag.graph.DAG and
attached to the PaperDraft, so all downstream rendering (Quarto
mermaid block, replication_pack lineage, Causal DAG appendix) flows
through the existing Phase 3 plumbing — no new branches.
Added — LLMClient.complete() alias (latent bug fix)¶
llm_dag_propose / llm_dag_validate / llm_dag_constrained
all called client.complete(prompt), but the LLMClient base
class only defined chat(role, prompt) and __call__(prompt).
Any user passing a real openai_client / anthropic_client
into the LLM-DAG functions would have hit AttributeError. Added
complete() as an alias on the base class — both names route
through chat(), so no concrete adapter needs changes.
Public exports¶
sp.causal_llm.get_llm_client,
sp.causal_llm.list_available_providers,
sp.causal_llm.configure_llm,
sp.causal_llm.LLMConfigurationError,
sp.causal_llm.llm_config_path,
sp.causal_llm.load_llm_config,
sp.causal_llm.DEFAULT_LLM_MODELS.
Tests¶
27 new tests (tests/test_llm_resolver.py):
- Config file: XDG path, missing/malformed graceful fallback, save round-trip, header comment warns against putting keys in the file.
- Layered fallback: explicit client → explicit provider → env → config tie-break → no-env-no-tty hard error → no-env-tty-no-keys hard error → env-set skips prompt.
configure_llmround-trip + unknown-provider rejection.LLMClient.complete()alias smoke.sp.paper(llm='auto')integration: no-env falls back to heuristic; explicitllm_client=populates the DAG.
221 green across the new + adjacent paper / lineage / replication_pack / estimator-provenance / bibliography / gt suites.
Phase 4: synth refactor + 5 more estimator provenance hookups¶
Continues the v1.7.2 provenance rollout from Phase 3 (4 estimators
instrumented). This round closes the deferred sp.synth dispatcher
refactor and adds 4 more high-leverage estimators. No numerical
changes to any estimator — total provenance coverage now 9/925.
Changed — sp.synth dispatcher refactored for one-shot provenance¶
The previous v1.7.2 instrumentation deferred sp.synth because its
13 method branches each had their own return X(...) call site —
sprinkling 13 attach_provenance calls would've been
maintenance-hostile. Refactor splits responsibility:
_dispatch_synth_impl(...)— full dispatcher (formersynthbody), unchanged logic.synth(...)— thin outer wrapper that captures kwargs, calls impl, then attaches provenance once before returning.
Public signature is bit-identical; the 145-test synth regression
sweep passes with zero changes. All 13 SCM method variants
(classic / penalized / demeaned / unconstrained /
augmented / sdid / factor / staggered / mc /
discos / multi_outcome / scpi / bayesian / bsts /
penscm / fdid / cluster / sparse / kernel /
kernel_ridge) now flow through the same provenance attach.
Added — Provenance instrumentation for 4 more estimators¶
sp.did.did_imputation— Borusyak-Jaravel-Spiess (2024) imputation.sp.did.aggte— Callaway-Sant'Anna ATT(g, t) aggregation. Capturesupstream_run_idandupstream_functionso downstream consumers can trace the aggregation step back to the producingcallaway_santannacall (sp.replication_pack'slineage.jsonthus gets a chain, not just disconnected runs).sp.did.did_multiplegt— de Chaisemartin-D'Haultfoeuille (2020).sp.rd.rdrobust— Calonico-Cattaneo-Titiunik local-polynomial RD with robust bias correction. Captures kernel / bwselect / fuzzy / donut / weights for the full reproduction recipe.
Each follows the established pattern: assign result to _result,
call attach_provenance with overwrite=False, return. Any
upstream-instrumented estimator (sp.causal_question /
sp.paper / aggte) preserves the inner record.
Tests¶
9 new tests (3 synth + 1 did_imputation + 1 aggte upstream-linkage + 1 did_multiplegt + 2 rdrobust + 1 multi-estimator integration). 166 green across the DiD + RD + new provenance regression sweep (95s wall, 145 of which are synth — the refactor is paid for in test time once and forgotten).
Provenance coverage scorecard¶
| v1.7.2 P3 | v1.7.2 P4 (this) | |
|---|---|---|
| Instrumented | 4/925 | 9/925 |
| Estimator | Status |
|---|---|
sp.regress |
P3 ✓ |
sp.callaway_santanna |
P3 ✓ |
sp.did_2x2 |
P3 ✓ |
statspai.regression.iv.iv |
P3 ✓ |
sp.synth (13 method dispatcher) |
P4 ✓ |
sp.did.did_imputation |
P4 ✓ |
sp.did.aggte (chain-aware) |
P4 ✓ |
sp.did.did_multiplegt |
P4 ✓ |
sp.rd.rdrobust |
P4 ✓ |
Remaining 916 estimators bucket into v1.7.3 sprint themes: DiD long-tail (~20), IV variants (~15), synth sub-modules (already flow through dispatcher), DML / TMLE / metalearners (~50), panel / structural (~80), and the long tail (~750).
Phase 3: estimand-first paper + estimator provenance + DAG appendix¶
Layered on top of the Phase 1+2 export trinity. No numerical changes to any estimator. Three additions, each gated to opt-in call sites to keep blast radius small.
Added — Estimand-first sp.paper(causal_question_obj)¶
The Target-Trial-Protocol-shaped declaration now drives the paper end to end. Two equivalent entry points:
# Method-style:
q = sp.causal_question("trained", "wage", data=df, design="did",
time="year", id="worker_id")
draft = q.paper(fmt="qmd")
draft.write("paper.qmd")
# Function-style dispatch:
draft = sp.paper(q, fmt="qmd")
The builder routes through q.identify() + q.estimate() and
assembles Question / Data / Identification / Estimator / Results /
Robustness / References sections whose contents match the
declaration (not natural-language inference). Unlike the
DataFrame-first sp.paper(df, "natural-language question") path,
this preserves the user's pre-registered estimand, design, and
identification claims verbatim — agents that pre-register get
audit-grade traceability for free.
Underlying estimator's result is exposed on
draft.workflow.result so sp.replication_pack and
draft.to_qmd()'s Reproducibility appendix pick up provenance
automatically.
Added — Estimator-level provenance instrumentation (4 of 5)¶
Top-tier estimators now attach_provenance() to their fit result
with overwrite=False semantics — outer wrappers (sp.causal,
sp.paper) preserve the inner estimator's more-specific record:
sp.regress(regression/ols.py).sp.callaway_santanna(did/callaway_santanna.py).sp.did_2x2(did/did_2x2.py).statspai.regression.iv.iv— unified 2SLS / LIML / GMM / JIVE.
Each captures: function name, key kwargs (formula / estimator / control_group / method / etc.), 12-char SHA-256 of the input frame (column-name + dtype + value sensitive), run uuid, version stamps.
Deferred: sp.synth dispatcher (13 method branches, 13+ return
sites). A dedicated v1.7.3 sprint refactors synth into an inner
_dispatch_synth plus an outer wrapper that attaches provenance
once, instead of sprinkling 13 attach calls.
Added — Causal DAG appendix in PaperDraft¶
Pass dag= to sp.paper(...) (or q.paper(dag=...)) and the
draft gains a Causal DAG section that renders fmt-aware:
- Markdown / TeX: text-art with the variable list, edge list,
back-door paths, adjustment sets, and any latent
_L_*confounders. - Quarto (.qmd): native
{mermaid}graph block (Quarto renders to SVG out of the box) plus the same text fallbacks below it.
Identification-relevant info (back-door paths, adjustment sets, bad
controls) is computed from the DAG via the existing
:class:statspai.dag.graph.DAG API; the LLM-DAG closed loop
(sp.llm_dag_propose / validate / constrained) integrates as a
data source — pass any DAG those return into dag=. The paper
builder doesn't itself call any LLM API; that remains the user's
explicit choice.
Added — Public exports¶
sp.paper_from_question— alternative entry point next to the method-styleq.paper()and the dispatcher insp.paper(q).- DAG-section-related fields on :class:
PaperDraft:dag,dag_treatment,dag_outcome.
Tests¶
35 new tests (14 paper_from_question + 8 estimator_provenance + 13 paper_dag_section). 295 green across the full Phase 1+2+3 + adjacent paper / registry / help / output / workflow surface.
HDFE silent-bug fix + completeness pass¶
Layered on top of the v1.8 RC sp.fast.* HDFE stack. One ⚠️
correctness fix (event_study cluster SE), the rest is additive.
⚠️ Correctness — sp.fast.event_study cluster SE¶
sp.fast.event_study was computing CR1 cluster-robust SEs without
charging the absorbed FE rank against residual degrees of freedom.
The small-sample factor used (n-1)/(n-k_dummies) instead of
(n-1)/(n - k_dummies - Σ(G_k - 1)), so SEs were systematically
too small (~3–5% on a typical balanced panel; up to ~10% on
small/uneven designs). The fix passes extra_df = Σ(G_k - 1) —
matching reghdfe / fixest convention — through the new crve
parameter (see Added below). t-statistics and CIs reported by
sp.fast.event_study will now be slightly wider; users
re-running the same data should expect modest changes in the third
decimal of SE.
Added — sp.fast.feols: native OLS HDFE estimator¶
The linear sister of sp.fast.fepois. Pure-Python orchestration on
top of the Phase 1 Rust demean kernel + Phase 4 inference primitives;
independent of pyfixest. Public API mirrors sp.fast.fepois
(formula DSL, vcov, cluster, weights).
vcov∈{"iid", "hc1", "cr1"}. CR1 is FE-rank-aware via the sameextra_df = Σ(G_k - 1)convention used elsewhere in fast/*.- Weighted OLS path routes through the
_weighted_ap_demeanloop (matches pyfixest weighted feols to ~1e-12). - Coefficient parity vs R
fixest::feols: 4.2e-15 at n=1M / fe1=100k / fe2=1k (machine epsilon). Wall-time 135 ms vs R fixest 106 ms vs pyfixest 210 ms — i.e. 1.55× faster than pyfixest, 1.27× slower than fixest's mature C++. Seebenchmarks/hdfe/run_feols_bench.py. - Full
coef()/se()/vcov()/tidy()/summary()surface; drop-in compatible withsp.fast.etablefor side-by-side regression tables alongsidesp.fast.fepoisresults.
Added — Cluster-robust SE in sp.fast.fepois¶
sp.fast.fepois(vcov="cr1", cluster="<col>") now ships. Score uses
the weighted Poisson form obs_weights · (y - μ) · X̃ with the
WLS bread (X̃' diag(μ) X̃)^{-1}; small-sample factor charges
Σ(G_k - 1) via the new crve parameter. NaN cluster values raise.
Added — extra_df parameter on crve / boottest / boottest_wald¶
Backward-compatible extra_df: int = 0 parameter on all three CR1
callers. Default 0 reproduces the prior behaviour bit-for-bit; HDFE
callers should pass extra_df = Σ(G_k - 1) to get the FE-rank-aware
small-sample factor. Documented in each docstring; rejected if < 0.
Added — Bell-McCaffrey / Imbens-Kolesar Satterthwaite DOF¶
Two new helpers for small-G CR2 inference:
sp.fast.cluster_dof_bm(X, cluster, *, contrast, ...)— single 1-D contrast Satterthwaite DOF, formulaν = (Σ_g λ_g)² / Σ_g λ_g²withλ_g = ‖A_g · W_g · X_g · bread · c‖².sp.fast.cluster_dof_wald_bm(X, cluster, *, R, ...)— q-dim matrix Satterthwaite for joint Wald tests, formulaν_W = (Σ tr(E_g E_g'))² / Σ ‖E_g E_g'‖_F². q=1 collapses to the scalar form bit-for-bit.
Honest convention note in both docstrings: these implement BM 2002
§3 simplified, not clubSandwich's Pustejovsky-Tipton 2018 HTZ /
generalized form. The CR2 variance matches clubSandwich exactly;
the DOF differs by 5–10% on typical panels (1-D contrast) and can
differ 50–100% in the q-dim matrix Satterthwaite. For tightest
small-G inference prefer sp.fast.boottest / sp.fast.boottest_wald.
Changed — sp.fast.fe_interact rejects NaN¶
The 2-way fast path was silently producing collision-prone packed
codes when input columns contained NaN (pd.factorize's -1
sentinel leaking into c0 * n1 + c1). Now fail-fast, matching the
fail-fast convention of sp.fast.demean / sp.fast.fepois /
sp.fast.feols. K-way path also restructured to progressive packing
with periodic re-densification, so deeply-nested FE chains can't
overflow int64.
Changed — Registry walks sp.fast.*¶
sp.list_functions() / sp.describe_function() now surface every
public callable in the sp.fast.* namespace under a fast.<name>
key (e.g. fast.feols, fast.cluster_dof_bm). The top-level
pyfixest-backed sp.feols continues to coexist as a separate
registry entry — no name collision. +27 new registry entries on
the v1.8 stack become Agent-discoverable for the first time.
Documentation¶
src/statspai/fast/jax_backend.py— added a verified-blocked note for Apple Silicon (Metal).jax-metal 0.1.1(latest, Apple- maintained) is incompatible with JAX 0.10.0 at the StableHLO bytecode level; even basic ops fail. Verified empirically on M3. Workaround for users with jax-metal installed:JAX_PLATFORMS=cpu.benchmarks/hdfe/SUMMARY.md— added v1.8.1 follow-up section with OLS bench numbers and full delta vs Phase 8.
Tests¶
tests/test_fast_feols.py— 20 new tests (coef / SE parity vs pyfixest and R fixest; weighted; intercept-only; validation; hand closed-form for OLS).tests/test_fast_inference.py— +14 tests (extra_df backward-compat and direction proofs across crve/boottest/boottest_wald; BM and Wald BM DOF coverage).tests/test_fast_event_study.py— +2 tests (FE-rank pin via math identity; R fixest SE parity within 1%).tests/test_fast_fepois.py— +6 tests (cluster CR1 path + R fixest SE parity).tests/test_fast_within_dsl.py— +3 tests (fe_interactNaN rejection; 2-way no-collision; K-way matches pandas tuple path).tests/test_fast_etable.py— +2 tests (etable × FeolsResult; mixed feols + fepois side-by-side).tests/test_registry_new_modules.py— +25 tests (parametrisedfast.*registry coverage; namespace coexistence with top-level).
Total: pytest tests/test_fast_*.py tests/test_hdfe_native.py
tests/test_registry*.py tests/test_help.py — 267 passed,
2 graceful-skip (was 133 at end of Phase 8).
Phase 2: great_tables + CSL pipeline + paper auto-provenance¶
Layered on top of the export trinity below. No numerical changes to any estimator. Three additions, all opt-in, all stdlib + soft optional deps.
Added — sp.gt(result) great_tables adapter¶
Posit's great_tables is the Python port of R's gt — the
publication-oriented table grammar (cell-level styling, spanners,
footnote marks, themes, multi-target HTML/LaTeX/RTF output). The new
adapter dispatches on input type:
- :class:
RegtableResult→ full-fidelity adapter (title, notes, journal preset → gt theme viaopt_footnote_marksandtab_options(table_font_names=...)). - :class:
PaperTables→ flattens panels into row groups viaGT(groupname_col=...). - :class:
MeanComparisonResult→to_dataframe()round-trip. pandas.DataFrame→ wraps verbatim with optionalrowname_col.- Anything with
to_dataframe()→ duck-typed.
Soft dep — great_tables is not required to import StatsPAI.
sp.is_great_tables_available() reports the dep; calling
sp.gt(...) without it raises a friendly ImportError pointing
to pip install great_tables. All 8 journal presets (AER / QJE /
Econometrica / RestStat / JF / AEJA / JPE / RestUd) apply without
crashing.
Added — sp.csl_url() / sp.write_bib() CSL pipeline¶
Quarto needs a .bib and a .csl to render citations. StatsPAI
captures citations as free-form strings on each estimator's
.cite(); this layer bridges:
- CSL URL registry — short journal names (
"aer"/"econometrica"/"qje"/ etc.) → canonical Zotero/styles URLs.sp.csl_url('aer')returns the URL; we deliberately do not bundle.cslfiles (CC-BY-SA-3.0, incompatible with MIT). Userscurlonce at project setup. - Citation → BibTeX — best-effort regex parse of canonical
"Author Y (YEAR). Title. Journal." form into
@articleentries with stablefirstauthor + year + first-title-wordkeys. Falls back to@miscfor unparseable strings rather than dropping them. sp.write_bib(citations, path)— dedupes by computed key, writes a cleanpaper.bibQuarto can resolve.- Replication pack integration —
replication_packnow writes a real BibTeX file (paper/paper.bib) instead of a free-text dump. - Quarto short names —
draft.to_qmd(csl='aer')now resolves tocsl: "american-economic-association.csl"automatically; pre-existing.cslfilenames pass through untouched.
Added — sp.paper() auto-attaches provenance¶
sp.paper() now calls attach_provenance() on workflow.result
after the estimate stage with overwrite=False: estimators that
wire their own provenance at fit() keep their (more specific)
record; estimators that don't gain workflow-level provenance for
free. Downstream consequences:
replication_packnow picks up provenance from a plaindraft = sp.paper(...)workflow with no further work — itslineage.jsonbecomes non-empty automatically.draft.to_qmd()emits thestatspai:YAML block (version/run_id/data_hash) and theReproducibility {.appendix}body section automatically.
This is the aggregation-point pattern for provenance rollout:
v1.7.3+ instruments individual estimators (sp.feols,
sp.did.callaway_santanna, sp.iv.tsls, sp.rd.rdrobust,
sp.synth, …); the workflow-level hook here is the bridge.
Added — Public sp.* exports¶
gt, is_great_tables_available, csl_url, csl_filename,
list_csl_styles, parse_citation_to_bib, make_bib_key,
citations_to_bib_entries, write_bib.
Tests¶
46 new tests (20 gt adapter + 26 bibliography); 226 passing across the full new + adjacent surface. Fast/Rust HDFE territory still untouched — Phase 2 is fully orthogonal to the parallel work.
Export trinity: numerical lineage + replication pack + Quarto emitter¶
Pure-additive export-layer patch. No numerical changes to any estimator. Closes three concrete gaps between StatsPAI's export stack and the R / Posit publication tooling, and lays the foundation for the v1.7.2+ "agent-native paper" line.
Added — sp.replication_pack() (replication archive)¶
One-liner that bundles an analysis into the layout AEA / AEJ data editors expect:
draft = sp.paper(df, "effect of trained on wage")
sp.replication_pack(draft, "submission.zip",
code="analysis.py", paper_format="qmd")
Produces a zip with MANIFEST.json (versions, timestamp, git SHA,
per-file SHA-256), README.md (replication instructions), data/
(CSV + schema manifest), code/, env/requirements.txt (from
pip freeze, fallback importlib.metadata), paper/ (rendered
draft + paper.bib), and lineage.json (aggregated provenance from
any results carrying _provenance). Tolerant by design — every
sub-step that fails is logged in MANIFEST.json["warnings"] rather
than aborting the archive.
Added — sp.Provenance / sp.attach_provenance() (numerical lineage)¶
A small dataclass attached as result._provenance recording: function
name, summarised params, 12-char SHA-256 of the input frame, run uuid,
StatsPAI/Python versions, ISO-8601 timestamp. Hash is column-name +
dtype + value sensitive. Estimators opt in by calling
attach_provenance(result, function="sp.did.foo", params=..., data=df)
at the end of their fit; backwards-compatible — unrecorded estimators
still work, recorded ones gain free traceability into every downstream
artifact (replication_pack, the Quarto appendix, table footers).
Added — PaperDraft.to_qmd() + sp.paper(fmt='qmd') (Quarto emitter)¶
sp.paper() now produces a .qmd document directly:
draft = sp.paper(df, "effect of trained on wage", fmt="qmd")
draft.write("paper.qmd") # quarto render paper.qmd
YAML frontmatter auto-emits format: {pdf,html,docx},
bibliography: paper.bib when citations exist, optional csl: for
journal styles, and a statspai: block carrying version / run_id
/ data_hash for machine-readable provenance. When the underlying
workflow.result has a _provenance record, a Reproducibility
{.appendix} section is appended automatically. YAML escaping is
robust against quotes / colons / newlines in the question text.
Added — Public sp.* exports¶
Provenance, attach_provenance, get_provenance,
compute_data_hash, format_provenance, lineage_summary,
replication_pack, ReplicationPack. Registry entry for
replication_pack is full agent-native (params, returns, example,
tags, assumptions, failure modes, alternatives).
Tests¶
77 new tests (32 lineage + 18 replication pack + 27 Quarto), 232 passing across new + adjacent paper/registry/help suites. Fast/ Rust HDFE territory untouched — runs independently of this patch.
[1.7.1] — 2026-04-26 — fmt="auto" magnitude-adaptive formatting + unified book-tab xlsx style¶
Pure-additive output-layer patch on top of v1.7.0. No numerical changes to any estimator. Two themes — both close gaps that referees and AER/QJE production editors flag in practice:
sp.regtable(..., fmt="auto")(andsp.modelsummary(..., fmt="auto")) pick decimal precision per-cell so a single table can mix dollar-magnitude coefficients (1,521) with elasticity-magnitude coefficients (0.288) without one side rounding to bare0.- Every
*.xlsxwriter instatspai.outputnow emits the strict academic book-tab three-rule layout (top / mid / bottom) in Times New Roman, via a single new shared modulestatspai.output._excel_style.
Added — fmt="auto" magnitude-adaptive formatting (regtable, modelsummary)¶
sp.regtable(..., fmt="auto") (and sp.modelsummary(..., fmt="auto"))
now picks decimal precision per-value so a single table can mix
dollar-magnitude coefficients (e.g. 1,521) with elasticity-magnitude
coefficients (e.g. 0.288) without one side being rounded to zero.
Bucketing: |x| ≥ 1000 → comma-separated integer; ≥ 100 → integer;
≥ 10 → 1 decimal; ≥ 1 → 2 decimals; < 1 → 3 decimals.
Why this matters. Before this addition, passing a single fixed format
like fmt="%.0f" (sensible for a wage regression where coefficients are
in dollars) would silently round any 0.X-magnitude regressor (e.g.
lagged-earnings persistence in a wages model) to bare 0 while keeping
its significance stars — producing 0*** cells that read as nonsense.
fmt="auto" is the recommended setting for any specification with
mixed-magnitude regressors. The default remains fmt="%.3f" for
backwards compatibility.
Closes the gap with R modelsummary::fmt_significant() and Stata
esttab's %g-family format codes.
Changed — Unified book-tab three-line style across all xlsx exports¶
Every *.xlsx writer in :mod:statspai.output now emits the strict
academic book-tab convention (thick top rule above the column header,
thin mid-rule between header and body, thick bottom rule under the
last data row, Times New Roman throughout, no internal grid lines —
mirrors LaTeX booktabs \toprule / \midrule / \bottomrule
verbatim).
Affected entrypoints:
sp.regtable(...).save("xxx.xlsx")(RegtableResult.to_excel) — upgraded from a two-rule layout (heavy/heavy) to strict three-rule (heavy/thin/heavy).sp.mean_comparison(...).to_excel(...)— was previously a stylelesspandas.DataFrame.to_exceldump; now goes through the shared book-tab renderer.sp.sumstats(..., output="xxx.xlsx")— added Times New Roman, top/ mid/bottom rules, merged panel headers forby=MultiIndex columns. Also adds theby_labelsparameter and auto-maps binary 0/1 group keys toControl/Treatedso academic Table 1 reads correctly out of the box (previously rendered raw0/1as panel headers).sp.modelsummary(..., output="xxx.xlsx")— Calibri → Times New Roman, double-line bottom border → strict\bottomrule.sp.outreg2(..., filename="xxx.xlsx")— replaces the legacy Excel grid layout (four-edge per-cell borders) with the book-tab three-rule convention; drops Calibri for Times New Roman.sp.tab(..., output="xxx.xlsx")— was unstyled; now book-tab compliant, chi-square test row appended as italic note below the table.
sp.paper_tables(...).to_xlsx() and sp.collect(...).save("xxx.xlsx")
were already book-tab compliant via _aer_style.excel_booktab_borders
and are unchanged in this release.
Implementation note. The visual conventions live in a single new
module statspai.output._excel_style (Times constants, write_title
/ write_header / write_body / apply_booktab_borders / write_notes
/ autofit_columns / one-shot render_dataframe_to_xlsx). Future
xlsx writers should call these primitives instead of hand-rolling
borders so the book-tab look stays consistent across all of StatsPAI.
Why this matters. Before this change the xlsx layer was three-way
fractured — regtable / outreg2 shipped two-rule or grid
layouts, sumstats / tab / modelsummary had no rules at
all, and only paper_tables / collection matched the AER/QJE
book-tab standard. Authors copying lalonde_sumstats.xlsx straight
into a manuscript got a styleless dump. Every entrypoint now produces
output a referee would accept verbatim.
[1.7.0] — 2026-04-25 — Phase 2 output overhaul: journal presets, auto-diagnostics, multi-SE, sp.cite(), reproducibility footer¶
This release closes the remaining gaps between StatsPAI's table layer
and R::modelsummary / fixest::etable. Six additive features;
no numerical changes to any estimator. One backwards-compat note
(see "Behavior change" below) — pure OLS without clustering or FE
produces byte-identical output to v1.6.x.
Added — Journal presets via template= on regtable¶
sp.regtable(..., template="qje") now picks up the per-journal SE-row
label (e.g. QJE → "Robust standard errors"), default summary-stat
selection (JF/AEJA add Adj. R²; QJE drops R²), and footer notes from a
single source-of-truth registry at
statspai.output._journals.JOURNALS. Eight presets ship: aer, qje,
econometrica, restat, jf, aeja, jpe, restud. Adding a new
journal is one dict entry — regtable, paper_tables.TEMPLATES, and
the top-level sp.JOURNAL_PRESETS view all light up automatically.
Added — Auto-extracted diagnostic rows (diagnostics="auto" default)¶
regtable now reads model_info / diagnostics on each result and
auto-emits journal-quality rows above the summary-stats block:
- Fixed Effects: Yes/No when any column absorbs FE.
- Cluster SE:
<var>with the variable name when any column clusters. - First-stage F for IV models (Olea-Pflueger preferred, falls back to
per-endog F from
sp.IVRegression). - Hansen J p-value for over-identified IV.
- Pre-trend p-value, Treated groups for DiD methods.
- Bandwidth, Kernel, Polynomial order for RD.
Rows are emitted only when at least one column produces a non-empty
cell, and user-supplied add_rows={...} always overrides on label
collision. Pass diagnostics=False (or "off") to disable.
Added — Multi-SE side-by-side¶
sp.regtable(*models, multi_se={"Cluster SE": [...]}) stacks alternative
SE specs under the primary SE row. Bracket styles cycle [] / {} /
⟨⟩ / «» (the fourth pair is guillemets, not pipes — Markdown-safe).
Footer notes record each label automatically. Works across
text / HTML / LaTeX / Markdown / Excel / Word / DataFrame.
Added — sp.cite() inline coefficient reporter¶
sp.cite(result, "treat") returns "0.234*** (0.041)" for direct
embedding in manuscript prose, Jupyter Markdown cells, and Quarto inline
expressions. Mirrors regtable's star / SE / CI conventions for
cross-table consistency. Modes: output="text"|"latex"|"markdown"|"html",
second_row="se"|"t"|"p"|"ci"|"none". Markdown stars are escaped so
they do not collide with bold delimiters.
Added — Reproducibility metadata footer¶
sp.regtable(..., repro=True) appends Reproducibility: StatsPAI v1.X.Y;
2026-04-25 15:30 as the last footer line. Pass a dict to record more:
repro={"data": df, "seed": 42, "extra": "git@<sha>"} adds
data 50000×12 SHA256:abcd1234ef; seed=42; .... Hashing skips frames
over MAX_HASH_ROWS (1M rows) to keep the call fast.
⚠️ Behavior change — diagnostics="auto" default emits new rows¶
regtable previously rendered only the rows you typed via
add_rows={...}. With the new diagnostics="auto" default, tables for
clustered or fixed-effects models now include a Cluster SE: <var> /
Fixed Effects: Yes row that was previously absent. Pure OLS without
clustering or FE produces byte-identical output to v1.6.x. Workarounds:
- Pass
diagnostics=False(or"off") to restore the old behavior. - Override individual rows by passing
add_rows={"Cluster SE": [...]}.
This is the only behavior change in the release; no numerical paths are affected.
[1.6.6] — 2026-04-24¶
Two parallel sub-releases consolidated under one version: the
journal-grade output-layer overhaul (AER/QJE DOCX, paper_tables
docx/xlsx, sp.collect, regtable.alpha, Quarto cross-refs) and the
HDFE LSMR/LSQR solver paired with the ⚠️ Heckman two-step SE
correctness fix.
Output-layer overhaul: AER/QJE DOCX, paper_tables docx/xlsx, sp.collect, regtable.alpha, Quarto cross-refs¶
This release elevates the export layer to journal-grade output. Five additive changes; no breaking changes, no numerical changes to any estimator. Existing scripts continue to produce identical numbers.
Added — Quarto cross-reference output for sp.regtable¶
sp.regtable(..., quarto_label="main", quarto_caption="Wage equation")
now emits a Quarto-cross-referenceable Markdown table via
result.to_quarto() (or result.to_markdown(quarto=True)). The
canonical : <caption> {#tbl-<label>} block is appended so the
manuscript can reference the table with @tbl-<label>.
- The
tbl-prefix is auto-prepended when missing (quarto_label="main"→{#tbl-main}); already-prefixed labels are not double-prefixed. quarto_captionfalls back totitlewhen omitted; if both are absent a generic default is used and aUserWarningis emitted.output="quarto"/output="qmd"make__str__/print()/result.save("table.qmd")round-trip Quarto output end-to-end.- The leading bold-title line is suppressed in Quarto output to avoid duplicating the caption block.
This closes the last ergonomic gap between StatsPAI's export layer and modern reproducible-paper toolchains (Quarto is the de-facto successor to R Markdown for academic econ workflows).
Added — sp.regtable(..., alpha=...) now controls CI width¶
sp.regtable(..., se_type="ci", alpha=0.10) now displays 90% confidence
intervals (and labels them 90% CI); alpha=0.01 displays 99% CIs, etc.
Previously the alpha parameter was documented but ignored — the
displayed CI was always the model's stored 95% CI.
When alpha=0.05 (default) the bounds come from the result's stored
95% CI for backward-compat (typically t-based with model df). For any
other alpha the bounds are recomputed as b ± crit · se, using the
t-distribution when df_resid is known and the standard normal as a
fallback.
sp.esttab(..., ci=True, alpha=...) mirrors the same behaviour. Both
APIs raise ValueError for alpha outside (0, 1).
Added — AER/QJE book-tab DOCX styling¶
sp.regtable(...).to_word(...), sp.sumstats(..., output="*.docx"),
sp.tab(..., output="*.docx") and sp.mean_comparison(...).to_word(...)
now render in book-tab style matching AER / QJE / Econometrica
conventions:
- heavy top rule above column headers (
sz=12) - thin mid rule below the header (
sz=4) - heavy bottom rule above the notes (
sz=12) - no internal vertical or horizontal borders
- Times New Roman, header bold, notes italic 8pt
The shared helper lives in src/statspai/output/_aer_style.py.
Previous DOCX output used the boxed Table Grid style.
Added — sp.paper_tables(...) DOCX / XLSX export¶
sp.PaperTables gains .to_docx(path) and .to_xlsx(path) methods,
and the sp.paper_tables(...) constructor accepts docx_filename= and
xlsx_filename= kwargs. Multi-panel paper bundles now go to a single
Word document (one panel per page, book-tab styled) or a single
workbook (one sheet per panel) in addition to the existing Markdown
and LaTeX outputs.
Added — sp.collect() / sp.Collection session-level container¶
A new container mirroring Stata 15's collect and R's gt::gtsave
workflow — gather any number of regressions, descriptive statistics,
balance tables, and free-form text in one container, then export the
whole bundle to a single .docx / .xlsx / .tex / .md / .html
file.
import statspai as sp
c = sp.collect("Wage analysis", template="aer")
c.add_regression(m1, m2, m3, name="main", title="Table 1: Wage equation")
c.add_summary(df, vars=["wage", "educ"], name="desc")
c.add_balance(df, treatment="treat", variables=["age", "female"], name="bal")
c.add_text("Standard errors clustered at firm level.")
c.save("appendix.docx") # single Word doc, AER book-tab style
c.save("appendix.xlsx") # single workbook, one sheet per item
Collection exposes fluent add_regression / add_table / add_summary /
add_balance / add_text / add_heading (each returns self), plus
list() / get(name) / remove(name) / clear() for inspection. The
public factory sp.collect() is registered with the StatsPAI registry
and visible via sp.help("collect").
Tests¶
tests/test_regtable_alpha.py(6 tests) —alphacontrols CI label and width;esttabparity; recompute matchesscipy.stats.t.ppfby hand.tests/test_aer_word_style.py(6 tests) — OOXML reverse-checks the three rules, asserts no inner vertical borders, italic notes.tests/test_paper_tables_export.py(5 tests) — multi-panel docx / xlsx round-trip with book-tab borders.tests/test_collection.py(18 tests) — construction, chained adds, duplicate-name guard, all five export formats, registry presence.
Files changed¶
src/statspai/output/estimates.py—_ModelDatagainsdf_residslot;_ci_bounds(model, var, alpha)helper;EstimateTable/esttabacceptalpha.src/statspai/output/regression_table.py—RegtableResultaccepts and usesalpha;to_wordrewritten to use_aer_stylehelpers;MeanComparisonResult.to_wordlikewise.src/statspai/output/sumstats.py—_sumstats_to_worduses_aer_style.src/statspai/output/tab.py—_tab_to_worduses_aer_style.src/statspai/output/paper_tables.py—PaperTables.to_docx/to_xlsxadded;paper_tables()acceptsdocx_filename=/xlsx_filename=.src/statspai/output/_aer_style.py— new, OOXML border manipulation + book-tab typography helpers.src/statspai/output/collection.py— new,Collectionclasscollect()factory.src/statspai/output/__init__.py— exportCollection,CollectionItem,collect.src/statspai/__init__.py— exportCollection,CollectionItem,collect; add to public__all__.src/statspai/registry.py— registercollectundercategory="output".
2026-04-24 — HDFE LSMR/LSQR solver + ⚠️ Heckman SE correctness fix¶
⚠️ Correctness fix — sp.heckman two-step standard errors¶
Affected: sp.heckman(...) — the Heckman (1979) two-step selection
model. Point estimates are unchanged; standard errors, t-statistics,
p-values and confidence intervals change, and model_info['sigma'] /
model_info['rho'] now use the correct Greene (2003) formula.
What was wrong. Before v1.6.6, sp.heckman reported an ad-hoc
HC1-style sandwich that (a) ignored the selection-induced
heteroskedasticity Var(y | X, D=1) = σ²(1 − ρ² δ_i), and (b) treated
the inverse Mills ratio λ̂ as a known regressor, ignoring the
first-stage probit estimation error in γ̂ — the "generated regressor"
problem. The code itself flagged this as
"Heckman SEs are complex; robust is conservative". It was a known
limitation, not a false belief; this release upgrades it from
approximate-conservative to textbook-correct.
The fix. sp.heckman now computes the Heckman (1979) / Greene
(2003, eq. 22-22) / Wooldridge (2010, §19.6) analytical variance:
where δ_i = λ̂_i (λ̂_i + Z_iγ̂) ≥ 0, D_δ = diag(δ_i),
F = X*' D_δ Z, and V̂_γ = (Z' diag(w_i) Z)⁻¹ is the probit
information-based variance of γ̂. Consistent σ̂² is
σ̂² = RSS/n_sel + β̂_λ² · mean(δ_i) (Greene 22-21), replacing the
old naive RSS/(n−k). The probit IRLS helper _probit_fit now also
returns V̂_γ for consumption by the second-stage SE computation.
What you'll see. Heckman SEs will generally be smaller than
before when selection is strong (the heteroskedastic factor
1 − ρ² δ_i ≤ 1 trims the structural-error contribution) and
larger when the exclusion restriction is weak (generated-regressor
uncertainty dominates). Match is to Stata's heckman ..., twostep
output and R's sampleSelection::heckit to the documented formula
precision.
Added — test coverage (Heckman)¶
tests/reference_parity/test_heckman_se_parity.py: three tests pinning β̂ and SE to a hand-computed implementation of the Greene (2003) formula, plus a check thatmodel_info['sigma']/rhoexpose the consistent σ̂² estimator.
Fixed¶
src/statspai/regression/heckman.py::heckman— replace naive HC1 sandwich with the Heckman (1979) two-step analytical variance.src/statspai/regression/heckman.py::_probit_fit— now returns(γ̂, V̂_γ); avoids allocating an n×ndiag(w)via broadcasting.
Added — HDFE LSMR/LSQR solver option (additive, pyreghdfe parity)¶
sp.hdfe_ols/sp.absorb_ols/sp.Absorber/sp.demeannow acceptsolver={"map", "lsmr", "lsqr"}(default"map", unchanged)."lsmr"/"lsqr"build a sparse FE design matrix and delegate the within-projection toscipy.sparse.linalg.lsmr/lsqr. Weighted regression uses the standard √w transformation applied to both the sparse design and the response. No new runtime dependency — scipy is already core.- Covers the feature surface of
pyreghdfe: multi-way FE OLS, robust / multi-way cluster SE, singleton drop, weights, Krylov solvers.pyreghdfecan now be archived withsp.hdfe_olsas a strict replacement (seeMIGRATION.md). - New cross-solver parity tests in
tests/test_hdfe_native.pyverify MAP ≡ LSMR ≡ LSQR toatol=1e-6on two-way FE OLS (with and without weights, with and without clustering). MIGRATION.mdgained a "Migrating frompyreghdfe" section with full API mapping.
Behavior¶
- HDFE default solver remains
"map"— all HDFE numerical output (MAP path) is byte-identical to v1.6.5.
[1.6.5] — 2026-04-24 — ⚠️ Standalone LIML correctness fix (follow-up to v1.6.4)¶
⚠️ Correctness fix — standalone sp.liml / sp.iv.liml¶
Affected: the standalone LIML entry point
sp.liml(...) = sp.iv.liml(...) in statspai.regression.advanced_iv.
This is a separate code path from the 2SLS/LIML/Fuller dispatcher
fixed in v1.6.4 (sp.ivreg(method='liml') went through the correct
_k_class_fit implementation and was fixed in the previous release;
the standalone sp.liml did not).
Not affected: sp.ivreg, sp.iv.iv, sp.iv.fit,
sp.ivreg(method='liml' | 'fuller' | '2sls') — all already correct
as of v1.6.4.
What was wrong. Two independent bugs in the standalone LIML:
- κ (Anderson LIML eigenvalue) used non-symmetric solver: the code
called
np.linalg.eigvals(np.linalg.inv(A) @ B)on a non-symmetric product, which can silently return complex eigenvalues and produces a biased κ. This is the same bug fixed iniv.py::_liml_kappain an earlier release, but the standalone module was an orphan copy that never got the fix. Point estimates β̂ were biased as a result. - Cluster / robust SE meat used raw X: same bug as v1.6.4, just in
a different module. Sandwich meat is now built from the k-class
transformed regressor
AX = (I − κ M_Z) X.
The fix.
- κ now computed via
scipy.linalg.eigh(S_exog, S_full)(generalized symmetric eigenvalue problem), aligned withiv.py::_liml_kappa. Falls back to 2SLS (κ = 1) with a warning if the solver returns an implausible κ < 1. - Cluster / robust SE meat now uses
AX = I_kMz @ X_all, matching the FOCX' (I − κ M_Z) (y − X β) = 0.
What you'll see. Users who called sp.liml(...) directly will see
both β̂ and SE change compared to ≤ v1.6.4. After the fix,
sp.liml(...) and sp.ivreg(..., method='liml') produce byte-identical
output, and both agree with linearmodels.IVLIML on β̂ to machine
precision. Cluster SEs differ from linearmodels.IVLIML by ~0.1–0.2%
due to a convention choice (StatsPAI uses the k-class FOC-derived
meat AX; linearmodels uses the 2SLS-style meat X̂ = P_Z X
regardless of κ). The two are asymptotically equivalent and coincide
at κ = 1 (2SLS).
Added — test coverage¶
tests/reference_parity/test_liml_se_parity.py: four tests — hand-computed projected-meat formula match,sp.limlvssp.ivreg(method='liml')internal consistency (byte-exact), andlinearmodels.IVLIMLparity with documented convention tolerance.
Fixed¶
src/statspai/regression/advanced_iv.py::liml— κ solver aligned toscipy.linalg.eighon the symmetric generalized eigenvalue problem; cluster / robust meat now uses projectedAX = I_kMz @ X_all.
[1.6.4] — 2026-04-24 — ⚠️ IV SE correctness fix¶
⚠️ Correctness fix — IV cluster & robust standard errors¶
Affected: sp.iv, sp.ivreg, sp.iv.fit(method='2sls' | 'liml' | 'fuller')
— any call that passes robust={'hc0','hc1','hc2','hc3'} or cluster=.
Not affected: point estimates β̂ are unchanged; nonrobust (default)
standard errors are unchanged; GMM (method='gmm'), JIVE (method='jive'),
and the JIVE variants (ujive/ijive/rjive) are unchanged (they already
used the correct formula).
What was wrong. The 2SLS / LIML / Fuller k-class sandwich meat was
computed with the unprojected regressor matrix X = [X_exog, X_endog]
instead of the projected X̂ = P_W X. The bread
(X' A X)^{-1} = (X̂' X̂)^{-1} was correct; the bug was in
src/statspai/regression/iv.py::_cluster_cov / ::_robust_cov call
sites which passed X_actual where the parameter (already misleadingly
named X_hat) expected the projected regressor.
This deviated from Cameron & Miller (2015), Stata ivregress,
ivreg2 (Baum–Schaffer–Stillman 2007), and linearmodels. The
magnitude of the error depends on first-stage fit: weaker instruments
→ larger inflation of the reported SE. On the audit DGP (n=1000,
40 clusters, moderate first stage) the reported SE on the endogenous
coefficient was 2.46× too large.
The fix. _k_class_fit now computes AX = A @ X_actual and passes
it to _cluster_cov / _robust_cov. For 2SLS (κ=1) this yields
AX = P_W X = X̂; for LIML/Fuller it is the k-class transformed
regressor X − κ M_W X that the k-class FOC X' A (y − X β) = 0
dictates. Matches linearmodels IV2SLS with debiased=True to
machine precision.
What you'll see. Reported SEs for cluster / HC0 / HC1 / HC2 / HC3 under 2SLS / LIML / Fuller will decrease (or occasionally increase) compared to v1.6.3 and earlier. t-statistics, p-values, and confidence intervals will change accordingly. If you cite SEs from StatsPAI IV in a paper, re-run and update the numbers before submission.
Added — test coverage¶
tests/reference_parity/test_iv_se_parity.py: six tests pinning 2SLS cluster / HC0 / HC1 to both a hand-computed projected-meat formula (Cameron–Miller) and tolinearmodels.IV2SLSwithdebiased=True. Closes the coverage gap that let this bug live in_cluster_cov/_robust_covsince the module's introduction.
Fixed¶
src/statspai/regression/iv.py::_k_class_fit— pass projectedAX = A @ X_actualto the sandwich meat.
[1.6.3] — 2026-04-24 — DiD frontier sprint¶
Additive release focused on closing gaps in the DiD module. No numerical changes to existing estimators — all new work is either new functions, new registry entries, new tests, or docstring truth-up where the existing docstring had overstated paper fidelity.
Added — new DiD estimators¶
sp.lp_did— Local-Projections DiD (Dube, Girardi, Jordà & Taylor 2023). Per-horizon long-difference OLS with time FE and cluster-robust SE; 'not-yet-treated' or 'never-treated' control variants. Paper bib key pending — reference carries[待核验].sp.ddd_heterogeneous— Heterogeneity-robust triple differences (Olden & Møen 2022 / Strezhnev 2023). CS-style cohort-time decomposition of DDD with a placebo subgroup, aggregated via switcher-count weights.[@olden2022triple]verified via Crossref; Strezhnev bib key pending.sp.did_timevarying_covariates— DiD with covariates frozen at baseline (Caetano, Callaway, Payne & Rodrigues 2022 — paper version[待核验]). Avoids the bad-controls bias when treatment affects the covariates. Per-(g, t) OR-DiD on frozen baseline X, aggregated with cohort-size weights.sp.did_multiplegt_dyn— dCDH (2024) intertemporal event-study DiD MVP. Long-difference per-horizon estimator with not-yet- treated / never-treated controls, cluster-bootstrap SE, joint placebo and overall Wald tests. Anchored to[@dechaisemartin2024difference](DOI verified). Not paper- parity — switch-off events and analytical IF variance are flagged[待核验], covered indocs/rfc/multiplegt_dyn.md.sp.continuous_did(method='cgs')— Callaway-Goodman-Bacon- Sant'Anna (2024) ATT(d)/ACRT(d) MVP. 2-period design, OR only, Nadaraya-Watson-style local linear smoother over dose, bootstrap SE. Anchored to[@callaway2024difference]. Full CGS cohort aggregation + DR/IPW + analytical IF are flagged[待核验]and tracked indocs/rfc/continuous_did_cgs.md.
Added — shared infrastructure¶
statspai.did._core— shared DiD primitives: cluster-bootstrap resampling with collision-safe relabelling, canonical event-study DataFrame shape, influence-function → SE plumbing, joint Wald. Used by the new estimators above; existing estimators retain their in-file copies (refactor is a separate pass). 16 unit tests.
Added — docstring truth-up (non-numerical)¶
sp.continuous_diddocstring no longer claims "equivalent to the methods in Callaway, Goodman-Bacon & Sant'Anna (2024)". The heuristic modes ('twfe','att_gt','dose_response') are explicitly labelled as heuristic; the CGS MVP lives atmethod='cgs'. Method label in returned CausalResult for the dose-bin heuristic changed from "Continuous DID (Callaway et al. 2024)" to "Continuous DID (dose-bin heuristic)" with estimand name updated accordingly.sp.did_multiplegtdocstring explicitly notes itsdynamic=Hargument is a pair-rollup extension, not equivalent to the dCDH (2024)did_multiplegt_dynestimator (which is now a separate function,sp.did_multiplegt_dyn).
Added — test coverage¶
tests/test_continuous_did_heuristics.py— 11 tests coveringmethod='att_gt'andmethod='dose_response'paths that previously had zero dedicated tests.tests/test_did_core_primitives.py— 16 unit tests for_core.py.tests/test_lp_did.py— 9 tests forsp.lp_did.tests/test_ddd_heterogeneous.py— 7 tests forsp.ddd_heterogeneous.tests/test_did_timevarying_covariates.py— 6 tests.tests/test_did_multiplegt_dyn.py— 10 tests including method-label MVP warning.tests/test_continuous_did_cgs.py— 8 tests including recovery on linear dose-response DGP.tests/reference_parity/test_did_multiplegt_parity.py— skeleton with R fixture script template; skipped untiltests/reference_parity/fixtures/did_multiplegt/*.jsoncommitted.
Added — registry¶
Rich hand-written FunctionSpec entries with agent-card metadata
(assumptions, failure modes with alternative pointers, pre-conditions,
typical_n_min) for 18 previously auto-registered DiD estimators:
did_2x2, drdid, sun_abraham, did_imputation, wooldridge_did,
etwfe, bacon_decomposition, ddd, cic, stacked_did,
event_study, did_analysis, harvest_did, overlap_weighted_did,
cohort_anchored_event_study, design_robust_event_study,
did_misclassified, did_bcf, plus rich entries for the five new
functions above. One fabricated bib key (roth2023trustworthy)
detected and removed during self-review; replaced with [待核验].
Added — documentation¶
docs/guides/choosing_did_estimator.md§4.5 Frontier estimators section distinguishes shipped vs. partial vs. not-yet-landed work and cross-links all three RFC documents. Makes explicit thatsp.did_multiplegt(dynamic=H)is not the dCDH (2024)_dynestimator.
Fixed — citation hygiene¶
paper.bib:dechaisemartin2022fixedupgraded from the SSRN working-paper stub to the published Econometrics Journal 26(3): C1–C30 (2023) version, DOI10.1093/ectj/utac017. Verified via two independent Crossref queries per CLAUDE.md §10 two-source rule.
[1.6.2] — 2026-04-23 — DiD-frontier registry coverage¶
Patch release. Pure-additive: no numerical behaviour changes. Closes a
registry-coverage gap for two already-shipping DiD estimators that were
callable but invisible to sp.list_functions() / sp.describe_function() /
agent discovery (CLAUDE.md §4). Adds the supporting RFC design documents
under docs/rfc/ so the registry reference / remedy pointers resolve.
Added — registry & agent discoverability¶
sp.continuous_didis now registered. DiD with continuous treatment intensity, exposing three modes: (i) TWFE with dose×post interaction, (ii) dose-quantile group-time ATT vs. the untreated (dose=0) arm with bootstrap SE, (iii) local-linear dose-response of ΔY on baseline dose. Callaway, Goodman-Bacon & Sant'Anna (2024) analytical influence-function inference is on the v1.7 roadmap — seedocs/rfc/continuous_did_cgs.md.sp.did_multiplegtis now registered. de Chaisemartin & D'Haultfœuille (2020) DID_M estimator for treatments that switch on and off (unlike Callaway–Sant'Anna which assumes staggered adoption). Supports placebo lags, dynamic horizons, joint placebo Wald test, and cluster-bootstrap SE. The full dCDH (2024) intertemporal event-study (Statadid_multiplegt_dyn) is on the v1.7 roadmap — seedocs/rfc/multiplegt_dyn.md.docs/rfc/— RFC directory for not-yet-landed design docs. Ships withcontinuous_did_cgs.md,multiplegt_dyn.md,did_roadmap_gap_audit.md, plusREADME.mdand a sprint-prep handoff note (HANDOFF_2026-04-23.md).
Changed¶
- None. No estimator output changes. Existing
sp.continuous_did/sp.did_multiplegtcallers observe identical numerical behaviour.
Fixed¶
- None.
Notes for agents¶
- Both estimators now surface in
sp.list_functions()and expose fullParamSpec/FailureMode/alternativesmetadata throughsp.describe_function(). Registered count rises from 923 to 925.
[1.6.1] — 2026-04-23 — CI/CD green-up¶
Patch release. No user-facing behavior or numerical change — all three
fixes target CI matrix reliability. The hashlib.md5 change is
digest-byte-identical to v1.6.0 (verified by assert); embed_texts /
sp.text_treatment_effect outputs are bit-for-bit unchanged.
Fixed — CI/CD green-up¶
- Bandit security gate —
src/statspai/causal_text/_common.pyhashing call now passesusedforsecurity=Falsetohashlib.md5. The digest is used as a deterministic bucket index for hashed-token embeddings, not a security primitive; the flag tells Bandit B324 (CWE-327) that weak-hash use is intentional. Digest bytes are identical to the prior call — no numerical change toembed_textsorsp.text_treatment_effect. - Windows path-separator parity —
tools/audit_bib_coverage.py::_relnow emits POSIX-style paths viaPath.as_posix(), so thecitations_by_keyreport is identical across Windows and POSIX runners. Fixestest_build_report_records_citation_locationsonwindows-latest. - Windows CLI subprocess —
tests/test_suggest_bibkey_backfills.pynow merges the child-process environment withos.environ(soPATHsurvives) before invoking the tool. WindowsCreateProcesshas no_CS_PATHfallback like POSIXexecvpe, so an empty-env child cannot resolvegit.exe. Fixestest_cli_dry_run_does_not_mutateonwindows-latest.
[1.6.0] — 2026-04-21 — P1 Agent-Native × Frontier + Agent-Native Infrastructure¶
Pure-additive release pushing two competitive axes:
- Agent-native — closed-loop LLM-DAG, end-to-end
sp.paper()pipeline, full registry/agent-card metadata for every new function, typed exception taxonomy (StatsPAIError+ 6 subclasses) withrecovery_hint/diagnostics/alternative_functionspayloads, result-object.violations()/.to_agent_summary()methods, and auto-generated## For Agentsblocks in every flagship guide. - Methodological frontier — five post-2020 Mendelian-randomization
estimators (
mr_lap,mr_clust,grapple,mr_cml,mr_raps), long-panel Double-ML (sp.dml_panel), constrained LLM-assisted PC discovery, and twocausal_textMVPs (text-as-treatment, LLM-annotator measurement-error correction).
Together: one new top-level pipeline (sp.paper), four new LLM-aware
dag/text estimators (sp.llm_dag_constrained, sp.llm_dag_validate,
sp.text_treatment_effect, sp.llm_annotator_correct), constrained
PC discovery (sp.pc_algorithm(forbidden=, required=)), five MR
frontier estimators (sp.mr_lap etc.), one long-panel DML estimator
(sp.dml_panel), 36 populated agent cards (was 0 pre-v1.5.1), and 26
## For Agents blocks across 19 guides.
Added — P1-A: closed-loop LLM-assisted causal discovery¶
sp.llm_dag_constrained— iterate propose → constrained PC → CI-test validate → demote until convergence ormax_iter. Returns per-edgellm_score+ci_pvalue+source(required/forbidden/demoted/ci-test) so every kept edge is justified by both the LLM prior and the data.result.to_dag()round-trips intostatspai.dag.DAGfor downstreamrecommend_estimator().sp.llm_dag_validate— per-edge CI-test audit of any declared DAG; flags edges whose implied conditional independence is consistent with the data (i.e. the edge looks spurious).sp.pc_algorithm(forbidden=, required=)— background-knowledge constraints injected into PC. DefaultNonepreserves the prior contract bit-for-bit. Required edges win over forbidden when both reference the same pair.- 18 new tests (
tests/test_llm_dag_loop.py). - Family guide:
docs/guides/llm_dag_family.md.
Added — P1-C: data → publication-draft pipeline¶
sp.paper(data, question, ...)— orchestrator on top ofsp.causal()that parses a natural-language question, runs the fulldiagnose → recommend → estimate → robustnesspipeline, and assembles a 7-sectionPaperDraft(Question / Data / Identification / Estimator / Results / Robustness / References).PaperDraftwithto_markdown()/to_tex()/to_docx()/write(path)/to_dict()/summary()and aparsed_hintsattribute exposing what the question parser extracted.- Lightweight question parser (
statspai.workflow.paper.parse_question) recognises "effect of X on Y", "Y ~ X", DiD / RD / IV / RCT design hints, "instrument(ing)Z", "discontinuity atc", "running variableX". Explicit kwargs always win. - Per-section failure isolation: a failed estimator stage yields a "Pipeline notes" section rather than crashing the draft.
- 27 new tests (
tests/test_paper_pipeline.py). - Family guide:
docs/guides/paper_pipeline.md.
Added — P1-B: sp.causal_text (experimental MVP)¶
sp.text_treatment_effect— Veitch-Wang-Blei (2020 UAI, MVP) text-as-treatment ATE via embedding-projected OLS with HC1 SEs. Hash embedder default (deterministic, dependency-free); lazysbertoptional viapip install sentence-transformers; custom callable embedder also supported.sp.llm_annotator_correct— Egami-Hinck-Stewart-Wei (2024) measurement-error correction for binary LLM-derived treatments. Hausman-style: estimatep_01/p_10on a hand-validated subset (≥30 rows spanning both classes), divide naive coefficient by1 - p_01 - p_10. First-order SE correction; raisesIdentificationFailurewhen the LLM has no information.- Both methods subclass
CausalResult, surfacestatus: "experimental"inresult.diagnostics, and ship full agent-card metadata (assumptions/pre_conditions/failure_modes/alternatives). - 20 new tests (
tests/test_causal_text.py). - Family guide:
docs/guides/causal_text_family.md.
Added — MR Frontier (src/statspai/mendelian/frontier.py)¶
sp.mr_lap— Sample-overlap-corrected IVW (Burgess, Davies & Thompson 2016 closed-form bias correction; conceptually aligned with the Mounier-Kutalik 2023 MR-Lap). Required inputs:overlap_fractionandoverlap_rho(e.g. from LD-score regression).overlap=0exactly reproduces naive IVW.sp.mr_clust— Clustered Mendelian randomization via finite Gaussian mixture on Wald ratios (Foley, Mason, Kirk & Burgess 2021). EM with SNP-specific measurement SE, optional null cluster at θ=0, BIC-selected K. Returns per-cluster estimates, SNP-to-cluster responsibilities, and the BIC path.sp.grapple— Profile-likelihood MR with joint weak-instrument and balanced-pleiotropy robustness (Wang, Zhao, Bowden, Hemani et al. 2021, single-exposure variant). Jointly MLE over causal β and pleiotropy variance τ² via L-BFGS-B; SE from observed Fisher info.sp.mr_cml— Constrained maximum-likelihood MR with L0-sparse pleiotropy, MR-cML-BIC variant (Xue, Shen & Pan 2021). Block- coordinate descent jointly updates causal β, true exposure effects, and a K-sparse pleiotropy vector; K selected by BIC.sp.mr_raps— Robust Adjusted Profile Score (Zhao, Wang, Hemani, Bowden & Small 2020, Annals of Statistics 48(3)). Profile-likelihood MR with Tukey biweight loss + log-variance adjustment; same structural model as GRAPPLE but resistant to gross pleiotropy outliers. Sandwich SE from M-estimator formula.
Added — v1.7 long-panel DML (src/statspai/dml/panel_dml.py)¶
sp.dml_panel— Long-panel Double/Debiased ML for static panel models with fixed effects (Clarke & Polselli 2025 simplified). Absorbs unit (and optional time) fixed effects via within-transform, cross-fits ML nuisance learners with folds that split units (Liang-Zeger compatible), reports cluster-robust SE at the unit level. PLR moment for continuous or binary treatment; empty-covariate fallback reduces to pure FE-OLS. (Citation corrected post-v1.15: the original v1.7 release note attributed this estimator to a "Semenova-Chernozhukov 2023 Econometrics Journal 26(2)" paper that does not exist; the actual reference is Clarke & Polselli (2025) ECTJ 29(1) 69-86, DOI 10.1093/ectj/utaf011, arXiv:2312.08174. See [Unreleased] for the full audit.)
Added — dispatcher + registry wiring¶
sp.mr(method=...)routesmr_lap | lap | sample_overlap,mr_clust | clust | clustered,grapple | profile_likelihood,mr_cml | cml | constrained_ml,mr_raps | raps | robust_profile_scoreto the new estimators.- All six new functions (5 MR +
dml_panel) registered inregistry.pywith fullParamSpecmetadata, category, tags, and reference.sp.describe_function,sp.function_schema, andsp.agent_cardcover them.
Added — tests¶
tests/test_mr_frontier.py— 41 tests covering correctness, boundary validation, cross-method consistency (mr_lapwithoverlap=0== IVW;mr_cmlwithK=0≈ IVW;mr_clusttwo-cluster DGP;mr_rapsoutlier-robustness vs IVW), dispatcher routing, and registry/schema export.tests/test_dml_panel.py— 13 tests covering recovery under homogeneous treatment, FE-OLS agreement in the no-confounding limit, cluster-SE vs iid SE under AR(1) within-unit correlation, time-FE option, boundary validation, and registry metadata.
Deferred (originally scoped for v1.6)¶
- CAUSE (Morrison et al. 2020) — the full variational-Bayes
implementation is ~5000 LOC in the R reference and cannot be
reference-parity validated in-cycle. Replaced with
mr_cml(same use-case: robust to correlated and uncorrelated pleiotropy). CAUSE will land in a later release once reference-parity infrastructure is in place.
Agent-native infrastructure (foundation for v1.6.0)¶
Every layer now speaks in structured data with recovery hints, not prose — this is the foundation the P1 frontier estimators above build on.
Added — agent-native exception taxonomy (statspai.exceptions)¶
StatsPAIErrorroot +AssumptionViolation/IdentificationFailure/DataInsufficient/ConvergenceFailure/NumericalInstability/MethodIncompatibility, each carryingrecovery_hint, machine-readablediagnostics, and a rankedalternative_functionslist.- Warning counterparts:
StatsPAIWarning/ConvergenceWarning/AssumptionWarningplus a rich-payloadsp.exceptions.warn()helper. - Domain errors subclass
ValueError/RuntimeErrorfor backwards compatibility with existingexceptblocks. No estimator behavior changes — migration of existingValueError/RuntimeErrorcall sites will follow incrementally.
Added — agent-native registry schema¶
FunctionSpecextended withassumptions/pre_conditions/failure_modes/alternatives/typical_n_min(all optional).- New
FailureModedataclass:(symptom, exception, remedy, alternative). - New public accessors
sp.agent_card(name)andsp.agent_cards(category=None)returning the superset offunction_schema()plus the agent-native fields. - Flagship families populated:
sp.regress,sp.iv,sp.did,sp.callaway_santanna,sp.rdrobust,sp.synth(was previously auto-registered only).
Added — agent-native methods on result objects¶
CausalResult.violations()andEconometricResults.violations()— inspect stored diagnostics (pre-trend p-value, first-stage F, McCrary, rhat/ESS/divergences, overlap, SMD) and return flagged items withseverity/recovery_hint/alternatives.CausalResult.to_agent_summary()andEconometricResults.to_agent_summary()— JSON-ready structured payload with point estimate, coefficients, scalar diagnostics, violations, and next-steps. Sits alongside existingsummary()(prose) andtidy()(DataFrame).
Added — guide ## For Agents sections¶
- Auto-rendered from registry cards via
sp.render_agent_block(name)andsp.render_agent_blocks(category=…, names=…). scripts/sync_agent_blocks.pyregenerates in-place between<!-- AGENT-BLOCK-START: <name> --> … <!-- AGENT-BLOCK-END -->markers;--checkexits non-zero on drift (CI-friendly).- Wired into four flagship guides so far:
choosing_did_estimator.md(did + callaway_santanna),choosing_iv_estimator.md(iv),choosing_rd_estimator.md(rdrobust),synth.md(synth). - Test guard
tests/test_agent_blocks_drift.pyfails CI if a doc falls out of sync with the registry.
Tests — agent-native infrastructure¶
tests/test_exceptions.py— hierarchy, payload, raise/catch,warn()helper, top-level exposure.tests/test_agent_schema.py— schema mechanics,agent_card/agent_cardsAPIs,FailureMode, parametrized flagship population.tests/test_agent_result_methods.py—violations()/to_agent_summary()on both result classes, JSON round-trip.tests/test_agent_docs.py— renderer output, pipe escaping, empty / non-empty cases.tests/test_agent_blocks_drift.py— CI guard for doc/registry sync.
Added — agent-native follow-up sprint¶
- Eight more flagship agent cards populated:
sp.dml,sp.causal_forest,sp.metalearner,sp.match,sp.tmle,sp.bayes_dml(extended),sp.bayes_did(new hand-register),sp.bayes_iv(new hand-register). Each carries pre-conditions, identifying assumptions, 3–4 failure modes with recovery hints, ranked alternatives, and a typical minimum-N rule of thumb. - Seven more guide AGENT-BLOCKs (13 total across 11 guides now):
choosing_matching_estimator.md(match),callaway_santanna.md/cs_report.md/mixtape_ch09_did.md(callaway_santanna),honest_did.md/repeated_cross_sections.md(did),synth_experimental.md(synth). sp.recommendnow consumes agent cards: every recommendation getsagent_card/pre_conditions/failure_modes/alternatives/typical_n_minfields merged in from the registry. Whenn_obs < typical_n_min, a dedicated warning lands in the top-levelwarningslist pointing tosp.agent_card(name). Hand-codedassumptions/reason/codeare never overwritten — only empty fields are promoted from the card.- First call-site migrations to the typed taxonomy, with
recovery_hint+diagnostics+alternative_functionsattached: sp.did_2x2treat/time cardinality →MethodIncompatibilitysp.did_analysis(method='cs'/'sa')missingid→MethodIncompatibilitysp.misclassified_didno cohorts / no never-treated →DataInsufficient- IV under-identification (all 3 k-class paths) →
MethodIncompatibility - IV singular k-class matrix →
NumericalInstability sp.bayes_dmlnon-positive DML SE →NumericalInstability- Latent registry bug fixed —
_build_registry()usedif _REGISTRY: returnas its idempotence gate, which silently skipped hand-written specs whenever any caller ranregister()first (e.g. test fixtures). Replaced with a dedicated_BASE_REGISTRY_BUILTsentinel so flagship agent-native fields survive arbitrary registration order. - New tests:
tests/test_recommend_agent_cards.py(5 tests),tests/test_exception_migrations.py(7 tests). All existing registry / help / DID / IV / synth / matching / DML / meta-learner / Bayesian-DID / TMLE / causal-forest / agent-native suites continue to pass.
Added — agent-native round 3 (v1.6 sprint)¶
- Nine more flagship agent cards:
sp.dml_panel(v1.7 long panel DML),sp.proximal(+ bidirectional/fortified PCI alternatives exposed),sp.mr(dispatcher for the full MR family),sp.qdid,sp.qte,sp.dose_response,sp.spillover,sp.multi_treatment,sp.network_exposure.sp.agent_cards()now returns 30 populated entries (was 19 after the prior sprint). - Thirteen more guide
## For Agentsblocks (26 total across 19 guides):proximal_family.md,mendelian_family.md,qte_family.md(qte + qdid),interference_family.md(spillover + network_exposure),harvest_did.md(did + callaway_santanna),causal_text_family.md(text_treatment_effect + llm_annotator_correct),llm_dag_family.md(llm_dag_constrained + llm_dag_validate),paper_pipeline.md(paper). paperspec cleanup —alternativesentries now use bare function names ("causal","recommend") instead of prose strings, so the renderer emitssp.causalrather thansp.sp.causal: ....- Six more call-site exception migrations with recovery hints:
sp.matchnon-binary treatment →MethodIncompatibilitypointing atsp.multi_treatment/sp.dose_responsesp.matchall-same treatment →DataInsufficientsp.ebalance< 2 treated-or-control →DataInsufficientsp.dml(model='irm')non-binary D →MethodIncompatibilitysp.dml(model='irm')constant D →IdentificationFailuresp.conformal_synth/sp.augsynthinsufficient pre/post periods →DataInsufficient- 6 new migration tests added to
tests/test_exception_migrations.py(13 total now). All existing DID / IV / matching / DML / meta-learners / TMLE / synth / Bayesian family suites (363 tests total) continue to pass.
Added — agent-native round 4 (v1.6 closed-loop)¶
- Seven more flagship agent cards:
sp.principal_strat(extended),sp.mediate,sp.bartik,sp.bayes_rd,sp.bayes_fuzzy_rd,sp.bayes_mte,sp.conformal(extended).sp.agent_cards()now returns 36 populated entries (30 → 36). - Two more guide
## For Agentsblocks (28 total across 21 guides):conformal_family.md(conformal),shift_share_political_panel.md(bartik). Drift-check passes. - Six more exception migrations with recovery hints:
sp.gsynth< 3 pre-periods →DataInsufficientpointing atsp.synth/sp.didsp.gsynth< 1 post-period →DataInsufficientsp.sbwnon-binary treatment →MethodIncompatibilitypointing atsp.multi_treatment/sp.dose_responsesp.optimal_matchmissing control arm →DataInsufficientsp.synth_survivalno donor →DataInsufficient- Closed-loop
sp.diagnose_result: the diagnostic battery output now also carries: violations— the structured output ofresult.violations()(already surfaces pre-trend / first-stage F / McCrary / rhat / ESS / divergences / overlap / SMD with severity + recovery_hint),next_steps— the output ofresult.next_steps(print_result=False). The printed version includes a new "Structured violations (agent-native)" section below the family battery so humans and agents see the same triage picture. Backwards compatible: the existingmethod_type/checkskeys are untouched.- 3 new migration tests + 8 new closed-loop tests added to
tests/test_exception_migrations.pyandtests/test_diagnose_result_closed_loop.py. - Self-audit fix: the
rdrobustcard's alternatives list usedrd_donut(not exposed as a top-level function); replaced withrdrbounds. Doc block re-synced; drift-check green.
Final tally (rounds 1 – 4 combined)¶
- 36 populated agent cards covering: regression / IV / DID / RD / synth / matching / DML / meta-learners / TMLE / Bayesian (DID/IV/DML/RD/fuzzy-RD/MTE) / proximal / MR / principal strat / mediation / Bartik / QTE / QDID / dose-response / spillover / multi-treatment / network exposure / conformal / DML panel / paper / causal text / LLM-DAG.
- 28
## For Agentsblocks across 21 guides, rendered bypython scripts/sync_agent_blocks.pywith a CI drift guard. - 19 call-site exception migrations to the typed taxonomy
(
MethodIncompatibility,DataInsufficient,IdentificationFailure,NumericalInstability) across DID / IV / DML / matching / synth / Bayes. All still inherit fromValueError/RuntimeError, so existingexceptblocks work unchanged. - Closed-loop
sp.diagnose_resultbridges fit → violations → next_steps in one call, merging the family battery with the structured agent-native view.
Migration notes¶
This release is purely additive. Existing call sites that catch
ValueError continue to catch AssumptionViolation /
DataInsufficient / MethodIncompatibility /
IdentificationFailure; catching RuntimeError continues to catch
ConvergenceFailure and NumericalInstability. New code in
StatsPAI should prefer the specific subclasses and attach a
recovery_hint so agents can act on failures without parsing
error strings.
[1.5.0] — 2026-04-21 — Interference / Conformal / Mendelian family consolidation¶
Minor release. Three concurrent improvements to the interference,
conformal causal inference, and Mendelian Randomization families:
full-family documentation guides, unified dispatchers matching the
sp.synth / sp.decompose / sp.dml pattern, and a targeted
correctness audit that surfaced and fixed two silent-wrong-numbers
issues.
Added — three new family guides (interference / conformal / MR)¶
docs/guides/interference_family.md— complete walkthrough ofsp.spillover,sp.network_exposure,sp.peer_effects,sp.network_hte,sp.inward_outward_spillover,sp.cluster_matched_pair,sp.cluster_cross_interference,sp.cluster_staggered_rollout,sp.dnc_gnn_did. Decision tree covering partial / network / cluster-RCT designs with the 5 diagnostics every interference analysis should report (exposure balance, identification check for peer_effects, overlap for network_hte, parallel trends for staggered-cluster, sensitivity to exposure function).docs/guides/conformal_family.md— complete walkthrough ofsp.conformal_cate,sp.weighted_conformal_prediction,sp.conformal_counterfactual,sp.conformal_ite_interval,sp.conformal_density_ite,sp.conformal_ite_multidp,sp.conformal_debiased_ml,sp.conformal_fair_ite,sp.conformal_continuous,sp.conformal_interference. Clarifies the distinction between marginal and conditional coverage, with per-tool "when to use it" + how-to-read-disagreement guidance.docs/guides/mendelian_family.md— complete walkthrough of all 17 MR functions (4 point estimators + 6 diagnostics + 3 multi-exposure extensions + instrument-strength F + 2 plots), organised around the IV1 / IV2 / IV3 assumption hierarchy. Ships the 4 sanity checks every MR analysis should report and a worked BMI → T2D example.
Each guide is linked from mkdocs.yml under Guides and surfaces via
sp.search_functions().
Added — unified family dispatchers¶
Three new top-level dispatchers mirroring the style of sp.synth /
sp.decompose / sp.dml:
-
sp.mr(method=..., ...)— single entry point for the 17-function Mendelian Randomization family. Supportsmethod ∈ {"ivw", "egger", "median", "penalized_median", "mode", "simple_mode", "all", "mvmr", "mediation", "bma", "presso", "radial", "leave_one_out", "steiger", "heterogeneity", "pleiotropy_egger", "f_statistic", ...}with aliases. kwargs pass through to the target function.sp.mr_available_methods()lists all aliases. -
sp.conformal(kind=..., ...)— single entry point for the 10-function conformal causal inference family. Supportskind ∈ {"cate", "counterfactual", "ite", "weighted", "density", "multidp", "debiased", "fair", "continuous", "interference", ...}.sp.conformal_available_kinds()lists all aliases. -
sp.interference(design=..., ...)— single entry point for the 9-function interference / spillover family. Supportsdesign ∈ {"partial", "network_exposure", "peer_effects", "network_hte", "inward_outward", "cluster_matched_pair", "cluster_cross", "cluster_staggered", "dnc_gnn", ...}.sp.interference_available_designs()lists all aliases.
All three dispatchers are registered with hand-written schemas so
sp.describe_function("mr") / "conformal" / "interference" return
agent-readable descriptions. 30 new tests in
tests/test_dispatchers_v150.py guarantee the dispatcher path and the
direct-call path produce byte-for-byte identical results.
⚠️ Breaking — sp.mr is now a function, not a module alias¶
Prior to v1.5.0 sp.mr was a reference to the statspai.mendelian
submodule (from . import mendelian as mr), so sp.mr.mr_ivw(...)
worked. v1.5.0 replaces this with the new dispatcher function
sp.mr(method=..., ...).
Migration: code that previously wrote sp.mr.mr_ivw(bx, by, sx, sy)
should use the top-level sp.mr_ivw(bx, by, sx, sy) (already exported
in every prior version) or the new sp.mr("ivw", beta_exposure=bx, ...)
dispatcher. The module is still accessible as sp.mendelian for users
who were doing submodule-level introspection.
Updated references: the only in-repo consumer of the old
sp.mr.mr_ivw form was tests/reference_parity/test_mr_parity.py,
which has been migrated to top-level calls. All external user code
that already uses sp.mr_ivw / sp.mendelian_randomization / etc
continues to work unchanged.
Fixed — silent wrong numbers (correctness audit)¶
-
sp.mr_egger— slope inference used Normal, not t(n−2). The companionsp.mr_pleiotropy_eggercorrectly usedt(n−2)for the Egger intercept p-value, butmr_eggeritself usedstats.norm.cdffor both the slope p-value and the slope CI's critical value. This was anti-conservative at smalln_snps: e.g. forn_snps = 5and a t-stat of 1.5, the Normal-based two-sided p is 0.134 whereas the correct t(3)-based p is 0.231.mendelian_randomization(..., methods=["egger"])inherited the bug through its internal call. The fix switches both the p-value and the CI critical value tot(n−2). Regression guard intests/test_correctness_v150.py::TestMREggerUsesTDistribution. Forn_snps ≥ 100the change is numerically invisible (< 1e-3 in p). -
sp.mr_presso— MC p-value could equal exactly 0. Both the global test p-value and the per-SNP outlier p-values used the rawmean(null >= obs)form, which collapses to0.0when the observed statistic exceeds every simulated null. An MC-estimated p-value cannot be zero — its true lower bound is1 / (B + 1). The fix switches to the standard(k + 1) / (B + 1)convention (matching R'sMR-PRESSOpackage). Downstream effect: reported p-values are now always strictly positive and in[1/(B+1), 1], which prevents log-transforms and sensitivity analyses from silently producing-inf. Regression guard intests/test_correctness_v150.py::TestMRPressoMCPvalueConvention.
Fixed — dead code¶
sp.network_exposure._ht_estimatecontained a dimensionally inconsistentvar = ...expression that was immediately overwritten by the conservative Aronow-Samii Theorem 1 boundvar_as = .... The dead line is removed; the reported SE is unchanged.
Fixed — registry coverage¶
Five previously-exposed-but-unregistered family functions now surface
in sp.list_functions() and have agent-readable schemas via
sp.describe_function():
sp.network_exposure(Aronow-Samii HT)sp.peer_effects(Bramoullé-Djebbari-Fortin 2SLS)sp.weighted_conformal_prediction(TBCR 2019 primitive)sp.conformal_counterfactual(Lei-Candès Theorem 1)sp.conformal_ite_interval(Lei-Candès Eq. 3.4 nested bound)
No other API changes¶
Every other public signature is byte-for-byte identical to v1.4.2.
Existing user code keeps working; upgrades reveal slightly wider Egger
CIs at small n_snps and strictly positive mr_presso p-values.
[1.4.2] — 2026-04-21 — correctness patches + family guides¶
Patch release. No breaking changes; two silent-wrong-numbers bug
fixes in dml_model_averaging and gardner_did, plus three new
family guides (Proximal / QTE / Causal RL) closing the last gaps
between the v3 reference document and the documentation.
Fixed — silent wrong numbers¶
sp.dml_model_averaging— √n SE scaling bug. The cross-candidate variance aggregator treated the sample-mean influence-function outer product asVar(θ̂_avg)directly, missing a final/ n. Net effect: reported SEs were√ntimes too large; on the canonical n=400 DGP the 95% CI width was 4.20 (nominal ≈ 0.21) and empirical coverage was 100% (nominal 95%). After the fix, CI width is 0.21 and coverage is 82% (≈ nominal, with the remaining gap explained by a 4% small-sample bias in the point estimate — a nuisance-tuning issue, not a variance-formula issue). Regression guard added totests/test_dml_model_averaging.py::test_se_on_correct_scale.sp.gardner_did— event-study reference-category contamination. The Stage-2 dummy regression pooled never-treated units and treated units outside the event-study horizon into a single baseline, dragging every event-time coefficient toward the mean of that pool. On a synthetic panel with true τ=2 and strict parallel trends, pre- trends came out ≈ -0.30 (should be 0) and post ≈ +1.72 (should be 2.0). Replaced the Stage-2 regression in event-study mode with direct Borusyak-Jaravel-Spiess-style within-(cohort × relative-time) averaging of the imputed gap. After the fix: pre-trends ≈ +0.01, post ≈ +2.02. Non-event-study path (single ATT) was already correct and is unchanged.
Added — family guides¶
docs/guides/proximal_family.md— complete walkthrough of the Proximal Causal Inference family:sp.proximal,sp.fortified_pci,sp.bidirectional_pci,sp.pci_mtp,sp.double_negative_control,sp.proximal_surrogate_index,sp.select_pci_proxies. Includes a decision tree ("got 1 Z + 1 W / bridges sensitive to spec / unsure which is Z vs W / continuous treatment + shift policy / only have negative controls / want long-term from short-term experiment / have candidate proxies") and the four diagnostics every PCI analysis should report.docs/guides/qte_family.md— the three granularity levels (mean → quantile → whole distribution), with cross-sectional / DiD / IV / panel-with-many-controls decision paths coveringsp.qte,sp.qdid,sp.cic,sp.distributional_te,sp.dist_iv,sp.kan_dlate,sp.beyond_average_late, andsp.qte_hd_panel.docs/guides/causal_rl_family.md— when to use causal RL vs classical causal inference, withsp.causal_bandit,sp.causal_dqn,sp.offline_safe_policy,sp.counterfactual_policy_optimization,sp.structural_mdp,sp.causal_rl_benchmark. Ships the 4 causal-RL-specific sanity checks.
Each guide is linked from mkdocs.yml under Guides and surfaces via
sp.search_functions() since all referenced functions have
hand-written registry specs.
Added — tests + docs hooks (from v1.4.1 cherry-picks now formally shipped)¶
tests/test_bridge_full.py: 10 end-to-end smoke + correctness tests for the sixsp.bridge(kind=...)bridging theorems — dispatches, finite outputs, agreement property on correctly-specified DGPs.docs/guides/bridging_theorems.md: full walkthrough of the six bridges with when-to-use and how-to-read-disagreement.
No API changes¶
Every public signature is byte-for-byte identical to v1.4.1. Existing
user code keeps working; upgrades reveal narrower CIs for
dml_model_averaging and cleaner event-study coefs for gardner_did.
[1.4.1] — 2026-04-21 — v3-frontier sprint 3 (AKM SE + Claude thinking + parity suites + docs)¶
Additive follow-up to v1.4.0. All v1.4.0 APIs remain stable; new functionality is exposed through additive kwargs on existing entry points.
Added — shock-clustered SE for panel shift-share¶
sp.shift_share_political_panel(..., cluster='shock')— new option computes the panel-extended Adão-Kolesár-Morales (2019) variance estimator recommended by Park-Xu (2026) §4.2:
Typically 3× tighter than unit-clustered SEs in settings with 10–100
industries. diagnostics['akm_se'] exposes the value alongside the
chosen cluster type, and diagnostics['cluster'] is now a
human-readable label ("shock (AKM 2019)" when the shock estimator
is active).
[bartik/political.py]
Added — Claude extended-thinking support for Causal MAS¶
sp.causal_llm.anthropic_client(..., thinking_budget=N)— opt into the Claude 4.5 / Opus 4.7 extended-thinking API. The reasoning trace is captured onclient.history[-1]['thinking']for auditability but is NOT included in the public answer parsed bycausal_mas. Compatible with Anthropic'sthinking/redacted_thinkingcontent blocks; both are handled cleanly. Validatesthinking_budget >= 1024and< max_tokenseagerly, so misconfiguration fails loudly before the first API call. [causal_llm/llm_clients.py]
Added — parity + integration test suites¶
tests/reference_parity/test_assimilation_parity.py— 10 checks on the Kalman / particle backends:- static-effect posterior recovery (both backends)
- Kalman ↔ particle agreement on three seeds (point + SD within 15%)
- monotone posterior variance under
process_var = 0 - particle-filter ESS stays above threshold after resampling
- Student-t particle beats Kalman on a contaminated stream
- drift tracking without variance blow-up
-
assimilative_causal(backend=...)matches direct-backend calls -
tests/integration/test_causal_mas_with_fake_llm.py— 11 end-to-end integration tests using the deterministicecho_clientto drive the proposer / critic / domain-expert / synthesiser loop: proposer parsing (newlines + bullets), critic rejection, domain-expert endorsement lifting confidence, transcript auditability, confidence scaling with rounds, role overrides, DAG interop viasp.dag(...), plus three Claude-thinking content-block splitter tests that mock Anthropic responses without requiring theanthropicSDK at test time.
Documentation¶
Two new MkDocs guides, wired into mkdocs.yml nav under
DID & Panel Methods / guides:
docs/guides/shift_share_political_panel.md— full panel-IV recipe including AKM shock-cluster guidance and pretrend workflow.docs/guides/causal_mas.md— multi-agent LLM causal discovery, real-SDK integration, Claude thinking-mode walkthrough, and end-to-end pipe intosp.dag/sp.identify.
Fixed¶
- Integration test used
dag.edges()butDAG.edgesis a list-of- tuples attribute (not a method). Corrected todag.edges.
Backwards compatibility¶
- All v1.4.0 APIs remain stable. The only new surface is additive kwargs:
sp.shift_share_political_panel(cluster='shock')sp.causal_llm.anthropic_client(thinking_budget=N)
[1.4.0] — 2026-04-21 — v3-frontier sprint 2 (extensions + LLM SDK + docs)¶
Follow-up to v1.3.0 covering the four secondary items flagged at the end of Sprint 1.
Added — panel-shift-share extension¶
sp.shift_share_political_panel— multi-period extension ofsp.shift_share_politicalper Park & Xu (2026) §4.2. Handles time-varying shares and time-varying shocks, runs pooled 2SLS with unit / time / two-way fixed effects, and reports a per-period event-study table plus aggregate Rotemberg top-K weights. Recovers τ = 0.30 within 0.003 on a 30 × 4 synthetic panel. [bartik/political.py]
Added — real-LLM adapters for Causal MAS¶
sp.causal_llm.openai_client— adapter over theopenai>=1.0Python SDK; supports custombase_urlfor Azure / vLLM / Ollama.sp.causal_llm.anthropic_client— adapter over theanthropic>=0.30Messages API; defaults toclaude-opus-4-7.sp.causal_llm.echo_client— deterministic scripted-response client for offline unit testing.- All three implement a single-method
LLMClientprotocol and integrate withsp.causal_llm.causal_mas(client=...)via the existingchat(role, prompt)interface. Lazy-imports the SDKs so the core package has zero new runtime dependencies. [causal_llm/llm_clients.py]
Added — particle-filter assimilation backend¶
sp.assimilation.particle_filter— bootstrap-SIR particle filter with systematic resampling (Gordon-Salmond-Smith 1993; Douc-Cappé 2005). Handles non-Gaussian priors, heavy-tailed observation noise, and nonlinear dynamics via pluggableprior_sampler/transition_sampler/observation_log_pdfcallbacks. Agrees with the exact Kalman filter to ~0.003 under Gaussian DGPs.sp.assimilative_causal(..., backend='particle')— the end-to-end wrapper now routes to the particle filter whenbackend='particle'. [assimilation/particle.py]
Documentation¶
Three new MkDocs guides covering the v3-frontier estimators:
docs/guides/synth_experimental.md— Abadie-Zhao inverse-SC workflow.docs/guides/harvest_did.md— Borusyak-Hull-Jaravel harvesting DID.docs/guides/assimilative_ci.md— Nature Comms 2026 streaming CI with both Kalman and particle backends.
All three are wired into mkdocs.yml nav under the DID & Panel
Methods / guides section.
Registry + agent schema¶
- 5 new hand-written
FunctionSpecentries:shift_share_political_panel,particle_filter,openai_client,anthropic_client,echo_client.
Code-quality pass (Sprint 1 audit)¶
- Removed 20 unused imports / shadow variables across the Sprint 1
modules identified by
pyflakes(did/harvest.py,bcf/ordinal.py,bcf/factor_exposure.py,causal_llm/causal_mas.py,bartik/political.py,assimilation/kalman.py,target_trial/report.py).
Fixed¶
tests/external_parity/test_causalml_book.py::test_forest_ate_recovers_average_tauwas flaking onubuntu-latest + Python 3.10because only the data-generating RNG was seeded — the causal forest's bootstrap + honest-split sampling was unseeded, so the ATE estimate varied by ±0.3 between OS / Python matrix entries and the|ATE - 0.5| < 0.3tolerance occasionally failed. Fixed by passingrandom_state=0+n_estimators=300+ bumpingnto 1 500 so the test is fully deterministic across the matrix.
[1.3.0] — 2026-04-21 — v3-frontier sprint (Sprint 1 of the 知识地图 v3 roadmap)¶
Builds on top of the v1.2.0 doc-alignment work by implementing the
eleven highest-leverage frontier methods identified in the 2026-04-20
Causal-Inference Method Family 万字剖析 v3 gap analysis. Every new
public function is wired into the registry + agent schema so it
surfaces through sp.list_functions, sp.describe_function, and
sp.all_schemas for LLM agents.
Added — P0 frontier (4 methods, within-sprint week 1)¶
-
sp.synth_experimental_design— Abadie & Zhao (2025/2026) inverse synthetic controls: picks the bestkcandidate units to treat by minimising the sum of per-unit pre-period SC MSPEs. Produces a ranking table, recommended treatment assignment, and a variance-gain benchmark against random allocation. [synth/experimental_design.py] -
sp.rdrobust(..., bootstrap='rbc', n_boot=999, random_state=...)— Cavaliere, Gonçalves, Nielsen & Zanelli (arXiv:2512.00566, 2025) robust-bias-corrected studentised percentile bootstrap. Empirically delivers CIs ~3–15% shorter than the analytic robust CI without sacrificing coverage. Newmodel_info['rbc_bootstrap']block exposes the CI, p-value, length-ratio, and effective replicate count. -
sp.fairness.evidence_without_injustice— Loi, Di Bello & Cangiotti (arXiv:2510.12822, 2025) counterfactual-fairness test that freezes admissible-evidence features at their factual values and tests whether predictions still change underdo(A = a'). Returns a bootstrap CI, p-value, and per-alternative breakdown. [fairness/evidence_test.py] -
sp.target_trial.to_paper(..., fmt='jama' | 'bmj')— renders a JAMA / BMJ-ready manuscript with all 21 TARGET Statement (JAMA/BMJ 2025-09) items auto-filled where derivable plus(supply text)placeholders elsewhere. Supportsauthors,funding,registration,data_availability,background,limitationskeyword arguments.
Added — P1 frontier (4 methods, within-sprint week 2)¶
-
sp.harvest_did— Abadie, Angrist, Frandsen & Pischke, NBER WP 34550 (2025) Harvesting DID + event-study framework: extracts every valid 2×2 DID comparison from a staggered panel, combines them via inverse-variance weights, and reports event-study + pretrend Wald tests. Uses a not-yet-treated-at-max(t₁, t₂) clean-control filter that correctly handles placebo horizons. [did/harvest.py] -
sp.bcf_ordinal— Zorzetto et al. (2026) BCF for ordered / dose treatments. Chains pairwise binary BCF between consecutive levels to yield cumulative dose-response CATEs with per-level ATEs. [bcf/ordinal.py] -
sp.bcf_factor_exposure— arXiv:2601.16595 (2026) BCF on PCA-factor scores of a high-dimensional exposure vector. SVD or user-supplied loadings compress the exposure toKfactors; one BCF is fit per factor. Returns per-factor ATEs, loadings, scores, and an aggregate mixture-ATE with CI. [bcf/factor_exposure.py] -
sp.causal_llm.causal_mas— arXiv:2509.00987 (2025/09) multi- agent causal discovery framework. Runs proposer / critic / domain-expert / synthesiser agents over several debate rounds with per-edge confidence scores and a full auditable transcript. Offline heuristic backend by default; accepts anychat(role, prompt)/complete(prompt)LLM client. [causal_llm/causal_mas.py] -
sp.shift_share_political— Park & Xu (arXiv:2603.00135, 2026) political-science variant of the Bartik IV. Long-difference 2SLS with AKM shock-cluster SEs, Rotemberg top-K diagnostic, and share-balance F-test against pre-treatment covariates. [bartik/political.py]
Added — P2 frontier + testing (2 methods + 2 test suites)¶
-
sp.assimilation.causal_kalman,sp.assimilation.assimilative_causal— Assimilative Causal Inference (Nature Communications 2026): a Kalman filter over streaming causal-effect estimates. Produces a running posterior with effective-sample-size diagnostics, pluggable dynamics (static or random-walk), and an end-to-end wrapper that runs a user-supplied per-batch estimator. New subpackage [assimilation/]. -
tests/reference_parity/test_mr_parity.py— 7 analytic-truth checks over the MR suite (IVW consistency, Egger intercept under balanced pleiotropy, Egger directional-pleiotropy detection, weighted-median robustness, PRESSO outlier flag, LOO stability, Radial-Wald exact agreement). All 7 pass. -
tests/external_parity/test_causalml_book.py— 7 CausalMLBook (Chernozhukov et al. 2024–2025) canonical-DGP checks: DML-PLR, Causal Forest, T-learner, 2SLS, Callaway–Sant'Anna DID, rdrobust, and rbc-bootstrap vs analytic parity. All 7 pass.
Registry + agent schema¶
- 9 hand-written
FunctionSpecentries for every new public function:synth_experimental_design,evidence_without_injustice,harvest_did,bcf_ordinal,bcf_factor_exposure,causal_mas,shift_share_political,causal_kalman,assimilative_causal. Each entry ships with NumPy-style parameter docs, examples, tags, and paper references for LLM-agent consumption.
Backwards compatibility¶
- All v1.2.x public APIs remain stable. The only changes to existing signatures are additive kwargs:
sp.rdrobust—bootstrap,n_boot,random_statesp.target_trial.to_paper—journal,authors,funding,registration,data_availability,background,limitations
[1.2.0] — 2026-04-21 — Doc-alignment sprint (v3 reference document)¶
Closes the remaining gaps between the Causal-Inference Method Family 万字剖析 v3 (2026-04-20) reference document and the StatsPAI public API. Most v3 frontier methods were already implemented in v1.0.x but lived in sub-packages without top-level exposure or curated registry specs. This release wires them up, adds the eight genuinely missing classical/frontier methods, and upgrades 14 frontier estimators from auto-generated to hand-written registry specifications so that LLM agents see proper parameter docs, examples, references, and tags.
Added — new estimators¶
Staggered DID
sp.gardner_did/sp.did_2stage— Gardner (2021) two-stage DID estimator (the Statadid2sanalogue). Stage-1 fits two-way fixed effects on untreated rows; Stage-2 regresses the residualised outcome on treatment dummies (overall ATT or event study) with cluster-robust SEs. Numerically agrees withdid_imputationto within ~2% on synthetic staggered panels.
DML
sp.dml_model_averaging/sp.model_averaging_dml— Ahrens, Hansen, Kurz, Schaffer & Wiemann (2025, JAE 40(3):381-402) model-averaging DML-PLR. Fits DML under multiple candidate nuisance learners and reports a risk-weighted (or equal/single-best) average θ with a cross-score-covariance-adjusted SE. Default candidate roster: Lasso / Ridge / RandomForest / GradientBoosting.
IV
sp.kernel_iv(top-level alias ofsp.iv.kernel_iv) — Lob et al. (2025, arXiv:2511.21603) kernel IV regression with wild-bootstrap uniform confidence band over the structural functionh*(d).sp.continuous_iv_late(top-level alias) — Zeng et al. (2025, arXiv:2504.03063) LATE on the maximal complier class for continuous instruments via quantile-bin Wald estimator. (Also fixed a summary formatting bug — see below.)
TMLE
sp.hal_tmle+sp.HALRegressor/sp.HALClassifier— TMLE with Highly Adaptive Lasso nuisance learners (Li, Qiu, Wang & van der Laan, 2025, arXiv:2506.17214). Two variants:"delta"(plug HAL into standard TMLE) and"projection"(apply tangent-space shrinkage to the targeting epsilon). Recovers ATE within ~3% on n=400 with rich nuisance.
Synthetic Control
sp.synth_survival— Synthetic Survival Control (Han & Shah, 2025, arXiv:2511.14133). Donor convex combination on the complementary log-log scale matches the treated arm's pre-treatment Kaplan-Meier, then projects forward and reports the survival gap with a placebo-permutation uniform band. Pre-treat fit RMSE typically < 0.01 on synthetic Cox data.
RDD aliases
sp.multi_cutoff_rd(alias forsp.rdmc),sp.geographic_rd(alias forsp.rdms),sp.boundary_rd(alias forsp.rd2d),sp.multi_score_rd(alias forsp.rd_multi_score) — user-friendly aliases mirroring the v3 document terminology.
Added — registry / agent surface¶
- 14 frontier estimators promoted from auto-generated to hand-written
registry specs with curated parameter descriptions, examples, tags,
and references:
gardner_did,dml_model_averaging,kernel_iv,continuous_iv_late,hal_tmle,synth_survival,bridge,causal_dqn,fortified_pci,bidirectional_pci,pci_mtp,cluster_cross_interference,beyond_average_late,conformal_fair_ite. This is whatsp.describe_function(...)andsp.function_schema(...)now return for these names. - Total registered functions: 836 → 860.
__all__repaired so previously-imported-but-not-exported symbols surface insp.list_functions():fci/FCIResult,spatial_did/SpatialDiDResult,spatial_iv,notears,pc_algorithm.
Fixed¶
iv.continuous_late.ContinuousLATEResult.summary— header line was being multiplied 42× by an implicit string-concat ×"=" * 42precedence bug ("title\n" "=" * 42parsed as("title\n" + "=") * 42). Replaced with explicit f-string concatenation.question.CausalQuestion.save— addedTYPE_CHECKINGimport forpathlib.Pathso the stringified return annotation stops trippingflake8 F821in CI.- Added
tabulate>=0.9.0to core dependencies.pandas.to_markdown()dispatches totabulate, which was previously a pandas-optional dep; user-facingsp.causal(...).report('markdown' | 'html')triggered anImportErroron systems (Windows, fresh envs) that didn't happen to transitively installtabulate.
Test coverage¶
35 new test cases across 7 new test modules:
test_gardner_2s.py (7), test_dml_model_averaging.py (5),
test_kernel_iv.py (5), test_continuous_iv_late.py (4),
test_hal_tmle.py (5), test_synth_survival.py (6),
test_rd_aliases.py (3). All pass on Python 3.13.
[1.0.1] - 2026-04-21 — Post-review correctness pass + deferred-item closeout¶
Bugfix release closing every Critical / High / Medium finding from the
independent code-review-expert pass on the v1.0.0 frontier modules,
plus resolution of the two # NEEDS_VERIFICATION items that had been
deferred in v1.0.0.
Fixed — post-review correctness pass¶
Critical (silent wrong numbers)
pcmci.partial_corr_pvalue: Fisher-z SE now uses the effective sample sizesqrt(n - |Z| - 3)instead of the off-by-onesqrt(df - 1). The previous formula systematically missed edges in PCMCI by making partial-correlation p-values too large.cohort_anchored_event_study: theclusterargument was silently dropped — the bootstrap resampled cohort ATTs instead of the user- supplied cluster level. Fixed to resample at the requested cluster and re-compute ATT(c, k) per draw.ltmle_survivaltargeting step: the TMLE one-step update appliedlogit(h_hat_regime)inline instead of using the pre-computedoffsetvariable, leaving the regime-counterfactual hazard untargeted. Reboundoffset_regime = logit(clip(h_hat_regime)).
High (wrong formula / silent tautology)
conformal_density_ite: previously fell back to split-conformal on Gaussian-residual quantiles, with the KDE bandwidth computed but unused. Now builds a proper KDE of the ITE-residual convolution and returns the Hyndman (1996) highest-density region via a shortest- window sweep over sorted smoothed samples.bridge.ewm_cate: Path A and Path B shared the same CATE-plug-in DR score, making the agreement test tautological. Path A now uses the Kitagawa-Tetenov (2018) pure-IPW welfare score so that the two paths have genuinely different failure modes, giving a real bridge.mr_multivariableconditional F-stat (Sanderson-Windmeijer): the partitionss_full - ss_residused raw (uncentred) sum of squares and unweighted OLS. Replaced with centred SS over WLS residuals, matching the MVMR weighting scheme.bcf_longitudinal.average_ate: point estimate and CI were computed on different sampling distributions (per-time-point mean vs. bootstrap quantiles). Headline now uses the bootstrap mean.
Medium
conformal_fair_ite: small protected-group fallback no longer mixes arms (which destroyed per-group coverage). Falls back to the conservative MAX per-group quantile across well-covered groups, or a pooled quantile with an explicit warning when all groups are small.causal_rl.structural_mdp: theA/Bmatrix slices were numerically verified correct, but shape assertions were added so any future refactor that flips the slice semantics fails loudly.causal_llm.llm_dag_propose: user-provideddomainandvariablesare now sanitized (non-printable and newline characters stripped; length capped) before interpolation into the LLM prompt, closing the prompt-injection vector.
Dead-variable cleanup
- Removed stale
bM,fe_cols,avg,rngnames acrossmendelian/multivariable.py,did/design_robust.py,bcf/longitudinal.py, andqte/hd_panel.py.
Changed — deferred-item closeout¶
beyond_average_late: replaced the ad-hoc quantile-range rescaling with an Abadie (2002) κ-weighted complier-CDF construction that inverts the CDF difference on the complier subpopulation only. The result is a proper complier quantile treatment effect.bridge.surrogate_pci: path A (surrogate index) and path B (PCI bridge) now use genuinely different identifying assumptions — path A relies on surrogacy (no direct D→Y path given S), path B relies on proxy completeness (D is a valid IV for itself under the bridge function). The old OLS-on-(D, S, X) construction for path B is replaced with a 2SLS that uses S and X as exogenous controls while leaving D as the treatment of interest.
Tests¶
tests/test_v100_review_fixes.py: 8 pinning regression tests, each corresponding 1:1 to a review finding.- Full-suite regression: 2 515+ tests passing, zero regressions.
[1.0.0+] - 2026-04-21 — v3 frontier sweep (12-module / 38-estimator pass)¶
Round-out pass triggered by the v3 全景图 doc (2026-04-20), filling the
remaining 2025-2026 frontier gaps that Stata / R / EconML / DoWhy /
CausalML still lack. 38 new public estimators across 12 modules,
all routed through sp.* and registered in sp.list_functions().
Added — v3 frontier estimators¶
- DiD frontier (
sp.did_*):did_bcf(Forests for Differences, Wüthrich-Zhu 2025),cohort_anchored_event_study(arXiv 2509.01829),design_robust_event_study(Wright 2026, arXiv 2601.18801),did_misclassified(arXiv 2507.20415). - Conformal frontier (
sp.conformal_*):conformal_density_ite(arXiv 2501.14933),conformal_ite_multidp(arXiv 2512.08828),conformal_debiased_ml(arXiv 2604.03772),conformal_fair_ite(arXiv 2510.08724). - Proximal frontier (
sp.fortified_pci,sp.bidirectional_pci,sp.pci_mtp,sp.select_pci_proxies): doubly-robust, bidirectional, modified-treatment-policies, plus a heuristic proxy selector (arXiv 2506.13152 / 2507.13965 / 2512.12038 / 2512.24413). - Distributional / panel QTE (
sp.dist_iv,sp.kan_dlate,sp.qte_hd_panel,sp.beyond_average_late): full distributional- layer LATE + HD-panel QTE + complier-distribution LATE (arXiv 2502.07641 / 2506.12765 / 2504.00785 / 2509.15594). - RDD frontier (
sp.rd_interference,sp.rd_multi_score,sp.rd_distribution,sp.rd_bayes_hte,sp.rd_distributional_design): five new 2025–2026 supports (arXiv 2410.02727 / 2508.15692 / 2504.03992 / 2504.10652 / 2602.19290). sp.causal_llm(NEW namespace):llm_dag_propose,llm_unobserved_confounders,llm_sensitivity_priors— all with deterministic heuristic backends (no API key required); accept aclientarg for real LLM injection.sp.causal_rl(NEW namespace):causal_dqn(Li-Zhang-Bareinboim confounding-robust Deep Q, arXiv 2510.21110),causal_rl_benchmark(5 benchmarks per Cunha-Liu-French-Mian, arXiv 2512.18135),offline_safe_policy(Chemingui et al., arXiv 2510.22027).- Cluster RCT × interference (
sp.cluster_*,sp.dnc_gnn_did): matched-pair, cross-cluster, staggered-rollout, DNC+GNN+DiD (arXiv 2211.14903 / 2310.18836 / 2502.10939 / 2601.00603). - IV frontier (
sp.iv.kernel_iv,sp.iv.continuous_iv_late,sp.iv.ivdml): kernel IV uniform CI + continuous-instrument maximal-complier LATE + LASSO-efficient instrument × DML (arXiv 2511.21603 / 2504.03063 / 2503.03530). - Meta-learner frontier (
sp.focal_cate,sp.cluster_cate): functional CATE (FOCaL, arXiv 2602.11118) + K-means cluster CATE (arXiv 2409.08773). - Bunching unification (
sp.general_bunching,sp.kink_unified): high-order bias correction (Song 2025, arXiv 2411.03625) + RDD/RKD/Bunching joint estimator (Lu-Wang-Xie 2025).
Tests (v3 sweep)¶
- 55 new smoke tests added under
tests/test_*_frontiers.py,tests/test_causal_llm.py,tests/test_causal_rl.py,tests/test_cluster_rct.py,tests/test_metalearner_frontiers.py,tests/test_bunching_unified.py. All pass; no regressions in the 153 core tests for did / iv / rd / dml / proximal / metalearners.
Registry (v3 sweep)¶
- Total registered functions: 794 → 831 (37 new symbols + 1 result class auto-discovered).
- All 38 surfaced via
sp.list_functions(),sp.help(),sp.function_schema(), and the OpenAI-compatible JSON schema export.
[1.0.0] - 2026-04-21 — Research-frontier capstone: bridging theorems, fairness, surrogates, MVMR, PCMCI, beyond-average QTE¶
StatsPAI 1.0 is the capstone release that integrates three years of
development into one coherent toolkit. On top of the v0.9.17
three-school completion, v1.0 ships the 2025-2026 research-frontier
modules that Stata / R have not yet caught up with, wires every
scaffolded subpackage into the top-level sp.* namespace, and
upgrades the target-trial reporting layer to the JAMA/BMJ 2025
TARGET Statement.
Added — v1.0 research-frontier modules¶
Bridging theorems (sp.bridge) — dual-path doubly-robust
identification. Each theorem pairs two seemingly different estimators
on the same target parameter: if either assumption holds, the
estimate is consistent.
bridge(..., kind="did_sc")— DiD ≡ Synthetic Control (Shi-Athey 2025)bridge(..., kind="ewm_cate")— EWM ≡ CATE → policy (Ferman et al. 2025)bridge(..., kind="cb_ipw")— Covariate balancing ≡ IPW × DR (Zhao-Percival 2025)bridge(..., kind="kink_rdd")— Kink-bunching ≡ RDD (Lu-Wang-Xie 2025)bridge(..., kind="dr_calib")— DR via calibration (Zhang 2025)bridge(..., kind="surrogate_pci")— Long-term surrogate ≡ PCI (Kallus-Mao 2026)BridgeResultreports both path estimates, their agreement test, and the recommended doubly-robust point estimate.
Fairness (sp.fairness) — counterfactual fairness as causal
inference, not pure statistics.
counterfactual_fairness— Kusner et al. (2018) Level-2/3 predictor evaluation on a user-supplied SCM.orthogonal_to_bias— Marchesin & Zhang (2025) residualization pre-processing that removes the component of non-protected features correlated with the protected attribute.demographic_parity,equalized_odds,fairness_audit— statistical fairness metrics + one-shot dashboard.
Long-term surrogates (sp.surrogate) — extrapolate short-term
experiments to long-term outcomes.
surrogate_index— Athey, Chetty, Imbens, Pollmann & Taubinsky (2019).long_term_from_short— Ghassami, Yang, Shpitser, Tchetgen Tchetgen (2024).proximal_surrogate_index— Imbens, Kallus, Mao (2026): proximal identification when unobserved confounders link surrogate and long-term outcome.
Multivariable MR (sp.mendelian extended)
mr_multivariable— MVMR on multiple correlated exposures.mr_mediation— causal-pathway decomposition for two-sample MR.mr_bma— Bayesian Model Averaging for MR with many candidate exposures (Yao et al. 2026 roadmap).
DiD frontiers (sp.did extended)
cohort_anchored_event_study— cohort-robust event-study weights.design_robust_event_study— design-robust dynamic ATT.did_misclassified— treatment-misclassification-robust DiD.did_bcf— Bayesian Causal Forest wrapper for DiD.
Conformal-inference frontiers (sp.conformal_causal extended)
conformal_debiased_ml— debiased-ML-aligned conformal intervals.conformal_density_ite— density-valued ITE conformal bounds.conformal_fair_ite— fairness-constrained ITE conformal.conformal_ite_multidp— multi-stage differentially-private ITE conformal bounds.
Proximal causal frontiers (sp.proximal extended)
bidirectional_pci— two-sided proxy-based causal inference.fortified_pci— variance-fortified PCI.pci_mtp— multiple-testing-corrected PCI.select_pci_proxies— automated proxy-variable selector.
Quantile / distributional-IV frontiers (sp.qte extended)
beyond_average_late— beyond-mean LATE for heterogeneous quantile treatment effects.qte_hd_panel— high-dimensional panel QTE.
RD frontiers (sp.rd extended)
rd_distribution— distribution-valued (functional) RD.rd_multi_score,rd_interference— already shipped.
Time-series causal discovery (sp.causal_discovery extended)
pcmci/lpcmci/dynotears— Peter-Clark-MCI family for observational + latent-confounder time-series DAG discovery.
LTMLE survival + BCF longitudinal (sp.tmle / sp.bcf extended)
ltmle_survival— LTMLE for survival outcomes with time-varying treatments.bcf_longitudinal— BCF for longitudinal panel settings.
Target Trial 2025 upgrade (sp.target_trial extended)
target_checklist(result)+to_paper(..., fmt="target")— render the JAMA/BMJ September-2025 TARGET Statement 21-item reporting checklist as a completed table, with[AUTO]/[TODO]tags for items that can be filled from the protocol + result vs. need author-supplied narrative.
Synthetic control frontier
sequential_sdid— sequential synthetic difference-in-differences.
ML bounds
ml_bounds— partial-identification bounds with ML nuisance estimation.
Added — MCP server + bridge layer¶
sp.agent.mcp_server— Model Context Protocol server scaffold so external LLMs (Claude, GPT-4, local models) can call every registered StatsPAI function via natural-language tool-calling.
Changed¶
statspai/__init__.py: 80+ new names in__all__; v1.0 total registered functions ≈ 729+.- Registry now includes rich FunctionSpec entries for the core new frontier APIs (bridge, fairness, surrogate, mr_multivariable, etc.).
Stability & scope¶
- All 229 tests added in the v0.9.17 + v1.0 window pass.
- Zero regressions in the 2158-test existing suite.
- Three-school completion from v0.9.17 carries forward intact
(
sp.epi,sp.longitudinal,sp.question, unified sensitivity, DAG recommender, preregistration).
Versioning¶
- This is a major release (breaking-change policy starts here). The
public API surface is the set of names in
statspai.__all__as of v1.0.0; anything outside that list remains unstable.
[0.9.17] - 2026-04-21 — Modern-weighting + MC g-formula + weakrobust panel + three-school completion¶
Two-pronged release. First, a surgical pass targeting four of the most-
requested gaps from the v1.0 gap-analysis: a Stata-style unified
weak-IV-robust diagnostic panel, the Zubizarreta (2015) stable-balancing-
weights estimator, the Robins (1986) Monte-Carlo g-formula (complementing
the existing Bang-Robins ICE), and a truly end-to-end sp.causal()
orchestrator. Second, a three-school completion pass mapping the
Econometrics ↔ Epidemiology ↔ ML knowledge-map article onto StatsPAI:
epidemiology primitives, MR full suite, longitudinal dispatcher, DAG-to-
estimator recommender, estimand-first DSL, and a unified sensitivity
dashboard attached to every Result object.
Added¶
-
sp.weakrobust(data, y, endog, instruments, exog)— one-call diagnostic panel that bundles Anderson-Rubin (1949), Moreira (2003) Conditional LR, Kleibergen (2002) K score test, Kleibergen-Paap (2006) rk LM + Wald F, Olea-Pflueger (2013) effective F, and Lee-McCrary-Moreira-Porter (2022) tF critical values.WeakRobustResultexposes.summary(),.to_frame(), and dict-style lookup. This is the Python analogue of Stata 19'sestat weakrobust, unifying functionality scattered acrossivmodel(R),linearmodels(Python), and the Stata user-writtenweakiv/rivtestpackages. -
sp.sbw(data, treat, covariates, y=..., estimand='att')— Stable Balancing Weights (Zubizarreta 2015 JASA). Minimises variance (or KL) of the weights subject to per-covariate SMD balance tolerances solved via SLSQP. Supports ATT / ATC / ATE. Reports an effective sample size and before/after balance table. Complementssp.ebalance(exact balance) andsp.cbps(CBPS). -
sp.gformula_mc(data, treatment_cols, confounder_cols, outcome_col)— Monte-Carlo parametric g-formula (Robins 1986). Fits per-timepoint conditional models for confounders (binary logit / Gaussian OLS) and simulates counterfactual trajectories under user-supplied static or dynamic (callable) treatment strategies. Non-parametric bootstrap CI. Complements the existingsp.gformula.ice(Bang-Robins 2005 ICE). -
Enhanced
sp.causal()workflow — three new stages auto-run afterestimate/robustness: .compare_estimators()— design-aware multi-estimator panel: CS + SA + BJS + Wooldridge for staggered DiD; 2SLS + LIML for IV; OLS + EB + CBPS + SBW + DML-PLR for observational..sensitivity_panel()— E-value + Oster δ* + Rosenbaum Γ in one DataFrame, matching the modern "sensitivity triad" expected by top-5 econ journals..cate()— X-Learner and Causal Forest heterogeneity summary (per-unit CATE mean, SD, q10/q50/q90).- Report output gains sections 4b / 4c / 4d.
- Opt-out via
CausalWorkflow.run(full=False);_extract_effecthelper unifiesCausalResultandEconometricResultsextraction.
Reviewer-identified fixes (v0.9.17 internal review)¶
SBWResult.__init__now forwardsmodel_info+_citation_keyto theCausalResultparent, wiring it into the citation registry.MCGFormulaResult._is_binarynow requires both 0 and 1 levels present — a degenerate column (all-0 or all-1) no longer triggers the logistic Newton-Raphson loop._extract_effectinCausalWorkflownow returns NaN when the treatment column is missing from the fitted params, rather than silently surfacing the intercept coefficient.- SBW docstring clarified: reported SE is conditional-on-weights;
users who need full parameter-uncertainty propagation should
bootstrap
sp.sbwexternally.
Deferred to a separate sprint¶
The original gap analysis also flagged TMLE dynamic regimes + censoring, Conformal counterfactual / weighted variants, PCMCI time-series causal discovery, Partial-ID + ML bounds, and the Agent-MCP server integration. Each is substantial enough to warrant its own focused sprint rather than being shipped half-finished here.
Added — three-school completion (2026-04-21 sub-release)¶
Driven by a cross-reference audit against the article "Causal Inference Knowledge Map — Econometrics, Epidemiology, ML", which pinpointed Layer-4 (What If longitudinal), epidemiology entry-level primitives, Mendelian randomization diagnostic depth, DAG-to-estimator UX, and estimand-first workflow as the remaining gaps vs. Stata / R dominance.
Epidemiology primitives (sp.epi) — NEW subpackage
odds_ratio,relative_risk,risk_difference,attributable_risk(Levin PAF),incidence_rate_ratio(exact Poisson CI via Clopper-Pearson),number_needed_to_treat,prevalence_ratio— Woolf / Fisher-exact / Katz / Wald / Newcombe intervals; Haldane-Anscombe correction for zero cells.mantel_haenszel(OR / RR with Robins-Breslow-Greenland variance),breslow_day_test(homogeneity of OR with Tarone correction).direct_standardize,indirect_standardize— direct-standardized rates + SMR with Garwood exact Poisson CI.bradford_hill— structured 9-viewpoint causal-assessment rubric with prerequisite check (no causality claim without temporality).
Mendelian randomization full suite (sp.mr / sp.mendelian)
mr_heterogeneity— Cochran's Q (IVW) or Rücker's Q' (Egger) + I².mr_pleiotropy_egger— formal MR-Egger intercept test for directional horizontal pleiotropy (Bowden 2015).mr_leave_one_out— per-SNP drop-one IVW sensitivity.mr_steiger— Hemani (2017) directionality test using Fisher-z of per-trait R² contributions.mr_presso— Verbanck (2018) global outlier test + per-SNP outlier detection + distortion test for raw-vs-corrected comparison.mr_radial— Bowden (2018) radial reparameterization + Bonferroni- thresholded outlier flagging.
Target trial emulation — structured report
TargetTrialResult.to_paper(fmt=...)/sp.target_trial.to_paper— render STROBE-compatible Methods + Results block in Markdown / LaTeX / plain-text for direct inclusion in manuscripts. Table structure tracks the JAMA 2022 7-component TTE spec exactly.
Longitudinal causal dispatcher (sp.longitudinal) — NEW subpackage
sp.longitudinal.analyze— unified entry point that auto-routes to IPW (no time-varying confounders) / MSM (dynamic regime with time-varying confounders) / parametric g-formula ICE (static regime) based on data shape and the supplied regime object.sp.longitudinal.contrast— plug-in estimator ofE[Y(regime_a)] - E[Y(regime_b)]with delta-method SE.sp.regime,sp.always_treat,sp.never_treat— dynamic-treatment- regime DSL supporting static sequences, callables, and a safe"if cd4 < 200 then 1 else 0"string DSL. The string DSL is parsed into a whitelisted AST and interpreted by a tiny tree-walker — no dynamic code execution is ever invoked, and disallowed constructs are rejected at regime-construction time.
Estimand-first causal-question DSL (sp.causal_question) — NEW subpackage
sp.causal_question(treatment=, outcome=, estimand=, design=, ...)declares a causal question up front..identify()picks an estimator + lists the identifying assumptions the user must defend;.estimate()runs the analysis;.report()produces a Markdown Methods + Results paragraph.- Auto-design selects IV when instruments are present, RD when running variable + cutoff given, DiD when panel + time, longitudinal when repeated measures, else selection-on-observables.
- Dispatches internally to
sp.regress/sp.aipw/sp.iv/sp.did/sp.rdrobust/sp.synth/sp.longitudinal.analyze/sp.event_study.
DAG → estimator recommender (sp.dag.recommend_estimator)
DAG.recommend_estimator(exposure, outcome)— inspects the declared graph and suggests a StatsPAI estimator with a plain-English identification story. Priority order: backdoor adjustment (OLS / IPW / matching) → IV (heuristic relevance + exclusion check) → frontdoor → not-identifiable (with sensitivity-analysis fallbacks).- Detects mediators on the causal path automatically.
Unified sensitivity dashboard (sp.unified_sensitivity)
result.sensitivity()— method added to bothCausalResultandEconometricResults. Single call runs E-value (always), Oster δ (when R² inputs supplied), Rosenbaum Γ (when a matched structure is exposed), Sensemakr (regression models), and a breakdown-frontier bias estimate.
Changed (three-school completion)¶
__init__.py: 40+ new names exposed at top level includingsp.epi,sp.longitudinal,sp.question,sp.tte/sp.mrshort aliases.
Fixed (three-school completion)¶
- Regime DSL: AST validation moved from evaluate-time to compile-time
so unsafe expressions are rejected immediately at
sp.regime(...)construction, before any history is supplied.
[0.9.16] - 2026-04-20 — v1.0 breadth expansion + Bayesian family polish + Rust Phase-2 CI¶
The largest release since the v1.0 breadth pass. Maps StatsPAI onto the full Mixtape + What If + Elements of Causal Inference curriculum: Hernan-Robins target-trial emulation, Pearl-Bareinboim SCM machinery, modern off-policy / neural-causal estimators, plus three additions that close long-standing gaps in the Bayesian family, plus a CI scaffold for the Rust HDFE spike.
Added (0.9.16) — v1.0 breadth expansion (27+ new modules)¶
Target trial emulation & censoring (sp.target_trial, sp.ipcw)
target_trial_protocol,target_trial_emulate,clone_censor_weight,immortal_time_check— JAMA 2022 7-component TTE framework with explicit eligibility / time-zero / per-protocol contrast support.ipcw— Robins-Finkelstein inverse probability of censoring weights (pooled-logistic or Cox hazard) with stabilization + truncation.
SCM / DAG machinery (sp.dag extended)
identify— Shpitser-Pearl ID algorithm; returns do-free estimand when identifiable, witness hedge(F, F')otherwise.do_rule1 / do_rule2 / do_rule3,do_calculus_apply— mechanized do-calculus with d-separation on mutilated graphsG_{bar X},G_{underline Z}, andG_{bar Z(W)}.swig— Richardson-Robins Single-World Intervention Graphs via node-splitting of intervened variables.SCM— abduction-action-prediction counterfactual runner with rejection sampling fallback for non-Gaussian structural equations.llm_dag— LLM-backed DAG extraction from free-form descriptions.
Causal discovery with latents (sp.causal_discovery)
fci— FCI for PAGs with unobserved confounders (Zhang 2008): skeleton + v-structures + FCI rules R1-R4.icp,nonlinear_icp— Peters-Bühlmann-Meinshausen invariant causal prediction; linear F-test / K-S nonlinear invariance.
Transportability (sp.transport)
transport_weights_fn/transport_generalize— Stuart / Dahabreh density-ratio transport with inverse odds of sampling weighting.identify_transport— Bareinboim-Pearl s-admissibility; enumerates adjustment sets on selection diagrams, returns transport formula.
Off-policy evaluation (sp.ope)
ips,snips,doubly_robust,switch_dr,direct_method,evaluate— Dudik-Langford-Li DR family plus Swaminathan-Joachims SNIPS and Wang-Agarwal-Dudík Switch-DR for bandits / RL.
Deep causal & latent-confounder models (sp.neural_causal)
cevae— Louizos et al. CEVAE with PyTorch path + numpy variational fallback so import never fails.
Longitudinal / G-methods (sp.gformula, sp.tmle, sp.dtr)
gformula_ice_fn— Bang-Robins iterative conditional expectation parametric g-formula; sequential backward regression with recursive strategy plug-in. Supports static / scalar / callable strategies.ltmle— van der Laan-Gruber longitudinal TMLE.q_learning,a_learning,snmm— dynamic treatment regime estimators.
Additional estimators across the stack
- Causal forests:
multi_arm_forest,iv_forest,survival/causal_forest(Cui-Kosorok 2023). - Proximal:
negative_controls,pci_regression(Miao-Shi-Tchetgen). - Interference:
network_exposure(Aronow-Samii 2017),peer_effects. - Dose-response:
vcnet+scigan(Nie-Brunskill-Wager 2021). - Matching:
genmatch(Diamond-Sekhon 2013). - Sensitivity:
rosenbaum_bounds. - Spatial:
spatial_did,spatial_iv(Kelejian-Prucha 1998). - Time series:
its(interrupted time series). - Bounds:
balke_pearl. - Mediation:
four_way_decomposition(VanderWeele 2014).
Registry / agent surface
- 11 hand-written
FunctionSpecentries for the new flagship APIs, each with parameter schemas, tags, and canonical references. sp.list_functions()now reports 664 entries.sp.search_functions("target trial")/"invariance"/"transport"all resolve correctly.
Added (0.9.16) — Bayesian family gap-closing¶
-
bayes_mte(mte_method='bivariate_normal')— full textbook Heckman-Vytlacil trivariate-normal model(U_0, U_1, V) ~ N(0, Σ)withD = 1{Z'π > V}. Identifies the structural gapβ_D = μ_1 - μ_0and the two selection covariancesσ_0V, σ_1Vvia inverse-Mills-ratio corrections in the structural equation, soMTE(v) = β_D + (σ_1V - σ_0V)·vis closed-form linear on V scale. Requiresselection='normal'andfirst_stage='joint';poly_uis overridden to 1 with aUserWarningif the user set something else. Exposesb_mteas a 2-vector Deterministic[β_D, σ_1V - σ_0V]so every downstream code path (mte_curve, ATT/ATU integrator,policy_effect) works unchanged. This is the last missing piece of the Heckman-Vytlacil pipeline thatselection='uniform'/'normal'+mte_method='polynomial'/'hv_latent'started. -
bayes_did(cohort=...)+BayesianDIDResult— when the user supplies acohortcolumn (typically first-treatment period in a staggered design), the scalartauis replaced with a vectortau_cohortof lengthn_cohortsunder the sameNormal(prior_ate)prior. The result carriescohort_summaries: Dict[str, dict]andcohort_labels; the top-level pooled ATT is the treated-size-weighted mean of the per-cohort τ posteriors.result.tidy(terms='per_cohort')returns one row per cohort withterm='cohort:<label>'; explicitterms=['att', 'cohort:2019', ...]selection is supported for modelsummary / gt pipelines. Back-compat: calling withoutcohort=...returns aBayesianDIDResultthat behaves byte- identically to the v0.9.15BayesianCausalResult. -
bayes_iv(per_instrument=True)+BayesianIVResult— on a multi-instrument fit, additionally runs one just-identified Bayesian IV sub-fit perZ_jand stores per-instrument posteriors asinstrument_summaries: Dict[str, dict]. Surface mirrors the DID extension:tidy(terms='per_instrument')emits one row perZwithterm='instrument:<name>'. The top-level pooled LATE remains the joint over-identified fit; per-instrument rows are an add-on diagnostic. Sub-fit priors and sampler controls mirror the pooled fit, so runtime scales roughly(K+1)×. -
.github/workflows/build-wheels.yml— Rust Phase-2 cibuildwheel matrix workflow (macOS arm64 + x86_64, manylinux_2_17 x86_64 + aarch64, musllinux_1_2 x86_64, Windows x86_64) with acheck_rust_presentguard job that makes the workflow a no-op whenrust/statspai_hdfe/Cargo.tomlis absent (the state onmain). The workflow activates automatically onfeat/rust-hdfe/feat/rust-phase2and on PRs touchingrust/**, so the Rust spike's CI lights up the moment the branch is ready — no second PR for CI scaffolding.
Tests (0.9.16)¶
tests/test_bayes_mte_bivariate_normal.py— 7 tests covering API validation (selection + first_stage gates, poly_u override), structural-param presence in posterior, method label contents, and slope recovery on a genuine trivariate-normal DGP at n=800.tests/test_bayes_did_cohort.py— 9 tests covering back-compat (no cohort → single-row tidy identical to v0.9.15), cohort fit populates summaries, multi-row tidy viaper_cohort+ explicit list, unknown-term raises, τ ordering recovered on a two-cohort staggered DGP with heterogeneous true ATTs (2.0 vs 0.5), and cohort weights recorded in model_info.tests/test_bayes_iv_per_instrument.py— 8 tests covering back-compat, per-instrument summary population,per_instrumenttidy, explicit-list tidy, unknown-term raises, error path when asking forper_instrumenttidy without the sub-fit, and each sub-fit's HDI covers the true LATE on a two-Z DGP.
Not in this release¶
- Round-trip testing of the cibuildwheel matrix on real runner
hardware — this must happen on
feat/rust-hdfe, where the crate lives. The workflow onmainis inert by design.
[0.9.15] - 2026-04-20 — Multi-term tidy(terms=[...]) + ATT/ATU prob_positive¶
Completes the broom-pipeline integration of v0.9.13's per-population
ATT/ATU uncertainty. Users can now pd.concat ATE/ATT/ATU rows
across fits in one call.
Added (0.9.15)¶
BayesianMTEResult.tidy(conf_level=None, terms=None)override:terms=None(default) — unchanged, single ATE row.terms='ate' | 'att' | 'atu'— single row of that term.terms=['ate', 'att', 'atu']— multi-row DataFrame.-
Invalid names → clear
ValueError. -
Two new result fields:
att_prob_positive,atu_prob_positive(NaN-defaulted for pre-v0.9.15 snapshot compatibility). Populated by_integrated_effectfrom per-draw ATT/ATU posteriors. -
_integrated_effectreturns 5-tuple(mean, sd, hdi_lower, hdi_upper, prob_positive). Caller unpacks + passes to the result.
Round-B review found 1 HIGH; Round-C fixed¶
-
HIGH-1 — label divergence: default
tidy()emitsterm='ate (integrated mte)'(via parentestimand.lower()), buttidy(terms='ate')emitted the short literal'ate'. Byte- compat broken when a user mixed both call styles insidepd.concat. Fixed —_row('ate')now also usesself.estimand.lower()so both paths produce identical rows. ATT / ATU rows keep their short labels (no parent-default precedent; short is the natural broom shape for new terms). -
Round C: 0 blockers.
Tests (0.9.15)¶
tests/test_bayes_mte_tidy.py(13 tests) — back-compat default schema, single-term paths for all three labels, multi-row order preservation, concat workflow, invalid-term + mixed-valid rejection, NaN prob_positive stub back-compat, prob_positive scalars populated on real fits, default-vs-explicit label byte-parity (Round-C regression).- Bayesian family suite: 101/101 focused tests green.
Design spec (0.9.15)¶
docs/superpowers/specs/2026-04-20-v0915-tidy-multiterm.md
Non-goals (0.9.15)¶
- Multi-term
.tidy()on other Bayesian estimators — DID/RD/IV have no ATT/ATU concept; the primary-estimand row is already what they emit. - Full bivariate-normal HV model.
- Rust Phase 2.
[0.9.14] - 2026-04-20 — Summary rendering completes v0.9.13 spec §3.3¶
Tiny patch release. Completes the "ATT/ATU in summary()" promise
from v0.9.13 spec §3.3 that was not actually wired at ship time
(the six uncertainty fields landed but summary() never printed
them).
Added (0.9.14)¶
-
BayesianMTEResult.summary()override. ExtendsBayesianCausalResult.summarywith aPopulation-integrated effectsblock:ATT: 0.2407 (sd 0.0370, 95% HDI [0.1693, 0.3136]) ATU: 0.2147 (sd 0.0435, 95% HDI [0.1341, 0.2947])
Rendered inside the framing = ruler for visual coherence.
Silently skipped when either SD is NaN (empty subpopulation or
pre-v0.9.13 deserialised result).
Round-B review: no blockers¶
Reviewer confirmed:
1. base.endswith('=' * 70) is exact — parent summary() returns
'\n'.join(lines) with the rule as the final element.
2. Block splicing preserves the closing ruler visually.
3. NaN stub path is safe; fallback branch is defensive.
4. 'ATT:' / 'ATU:' are unique substrings — no collision with
parent output.
5. Pure reader; thread-safe.
Tests (0.9.14)¶
tests/test_bayes_mte_uncertainty.pynow has:test_summary_shows_att_atu_uncertainty— after fit, string contains'ATT:','ATU:','sd ','HDI ['.test_summary_skips_att_atu_when_nan— NaN-SD stub → no'ATT:'/'ATU:'in output.- Full Bayesian suite: 88/88 focused MTE + sibling green in 1:55.
Non-goals (0.9.14)¶
.tidy()multi-row variant with ATE/ATT/ATU as separate rows — queued for v0.9.15+.- Full bivariate-normal HV model.
- Rust Phase 2.
[0.9.13] - 2026-04-20 — ArviZ HDI compat shim + ATT/ATU uncertainty¶
Small-but-load-bearing cleanup release. Closes two items deferred across the v0.9.10 / v0.9.11 / v0.9.12 code reviews.
Added (0.9.13)¶
-
_az_hdi_compat(samples, hdi_prob)instatspai.bayes._base— callsaz.hdi(samples, hdi_prob=...)first, falls back toaz.hdi(samples, prob=...)onTypeError. Routes everyaz.hdi(...)call site in the Bayesian sub-package through one place so the inevitable arviz ≥ 0.18 kwarg rename is a one-line change. Previously identified as time-bomb by v0.9.12 round-C review. -
ATT / ATU uncertainty on
BayesianMTEResult: att_sd,att_hdi_lower,att_hdi_upperatu_sd,atu_hdi_lower,atu_hdi_upper
_integrated_effect now returns (mean, sd, hdi_lower,
hdi_upper) instead of (mean, sd). posterior_sd on the parent
result already covers ATE uncertainty — no redundant ate_sd.
- Appended-at-end field order on
BayesianMTEResult— all six new fields are NaN-defaulted and positioned after the v0.9.12 schema (selection). Serialised results from earlier releases deserialise cleanly.
Round-B code review found no blockers¶
Reviewer confirmed:
1. _az_hdi_compat fallback shape correct for any future arviz
kwarg rename.
2. Dataclass field order verified via live introspection.
3. No __hash__ risk on NaN fields; broom-style .tidy() /
.glance() intentionally do not surface the new SD/HDI fields
(opt-in access).
4. Imports clean in mte.py + hte_iv.py.
5. Empty-population NaN guardrail is defensive-only; unreachable
from bayes_mte because _logit_propensity enforces 2-class
requirement upstream. Test renamed to reflect this honestly.
One MEDIUM item (test-docstring mislabel) fixed inline.
Incident log¶
A mass regex rewrite from az.hdi(...) to _az_hdi_compat(...)
accidentally matched the helper's own body, creating a
_az_hdi_compat → _az_hdi_compat self-recursion. Caught by running
the Bayesian focused suite (would have been a stack-overflow the
moment any Bayesian estimator shipped). Reverted + re-applied
manually in the same session before tests ever ran outside dev.
Tests (0.9.13)¶
tests/test_bayes_hdi_compat.py(4 tests) — forwards on current arviz, falls back on monkey-patched future arviz, returns length-2 array, propagatesTypeErrorwhen both kwargs rejected (no silent success).tests/test_bayes_mte_uncertainty.py(4 tests) — ATT/ATU SD populated + > 0, HDI brackets mean, no redundantate_sd, realistic- DGP both-finite.- Bayesian family suite: 145/145 focused MTE + sibling tests green.
Design spec¶
docs/superpowers/specs/2026-04-20-v0913-hdi-compat-and-att-sd.md
Non-goals (0.9.13)¶
- Full bivariate-normal HV
(U_0, U_1, V) ~ N(0, Σ)— stays queued. - Rust Phase 2.
- Expose ATT/ATU HDI on
.tidy()— today.tidy()describes the primary estimand (ATE); adding a multi-row variant for ATT/ATU is a v0.9.14+ API question.
[0.9.12] - 2026-04-20 — Probit-scale MTE (Heckman selection frame)¶
Adds the third orthogonal axis to sp.bayes_mte: the MTE polynomial
can now be fit on either the uniform scale U_D ∈ [0, 1]
(v0.9.11 default) or the probit / V scale
V = Φ^{-1}(U_D) ∈ ℝ — the conventional Heckman (1979) / HV 2005
frame. All (first_stage, mte_method, selection) combinations fit.
Added (0.9.12)¶
sp.bayes_mte(..., selection='uniform' | 'normal')— new kwarg.'uniform'(default) preserves v0.9.11 behaviour: polynomial inU_D ∈ [0, 1].-
'normal'reinterprets the abscissa asV = Φ^{-1}(U_D)viapt.sqrt(2) * pt.erfinv(2a-1)on the tensor side andscipy.stats.norm.ppfon numpy side. Under strict HV + bivariate- normal,poly_u=1 + selection='normal' + mte_method='hv_latent'exactly recovers the linear Heckman MTE slope. -
mte_curveexposesvcolumn underselection='normal'(empty otherwise) so users can plot on the scale their model was fit on. -
Shared
PROBIT_CLIPconstant instatspai.bayes._base— fit-time, ATT/ATU integrator, andpolicy_effectall read the same clip so the three paths stay numerically consistent.
Empirical recovery on Heckman DGP (true (b_0, b_1) = (0.5, 1.5))¶
| combo | b_0 |
b_1 |
|---|---|---|
| plugin × polynomial × V | -0.73 | 0.82 |
| plugin × hv_latent × V | 0.42 | 1.37 ✓ |
| joint × polynomial × V | -0.73 | 0.81 |
| joint × hv_latent × V | 0.46 | 1.40 ✓ |
Same story as earlier releases: hv_latent recovers truth;
polynomial fits g(v) not MTE(v) and is biased.
Round-B review found 2 BLOCKERS + 2 HIGHs; Round-C fixed all¶
- BLOCKER-1:
_integrated_effect(ATT/ATU) was raisingU_populationto polynomial powers directly, even under'normal'where the posterior is on V scale. Fixed — transforms toΦ^{-1}(U_population)first. - BLOCKER-2:
BayesianMTEResult.policy_effectcomputedu_pow = [u^k ...]instead of[v^k ...]under'normal', silently integrating a V-scale polynomial against u-scale powers. Fixed —BayesianMTEResultnow carries aselectionfield, andpolicy_effecttransforms the grid to V scale when needed. Regression test assertspolicy_effect(policy_weight_ate())matches.ateto 1e-8 under'normal'. - HIGH-1:
mte_curvelacked avcolumn — added. - Round-C follow-up: extracted
PROBIT_CLIP = 1e-6to a shared module constant consumed by bothmte.pyand_base.pyso the three-site fit/summary/policy paths cannot drift.
Tests (0.9.12)¶
tests/test_bayes_mte_selection.py(NEW, 12 tests) — back-compat, method-label, Heckman DGP recovery, all-8-combo orthogonality, input validation,vcolumn presence/absence, ATT/ATU V-scale correctness (Round-C regression),policy_effectV-scale parity with.ate(Round-C regression), uniform-vs-normal non-trivial disagreement.- 78 focused MTE tests green.
Non-goals (0.9.12)¶
- Full bivariate-normal error covariance
(U_0, U_1, V) ~ N(0, Σ)with freeρ_{0V},ρ_{1V}— convergence-intensive MvNormal mixture, queued for 0.9.13+. - Rust Phase 2 — separate branch.
[0.9.11] - 2026-04-20 — Multi-instrument MTE + true CHV-2011 PRTE weights¶
Closes two long-standing API gaps plus an empirical math debt.
Added (0.9.11)¶
sp.bayes_mte(instrument: str | Sequence[str], ...)— MTE now accepts multiple instruments, matchingsp.bayes_iv/sp.bayes_hte_iv. Scalar calls unchanged.sp.policy_weight_observed_prte(propensity_sample, shift)— true Carneiro-Heckman-Vytlacil (2011) PRTE weights from the observed propensity distribution viakde.integrate_box_1d(u-Δ, u) / Δ(CDF difference). Closes the v0.9.9 docstring gap wherepolicy_weight_prtewas flagged stylised.
Round-B review found 2 HIGH + 3 MEDIUM; all fixed¶
- CHV sign bug — my original
(kde(u) - kde(u-Δ))/ΔAND the reviewer's proposed swap were both wrong (both compute derivative of density, not CDF difference). Self-sweep verified CHV-2011 Theorem 1 is a CDF difference. Fixed viaintegrate_box_1d. Empirical: uniform propensity + Δ=0.2 now gives the textbook trapezoid; previously gave a spurious boundary spike. - Unconditional
np.clip(w, 0, None)silently altered the estimand. Dropped — contraction policies now yield signed negative weights, matching CHV convention. gaussian_kdethread safety — forced covariance precomputation inside the builder.model_info['instrument']type varied — dropped the raw key; onlyinstruments(list) +n_instrumentsremain.- Back-compat test uses relative-to-posterior-SD tolerance.
Tests (0.9.11)¶
tests/test_bayes_mte_multi_iv.py(9 tests).tests/test_bayes_mte_policy.py(+7 tests).- 61 focused MTE tests green.
Code review¶
- Round B agent: 5 items. Self-sweep caught one HIGH the agent got wrong. All 5 fixed.
- Round C agent: zero ship-blockers.
Design spec (0.9.11)¶
docs/superpowers/specs/2026-04-20-v0911-multi-iv-mte-observed-prte.md
[0.9.10] - 2026-04-20 — HV-latent MTE (textbook Heckman-Vytlacil via latent U_D)¶
Closes the semantic debt v0.9.9 flagged but did not pay: the
previous releases fitted a polynomial in the propensity p_i
(g(p) = LATE-at-propensity), which coincides with the textbook
MTE only under HV-2005 linear-separable + bivariate-normal errors.
v0.9.10 adds an opt-in fully HV-faithful model that samples a
latent U_D_i ~ Uniform(0, 1) per unit via the truncated-uniform
reparameterisation trick, making the fitted polynomial a genuine
posterior over tau(u) = E[Y_1 - Y_0 | U_D = u].
Added (0.9.10)¶
sp.bayes_mte(..., mte_method='polynomial' | 'hv_latent')— new kwarg, orthogonal to the existingfirst_stagekwarg.'polynomial'(default) — v0.9.9 behaviour; polynomial in propensity.-
'hv_latent'— textbook HV. For each unit, sampleraw_U_i ~ Uniform(0, 1), then transform deterministically:D_i = 1 ⇒ U_D_i = raw_U_i · p_i ∈ [0, p_i] D_i = 0 ⇒ U_D_i = p_i + raw_U_i·(1 - p_i) ∈ [p_i, 1]The polynomial is then evaluated at
U_D_i(notp_i). Structural equation:Y_i = α + β_X' X_i + D_i · τ(U_D_i) + ε_i.
Orthogonal to first_stage: all four
(plugin|joint) × (polynomial|hv_latent) combinations run.
- Memory-warning guard —
hv_latentregisters a shape-(n,) latent stored as(chains, draws, n)in the posterior. The function emits aUserWarningwhenn × draws × chains > 50,000,000(~400 MB at f64), mentioningdraws,chains, andmte_method='polynomial'as mitigations.
HV-augmentation factorisation (documented in docstring)¶
bayes_mte uses the standard Form-2 data-augmentation
factorisation:
p(Y, D, U_D | p, θ) = p(Y | U_D, D, θ) · p(U_D | D, p) · p(D | p)
where the truncated-uniform transform gives p(U_D | D, p) and
pm.Bernoulli(D | p) gives the marginal p(D | p). Both are
needed — dropping the Bernoulli in a counter-factual experiment
made piZ flip sign (true 0.8 → posterior -1.01) and biased the
MTE polynomial to [0.81, 1.25] vs true [2, -2]. This test is
documented in the v0.9.10 round-B code review.
Empirical recovery evidence¶
Decreasing-MTE DGP with truth (b_0, b_1) = (2.0, -2.0):
| combo | b_0 posterior | b_1 posterior | recovers? |
|---|---|---|---|
| plugin × polynomial | 1.73 | -0.43 | biased |
| plugin × hv_latent | 2.03 | -2.13 | ✓ |
| joint × polynomial | 1.73 | -0.44 | biased |
| joint × hv_latent | 2.05 | -2.16 | ✓ |
The polynomial modes are systematically biased on HV DGPs — the honesty caveat v0.9.9 added is empirically validated; hv_latent is the mathematical fix.
Method label¶
polynomial→"Bayesian treatment-effect-at-propensity (...)"(v0.9.9 label retained).hv_latent→"Bayesian HV-latent MTE (...)".
Tests (0.9.10)¶
tests/test_bayes_mte_hv_latent.py(10 tests) — API, recovery of true(b_0, b_1) = (2, -2)on an HV DGP, disagreement with polynomial mode on same DGP, orthogonality withfirst_stage='joint', input validation, memory-warning fires above threshold (unittest.mock), memory-warning stays silent below threshold,policy_effectstill works on hv_latent results.
Code review (two rounds)¶
- Round B (agent) raised 3 HIGH items:
- "Double-counting Bernoulli" — rejected after math + counter- factual. Form-2 factorisation is correct; dropping Bernoulli wildly biased the result. Defended in docstring.
- "Marginal U_D not Uniform(0,1)" — rejected after algebra.
p(U_D|p) = p·U(0,p) + (1-p)·U(p,1) = Uniform(0,1)holds. - "Memory blow-up" — accepted; added
UserWarning. - Round C (agent) on the round-B resolutions: no ship-blockers. One cosmetic nit on the integration notation in the docstring fixed inline.
Design spec¶
docs/superpowers/specs/2026-04-20-v0910-hv-latent-mte.md
Non-goals (0.9.10)¶
- Full bivariate-normal error structure on
(U_0, U_1, U_D)— linear-separable only. Natural 0.9.11+ extension. - Multi-instrument HV MTE.
- GP over
u(still polynomial of orderpoly_u). - Rust Phase 2 — branch work.
Article-surface round-2: namespace fixes + kwarg alignment¶
Completes the API-cleanup thread started by v0.9.9's first alias pass.
The 2026-04-20 survey post advertises sp.matrix_completion,
sp.causal_discovery, sp.mediation, sp.evalue_rr, plus
article-style kwargs on sp.policy_tree / sp.dml — all of which
either resolved to the submodule or rejected the blog-post kwargs
before this round.
Added — article-facing aliases¶
sp.matrix_completion(df, y, d, unit, time)— thin wrapper oversp.mc_panel, renamesd → treat. Shadows the former module binding.sp.causal_discovery(df, method='notears'|'pc'|'ges'|'lingam', variables=None)— dispatcher. Handles each backend's native signature (notears/pc acceptvariables=; ges/lingam do not, so the dispatcher subsets the frame upfront).sp.mediation(df, y, d, m, X)— article wrapper oversp.mediate; shadows the former module binding.sp.evalue_rr(rr, rr_lower=None, rr_upper=None)— risk-ratio convenience for the shape documented in the blog post.sp.policy_treeaccepts eitherd=/treat=,X=/covariates=, anddepth=/max_depth=. Conflicting values raiseTypeError.sp.dmlacceptsmodel_y=/model_d=as aliases forml_g/ml_m, and the same dual-convention naming.
Hardened¶
sp.auto_didnow fails fast withTypeErrorwhengis a non-numeric cohort label (BJS branch silently misbehaves otherwise).AutoDIDResult.__repr__/AutoIVResult.__repr__now return a one-line summary (Jupyter list-of-results display); call.summary()for the full leaderboard.statspai.agent.tools._default_serializeris now scalar-safe (new_scalar_or_nonehelper) — handles Series-valued result fields without crashing JSON serialisation.
Reverted — deliberate non-goal¶
- An experimental addition of
.estimate/.se/.pvalue/.ciproperties toEconometricResultswas reverted when regression testing showed it brokeagent/tools.pyandcausal_workflow.pywhich usehasattr(r, 'estimate')to dispatch between scalarCausalResultand multi-coefEconometricResults. A NOTE incore/results.pydocuments why the aliases are intentionally absent; use.tidy()for cross-estimator code.
Tests (article-surface round-2)¶
tests/test_article_aliases_round2.py adds 25 tests covering all of
the above, including the conflict-detection and backend-signature
branches flagged by the round-2 code review.
[0.9.9] - 2026-04-20 — Joint first-stage MTE + policy-relevant weights + honesty pass¶
Closes v0.9.8's two explicit follow-ons (joint first stage, policy-relevant weights) and ships a semantic correction on the MTE labelling that survived two rounds of code review.
Added (0.9.9)¶
-
sp.bayes_mte(..., first_stage='plugin' | 'joint')— new kwarg.'plugin'(default) preserves v0.9.8 behaviour: logit MLE computes propensity as a fixed constant.'joint'puts the first-stage logit coefficients inside the PyMC graph (pi_intercept,pi_Z, optionalpi_X), modelsD ~ Bernoulli(sigmoid(pi'W)), and evaluates the effect polynomial at the random propensity — so first-stage uncertainty propagates into the returned curve. 2-4× slower than plug-in but honest about identification noise. -
BayesianMTEResult.policy_effect(weight_fn, label, rope=None)(src/statspai/bayes/_base.py) — posterior summary ofint w(u) g(u) du / int w(u) duusing trapezoidal integration on the fit'su_grid. Withpolicy_weight_ate()it is now numerically identical to.ate(both trapezoid on the same grid) — test asserts< 1e-8parity. -
sp.policy_weight_*— four weight-function builders (src/statspai/bayes/policy_weights.py): policy_weight_ate()— uniform weight = 1.policy_weight_subsidy(u_lo, u_hi)— indicator on[u_lo, u_hi].policy_weight_prte(shift)— stylised rectangle around the mean propensity. The docstring leads with "NOT the textbook Carneiro-Heckman-Vytlacil 2011 PRTE" and shows a workedscipy.stats.gaussian_kdesnippet users can adapt for the true CHV PRTE with their observed propensity sample.policy_weight_marginal(u_star, bandwidth)— marginal PRTE at a specific propensity via a narrow band.
Semantic correction (honesty pass)¶
- Labelling fix: v0.9.8's fit was described as the "MTE curve",
but the structural model fits
g(p) = E[Y|D=1,P=p] - E[Y|D=0,P=p]— the treatment-effect-at-propensity function. Under the Heckman-Vytlacil (2005) linear-separable + bivariate-normal assumption,g(p) = MTE(p); under arbitrary heterogeneity,g(p)is a LATE summary at propensityp, not the textbook MTE(u). The module docstring now leads with this caveat and the method label reads"Bayesian treatment-effect-at-propensity"rather than"Bayesian MTE". Function name, result class name, and themte_curvefield are unchanged for API continuity — the "MTE" naming is retained because applied users expect it and search for it.
Performance¶
- Removed
pm.Deterministic('p', ...)from joint mode. Under largen, storing per-unit propensity per draw wasO(chains × draws × n)memory (e.g. 64MB at n=1000, draws=2000, chains=4). Post-hoc ATT/ATU propensity is now recomputed from the posterior means ofpi_intercept/pi_Z/pi_X.
Tests (0.9.9)¶
tests/test_bayes_mte_policy.py(NEW, 14 tests) — builders' input validation (bad bounds rejected, FP-safe grids), joint mode runs + agrees with plug-in on well-specified DGPs, policy_effect contract, trapezoid parity with.ateat 1e-8, top-level export of all four weight builders.
Code review¶
- Round-A (agent) found 4 items: B1 (semantic mislabel), H1 (normalisation inconsistency), H2 (memory blow-up under joint+ADVI), M1 (PRTE-builder naming).
- Round-B (agent) on the fixes confirmed no remaining blockers; one follow-up (test tolerance too loose after the H1 fix) was applied inline before shipping.
Design spec (0.9.9)¶
docs/superpowers/specs/2026-04-20-v099-mte-joint-policy-weights.md
Non-goals (0.9.9)¶
- Fully H-V-faithful joint model (sampling latent
U_Dper unit) — still a future release. Documented as the natural 0.9.10+ extension. - Multi-instrument MTE with per-instrument PRTE weights.
- Gaussian-process surfaces on
u(current release is polynomial). - Rust Phase 2 — branch work.
[0.9.8] - 2026-04-20 — Bayesian Marginal Treatment Effects + Pathfinder / SMC backends¶
Closes the two explicit next-batch items from v0.9.7's non-goals list. Ships the first Bayesian Marginal Treatment Effect estimator in the Python causal-inference stack and extends the sampler dispatch with two new backends.
Added (0.9.8)¶
sp.bayes_mte(data, y, treat, instrument, covariates=None, u_grid=..., poly_u=2, ...)(src/statspai/bayes/mte.py) — Heckman-Vytlacil (2005) Marginal Treatment Effects via PyMC. Returns aBayesianMTEResultwith:.mte_curve— DataFrame on the user-supplied (or default 19-point) grid of propensity-to-be-treated valuesU_D: columnsu, posterior_mean, posterior_sd, hdi_low, hdi_high, prob_positive..ate,.att,.atu— integrated MTE over the population / treated / untreated regions..plot_mte()— quick matplotlib visualisation of the MTE curve with an HDI ribbon.
Uses a plug-in logit first stage (same pragmatic shortcut as
bayes_iv): the Bayesian layer lies over the MTE polynomial
coefficients only. Asymptotically correct under correctly
specified first stage; explicit non-goal is full joint
first-stage-+-MTE posterior (queued for 0.9.9+).
-
inference='pathfinder'— new sampler backend routing to PyMC'spm.fit(method='fullrank_advi'). Captures pairwise covariance between parameters (mean-field ADVI misses this) at similar speed. Placeholder for when PyMC'spmx.fitstabilises; full-rank ADVI is the same spirit. -
inference='smc'— new sampler backend routing to PyMC'spm.sample_smc. Sequential Monte Carlo; slower than NUTS on unimodal posteriors but robust to multi-modal ones where NUTS gets stuck. Unlike ADVI / Pathfinder, SMC returns a multi-chain trace so R-hat stays meaningful. -
BayesianMTEResult— top-level export (sp.BayesianMTEResult). InheritsBayesianCausalResultand addsmte_curve,u_grid,ate,att,atu,.plot_mte(). -
Summary output now recognises the full sampler menu:
- NUTS / SMC: R-hat is meaningful; flagged on > 1.01.
- ADVI / Pathfinder: R-hat is variational and flagged as such with a concrete "use NUTS or SMC for calibrated uncertainty" caveat.
Design spec (0.9.8)¶
docs/superpowers/specs/2026-04-20-v098-bayes-mte-samplers.md
Tests (0.9.8)¶
tests/test_bayes_mte.py(9 tests) — API surface, flat-MTE recovery, monotone-MTE slope recovery, customu_grid,poly_u=1path, covariate plumbing, top-level export, missing-column and non-binary-treat validation.tests/test_bayes_advi.py(+5 tests) — Pathfinder on bayes_iv and bayes_did, SMC on bayes_iv and bayes_did, Pathfinder summary() caveat.
Non-goals (0.9.8, explicit)¶
- Full joint first-stage + MTE posterior (propagating first-stage
uncertainty into
tau(u)). Plug-in propensity is the v0.9.8 choice — correct asymptotically under correctly specified first stage; next release can add a joint model. - Multi-instrument MTE — requires policy-relevant weighting (Carneiro-Heckman-Vytlacil 2011) and is out of scope.
- Non-linear MTE surfaces (GP over
u) — polynomial of orderpoly_uis what this release supports. - Rust Phase 2 — stays on
feat/rust-hdfebranch.
[0.9.7] - 2026-04-20 — Heterogeneous-effect Bayesian IV + ADVI toggle¶
Closes two of the three items queued at v0.9.6's "诚实汇报" list. The third (Bayesian bunching) is explicitly declined — see the "Non-goals" section below.
Added (0.9.7)¶
sp.bayes_hte_iv(data, y, treat, instrument, effect_modifiers, ...)(src/statspai/bayes/hte_iv.py) — Bayesian IV with a linear CATE-by-covariate model. Returns aBayesianHTEIVResultcarrying:- Average LATE (
tau_0, at modifier means) with posterior + HDI. .cate_slopesDataFrame — one row per effect modifier with posterior mean, SD, HDI, andprob_positive..predict_cate(values: dict) -> dict— posterior summary of the CATE at user-specified modifier values.
Model:
D = pi_0 + pi_Z' Z + pi_X' X + v
tau(M) = tau_0 + tau_hte' (M - M_bar)
Y = alpha + tau(M) * D + beta_X' X + rho * v_hat + eps
Control-function formulation keeps NUTS sampling tractable. Multiple instruments + multiple modifiers + exogenous controls all supported.
inference='nuts' | 'advi'parameter on every Bayesian estimator —bayes_did,bayes_rd,bayes_iv,bayes_fuzzy_rd, and the newbayes_hte_iv. Under'advi'the estimator goes throughpm.fit(method='advi')for a 10-50× speedup at the cost of mean-field calibration.rhatis reported asNaNin ADVI mode (no meaning for variational approximations).
A shared _sample_model helper now owns sampling dispatch, so
future backends ('smc', 'pathfinder') plug in trivially.
BayesianHTEIVResult— top-level export (sp.BayesianHTEIVResult). ExtendsBayesianCausalResultwithcate_slopes,effect_modifiers, andpredict_cate(...).
Design spec (0.9.7)¶
docs/superpowers/specs/2026-04-20-v097-bayes-hte-iv-advi.md
Tests (0.9.7)¶
tests/test_bayes_hte_iv.py(8 tests) — API surface, avg-LATE recovery on heterogeneous DGP, slope recovery, null-slope coverage on homogeneous DGP,predict_cateschema, multi-modifier fit, input validation.tests/test_bayes_advi.py(10 tests) — ADVI runs on all five Bayesian estimators, posterior means finite,model_info['inference']reports correctly, invalid inference modes raise across the parametrised five-function set.
Non-goals (0.9.7, explicitly declined)¶
-
Bayesian bunching (
sp.bayes_bunching) — after review we decline. Kleven / Saez / Chetty bunching estimators are structural public-finance models whose identification depends on utility / optimisation parameterisations that don't generalise across kink types, priors on taste heterogeneity that are domain-specific and hard to default well, and model fits only as interpretable as the structural model itself. This defeats the package's "agent-native one-liner" thesis. The frequentistsp.bunchingstays where it is. We revisit only on a concrete user use-case that fits the agent-native workflow. -
MTE / complier-heterogeneity IV — queued for 0.9.8+.
- Extra VI backends beyond ADVI (Pathfinder, SMC) —
_sample_modelis now extensible but the backends stay out of this release. - Rust Phase 2 — on
feat/rust-hdfebranch until the cibuildwheel matrix is green.
[0.9.6] - 2026-04-20 — Bayesian IV + fuzzy RD + per-learner Optuna + Rust branch + g-methods family¶
This release bundles two independent sprints that landed the same day:
Sprint A — Bayesian depth + tuning granularity + Rust branch¶
- Bayesian 口袋深度 — adds
sp.bayes_ivandsp.bayes_fuzzy_rd. - Optuna 粒度 —
sp.auto_cate_tunednow supportstune='nuisance'(v0.9.5 behaviour),tune='per_learner', andtune='both'. - Rust 工作流 —
feat/rust-hdfebranch opened with Cargo crate scaffold;mainstays maturin-free.
Sprint B — G-methods family, Proximal, Principal Stratification¶
Closes a causal-inference-coverage audit against the 2026-04-20 gap
table: ships DML IIVM, g-computation, front-door estimator, MSM,
interventional mediation, plus two new top-level modules
Proximal Causal Inference and Principal Stratification. After
self-review, a second pass re-polished weight-semantics, bootstrap
diagnostics, MC vectorisation, and did a full DML internal refactor
(four per-model files sharing _DoubleMLBase).
Added¶
-
sp.bayes_iv(data, y, treat, instrument, covariates=None, ...)(src/statspai/bayes/iv.py) — Bayesian linear IV via a control-function formulation. First-stage OLS residuals enter the structural equation as an exogeneity correction, so the posterior on the LATE equals 2SLS asymptotically while remaining trivially sampleable in PyMC. Accepts a single instrument or a list. The HDI widens naturally as the instrument gets weaker (no "F < 10" cliff — the posterior prices identification automatically). -
sp.bayes_fuzzy_rd(data, y, treat, running, cutoff, ...)(src/statspai/bayes/fuzzy_rd.py) — Bayesian fuzzy RD via joint ITT-on-Y and ITT-on-D local polynomials with a deterministic ratio for the LATE. Under partial compliance the posterior inherits both noise channels (Wald-ratio posterior); under full compliance it collapses to the sharp RD result. Non-binary uptake is rejected with a clear error.model_inforeportsfirst_stage_mean/first_stage_sdso users can eyeball compliance. -
sp.auto_cate_tuned(..., tune='nuisance' | 'per_learner' | 'both')— newtuneflag toggles between three regimes: -
'nuisance'(default, v0.9.5 behaviour): shared outcome / propensity GBMs tuned against held-out R-loss. 'per_learner': each learner's final-stage CATE model is tuned independently against held-out R-loss; nuisance stays at defaults.model_info['per_learner_params']and['per_learner_r_loss']are populated; the best learner's tuned CATE model is fed toauto_cateas a hint.'both': tune the nuisance first, then per-learner CATE on top of that nuisance.
Also adds n_trials_per_learner (defaults to max(5, n_trials//3))
and per_learner_search_space. Selection-rule text now records
which tuning regime ran.
feat/rust-hdfebranch (pushed, not merged) — Cargo crate scaffold plus PyO3 stub for the eventualgroup_demeankernel.mainstays maturin-free sopip install statspaiis unaffected.
Design spec¶
docs/superpowers/specs/2026-04-20-v096-bayes-iv-fuzzyrd-perlearner.md
Tests¶
tests/test_bayes_iv.py(8 tests) — API, top-level export, strong-IV recovery, weak-IV HDI widens, multi-instrument fit, covariate plumbing, input validation, tidy/glance shape.tests/test_bayes_fuzzy_rd.py(7 tests) — API, recovery under partial compliance, sharp-equivalence under full compliance, bandwidth shrinks sample, first-stage diagnostics reported, non-binary uptake rejected.tests/test_auto_cate_tuned.py(+5 tests) — invalidtunemode rejected,'per_learner'populates params, no nuisance metadata leaks in per_learner mode,'both'mode covers both channels, selection_rule mentions per-learner tuning.
Non-goals (deferred)¶
- Bunching Bayesian estimator (Kleven-style is structural / macro-flavoured; poor fit for the agent-native API). Queue for 0.9.7.
- Heterogeneous-effect Bayesian IV — LATE only in this release.
- VI sampler (ADVI) — NUTS only.
- Rust kernel merged to
main— stays onfeat/rust-hdfeuntil the cibuildwheel matrix is green.
Added (Sprint B)¶
-
sp.dml(..., model='iivm', instrument=Z)(src/statspai/dml/iivm.py) — Interactive IV (binary D, binary Z) DML estimator for LATE. Uses the efficient-influence-function ratio of two doubly-robust scores(ψ_a, ψ_b)with Neyman-orthogonal cross-fitting; SE via delta-method on the ratio. Weak-instrument guard raisesRuntimeErrorwhen|E[ψ_b]| ≈ 0. Class form:sp.DoubleMLIIVM. -
sp.DoubleMLPLR / DoubleMLIRM / DoubleMLPLIV / DoubleMLIIVM(src/statspai/dml/*.py) — each DML model family now lives in its own file with a shared_DoubleMLBaseindml/_base.pythat handles validation, default learners (auto-selecting classifier vs regressor per model), cross-fitting, andCausalResultconstruction. The legacysp.DoubleML(model=...)façade still works. -
sp.g_computation(data, y, treat, covariates, estimand='ATE'|'ATT'|'dose_response', ...)(src/statspai/inference/g_computation.py) — Robins' (1986) parametric g-formula / standardisation estimator. Supports binary treatment (ATE, ATT) and continuous treatment dose-response grids. Default OLS outcome model or any sklearn-compatible learner viaml_Q=. Nonparametric bootstrap SE with NaN-based failure tracking (model_info['n_boot_failed']) — replaces silent point-estimate fallback that would shrink SE. -
sp.front_door(data, y, treat, mediator, covariates=None, mediator_type='auto', integrate_by='marginal'|'conditional', ...)(src/statspai/inference/front_door.py) — Pearl (1995) front-door adjustment estimator. Closed-form sums for binary mediator; Monte Carlo integration over a Gaussian conditional density for continuous mediator. Two identification variants exposed:integrate_by='marginal'(Pearl 95 aggregate formulation) and'conditional'(Fulcher et al. 2020 generalised front-door). Bootstrap SE with NaN-based failure tracking. -
sp.msm(data, y, treat, id, time, time_varying, baseline=None, exposure='cumulative'|'current'|'ever', family='gaussian'|'binomial', trim=0.01, ...)(src/statspai/msm/) — Robins-Hernán-Brumback (2000) Marginal Structural Models via stabilised IPTW. Handles time-varying treatment + time-varying confounders (binary or continuous). Weighted pooled regression of outcome on exposure history with cluster-robust CR1 sandwich at the unit level.sp.stabilized_weights(...)is exposed as a standalone helper for users who want the weights without fitting the outcome model. -
sp.mediate_interventional(data, y, treat, mediator, covariates=None, tv_confounders=None, ...)(src/statspai/mediation/mediate.py) — VanderWeele, Vansteelandt & Robins (2014) interventional (in)direct effects. Identified in the presence of a treatment-induced mediator-outcome confounder (tv_confounders=[...]) where natural (in)direct effects are not. Fully vectorised MC integration (~100× faster than naïve per-observation loop). -
sp.proximal(data, y, treat, proxy_z, proxy_w, covariates=None, n_boot=0, ...)(src/statspai/proximal/) — Proximal Causal Inference (Tchetgen Tchetgen et al. 2020; Miao, Geng & Tchetgen Tchetgen 2018) via linear 2SLS on the outcome bridge function. Handles ATE identification with an unobserved confounder when two proxies (treatment-sideZand outcome-sideW) are available. Reports a first-stage F-stat for the proxy equation and warns when F < 10. Optional nonparametric bootstrap SE vian_boot=. -
sp.principal_strat(data, y, treat, strata, covariates=None, method='monotonicity'|'principal_score', ...)(src/statspai/principal_strat/) — Principal Stratification (Frangakis & Rubin 2002).method='monotonicity'applies the Angrist-Imbens-Rubin compliance decomposition to identify the complier PCE (= LATE) and returns Zhang-Rubin (2003) sharp bounds for the always-survivor SACE.method='principal_score'implements Ding & Lu (2017) principal-score weighting to point-identify always-taker / complier / never-taker PCEs under principal ignorability. Returns a dedicatedPrincipalStratResultwithstrata_proportions,effects,bounds. -
sp.survivor_average_causal_effect(data, y, treat, survival, ...)— friendly wrapper aroundprincipal_strat(method='monotonicity')for the classical truncation-by-death problem. Reports SACE midpoint + endpoint-union confidence interval.
Changed (Sprint B)¶
-
MSM binomial outcome family:
_weighted_logit_clusterreplaced the previousstatsmodels.GLM(freq_weights=w)call (which treats weights as integer replication counts) with a hand-rolled IRLS that uses probability-weight semantics. Matches Cole & Hernán (2008) and Stata'spweightconvention for IPTW. -
Bootstrap failure reporting:
g_computation,mediate_interventional,front_door, andproximalnow leave failed bootstrap replications asNaN, emit aRuntimeWarningwith the failure count and first error message, and recordn_boot_failed/n_boot_success/first_bootstrap_errorinmodel_info. If fewer than two replications succeed, a cleanRuntimeErroris raised rather than silently under-estimating SE. -
mediate_interventionalMC loop: the previousO(n × n_mc)Python comprehension is replaced by a closed-form vectorisation that exploits OLS linearity of the outcome model in the treatment-induced-confounder block (X_tv). The outer expectation over units collapses toβ_tv · mean(X_tv), reducing runtime toO(n_mc + n)and giving a measured ~100× speed-up on the reference configuration (n=800, n_boot=200, n_mc=300 drops from ~4 s to ~0.04 s). -
sp.dmlinternal layout: the 466-line single-classdml/double_ml.pyis split into five files (_base.py+plr.py+irm.py+pliv.py+iivm.py) each owning a single Neyman-orthogonal score and its validation. The publicdml()function andDoubleMLclass are unchanged; new per-model classes are now directly importable. -
sp.front_doorwith covariates and continuous mediator gainedintegrate_by(see Added).
Tests (Sprint B)¶
tests/test_dml_iivm.py(5 tests) — LATE recovery on one-sided-noncompliance DGP, significance, binary-D/binary-Z validation,model_infofields.tests/test_dml_split.py(5 tests) — direct-class API equals dispatcher, legacyDoubleMLfaçade, PLIV rejects multi-instrument list.tests/test_g_computation.py(5 tests) — ATE / ATT / dose-response curves recovered within tolerance, validation errors.tests/test_front_door.py(4 tests) — continuous-M and binary-M ATE recovery on DGP with unobserved confounder, strictly closer to truth than naïve OLS.tests/test_front_door_integrate_by.py(3 tests) — marginal and conditional variants both recover truth, invalid values rejected.tests/test_msm.py(5 tests) — cumulative-exposure slope recovery, stabilised-weight shape / mean,exposure='ever'requires binary treatment, weight diagnostics exposed.tests/test_mediate_interventional.py(4 tests) — IIE + IDE decomposition additivity, total-effect sign, binary-D validation.tests/test_proximal.py(6 tests) — linear-bridge ATE recovery, strictly-better-than-OLS, order-condition check, covariate compatibility, bootstrap SE path, first-stage F reported.tests/test_principal_strat.py(7 tests) — monotonicity LATE- stratum proportions, valid SACE bounds, principal-score method with informative X, input validation, SACE helper.
Notes (Sprint B)¶
- No new required dependency. All additions use NumPy / pandas / scipy / scikit-learn only (statsmodels optional).
- Full new-module suite: 44 new tests pass; the existing 28 DML + mediation regression tests still pass; full collection reports 1960 tests, zero import errors introduced by this sprint.
[0.9.5] - 2026-04-20 — Bayesian causal + Optuna-tuned CATE + Rust spike¶
This release closes three items from the v0.9.4 post-release retrospective (Section 8 "认怂" list):
- Bayesian causal —
sp.bayes_did+sp.bayes_rd(PyMC). - ML CATE調参 —
sp.auto_cate_tuned(Optuna). - Rust HDFE kernel — spec + benchmark harness shipped;
actual Rust crate deferred to 1.0 on a dedicated branch (any
maturinchange topip installis postponed until a full cross-platform wheel matrix is green).
Added¶
-
sp.bayes_did(data, y, treat, post, unit=None, time=None, ...)(src/statspai/bayes/did.py) — Bayesian difference-in-differences via PyMC. 2×2 for no panel indices, hierarchical Gaussian random effects whenunitand/ortimeare supplied. NUTS sampler, configurable priors,rope=(lo, hi)for "practical equivalence" posterior probabilities. Returns aBayesianCausalResultwith posterior mean/median/SD, 95 % HDI,prob_positive,rhat,ess, and the full ArviZInferenceDataon.tracefor downstream plotting. -
sp.bayes_rd(data, y, running, cutoff, bandwidth=None, poly=1, ...)(src/statspai/bayes/rd.py) — Bayesian sharp regression discontinuity with local polynomial (order ≥ 1) and Normal prior on the jump. Bandwidth defaults to0.5 * std(running). -
sp.BayesianCausalResult— sibling ofCausalResultwith broom-style.tidy()/.glance()/.summary()and Bayesian-native fields (hdi_lower,hdi_upper,prob_positive,prob_rope,rhat,ess). Slots into the same agent-nativepd.concat([r.tidy() for r in results])workflow as the frequentist estimators. -
sp.auto_cate_tuned(..., n_trials=25, timeout=None, search_space=None)(src/statspai/metalearners/auto_cate_tuned.py) — Optuna'sTPESamplersearches over the nuisance GBM hyperparameters (outcome and propensity model separately), scoring each trial by shared-nuisance held-out R-loss. Best trial's models are handed tosp.auto_cate; the winner'smodel_info['tuned_params']records the chosen HP and['n_trials']the search budget. Closes the econml "nuisance cross-validation before CATE" ergonomic gap. -
sp.fast.hdfe_bench(n_list, n_groups, repeat, seed, atol)(src/statspai/fast/bench.py) — benchmark harness for HDFE group-demean kernels. Times NumPy, Numba, and (future) Rust paths on the same DGPs and asserts correctness to ≤ 1 × 10⁻¹⁰ vs the NumPy reference. Unavailable backends are recorded, not crashed, so the same harness runs on CI environments that lack Numba and on dev boxes with a future Rust wheel installed. -
Optional install extras:
pip install "statspai[bayes]"pullspymc >= 5+arviz >= 0.15.pip install "statspai[tune]"pullsoptuna >= 3. Coreimport statspaiworks in either's absence; the estimators raise a cleanImportErrorat call time with the install recipe.
Design docs¶
docs/superpowers/specs/2026-04-20-v095-bayes-optuna-rust-spike.md— full spec for this release.docs/superpowers/specs/2026-04-20-v095-rust-hdfe-spike.md— the phased plan for the Rust HDFE port (crate layout, PyO3 FFI surface, cibuildwheel matrix, graceful-degradation contract).
Tests¶
tests/test_bayes_did.py(11 tests) — 2×2 + panel recovery, prob_positive calibration, HDI coverage, input validation, ROPE, tidy/glance shape.tests/test_bayes_rd.py(9 tests) — sharp recovery, null-effect HDI straddles 0, bandwidth shrinks local sample, poly=2 runs, validation errors.tests/test_auto_cate_tuned.py(7 tests) — API,n_trialsrespected, ATE recovery, custom search space honoured, invalid treatment rejected.tests/test_fast_bench.py(5 tests) — harness returnsHDFEBenchResult, dry-run <5 s, Numba/NumPy agree to 1e-10, unavailable paths recorded not crashed, summary string.
Non-goals (explicit)¶
- Variational inference (
pymc.fitADVI) — NUTS only for 0.9.5. - Bayesian fuzzy RD, IV, bunching — deferred to 0.9.6+.
- Rust crate itself — ships on a dedicated branch with a full
cibuildwheelmatrix; addingmaturintopyproject.tomlwithout that matrix would breakpip installfor some users.
[0.9.4] - 2026-04-20 — sp.auto_cate + strict identification¶
This release closes two concrete commitments from the 0.9.3 post-release
retrospective (社媒文档/4.20-升级说明/StatsPAI-0.9.3之后的一周…):
- Section 5 promise: "下一步打算加
strict_mode=True" onsp.check_identification. Delivered asstrict=Trueplus thesp.IdentificationErrorexception. - Section 8 gap: "ML CATE scheduling isn't as good as econml."
Delivered as
sp.auto_cate()— one-line multi-learner race with honest Nie-Wager R-loss scoring and BLP calibration.
Added¶
sp.auto_cate(data, y, treat, covariates, learners=('s','t','x','r','dr'))(src/statspai/metalearners/auto_cate.py, +400 LOC) — races the five meta-learners on shared cross-fitted nuisances, scores each on held-out predictions via the Nie-Wager R-loss, runs the Chernozhukov-Demirer-Duflo-Fernández-Val BLP calibration test on each, and returns anAutoCATEResultwith:.leaderboard— sorted by R-loss, with ATE, SE, CI, BLP β₁/β₂, CATE std/IQR per learner;.best_learner/.best_result— winner selected by lowest held-out Nie-Wager R-loss; BLP β₁/β₂ are reported in the leaderboard as diagnostics, not selection gates (β₁ equals the ATE in units of Y in this parametrization, so there is no natural "β₁ ≈ 1" gate);.results— the full fittedCausalResultfor every learner;.agreement— Pearson-ρ matrix of in-sample CATE vectors across learners (quick sanity check for model dependence);.summary()— a printable leaderboard + agreement table.
A bundled CATE learner race with honest held-out scoring. econml's
multi-metalearner pipeline is not bundled into
a single call; causalml's BaseMetaLearner comparison doesn't run
BLP calibration per learner.
-
sp.check_identification(..., strict=True)raisessp.IdentificationErrorwhen the report's verdict is'BLOCKERS'. The exception carries the complete report on.reportfor post-mortem inspection. Default remainsstrict=False(non-breaking). -
sp.IdentificationError— new exception type, exported at the top level. -
IV first-stage strength check in
sp.check_identification(_check_iv_strength) — computed from a first-stage OLStreatment ~ intercept + covariates + instrument(covariates partialled out before computing the instrument's F, so the reported F matches the Staiger-Stock definition when controls are present). Flags F < 5 asblocker, F < 10 aswarning(Staiger-Stock 1997), F ∈ [10, 30) asinfo. Fires only wheninstrumentis supplied.
Tests¶
tests/test_auto_cate.py(13 tests) — API surface, leaderboard shape, ATE recovery on constant-effect DGP, all-positive ATE on positive DGP, learner subset, invalid learner rejection, selection rule string, agreement matrix,CausalResultdelegation (.tidy(),.glance()), custom model override, summary string, top-levelsp.*availability, heterogeneous-DGP CATE dispersion.tests/test_check_identification.py(+5 tests) —strict=Trueraises on blockers, tolerates warnings, default non-strict behaviour unchanged,sp.IdentificationErrortop-level export, weak-instrument flagged, strong-instrument not flagged.
Design¶
- Published spec at
docs/superpowers/specs/2026-04-20-v094-auto-cate-strict-id-design.md.
Non-goals (deferred to 0.9.5+)¶
- Optuna hyperparameter search inside
auto_cate— for now the user either accepts the boosted-tree defaults or passes pre-tuned estimators viaoutcome_model=/propensity_model=/cate_model=. - Bayesian
sp.bayes_did/sp.bayes_rd— announced as a 0.9.5 preview line. - Rust HDFE inner kernel — remains Section 8's open item.
[0.9.3.post] — 0.9.3 post-release bugfixes (rolled into a later patch)¶
Four user-reported bugs surfaced during the 0.9.3 end-to-end smoke test.
All are fixed on main without a version bump (pending a later patch release).
Fixed¶
-
sp.use_chinese()failed on Linux (plots/themes.py) — the auto-detect candidate list only covered macOS fonts plusNoto Sans CJK SCandWenQuanYi Micro Hei, so a Linux/Docker host withfonts-noto-cjk(which shipsNoto Sans CJK JP/TC/KRby default) orfonts-wqy-zenhei(WenQuanYi Zen Hei) installed got an empty return plus a "no Chinese font" warning. Priority lists are now segmented by platform (macOS → Windows → Linux → cross-platform Source Han), all four Noto CJK regional variants are listed, and a substring fallback (CJK,Han Sans,Han Serif,WenQuanYi,Heiti,Ming) picks up custom/renamed builds. Warning message now includes the exactapt install fonts-noto-cjk fonts-wqy-zenheirecipe. -
sp.regtable(...)printed the table twice in REPL/Jupyter (output/regression_table.py,output/estimates.py) —regtable(),mean_comparison()andesttab()each calledprint(result)internally and then returned the result, which REPL/Jupyter re-displayed via__repr__/_repr_html_. All three internal prints are removed; display now flows through the standard Python display protocol.
Behaviour change: scripts that relied on the auto-print side-effect
must switch to print(sp.regtable(...)). Jupyter and interactive REPLs
are unaffected.
-
sp.regtable(..., output="latex")was silently ignored (output/regression_table.py) — theoutput=parameter previously controlled only the Word/Excel warning branch;__str__always rendered text.RegtableResultandMeanComparisonResultnow store_outputand dispatch in__str__/__repr__through_render(fmt)over{text, latex, tex, html, markdown, md}. Jupyter's_repr_html_still always returns HTML. Invalidoutput=values now raiseValueErrorinstead of falling back silently. -
sp.did()treat=column semantics were easy to mis-specify (did/__init__.py) — for staggered designs the column must hold each unit's first-treatment period (never-treated =0, not1), but users with a pre-existing 0/1treatedcolumn consistently passed it straight through and got nonsense estimates. Docstring now carries an explicit callout and a verified pandas idiom for constructingfirst_treat(.loc[treated==1].groupby('id')['year'].min()+.map+.fillna(0)) that broadcasts correctly to pre-treatment rows.
Added¶
- Documentation clarifies that
regtable(output=...)controlsstr(result)whileregtable(filename=...)dispatches on the file extension — they can diverge, and users should pass matching values. - Input validation on
regtable()/mean_comparison()rejects unknownoutput=values with a helpfulValueErrorlisting valid choices.
Tests¶
tests/test_v093_bugfixes.py — 15 regression tests covering all four bugs
plus the new validation. Full suite: 1655 passed, 4 skipped, 0 regressions.
[0.9.3] - 2026-04-19 — Stochastic Frontier + Multilevel + GLMM + Econometric Trinity¶
Overview. This release bundles four simultaneous deep overhauls plus an author-metadata correction:
- Stochastic Frontier Analysis —
sp.frontier/sp.xtfrontierrewritten to Stata/R-grade, with a critical correctness bug fix. - Multilevel / Mixed-Effects —
sp.multilevelrewritten to lme4/Stata-grade. - GLMM hardening — AGHQ (
nAGQ>1) plus three new families (Gamma, Negative Binomial, Ordinal Logit) and cross-family AIC comparability. - Econometric Trinity — three new P0 pillars: DML-PLIV, Mixed Logit, IV-QR.
- Author attribution corrected to
Biaoyue Wang.
⚠️ Critical correctness fix — sp.frontier carried a latent Jondrow posterior
sign error in all prior versions (0.9.2 and earlier). Efficiency scores were
systematically biased; the normal-exponential path additionally returned NaN
for unit efficiency. Re-run any prior frontier analyses. Detail below.
Stochastic Frontier Analysis Overhaul¶
Release focus: statspai.frontier. The prior implementation was a
270-line single file with one function covering cross-sectional
half-normal / exponential / truncated-normal frontiers, no panel
support, no heteroskedasticity, no inefficiency determinants, and —
critically — a sign error in the Jondrow posterior that silently
produced wrong efficiency scores, plus a wrong ε-coefficient in the
exponential log-likelihood that the old test never exercised. The
module has been rewritten (~1,300 LOC across _core.py, sfa.py,
panel.py, te_tools.py) to match or exceed Stata's
frontier / xtfrontier and R's frontier / sfaR.
Correctness fixes¶
- Jondrow posterior μ*: corrected
signconvention in all three distributions — the old code'sμ* = -sign·ε·σ_u²/σ²has been replaced by the derivation-verifiedμ* = sign·ε·σ_u²/σ²(and the analogous correction for truncated-normal). Efficiency scores from the old implementation were systematically biased; re-run any prior analyses. - Normal-exponential log-density: fixed the ε-coefficient and
Φ argument (the old form was
+ sign·ε/σ_u + log Φ((-sign·ε - σ_v²/σ_u)/σ_v); correct per Greene 2008 eq. 2.39 is- sign·ε/σ_u + log Φ(sign·ε/σ_v - σ_v/σ_u)). The old exponential path never produced efficiency scores (returned NaN) — now returns correct Battese-Coelli scores. - Truncated-normal density: fixed the
centeredoffset in the φ factor from(ε + sign·μ)/σto(ε - sign·μ)/σ. - Monte-Carlo density-integration tests (
∫ f(ε) dε = 1) now guard against regressions for all three distributions.
New cross-sectional sp.frontier¶
- Heteroskedastic inefficiency via
usigma=[...]— parameterisesln σ_u_i = γ_u' [1, w_i](Caudill-Ford-Gropper 1995, Hadri 1999). - Heteroskedastic noise via
vsigma=[...]— parameterisesln σ_v_i = γ_v' [1, r_i](Wang 2002). - Inefficiency determinants via
emean=[...]— the Battese-Coelli (1995) / Kumbhakar-Ghosh-McGuckin (1991) modelμ_i = δ' [1, z_i]fordist='truncated-normal'. - Battese-Coelli (1988) TE:
result.efficiency(method='bc')returnsE[exp(-u)|ε](the Stata default) in addition to the JLMS approximationexp(-E[u|ε])(method='jlms'). - LR test for absence of inefficiency: one-sided mixed χ̄²
(Kodde-Palm 1986) via
result.lr_test_no_inefficiency(). - Bootstrap CI for unit efficiency: parametric-bootstrap bounds
via
result.efficiency_ci(alpha=.05, B=500). - Residual skewness diagnostic stored at
result.diagnostics['residual_skewness']. - Optimiser now has hard bounds on
ln σand guards against σ → 0 / σ → ∞ excursions that previously caused truncated-normal fits to diverge.
New panel sp.xtfrontier¶
- Pitt-Lee (1981) time-invariant (
model='ti'):u_it = u_i, half-normal or truncated-normal. Closed-form group log-likelihood derived from the per-unit integration; unit-level TE stored atresult.diagnostics['efficiency_bc_unit']. - Battese-Coelli (1992) time-varying decay (
model='tvd'):u_it = exp(-η(t - T_i)) · u_iwith η estimated jointly. The obs-level efficiency usesE[exp(-a_it u_i)|e_i]under the posterioru_i ~ N⁺(μ*, σ*²)(MGF form). - Battese-Coelli (1995) inefficiency effects (
model='bc95'):u_it ~ N⁺(z_it' δ, σ_u²)independently; returned with unit-mean efficiency roll-up.
Helpers¶
sp.te_summary(result)— Stata-style descriptive table of TE scores (n, mean, sd, quartiles, share > 0.9, share < 0.5).sp.te_rank(result, with_ci=True)— efficiency ranking with optional bootstrap CIs for benchmarking.
Tests¶
- 33 new tests covering: parameter recovery for all three cross-sectional distributions, cost vs production sign handling, heteroskedastic σ_u / σ_v, BC95 determinants, LR specification tests, TE-score bounds and internal consistency, bootstrap CI structure, Pitt-Lee / BC92 / BC95 panel recovery, and density-integrates-to-1 kernel sanity checks.
Advanced frontier extensions¶
Three frontier extensions shipped after the initial overhaul (commit e876937):
sp.zisf— Zero-Inefficiency SFA mixture (Kumbhakar-Parmeter-Tsionas 2013). Mixture of fully-efficient (u=0, pure noise) and standard composed-error regimes; mixing probabilityp_iparameterised via logit on optionalzprobcovariates. PosteriorP(efficient|ε)exposed indiagnostics['p_efficient_posterior']. Recovery test: true efficient share 0.30 → estimated 0.286 onn=2000.sp.lcsf— 2-class Latent-Class SFA (Orea-Kumbhakar 2004; Greene 2005). Two separate frontiers with their ownβ_kand variance parameters; class-membership logit on optionalz_classcovariates. Direct MLE with perturbed starts to break label symmetry.xtfrontier(..., model='tfe', bias_correct=True)— Dhaene-Jochmans (2015) split-panel jackknife for TFE:β_BC = 2·β_full − (β_first_half + β_second_half)/2. Cuts theO(1/T)incidental-parameters bias. Guards against degenerate halves by skipping σ corrections with an annotation inmodel_info. Verified atT=30,N=25: rawσ_u=0.374→ BCσ_u=0.359(true 0.35).
Productivity helpers¶
Shipped in commit be59260:
sp.malmquist— Färe-Grosskopf-Lindgren-Roos (1994) Malmquist TFP index via period-by-period parametric frontier fits. Returns per- transition decompositionM = EC × TC(efficiency change × technical change). Row-wise identityM == EC·TCverified tortol=1e-8. Cost frontiers supported via reciprocal distance convention. Validated on 3-period DGP with 5%/year intercept growth: mean TC ≈ 1.07–1.09, mean EC ≈ 1.0.sp.translog_design— Cobb-Douglas → Translog design-matrix helper. Appends0.5·log(x_k)²squares andlog(x_k)·log(x_l)interactions; thetranslog_termslist is stored indf.attrsfor one-line feed tofrontier()/xtfrontier(). Toggleable squares and interactions.
Migration¶
- Old:
frontier(df, y='y', x=['x1'])still works (same required args). - New keyword-only args:
usigma,vsigma,emean,te_method,start. - Existing efficiency scores should be recomputed — prior values were systematically biased by the Jondrow sign error.
Multilevel / Mixed-Effects Overhaul¶
Release focus: statspai.multilevel. The previous implementation was a
400-line single file covering only the two-level linear mixed model
with a diagonal random-effect covariance. It has been rewritten as a
proper sub-package (~2,000 LOC across _core.py, lmm.py, glmm.py,
diagnostics.py, comparison.py) with feature parity against
lme4/Stata mixed and additions on top.
New in sp.mixed¶
- Unstructured covariance
Gfor random effects is now the default (cov_type='unstructured', Cholesky-parameterised so the optimiser is unconstrained).diagonalandidentityremain available for nested-model comparisons. - Three-level nested models via
group=['school', 'class']— fits school- and class-level random intercepts jointly (verified to matchstatsmodels.MixedLM(..., re_formula="1", vc_formula={...})to four decimals on the variance components and fixed effects). - BLUP posterior standard errors (
result.ranef(conditional_se= True)) — exposesVar(u|y) = G − GZ'V⁻¹ZG + GZ'V⁻¹X Cov(β̂) X'V⁻¹ZGfor use in caterpillar plots. predict(new_data, include_random=…)— population-marginal and group-conditional predictions, with zeroed-out BLUPs for unseen groups.- Nakagawa-Schielzeth marginal & conditional R² via
result.r_squared(). - AIC / BIC,
wald_test()for linear restrictions,to_markdown()/to_latex()/_repr_html_()/cite(), andplot(kind='caterpillar' | 'residuals').
New functions¶
sp.melogit/sp.mepoisson/sp.meglm— Generalised linear mixed models (binomial logit, Poisson log, Gaussian identity) fitted by Laplace approximation with canonical-link observed information. Supports random intercepts and random slopes,cov_typeas forsp.mixed, binomialtrials=and Poissonoffset=. Results exposeodds_ratios()/incidence_rate_ratios()and apredict(type= 'response'|'linear')method.sp.icc(result)— intra-class correlation with a delta-method (logit-scale) 95% CI.sp.lrtest(restricted, full)— likelihood-ratio test between two nested mixed-model fits with automatic Self-Liang χ̄² boundary correction when variance components are being tested.
Validation¶
- Linear mixed models: fixed effects and variance components agree
with
statsmodels.MixedLMto 4 decimal places on both random- intercept and unstructured random-slope specifications (test_multilevel.py::TestRandomSlopeUnstructured:: test_matches_statsmodels). - Three-level nested: variance components identified jointly and match
the reference implementation to 2 decimal places
(
TestThreeLevelNested::test_separates_variance_components). - GLMM recovery tests on 2,000-observation synthetic panels confirm slope and random-intercept variance within expected sampling ranges.
Behavioural changes¶
- The default
cov_typeforsp.mixedis now'unstructured'(previously effectively diagonal). Passcov_type='diagonal'explicitly for the old behaviour. LR test vs. pooled OLSnow uses the ML-converted likelihood (previously a mix of REML and ML that could produce inconsistent values whenmethod='reml').
Post-review hardening (post oracle + code-reviewer audit)¶
- [BLOCKER fix]
MixedResult.predict(data=None)previously returned predictions in group-iteration order rather than the original row order._GroupBlocknow carries the training row indices andpredict()scatters the output back to the correct positions. Regression test:tests/test_multilevel.py::TestRandomIntercept:: test_predict_is_row_aligned_with_training_frame. - [BLOCKER fix] GLMM inner Newton (
_find_mode) now damps large steps and returns a convergence flag.meglmaggregates per-cluster failures and emits aRuntimeWarningwhen any cluster fails to converge — a previously silent failure mode. - [HIGH fix]
MEGLMResultgainsto_latex()andplot()so it matches the unified StatsPAI result contract. - [HIGH fix]
lrtestnow raisesValueErroron cross-family comparisons and on REML fits whose fixed-effect design differs, preventing invalid LR statistics. Multi-component boundary corrections emit aRuntimeWarningexplaining the conservative upper bound (Stram–Lee 1994 mixture not implemented). - [HIGH fix]
mixed()/meglm()reject non-hashable group values with a descriptiveTypeErrorinstead of producing a silently corrupted BLUP dict. - [MED fix]
icc(result, n_boot>0)raisesNotImplementedErrorinstead of silently returning the delta-method CI.icc()warns whenn_groups < 30(delta-method CI unreliable). - [MED fix] Three-level nested fit emits a warning when any
outer group has only one inner group (class variance then not
identified), and exposes both school and class ICCs via
variance_components['icc(outer)']/icc(outer+inner).
GLMM hardening — AGHQ + Gamma / NegBin / Ordinal¶
Closes the three GLMM gaps flagged in the multilevel self-audit. All
changes are additive (no API breaks); existing meglm / melogit /
mepoisson calls produce numerically identical fits.
Adaptive Gauss-Hermite quadrature (AGHQ) — nAGQ parameter.
Previously meglm only offered the Laplace approximation (nAGQ=1),
which is known to underestimate random-effect variances on small
clusters with binary or other non-Gaussian outcomes. The new nAGQ
argument selects the number of adaptive quadrature points per scalar
random effect:
sp.melogit(df, "y", ["x"], "g", nAGQ=7) # matches Stata intpoints(7)
sp.megamma(df, "y", ["x"], "g", nAGQ=15) # converged-grade quadrature
nAGQ=1 reduces exactly to the Laplace formula (verified to 1e-10).
nAGQ>1 is restricted to single-scalar random-effect models (no random
slopes), matching the same restriction lme4::glmer imposes — full
tensor-product AGHQ over q>1 random effects is deferred because cost
scales as nAGQ^q. AGHQ is wired into all five families
(Gaussian / Binomial / Poisson / Gamma / NegBin) plus meologit.
New families:
sp.megamma— Gamma GLMM with log link and dispersionφestimated by ML, packed aslog φfor unconstrained optimisation. IRLS weight uses Fisher information1/φ(Fisher scoring) for PSD Hessian regardless of fitted means.sp.menbreg— Negative-binomial NB-2 GLMM (Var = μ + α μ²) with log link, dispersionα(aliasfamily='negbin'accepted). Reduces analytically to Poisson asα → 0; verified.sp.meologit— Random-effects ordinal logit (Statameologit, Rordinal::clmm). K−1 thresholds reparameterised asκ_1, log(κ_2−κ_1), ...so strict ordering is enforced unconditionally. ReturnsMEGLMResultwith newthresholdsattribute. SupportsnAGQ>1.
Cross-family AIC comparability. Poisson and Binomial log-
likelihoods now include the full normalisation constants (-log(y!)
for Poisson, log-binomial-coefficient for Binomial). Previously these
constants were dropped, which made mepoisson vs menbreg AIC
comparisons biased by ~Σ log(y!). β and variance estimates are
unchanged; only log_likelihood and aic / bic absolute values
shift — relative comparisons within a family are unaffected.
Tests (multilevel). tests/test_multilevel.py grows from 35 to 53
tests:
TestAGHQ(7 tests) — nAGQ=1↔Laplace identity, AGHQ improves vs Laplace on small clusters, convergence in nAGQ, random-slope rejection.TestMEGamma(3) — truth recovery, dispersion accounting, summary.TestMENegBin(3) — truth recovery, IRR availability, alias resolution.TestMEOLogit(5) — truth recovery, threshold ordering, no intercept, summary, K≥3 enforcement.
Backwards compatibility: all 35 prior multilevel tests pass unchanged.
Synth API-drift fixes (post-0.9.3-initial)¶
SyntheticControl._solve_weightssignature migration — three stale call sites insynth/power.pyandsynth/sensitivity.pymigrated to the new (Y_treated_pre, Y_donors_pre, X_treated, X_donors, run_nested) signature (fixes 8 test failures intests/test_synth_advanced.pyandtests/test_synth_extras.py).- Placebo alignment —
synth/power.pyplacebo builder now followsscm.py:888exactly so LOO ↔ main placebo results stay consistent. - numpy 2.x compatibility —
tests/test_frontier.pyswitchesnp.trapz→np.trapezoid(removed in numpy 2.x).
Econometric Trinity — P0 Pillars (DML-PLIV, Mixed Logit, IV-QR)¶
Three foundational econometric estimators identified as the highest-ROI gaps
vs. Stata, R, and existing Python packages are now first-class sp.* APIs
(~1,170 new LOC, 10 tests in test_econ_trinity.py).
sp.dml(model='pliv', instrument=…)— DML-PLIV (Partially Linear IV). Chernozhukov et al. (2018, §4.2) Neyman-orthogonal score with cross-fitted nuisance functionsg(X)=E[Y|X],m(X)=E[D|X],r(X)=E[Z|X]. Returns the LATE with influence-function-based standard errors. Closes the IV gap in the existingDoubleML(previously only PLM + IRM).sp.mixlogit— Mixed Logit. Random-coefficient multinomial logit via simulated maximum likelihood with Halton quasi-random draws. Supports: fixed + random coefficients, normal / log-normal / triangular mixing distributions, diagonal or full Cholesky covariance, panel (repeated-choice) data, OPG-sandwich robust SEs. Benchmarked against Statamixlogitand Rmlogit.sp.ivqreg— IV Quantile Regression. Chernozhukov-Hansen (2005, 2006, 2008) instrumental-variable quantile regression via inverse-QR profile. Scalar endogenous case uses grid + Brent refinement; multi-dim uses BFGS on theb̂(α)criterion. Multiple quantiles return a tidy DataFrame; single quantile returnsEconometricResults. Optional pairs-bootstrap SEs.
All three reuse _qreg_fit, CausalResult, EconometricResults for API
consistency with the rest of StatsPAI.
Post-self-audit hardening¶
Self-audit + code-reviewer agent surfaced and fixed 4 BLOCKER + 7 HIGH bugs
in the first-cut implementation (see commit 2aa709b). Parameter-recovery
tests now pass against controlled DGPs.
Smart Workflow — Posterior Verification¶
Shipped in commit be59260:
sp.verify/sp.verify_benchmark— posterior verification engine forsp.recommend()outputs. Runs bootstrap stability, placebo pass rate, and subsample agreement, aggregated intoverify_score ∈ [0, 100]. Opt-in viasp.recommend(verify=True); zero overhead when disabled.- Calibration card shows top-method
verify_score85–95 on clean DGPs (RD lower at ≈ 74 due to local-polynomial bootstrap variance). - 18/18 smart tests pass.
Meta — Author Attribution¶
- Author metadata corrected from
Bryce WangtoBiaoyue Wangin:pyproject.toml(authors+maintainers),src/statspai/__init__.py(__author__),README.md/README_CN.md(team line + BibTeX),docs/index.md(BibTeX), andmkdocs.yml(site_author). Software-journal submission (paper.md) was already correct.
[0.9.2] - 2026-04-16¶
Decomposition Analysis — Broad Decomposition Toolkit in Python¶
Release focus: statspai.decomposition. 18 first-class decomposition methods across 13 modules (~6,200 LOC, 54 tests) spanning mean, distributional, inequality, demographic, and causal decomposition. The release consolidated a broad Python API surface for workflows that are often split across Stata commands and R packages; numerical claims remain tied to the method-level tests and validation metadata.
What's in sp.decompose (18 methods, 30 aliases)¶
Mean decomposition
| Function | Method / Paper |
|---|---|
sp.oaxaca(df, ...) |
Blinder-Oaxaca threefold with 5 reference coefficients (Blinder 1973; Oaxaca 1973; Neumark 1988; Cotton 1988; Reimers 1983) |
sp.gelbach(df, ...) |
Sequential orthogonal decomposition of omitted-variable bias (Gelbach 2016, JoLE) |
sp.fairlie(df, ...) |
Nonlinear logit/probit decomposition (Fairlie 1999, 2005) |
sp.bauer_sinning(df, ...) / sp.yun_nonlinear(df, ...) |
Bauer-Sinning (2008) + Yun (2004, 2005) detailed nonlinear |
Distributional decomposition
| Function | Method / Paper |
|---|---|
sp.rifreg(df, ...) / sp.rif_decomposition(...) |
Recentered Influence Function regression + OB (Firpo, Fortin & Lemieux 2009, Econometrica) |
sp.ffl_decompose(df, ...) |
Two-step detailed decomposition (Firpo, Fortin & Lemieux 2018, Econometrics) |
sp.dfl_decompose(df, ...) |
Reweighting counterfactual distributions (DiNardo, Fortin & Lemieux 1996, Econometrica) |
sp.machado_mata(df, ...) |
Simulation-based quantile regression decomposition (Machado & Mata 2005, JAE) |
sp.melly_decompose(df, ...) |
Analytical quantile regression decomposition (Melly 2005, Labour Economics) |
sp.cfm_decompose(df, ...) |
Distribution regression counterfactuals (Chernozhukov, Fernández-Val & Melly 2013, Econometrica) |
Inequality decomposition
| Function | Method / Paper |
|---|---|
sp.subgroup_decompose(df, ...) |
Between/within for Theil T, Theil L, GE(α), Gini (Dagum 1997), Atkinson, CV² (Shorrocks 1984) |
sp.shapley_inequality(df, ...) |
Shorrocks-Shapley allocation of inequality to covariates (Shorrocks 2013, JoEI) |
sp.source_decompose(df, ...) |
Gini source decomposition (Lerman & Yitzhaki 1985, ReStat) |
Demographic standardization
| Function | Method / Paper |
|---|---|
sp.kitagawa_decompose(df, ...) |
Two-factor rate decomposition (Kitagawa 1955, JASA) |
sp.das_gupta(df_a, df_b, ...) |
Multi-factor symmetric decomposition (Das Gupta 1993) |
Causal decomposition
| Function | Method / Paper |
|---|---|
sp.gap_closing(df, method=...) (regression / IPW / AIPW) |
Gap-closing estimator (Lundberg 2021, Sociol. Methods Res.) |
sp.mediation_decompose(df, ...) |
Natural direct/indirect effects (VanderWeele 2014, Epidemiology) |
sp.disparity_decompose(df, ...) |
Causal disparity decomposition (Jackson & VanderWeele 2018, Epidemiology) |
Unified entry point
import statspai as sp
result = sp.decompose(method='ffl', data=df, y='log_wage',
group='female', x=['education', 'experience'],
stat='quantile', tau=0.5)
result.summary(); result.plot(); result.to_latex()
30 aliases supported ('mm' → machado_mata, 'dinardo_fortin_lemieux' → dfl, etc.).
Why this matters¶
- Stata has it scattered across 6+ packages (
oaxaca,ddecompose,cdeco,rifhdreg,mvdcmp,fairlie) with no unified API. - R has
ddecompose,Counterfactual,dineq— three different authors, three different conventions. - Python previously had only one 2018-vintage unmaintained PyPI package (basic Oaxaca).
- StatsPAI 0.9.2: one API, one result-class contract (
.summary()/.plot()/.to_latex()/._repr_html_()), three inference modes (analytical / bootstrap / none), all numpy/scipy/pandas.
Quality bar¶
- 54 tests including cross-method consistency (
test_dfl_ffl_mean_agree,test_mm_melly_cfm_aligned_reference,test_dfl_mm_reference_convention_opposite) and numerical identity checks (FFL four-part sum, weighted Gini RIF E_w[RIF]=G). - Closed-form influence functions for Theil T / Theil L / Atkinson (no O(n²) numerical fallback).
- Weighted O(n log n) Dagum Gini via sorted-ECDF pairwise-MAD identity.
- Logit non-convergence surfaces as RuntimeWarning; bootstrap failure rate >5% warns.
[0.9.1] - 2026-04-16¶
Regression Discontinuity — Broad RD Toolkit¶
Release focus: statspai.rd. 18+ RD estimators, diagnostics, and inference methods across 14 modules (~10,300 LOC). The machinery behind Calonico-Cattaneo-Titiunik (CCT), Cattaneo-Jansson-Ma density tests, Armstrong-Kolesar honest CIs, Cattaneo-Titiunik-Vazquez-Bare local randomization, Cattaneo-Titiunik-Yu boundary (2D) RD, and Angrist-Rokkanen external validity is exposed through one import statspai as sp; validation status is method-specific.
What's in sp.rd (14 modules)¶
Core estimation
| Function | Method / Paper |
|---|---|
sp.rdrobust(df, ...) |
Sharp / Fuzzy / Kink RD with bias-corrected robust inference (Calonico, Cattaneo & Titiunik 2014, Econometrica; 2020, Stata Journal) |
sp.rdrobust(..., covs=...) |
Covariate-adjusted local polynomial (Calonico, Cattaneo, Farrell & Titiunik 2019, ReStat) |
sp.rd2d(df, x1, x2, ...) |
Boundary discontinuity / 2D RD designs (Cattaneo, Titiunik & Yu 2025) |
sp.rkd(df, ...) |
Regression Kink Design (Card, Lee, Pei & Weber 2015, Econometrica) |
sp.rdit(df, time, ...) |
Regression Discontinuity in Time (Hausman & Rapson 2018, Annual Review) |
sp.rdmc(df, cutoffs=[...]) |
Multi-cutoff RD (Cattaneo, Titiunik, Vazquez-Bare & Keele 2016) |
sp.rdms(df, scores=[...]) |
Multi-score RD (Cattaneo, Idrobo & Titiunik 2024) |
Bandwidth selection
| Function | Selector |
|---|---|
sp.rdbwselect(df, bwselect='mserd') |
MSE-optimal (Imbens-Kalyanaraman 2012) |
sp.rdbwselect(..., bwselect='msetwo') |
Two-bandwidth MSE |
sp.rdbwselect(..., bwselect='cerrd'/'cercomb1'/'cercomb2') |
CER-optimal coverage-error-rate (Calonico, Cattaneo & Farrell 2020, Econometrics Journal) |
Inference
| Function | Method |
|---|---|
sp.rd_honest(df, ...) |
Honest CIs with worst-case bias bound (Armstrong & Kolesar 2018, Econometrica; 2020, QE) |
sp.rdrandinf(df, ...) |
Local randomization inference via Fisher exact tests (Cattaneo, Frandsen & Titiunik 2015) |
sp.rdwinselect(df, ...) |
Data-driven window selection for local randomization |
sp.rdsensitivity(df, ...) |
Sensitivity analysis across windows |
sp.rdrbounds(df, ...) |
Rosenbaum sensitivity bounds for hidden selection |
Heterogeneous treatment effects
| Function | Method |
|---|---|
sp.rdhte(df, covs=[...]) |
CATE via fully interacted local linear (Calonico et al. 2025) |
sp.rdbwhte(df, ...) |
HTE-optimal bandwidth |
sp.rd_forest(df, ...) |
Causal forest + RD |
sp.rd_boost(df, ...) |
Gradient boosting + RD |
sp.rd_lasso(df, ...) |
LASSO-assisted RD with covariate selection |
External validity & extrapolation
| Function | Method |
|---|---|
sp.rd_extrapolate(df, ...) |
Away-from-cutoff extrapolation (Angrist & Rokkanen 2015, JASA) |
sp.rd_multi_extrapolate(df, cutoffs=[...]) |
Multi-cutoff extrapolation (Cattaneo, Keele, Titiunik & Vazquez-Bare 2024) |
Diagnostics & visualization
| Function | Purpose |
|---|---|
sp.rdsummary(df, ...) |
Single-call dashboard — rdrobust + density test + bandwidth sensitivity + placebo cutoffs + covariate balance |
sp.rdplot(df, ...) |
IMSE-optimal binned scatter with pointwise CI bands (Calonico, Cattaneo & Titiunik 2015, JASA) |
sp.rddensity(df, ...) |
Cattaneo-Jansson-Ma (2020, JASA) manipulation test |
sp.rdbalance(df, covs=[...]) |
Covariate balance tests at cutoff |
sp.rdplacebo(df, cutoffs=[...]) |
Placebo cutoff tests |
Power analysis
| Function | Purpose |
|---|---|
sp.rdpower(df, effect_sizes=[...]) |
Power curves for RD designs |
sp.rdsampsi(df, target_power=0.8) |
Required sample size |
Refactor — rd/_core.py consolidation¶
A 5-sprint refactor (commit 44f7529) centralized shared low-level primitives that had been duplicated across 9 RD files into a single private module rd/_core.py (191 lines):
_kernel_fn— triangular / epanechnikov / uniform / gaussian (previously 4 duplicate definitions)_kernel_constants/_kernel_mse_constant— MSE-optimal bandwidth constants_local_poly_wls— WLS local polynomial fit with HC1 / cluster-robust variance + optional covariate augmentation_sandwich_variance— HC1 / cluster sandwich for arbitrary design matrices
Net effect: 253 lines of duplicated math consolidated into 191 lines of canonical implementation. 97 RD tests pass with zero regression.
Bug fixes (since 0.9.0)¶
- RDD extrapolation:
_ols_fitsingular matrix fallback (commit 052594a) - 3 critical + 3 high-priority bugs from comprehensive RD code review (commit 6489270)
- Density test: bug in CJM (2020) implementation + DGP helper fixes + validation tests (commit b66f312)
Tests¶
- 97 RD tests + 1 skipped, 0 failed across 5 test files.
Also in 0.9.1¶
synth/_core.py— simplex weight solver consolidated from 6 duplicate implementations (commit a4036a2). Analytic Jacobian now available to all six callers for ~3-5x speedup.decomposition/_common.py— newinfluence_function(y, stat, tau, w)is the canonical 9-stat RIF kernel.rif.rif_valuespublic API expands from 3 to 9 statistics (commits 0789223, 5569fd0).
[0.9.0] - 2026-04-16¶
Synthetic Control — Broad SCM Toolkit¶
Release focus: statspai.synth. 20 SCM methods + 6 inference strategies + analysis workflow (compare / power / sensitivity / reports), all behind the unified sp.synth(method=...) dispatcher. This is an API-breadth statement; exact validation evidence is recorded by each function's validation metadata and the parity artifacts.
Seven new SCM estimators¶
| Method | Reference |
|---|---|
bayesian_synth |
Dirichlet-prior MCMC with full posterior credible intervals (Vives & Martinez 2024) |
bsts_synth / causal_impact |
Bayesian Structural Time Series via Kalman filter/smoother (Brodersen et al. 2015) |
penalized_synth (penscm) |
Pairwise discrepancy penalty (Abadie & L'Hour 2021, JASA) |
fdid |
Forward DID with optimal donor subset selection (Li 2024) |
cluster_synth |
K-means / spectral / hierarchical donor clustering (Rho 2024) |
sparse_synth |
L1 / constrained-LASSO / joint V+W (Amjad, Shah & Shen 2018, JMLR) |
kernel_synth + kernel_ridge_synth |
RKHS / MMD-based nonlinear matching |
Previous methods — classic, penalized, demeaned, unconstrained, augmented, SDID, gsynth, staggered, MC, discos, multi-outcome, scpi — remain with bug fixes (see below).
Research workflow¶
synth_compare(df, ...)— run every method at once, tabular + graphical comparisonsynth_recommend(df, ...)— auto-select best estimator by pre-fit + robustnesssynth_report(result, format='markdown'|'latex'|'text')— one-command structured reportsynth_power(df, effect_sizes=[...])— power-analysis helper for SCM designssynth_mde(df, target_power=0.8)— minimum detectable effectsynth_sensitivity(result)— LOO + time placebos + donor sensitivity + RMSPE filtering- Three canonical datasets shipped:
california_tobacco(),german_reunification(),basque_terrorism()
Critical fixes from comprehensive module review¶
Following a 5-parallel-agent code review (correctness / numerics / API / perf / docs), nine critical review findings were fixed:
- ASCM correction formula —
augsynthnow follows Ben-Michael, Feller & Rothstein (2021) Eq. 3 per-period ridge bias(Y1_pre − Y0'γ) @ β(T0, T1), replacing the scalar mean-residual placeholder._ridge_fitRHS bug also fixed. - Bayesian likelihood scale — covariate rows are now z-scored to the pooled pre-outcome SD before concatenation, preventing scale mismatch from dominating the Gaussian
σ²posterior. - Bayesian MCMC Jacobian — missing
log(σ′/σ)correction for the log-normal random-walk proposal on σ has been added to the MH acceptance ratio. - BSTS Kalman filter — innovation variance floored at
1e-12(preventslog(0)on constant outcome series); RTS smootherinv → solve + pinvfallback on near-singular predicted covariance. - gsynth factor estimation — four
np.linalg.invcalls (loadings + placebo loop) replaced withnp.linalg.lstsq(robust to rank-deficientF'F/L'L). - Dispatcher
**kwargsleakage —augsynthgains**kwargs + placebo=True;sp.synth(method='augmented', placebo=False)no longer raisesTypeError. - Dispatcher
kernel_ridgeplacebo bypass —placebo=now forwarded correctly. - Cross-method API consistency —
sdid()now accepts canonicaloutcome / treated_unit / treatment_time(legacyy / treat_unit / treat_timealiases retained for backwards compatibility). - Documentation accuracy —
synth_comparedocstring reflects 20 methods (was 12);synth()Returns section enumerates allCausalResultfields.
Tests & validation¶
- 144 synth tests passing (new: 12-method cross-method consistency benchmark verifying the benchmarked synth methods recover a known ATT within 1.5 units on a clean DGP).
- Full suite: 1481 passed, 4 skipped, 0 failed (5m42s).
- New guide:
docs/guides/synth.md— complete tutorial covering all 20 methods with a method-choice decision table.
API migration notes¶
sdid(y=, treat_unit=, treat_time=) still works but outcome=, treated_unit=, treatment_time= is preferred for consistency with every other sp.synth.* function. A deprecation of the legacy names is planned for v1.0.
Other Modules¶
Decomposition and Regression Discontinuity modules received significant upgrades in this release cycle (tier-C decomposition expansion to 18 methods + unified sp.decompose(); RD _core.py primitive centralization + bug fixes from code review). These will be highlighted in a dedicated follow-up release note.
[0.8.0] - 2026-04-16¶
Spatial Econometrics + 10-Domain Breadth Upgrade¶
Largest release in StatsPAI history. 60+ new functions across 10 domains.
Spatial Econometrics (NEW — 38 API symbols)¶
From 3 functions / 419 LOC to 38 functions / 3,178 LOC / 69 tests. A unified spatial econometrics API for Python users.
- Weights (L1):
W(sparse CSR),queen_weights,rook_weights,knn_weights,distance_band,kernel_weights,block_weights - ESDA (L2):
moran(global + local),geary,getis_ord_g,getis_ord_local,join_counts,moran_plot,lisa_cluster_map - ML Regression (L3):
sar,sem,sdm,slx,sac— sparse-backed, dual log-det path (exact + Barry-Pace), scales to N=100K - GMM (L3):
sar_gmm,sem_gmm,sarar_gmm— Kelejian-Prucha (1998/1999), heteroskedasticity-robust - Diagnostics:
lm_tests(Anselin 1988 full battery),moran_residuals - Effects:
impacts(LeSage-Pace 2009 direct/indirect/total + simulated SE) - GWR (L4):
gwr,mgwr(Multiscale GWR),gwr_bandwidth(AICc/CV golden-section) - Spatial Panel (L5):
spatial_panel(SAR-FE / SEM-FE / SDM-FE, entity + twoways) - Cross-validated: Columbus rtol<1e-7 vs PySAL spreg 1.9.0; Georgia GWR bit-identical vs mgwr 2.2.1; GMM rtol<1e-4 vs spreg GM_*
Time Series¶
local_projections— Jordà (2005) IRF with Newey-West HACgarch— GARCH(p,q) MLE with multi-step forecastarima— ARIMA/SARIMAX with auto (p,d,q) AICc grid searchbvar— Bayesian VAR with Minnesota (Litterman) prior
Causal Discovery¶
lingam— DirectLiNGAM (Shimizu 2011), bit-identical vs lingam packageges— Greedy Equivalence Search (Chickering 2002)
Matching¶
optimal_match— Hungarian 1:1 matching (min total Mahalanobis distance)cardinality_match— Zubizarreta (2014) LP-based matching with balance constraints
Decomposition & Mediation¶
rifreg— RIF regression (Firpo-Fortin-Lemieux 2009)rif_decomposition— RIF Oaxaca-Blinder for distributional statisticsmediate_sensitivity— Imai-Keele-Yamamoto (2010) ρ-sensitivity
RD & Survey¶
rdpower,rdsampsi— power/sample-size for RD designsrake,linear_calibration— survey calibration (Deville-Särndal 1992)
Survival¶
cox_frailty— Cox with shared gamma frailty (Therneau-Grambsch)aft— Accelerated Failure Time (exponential/Weibull/lognormal/loglogistic)
ML-Causal (GRF)¶
CausalForest.variable_importance(),.best_linear_projection(),.ate(),.att()- Bugfix: honest leaf values now correctly vary per-leaf
Infrastructure¶
- OLS/IV
predict(data, what='confidence'|'prediction')with intervals - Pre-release code review: 3 critical + 2 high-priority bugs fixed
[0.7.1] - 2026-04-15¶
DID-focused polish release. Brings the Wooldridge (2021) ETWFE
implementation to full feature parity with the R etwfe package,
adds a one-call method-robustness workflow, and closes 12 issues
uncovered by an internal code review round. All 27 new / updated
DID tests pass (pytest tests/test_did_summary.py).
Added — ETWFE full parity with R etwfe¶
sp.etwfe()explicit API aligned with Retwfe(McDermott 2023) naming. Thin alias overwooldridge_did()with a full argument- mapping table in the docstring.xvar=covariate heterogeneity (single string or list of names). Adds per-cohort × post ×(x_j − mean(x_j))interactions;detailgainsslope_<x>/slope_<x>_se/slope_<x>_pvaluecolumns. Baseline ATT is reported at the sample means of every covariate.panel=Falserepeated cross-section mode — replaces unit FE with cohort + time dummies (Retwfe(ivar=NULL)equivalent).cgroup='nevertreated'— per-cohort regressions restricted to (cohort g) ∪ (never-treated); cohort-size-weighted aggregation (Retwfe(cgroup='never')equivalent). Default'notyet'preserves prior ETWFE behaviour.sp.etwfe_emfx(result, type=…)— Retwfe::emfx-equivalent four aggregations:'simple','group','event','calendar'.include_leads=Truereturns full event-time output including pre- treatment leads for pre-trend inspection (rel_time = -1is the reference category).
Added — one-call DID method-robustness workflow¶
sp.did_summary()— fits five modern staggered-DID estimators (CS, SA, BJS, ETWFE, Stacked) to the same data and returns a tidy comparison table with per-method (estimate, SE, p, 95 % CI). Mean across methods + cross-method SD flag method-sensitivity of results.include_sensitivity=True— attaches the Rambachan-Roth (2023) breakdownM*to the CS row, giving a three-way robustness readout in a single call.sp.did_summary_plot()— forest plot of per-method estimates with cross-method mean line;sort_by='estimate'supported.sp.did_summary_to_markdown()/_to_latex()— publication- ready exports (GFM tables / booktabs LaTeX with auto-escaped ampersands).sp.did_report(save_to=dir)— one-call bundle that writesdid_summary.txt/.md/.tex/.png/.jsonto a folder.
Fixed — 12 issues from the internal code review¶
Blockers (C-severity):
etwfe(xvar=…)now raises a clearValueErrorwhen the covariate is all-NaN or (near-)constant. Previously returnedn_obs = 0, estimate = 0silently.etwfe(panel=False, cgroup='nevertreated')now raises a crispNotImplementedErrorinstead of silently falling back to'notyet'.did_summarynow validates column names up front (raisesKeyErrorlisting missing columns) and only catches narrow estimator-side exceptions inside the fit loop; user typos incontrols=/cluster=surface as proper errors.did_summaryresults round-trip cleanly through stdlib serialisation (DIDSummaryResult(CausalResult)subclass with a real.summary()method, replacing the prior closure-bound instance attribute).
High-severity:
etwfe_emfx(type='event'/'calendar')now computes SEs via the delta method on the stored event-study vcov instead of the independent-coefficient approximation.model_info['se_method']advertises which path was used.etwfe_emfx(type='group')headlinese/pvalue/ciare now populated (match the underlying fit's overall ATT exactly).- Validation for
did_summary_plot/_to_markdown/_to_latexaligned on a single sentinelmodel_info['_did_summary_marker']. _etwfe_never_onlyno longer leaves a_ft_cachehelper column on the caller's DataFrame.- Slope indexing in
_etwfe_with_xvaris now name-keyed (coef_indexdict); regression test verifies swappingxvar=['x1','x2']vs['x2','x1']produces identical slopes per name. etwfe(panel=False)with rank-deficient designs emits aRuntimeWarningpointing at concrete remedies (previously fell through topinvsilently).
Tests¶
- New test module
tests/test_did_summary.py— 27 cases covering consistency with direct estimator calls, export formats, forest plot rendering,etwfe_emfxround-trips, xvar / panel / cgroup options, the 12 review fixes, and theinclude_leadsmode.
[0.7.0] - 2026-04-14¶
Focused release reaching feature parity with the R did / HonestDiD
packages and the Python csdid / differences packages for staggered
Difference-in-Differences. All core algorithms are reimplemented from
the original papers — no wrappers, no runtime dependencies on upstream
DID packages. Full DiD test suite: 47 → 170+ (including three rounds
of post-implementation audit that surfaced and fixed 9 bugs before
release).
Added — Core estimation¶
sp.aggte(result, type=...)— unified aggregation layer forcallaway_santanna()results. Four aggregation schemes (simple,dynamic,group,calendar) backed by a single weighted- influence-function engine. Callaway & Sant'Anna (2021) Section 4.- Mammen (1993) multiplier bootstrap — IQR-rescaled pointwise
standard errors and simultaneous (uniform / sup-t) confidence
bands over the aggregation dimension. Matches the uniform-band
behaviour of the R
did::aggtefunction. balance_e/min_e/max_e— event-study cohort balancing and window truncation (CS2021 eq. 3.8).anticipation=δparameter oncallaway_santanna()— shifts the base period back by δ periods per CS2021 §3.2.- Repeated cross-sections support via
callaway_santanna(panel=False)— unconditional 2×2 cell-mean DID with observation-level influence functions (CS2021 eq. 2.4, RCS version). Optional covariate residualisation withx=[...]for regression adjustment. All downstream modules (aggte,cs_report,ggdid,honest_did) work on RCS results with no code changes. - dCDH joint inference (
did_multiplegt) —joint_placebo_test(Wald χ² across placebo lags with bootstrap covariance, dCDH 2024 §3.3) andavg_cumulative_effect(mean of dynamic[0..L] with SE preserving cross-horizon covariance, dCDH 2024 §3.4). sp.bjs_pretrend_joint()— cluster-bootstrap joint Wald pre- trend test for BJS imputation results. Upgrades the default sum-of-z² test (which assumes pre-period independence) to a full covariance-aware statistic.
Added — Reporting & visualisation¶
sp.cs_report(data, ...)— one-call report card. Runs the full pipeline (ATT(g,t) → four aggregations with uniform bands → pre-trend Wald → Rambachan–Roth breakdown M* for every post event time) under a single bootstrap seed and pretty-prints the result. Returns a structuredCSReportdataclass.sp.ggdid(result)— plot routine foraggte()output, mirroring Rdid::ggdid. Auto-dispatches on aggregation type; uniform band overlaid on pointwise CI.CSReport.plot()— one-call 2×2 summary figure: event study with uniform band (top-left), θ(g) per-cohort (top-right), θ(t) per-calendar-time (bottom-left), Rambachan–Roth breakdown M* bars (bottom-right).CSReport.to_markdown()— GitHub-flavoured Markdown export with proper integer-column rendering and a configurablefloat_format.CSReport.to_latex()— formatted booktabs fragment wrapped in atablefloat. Zerojinja2dependency (hand-rolled booktabs renderer); auto-escapes LaTeX special characters.CSReport.to_excel()— six-sheet workbook (Summary,Dynamic,Group,Calendar,Breakdown,Meta). Engine autoselect (openpyxl → xlsxwriter) with a clear ImportError when neither is installed.cs_report(..., save_to='prefix')— one-call dump of the full export matrix: writes<prefix>.{txt,md,tex,xlsx,png}in a single invocation, auto-creating missing parent directories. Optional dependencies (openpyxl, matplotlib) are skipped silently so a minimal install still produces text + md + tex.sp.did(..., aggregation='dynamic', n_boot=..., random_state=...)— the top-level dispatcher now forwards CS-style arguments (aggregation,panel,anticipation) and can pipe a CS result straight throughaggte()in a single call.
Changed¶
sun_abraham()inference layer rewritten — replaces the former ad-hoc√(σ²/(total·T))approximation with a Liang–Zeger cluster-robust sandwich(X'X)⁻¹ Σ_c X_c' u_c u_c' X_c (X'X)⁻¹(small-sample adjusted), delta-method IW aggregation SEsw' V_β w, iterative two-way within transformation (correct on unbalanced panels), and optionalcontrol_group='lastcohort'per SA 2021 §6.sp.honest_did()/sp.breakdown_m()made polymorphic — now accept the legacycallaway_santanna()/sun_abraham()format (event study inmodel_info) and the newaggte(type='dynamic')format (event study indetailwith Mammen uniform bands). The idiomatic pipelinecs → aggte → honest_did → breakdown_mnow runs end-to-end with no manual plumbing.- README DiD parity matrix added, comparing StatsPAI against
csdid,differences, and Rdid+HonestDiDacross 15 capabilities.
Fixed (from pre-release audit rounds)¶
- Critical —
aggte(type='dynamic').estimatepreviously averaged pre- and post-treatment event times into the overall ATT, polluting the headline number with placebo signal. Now averages only e ≥ 0, matching Rdid::aggte's print convention. On a typical DGP the bug shifted the reported overall by nearly a factor of 2. - LaTeX escape non-idempotence in
CSReport.to_latex():\→\textbackslash{}followed by{→\{mangled the just-inserted braces. Fixed with a single-passre.sub. cs_report(save_to='~/study/…')did not expand~; fixed viaos.path.expanduser.cs_report(sa_result)/aggte(sa_result)raised crypticKeyError: 'group'; both entry points now detect non-CS input up-front and raise a clearValueError.cs_report(pre_fitted_cs, estimator=…)silently ignored the override; now emits aUserWarninglisting every shadowed arg.sp.did(method='2x2', aggregation='dynamic')silently ignored CS-only arguments; now raises an informativeValueError.bjs_pretrend_jointswallowed all exceptions as "bootstrap failed"; now narrows to expected failure modes and re-raises unexpected errors with context.matplotlib.use('Agg')in_save_report_bundleno longer switches the backend unconditionally (respects Jupyter sessions).
References¶
- Callaway, B. and Sant'Anna, P.H.C. (2021). J. of Econometrics 225(2).
- Sun, L. and Abraham, S. (2021). J. of Econometrics 225(2).
- Mammen, E. (1993). Ann. Statist. 21(1).
- Liang, K.-Y. and Zeger, S.L. (1986). Biometrika 73(1).
- de Chaisemartin, C. and D'Haultfoeuille, X. (2020). AER 110(9).
- de Chaisemartin, C. and D'Haultfoeuille, X. (2024). RESt, forthcoming.
- Rambachan, A. and Roth, J. (2023). Rev. Econ. Studies 90(5).
- Borusyak, K., Jaravel, X. and Spiess, J. (2024). ReStud 91(6).
[0.6.2] - 2026-04-12¶
Added¶
- OLS
predict():result.predict(newdata=)for out-of-sample prediction on OLS results balance_panel(): Utility to keep only units observed in every time period (sp.balance_panel())- Panel
balance=True: Convenience flag insp.panel()to auto-balance before estimation - Analytical weights for DID:
weights=parameter added todid(),ddd(), andevent_study()for population-weighted estimation (Stata[aweight=...]equivalent) - Matching
ps_poly=: Polynomial propensity score specification (ps_poly=2adds interactions/squares, following Cunningham 2021 Ch. 5) - Synth
rmspeplot: Post/pre RMSPE ratio histogram (synthplot(result, type='rmspe')) per Abadie et al. (2010) - Synth placebo gap plot: Full spaghetti placebo gap paths with
rmspe_thresholdfilter (Abadie et al. 2010, Figure 4) - Graddy (2006) replication: Fulton Fish Market IV example added to
sp.replicate()(Mixtape Ch. 7) - Numerical validation tests: early selected Stata/R reference checks
with humanized error messages; current package-wide evidence is reported
through
validation_status, not a blanket validation claim
Fixed¶
outreg2format auto-detection: Correctly infers.xlsx/.csv/.texfrom filename extension- Synth placebo p-value: Now uses RMSPE ratio (√post/√pre) instead of squared ratio, matching Abadie et al. (2010) convention
Improved¶
- DID/DDD/Event Study: Weights propagation through WLS with proper normalization and validation
- Synth placebos: Store full placebo gap trajectories, per-unit RMSPE ratios, and unit labels for richer post-estimation analysis
- Matching tests: Added comprehensive test suite for PSM, Mahalanobis, CEM, and stratification methods
[0.6.1] - 2026-04-07¶
Fixed¶
- Interactive Editor — Theme switching: Themes now fully reset before applying, so switching between themes (e.g. ggplot → academic) correctly updates all visual properties instead of leaking stale settings
- Interactive Editor — Apply button: Fixed Apply button being clipped/hidden on the Layout tab due to panel overflow
- Interactive Editor — Panel layout: Fixed panel content disappearing when using flex layout for bottom-pinned Apply button
- Interactive Editor — Style tab: Fixed Style tab stuck on "Loading" after Theme tab was reordered to first position
- Interactive Editor — Error visibility: Widget callback errors now surface in the status bar instead of being silently swallowed
Improved¶
- Interactive Editor — Auto mode: Clicking Auto now always refreshes the preview, giving immediate visual feedback
- Interactive Editor — Auto/Manual toggle: Compact toggle button moved to panel header with sticky positioning
- Interactive Editor — Apply button: Separated from Auto toggle and placed at panel bottom-right for better UX
- Interactive Editor — Theme tab: Moved to first position for better discoverability
- Interactive Editor — Color pickers: Added visual confirmation feedback on all color changes
- Interactive Editor — Code generation: Auto-generate reproducible code with text selection support in the editor
- Smart recommendations: Enhanced compare and recommend logic
- Registry: Expanded module support in the registry
[0.1.0] - 2024-07-26¶
Added¶
- Core Regression Framework
- OLS (Ordinary Least Squares) regression with formula interface
- Robust standard errors (HC0, HC1, HC2, HC3)
- Clustered standard errors
-
Weighted Least Squares (WLS) support
-
Causal Inference Module
- Causal Forest implementation inspired by Wager & Athey (2018)
- Honest estimation for unbiased treatment effect estimation
- Bootstrap confidence intervals for treatment effects
-
Formula interface:
"outcome ~ treatment | features | controls" -
Output Management (outreg2)
- Excel export functionality similar to Stata's outreg2
- Support for multiple regression models in single output
- Customizable formatting options
-
Professional table layout
-
Unified API Design
- Consistent
reg()function interface - Formula parsing: R/Stata-style syntax
"y ~ x1 + x2" - Type hints throughout the codebase
- Comprehensive documentation
Technical Details¶
- Python 3.8+ support
- Dependencies: numpy, scipy, pandas, scikit-learn, openpyxl
- MIT License
- Comprehensive test suite