Skip to content

Design Rationale & Frequently-Raised Questions

This page answers the questions reviewers most often raise about a package of StatsPAI's scope, with pointers to the concrete evidence in the repository. It complements the JOSS reviewer guide (how to install and smoke-test) and the JOSS validation dossier (the validation evidence itself). Every claim below links to a file you can open.


Scope & quality

With 1,000+ functions across 81 submodules, how can a reviewer trust the breadth?

StatsPAI does not ask reviewers to take 1,020 functions on faith. Three mechanisms make the breadth auditable rather than assertable:

  1. Validation tiering is explicit and machine-readable. Parity-backed functions expose a validation_status so a human or agent can tell certified numerical evidence apart from API-stable breadth. Nothing is silently presented as validated when it is not.
  2. Independent validation tracks, not self-comparison. The numerical core is pinned against external references:
  3. 124 reference-parity checks (tests/reference_parity/) against R,
  4. 54 external-parity checks (tests/external_parity/) against the canonical upstream Python implementations (e.g. DoubleML),
  5. R/Stata cross-language parity (tests/r_parity/, tests/stata_parity/),
  6. Monte-Carlo coverage checks (tests/coverage_monte_carlo/),
  7. original-paper replays (tests/orig_parity/). The archived full-suite report records 5,200 passed on Python 3.9 (test_results_full_suite.md).
  8. Coverage is honest and rising. The CI line-coverage gate was lifted from a 15% placeholder to 60% against a measured 70.5% baseline (historical full-suite coverage XML snapshot). The current CI gate uses the coverage.xml generated inside the same job, while .coverage_campaign/ tracks the dedicated campaign driving the six core estimator modules (did iv rd synth dml panel) to ≥95% line coverage under the full suite — test-only, with real numerical assertions, no mocking of numerical paths.

The honest framing — also stated in the validation dossier — is that certified methods carry parity or analytic evidence, while breadth methods are API-stable and documented; the validation_status field is the contract that keeps the two from being conflated.

Isn't a package this large just thin wrappers around existing tools?

No. A grep of the source shows exactly one wrapper boundary (src/statspai/fixest/wrapper.py, an opt-in pyfixest bridge); the estimators are first-party implementations. StatsPAI depends on the scientific-Python stack (NumPy, SciPy, pandas, statsmodels, scikit-learn, linearmodels) and implements the causal/econometric layer on top of it rather than re-exporting another package's estimators.


Double Machine Learning (sp.dml)

Given that DoubleML already exists, why does sp.dml exist? What is the increment?

sp.dml is not a competing DML engine — and the documentation says so plainly. Its value is integration, not reimplementation of the theory:

  • One import, one result object. The same sp.dml(model=...) dispatcher sits behind import statspai as sp next to DiD, IV, RD, synthetic control, meta-learners, etc., and returns the shared CausalResult (.summary(), .to_latex(), .to_word(), .cite()). A user moving between designs does not switch libraries or result conventions.
  • Agent-native surface. Every estimator, sp.dml included, is discoverable via list_functions() / describe_function() / function_schema(), so the same entry point serves humans and agents.
  • Cross-ecosystem alignment. sp.dml is pinned against both DoubleML ecosystems (R and Python), so the numerical claim is auditable from either side.

The increment is a unified, agent-native, R/Stata-aligned interface over a correctly-implemented orthogonal-score core — not a claim to improve on the DoubleML estimator itself.

Are the DML scores correct (Neyman-orthogonal) and is the cross-fitting faithful?

Yes, and this is verified numerically rather than asserted. Under identical scikit-learn learners and the same cross-fit fold partition with a fixed seed:

Model Agreement with doubleml-for-py
PLR (partially linear) machine precision — |Δ coef| ≈ 1.1×10⁻¹⁶, |Δ SE| ≈ 1.4×10⁻¹⁷
PLIV (partially linear IV) machine precision under shared learners/folds
IRM (AIPW ATE) within ≈ 0.10 SE; residual traced to AIPW score construction (not trimming, not IPW normalization)
IIVM (interactive IV / LATE) analogous orthogonal score; pinned alongside IRM

Machine-precision PLR/PLIV agreement is only possible if both implementations evaluate the same Neyman-orthogonal score on the same cross-fit partition. The full numbers, software versions, and an honest discussion of where and why sp.dml diverges are in docs/guides/sp_dml_vs_doubleml.md; the checks themselves are tests/external_parity/test_dml_python_parity.py (Python) and tests/reference_parity/test_dml_parity.py (R).

To reproduce:

python -m pip install -e ".[dev,parity]"   # parity extra adds doubleml-for-py
python -m pytest tests/external_parity/test_dml_python_parity.py -v

Installation & dependencies

Does the package install cleanly, and are heavy dependencies optional?

Yes. The core runtime declares only the scientific-Python stack; torch, jax, and pymc live in optional extras (neural/deepiv, performance, bayes) and are lazily imported. CI proves this with a bare-install smoke step: it builds the wheel, installs it into a fresh virtual environment with no extras, and runs core estimators (IV on the Card data, staggered DiD on the MPDTA data) — so a regression that eager-imports an optional dependency fails CI rather than a downstream user's pip install statspai (see the bare install — lazy-import guard step in .github/workflows/ci-cd.yml).


Reproducibility

Can a reviewer reproduce the parity claims without installing R and Stata?

Yes. The R/Stata reference outputs are captured once and committed as fixtures (JSON), so the reference-parity suite runs against the cached numbers with only a Python toolchain:

python -m pytest tests/reference_parity/ -q --no-cov

R or Stata is needed only to regenerate a fixture (when a DGP changes), not to run the parity check. Each fixture records the exact upstream software version it was generated with, so the comparison is pinned and documented rather than live and drifting.

Why does the version under review differ from the latest release?

Development continued during review. The reviewed numerical evidence (parity tables, dossier numbers) is pinned to committed fixtures and tests, so it is reproducible at the reviewed commit regardless of later releases; the maintainer coordinates the exact archival version with the editor.


Documentation

Full documentation, including the API reference generated directly from NumPy-style docstrings, is published at https://brycewang-stanford.github.io/StatsPAI/ and builds under mkdocs build --strict as a CI gate.