StatsPAI — Ecosystem & Code Statistics¶
This page tracks StatsPAI's size and coverage against the broader statistical ecosystem, in the spirit of evidence-based positioning. Numbers marked "measured" are reproducible on a standard install; numbers marked "estimated" are extrapolated from public ecosystem statistics with the reasoning shown.
StatsPAI live numbers last measured: 2026-06-08 on macOS arm64 against the local
statspai1.17.0 source tree (post upstream sync). External ecosystem comparison rows are retained from the 2026-05-03 measurement unless noted otherwise.The four StatsPAI numbers (function count, submodule count, source LOC, test LOC) are reproducible from a single canonical generator:
python scripts/registry_stats.py. The per-module table in §2 is regenerated bypython scripts/registry_stats.py --table. CI guards against drift viapython scripts/registry_stats.py --check.
1 · Cross-ecosystem lines-of-code¶
| Ecosystem / Project | Method | Files | Lines of code | Primary focus |
|---|---|---|---|---|
StatsPAI src/statspai/ |
measured | 650 | 271,580 | validation-tiered causal inference |
StatsPAI tests (tests/) |
measured | 774 | 134,248 | — |
| statsmodels 0.14.x | measured | 948 | 381,981 | GLM / time series / general |
| linearmodels | measured | 131 | 36,607 | panel / IV |
| Python causal-inference subtotal | 1,079 | 418,588 | ||
Stata 18 — official .ado |
measured | 3,767 | 937,543 | command layer above closed kernel |
Stata 18 — official .mata |
measured | 411 | 103,822 | Mata numerical layer |
| Stata official executable code | measured | 4,178 | 1,041,365 | (+ 738,543 lines of .sthlp help text, not counted as code) |
| Stata SSC (third-party) | estimated | ~3,500 pkgs | 2M – 4M | community extensions; local sample (reghdfe + winsor2 + 50 others) = 33,296 LOC |
| R base interpreter (C + R + Fortran) | estimated | — | ~1.5M | language itself |
| R base library (73 recommended pkgs) | measured | 509 | 62,321 | shipped with R on this machine |
| CRAN (~22,000 packages, 2026) | estimated | — | 80M–120M (R-only; >200M incl. C/C++/Fortran) | main R package universe |
| Bioconductor (~2,300 packages) | estimated | — | 30M–50M | bioinformatics |
| R ecosystem total | estimated | — | ≈ 150M+ |
How to read this table
LOC is a vanity metric in isolation — what matters is coverage density within a target domain. StatsPAI is deliberately scoped at causal inference and applied econometrics; it is not trying to match R's 150M+ lines because ~90% of CRAN is bioinformatics, visualization, text mining, and general-purpose ML that is out of scope. The relevant comparison is the coverage matrix in §3 below.
2 · StatsPAI per-module breakdown¶
Sorted by LOC. This table is generated from the live source tree by python scripts/registry_stats.py --table; it intentionally avoids editorial focus labels so the numbers can be refreshed mechanically before a release.
| Module | LOC | Files | Registered functions (sp.*) |
|---|---|---|---|
synth |
20,256 | 31 | 53 |
did |
17,188 | 33 | 59 |
rd |
13,696 | 25 | 53 |
regression |
11,782 | 19 | 37 |
output |
10,891 | 21 | 40 |
agent |
9,709 | 30 | 0 |
smart |
8,924 | 17 | 26 |
decomposition |
7,509 | 18 | 31 |
iv |
6,599 | 16 | 8 |
diagnostics |
6,115 | 12 | 22 |
fast |
5,799 | 13 | 0 |
plots |
5,176 | 6 | 6 |
spatial |
5,136 | 29 | 35 |
core |
5,004 | 10 | 2 |
inference |
4,949 | 15 | 24 |
panel |
4,906 | 12 | 17 |
matching |
4,052 | 9 | 23 |
frontier |
4,008 | 8 | 12 |
workflow |
3,890 | 5 | 1 |
bayes |
3,834 | 10 | 17 |
multilevel |
3,719 | 8 | 11 |
dml |
3,482 | 12 | 14 |
mendelian |
3,410 | 13 | 38 |
causal_discovery |
3,379 | 11 | 20 |
dag |
2,924 | 9 | 19 |
metalearners |
2,916 | 8 | 23 |
structural |
2,784 | 9 | 12 |
survival |
2,661 | 6 | 12 |
neural_causal |
2,651 | 6 | 16 |
causal_llm |
2,612 | 10 | 11 |
timeseries |
2,465 | 9 | 18 |
tmle |
2,351 | 6 | 9 |
bounds |
2,268 | 5 | 9 |
robustness |
2,243 | 6 | 11 |
forest |
2,168 | 5 | 8 |
conformal_causal |
2,002 | 9 | 19 |
utils |
1,906 | 8 | 32 |
epi |
1,860 | 6 | 20 |
interference |
1,837 | 10 | 18 |
bartik |
1,733 | 4 | 6 |
question |
1,552 | 3 | 6 |
proximal |
1,525 | 8 | 7 |
postestimation |
1,509 | 4 | 12 |
qte |
1,488 | 6 | 12 |
causal_text |
1,330 | 4 | 4 |
bcf |
1,286 | 5 | 5 |
bridge |
1,236 | 8 | 2 |
bunching |
1,206 | 5 | 8 |
target_trial |
1,151 | 7 | 9 |
mediation |
1,133 | 4 | 4 |
power |
1,099 | 3 | 12 |
datasets |
1,007 | 2 | 0 |
policy_learning |
981 | 3 | 3 |
principal_strat |
923 | 2 | 3 |
gmm |
876 | 3 | 2 |
dtr |
833 | 5 | 2 |
fairness |
833 | 3 | 6 |
longitudinal |
800 | 3 | 7 |
experimental |
794 | 4 | 9 |
survey |
789 | 4 | 7 |
causal_rl |
782 | 5 | 9 |
selection |
766 | 2 | 3 |
deepiv |
739 | 2 | 2 |
gformula |
736 | 3 | 4 |
surrogate |
718 | 2 | 3 |
mht |
694 | 2 | 6 |
assimilation |
681 | 3 | 2 |
ope |
619 | 3 | 3 |
msm |
612 | 2 | 3 |
dose_response |
601 | 3 | 2 |
transport |
567 | 5 | 8 |
causal_impact |
509 | 2 | 3 |
fixest |
504 | 3 | 4 |
nonparametric |
447 | 3 | 4 |
imputation |
445 | 2 | 3 |
matrix_completion |
377 | 2 | 2 |
compat |
351 | 2 | 0 |
multi_treatment |
312 | 2 | 2 |
censoring |
284 | 2 | 2 |
causal |
101 | 1 | 0 |
schemas |
0 | 0 | 0 |
| Total | 271,580 | 650 | 1031 |
3 · Causal-inference coverage matrix (full)¶
Legend: B = broad API coverage within this comparison table; Y = implemented entry points; P = partial, scattered, or single-algorithm support; N = no first-class entry point. These are API-breadth labels, not validation tiers.
"Stata" = official + major SSC packages. "R" = CRAN. "sm+lm" = statsmodels + linearmodels.
| # | Method family | Stata | R | sm+lm | DoubleML | StatsPAI | StatsPAI entry points |
|---|---|---|---|---|---|---|---|
| 1 | DiD — staggered (CS / SA / BJS / dCdH / Gardner / Wooldridge ET) + event-study + honest CIs (Rambachan-Roth) | P | Y | N | N | B | sp.callaway_santanna, sp.sun_abraham, sp.borusyak_jaravel_spiess, sp.dchd, sp.gardner_did, sp.wooldridge_did, sp.honest_did, sp.cs_report |
| 2 | IV — classical (2SLS / LIML / GMM) + modern (Kernel IV / Deep IV / KAN-DeepIV) | Y classical only | Y classical only | P classical (lm) | P | B | sp.ivreg, sp.kernel_iv, sp.deep_iv, sp.kan_deepiv |
| 3 | RD — CCT sharp/fuzzy/kink + 2D / boundary + multi-cutoff + honest CIs + ML-CATE (18+ estimators) | P (rdrobust SSC) |
Y (rdrobust) |
N | N | B | sp.rdrobust, sp.rd2d, sp.rdhte, sp.rd_forest, sp.rd_boost, sp.rdrandinf, sp.rdpower, sp.rdsummary |
| 4 | Synthetic Control (ADH / ASCM / gsynth / BSTS / Bayesian / PenSCM / FDID — 20 methods + 6 inference strategies) | P (synth SSC) |
P (7 pkgs: Synth, gsynth, CausalImpact, MSCMT, …) | N | N | B | sp.synth(method=...), sp.synth_compare, sp.synth_recommend, sp.synth_power, sp.synth_sensitivity |
| 5 | Matching — PS / CEM / optimal / cardinality / one-to-many | P (psmatch2 SSC) |
Y (MatchIt, optmatch) |
N | N | Y | sp.match, sp.cem, sp.optimal_match, sp.cardinality_match |
| 6 | Double / Debiased ML | N | Y (DoubleML) |
N | Y | Y | sp.dml(model=...), sp.dml_model_averaging, sp.kernel_dml |
| 7 | Meta-Learners (S/T/X/R/DR) + Causal Forest / GRF | N | Y (grf, rlearner) |
N | N | Y | sp.s_learner, sp.t_learner, sp.x_learner, sp.r_learner, sp.dr_learner, sp.causal_forest |
| 8 | TMLE / HAL-TMLE | N | Y (tmle, hal9001) |
N | N | Y | sp.tmle, sp.hal_tmle, sp.ctmle |
| 9 | Neural causal (TARNet / CFRNet / DragonNet / CEVAE) | N | N | N | N | B | sp.tarnet, sp.cfrnet, sp.dragonnet, sp.cevae |
| 10 | Causal discovery (NOTEARS / PC / LiNGAM / GES + deep variants) | N | P (pcalg) |
N | N | B | sp.notears, sp.pc_algorithm, sp.lingam, sp.ges |
| 11 | Proximal CI (fortified / bidirectional / MTP / double-negative-control / surrogate) | N | P (pci scattered) |
N | N | B | sp.proximal, sp.fortified_pci, sp.bidirectional_pci, sp.pci_mtp, sp.double_negative_control |
| 12 | QTE / distributional TE / CiC / dist-IV / beyond-avg-LATE / HD panel | P (ivqreg) |
P (qte, Counterfactual) |
N | N | Y | sp.qte, sp.qdid, sp.cic, sp.distributional_te, sp.dist_iv |
| 13 | Mendelian randomization (IVW / Egger / median / mode / MR-PRESSO / MVMR / BMA) | N | Y (MendelianRandomization, TwoSampleMR) |
N | N | Y | sp.mr_ivw, sp.mr_egger, sp.mr_median, sp.mr_presso, sp.mvmr, sp.mr_bma |
| 14 | Conformal causal inference (ITE / CATE / density / dose-response / cluster / fair) | N | N | N | N | B | sp.conformal_ite, sp.conformal_cate, sp.conformal_dose_response |
| 15 | Bayesian causal (BCF / ordinal BCF / factor-exposure BCF) | N | P (bcf) |
N | N | Y | sp.bcf, sp.bcf_ordinal, sp.bcf_factor_exposure |
| 16 | Spatial econometrics (weights → ESDA → ML/GMM → GWR/MGWR → panel) | N | P (5 pkgs: spdep, spatialreg, sphet, splm, GWmodel) | N | N | B | 38 functions including sp.sem, sp.sar, sp.gwr, sp.mgwr, sp.splm |
| 17 | Policy learning / OPE | N | P (policytree) |
N | N | Y | sp.policy_tree, sp.policy_value, sp.doubly_robust_ope |
| 18 | Bunching estimation | P (bunching SSC) |
N | N | N | Y | sp.bunching, sp.kink_bunching |
| 19 | Interference / spillover (partial / network / cluster-RCT / HTE) | N | P (interference) |
N | N | B | 18 functions including sp.spillover, sp.cluster_rct, sp.hte_interference |
| 20 | Matrix completion for panels | N | P (gsynth) |
N | N | Y | sp.matrix_completion |
| 21 | Causal MAS (multi-agent LLM causal discovery — arXiv:2509.00987) | N | N | N | N | B | sp.causal_mas, sp.causal_llm.openai_client, sp.causal_llm.anthropic_client |
| 22 | Publication tables (Word / Excel / LaTeX / HTML / Markdown) | P (outreg2) |
P (modelsummary) |
P | N | B | Supported result objects expose .to_word() / .to_excel() / .to_latex() / .to_html() |
| 23 | Agent-native tool-calling schemas (function_schema()) |
N | N | N | N | B | sp.list_functions(), sp.describe_function(), sp.function_schema(), sp.agent.mcp_server |
4 · How to reproduce these numbers¶
# StatsPAI (Python)
find src/statspai -name '*.py' -exec wc -l {} + | tail -1
find tests -name '*.py' -exec wc -l {} + | tail -1
python3 -c "import statspai as sp; print(len(sp.list_functions()))"
# statsmodels + linearmodels
python3 -c "import statsmodels, os; print(os.path.dirname(statsmodels.__file__))"
python3 -c "import linearmodels, os; print(os.path.dirname(linearmodels.__file__))"
find $(python3 -c "import statsmodels,os;print(os.path.dirname(statsmodels.__file__))") -name '*.py' -exec wc -l {} + | tail -1
find $(python3 -c "import linearmodels,os;print(os.path.dirname(linearmodels.__file__))") -name '*.py' -exec wc -l {} + | tail -1
# Stata (macOS default install path)
find /Applications/Stata/ado/base -name '*.ado' | xargs wc -l | tail -1
find /Applications/Stata/ado/base -name '*.mata' | xargs wc -l | tail -1
# R (installed library only; base-interpreter source requires R-<version>.tar.gz)
find /Library/Frameworks/R.framework/Resources/library \( -name '*.R' -o -name '*.r' \) -exec wc -l {} + | tail -1
Ecosystem-wide estimates (SSC / CRAN / Bioconductor totals) are drawn from:
- SSC — Boston College Statistical Software Components archive package listing.
- CRAN — METACRAN and CRAN by topic task views.
- Bioconductor — Bioconductor 3.20 release statistics.
The SSC sample used for extrapolation is the 52-file local install (/Users/brycewang/Library/Application Support/Stata/ado/plus, 33,296 LOC — includes reghdfe, winsor2, and others).
5 · Why we don't lead with "line-count wins"¶
Three reasons a naked LOC comparison is misleading for StatsPAI positioning:
- Vanity metric: 188K vs R's 150M+ tells a reviewer nothing about capability per line. R's CRAN is ~90% out-of-scope (bioinformatics, visualization, text, general ML) — apples to oranges.
- Moving target: CRAN and statsmodels grow every month. A headline number in the README rots quarterly unless regenerated by CI.
- Wrong axis: StatsPAI's differentiator is causal-inference depth in one API — see §3. Most cells where we win are ❌ in Stata / R / statsmodels entirely, not "fewer lines".
The coverage matrix in §3 is the honest positioning. LOC in §1 is supporting evidence for "yes, there is real code behind these claims" — not a competitive boast.