Skip to content

StatsPAI — Ecosystem & Code Statistics

This page tracks StatsPAI's size and coverage against the broader statistical ecosystem, in the spirit of evidence-based positioning. Numbers marked "measured" are reproducible on a standard install; numbers marked "estimated" are extrapolated from public ecosystem statistics with the reasoning shown.

StatsPAI live numbers last measured: 2026-06-08 on macOS arm64 against the local statspai 1.17.0 source tree (post upstream sync). External ecosystem comparison rows are retained from the 2026-05-03 measurement unless noted otherwise.

The four StatsPAI numbers (function count, submodule count, source LOC, test LOC) are reproducible from a single canonical generator: python scripts/registry_stats.py. The per-module table in §2 is regenerated by python scripts/registry_stats.py --table. CI guards against drift via python scripts/registry_stats.py --check.


1 · Cross-ecosystem lines-of-code

Ecosystem / Project Method Files Lines of code Primary focus
StatsPAI src/statspai/ measured 650 271,580 validation-tiered causal inference
StatsPAI tests (tests/) measured 774 134,248
statsmodels 0.14.x measured 948 381,981 GLM / time series / general
linearmodels measured 131 36,607 panel / IV
Python causal-inference subtotal 1,079 418,588
Stata 18 — official .ado measured 3,767 937,543 command layer above closed kernel
Stata 18 — official .mata measured 411 103,822 Mata numerical layer
Stata official executable code measured 4,178 1,041,365 (+ 738,543 lines of .sthlp help text, not counted as code)
Stata SSC (third-party) estimated ~3,500 pkgs 2M – 4M community extensions; local sample (reghdfe + winsor2 + 50 others) = 33,296 LOC
R base interpreter (C + R + Fortran) estimated ~1.5M language itself
R base library (73 recommended pkgs) measured 509 62,321 shipped with R on this machine
CRAN (~22,000 packages, 2026) estimated 80M–120M (R-only; >200M incl. C/C++/Fortran) main R package universe
Bioconductor (~2,300 packages) estimated 30M–50M bioinformatics
R ecosystem total estimated ≈ 150M+

How to read this table

LOC is a vanity metric in isolation — what matters is coverage density within a target domain. StatsPAI is deliberately scoped at causal inference and applied econometrics; it is not trying to match R's 150M+ lines because ~90% of CRAN is bioinformatics, visualization, text mining, and general-purpose ML that is out of scope. The relevant comparison is the coverage matrix in §3 below.


2 · StatsPAI per-module breakdown

Sorted by LOC. This table is generated from the live source tree by python scripts/registry_stats.py --table; it intentionally avoids editorial focus labels so the numbers can be refreshed mechanically before a release.

Module LOC Files Registered functions (sp.*)
synth 20,256 31 53
did 17,188 33 59
rd 13,696 25 53
regression 11,782 19 37
output 10,891 21 40
agent 9,709 30 0
smart 8,924 17 26
decomposition 7,509 18 31
iv 6,599 16 8
diagnostics 6,115 12 22
fast 5,799 13 0
plots 5,176 6 6
spatial 5,136 29 35
core 5,004 10 2
inference 4,949 15 24
panel 4,906 12 17
matching 4,052 9 23
frontier 4,008 8 12
workflow 3,890 5 1
bayes 3,834 10 17
multilevel 3,719 8 11
dml 3,482 12 14
mendelian 3,410 13 38
causal_discovery 3,379 11 20
dag 2,924 9 19
metalearners 2,916 8 23
structural 2,784 9 12
survival 2,661 6 12
neural_causal 2,651 6 16
causal_llm 2,612 10 11
timeseries 2,465 9 18
tmle 2,351 6 9
bounds 2,268 5 9
robustness 2,243 6 11
forest 2,168 5 8
conformal_causal 2,002 9 19
utils 1,906 8 32
epi 1,860 6 20
interference 1,837 10 18
bartik 1,733 4 6
question 1,552 3 6
proximal 1,525 8 7
postestimation 1,509 4 12
qte 1,488 6 12
causal_text 1,330 4 4
bcf 1,286 5 5
bridge 1,236 8 2
bunching 1,206 5 8
target_trial 1,151 7 9
mediation 1,133 4 4
power 1,099 3 12
datasets 1,007 2 0
policy_learning 981 3 3
principal_strat 923 2 3
gmm 876 3 2
dtr 833 5 2
fairness 833 3 6
longitudinal 800 3 7
experimental 794 4 9
survey 789 4 7
causal_rl 782 5 9
selection 766 2 3
deepiv 739 2 2
gformula 736 3 4
surrogate 718 2 3
mht 694 2 6
assimilation 681 3 2
ope 619 3 3
msm 612 2 3
dose_response 601 3 2
transport 567 5 8
causal_impact 509 2 3
fixest 504 3 4
nonparametric 447 3 4
imputation 445 2 3
matrix_completion 377 2 2
compat 351 2 0
multi_treatment 312 2 2
censoring 284 2 2
causal 101 1 0
schemas 0 0 0
Total 271,580 650 1031

3 · Causal-inference coverage matrix (full)

Legend: B = broad API coverage within this comparison table; Y = implemented entry points; P = partial, scattered, or single-algorithm support; N = no first-class entry point. These are API-breadth labels, not validation tiers.

"Stata" = official + major SSC packages. "R" = CRAN. "sm+lm" = statsmodels + linearmodels.

# Method family Stata R sm+lm DoubleML StatsPAI StatsPAI entry points
1 DiD — staggered (CS / SA / BJS / dCdH / Gardner / Wooldridge ET) + event-study + honest CIs (Rambachan-Roth) P Y N N B sp.callaway_santanna, sp.sun_abraham, sp.borusyak_jaravel_spiess, sp.dchd, sp.gardner_did, sp.wooldridge_did, sp.honest_did, sp.cs_report
2 IV — classical (2SLS / LIML / GMM) + modern (Kernel IV / Deep IV / KAN-DeepIV) Y classical only Y classical only P classical (lm) P B sp.ivreg, sp.kernel_iv, sp.deep_iv, sp.kan_deepiv
3 RD — CCT sharp/fuzzy/kink + 2D / boundary + multi-cutoff + honest CIs + ML-CATE (18+ estimators) P (rdrobust SSC) Y (rdrobust) N N B sp.rdrobust, sp.rd2d, sp.rdhte, sp.rd_forest, sp.rd_boost, sp.rdrandinf, sp.rdpower, sp.rdsummary
4 Synthetic Control (ADH / ASCM / gsynth / BSTS / Bayesian / PenSCM / FDID — 20 methods + 6 inference strategies) P (synth SSC) P (7 pkgs: Synth, gsynth, CausalImpact, MSCMT, …) N N B sp.synth(method=...), sp.synth_compare, sp.synth_recommend, sp.synth_power, sp.synth_sensitivity
5 Matching — PS / CEM / optimal / cardinality / one-to-many P (psmatch2 SSC) Y (MatchIt, optmatch) N N Y sp.match, sp.cem, sp.optimal_match, sp.cardinality_match
6 Double / Debiased ML N Y (DoubleML) N Y Y sp.dml(model=...), sp.dml_model_averaging, sp.kernel_dml
7 Meta-Learners (S/T/X/R/DR) + Causal Forest / GRF N Y (grf, rlearner) N N Y sp.s_learner, sp.t_learner, sp.x_learner, sp.r_learner, sp.dr_learner, sp.causal_forest
8 TMLE / HAL-TMLE N Y (tmle, hal9001) N N Y sp.tmle, sp.hal_tmle, sp.ctmle
9 Neural causal (TARNet / CFRNet / DragonNet / CEVAE) N N N N B sp.tarnet, sp.cfrnet, sp.dragonnet, sp.cevae
10 Causal discovery (NOTEARS / PC / LiNGAM / GES + deep variants) N P (pcalg) N N B sp.notears, sp.pc_algorithm, sp.lingam, sp.ges
11 Proximal CI (fortified / bidirectional / MTP / double-negative-control / surrogate) N P (pci scattered) N N B sp.proximal, sp.fortified_pci, sp.bidirectional_pci, sp.pci_mtp, sp.double_negative_control
12 QTE / distributional TE / CiC / dist-IV / beyond-avg-LATE / HD panel P (ivqreg) P (qte, Counterfactual) N N Y sp.qte, sp.qdid, sp.cic, sp.distributional_te, sp.dist_iv
13 Mendelian randomization (IVW / Egger / median / mode / MR-PRESSO / MVMR / BMA) N Y (MendelianRandomization, TwoSampleMR) N N Y sp.mr_ivw, sp.mr_egger, sp.mr_median, sp.mr_presso, sp.mvmr, sp.mr_bma
14 Conformal causal inference (ITE / CATE / density / dose-response / cluster / fair) N N N N B sp.conformal_ite, sp.conformal_cate, sp.conformal_dose_response
15 Bayesian causal (BCF / ordinal BCF / factor-exposure BCF) N P (bcf) N N Y sp.bcf, sp.bcf_ordinal, sp.bcf_factor_exposure
16 Spatial econometrics (weights → ESDA → ML/GMM → GWR/MGWR → panel) N P (5 pkgs: spdep, spatialreg, sphet, splm, GWmodel) N N B 38 functions including sp.sem, sp.sar, sp.gwr, sp.mgwr, sp.splm
17 Policy learning / OPE N P (policytree) N N Y sp.policy_tree, sp.policy_value, sp.doubly_robust_ope
18 Bunching estimation P (bunching SSC) N N N Y sp.bunching, sp.kink_bunching
19 Interference / spillover (partial / network / cluster-RCT / HTE) N P (interference) N N B 18 functions including sp.spillover, sp.cluster_rct, sp.hte_interference
20 Matrix completion for panels N P (gsynth) N N Y sp.matrix_completion
21 Causal MAS (multi-agent LLM causal discovery — arXiv:2509.00987) N N N N B sp.causal_mas, sp.causal_llm.openai_client, sp.causal_llm.anthropic_client
22 Publication tables (Word / Excel / LaTeX / HTML / Markdown) P (outreg2) P (modelsummary) P N B Supported result objects expose .to_word() / .to_excel() / .to_latex() / .to_html()
23 Agent-native tool-calling schemas (function_schema()) N N N N B sp.list_functions(), sp.describe_function(), sp.function_schema(), sp.agent.mcp_server

4 · How to reproduce these numbers

# StatsPAI (Python)
find src/statspai -name '*.py' -exec wc -l {} + | tail -1
find tests        -name '*.py' -exec wc -l {} + | tail -1
python3 -c "import statspai as sp; print(len(sp.list_functions()))"

# statsmodels + linearmodels
python3 -c "import statsmodels, os; print(os.path.dirname(statsmodels.__file__))"
python3 -c "import linearmodels, os; print(os.path.dirname(linearmodels.__file__))"
find $(python3 -c "import statsmodels,os;print(os.path.dirname(statsmodels.__file__))") -name '*.py' -exec wc -l {} + | tail -1
find $(python3 -c "import linearmodels,os;print(os.path.dirname(linearmodels.__file__))") -name '*.py' -exec wc -l {} + | tail -1

# Stata (macOS default install path)
find /Applications/Stata/ado/base -name '*.ado'  | xargs wc -l | tail -1
find /Applications/Stata/ado/base -name '*.mata' | xargs wc -l | tail -1

# R (installed library only; base-interpreter source requires R-<version>.tar.gz)
find /Library/Frameworks/R.framework/Resources/library \( -name '*.R' -o -name '*.r' \) -exec wc -l {} + | tail -1

Ecosystem-wide estimates (SSC / CRAN / Bioconductor totals) are drawn from:

The SSC sample used for extrapolation is the 52-file local install (/Users/brycewang/Library/Application Support/Stata/ado/plus, 33,296 LOC — includes reghdfe, winsor2, and others).


5 · Why we don't lead with "line-count wins"

Three reasons a naked LOC comparison is misleading for StatsPAI positioning:

  1. Vanity metric: 188K vs R's 150M+ tells a reviewer nothing about capability per line. R's CRAN is ~90% out-of-scope (bioinformatics, visualization, text, general ML) — apples to oranges.
  2. Moving target: CRAN and statsmodels grow every month. A headline number in the README rots quarterly unless regenerated by CI.
  3. Wrong axis: StatsPAI's differentiator is causal-inference depth in one API — see §3. Most cells where we win are ❌ in Stata / R / statsmodels entirely, not "fewer lines".

The coverage matrix in §3 is the honest positioning. LOC in §1 is supporting evidence for "yes, there is real code behind these claims" — not a competitive boast.