StatsPAI — Ecosystem & Code Statistics¶

This page tracks StatsPAI's size and coverage against the broader statistical ecosystem, in the spirit of evidence-based positioning. Numbers marked "measured" are reproducible on a standard install; numbers marked "estimated" are extrapolated from public ecosystem statistics with the reasoning shown.

StatsPAI live numbers last measured: 2026-07-16 on Linux x86_64 against the local statspai 1.20.0 source tree. External ecosystem comparison rows are retained from the 2026-05-03 measurement unless noted otherwise.

The four StatsPAI numbers (function count, submodule count, source LOC, test LOC) are reproducible from a single canonical generator: python scripts/registry_stats.py. The per-module table in §2 is regenerated by python scripts/registry_stats.py --table. CI guards against drift via python scripts/registry_stats.py --check.

1 · Cross-ecosystem lines-of-code¶

Ecosystem / Project	Method	Files	Lines of code	Primary focus
StatsPAI `src/statspai/`	measured	702	347,786	validation-tiered causal inference
StatsPAI tests (`tests/`)	measured	1,052	199,714	—
statsmodels 0.14.x	measured	948	381,981	GLM / time series / general
linearmodels	measured	131	36,607	panel / IV
Python causal-inference subtotal		1,079	418,588
Stata 18 — official `.ado`	measured	3,767	937,543	command layer above closed kernel
Stata 18 — official `.mata`	measured	411	103,822	Mata numerical layer
Stata official executable code	measured	4,178	1,041,365	(+ 738,543 lines of `.sthlp` help text, not counted as code)
Stata SSC (third-party)	estimated	~3,500 pkgs	2M – 4M	community extensions; local sample (reghdfe + winsor2 + 50 others) = 33,296 LOC
R base interpreter (C + R + Fortran)	estimated	—	~1.5M	language itself
R base library (73 recommended pkgs)	measured	509	62,321	shipped with R on this machine
CRAN (~22,000 packages, 2026)	estimated	—	80M–120M (R-only; >200M incl. C/C++/Fortran)	main R package universe
Bioconductor (~2,300 packages)	estimated	—	30M–50M	bioinformatics
R ecosystem total	estimated	—	≈ 150M+

How to read this table

LOC is a vanity metric in isolation — what matters is coverage density within a target domain. StatsPAI is deliberately scoped at causal inference and applied econometrics; it is not trying to match R's 150M+ lines because ~90% of CRAN is bioinformatics, visualization, text mining, and general-purpose ML that is out of scope. The relevant comparison is the coverage matrix in §3 below.

2 · StatsPAI per-module breakdown¶

Sorted by LOC. This table is generated from the live source tree by python scripts/registry_stats.py --table; it intentionally avoids editorial focus labels so the numbers can be refreshed mechanically before a release.

Module	LOC	Files	Registered functions (`sp.*`)
`synth`	22,372	31	54
`did`	23,531	35	64
`rd`	16,207	25	53
`regression`	15,904	20	37
`smart`	14,739	20	31
`output`	11,961	21	40
`agent`	12,016	31	3
`decomposition`	8,805	18	31
`core`	8,291	12	4
`fast`	7,683	16	0
`iv`	7,409	16	8
`diagnostics`	7,127	12	25
`matching`	7,016	11	25
`panel`	7,652	12	18
`inference`	7,359	17	24
`spatial`	6,852	30	38
`plots`	6,002	7	8
`bayes`	5,168	12	19
`dml`	4,970	12	14
`frontier`	4,801	8	12
`multilevel`	4,421	8	11
`workflow`	4,414	5	3
`mendelian`	4,351	13	38
`causal_discovery`	4,052	11	20
`metalearners`	3,915	8	23
`dag`	3,500	9	23
`network`	3,474	9	33
`timeseries`	3,390	9	20
`survival`	3,303	6	12
`tmle`	3,232	6	11
`neural_causal`	3,220	6	16
`structural`	3,149	9	12
`forest`	3,024	5	8
`causal_llm`	2,972	10	14
`robustness`	2,664	6	11
`crossval`	2,644	7	2
`bounds`	2,577	5	9
`conformal_causal`	2,499	9	19
`rlasso`	2,452	7	10
`interference`	2,443	10	20
`epi`	2,340	6	20
`bartik`	2,212	4	8
`utils`	2,145	8	32
`question`	2,079	3	6
`proximal`	2,029	8	13
`qte`	2,017	6	12
`postestimation`	1,775	4	12
`bcf`	1,644	5	8
`datasets`	1,549	3	3
`fixest`	1,580	3	4
`causal_text`	1,457	4	4
`target_trial`	1,457	7	9
`mediation`	1,455	4	6
`bunching`	1,440	5	8
`fairness`	1,419	3	6
`power`	1,405	3	12
`policy_learning`	1,373	3	7
`bridge`	1,317	8	2
`principal_strat`	1,157	2	3
`experimental`	1,118	4	9
`longitudinal`	1,044	3	7
`causal_rl`	1,041	5	9
`surrogate`	995	2	3
`ope`	965	3	3
`survey`	963	4	7
`gmm`	938	3	2
`dose_response`	932	3	5
`selection`	904	2	3
`dtr`	894	5	2
`gformula`	882	3	4
`deepiv`	801	2	2
`mht`	768	2	6
`transport`	761	5	8
`msm`	716	2	3
`assimilation`	698	3	3
`causal_impact`	652	2	3
`nonparametric`	630	3	4
`imputation`	502	2	3
`matrix_completion`	435	2	2
`compat`	378	2	0
`multi_treatment`	366	2	2
`quasi`	339	2	2
`censoring`	337	2	2
`geolift`	180	2	1
`checks`	152	2	0
`causal`	111	1	0
`schemas`	0	0	0
Total	352,772	705	1144
## 3 · Causal-inference coverage matrix (full)

Legend: B = broad API coverage within this comparison table; Y = implemented entry points; P = partial, scattered, or single-algorithm support; N = no first-class entry point. These are API-breadth labels, not validation tiers.

"Stata" = official + major SSC packages. "R" = CRAN. "sm+lm" = statsmodels + linearmodels.

#	Method family	Stata	R	sm+lm	DoubleML	StatsPAI	StatsPAI entry points
1	DiD — staggered (CS / SA / BJS / dCdH / Gardner / Wooldridge ET) + event-study + honest CIs (Rambachan-Roth)	P	Y	N	N	B	`sp.callaway_santanna`, `sp.sun_abraham`, `sp.borusyak_jaravel_spiess`, `sp.dchd`, `sp.gardner_did`, `sp.wooldridge_did`, `sp.honest_did`, `sp.cs_report`
2	IV — classical (2SLS / LIML / GMM) + modern (Kernel IV / Deep IV / KAN-DeepIV)	Y classical only	Y classical only	P classical (lm)	P	B	`sp.ivreg`, `sp.kernel_iv`, `sp.deep_iv`, `sp.kan_deepiv`
3	RD — CCT sharp/fuzzy/kink + 2D / boundary + multi-cutoff + honest CIs + ML-CATE (18+ estimators)	P (`rdrobust` SSC)	Y (`rdrobust`)	N	N	B	`sp.rdrobust`, `sp.rd2d`, `sp.rdhte`, `sp.rd_forest`, `sp.rd_boost`, `sp.rdrandinf`, `sp.rdpower`, `sp.rdsummary`
4	Synthetic Control (ADH / ASCM / gsynth / BSTS / Bayesian / PenSCM / FDID — 20 methods + 6 inference strategies)	P (`synth` SSC)	P (7 pkgs: Synth, gsynth, CausalImpact, MSCMT, …)	N	N	B	`sp.synth(method=...)`, `sp.synth_compare`, `sp.synth_recommend`, `sp.synth_power`, `sp.synth_sensitivity`
5	Matching — PS / CEM / optimal / cardinality / one-to-many	P (`psmatch2` SSC)	Y (`MatchIt`, `optmatch`)	N	N	Y	`sp.match`, `sp.cem`, `sp.optimal_match`, `sp.cardinality_match`
6	Double / Debiased ML	N	Y (`DoubleML`)	N	Y	Y	`sp.dml(model=...)`, `sp.dml_model_averaging`, `sp.kernel_dml`
7	Meta-Learners (S/T/X/R/DR) + Causal Forest / GRF	N	Y (`grf`, `rlearner`)	N	N	Y	`sp.s_learner`, `sp.t_learner`, `sp.x_learner`, `sp.r_learner`, `sp.dr_learner`, `sp.causal_forest`
8	TMLE / HAL-TMLE	N	Y (`tmle`, `hal9001`)	N	N	Y	`sp.tmle`, `sp.hal_tmle`, `sp.ctmle`
9	Neural causal (TARNet / CFRNet / DragonNet / CEVAE)	N	N	N	N	B	`sp.tarnet`, `sp.cfrnet`, `sp.dragonnet`, `sp.cevae`
10	Causal discovery (NOTEARS / PC / LiNGAM / GES + deep variants)	N	P (`pcalg`)	N	N	B	`sp.notears`, `sp.pc_algorithm`, `sp.lingam`, `sp.ges`
11	Proximal CI (fortified / bidirectional / MTP / double-negative-control / surrogate)	N	P (`pci` scattered)	N	N	B	`sp.proximal`, `sp.fortified_pci`, `sp.bidirectional_pci`, `sp.pci_mtp`, `sp.double_negative_control`
12	QTE / distributional TE / CiC / dist-IV / beyond-avg-LATE / HD panel	P (`ivqreg`)	P (`qte`, `Counterfactual`)	N	N	Y	`sp.qte`, `sp.qdid`, `sp.cic`, `sp.distributional_te`, `sp.dist_iv`
13	Mendelian randomization (IVW / Egger / median / mode / MR-PRESSO / MVMR / BMA)	N	Y (`MendelianRandomization`, `TwoSampleMR`)	N	N	Y	`sp.mr_ivw`, `sp.mr_egger`, `sp.mr_median`, `sp.mr_presso`, `sp.mvmr`, `sp.mr_bma`
14	Conformal causal inference (ITE / CATE / density / dose-response / cluster / fair)	N	N	N	N	B	`sp.conformal_ite`, `sp.conformal_cate`, `sp.conformal_dose_response`
15	Bayesian causal (BCF / ordinal BCF / factor-exposure BCF)	N	P (`bcf`)	N	N	Y	`sp.bcf`, `sp.bcf_ordinal`, `sp.bcf_factor_exposure`
16	Spatial econometrics (weights → ESDA → ML/GMM → GWR/MGWR → panel)	N	P (5 pkgs: spdep, spatialreg, sphet, splm, GWmodel)	N	N	B	38 functions including `sp.sem`, `sp.sar`, `sp.gwr`, `sp.mgwr`, `sp.splm`
17	Policy learning / OPE	N	P (`policytree`)	N	N	Y	`sp.policy_tree`, `sp.policy_value`, `sp.doubly_robust_ope`
18	Bunching estimation	P (`bunching` SSC)	N	N	N	Y	`sp.bunching`, `sp.kink_bunching`
19	Interference / spillover (partial / network / cluster-RCT / HTE)	N	P (`interference`)	N	N	B	18 functions including `sp.spillover`, `sp.cluster_rct`, `sp.hte_interference`
20	Matrix completion for panels	N	P (`gsynth`)	N	N	Y	`sp.matrix_completion`
21	Causal MAS (multi-agent LLM causal discovery — arXiv:2509.00987)	N	N	N	N	B	`sp.causal_mas`, `sp.causal_llm.openai_client`, `sp.causal_llm.anthropic_client`
22	Publication tables (Word / Excel / LaTeX / HTML / Markdown)	P (`outreg2`)	P (`modelsummary`)	P	N	B	Supported result objects expose `.to_word()` / `.to_excel()` / `.to_latex()` / `.to_html()`
23	Agent-native tool-calling schemas (`function_schema()`)	N	N	N	N	B	`sp.list_functions()`, `sp.describe_function()`, `sp.function_schema()`, `sp.agent.mcp_server`

4 · How to reproduce these numbers¶

# StatsPAI (Python)
find src/statspai -name '*.py' -exec wc -l {} + | tail -1
find tests        -name '*.py' -exec wc -l {} + | tail -1
python3 -c "import statspai as sp; print(len(sp.list_functions()))"

# statsmodels + linearmodels
python3 -c "import statsmodels, os; print(os.path.dirname(statsmodels.__file__))"
python3 -c "import linearmodels, os; print(os.path.dirname(linearmodels.__file__))"
find $(python3 -c "import statsmodels,os;print(os.path.dirname(statsmodels.__file__))") -name '*.py' -exec wc -l {} + | tail -1
find $(python3 -c "import linearmodels,os;print(os.path.dirname(linearmodels.__file__))") -name '*.py' -exec wc -l {} + | tail -1

# Stata (macOS default install path)
find /Applications/Stata/ado/base -name '*.ado'  | xargs wc -l | tail -1
find /Applications/Stata/ado/base -name '*.mata' | xargs wc -l | tail -1

# R (installed library only; base-interpreter source requires R-<version>.tar.gz)
find /Library/Frameworks/R.framework/Resources/library \( -name '*.R' -o -name '*.r' \) -exec wc -l {} + | tail -1

Ecosystem-wide estimates (SSC / CRAN / Bioconductor totals) are drawn from:

SSC — Boston College Statistical Software Components archive package listing.
CRAN — METACRAN and CRAN by topic task views.
Bioconductor — Bioconductor 3.20 release statistics.

The SSC sample used for extrapolation is the 52-file local install (/Users/brycewang/Library/Application Support/Stata/ado/plus, 33,296 LOC — includes reghdfe, winsor2, and others).

5 · Why we don't lead with "line-count wins"¶

Three reasons a naked LOC comparison is misleading for StatsPAI positioning:

Vanity metric: 188K vs R's 150M+ tells a reviewer nothing about capability per line. R's CRAN is ~90% out-of-scope (bioinformatics, visualization, text, general ML) — apples to oranges.
Moving target: CRAN and statsmodels grow every month. A headline number in the README rots quarterly unless regenerated by CI.
Wrong axis: StatsPAI's differentiator is causal-inference depth in one API — see §3. Most cells where we win are ❌ in Stata / R / statsmodels entirely, not "fewer lines".

The coverage matrix in §3 is the honest positioning. LOC in §1 is supporting evidence for "yes, there is real code behind these claims" — not a competitive boast.