statspai.datasets¶
datasets ¶
Canonical econometrics datasets with documented expected estimates.
This subpackage provides deterministic, reproducible datasets used
throughout the causal-inference literature, consolidated under a
single import path sp.datasets:
import statspai as sp df = sp.datasets.nsw_lalonde() df.attrs['expected_experimental_att'] 1794
Each function returns a pd.DataFrame with:
- A fully-deterministic DGP (fixed seed, BSD/MIT redistributable).
df.attrscontaining:'paper'— original paper citation'expected_*'— theoretically-anchored estimates (from the published paper on the ORIGINAL data, not necessarily the simulated replica)'notes'— what to be careful about when using this replica
The simulated replicas are designed to match the structure and
summary statistics of the original datasets. For numerical R /
Stata parity against the original data, see
tests/external_parity/PUBLISHED_REFERENCE_VALUES.md.
Available datasets
DID / panel
mpdta() — Callaway-Sant'Anna teen employment
teen_employment() — alias of mpdta()
RD
lee_2008_senate() — US Senate RD (Lee 2008)
IV
card_1995() — IV returns-to-schooling
angrist_krueger_1991() — quarter-of-birth IV
Matching / SOO
nsw_lalonde() — LaLonde NSW job training (experimental subset)
nsw_dw() — Dehejia-Wahba NSW + PSID comparison
Synthetic control
california_prop99() — ADH tobacco (re-exported from synth)
basque_terrorism() — Abadie-Gardeazabal (re-exported)
german_reunification() — ADH 2015 (re-exported)
Public health / epidemiology (REAL data)
nhefs() — Hernán-Robins What If NHEFS (g-methods
canon: quit-smoking → weight / mortality)
load_nhefs() — alias of nhefs()
mpdta ¶
Simulated replica of the mpdta dataset from R's did package.
The original mpdta is a county-year panel of log teen-employment
(2003-2007) where some counties raise their minimum wage in 2004,
2006, or 2007 (staggered adoption).
Our replica preserves:
- 500 counties × 5 years = 2500 rows
- Three treatment cohorts: 2004, 2006, 2007 + never-treated
- Negative homogeneous ATT ≈ -0.04 log points (matches the published
R did::att_gt aggregated ATT of roughly -0.045 on the original)
- County-level clustering in residuals
Returns:
| Type | Description |
|---|---|
pd.DataFrame with columns: countyreal, year, lemp, first_treat, treat
|
|
Notes
df.attrs['expected_simple_att'] = -0.040 (published R output
on the original data: -0.0454; our replica's target is -0.04).
Because this is a simulated DGP, numerical values will not match
R did::att_gt to high precision on the original data; but they
match the sign, order of magnitude, and aggregation pattern.
References
Callaway, B. & Sant'Anna, P.H.C. (2021). Difference-in-Differences with Multiple Time Periods. Journal of Econometrics 225(2), 200-230. [@callaway2021difference]
card_1995 ¶
Card (1995) NLS Young Men data — simulated replica or real extract.
Card uses proximity to a 4-year college (nearc4) as an
instrument for years of education in a wage equation. Published
OLS and IV point estimates (Card 1995 Table 2):
- OLS: β_educ ≈ 0.075 (col 2)
- IV (nearc4): β_educ ≈ 0.132 (col 5)
IV exceeds OLS — the "Card puzzle". The LATE interpretation is for compliers on the margin of attending college because of proximity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seed
|
int
|
RNG seed for the simulated DGP (ignored when |
42
|
simulated
|
bool
|
If True, return a deterministic simulated replica calibrated so
StatsPAI estimators recover OLS ≈ 0.11 and IV ≈ 0.142.
If False, load the real NLSYM extract bundled in
|
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Simulated columns: |
References
Card, D. (1995). Using Geographic Variation in College Proximity to Estimate the Return to Schooling. In Christofides et al. (eds.), Aspects of Labour Market Behaviour. [@card1995using]
nsw_lalonde ¶
LaLonde NSW data — simulated replica or real MatchIt extract.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seed
|
int
|
RNG seed for the simulated replica (ignored when |
42
|
simulated
|
bool
|
If True, return a deterministic simulated NSW experimental
subset (185 + 260 = 445 rows) calibrated so naive OLS
recovers the Dehejia-Wahba experimental ATT of about $1,794.
If False, load the real |
True
|
Notes
The bundled real data is MatchIt::lalonde (n=614), NOT the
larger DW (1999) NSW + PSID-1 sample (n=2,675). On this smaller
subset, naive OLS gives ATT roughly -\(635 (less negative than DW
Table 3's headline -\)8,498, which uses the full PSID-1). For
the headline naive-bias demonstration, use the simulated
nsw_dw() panel instead.
Simulated replica calibration
nsw_dw ¶
Dehejia-Wahba NSW + PSID-1 non-experimental comparison.
Combines the 185 NSW treated (from the experiment) with 2,490 non-experimental PSID males as the comparison group — the classic observational-vs-experimental benchmark.
A naive OLS on re78 ~ treat (no covariates) yields strongly negative estimates (~-$8,500) because the PSID controls are much better-off on average. With PSM on rich covariates, the estimate should return to the experimental benchmark of ≈ $1,794.
Returns:
| Type | Description |
|---|---|
pd.DataFrame with columns: treat, age, education, black, hispanic,
|
married, nodegree, re74, re75, re78. Treated units (185) are the NSW experimental cohort; controls (2,490) are PSID. |
References
Dehejia, R. & Wahba, S. (1999). Causal Effects in Nonexperimental Studies. JASA 94(448), 1053-1062. [@dehejia1999causal]
lee_2008_senate ¶
Lee (2008) US Senate RD — simulated replica or real extract.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seed
|
int
|
RNG seed for the simulated DGP (ignored when |
42
|
simulated
|
bool
|
If True, return a deterministic simulated panel (n=6558,
|
True
|
Notes
The real-data branch lets you reproduce Lee (2008) Table 1 /
CCT (2014) Table 4 numbers exactly. StatsPAI's
sp.rdrobust(df, y='y', x='x', c=0, kernel='triangular',
bwselect='cct') recovers Conventional ≈ 7.41 and Robust ≈ 7.51
on this dataset (paper headline ≈ 7.99).
Returns:
| Type | Description |
|---|---|
DataFrame
|
Simulated columns: |
References
Lee, D. (2008). Randomized experiments from non-random selection in U.S. House elections. Journal of Econometrics 142, 675-697. [@lee2008randomized] Calonico, S., Cattaneo, M.D. & Titiunik, R. (2014). Robust nonparametric confidence intervals for regression-discontinuity designs. Econometrica 82(6), 2295-2326. [@calonico2014robust]
angrist_krueger_1991 ¶
Simulated replica of Angrist-Krueger (1991) quarter-of-birth IV.
Classical weak-instrument case. Quarter of birth predicts years of schooling because compulsory-schooling laws tie entry age to calendar date (Q1 borns are slightly older at entry so can drop out with fewer years of school). First-stage F is a few dozen on several million observations; point estimates are unstable on subsets.
Our replica uses n=5,000 (the original is ~329k). Published IV returns-to-schooling on the original: 0.08-0.11 depending on controls and birth cohort.
Returns:
| Type | Description |
|---|---|
pd.DataFrame with columns: lwage, educ, q1, q2, q3, q4, year_of_birth.
|
|
References
Angrist, J. & Krueger, A. (1991). Does Compulsory School Attendance Affect Schooling and Earnings? QJE 106(4), 979-1014. [@angrist1991does]
nhefs ¶
NHEFS — the canonical dataset of Hernán & Robins, Causal Inference: What If (2020), bundled as real, public-domain data for exact replication of the book's g-methods examples.
The National Health and Nutrition Examination Survey I (NHANES I)
Epidemiologic Followup Study (NHEFS) follows US adults from a
1971-1975 baseline to a 1982 re-examination. The book uses it
throughout Part II to estimate the average causal effect of
quitting smoking (qsmk) on 10-year weight change
(wt82_71, kg) and on 10-year mortality (death).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
complete_case
|
bool
|
If False, return the full NHEFS extract (n=1629, 67 columns).
If True, restrict to subjects with a non-missing 1982 weight
( |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
67 columns. Key modelling variables used in the book:
|
Notes
This is real data (df.attrs['data_source'] == 'real'), unlike
the simulated econometrics replicas in this module. Because the data
are the genuine book extract, StatsPAI reproduces the book's published
numbers — not merely their neighbourhood:
- Crude (unadjusted) mean weight-change difference, quitters vs non-quitters: 2.54 kg (book §12.2; StatsPAI 2.5406).
- IP-weighted average treatment effect (stabilized weights,
Chapter 12 MSM): 3.4 kg, 95% CI (2.4, 4.5) (book Program 12.4;
StatsPAI
sp.ipw3.48, gold statsmodels MSM 3.44). - Parametric g-formula / standardization (Chapter 13): 3.5 kg.
- G-estimation of a structural nested mean model (Chapter 14):
psi≈ 3.4.
Strict numerical reproductions of the full chapter programs live in
tests/external_parity/test_whatif_nhefs.py and the public-health
validation notebooks under examples/public_health/.
Provenance & licence
NHEFS is a US Federal public-use survey (NCHS / NIH) and is therefore
in the public domain as a US Government work. The specific analytic
extract bundled here (n=1629 × 67) is the one distributed by Hernán &
Robins with the book and re-packaged in the MIT-licensed causaldata
package (Huntington-Klein); it is byte-reproducible from
causaldata.nhefs. Redistribution here is consistent with both the
public-domain status of the underlying survey and StatsPAI's policy of
only bundling freely redistributable datasets.
References
Hernán, M.A. & Robins, J.M. (2020). Causal Inference: What If. Boca Raton: Chapman & Hall/CRC. [@hernan2020causal]
basque_terrorism ¶
Basque Country terrorism dataset (simulated).
Returns a balanced panel of GDP per capita (thousands of 1986 USD) for 17 Spanish regions, 1955--1997. The Basque Country is the treated unit; treatment begins in 1970 (onset of ETA terrorism).
The simulated data reproduce the gradual widening of an approximately 10 % GDP gap between the Basque Country and its synthetic counterfactual after 1970.
References
Abadie, A. & Gardeazabal, J. (2003). "The Economic Costs of Conflict: A Case Study of the Basque Country." American Economic Review, 93(1), 113--132. [@abadie2003economic]
Returns:
| Type | Description |
|---|---|
DataFrame
|
Columns: |
Examples:
german_reunification ¶
German reunification dataset (simulated).
Returns a balanced panel of GDP per capita for 17 OECD countries, 1960--2003. West Germany is the treated unit; treatment begins in 1990 (reunification).
The simulated trajectories reproduce the key stylised facts: Luxembourg has the highest GDP per capita (~40 000), Portugal the lowest (~10 000), and all countries share a common upward growth trend. Post-1990, West Germany exhibits an approximately 1 500 GDP-per-capita decline relative to its synthetic counterfactual.
References
Abadie, A., Diamond, A. & Hainmueller, J. (2015). "Comparative Politics and the Synthetic Control Method." American Journal of Political Science, 59(2), 495--510. [@abadie2015comparative]
Returns:
| Type | Description |
|---|---|
DataFrame
|
Columns: |
Examples:
california_prop99 ¶
California Proposition 99 panel (Abadie-Diamond-Hainmueller 2010).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
simulated
|
bool
|
If True, return the simulated covariate-rich replica from
|
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Columns (both branches): |
References
Abadie, A., Diamond, A. & Hainmueller, J. (2010). Synthetic Control Methods for Comparative Case Studies. Journal of the American Statistical Association 105(490), 493-505. [@abadie2010synthetic]
list_datasets ¶
Return a DataFrame describing all available datasets.
Columns: name, design, n_obs, paper, paper_original, expected_main.
paper_originalis the headline number from the published paper on the ORIGINAL data (what readers expect to see).expected_mainis what the canonical estimator recovers on this simulated replica (what users will actually observe). The two differ because the bundled replicas are deterministic DGPs calibrated to the neighbourhood of the published values, not the original data.
For the strict numerical neighbourhood proofs see
tests/external_parity/test_published_replications.py and
tests/external_parity/PUBLISHED_REFERENCE_VALUES.md.