Skip to content

statspai.datasets

datasets

Canonical econometrics datasets with documented expected estimates.

This subpackage provides deterministic, reproducible datasets used throughout the causal-inference literature, consolidated under a single import path sp.datasets:

import statspai as sp df = sp.datasets.nsw_lalonde() df.attrs['expected_experimental_att'] 1794

Each function returns a pd.DataFrame with:

  • A fully-deterministic DGP (fixed seed, BSD/MIT redistributable).
  • df.attrs containing:
  • 'paper' — original paper citation
  • 'expected_*' — theoretically-anchored estimates (from the published paper on the ORIGINAL data, not necessarily the simulated replica)
  • 'notes' — what to be careful about when using this replica

The simulated replicas are designed to match the structure and summary statistics of the original datasets. For numerical R / Stata parity against the original data, see tests/external_parity/PUBLISHED_REFERENCE_VALUES.md.

Available datasets

DID / panel mpdta() — Callaway-Sant'Anna teen employment teen_employment() — alias of mpdta()

RD lee_2008_senate() — US Senate RD (Lee 2008)

IV card_1995() — IV returns-to-schooling angrist_krueger_1991() — quarter-of-birth IV

Matching / SOO nsw_lalonde() — LaLonde NSW job training (experimental subset) nsw_dw() — Dehejia-Wahba NSW + PSID comparison

Synthetic control california_prop99() — ADH tobacco (re-exported from synth) basque_terrorism() — Abadie-Gardeazabal (re-exported) german_reunification() — ADH 2015 (re-exported)

Public health / epidemiology (REAL data) nhefs() — Hernán-Robins What If NHEFS (g-methods canon: quit-smoking → weight / mortality) load_nhefs() — alias of nhefs()

mpdta

mpdta(seed: int = 42) -> DataFrame

Simulated replica of the mpdta dataset from R's did package.

The original mpdta is a county-year panel of log teen-employment (2003-2007) where some counties raise their minimum wage in 2004, 2006, or 2007 (staggered adoption).

Our replica preserves: - 500 counties × 5 years = 2500 rows - Three treatment cohorts: 2004, 2006, 2007 + never-treated - Negative homogeneous ATT ≈ -0.04 log points (matches the published R did::att_gt aggregated ATT of roughly -0.045 on the original) - County-level clustering in residuals

Returns:

Type Description
pd.DataFrame with columns: countyreal, year, lemp, first_treat, treat

lemp — log teen employment (outcome) first_treat — period of first treatment (0 if never) treat — binary on/off indicator (post × treated cohort)

Notes

df.attrs['expected_simple_att'] = -0.040 (published R output on the original data: -0.0454; our replica's target is -0.04).

Because this is a simulated DGP, numerical values will not match R did::att_gt to high precision on the original data; but they match the sign, order of magnitude, and aggregation pattern.

References

Callaway, B. & Sant'Anna, P.H.C. (2021). Difference-in-Differences with Multiple Time Periods. Journal of Econometrics 225(2), 200-230. [@callaway2021difference]

card_1995

card_1995(seed: int = 42, simulated: bool = True) -> DataFrame

Card (1995) NLS Young Men data — simulated replica or real extract.

Card uses proximity to a 4-year college (nearc4) as an instrument for years of education in a wage equation. Published OLS and IV point estimates (Card 1995 Table 2):

  • OLS: β_educ ≈ 0.075 (col 2)
  • IV (nearc4): β_educ ≈ 0.132 (col 5)

IV exceeds OLS — the "Card puzzle". The LATE interpretation is for compliers on the margin of attending college because of proximity.

Parameters:

Name Type Description Default
seed int

RNG seed for the simulated DGP (ignored when simulated=False).

42
simulated bool

If True, return a deterministic simulated replica calibrated so StatsPAI estimators recover OLS ≈ 0.11 and IV ≈ 0.142. If False, load the real NLSYM extract bundled in statspai/datasets/data/card_1995.csv (n=3010, identical to R's wooldridge::card complete-cases subset on Card's modelling variables). StatsPAI on this real data recovers OLS ≈ 0.0740 (paper 0.075) and IV ≈ 0.1323 (paper 0.132).

True

Returns:

Type Description
DataFrame

Simulated columns: lwage, educ, exper, expersq, black, south, smsa, nearc4 (n=3010). Real columns: same plus nearc2 (proximity to 2-year college).

References

Card, D. (1995). Using Geographic Variation in College Proximity to Estimate the Return to Schooling. In Christofides et al. (eds.), Aspects of Labour Market Behaviour. [@card1995using]

nsw_lalonde

nsw_lalonde(seed: int = 42, simulated: bool = True) -> DataFrame

LaLonde NSW data — simulated replica or real MatchIt extract.

Parameters:

Name Type Description Default
seed int

RNG seed for the simulated replica (ignored when simulated=False).

42
simulated bool

If True, return a deterministic simulated NSW experimental subset (185 + 260 = 445 rows) calibrated so naive OLS recovers the Dehejia-Wahba experimental ATT of about $1,794. If False, load the real MatchIt::lalonde extract bundled in statspai/datasets/data/lalonde_matchit.csv — the DW NSW treated cohort (185) plus a 429-unit PSID-1 subset for observational comparisons (n=614 total, with race factor already split into black and hispanic indicators).

True
Notes

The bundled real data is MatchIt::lalonde (n=614), NOT the larger DW (1999) NSW + PSID-1 sample (n=2,675). On this smaller subset, naive OLS gives ATT roughly -\(635 (less negative than DW Table 3's headline -\)8,498, which uses the full PSID-1). For the headline naive-bias demonstration, use the simulated nsw_dw() panel instead.

Simulated replica calibration

nsw_dw

nsw_dw(seed: int = 42) -> DataFrame

Dehejia-Wahba NSW + PSID-1 non-experimental comparison.

Combines the 185 NSW treated (from the experiment) with 2,490 non-experimental PSID males as the comparison group — the classic observational-vs-experimental benchmark.

A naive OLS on re78 ~ treat (no covariates) yields strongly negative estimates (~-$8,500) because the PSID controls are much better-off on average. With PSM on rich covariates, the estimate should return to the experimental benchmark of ≈ $1,794.

Returns:

Type Description
pd.DataFrame with columns: treat, age, education, black, hispanic,

married, nodegree, re74, re75, re78. Treated units (185) are the NSW experimental cohort; controls (2,490) are PSID.

References

Dehejia, R. & Wahba, S. (1999). Causal Effects in Nonexperimental Studies. JASA 94(448), 1053-1062. [@dehejia1999causal]

lee_2008_senate

lee_2008_senate(seed: int = 42, simulated: bool = True) -> DataFrame

Lee (2008) US Senate RD — simulated replica or real extract.

Parameters:

Name Type Description Default
seed int

RNG seed for the simulated DGP (ignored when simulated=False).

42
simulated bool

If True, return a deterministic simulated panel (n=6558, voteshare_next, margin, win) on a 0-1 vote-share scale, calibrated to a 0.08 jump at the cutoff. If False, load the real rdrobust::rdrobust_RDsenate extract (n=1390, x, y where y is vote share in percent points 0-100 and x is the lagged Democratic margin).

True
Notes

The real-data branch lets you reproduce Lee (2008) Table 1 / CCT (2014) Table 4 numbers exactly. StatsPAI's sp.rdrobust(df, y='y', x='x', c=0, kernel='triangular', bwselect='cct') recovers Conventional ≈ 7.41 and Robust ≈ 7.51 on this dataset (paper headline ≈ 7.99).

Returns:

Type Description
DataFrame

Simulated columns: voteshare_next, margin, win (0-1 scale). Real columns: x, y (running variable; vote share 0-100).

References

Lee, D. (2008). Randomized experiments from non-random selection in U.S. House elections. Journal of Econometrics 142, 675-697. [@lee2008randomized] Calonico, S., Cattaneo, M.D. & Titiunik, R. (2014). Robust nonparametric confidence intervals for regression-discontinuity designs. Econometrica 82(6), 2295-2326. [@calonico2014robust]

angrist_krueger_1991

angrist_krueger_1991(seed: int = 42) -> DataFrame

Simulated replica of Angrist-Krueger (1991) quarter-of-birth IV.

Classical weak-instrument case. Quarter of birth predicts years of schooling because compulsory-schooling laws tie entry age to calendar date (Q1 borns are slightly older at entry so can drop out with fewer years of school). First-stage F is a few dozen on several million observations; point estimates are unstable on subsets.

Our replica uses n=5,000 (the original is ~329k). Published IV returns-to-schooling on the original: 0.08-0.11 depending on controls and birth cohort.

Returns:

Type Description
pd.DataFrame with columns: lwage, educ, q1, q2, q3, q4, year_of_birth.
References

Angrist, J. & Krueger, A. (1991). Does Compulsory School Attendance Affect Schooling and Earnings? QJE 106(4), 979-1014. [@angrist1991does]

nhefs

nhefs(complete_case: bool = False) -> DataFrame

NHEFS — the canonical dataset of Hernán & Robins, Causal Inference: What If (2020), bundled as real, public-domain data for exact replication of the book's g-methods examples.

The National Health and Nutrition Examination Survey I (NHANES I) Epidemiologic Followup Study (NHEFS) follows US adults from a 1971-1975 baseline to a 1982 re-examination. The book uses it throughout Part II to estimate the average causal effect of quitting smoking (qsmk) on 10-year weight change (wt82_71, kg) and on 10-year mortality (death).

Parameters:

Name Type Description Default
complete_case bool

If False, return the full NHEFS extract (n=1629, 67 columns). If True, restrict to subjects with a non-missing 1982 weight (wt82_71 not null, n=1566) — the analytic sample used for the weight-change examples in Chapters 12-15 of the book.

False

Returns:

Type Description
DataFrame

67 columns. Key modelling variables used in the book:

qsmk — quit smoking 1971-1982 (1 = yes; the "treatment") wt82_71 — weight change 1971→1982 in kg (continuous outcome) death — died by 1992 (1 = yes; the survival outcome) yrdth, modth — year / month of death (for survival timing) sex, race, age — demographics (sex: 0 male / 1 female; race 0/1) education — 5-level education (1-5) smokeintensity — cigarettes/day at baseline smokeyrs — years smoked at baseline exercise — 3-level exercise (0 much / 1 moderate / 2 little) active — 3-level daily activity (0 / 1 / 2) wt71 — baseline weight (kg)

df.attrs records the book citation and the published reference estimates (see Notes).

Notes

This is real data (df.attrs['data_source'] == 'real'), unlike the simulated econometrics replicas in this module. Because the data are the genuine book extract, StatsPAI reproduces the book's published numbers — not merely their neighbourhood:

  • Crude (unadjusted) mean weight-change difference, quitters vs non-quitters: 2.54 kg (book §12.2; StatsPAI 2.5406).
  • IP-weighted average treatment effect (stabilized weights, Chapter 12 MSM): 3.4 kg, 95% CI (2.4, 4.5) (book Program 12.4; StatsPAI sp.ipw 3.48, gold statsmodels MSM 3.44).
  • Parametric g-formula / standardization (Chapter 13): 3.5 kg.
  • G-estimation of a structural nested mean model (Chapter 14): psi3.4.

Strict numerical reproductions of the full chapter programs live in tests/external_parity/test_whatif_nhefs.py and the public-health validation notebooks under examples/public_health/.

Provenance & licence

NHEFS is a US Federal public-use survey (NCHS / NIH) and is therefore in the public domain as a US Government work. The specific analytic extract bundled here (n=1629 × 67) is the one distributed by Hernán & Robins with the book and re-packaged in the MIT-licensed causaldata package (Huntington-Klein); it is byte-reproducible from causaldata.nhefs. Redistribution here is consistent with both the public-domain status of the underlying survey and StatsPAI's policy of only bundling freely redistributable datasets.

References

Hernán, M.A. & Robins, J.M. (2020). Causal Inference: What If. Boca Raton: Chapman & Hall/CRC. [@hernan2020causal]

basque_terrorism

basque_terrorism() -> DataFrame

Basque Country terrorism dataset (simulated).

Returns a balanced panel of GDP per capita (thousands of 1986 USD) for 17 Spanish regions, 1955--1997. The Basque Country is the treated unit; treatment begins in 1970 (onset of ETA terrorism).

The simulated data reproduce the gradual widening of an approximately 10 % GDP gap between the Basque Country and its synthetic counterfactual after 1970.

References

Abadie, A. & Gardeazabal, J. (2003). "The Economic Costs of Conflict: A Case Study of the Basque Country." American Economic Review, 93(1), 113--132. [@abadie2003economic]

Returns:

Type Description
DataFrame

Columns: region, year, gdppc, treated.

Examples:

>>> import statspai as sp
>>> df = sp.synth.basque_terrorism()
>>> result = sp.synth.synth(df, y='gdppc', unit='region',
...                         time='year', treat_unit='Basque Country',
...                         treat_time=1970)

german_reunification

german_reunification() -> DataFrame

German reunification dataset (simulated).

Returns a balanced panel of GDP per capita for 17 OECD countries, 1960--2003. West Germany is the treated unit; treatment begins in 1990 (reunification).

The simulated trajectories reproduce the key stylised facts: Luxembourg has the highest GDP per capita (~40 000), Portugal the lowest (~10 000), and all countries share a common upward growth trend. Post-1990, West Germany exhibits an approximately 1 500 GDP-per-capita decline relative to its synthetic counterfactual.

References

Abadie, A., Diamond, A. & Hainmueller, J. (2015). "Comparative Politics and the Synthetic Control Method." American Journal of Political Science, 59(2), 495--510. [@abadie2015comparative]

Returns:

Type Description
DataFrame

Columns: country, year, gdppc, treated.

Examples:

>>> import statspai as sp
>>> df = sp.synth.german_reunification()
>>> result = sp.synth.synth(df, y='gdppc', unit='country',
...                         time='year', treat_unit='West Germany',
...                         treat_time=1990)

california_prop99

california_prop99(simulated: bool = True) -> DataFrame

California Proposition 99 panel (Abadie-Diamond-Hainmueller 2010).

Parameters:

Name Type Description Default
simulated bool

If True, return the simulated covariate-rich replica from synth.california_tobacco (39 states × 31 years, 1970-2000, ADH-shaped DGP). Default for backward compatibility. If False, load the real ADH (2010) panel bundled in statspai/datasets/data/california_prop99.csv (39 states × 31 years, with covariates cigsale, retprice, lnincome, age15to24, beer; identical to tidysynth's smoking dataset). Use this for exact paper replication.

True

Returns:

Type Description
DataFrame

Columns (both branches): state, year, cigsale, retprice, lnincome, age15to24, beer. The simulated branch additionally provides treated; on the real branch we derive it as (state == 'California') & (year >= 1989).

References

Abadie, A., Diamond, A. & Hainmueller, J. (2010). Synthetic Control Methods for Comparative Case Studies. Journal of the American Statistical Association 105(490), 493-505. [@abadie2010synthetic]

list_datasets

list_datasets() -> DataFrame

Return a DataFrame describing all available datasets.

Columns: name, design, n_obs, paper, paper_original, expected_main.

  • paper_original is the headline number from the published paper on the ORIGINAL data (what readers expect to see).
  • expected_main is what the canonical estimator recovers on this simulated replica (what users will actually observe). The two differ because the bundled replicas are deterministic DGPs calibrated to the neighbourhood of the published values, not the original data.

For the strict numerical neighbourhood proofs see tests/external_parity/test_published_replications.py and tests/external_parity/PUBLISHED_REFERENCE_VALUES.md.