`statspai.datasets`¶

datasets ¶

Canonical econometrics datasets with documented expected estimates.

This subpackage provides deterministic, reproducible datasets used throughout the causal-inference literature, consolidated under a single import path sp.datasets:

import statspai as sp df = sp.datasets.nsw_lalonde() df.attrs['expected_experimental_att'] 1794

Each function returns a pd.DataFrame with:

A fully-deterministic DGP (fixed seed, BSD/MIT redistributable).
df.attrs containing:
'paper' — original paper citation
'expected_*' — theoretically-anchored estimates (from the published paper on the ORIGINAL data, not necessarily the simulated replica)
'notes' — what to be careful about when using this replica

The simulated replicas are designed to match the structure and summary statistics of the original datasets. For numerical R / Stata parity against the original data, see tests/external_parity/PUBLISHED_REFERENCE_VALUES.md.

Available datasets

DID / panel mpdta() — Callaway-Sant'Anna teen employment teen_employment() — alias of mpdta()

RD lee_2008_senate() — US Senate RD (Lee 2008)

IV card_1995() — IV returns-to-schooling angrist_krueger_1991() — quarter-of-birth IV

Matching / SOO nsw_lalonde() — LaLonde NSW job training (experimental subset) nsw_dw() — Dehejia-Wahba NSW + PSID comparison

Synthetic control california_prop99() — ADH tobacco (re-exported from synth) basque_terrorism() — Abadie-Gardeazabal (re-exported) german_reunification() — ADH 2015 (re-exported)

Public health / epidemiology (REAL data) nhefs() — Hernán-Robins What If NHEFS (g-methods canon: quit-smoking → weight / mortality) load_nhefs() — alias of nhefs()

mpdta ¶

mpdta(seed: int = 42) -> DataFrame

Simulated replica of the mpdta dataset from R's did package.

The original mpdta is a county-year panel of log teen-employment (2003-2007) where some counties raise their minimum wage in 2004, 2006, or 2007 (staggered adoption).

Our replica preserves: - 500 counties × 5 years = 2500 rows - Three treatment cohorts: 2004, 2006, 2007 + never-treated - Negative homogeneous ATT ≈ -0.04 log points (matches the published R did::att_gt aggregated ATT of roughly -0.045 on the original) - County-level clustering in residuals

Returns:

Type	Description
`pd.DataFrame with columns: countyreal, year, lemp, first_treat, treat`	`lemp` — log teen employment (outcome) `first_treat` — period of first treatment (0 if never) `treat` — binary on/off indicator (post × treated cohort)

Notes

df.attrs['expected_simple_att'] = -0.040 (published R output on the original data: -0.0454; our replica's target is -0.04).

Because this is a simulated DGP, numerical values will not match R did::att_gt to high precision on the original data; but they match the sign, order of magnitude, and aggregation pattern.

References

Callaway, B. & Sant'Anna, P.H.C. (2021). Difference-in-Differences with Multiple Time Periods. Journal of Econometrics 225(2), 200-230. [@callaway2021difference]

card_1995 ¶

card_1995(seed: int = 42, simulated: bool = True) -> DataFrame

Card (1995) NLS Young Men data — simulated replica or real extract.

Card uses proximity to a 4-year college (nearc4) as an instrument for years of education in a wage equation. Published OLS and IV point estimates (Card 1995 Table 2):

OLS: β_educ ≈ 0.075 (col 2)
IV (nearc4): β_educ ≈ 0.132 (col 5)

IV exceeds OLS — the "Card puzzle". The LATE interpretation is for compliers on the margin of attending college because of proximity.

Parameters:

Name	Type	Description	Default
`seed`	`int`	RNG seed for the simulated DGP (ignored when `simulated=False`).	`42`
`simulated`	`bool`	If True, return a deterministic simulated replica calibrated so StatsPAI estimators recover OLS ≈ 0.11 and IV ≈ 0.142. If False, load the real NLSYM extract bundled in `statspai/datasets/data/card_1995.csv` (n=3010, identical to R's `wooldridge::card` complete-cases subset on Card's modelling variables). StatsPAI on this real data recovers OLS ≈ 0.0740 (paper 0.075) and IV ≈ 0.1323 (paper 0.132).	`True`

Returns:

Type	Description
`DataFrame`	Simulated columns: `lwage, educ, exper, expersq, black, south, smsa, nearc4` (n=3010). Real columns: same plus `nearc2` (proximity to 2-year college).

References

Card, D. (1995). Using Geographic Variation in College Proximity to Estimate the Return to Schooling. In Christofides et al. (eds.), Aspects of Labour Market Behaviour. [@card1995using]

nsw_lalonde ¶

nsw_lalonde(seed: int = 42, simulated: bool = True) -> DataFrame

LaLonde NSW data — simulated replica or real MatchIt extract.

Parameters:

Name	Type	Description	Default
`seed`	`int`	RNG seed for the simulated replica (ignored when `simulated=False`).	`42`
`simulated`	`bool`	If True, return a deterministic simulated NSW experimental subset (185 + 260 = 445 rows) calibrated so naive OLS recovers the Dehejia-Wahba experimental ATT of about $1,794. If False, load the real `MatchIt::lalonde` extract bundled in `statspai/datasets/data/lalonde_matchit.csv` — the DW NSW treated cohort (185) plus a 429-unit PSID-1 subset for observational comparisons (n=614 total, with race factor already split into `black` and `hispanic` indicators).	`True`

Notes

The bundled real data is MatchIt::lalonde (n=614), NOT the larger DW (1999) NSW + PSID-1 sample (n=2,675). On this smaller subset, naive OLS gives ATT roughly -$635 (less negative than DW Table 3's headline -$8,498, which uses the full PSID-1). For the headline naive-bias demonstration, use the simulated nsw_dw() panel instead.

Simulated replica calibration

nsw_dw ¶

nsw_dw(seed: int = 42) -> DataFrame

Dehejia-Wahba NSW + PSID-1 non-experimental comparison.

Combines the 185 NSW treated (from the experiment) with 2,490 non-experimental PSID males as the comparison group — the classic observational-vs-experimental benchmark.

A naive OLS on re78 ~ treat (no covariates) yields strongly negative estimates (~-$8,500) because the PSID controls are much better-off on average. With PSM on rich covariates, the estimate should return to the experimental benchmark of ≈ $1,794.

Returns:

Type	Description
`pd.DataFrame with columns: treat, age, education, black, hispanic,`	married, nodegree, re74, re75, re78. Treated units (185) are the NSW experimental cohort; controls (2,490) are PSID.

References

Dehejia, R. & Wahba, S. (1999). Causal Effects in Nonexperimental Studies. JASA 94(448), 1053-1062. [@dehejia1999causal]

lee_2008_senate ¶

lee_2008_senate(seed: int = 42, simulated: bool = True) -> DataFrame

Lee (2008) US Senate RD — simulated replica or real extract.

Parameters:

Name	Type	Description	Default
`seed`	`int`	RNG seed for the simulated DGP (ignored when `simulated=False`).	`42`
`simulated`	`bool`	If True, return a deterministic simulated panel (n=6558, `voteshare_next, margin, win`) on a 0-1 vote-share scale, calibrated to a 0.08 jump at the cutoff. If False, load the real `rdrobust::rdrobust_RDsenate` extract (n=1390, `x, y` where `y` is vote share in percent points 0-100 and `x` is the lagged Democratic margin).	`True`

Notes

The real-data branch lets you reproduce Lee (2008) Table 1 / CCT (2014) Table 4 numbers exactly. StatsPAI's sp.rdrobust(df, y='y', x='x', c=0, kernel='triangular', bwselect='cct') recovers Conventional ≈ 7.41 and Robust ≈ 7.51 on this dataset (paper headline ≈ 7.99).

Returns:

Type	Description
`DataFrame`	Simulated columns: `voteshare_next, margin, win` (0-1 scale). Real columns: `x, y` (running variable; vote share 0-100).

References

Lee, D. (2008). Randomized experiments from non-random selection in U.S. House elections. Journal of Econometrics 142, 675-697. [@lee2008randomized] Calonico, S., Cattaneo, M.D. & Titiunik, R. (2014). Robust nonparametric confidence intervals for regression-discontinuity designs. Econometrica 82(6), 2295-2326. [@calonico2014robust]

angrist_krueger_1991 ¶

angrist_krueger_1991(seed: int = 42) -> DataFrame

Simulated replica of Angrist-Krueger (1991) quarter-of-birth IV.

Classical weak-instrument case. Quarter of birth predicts years of schooling because compulsory-schooling laws tie entry age to calendar date (Q1 borns are slightly older at entry so can drop out with fewer years of school). First-stage F is a few dozen on several million observations; point estimates are unstable on subsets.

Our replica uses n=5,000 (the original is ~329k). Published IV returns-to-schooling on the original: 0.08-0.11 depending on controls and birth cohort.

Returns:

Type	Description
`pd.DataFrame with columns: lwage, educ, q1, q2, q3, q4, year_of_birth.`

References

Angrist, J. & Krueger, A. (1991). Does Compulsory School Attendance Affect Schooling and Earnings? QJE 106(4), 979-1014. [@angrist1991does]

nhefs ¶

nhefs(complete_case: bool = False) -> DataFrame

NHEFS — the canonical dataset of Hernán & Robins, Causal Inference: What If (2020), bundled as real, public-domain data for exact replication of the book's g-methods examples.

The National Health and Nutrition Examination Survey I (NHANES I) Epidemiologic Followup Study (NHEFS) follows US adults from a 1971-1975 baseline to a 1982 re-examination. The book uses it throughout Part II to estimate the average causal effect of quitting smoking (qsmk) on 10-year weight change (wt82_71, kg) and on 10-year mortality (death).

Parameters:

Name	Type	Description	Default
`complete_case`	`bool`	If False, return the full NHEFS extract (n=1629, 67 columns). If True, restrict to subjects with a non-missing 1982 weight (`wt82_71` not null, n=1566) — the analytic sample used for the weight-change examples in Chapters 12-15 of the book.	`False`

Returns:

Type Description

DataFrame

67 columns. Key modelling variables used in the book:

qsmk — quit smoking 1971-1982 (1 = yes; the "treatment") wt82_71 — weight change 1971→1982 in kg (continuous outcome) death — died by 1992 (1 = yes; the survival outcome) yrdth, modth — year / month of death (for survival timing) sex, race, age — demographics (sex: 0 male / 1 female; race 0/1) education — 5-level education (1-5) smokeintensity — cigarettes/day at baseline smokeyrs — years smoked at baseline exercise — 3-level exercise (0 much / 1 moderate / 2 little) active — 3-level daily activity (0 / 1 / 2) wt71 — baseline weight (kg)

df.attrs records the book citation and the published reference estimates (see Notes).

Notes

This is real data (df.attrs['data_source'] == 'real'), unlike the simulated econometrics replicas in this module. Because the data are the genuine book extract, StatsPAI reproduces the book's published numbers — not merely their neighbourhood:

Crude (unadjusted) mean weight-change difference, quitters vs non-quitters: 2.54 kg (book §12.2; StatsPAI 2.5406).
IP-weighted average treatment effect (stabilized weights, Chapter 12 MSM): 3.4 kg, 95% CI (2.4, 4.5) (book Program 12.4; StatsPAI sp.ipw 3.48, gold statsmodels MSM 3.44).
Parametric g-formula / standardization (Chapter 13): 3.5 kg.
G-estimation of a structural nested mean model (Chapter 14): psi ≈ 3.4.

Strict numerical reproductions of the full chapter programs live in tests/external_parity/test_whatif_nhefs.py and the public-health validation notebooks under examples/public_health/.

Provenance & licence

NHEFS is a US Federal public-use survey (NCHS / NIH) and is therefore in the public domain as a US Government work. The specific analytic extract bundled here (n=1629 × 67) is the one distributed by Hernán & Robins with the book and re-packaged in the MIT-licensed causaldata package (Huntington-Klein); it is byte-reproducible from causaldata.nhefs. Redistribution here is consistent with both the public-domain status of the underlying survey and StatsPAI's policy of only bundling freely redistributable datasets.

References

Hernán, M.A. & Robins, J.M. (2020). Causal Inference: What If. Boca Raton: Chapman & Hall/CRC. [@hernan2020causal]

basque_terrorism ¶

basque_terrorism() -> DataFrame

Basque Country terrorism dataset (simulated).

Returns a balanced panel of GDP per capita (thousands of 1986 USD) for 17 Spanish regions, 1955--1997. The Basque Country is the treated unit; treatment begins in 1970 (onset of ETA terrorism).

The simulated data reproduce the gradual widening of an approximately 10 % GDP gap between the Basque Country and its synthetic counterfactual after 1970.

References

Abadie, A. & Gardeazabal, J. (2003). "The Economic Costs of Conflict: A Case Study of the Basque Country." American Economic Review, 93(1), 113--132. [@abadie2003economic]

Returns:

Type	Description
`DataFrame`	Columns: `region`, `year`, `gdppc`, `treated`.

Examples:

>>> import statspai as sp
>>> df = sp.basque_terrorism()
>>> result = sp.synth(df, outcome='gdppc', unit='region',
...                   time='year', treated_unit='Basque Country',
...                   treatment_time=1970)
>>> bool(result.estimate is not None)
True

german_reunification ¶

german_reunification() -> DataFrame

German reunification dataset (simulated).

Returns a balanced panel of GDP per capita for 17 OECD countries, 1960--2003. West Germany is the treated unit; treatment begins in 1990 (reunification).

The simulated trajectories reproduce the key stylised facts: Luxembourg has the highest GDP per capita (~40 000), Portugal the lowest (~10 000), and all countries share a common upward growth trend. Post-1990, West Germany exhibits an approximately 1 500 GDP-per-capita decline relative to its synthetic counterfactual.

References

Abadie, A., Diamond, A. & Hainmueller, J. (2015). "Comparative Politics and the Synthetic Control Method." American Journal of Political Science, 59(2), 495--510. [@abadie2015comparative]

Returns:

Type	Description
`DataFrame`	Columns: `country`, `year`, `gdppc`, `treated`.

Examples:

>>> import statspai as sp
>>> df = sp.german_reunification()
>>> result = sp.synth(df, outcome='gdppc', unit='country',
...                   time='year', treated_unit='West Germany',
...                   treatment_time=1990)
>>> bool(result.estimate is not None)
True

from_worldbank ¶

from_worldbank(payload: JSONLike, *, wide: bool = False, value_name: str = 'value') -> DataFrame

Normalise a World Bank Indicators API payload to a tidy panel.

Accepts any of the shapes a World Bank MCP / the v2 REST API returns:

the raw [metadata, rows] two-element list,
just the rows list of observation dicts,
a DataFrame already close to tidy.

Each observation dict looks like::

{"indicator": {"id": "NY.GDP.PCAP.KD", "value": "GDP per capita"},
 "country":   {"id": "US", "value": "United States"},
 "countryiso3code": "USA", "date": "2020", "value": 63027.7}

Parameters:

Name	Type	Description	Default
`payload`	`list or dict or DataFrame`		required
`wide`	`bool`	If True, pivot indicators to columns indexed by (iso3, year) — handy when several indicators were fetched and you want one regression frame.	`False`
`value_name`	`str`	Name of the value column in long form.	`"value"`

Returns:

Type	Description
`DataFrame`	Long form columns: `country`, `iso3`, `indicator`, `indicator_id`, `year`, `<value_name>`. `df.attrs['source']` records the provenance. Wide form: one row per (iso3, year), one column per indicator.

Examples:

>>> rows = [{"indicator": {"id": "NY.GDP", "value": "GDP"},
...          "country": {"id": "US", "value": "United States"},
...          "countryiso3code": "USA", "date": "2020", "value": 1.0}]
>>> import statspai as sp
>>> df = sp.from_worldbank(rows)
>>> list(df.columns)
['country', 'iso3', 'indicator', 'indicator_id', 'year', 'value']
>>> int(df.loc[0, 'year'])
2020

from_fred ¶

from_fred(payload: Union[JSONLike, Mapping[str, JSONLike]], *, series_id: Optional[str] = None) -> DataFrame

Normalise FRED series observations to a tidy time series.

Accepts:

a single series' {"observations": [{"date", "value"}, ...]} dict,
just the observations list,
a mapping {series_id: observations} for several series, merged on date into a wide frame (one column per series).

FRED's missing-value sentinel "." becomes NaN; dates are parsed to datetime64.

Parameters:

Name	Type	Description	Default
`payload`	`dict or list or mapping of series_id -> observations`		required
`series_id`	`str`	Column name for the value series when a single series is passed (default `"value"`).	`None`

Returns:

Type	Description
`DataFrame`	Columns `date` + one value column per series, sorted by date.

Examples:

>>> obs = {"observations": [{"date": "2020-01-01", "value": "1.5"},
...                         {"date": "2020-02-01", "value": "."}]}
>>> import statspai as sp
>>> df = sp.from_fred(obs, series_id="cpi")
>>> list(df.columns)
['date', 'cpi']
>>> bool(df['cpi'].isna().iloc[1])
True

from_sdmx ¶

from_sdmx(payload: JSONLike, *, value_name: str = 'value') -> DataFrame

Normalise an SDMX-JSON payload (OECD / Eurostat / IMF) to long form.

SDMX-JSON encodes each observation by integer indices into per-dimension code lists. This expands those indices back into human-readable dimension columns plus a value column — the shape sp.detect_design can read.

Accepts the SDMX-JSON 1.0 structure ({"dataSets": [...], "structure": {"dimensions": {...}}}) or a pre-tidied list of record dicts (returned as a DataFrame unchanged).

Parameters:

Name	Type	Description	Default
`payload`	`dict or list or DataFrame`		required
`value_name`	`str`		`"value"`

Returns:

Type	Description
`DataFrame`	One column per SDMX dimension (e.g. `LOCATION`, `TIME_PERIOD`) plus `<value_name>`. `df.attrs['source'] == 'sdmx'`.

Examples:

>>> payload = {
...   "dataSets": [{"series": {"0:0": {"observations": {"0": [3.2]}}}}],
...   "structure": {"dimensions": {
...      "series": [
...        {"id": "LOCATION", "values": [{"id": "USA", "name": "USA"}]},
...        {"id": "SUBJECT", "values": [{"id": "UNR", "name": "Unemp"}]}],
...      "observation": [
...        {"id": "TIME_PERIOD", "values": [{"id": "2020", "name": "2020"}]}]
...   }}}
>>> import statspai as sp
>>> df = sp.from_sdmx(payload)
>>> df.loc[0, "LOCATION"], df.loc[0, "TIME_PERIOD"], df.loc[0, "value"]
('USA', '2020', 3.2)

california_prop99 ¶

california_prop99(simulated: bool = True) -> DataFrame

California Proposition 99 panel (Abadie-Diamond-Hainmueller 2010).

Parameters:

Name	Type	Description	Default
`simulated`	`bool`	If True, return the simulated covariate-rich replica from `synth.california_tobacco` (39 states × 31 years, 1970-2000, ADH-shaped DGP). Default for backward compatibility. If False, load the real ADH (2010) panel bundled in `statspai/datasets/data/california_prop99.csv` (39 states × 31 years, with covariates `cigsale, retprice, lnincome, age15to24, beer`; identical to tidysynth's smoking dataset). Use this for exact paper replication.	`True`

Returns:

Type	Description
`DataFrame`	Columns (both branches): `state, year, cigsale, retprice, lnincome, age15to24, beer`. The simulated branch additionally provides `treated`; on the real branch we derive it as `(state == 'California') & (year >= 1989)`.

References

Abadie, A., Diamond, A. & Hainmueller, J. (2010). Synthetic Control Methods for Comparative Case Studies. Journal of the American Statistical Association 105(490), 493-505. [@abadie2010synthetic]

list_datasets ¶

list_datasets() -> DataFrame

Return a DataFrame describing all available datasets.

Columns: name, design, n_obs, paper, paper_original, expected_main.

paper_original is the headline number from the published paper on the ORIGINAL data (what readers expect to see).
expected_main is what the canonical estimator recovers on this simulated replica (what users will actually observe). The two differ because the bundled replicas are deterministic DGPs calibrated to the neighbourhood of the published values, not the original data.

For the strict numerical neighbourhood proofs see tests/external_parity/test_published_replications.py and tests/external_parity/PUBLISHED_REFERENCE_VALUES.md.

statspai.datasets¶

datasets ¶

mpdta ¶

card_1995 ¶

nsw_lalonde ¶

nsw_dw ¶

lee_2008_senate ¶

angrist_krueger_1991 ¶

nhefs ¶

basque_terrorism ¶

german_reunification ¶

from_worldbank ¶

from_fred ¶

from_sdmx ¶

california_prop99 ¶

list_datasets ¶

`statspai.datasets`¶