Mixtape Chapter 9 — Difference-in-Differences (StatsPAI replication)¶
This guide walks through Cunningham's Causal Inference: The Mixtape Chapter 9 (Difference-in-Differences) using StatsPAI's causal toolbox. It is designed to be the template other chapter replications follow — if you want Chapter 8 (IV) or Chapter 6 (RD), fork this structure and swap the estimators.
Why replicate a textbook? Every serious empirical programme lives or dies on identification assumptions — not code. Walking through Cunningham's Chapter 9 lets you see how StatsPAI maps one-for-one onto the pedagogical language (parallel trends, TWFE, staggered adoption, event studies) rather than the estimator zoo.
Running the code in this page end-to-end takes about 30 seconds.
0. Setup¶
import numpy as np
import pandas as pd
import statspai as sp
# Reproducible throughout the chapter:
RNG_SEED = 2026
Mixtape uses the Kansas and Cheng-Hoekstra castle-doctrine data for its worked examples. Neither is bundled with StatsPAI, so we replicate the pedagogical pattern — heterogeneous, staggered treatment on a panel — using the canonical DGP:
df = sp.dgp_did(
n_units=200, n_periods=10,
staggered=True, n_groups=4,
effect=0.5, heterogeneous=True,
seed=RNG_SEED,
)
df.head()
# columns: unit, time, y, treated, first_treat, group
first_treat is NaN for never-treated units and an integer period
otherwise — the standard Callaway-Sant'Anna convention.
1. The TWFE baseline (and why it's wrong under heterogeneity)¶
Mixtape first presents the classical two-way-fixed-effects estimator:
twfe = sp.panel(
df,
"y ~ treated",
entity="unit", time="time",
method="fe",
cluster="unit",
)
twfe.summary()
Under homogeneous effects this is the canonical DID. Under staggered adoption with heterogeneous effects, Goodman-Bacon (2021) shows the TWFE coefficient is a weighted average of every 2×2 DID comparison including ones where already-treated units serve as the control group — so some weights can be negative.
Run the Bacon decomposition to see it:
bacon = sp.bacon_decomposition(
data=df, y="y", treat="treated",
time="time", id="unit",
)
sp.bacon_plot(bacon)
Every bar labelled "Later vs. earlier (treated)" is the pathological contribution Goodman-Bacon flagged.
2. The modern staggered-DID suite (CS / SA / BJS)¶
Since 2021 the canonical answer is to report three estimators in parallel: Callaway-Sant'Anna (ATT decomposition), Sun-Abraham (weighted event study), and Borusyak-Jaravel-Spiess (imputation). StatsPAI ships a one-liner for the race:
which prints:
auto_did: staggered-DiD method race
============================================================
method estimate std_error ci_lower ci_upper n_obs notes
CS 0.5092 0.0538 0.4037 0.6146 2000 ok
SA 0.4923 0.1091 0.2786 0.7061 2000 ok
BJS 0.5039 0.0240 0.4569 0.5509 2000 ok
selected winner : bjs (rule=median)
All three land within 0.02 of the true effect (0.5). The median-winner
rule (select_by='median') is the default; pass select_by='cs' (etc.)
to force a specific headline estimate.
3. Event-study visualisation (Sun-Abraham)¶
Mixtape emphasises the event-study plot as the primary evidence for parallel pre-trends. StatsPAI:
sa = sp.sun_abraham(
data=df, y="y", g="first_treat", t="time", i="unit",
event_window=(-3, 3),
)
sp.enhanced_event_study_plot(sa)
Pre-period coefficients should cluster around zero; post-period coefficients should show the dynamic path of the treatment.
4. Honesty: Rambachan-Roth (2023) sensitivity¶
What if parallel trends doesn't quite hold? Rambachan & Roth's "Honest DID" machinery answers how much drift the result can absorb before losing significance.
cs = sp.callaway_santanna(
data=df, y="y", g="first_treat", t="time", i="unit",
)
dyn = sp.aggte(cs, type="dynamic")
# Breakdown value: largest allowable pre-trend drift per period
m_star = sp.breakdown_m(dyn, e=0, method="smoothness")
print(f"Breakdown M* = {m_star:.4f}")
# Full sensitivity frame at several candidate M values
sens = sp.honest_did(dyn, e=0, method="smoothness",
m_grid=list(np.linspace(0, 0.5, 11)))
sens.tail()
A large M* means the effect survives substantial violations of
parallel trends — the result is robust. A small M* (or zero!)
means the conclusion is fragile and should be flagged in the paper.
5. Multiverse / specification curve¶
Cunningham's Mixtape ends every DID chapter with a specification check. StatsPAI ships two complementary multiverse tools — pick the one that matches what you want to vary:
5a. Vary the estimator (CS vs SA vs BJS)¶
That's sp.auto_did, which we already used in Section 2. The
leaderboard is itself the multiverse — three identification strategies,
one data set, three point estimates displayed side by side.
5b. Vary the control set — sp.spec_curve¶
# Build a cross-sectional view of the last period (spec_curve takes a
# single-observation-per-unit frame, not the panel) for a covariate
# multiverse on a post-treatment outcome.
last = df[df["time"] == df["time"].max()]
sc = sp.spec_curve(
data=last,
y="y",
x="treated",
controls=[["group"], ["group", "unit"]], # list-of-list: each inner list is one spec
)
sc.plot()
Every combination of a control set × an SE type × an (optional) sample subset becomes one dot on the curve; stable estimates across the curve imply the result is design-driven rather than specification-driven.
Note — the
spec_curveAPI is a covariate / SE multiverse, not an estimator multiverse. If you want to race TWFE against CS against BJS, useauto_didabove; if you want to see how the headline estimate moves when you change the control set, use this section.
6. What comes next¶
- Chapter 6 replication — sharp and fuzzy RD with
sp.rdrobustandsp.rd_honest(McCrary manipulation + robust bias correction). - Chapter 8 replication — IV with
sp.ivreg, weak-IV-robust CIs viasp.anderson_rubin_ci/sp.conditional_lr_ci, and the Lee-McCrary (2022)sp.tF_adjustmentcritical value. - Chapter 10 replication — synthetic control with
sp.synth,sp.sdid, and the full donor-pool diagnostic suite.
Each follows the same pattern: (1) classical baseline → (2) modern race → (3) event-study visualisation → (4) sensitivity → (5) spec curve.
Citation¶
If you use this guide or StatsPAI in academic work, please cite both:
@book{cunningham2021causal,
title={Causal Inference: The Mixtape},
author={Cunningham, Scott},
year={2021},
publisher={Yale University Press},
url={https://mixtape.scunning.com/}
}
@software{wang2026statspai,
title={StatsPAI: The Agent-Native Causal Inference and Econometrics Toolkit for Python},
author={Wang, Biaoyue and Rozelle, Scott},
year={2026},
url={https://github.com/brycewang-stanford/StatsPAI},
version={1.16.0}
}
For Agents¶
Pre-conditions - data is panel or repeated cross-section with a time column - treat column is binary (0/1) for 2x2, or first-treatment-period (int) for staggered - at least one pre-treatment period (≥ 2 periods for 2x2; ≥ 3 recommended for event study) - for staggered designs: id column identifying units across time
Identifying assumptions - Parallel trends: treated and control groups would have followed the same trajectory absent treatment - No anticipation: outcomes in pre-treatment periods are unaffected by future treatment - SUTVA: no spillovers between units - For staggered / heterogeneous effects: use CS or SA — TWFE can produce negative weights (Goodman-Bacon)
Failure modes → recovery
| Symptom | Exception | Remedy | Try next |
|---|---|---|---|
| Pre-trend joint test p < 0.05 (or underpowered at 0.10) | AssumptionViolation |
Use sp.sensitivity_rr (Rambachan & Roth honest CI) or switch to sp.callaway_santanna. | sp.sensitivity_rr |
| Staggered treatment timing with TWFE method | AssumptionWarning |
TWFE can give negative weights; use Callaway-Sant'Anna, Sun-Abraham, or BJS imputation. | sp.callaway_santanna |
| Pre-trend test underpowered (Roth 2022) | AssumptionWarning |
Check sp.pretrends_power — if low, report honest CI via sp.sensitivity_rr. | sp.sensitivity_rr |
| Few clusters at unit level | AssumptionWarning |
Use wild cluster bootstrap (sp.wild_cluster_bootstrap). | sp.wild_cluster_bootstrap |
Alternatives (ranked)
- sp.callaway_santanna
- sp.sun_abraham
- sp.did_imputation
- sp.sdid
- sp.synth
Typical minimum N: 50
For Agents¶
Pre-conditions - panel data with unit × time × outcome - g column is integer: first-treated period or 0 for never-treated - at least one never-treated or late-treated control group - ≥ 2 pre-treatment periods per cohort - data is panel or repeated cross-section with a time column - treat column is binary (0/1) for 2x2, or first-treatment-period (int) for staggered - at least one pre-treatment period (≥ 2 periods for 2x2; ≥ 3 recommended for event study) - for staggered designs: id column identifying units across time
Identifying assumptions - Parallel trends conditional on X (if covariates supplied) - No anticipation (or adjust via anticipation= parameter) - Overlap: positive propensity for each cohort - SUTVA - Parallel trends: treated and control groups would have followed the same trajectory absent treatment - No anticipation: outcomes in pre-treatment periods are unaffected by future treatment - SUTVA: no spillovers between units - For staggered / heterogeneous effects: use CS or SA — TWFE can produce negative weights (Goodman-Bacon)
Failure modes → recovery
| Symptom | Exception | Remedy | Try next |
|---|---|---|---|
| Pre-trend test on aggregated ATT(g,t) rejects | AssumptionViolation |
Use sp.sensitivity_rr for honest CI, or add covariates for conditional parallel trends. | sp.sensitivity_rr |
| Cohort with only one unit — insufficient variation | DataInsufficient |
Aggregate small cohorts or drop; check sp.diagnose_result. | |
| All units treated at the same time (no staggering) | MethodIncompatibility |
Fall back to 2x2 DID via sp.did(method='2x2'). | sp.did |
| Pre-trend joint test p < 0.05 (or underpowered at 0.10) | AssumptionViolation |
Use sp.sensitivity_rr (Rambachan & Roth honest CI) or switch to sp.callaway_santanna. | sp.sensitivity_rr |
| Staggered treatment timing with TWFE method | AssumptionWarning |
TWFE can give negative weights; use Callaway-Sant'Anna, Sun-Abraham, or BJS imputation. | sp.callaway_santanna |
| Pre-trend test underpowered (Roth 2022) | AssumptionWarning |
Check sp.pretrends_power — if low, report honest CI via sp.sensitivity_rr. | sp.sensitivity_rr |
| Few clusters at unit level | AssumptionWarning |
Use wild cluster bootstrap (sp.wild_cluster_bootstrap). | sp.wild_cluster_bootstrap |
Alternatives (ranked)
- sp.sun_abraham
- sp.did_imputation
- sp.sdid
- sp.did
- sp.callaway_santanna
- sp.synth
Typical minimum N: 50