Mixtape Chapter 9 — Difference-in-Differences (StatsPAI replication)¶

This guide walks through Cunningham's Causal Inference: The Mixtape Chapter 9 (Difference-in-Differences) using StatsPAI's causal toolbox. It is designed to be the template other chapter replications follow — if you want Chapter 8 (IV) or Chapter 6 (RD), fork this structure and swap the estimators.

Why replicate a textbook? Every serious empirical programme lives or dies on identification assumptions — not code. Walking through Cunningham's Chapter 9 lets you see how StatsPAI maps one-for-one onto the pedagogical language (parallel trends, TWFE, staggered adoption, event studies) rather than the estimator zoo.

Running the code in this page end-to-end takes about 30 seconds.

0. Setup¶

import numpy as np
import pandas as pd
import statspai as sp

# Reproducible throughout the chapter:
RNG_SEED = 2026

Mixtape uses the Kansas and Cheng-Hoekstra castle-doctrine data for its worked examples. Neither is bundled with StatsPAI, so we replicate the pedagogical pattern — heterogeneous, staggered treatment on a panel — using the canonical DGP:

df = sp.dgp_did(
    n_units=200, n_periods=10,
    staggered=True, n_groups=4,
    effect=0.5, heterogeneous=True,
    seed=RNG_SEED,
)
df.head()
# columns: unit, time, y, treated, first_treat, group

first_treat is NaN for never-treated units and an integer period otherwise — the standard Callaway-Sant'Anna convention.

1. The TWFE baseline (and why it's wrong under heterogeneity)¶

Mixtape first presents the classical two-way-fixed-effects estimator:

twfe = sp.panel(
    df,
    "y ~ treated",
    entity="unit", time="time",
    method="fe",
    cluster="unit",
)
twfe.summary()

Under homogeneous effects this is the canonical DID. Under staggered adoption with heterogeneous effects, Goodman-Bacon (2021) shows the TWFE coefficient is a weighted average of every 2×2 DID comparison including ones where already-treated units serve as the control group — so some weights can be negative.

Run the Bacon decomposition to see it:

bacon = sp.bacon_decomposition(
    data=df, y="y", treat="treated",
    time="time", id="unit",
)
sp.bacon_plot(bacon)

Every bar labelled "Later vs. earlier (treated)" is the pathological contribution Goodman-Bacon flagged.

2. The modern staggered-DID suite (CS / SA / BJS)¶

Since 2021 the canonical answer is to report three estimators in parallel: Callaway-Sant'Anna (ATT decomposition), Sun-Abraham (weighted event study), and Borusyak-Jaravel-Spiess (imputation). StatsPAI ships a one-liner for the race:

race = sp.auto_did(
    df, y="y", g="first_treat", t="time", i="unit",
)
print(race.summary())

which prints:

auto_did: staggered-DiD method race
============================================================
method  estimate  std_error  ci_lower  ci_upper  n_obs notes
    CS    0.5092     0.0538    0.4037    0.6146   2000    ok
    SA    0.4923     0.1091    0.2786    0.7061   2000    ok
   BJS    0.5039     0.0240    0.4569    0.5509   2000    ok

selected winner : bjs (rule=median)

All three land within 0.02 of the true effect (0.5). The median-winner rule (select_by='median') is the default; pass select_by='cs' (etc.) to force a specific headline estimate.

3. Event-study visualisation (Sun-Abraham)¶

Mixtape emphasises the event-study plot as the primary evidence for parallel pre-trends. StatsPAI:

sa = sp.sun_abraham(
    data=df, y="y", g="first_treat", t="time", i="unit",
    event_window=(-3, 3),
)
sp.enhanced_event_study_plot(sa)

Pre-period coefficients should cluster around zero; post-period coefficients should show the dynamic path of the treatment.

4. Honesty: Rambachan-Roth (2023) sensitivity¶

What if parallel trends doesn't quite hold? Rambachan & Roth's "Honest DID" machinery answers how much drift the result can absorb before losing significance.

cs = sp.callaway_santanna(
    data=df, y="y", g="first_treat", t="time", i="unit",
)
dyn = sp.aggte(cs, type="dynamic")

# Breakdown value: largest allowable pre-trend drift per period
m_star = sp.breakdown_m(dyn, e=0, method="smoothness")
print(f"Breakdown M* = {m_star:.4f}")

# Full sensitivity frame at several candidate M values
sens = sp.honest_did(dyn, e=0, method="smoothness",
                     m_grid=list(np.linspace(0, 0.5, 11)))
sens.tail()

A large M* means the effect survives substantial violations of parallel trends — the result is robust. A small M* (or zero!) means the conclusion is fragile and should be flagged in the paper.

5. Multiverse / specification curve¶

Cunningham's Mixtape ends every DID chapter with a specification check. StatsPAI ships two complementary multiverse tools — pick the one that matches what you want to vary:

5a. Vary the estimator (CS vs SA vs BJS)¶

That's sp.auto_did, which we already used in Section 2. The leaderboard is itself the multiverse — three identification strategies, one data set, three point estimates displayed side by side.

5b. Vary the control set — `sp.spec_curve`¶

# Build a cross-sectional view of the last period (spec_curve takes a
# single-observation-per-unit frame, not the panel) for a covariate
# multiverse on a post-treatment outcome.
last = df[df["time"] == df["time"].max()]

sc = sp.spec_curve(
    data=last,
    y="y",
    x="treated",
    controls=[["group"], ["group", "unit"]],  # list-of-list: each inner list is one spec
)
sc.plot()

Every combination of a control set × an SE type × an (optional) sample subset becomes one dot on the curve; stable estimates across the curve imply the result is design-driven rather than specification-driven.

Note — the spec_curve API is a covariate / SE multiverse, not an estimator multiverse. If you want to race TWFE against CS against BJS, use auto_did above; if you want to see how the headline estimate moves when you change the control set, use this section.

6. What comes next¶

Chapter 6 replication — sharp and fuzzy RD with sp.rdrobust and sp.rd_honest (McCrary manipulation + robust bias correction).
Chapter 8 replication — IV with sp.ivreg, weak-IV-robust CIs via sp.anderson_rubin_ci / sp.conditional_lr_ci, and the Lee-McCrary (2022) sp.tF_adjustment critical value.
Chapter 10 replication — synthetic control with sp.synth, sp.sdid, and the full donor-pool diagnostic suite.

Each follows the same pattern: (1) classical baseline → (2) modern race → (3) event-study visualisation → (4) sensitivity → (5) spec curve.

Citation¶

If you use this guide or StatsPAI in academic work, please cite both:

@book{cunningham2021causal,
  title={Causal Inference: The Mixtape},
  author={Cunningham, Scott},
  year={2021},
  publisher={Yale University Press},
  url={https://mixtape.scunning.com/}
}

@software{wang2026statspai,
  title={StatsPAI: The Agent-Native Causal Inference and Econometrics Toolkit for Python},
  author={Wang, Biaoyue and Rozelle, Scott},
  year={2026},
  url={https://github.com/brycewang-stanford/StatsPAI},
  version={1.16.0}
}

For Agents¶

Pre-conditions - data is panel or repeated cross-section with a time column - treat column is binary (0/1) for 2x2, or first-treatment-period (int) for staggered - at least one pre-treatment period (≥ 2 periods for 2x2; ≥ 3 recommended for event study) - for staggered designs: id column identifying units across time

Identifying assumptions - Parallel trends: treated and control groups would have followed the same trajectory absent treatment - No anticipation: outcomes in pre-treatment periods are unaffected by future treatment - SUTVA: no spillovers between units - For staggered / heterogeneous effects: use CS or SA — TWFE can produce negative weights (Goodman-Bacon)

Failure modes → recovery

Symptom	Exception	Remedy	Try next
Pre-trend joint test p < 0.05 (or underpowered at 0.10)	`AssumptionViolation`	Use sp.sensitivity_rr (Rambachan & Roth honest CI) or switch to sp.callaway_santanna.	`sp.sensitivity_rr`
Staggered treatment timing with TWFE method	`AssumptionWarning`	TWFE can give negative weights; use Callaway-Sant'Anna, Sun-Abraham, or BJS imputation.	`sp.callaway_santanna`
Pre-trend test underpowered (Roth 2022)	`AssumptionWarning`	Check sp.pretrends_power — if low, report honest CI via sp.sensitivity_rr.	`sp.sensitivity_rr`
Few clusters at unit level	`AssumptionWarning`	Use wild cluster bootstrap (sp.wild_cluster_bootstrap).	`sp.wild_cluster_bootstrap`

Alternatives (ranked) - sp.callaway_santanna - sp.sun_abraham - sp.did_imputation - sp.sdid - sp.synth

Typical minimum N: 50

For Agents¶

Pre-conditions - panel data with unit × time × outcome - g column is integer: first-treated period or 0 for never-treated - at least one never-treated or late-treated control group - ≥ 2 pre-treatment periods per cohort - data is panel or repeated cross-section with a time column - treat column is binary (0/1) for 2x2, or first-treatment-period (int) for staggered - at least one pre-treatment period (≥ 2 periods for 2x2; ≥ 3 recommended for event study) - for staggered designs: id column identifying units across time

Identifying assumptions - Parallel trends conditional on X (if covariates supplied) - No anticipation (or adjust via anticipation= parameter) - Overlap: positive propensity for each cohort - SUTVA - Parallel trends: treated and control groups would have followed the same trajectory absent treatment - No anticipation: outcomes in pre-treatment periods are unaffected by future treatment - SUTVA: no spillovers between units - For staggered / heterogeneous effects: use CS or SA — TWFE can produce negative weights (Goodman-Bacon)

Failure modes → recovery

Symptom	Exception	Remedy	Try next
Pre-trend test on aggregated ATT(g,t) rejects	`AssumptionViolation`	Use sp.sensitivity_rr for honest CI, or add covariates for conditional parallel trends.	`sp.sensitivity_rr`
Cohort with only one unit — insufficient variation	`DataInsufficient`	Aggregate small cohorts or drop; check sp.diagnose_result.
All units treated at the same time (no staggering)	`MethodIncompatibility`	Fall back to 2x2 DID via sp.did(method='2x2').	`sp.did`
Pre-trend joint test p < 0.05 (or underpowered at 0.10)	`AssumptionViolation`	Use sp.sensitivity_rr (Rambachan & Roth honest CI) or switch to sp.callaway_santanna.	`sp.sensitivity_rr`
Staggered treatment timing with TWFE method	`AssumptionWarning`	TWFE can give negative weights; use Callaway-Sant'Anna, Sun-Abraham, or BJS imputation.	`sp.callaway_santanna`
Pre-trend test underpowered (Roth 2022)	`AssumptionWarning`	Check sp.pretrends_power — if low, report honest CI via sp.sensitivity_rr.	`sp.sensitivity_rr`
Few clusters at unit level	`AssumptionWarning`	Use wild cluster bootstrap (sp.wild_cluster_bootstrap).	`sp.wild_cluster_bootstrap`

Alternatives (ranked) - sp.sun_abraham - sp.did_imputation - sp.sdid - sp.did - sp.callaway_santanna - sp.synth

Typical minimum N: 50

Mixtape Chapter 9 — Difference-in-Differences (StatsPAI replication)¶

0. Setup¶

1. The TWFE baseline (and why it's wrong under heterogeneity)¶

2. The modern staggered-DID suite (CS / SA / BJS)¶

3. Event-study visualisation (Sun-Abraham)¶

4. Honesty: Rambachan-Roth (2023) sensitivity¶

5. Multiverse / specification curve¶

5a. Vary the estimator (CS vs SA vs BJS)¶

5b. Vary the control set — sp.spec_curve¶

6. What comes next¶

Citation¶

For Agents¶

For Agents¶

5b. Vary the control set — `sp.spec_curve`¶