Skip to content

Cookbook — recipes by research question

Find the method by the question you are actually asking, not by its textbook name. Each recipe is a minimal, runnable starting point; follow the linked guide or API reference for the full options.

Let StatsPAI choose

If you are unsure, sp.recommend(df, y=..., treat=...) and sp.detect_design(df) will suggest an estimator from the data shape.


"A policy turned on for different units at different times"

Staggered-adoption difference-in-differences. Two-way fixed effects is biased here; use a heterogeneity-robust estimator.

import statspai as sp
df = sp.datasets.mpdta()
r = sp.callaway_santanna(df, y="lemp", g="first_treat", t="year", i="countyreal")
r.summary()

Choosing a DID estimator · Callaway–Sant'Anna guide

"One unit got treated and I have many untreated comparison units"

Synthetic control — build a weighted combination of donors that tracks the treated unit before treatment.

r = sp.synth(df, y="outcome", unit="state", time="year",
             treated="California", treat_period=1989)
r.plot()

Synthetic control guide · sp.synth family

"Treatment is endogenous but I have an instrument"

Instrumental variables. Check the first stage before trusting the estimate.

data = sp.datasets.card_1995()
r = sp.ivreg("lwage ~ (educ ~ nearc4) + exper + black + south", data=data)
r.summary()
# weak-instrument-robust reporting bundle:
sp.iv_diag(data, y="lwage", endog="educ", instruments="nearc4")

Choosing an IV estimator · IV reference

"Treatment is assigned by a cutoff on a running variable"

Regression discontinuity.

data = sp.datasets.lee_2008_senate()
r = sp.rdrobust(data["vote_t1"], data["margin"], c=0.0)
r.summary()

Choosing an RD estimator · RD reference

"I want the effect for everyone, not just the average (heterogeneity)"

Conditional average treatment effects (CATE) via meta-learners, causal forest, or double ML.

r = sp.dml(df, y="y", treat="d", covariates=["x1", "x2", "x3"], model="irm")
cate = sp.auto_cate(df, y="y", treat="d", covariates=["x1", "x2", "x3"])

Choosing an ML causal estimator

"I have rich confounders and want a robust observational estimate"

Double/debiased ML or TMLE — both doubly robust, both need overlap.

r = sp.dml(df, y="y", treat="d", covariates=[...], model="irm", ml_g="rf", ml_m="rf")
r = sp.tmle(df, y="y", treat="d", covariates=[...])

sp.dml vs DoubleML

"Why is the gap between two groups what it is?" (decomposition)

Oaxaca–Blinder and RIF/recentered-influence-function decompositions.

r = sp.decompose(df, y="wage", group="female", covariates=["edu", "exp"],
                 method="oaxaca")

Decomposition family · Decomposition reference

"Match treated and control units on covariates"

Propensity-score / entropy-balancing / optimal matching.

ps = sp.propensity_score(df, treat="d", covariates=["x1", "x2"])
w  = sp.ebalance(df, treat="d", covariates=["x1", "x2"])   # entropy balancing
sp.love_plot(sp.balance_diagnostics(df, treat="d", covariates=["x1", "x2"]))

Choosing a matching estimator

"Panel regression with many fixed effects"

reghdfe-style high-dimensional fixed effects.

r = sp.hdfe_ols("y ~ x1 + x2 | firm + year", data=df, cluster="firm")

Panel reference


After any estimate: the agent-native follow-ups

r.summary()        # human-readable
r.to_dict()        # structured payload for agents
sp.audit(r)        # what robustness checks are missing?
r.cite()           # verified BibTeX
r.to_latex(...)    # publication export