Sensitivity analysis: which tool for which design¶
A point estimate with a small standard error answers exactly one question: given my identifying assumptions, what is the effect? It says nothing about the question every reviewer actually asks: what if the identifying assumptions are wrong — and by how much can they be wrong before the conclusion dies?
Sensitivity analysis answers that second question. StatsPAI ships the
major tools — sp.sensemakr, sp.oster_bounds, sp.evalue,
sp.rosenbaum_bounds, sp.honest_did, sp.sensitivity_rr,
sp.weakrobust, sp.dml_sensitivity, sp.manski_bounds,
sp.lee_bounds — but each one is tied to a specific research design
and a specific threatened assumption. Running the wrong one is not
conservative, it is meaningless: an E-value cannot rescue a DiD with
broken parallel trends, and sp.honest_did says nothing about omitted
variables in a cross-sectional regression. This guide is the decision
tree.
Robustness checks vs. sensitivity analysis¶
The two are routinely conflated. They are different layers of the same defence (see the robustness workflow guide):
| Robustness check | Sensitivity analysis | |
|---|---|---|
| Question | Does the estimate survive alternative specifications? | How badly can an untestable assumption fail before the conclusion flips? |
| Varies | Things you can observe: covariate sets, bandwidths, subsamples, estimators | Things you cannot observe: unmeasured confounding, parallel-trends violations, selection |
| Output | A set of alternative point estimates | A breakdown value: "the conclusion survives violations up to this magnitude" |
| Tooling | sp.spec_curve, sp.robustness_report, sp.rdbwsensitivity |
Everything in this guide |
A spec curve with 40 stable estimates is worthless against unmeasured confounding — every one of those 40 specifications conditions on the same observed covariates. Sensitivity analysis is the only honest answer to "what about the variable you didn't measure?", and its output is not a yes/no verdict but a breakdown magnitude that the reader judges against domain knowledge.
The decision tree¶
Start from your design, not from the tool:
What is your design, and which assumption is under attack?
1. Selection on observables (regression / matching / weighting / AIPW)
threat: an unmeasured confounder U
├── You fit an OLS-style regression and can name benchmark covariates
│ -> sp.sensemakr (partial-R^2 robustness value, Cinelli-Hazlett)
├── You want "how much would unobservables need to matter relative
│ to the observables I added?"
│ -> sp.oster_bounds (coefficient-stability delta*, Oster)
├── You want a scale-free single number any referee understands
│ -> sp.evalue / sp.evalue_from_result (VanderWeele-Ding)
└── You ran pair-matching (sp.psm / sp.match)
-> sp.rosenbaum_bounds (hidden-bias Gamma)
2. Difference-in-differences / event study
threat: parallel trends fails post-treatment
├── ALWAYS pair the pre-trend test with its power:
│ sp.pretrends_test + sp.pretrends_power (Roth)
└── Honest partial identification under bounded violations:
sp.honest_did / sp.sensitivity_rr / sp.breakdown_m
(Rambachan-Roth; see the honest DiD guide)
3. Instrumental variables
threat: the instrument is weak (relevance, not exclusion)
└── sp.weakrobust (AR + CLR + K + effective F in one panel)
sp.anderson_rubin_test / sp.anderson_rubin_ci for AR alone
note: exclusion-restriction failure is NOT covered here —
no statistic can test it; argue it, or bound it (sp.iv_bounds)
4. Double / debiased machine learning
threat: unmeasured confounder survives the flexible nuisances
└── sp.dml_sensitivity (DML-flavoured omitted-variable bounds,
Chernozhukov-Cinelli-Newey-Sharma-Syrgkanis)
5. Missing, truncated, or selected outcomes
threat: who you observe depends on treatment
├── Outcome bounded, no assumptions you trust
│ -> sp.manski_bounds (worst-case; add 'mtr'/'mts' to tighten)
├── Outcome observed only conditional on selection (employment,
│ survival, attrition) and treatment shifts selection one way
│ -> sp.lee_bounds (trimming bounds)
└── Outcomes/covariates missing not-at-random
-> sp.horowitz_manski (worst-case imputation bounds)
Quick reference:
| Design | Threatened assumption | Tool | Breakdown quantity |
|---|---|---|---|
| Regression / SOO | Unconfoundedness | sp.sensemakr |
Robustness value RV (partial R²) |
| Regression / SOO | Unconfoundedness | sp.oster_bounds |
δ* (relative selection) |
| Any SOO estimate | Unconfoundedness | sp.evalue |
E-value (risk-ratio scale) |
| Pair matching | Hidden bias in matching | sp.rosenbaum_bounds |
Γ* (odds of treatment) |
| DiD / event study | Parallel trends | sp.honest_did, sp.sensitivity_rr |
Breakdown M / M̄ |
| DiD pre-test | Power of pre-test | sp.pretrends_power |
Power against given violation |
| IV | Instrument relevance | sp.weakrobust |
Effective F, AR/CLR CI |
| DML | Residual confounding | sp.dml_sensitivity |
RV_q (confounder partial R² with DML residuals) |
| Bounded outcome | Point identification | sp.manski_bounds |
Identified set |
| Selected outcome | Sample-selection ignorability | sp.lee_bounds |
Trimming bounds |
The rest of this guide walks each branch with a runnable example and — more importantly — tells you how to read the output.
Branch 1 — Selection on observables¶
You estimated an effect from observational data by conditioning on covariates (OLS, matching, weighting, AIPW). The untestable assumption is unconfoundedness: no unmeasured variable drives both treatment and outcome. Four tools, four parameterisations of the same threat.
All four examples below share this simulated dataset, which has a
genuine unobserved confounder u baked in:
import numpy as np
import pandas as pd
import statspai as sp
rng = np.random.default_rng(42)
n = 2000
x1 = rng.normal(size=n) # observed confounder
x2 = rng.normal(size=n) # observed confounder
u = rng.normal(size=n) # UNOBSERVED confounder
d = (0.5 * x1 + 0.3 * x2 + 0.4 * u + rng.normal(size=n) > 0).astype(int)
y = 1.0 * d + 0.8 * x1 + 0.5 * x2 + 0.6 * u + rng.normal(size=n)
df = pd.DataFrame({"y": y, "d": d, "x1": x1, "x2": x2})
The regression of y on d, x1, x2 gives a biased estimate of
about 1.32 (truth: 1.00) because u is omitted. The sensitivity tools
cannot detect this bias — nothing can — but they quantify how strong
u would have to be to explain various amounts of it.
1a. Sensemakr (Cinelli-Hazlett)¶
s = sp.sensemakr(df, y="y", treat="d", controls=["x1", "x2"],
benchmark=["x1"])
print(s["rv_q"]) # 0.400 — robustness value for the point estimate
print(s["rv_qa"]) # 0.374 — robustness value for significance
print(s["partial_r2_yd"]) # 0.211 — partial R^2 of treatment with outcome
print(s["interpretation"])
How to read it. rv_q = 0.40 means: an unobserved confounder
would need to explain more than 40% of the residual variance of both
the treatment and the outcome to drive the estimate all the way to
zero. rv_qa = 0.37 is the (lower) bar for merely wiping out
statistical significance. The benchmark_table entry translates this
into something a referee can judge: it shows the partial R² of an
observed covariate (x1 here) and asks whether a confounder "as
strong as x1" — or 2×, 3× as strong — would breach the robustness
value. If the strongest covariate you measured explains 10% and the RV
is 40%, the confounder story needs a variable four times stronger than
anything you saw; that is an argument, not a proof.
1b. Oster coefficient-stability bounds¶
Oster's δ asks the question in movement terms: when you added controls, how much did the coefficient move and how much did R² rise? If adding the observables barely moved the coefficient while explaining a lot of variance, the unobservables would have to be disproportionately selected to undo the result.
b = sp.oster_bounds(df, y="y", treat="d", controls=["x1", "x2"],
delta=1.0)
print(b["beta_short"]) # 2.00 — no controls
print(b["beta_long"]) # 1.32 — with controls
print(b["delta_for_zero"]) # 1.56 — delta* needed to zero the effect
print(b["identified_set"]) # (0.47, 1.32) under delta=1, R_max=1.3*R2
print(b["interpretation"])
How to read it. delta_for_zero = 1.56 means unobservable
confounding would need to be 1.56 times as important as the
observable confounding you already controlled for to explain the
entire effect. Oster's heuristic treats |δ*| > 1 as robust — the
observables are usually chosen because they are the most important
confounders, so "unobservables matter even more" is a strong claim.
The identified_set is where the true effect lies if unobservables
are exactly as important as observables (δ = 1) and the hypothetical
full-controls R² is r_max: if 0 falls inside it, the result is
fragile. Note the trap in this example: the set (0.47, 1.32) excludes
zero — "robust" by Oster's standard — yet the truth (1.00) is well
below beta_long. Robust-to-zero is not the same as unbiased.
If you only have published regression output (no microdata), pass the
summary statistics directly: sp.oster_bounds(beta_short=...,
r2_short=..., beta_long=..., r2_long=..., r_max=...).
1c. E-value (VanderWeele-Ding)¶
The E-value is the most portable of the four: it needs only the estimate and its CI, works on risk ratios, odds ratios, hazard ratios, and (after standardisation) linear coefficients, and is defined without any modelling assumption on the confounder (Ding-VanderWeele).
# A linear (OLS) effect: pass the coefficient, its SE, and the outcome SD
ev = sp.evalue(estimate=0.6, se=0.12, sd=2.0, measure="OLS")
print(ev["evalue_estimate"]) # 1.96
print(ev["evalue_ci"]) # 1.64
print(ev["interpretation"])
# Ratio-scale estimates are passed directly
ev2 = sp.evalue(estimate=1.8, ci=(1.2, 2.7), measure="OR", rare=False)
print(ev2["evalue_estimate"], ev2["evalue_ci"]) # 2.02 1.42
How to read it. An E-value of 2.3 means, in plain language: an
unmeasured confounder would have to be associated with both the
treatment and the outcome by a risk ratio of at least 2.3 each — above
and beyond the measured covariates — to fully explain away the
observed effect; anything weaker could shrink it but not erase it.
The companion evalue_ci applies the same logic to the CI limit
nearest the null — the bar for destroying significance rather than the
point estimate. Calibrate against your data: if the strongest measured
covariate has RR ≈ 1.5 with the outcome, a confounder needing RR ≥ 2.3
on both arms is a demanding ask. Rule of thumb from the
robustness workflow guide: E > 2 is
reasonably robust, E > 3 strong, E < 1.5 fragile.
For a fitted CausalResult (from sp.dml, sp.psm, sp.aipw, …)
use sp.evalue_from_result(r) — it standardises the estimate and
converts to the risk-ratio scale for you (shown in Branch 4 below).
1d. Rosenbaum bounds (matched designs)¶
If your design is pair matching, the natural sensitivity parameter is Γ: how much could hidden bias multiply the odds of treatment within a matched pair? At Γ = 1 matching is as-good-as-random; Γ = 2 means one unit of an identical-looking pair can be twice as likely treated because of something you did not match on (Rosenbaum).
import numpy as np
import statspai as sp
rng = np.random.default_rng(7)
n = 200
treated = 0.35 + rng.normal(size=n) # matched-pair treated outcomes
control = rng.normal(size=n) # matched-pair control outcomes
rb = sp.rosenbaum_bounds(treated, control) # default grid: 1.0–3.0, step 0.1
print(rb.summary())
print(rb.gamma_critical) # 1.3
How to read it. The table reports, for each Γ, the worst-case
range of p-values consistent with that much hidden bias.
gamma_critical is the smallest Γ at which the worst-case p-value
exceeds α — that is, the result is already overturned at Γ*, not
beyond it. Here the significant Wilcoxon result survives hidden bias
up to Γ = 1.2 (worst-case p = 0.040) and dies at Γ = 1.3 (worst-case
p = 0.103). Note that the reported Γ* is grid-resolution-dependent —
a coarse grid like gamma_grid=[1.0, 1.5, 2.0] would round the
breakdown up to 1.5 and overstate the robustness — so keep the default
fine grid (or finer) when reporting. The larger Γ* is, the more
hidden bias the finding tolerates; a Γ* barely above 1 means even
mild unmatched heterogeneity could account for the entire result.
Branch 2 — Difference-in-differences¶
The threatened assumption is parallel trends after treatment — which no pre-trend test can verify, because the test only sees the pre period. The modern recipe (Rambachan-Roth; Roth) is a three-part package: pre-trend test, the test's power, and honest confidence intervals under bounded violations. Full background lives in the honest DiD guide; here is the sensitivity-analysis view.
import numpy as np
import pandas as pd
import statspai as sp
# Staggered-free panel: 30 units treated at t=5, 30 never treated
rng = np.random.default_rng(0)
rows = []
for i in range(60):
alpha, g = rng.normal(), (5 if i < 30 else 0)
for t in range(1, 9):
treated = int(g != 0 and t >= g)
rows.append((i, t, g, alpha + 0.3 * t + 1.5 * treated
+ rng.normal(scale=0.5)))
df = pd.DataFrame(rows, columns=["i", "t", "g", "y"])
r = sp.callaway_santanna(df, y="y", g="g", t="t", i="i")
# (1) Pre-trend test ...
pt = sp.pretrends_test(r)
print(pt["pvalue"], pt["reject"]) # 0.358 False
# (2) ... ALWAYS paired with its power
pw = sp.pretrends_power(r)
print(round(pw["power"], 2)) # 0.15
print(pw["warning"])
# (3) Honest CIs under smoothness violations of magnitude M
hd = sp.honest_did(r, m_grid=[0.0, 0.1, 0.2, 0.5])
print(hd)
# Breakdown M in one number
print(sp.breakdown_m(r)) # 1.269
# Relative-magnitudes version (Mbar): violation post <= Mbar x worst pre
sr = sp.sensitivity_rr(r, Mbar=[0.0, 0.5, 1.0])
print(sr)
How to read it.
pretrends_testnot rejecting (p = 0.36) is necessary, not sufficient. The power calculation is what makes it honest: power 0.15 means that even if a violation as large as the hypothesised slope existed, this test would catch it only 15% of the time — a flat pre-trend plot from an underpowered test is close to uninformative (Roth). Report both numbers, always.sp.honest_didre-computes the CI allowing post-treatment trend violations up to magnitudeMper period (smoothness flavour). The table'srejects_zerocolumn shows where the conclusion survives. The breakdown M (1.27 here, vs. an ATT of 1.73) is the headline: parallel trends can fail by up to 1.27 outcome units per period before zero enters the CI. Judge that against the pre-period fluctuations you actually saw.sp.sensitivity_rrparameterises violations relative to the observed pre-trends instead: M̄ = 1 allows a post-treatment violation as large as the worst pre-treatment one. A breakdown M̄ of 0.5 (as printed here) reads: if parallel trends fails post-treatment by even half of the worst pre-treatment wobble, significance is gone. Breakdown M̄ ≥ 1 is the comfortable zone — the design survives violations as bad as anything visible in the data.
Branch 3 — Instrumental variables¶
For IV the sensitivity question that has a statistical answer is weak instruments: when the first stage is weak, 2SLS t-tests over-reject and Wald CIs undercover, sometimes catastrophically (Staiger-Stock; Stock-Yogo). The repair is not a bigger F-statistic but identification-robust inference: Anderson-Rubin, Kleibergen's K, and Moreira's CLR remain valid at any instrument strength.
import numpy as np
import pandas as pd
import statspai as sp
rng = np.random.default_rng(3)
n = 1000
z = rng.normal(size=n)
u = rng.normal(size=n)
d = 0.15 * z + 0.8 * u + rng.normal(size=n) # weak-ish first stage
y = 0.5 * d + 0.8 * u + rng.normal(size=n)
df = pd.DataFrame({"y": y, "d": d, "z": z})
# Full weak-IV-robust panel: AR + CLR + K + effective F in one call
wr = sp.weakrobust(df, y="y", endog="d", instruments=["z"],
clr_simulations=2000)
print(wr.summary())
# Or the Anderson-Rubin test alone
ar = sp.anderson_rubin_test(df, y="y", endog="d", instruments=["z"], h0=0.0)
print(ar["effective_F"]) # 13.8
print(ar["ar_ci"]) # (0.153, 1.304)
print(ar["tF_critical_value"]) # 2.61 — use instead of 1.96
How to read it.
- The effective F (Montiel Olea-Pflueger) of 13.8 lands in the "moderate" zone: above the folk threshold of 10, but below the ~23 needed for the naive t-test to behave. In that zone, report the robust intervals, not the Wald CI.
- The AR 95% CI [0.15, 1.30] is the set of effect values β₀ not rejected by the Anderson-Rubin test. It is wider than the 2SLS Wald CI — that width is the honest price of the weak instrument. With truly weak instruments the AR CI can be unbounded; that is the method telling you the data cannot pin the effect down, not a bug.
- The tF critical value (Lee-McCrary-Moreira-Porter) gives a third option: keep the 2SLS t-statistic but compare it to 2.61 instead of 1.96, with the cutoff adapting to the observed first-stage F.
- What this branch can not do: test the exclusion restriction.
No weak-IV statistic speaks to whether
zaffectsyonly throughd. That assumption must be argued from design, or relaxed into partial identification viasp.iv_bounds.
Branch 4 — Double / debiased machine learning¶
DML removes confounding from the covariates you feed it — flexibly,
but only those. The DML-flavoured omitted-variable framework
(Chernozhukov-Cinelli-Newey-Sharma-Syrgkanis) extends the sensemakr
logic to the partially linear / nonparametric setting: the sensitivity
parameters are the partial R² of the latent confounder with the
outcome (cf_y) and with the treatment/Riesz representer (cf_d).
import numpy as np
import pandas as pd
import statspai as sp
rng = np.random.default_rng(1)
n = 1500
X = rng.normal(size=(n, 4))
d = 0.6 * X[:, 0] - 0.4 * X[:, 1] + rng.normal(size=n)
y = 0.8 * d + X[:, 0] + 0.5 * X[:, 2] + rng.normal(size=n)
df = pd.DataFrame(X, columns=["x1", "x2", "x3", "x4"])
df["d"], df["y"] = d, y
r = sp.dml(df, y="y", treat="d", covariates=["x1", "x2", "x3", "x4"],
model="plr")
ds = sp.dml_sensitivity(r, cf_y=0.05, cf_d=0.05,
benchmark_covariates=["x1"])
print(ds.summary())
# E-value works on any CausalResult too
ev = sp.evalue_from_result(r)
print(round(ev["evalue_estimate"], 2)) # 3.52
How to read it. The summary mirrors sensemakr's grammar:
- RV_q = 0.448: a confounder needs partial R² of ~45% with both the outcome and treatment residuals to drag the estimate to zero; RV_qa is the lower bar for losing significance.
- The bias bound row is scenario analysis: if a confounder with
cf_y = cf_d = 0.05existed, the estimate could move at most ±0.067 — the adjusted range [0.72, 0.86] still excludes zero, so the conclusion survives that scenario. - The benchmark table calibrates the scenario: it computes the
cf_y/cf_da confounder "as strong asx1" would have, given everything else in the model. If your hypothesised confounder is "a second variable like x1", read its row instead of inventingcf_y/cf_dfrom thin air.
Branch 5 — Missing, truncated, or selected outcomes¶
When the observability of the outcome depends on treatment — attrition, employment-conditional wages, survival — no reweighting trick restores point identification without untestable assumptions. The honest output is an identified set.
5a. Manski worst-case bounds¶
For an outcome with known logical bounds (here binary, so [0, 1]), Manski's no-assumption bounds bracket the ATE using only the data plus the bounds themselves.
import numpy as np
import pandas as pd
import statspai as sp
rng = np.random.default_rng(5)
n = 800
d = rng.integers(0, 2, size=n)
y = (rng.uniform(size=n) < 0.4 + 0.2 * d).astype(float)
df = pd.DataFrame({"d": d, "y": y})
mb = sp.manski_bounds(df, y="y", treat="d", y_lower=0.0, y_upper=1.0,
assumption="none", n_bootstrap=100)
print(mb.model_info["lower_bound"], mb.model_info["upper_bound"])
# -0.41 0.59 (width exactly y_upper - y_lower = 1.0)
# Monotone treatment response tightens the lower bound to 0
mb_mtr = sp.manski_bounds(df, y="y", treat="d", y_lower=0.0, y_upper=1.0,
assumption="mtr", n_bootstrap=100)
print(mb_mtr.model_info["lower_bound"], mb_mtr.model_info["upper_bound"])
How to read it. The no-assumption bounds [−0.41, 0.59] always have
width y_upper − y_lower and always contain zero — by construction
they can never sign the effect. That is not a defect; it is the
honest statement of what the data alone say. Each added assumption
('mtr': treatment can't hurt anyone; 'mts': selection is monotone
in levels) buys narrower bounds at the price of a defensible-or-not
behavioural claim. Present the bounds as a ladder — none → MTR → MTS —
so the reader sees exactly which assumption delivers which conclusion.
5b. Lee trimming bounds¶
Lee bounds target the canonical "wages only observed for the employed" problem: treatment shifts selection (employment), so comparing observed outcomes mixes the treatment effect with a composition change. The fix trims the excess-selected group at the relevant quantiles.
import numpy as np
import pandas as pd
import statspai as sp
rng = np.random.default_rng(5)
n = 1200
d = rng.integers(0, 2, size=n)
wage = 2.0 + 0.4 * d + rng.normal(scale=0.8, size=n)
employed = (rng.uniform(size=n) < (0.55 + 0.15 * d)).astype(int)
df = pd.DataFrame({"d": d, "employed": employed,
"wage": np.where(employed == 1, wage, np.nan)})
lb = sp.lee_bounds(df, y="wage", treat="d", selection="employed",
n_bootstrap=100)
print(lb.summary())
How to read it. The example reports bounds [0.15, 0.73] for the wage effect, with a trimming fraction of ~22%: treatment raised employment by enough that 22% of the treated-and-employed have no control counterpart, and the bounds come from deleting the top vs. bottom 22% of their wage distribution. The estimand is the effect for always-selected units only (people employed either way) — Lee bounds cannot speak to marginal entrants. The identifying assumption is monotonicity: treatment moves selection in one direction for everyone. If treatment plausibly destroys some jobs while creating others, the bounds are invalid.
For outcomes (or covariates) that are missing not-at-random rather
than selection-truncated, sp.horowitz_manski(df, y=..., treatment=...,
covariates=[...], y_lower=..., y_upper=...) returns the
worst-case-imputation analogue (Horowitz-Manski).
One call for the common case¶
For a quick first pass on any fitted result, every CausalResult and
regression result exposes a .sensitivity() method (alias
sp.unified_sensitivity) that runs whatever applies — E-value and a
breakdown-bias calculation always, plus an Oster component when you
supply the needed inputs (r2_treated, r2_controlled,
beta_uncontrolled), a Rosenbaum component when the result exposes
matched_pairs outcome arrays, and a sensemakr component when you
pass the raw estimation data (data=, y=, treat=, controls=,
since result objects do not carry the data) — and collects them in
one dashboard. Anything it could not run is recorded in dash.notes
rather than silently dropped:
import numpy as np
import pandas as pd
import statspai as sp
rng = np.random.default_rng(42)
n = 2000
x1, x2, u = rng.normal(size=n), rng.normal(size=n), rng.normal(size=n)
d = (0.5 * x1 + 0.3 * x2 + 0.4 * u + rng.normal(size=n) > 0).astype(int)
y = 1.0 * d + 0.8 * x1 + 0.5 * x2 + 0.6 * u + rng.normal(size=n)
df = pd.DataFrame({"y": y, "d": d, "x1": x1, "x2": x2})
r = sp.regress("y ~ d + x1 + x2", data=df)
dash = r.sensitivity() # = sp.unified_sensitivity(r)
print(dash.e_value_point, dash.e_value_ci) # 1.61 1.40
print(dash.breakdown) # bias needed to flip the conclusion
Treat the dashboard as triage, not as the final deliverable: for the
paper, run the design-specific tool from the decision tree with
explicit benchmarks, and check dash.notes for any component that was
skipped.
What these tools can NOT tell you¶
Sensitivity analysis is the most honest part of the causal toolkit precisely because its limits are sharp. Do not let a good-looking robustness value write checks the design cannot cash:
- None of these tools detects confounding. The sensitivity parameters (partial R², δ, Γ, M, cf_y/cf_d) are not estimable from the data — that is what makes the confounder unobserved. The tools answer "how strong would it need to be?", never "how strong is it?". The Oster example in Branch 1 shows a result that passes the δ > 1 bar while still being biased by 30%.
- A large breakdown value is an argument, not a proof. "A confounder would need partial R² > 40%" persuades only if you can argue no such variable exists. That argument lives in your institutional knowledge, not in the output.
- Benchmarking assumes unobservables resemble observables. Oster's δ and the sensemakr / DML benchmark tables calibrate against covariates you chose because they were the important ones. A qualitatively different confounder (a policy shock, a genetic factor) is not bound by that calibration.
- Each tool covers exactly one assumption. An E-value of 3 says
nothing about SUTVA violations, measurement error in the treatment,
selection into the sample, or p-hacking across specifications.
sp.honest_didcovers parallel trends, not anticipation or spillovers.sp.weakrobustcovers relevance, not exclusion. - Passing a pre-trend test does not establish parallel trends —
the post-treatment counterfactual trend is unobservable by
definition, and low-power pre-tests pass too easily (Roth). That is
why Branch 2 insists on
sp.pretrends_powerplus honest CIs rather than the test alone. - Wide Manski bounds are information, not failure. If the no-assumption bounds straddle zero, the data genuinely do not sign the effect without further assumptions; reporting a point estimate instead just hides the assumption doing the work.
- Sensitivity to confounding is not sensitivity to specification. Run the robustness workflow (spec curve, placebo tests, subsample stability) as well as — never instead of — the tools in this guide. The two layers fail independently.
Reporting template¶
For the sensitivity panel of a paper:
- Name the threatened assumption for your design (the decision tree's branch), and the tool that parameterises it.
- Report the breakdown value next to the point estimate: RV and benchmark comparison (sensemakr), δ* and identified set (Oster), E-value for estimate and CI, Γ*, breakdown M / M̄, effective F with AR CI, or the bounds ladder.
- Calibrate it: compare the breakdown value against an observed quantity (strongest measured covariate, worst pre-trend deviation) so the reader can judge plausibility.
- State the residual exposure explicitly — which assumptions the reported analysis does not defend (item 4 above). Reviewers trust papers that say this unprompted.
Cross-references¶
- Honest DiD guide — full treatment of
sp.honest_did,sp.sensitivity_rr, and event-study workflows. - Robustness workflow — the three-layer defence; this guide is Layer 3 expanded.
- Choosing a DID estimator and Choosing an IV estimator — pick the estimator before stress-testing it.
References¶
All citations resolve to verified entries in paper.bib:
sensemakr [@cinelli2020making]; Oster bounds [@oster2019unobservable];
E-value [@vanderweele2017sensitivity; @ding2016sensitivity];
Rosenbaum bounds [@rosenbaum2002observational];
honest DiD [@rambachan2023more]; pre-test power [@roth2022pretest];
weak-IV-robust inference [@anderson1949estimation;
@kleibergen2002pivotal; @moreira2003conditional; @olea2013robust;
@lee2022valid; @staiger1997instrumental; @stock2005testing];
DML sensitivity [@chernozhukov2022long];
Manski bounds [@manski1990nonparametric]; Lee bounds [@lee2009training];
Horowitz-Manski bounds [@horowitz2000nonparametric].