Skip to content

statspai.diagnostics

diagnostics

Diagnostics and sensitivity analysis for StatsPAI.

Provides: - Oster (2019) coefficient stability bounds - McCrary (2008) density discontinuity test for RD manipulation

WeakRobustResult

Container holding the unified weak-IV-robust panel.

to_frame

to_frame() -> DataFrame

Return the Stata-style single-table summary.

KitagawaResult dataclass

Result of the Kitagawa (2015) LATE validity test.

Attributes:

Name Type Description
statistic float

Test statistic (maximum violation of the inequality conditions).

p_value float

Bootstrap p-value.

first_stage float

First-stage effect: P(D=1|Z=1) - P(D=1|Z=0).

n_boot int

Number of bootstrap replications used.

n_obs int

Number of observations.

max_violation_treated float

Maximum violation of the treated-complier CDF condition.

max_violation_untreated float

Maximum violation of the untreated-complier CDF condition.

Methods:

Name Description
summary

Print a formatted summary.

summary

summary() -> str

Return a formatted summary string.

RosenbaumResult dataclass

Result container for :func:rosenbaum_bounds.

oster_bounds

oster_bounds(data: Optional[DataFrame] = None, y: Optional[str] = None, treat: Optional[str] = None, controls: Optional[List[str]] = None, r_max: Optional[float] = None, delta: float = 1.0, beta_short: Optional[float] = None, r2_short: Optional[float] = None, beta_long: Optional[float] = None, r2_long: Optional[float] = None, alpha: float = 0.05) -> Dict[str, Any]

Oster (2019) coefficient stability bounds.

Assesses robustness of a treatment effect estimate to omitted variable bias by computing how much unobservable selection (relative to observable selection) would be needed to explain away the result.

Can be called in two ways:

  1. From data — runs short and long regressions internally::

    result = oster_bounds(data, y='wage', treat='training', controls=['age', 'education'])

  2. From pre-computed statistics — supply β and R² directly::

    result = oster_bounds(beta_short=2.5, r2_short=0.15, beta_long=2.0, r2_long=0.45)

Parameters:

Name Type Description Default
data DataFrame

Input data (required for method 1).

None
y str

Outcome variable.

None
treat str

Treatment variable.

None
controls list of str

Control variables (included in the long but not short regression).

None
r_max float

Maximum R² under full selection. Default: min(1.0, 1.3 × R²_long).

None
delta float

Proportionality assumption: ratio of unobservable-to-observable selection. δ=1 means equal selection.

1.0
beta_short float

Treatment coefficient from short regression (no controls).

None
r2_short float

R² from short regression.

None
beta_long float

Treatment coefficient from long regression (with controls).

None
r2_long float

R² from long regression.

None
alpha float

Significance level for the identified set.

0.05

Returns:

Type Description
dict

Keys: - beta_short, r2_short: short regression estimates - beta_long, r2_long: long regression estimates - r_max: maximum R² used - delta_for_zero: δ* such that adjusted β = 0 (robustness measure) - beta_adjusted: bias-adjusted β under (δ, R_max) - identified_set: [min(β_adjusted, β_long), max(β_adjusted, β_long)] - robust: True if identified set excludes zero - interpretation: human-readable interpretation

Examples:

>>> # From data
>>> result = oster_bounds(df, y='wage', treat='training',
...                       controls=['age', 'education', 'experience'])
>>> print(f"δ* = {result['delta_for_zero']:.2f}")
>>> print(f"Robust: {result['robust']}")
>>> # From statistics (e.g., read from a paper)
>>> result = oster_bounds(beta_short=2.5, r2_short=0.15,
...                       beta_long=2.0, r2_long=0.45)

mccrary_test

mccrary_test(data: DataFrame, x: str, c: float = 0, bw: Optional[float] = None, n_bins: Optional[int] = None, alpha: float = 0.05) -> CausalResult

McCrary (2008) density discontinuity test for RD manipulation.

Tests the null hypothesis that the density of the running variable is continuous at the cutoff. A rejection suggests possible manipulation of the running variable.

Parameters:

Name Type Description Default
data DataFrame
required
x str

Running variable name.

required
c float

RD cutoff.

0
bw float

Bandwidth for local linear density estimation. Default: automatic (2 × Silverman rule).

None
n_bins int

Number of histogram bins per side. Default: ceil(sqrt(n/2)).

None
alpha float

Significance level.

0.05

Returns:

Type Description
CausalResult
  • estimate: log density difference ln(f̂₊) - ln(f̂₋)
  • pvalue: two-sided test for H₀: no discontinuity
  • model_info['density_left']: estimated density from the left
  • model_info['density_right']: estimated density from the right

Examples:

>>> result = mccrary_test(df, x='score', c=0)
>>> print(f"θ̂ = {result.estimate:.3f}, p = {result.pvalue:.4f}")

diagnose

diagnose(data: DataFrame, y: str, x: List[str], print_results: bool = True) -> Dict[str, Any]

Comprehensive regression diagnostics in one call.

Equivalent to running Stata's estat hettest, estat ovtest, estat vif after a regression.

Parameters:

Name Type Description Default
data DataFrame

Data used in the regression.

required
y str

Dependent variable name.

required
x list of str

Independent variable names (excluding constant).

required
print_results bool

Print formatted output.

True

Returns:

Type Description
dict

Keys: 'het_test', 'reset_test', 'vif', each containing test results.

Examples:

>>> diag = sp.diagnose(df, y='wage', x=['education', 'experience'])

het_test

het_test(data: DataFrame, y: str, x: List[str]) -> Dict[str, float]

Breusch-Pagan test for heteroskedasticity.

H0: Homoskedastic errors (constant variance). H1: Variance depends on regressors.

Equivalent to Stata's estat hettest.

Parameters:

Name Type Description Default
data as in ``diagnose()``
required
y as in ``diagnose()``
required
x as in ``diagnose()``
required

Returns:

Type Description
dict

'statistic', 'df', 'pvalue'.

References

Breusch, T.S. and Pagan, A.R. (1979). Econometrica, 47(5), 1287-1294. [@breusch1979simple]

reset_test

reset_test(data: DataFrame, y: str, x: List[str], powers: int = 3) -> Dict[str, float]

Ramsey RESET test for functional form misspecification.

Tests whether nonlinear combinations of fitted values help explain the dependent variable. Rejection suggests the model is missing important nonlinearities.

Equivalent to Stata's estat ovtest.

Parameters:

Name Type Description Default
data as in ``diagnose()``
required
y as in ``diagnose()``
required
x as in ``diagnose()``
required
powers int

Include ŷ², ŷ³, ..., ŷ^powers in the auxiliary regression.

3

Returns:

Type Description
dict

'statistic' (F), 'df1', 'df2', 'pvalue'.

References

Ramsey, J.B. (1969). JRSS-B, 31(2), 350-371. [@ramsey1969tests]

vif

vif(data: DataFrame, x: List[str]) -> DataFrame

Variance Inflation Factors for multicollinearity detection.

VIF > 10 is a common rule of thumb for concerning multicollinearity.

Equivalent to Stata's estat vif.

Parameters:

Name Type Description Default
data DataFrame
required
x list of str

Independent variables.

required

Returns:

Type Description
DataFrame

Columns: 'variable', 'VIF', '1/VIF'.

Notes

VIF_j = 1 / (1 - R²_j) where R²_j is from regressing x_j on all other x variables.

hausman_test

hausman_test(data: DataFrame, y: str, x: List[str], id: str, time: str, alpha: float = 0.05) -> Dict[str, Any]

Hausman test for FE vs RE in panel data.

Equivalent to Stata's hausman fe re.

Parameters:

Name Type Description Default
data DataFrame

Panel data in long format.

required
y str

Dependent variable.

required
x list of str

Independent variables (time-varying).

required
id str

Unit identifier.

required
time str

Time period identifier.

required
alpha float
0.05

Returns:

Type Description
dict

'statistic': Hausman chi² statistic 'df': degrees of freedom 'pvalue': p-value 'recommendation': 'FE' or 'RE' 'beta_fe', 'beta_re': coefficient vectors

Examples:

>>> result = sp.hausman_test(df, y='wage', x=['education', 'experience'],
...                         id='worker', time='year')
>>> print(f"chi2({result['df']}) = {result['statistic']:.2f}, "
...       f"p = {result['pvalue']:.4f}")
>>> print(f"Recommendation: {result['recommendation']}")
Notes

The test statistic is:

.. math:: H = (\hat{\beta}{FE} - \hat{\beta}{RE})' [V(\hat{\beta}{FE}) - V(\hat{\beta}{RE})]^{-1} (\hat{\beta}{FE} - \hat{\beta}{RE})

Under H0, H ~ χ²(k).

  • Reject H0 (p < 0.05) → Use Fixed Effects
  • Fail to reject → Use Random Effects (more efficient)

See Hausman (1978, Econometrica).

anderson_rubin_test

anderson_rubin_test(data: DataFrame, y: str, endog: str, instruments: List[str], exog: Optional[List[str]] = None, h0: float = 0, alpha: float = 0.05, vcov: str = 'HC1') -> Dict[str, Any]

Anderson-Rubin (1949) test — size-correct under weak instruments.

Tests H0: β_endog = h0 and constructs a confidence set by inverting the test over a grid of candidate values. The AR test has correct size regardless of instrument strength and is the recommended inference procedure when F_eff < 10.

Parameters:

Name Type Description Default
data DataFrame
required
y str

Outcome variable.

required
endog str

Endogenous regressor (single).

required
instruments list of str

Excluded instruments.

required
exog list of str

Included exogenous controls.

None
h0 float

Null hypothesis value for the endogenous coefficient.

0
alpha float

Significance level.

0.05
vcov (HC0, HC1, classic)

Variance estimator used for the Olea-Pflueger effective F reported alongside AR.

'HC0'

Returns:

Type Description
dict

'ar_stat' : AR F statistic at h0. 'ar_df' : Degrees of freedom (k_z, n - k_w - k_z). 'ar_pvalue' : P-value of AR test at h0. 'ar_ci' : (low, high) AR confidence set (grid-inverted). 'beta_2sls' : 2SLS point estimate. 'first_stage_F' : Classical first-stage F. 'effective_F' : Olea-Pflueger robust F_eff. 'tF_critical_value' : Lee et al. (2022) adjusted 5 % two-sided critical value. None if alpha != 0.05. 'strength' : Instrument-strength interpretation. 'interpretation' : Human-readable summary.

Examples:

>>> result = sp.anderson_rubin_test(df, y='wage', endog='education',
...                                 instruments=['parent_edu', 'distance'])
>>> print(result['interpretation'])
Notes

Under H0: β = β₀, regress Y - β₀·D on Z and W. The F-test on Z gives the AR statistic. The confidence set is constructed by collecting all β₀ values not rejected at level alpha. If the AR set is unbounded on one or both sides, the returned (low, high) will be ±inf accordingly.

effective_f_test

effective_f_test(data: DataFrame, endog: str, instruments: List[str], exog: Optional[List[str]] = None, vcov: str = 'HC1') -> Dict[str, Any]

Olea-Pflueger (2013) robust effective F statistic for weak instruments.

Computes the heteroskedasticity-robust effective F that is a pre-test for the concentration parameter of the first stage. Under homoskedasticity (vcov='classic') and a single instrument it reduces exactly to the standard first-stage F.

Parameters:

Name Type Description Default
data DataFrame
required
endog str

Endogenous regressor (single endogenous variable).

required
instruments list of str

Excluded instruments.

required
exog list of str

Included exogenous controls (a constant is added automatically).

None
vcov (HC0, HC1, classic)

Variance estimator for the first-stage residuals:

  • 'classic' — homoskedastic; F_eff equals first-stage F.
  • 'HC0' — White heteroskedasticity-robust.
  • 'HC1' — HC0 with small-sample correction n/(n-k).
'HC0'

Returns:

Type Description
dict

'F_eff' : Olea-Pflueger effective F. 'first_stage_F' : Classical first-stage F (for comparison). 'n_instruments' : Number of excluded instruments (k_z). 'n_obs' : Sample size. 'strength' : Interpretation string. 'stock_yogo_10pct' : 23.1 (conventional threshold for <10 % bias).

Notes

The formula (Andrews-Stock-Sun 2019, eq. 4.13) is

.. math::

F_{\text{eff}} = \frac{\hat\pi' (\tilde Z' \tilde Z)\hat\pi}
    {\mathrm{tr}\!\left(\hat\Omega\,(\tilde Z' \tilde Z)^{-1}\right)}

where :math:\tilde Z, \tilde D are residualized after partialling out the exogenous controls, :math:\hat\pi is the first-stage OLS coefficient vector, and :math:\hat\Omega = \sum_i \hat\eta_i^2 \tilde z_i \tilde z_i' is the HC meat.

Under homoskedasticity :math:\hat\Omega \approx \hat\sigma_\eta^2 \tilde Z'\tilde Z, so the trace collapses to :math:k_z \hat \sigma_\eta^2 and :math:F_{\text{eff}} reduces to the first-stage F.

Examples:

>>> import statspai as sp
>>> res = sp.effective_f_test(df, endog='educ', instruments=['qob'])
>>> print(f"F_eff = {res['F_eff']:.2f} ({res['strength']})")

tF_critical_value

tF_critical_value(first_stage_F: float, alpha: float = 0.05) -> float

Lee–McCrary–Moreira–Porter (2022, AER) tF adjusted critical value.

Returns the adjusted two-sided t-ratio critical value c(F) such that |t| > c(F) is a valid 1 - alpha test of the 2SLS coefficient, robust to weak instruments.

Parameters:

Name Type Description Default
first_stage_F float

Observed first-stage F statistic (or Olea-Pflueger F_eff).

required
alpha float

Significance level. Only 0.05 is implemented (the only level for which LMMP publish a complete table).

0.05

Returns:

Type Description
float

Adjusted critical value. Returns inf when F ≤ 3.84 (AR inference should be used instead). Converges to 1.96 as F → ∞.

Raises:

Type Description
ValueError

If alpha is not 0.05.

Notes

The LMMP tF procedure gives exactly correct 5 % size for any first-stage strength. The standard 1.96 critical value over-rejects substantially when F < 104.7 (the value at which c = 1.96).

Examples:

>>> import statspai as sp
>>> # With first-stage F = 10, the adjusted critical value is ~3.16
>>> c = sp.tF_critical_value(10.0)
>>> print(f"Adjusted 5% critical value: {c:.2f}")

weakrobust

weakrobust(data: DataFrame, y: str, endog: str, instruments: List[str], exog: Optional[List[str]] = None, *, h0: float = 0.0, alpha: float = 0.05, vcov: str = 'HC1', include_clr: bool = True, include_k: bool = True, clr_simulations: int = 20000, grid_size: int = 401, random_state: Optional[int] = None) -> WeakRobustResult

Stata-style unified weak-instrument-robust diagnostic panel.

Bundles in a single call:

  1. Classical first-stage F and Olea–Pflueger (2013) effective F — instrument-strength pre-tests.
  2. Kleibergen–Paap (2006) rk LM and Wald F — rank test of the reduced form, heteroskedasticity-robust Cragg-Donald analogue.
  3. Anderson–Rubin (1949) F test of H0: β_endog = h0 and its weak-IV-robust confidence set.
  4. Moreira (2003) Conditional LR (CLR) test and CLR confidence set by grid inversion (uniformly most powerful invariant for a single endogenous regressor).
  5. Kleibergen (2002) K score test and K confidence set.
  6. Lee–McCrary–Moreira–Porter (2022) tF adjusted critical value.

This entry point is the Python analogue of Stata 19's ivregress 2sls; estat weakrobust and weakiv / ivreg2 in Stata 17+, unifying tools that are scattered across ivmodel (R), linearmodels (Python), and the user-written packages weakiv, rivtest (Stata).

Parameters:

Name Type Description Default
data DataFrame
required
y str

Outcome and single endogenous regressor (column names in data).

required
endog str

Outcome and single endogenous regressor (column names in data).

required
instruments list of str

Excluded instruments.

required
exog list of str

Included exogenous controls. An intercept is always added.

None
h0 float

Null value for the endogenous coefficient. All under-H0 tests (AR, CLR, K) are evaluated at β = h0.

0.0
alpha float

Significance level for the robust confidence sets.

0.05
vcov (HC0, HC1, classic)

Used by the Olea–Pflueger effective F.

'HC0'
include_clr bool

Also run the CLR test and invert it for a CLR confidence set.

True
include_k bool

Also run the Kleibergen K score test and K confidence set.

True
clr_simulations int

Monte-Carlo draws for the CLR null distribution.

20 000
grid_size int

Grid resolution used by AR/CLR/K confidence-set inversion.

401
random_state int
None

Returns:

Type Description
WeakRobustResult

Accepts .summary(), .to_frame() and dict-style lookup.

Examples:

>>> panel = sp.weakrobust(df, y='wage', endog='educ',
...                       instruments=['nearc2','nearc4'],
...                       exog=['age','exper'])
>>> print(panel.summary())
>>> panel.to_frame()
References

Anderson–Rubin (1949) Ann. Math. Stat. 20, 46-63. Kleibergen (2002) Econometrica 70, 1781-1803. Moreira (2003) Econometrica 71, 1027-1048. Kleibergen–Paap (2006) J. Econom. 133, 97-126. Olea–Pflueger (2013) JBES 31, 358-369. Lee–McCrary–Moreira–Porter (2022) AER 112, 3260-3290. [@anderson1949estimation]

evalue_from_result

evalue_from_result(result, measure: str = 'SMD', rare: Optional[bool] = None, rare_outcome: Optional[bool] = None, true: Optional[float] = None) -> Dict[str, Any]

Compute an E-value from a StatsPAI CausalResult object.

Parameters:

Name Type Description Default
result CausalResult

Result from any StatsPAI causal estimator exposing a scalar estimate (and ideally se / ci).

required
measure str

How to interpret result.estimate (for ATE/ATT on continuous outcomes 'SMD' is appropriate; pass 'RR' / 'OR' / 'HR' for ratio estimates).

'SMD'
rare Optional[bool]

Passed through to :func:evalue.

None
rare_outcome Optional[bool]

Passed through to :func:evalue.

None
true Optional[bool]

Passed through to :func:evalue.

None

Returns:

Type Description
dict

Same structure as :func:evalue.

evalue_rd

evalue_rd(n11: float, n10: float, n01: float, n00: float, true: float = 0.0, alpha: float = 0.05, grid: float = 0.0001) -> Dict[str, Any]

Exact E-value for a risk difference from a 2x2 table.

Implements the Ding & VanderWeele (2016) risk-difference E-value, a faithful port of EValue::evalues.RD. The exposure must be coded so that the risk difference is non-negative.

Parameters:

Name Type Description Default
n11 float

Exposed cases and exposed non-cases.

required
n10 float

Exposed cases and exposed non-cases.

required
n01 float

Unexposed cases and unexposed non-cases.

required
n00 float

Unexposed cases and unexposed non-cases.

required
true float

Reference risk difference (must be <= the observed RD).

0.0
alpha float

Significance level for the confidence-limit E-value.

0.05
grid float

Step size of the bias-factor grid search for the CI E-value.

1e-4

Returns:

Type Description
dict

evalue_estimate (point E-value), evalue_ci (E-value for the lower confidence limit; 1 if the CI already crosses true), rd (the observed risk difference), bias_factor and measure='RD'.

Examples:

>>> import statspai as sp
>>> round(sp.evalue_rd(200, 150, 100, 250)["evalue_estimate"], 4)
3.4142

bias_factor

bias_factor(rr_eu: float, rr_ud: float) -> float

Confounding bias factor B (Ding & VanderWeele 2016).

Given the maximum risk ratios of an unmeasured confounder U with the exposure (rr_eu) and with the outcome (rr_ud), the factor by which they can bias a risk ratio is

B = (rr_eu * rr_ud) / (rr_eu + rr_ud - 1).

The E-value is the value e such that bias_factor(e, e) equals the observed risk ratio.

diagnose_result

diagnose_result(result, print_results: bool = True, alpha: float = 0.05) -> Dict[str, Any]

One-call diagnostic battery, auto-selected by method type.

Parameters:

Name Type Description Default
result EconometricResults or CausalResult

A fitted result from any StatsPAI estimator.

required
print_results bool

If True, print formatted diagnostics to stdout.

True
alpha float

Significance level for tests.

0.05

Returns:

Type Description
dict

Keys depend on the method. Always includes 'method_type' and 'checks' (list of individual test result dicts).

kitagawa_test

kitagawa_test(data: DataFrame, y: str, treatment: str, instrument: str, n_boot: int = 1000, n_grid: int = 100, seed: Optional[int] = None) -> KitagawaResult

Kitagawa (2015) specification test for the validity of LATE.

Tests whether the data are consistent with the LATE assumptions (independence, exclusion restriction, monotonicity) by checking that the implied complier potential-outcome CDFs are proper distribution functions.

Parameters:

Name Type Description Default
data DataFrame

Input data.

required
y str

Outcome variable name.

required
treatment str

Endogenous binary treatment variable (D).

required
instrument str

Binary instrument variable (Z).

required
n_boot int

Number of bootstrap replications for the p-value.

1000
n_grid int

Number of grid points for evaluating the CDF conditions.

100
seed int

Random seed for reproducibility.

None

Returns:

Type Description
KitagawaResult

Test results including statistic, p-value, and first-stage effect.

Raises:

Type Description
ValueError

If instrument or treatment are not binary, or first stage is zero.

Examples:

>>> import statspai as sp
>>> result = sp.kitagawa_test(
...     data=df,
...     y="outcome",
...     treatment="treated",
...     instrument="assigned",
...     n_boot=1000,
...     seed=42,
... )
>>> result.summary()

rosenbaum_bounds

rosenbaum_bounds(treated: Optional[Sequence[float]] = None, control: Optional[Sequence[float]] = None, *, data: Optional[DataFrame] = None, y: Optional[str] = None, treat: Optional[str] = None, pair_id: Optional[str] = None, method: str = 'wilcoxon', alternative: str = 'greater', gamma_grid: Optional[Sequence[float]] = None, alpha: float = 0.05) -> RosenbaumResult

Compute Rosenbaum bounds on a paired observational study.

You can pass either two parallel arrays (treated, control) of matched outcomes, or a long-format DataFrame with columns y, treat, pair_id.

Parameters:

Name Type Description Default
treated array - like

Outcome in the treated / control unit of each matched pair (same length). Ignored if data is provided.

None
control array - like

Outcome in the treated / control unit of each matched pair (same length). Ignored if data is provided.

None
data DataFrame

Long-format data. Must contain exactly two rows per pair_id, one with treat=1 and one with treat=0.

None
method ('wilcoxon', 'sign')

Wilcoxon signed-rank bound (continuous) or binomial sign test (robust / binary).

"wilcoxon"
alternative ('greater', 'less', 'two-sided')

Direction of the alternative hypothesis for the treatment effect.

"greater"
gamma_grid sequence of float

Gamma values (>= 1) over which to compute bounding p-values. Default: np.arange(1.0, 3.01, 0.1).

None
alpha float

Significance level used to report gamma_critical.

0.05

Returns:

Type Description
RosenbaumResult

gamma_critical is the smallest Gamma in the grid at which the upper-bound p-value exceeds alpha. It is inf if the study is insensitive across the grid, and 1.0 if already sensitive.