Skip to content

Smart Workflow

statspai.smart — estimator recommendation, comparison, assumption auditing, and posterior verification.

sp.recommend

rec = sp.recommend(
    df,
    outcome='earnings',
    treatment='training',
    covariates=['age', 'educ', 'prior_earnings'],
    design='observational',          # 'rct' | 'observational' | 'did' | 'rd' | 'iv' | 'synth'
    verify=True,                     # run posterior verification (v0.9.3)
)
rec.summary()                        # ranked estimators with rationale
rec.recommended_method
rec.plot('verify_radar')             # stability breakdown per method
rec.to_latex()

sp.compare_estimators

Run multiple estimators on the same data and show a coefficient- stability forest:

cmp = sp.compare_estimators(
    df, outcome='y', treatment='d', covariates=[...],
    methods=['ols', 'psm', 'dml', 'aipw', 'tmle', 'causal_forest'],
)
cmp.plot_forest()
cmp.table()

sp.assumption_audit

One-call audit of the most common identification assumptions:

audit = sp.assumption_audit(df, outcome='y', treatment='d', covariates=[...])
audit.overlap                        # propensity score overlap diagnostic
audit.covariate_balance              # Love plot of standardised diffs
audit.placebo_outcomes               # pre-treatment placebos
audit.instrument_strength            # first-stage F if IV specified
audit.parallel_trends                # pre-trend placebo if DID
audit.summary()

sp.verify / sp.verify_benchmark (v0.9.3)

Posterior verification of any sp.recommend() output — aggregates three signals into a verify_score ∈ [0, 100]:

v = sp.verify(
    rec,
    n_boot=500,
    n_subsample=100,
    subsample_frac=0.8,
    n_placebo=20,
)

v.verify_score                       # 0–100 composite
v.components                         # dict: bootstrap / placebo / subsample
v.failures                           # methods that failed verification
v.plot('radar')                      # visual per-method breakdown

Calibration card: top-method verify_score is typically 85–95 on clean DGPs (RD lower at ≈ 74 due to local-polynomial bootstrap variance). sp.verify_benchmark(...) runs verify against synthetic DGPs to calibrate what threshold constitutes "trust it".

Agent-native method-level API

These functions are guide-friendly, but they are also public API calls used by agents and notebook workflows. Their docstrings are exposed here so the Reference navigation includes method-level details.

sp.detect_design(...)

detect_design

Heuristic study-design detection from a raw DataFrame.

sp.detect_design(data, **hints) answers the agent's first question on receiving unfamiliar data: what kind of dataset is this? — cross-section, panel, or something with an obvious RD running variable.

Distinct from siblings:

  • :func:sp.recommend — needs a research question (outcome, treatment) and recommends an estimator; detect_design only inspects shape.
  • :func:sp.check_identification — diagnoses a specific declared design; detect_design decides which design is plausible at all.

The function is intentionally heuristic — it reports a confidence, ranks alternatives, and surfaces every column-role candidate it considered, so an agent can override with hints (unit=... / time=... / running_var=...) when the heuristic is wrong.

detect_design

detect_design(data: DataFrame, *, unit: Optional[str] = None, time: Optional[str] = None, running_var: Optional[str] = None, cutoff: Optional[float] = None) -> Dict[str, Any]

Detect the most plausible study design from a DataFrame.

Heuristic — never definitive. Returns ranked candidates so the agent can override with hints (unit=... / time=... / running_var=... / cutoff=...) when the auto-detection is wrong.

Parameters:

Name Type Description Default
data DataFrame

The dataset to inspect. Must have ≥ 1 row.

required
unit str

Column the caller has already identified as the unit ID. Skips unit-detection and pins this column.

None
time str

Column the caller has already identified as the time dimension.

None
running_var str

Force this numeric column to be evaluated as an RD running variable.

None
cutoff float

RD cutoff value (the heuristic does NOT auto-discover this).

None

Returns:

Type Description
dict

JSON-safe payload with keys:

  • design (str) — "panel" / "rd" / "cross_section"
  • confidence (float in [0, 1])
  • identified (dict[str, str|float]) — column-role assignments for the chosen design
  • candidates (list[dict]) — every alternative considered, each with its own design / confidence / details. Use this to override when the top pick is wrong.
  • n_obs (int) — sample size
  • columns (list[str]) — input column names

Examples:

Panel data:

>>> df = pd.DataFrame({
...     'firm_id': np.repeat(range(50), 10),
...     'year': np.tile(range(2010, 2020), 50),
...     'sales': np.random.randn(500),
... })
>>> sp.detect_design(df)['design']
'panel'

Cross-section:

>>> df = pd.DataFrame({'x': np.random.randn(200),
...                    'y': np.random.randn(200)})
>>> sp.detect_design(df)['design']
'cross_section'
See Also

sp.recommend : Method advisor — pair this with a declared research question (outcome / treatment) and it recommends an estimator. sp.check_identification : Design-level diagnostics for an already declared design.

sp.preflight(...)

preflight

Method-specific pre-estimation diagnostics.

sp.preflight(data, method, **kwargs) runs cheap, method-specific shape and content checks BEFORE the agent commits to an expensive estimator call. Different from the neighbours:

  • :func:statspai.smart.check_identificationdesign-level diagnostics for an already-declared design (DID / RD / IV / observational). Heavier and broader.
  • :func:statspai.smart.assumption_audit — heavyweight: re-runs statistical tests against the data after the model is fit.
  • :func:statspai.smart.audit — read-only checklist of robustness evidence ON a fitted result.

preflight answers: "if I call sp.{method}(data, ...) with these arguments, will it work, and is the data the right shape?" — a quick gate the agent can run first to avoid wasting tokens on bad calls.

Per-method check tables cover the curated agent-tool surface (regress / did / callaway_santanna / rdrobust / ivreg / ebalance); unknown methods get the universal sanity checks only (data is a non-empty DataFrame, sample size sanity).

CheckResult module-attribute

CheckResult = Tuple[str, str, Dict[str, Any]]

(status, message, evidence) — status in {passed, warning, failed}.

preflight

preflight(data: DataFrame, method: str, **kwargs: Any) -> Dict[str, Any]

Method-specific pre-estimation diagnostics.

Runs cheap, method-aware checks (column existence, data shape, treatment binarity, sample size) BEFORE the agent commits to an expensive estimator call. Use the verdict to decide whether to proceed, fix arguments, or pivot to a different method.

Parameters:

Name Type Description Default
data DataFrame

Same DataFrame the agent plans to pass to sp.{method}(...).

required
method str

Name of the StatsPAI estimator to pre-flight (e.g. "did", "rdrobust", "ivreg", "callaway_santanna", "ebalance"). Unknown methods get only the universal DataFrame-shape sanity checks.

required
**kwargs Any

Estimator arguments — column names (y, treat, time, i, running_var), a Wilkinson formula, a covariates list, etc. Passed through unchanged from what the agent intends to use; preflight doesn't run the estimator.

{}

Returns:

Type Description
dict

JSON-safe payload with keys:

  • method (str) — input method name (lower-cased)
  • verdict (str) — "PASS" / "WARN" / "FAIL"
  • checks (list[dict]) — every check that ran, with name / question / status / message / evidence
  • summary (dict) — count of passed / warning / failed
  • n_obs (int) — sample size
  • known_method (bool) — whether the method has a dedicated check table (False falls back to the universal checks only)

Examples:

>>> df = pd.DataFrame({'y': [1, 2, 3, 4],
...                    'treated': [0, 1, 0, 1],
...                    't': [0, 0, 1, 1]})
>>> sp.preflight(df, 'did', y='y', treat='treated', time='t')['verdict']
'WARN'  # n=4 is below the typical-minimum threshold of 50
See Also

sp.check_identification : Design-level diagnostics for an already-declared design. sp.assumption_audit : Heavyweight: re-runs statistical tests after fitting. sp.audit : Read-only checklist of robustness evidence ON a fitted result.

sp.audit(...)

audit

Reviewer-checklist audit of a fitted StatsPAI result.

sp.audit(result) returns the missing-evidence view of a result: which robustness / sensitivity / diagnostic checks a careful reviewer would expect for this estimator family — and which of those have already been run vs. are still missing on the result.

Distinct from neighbouring methods:

  • :meth:CausalResult.violations — items already on model_info whose values fail their threshold ("checked but failed").
  • :meth:CausalResult.next_steps — recommendations of what to do next, oriented around action (export, robustness, alternative method).
  • :func:statspai.smart.assumption_audit — heavyweight: takes (result, data) and re-runs statistical tests. audit is pure introspection — it never re-runs anything, never touches data, and runs in microseconds.

The agent's mental model: audit answers "what evidence is missing for a reviewer to trust this estimate?"; assumption_audit answers "given the data, do the assumptions actually hold?".

Returns a JSON-safe dict so MCP-mediated agents can branch on the status field of each check without parsing prose.

audit

audit(result: Any) -> Dict[str, Any]

Reviewer-checklist audit of a fitted StatsPAI result.

Returns the missing-evidence view: which robustness / sensitivity / diagnostic checks the estimator family expects, and which of those are present, failed, or absent on the result.

This is read-only — never re-runs a statistical test, never touches the original DataFrame, runs in microseconds. Pair it with :func:statspai.smart.assumption_audit (which does re-run tests against the data) when you need both perspectives.

Parameters:

Name Type Description Default
result CausalResult or EconometricResults

Any fitted StatsPAI result with model_info attached.

required

Returns:

Type Description
dict

JSON-safe payload with keys:

  • method (str) — the estimator's name
  • method_family (str) — one of "did" / "rd" / "iv" / "synth" / "matching" / "dml" / "hte" / "regression" / "generic"
  • checks (list[dict]) — every applicable check, each with name / question / status / severity / value / threshold / suggest_function / rationale
  • summary (dict) — count of passed / failed / missing / n_total
  • coverage (float in [0, 1]) — passed / n_total; agents can sort multiple results by reviewer-readiness

Examples:

>>> r = sp.did(df, y='wage', treat='treated', time='post')
>>> audit_card = sp.audit(r)
>>> for c in audit_card['checks']:
...     if c['status'] == 'missing' and c['severity'] == 'high':
...         print(c['suggest_function'])
sp.pretrends_test
See Also

statspai.smart.assumption_audit : Heavyweight counterpart: re-runs statistical tests against the original data and returns pass/fail per assumption. CausalResult.violations : Items already on model_info whose values fail thresholds ("checked-but-failed" view). CausalResult.next_steps : Action-oriented recommendations for what to do next.

sp.examples(...)

examples

Runnable code-example surface for StatsPAI agents.

sp.examples(name) is the agent-discoverable entry for "show me how to call this function". Different from neighbouring APIs:

  • :func:sp.describe_function returns the full registry record (params / assumptions / failure_modes etc.) — useful, but verbose.
  • :func:sp.recommend walks DATA + research question → estimator selection.
  • :func:sp.examples answers: "I know I want sp.{name}; show me one short, copy-pasteable Python snippet that exercises it." — agents need this to bootstrap a fresh notebook without reading docs.

Per-method curated snippets cover the flagship surface (regress / did / callaway_santanna / rdrobust / ivreg / ebalance / synth / metalearners). For any other registered function, falls back to the example field stored on the registry entry.

examples

examples(name: str) -> Dict[str, Any]

Return runnable code examples + registry metadata for a function.

Parameters:

Name Type Description Default
name str

Canonical StatsPAI function name (e.g. "did", "regress", "callaway_santanna"). Lower-cased and stripped before lookup.

required

Returns:

Type Description
dict

JSON-safe payload with keys:

  • name (str)
  • category (str | None) — registry category
  • description (str) — one-line summary from the registry
  • signature (str | None) — example call from the registry
  • examples (list[dict]) — curated runnable snippets (each with title / code); empty if no curated snippet exists for this function. The snippets are intentionally short (≤ 20 lines) so an agent can paste them whole into a fresh REPL.
  • pre_conditions (list[str])
  • assumptions (list[str])
  • alternatives (list[str])
  • known_function (bool) — True if registry has the function, False if a fallback record was synthesised

Raises:

Type Description
TypeError

If name is not a string.

Examples:

>>> ex = sp.examples("did")
>>> ex["examples"][0]["title"]
'Classic 2x2 DID'
See Also

sp.describe_function : Full registry record (params / failure_modes / etc.). sp.list_functions : Discover available function names. sp.recommend : Method advisor when you don't yet know which function to call.

sp.session(...)

session

Deterministic-RNG context manager for reproducible agent loops.

sp.session(seed=42) snapshots the RNG state of Python's random and NumPy's legacy global MT19937 generator (the one backing np.random.randn, np.random.choice, etc.), applies the new seed for the duration of the with block, and restores the prior state on exit. Optional extras (PyTorch, JAX) are seeded only when those libraries are already importable — never auto-installed.

What it does NOT cover

np.random.default_rng() creates a fresh PCG64 generator seeded from OS entropy each time it is called — those generators have no process-global state for sp.session to manipulate. If your code calls rng = np.random.default_rng() inside the block, the draws will be different on every run regardless of the session seed. To get deterministic default_rng draws, pass the seed explicitly:

with sp.session(seed=42) as state: ... rng = np.random.default_rng(state.seed) # explicit seed ... x = rng.normal(size=5)

Threading

Not thread-safe. The snapshot lives in a context-manager local but the target state (Python random and NumPy legacy globals) is process-wide. Two threads that enter sp.session simultaneously will trample each other's snapshots and produce non-deterministic results with no error. For parallel workloads, instantiate np.random.default_rng(seed) per-thread and thread the generator through your call stack instead of relying on sp.session.

The point: agents iterate. A bootstrap CI that drifts between calls because different RNG state was active is a debugging nightmare. with sp.session(seed=42): ... makes every call reproducible without polluting the global RNG state outside the block.

Usage

import statspai as sp import numpy as np

with sp.session(seed=42): ... a = np.random.randn(3) ... # any sp.xxx call inside is deterministic

State outside the block is untouched.

b = np.random.randn(3) # uses prior global state

session

session(seed: Optional[int] = None, *, torch: bool = True, jax: bool = True, pythonhashseed: bool = False) -> Iterator[Any]

Set every reachable RNG to a known seed for the duration of the with block, then restore the prior state on exit.

Parameters:

Name Type Description Default
seed int

Seed value. None (the default) means "snapshot current state but don't reseed" — useful for opportunistic save / restore around code that you don't want to leak RNG drift.

None
torch bool

Seed PyTorch (CPU + CUDA) when the library is already imported. Never imports torch on its own.

``True``
jax bool

Yield a fresh JAX PRNGKey(seed) to the caller via the jax_key attribute on the yielded session object, when JAX is already imported. Never imports jax on its own. (JAX has no global state so we can't seed it — agents must thread the key explicitly.)

``True``
pythonhashseed bool

Set PYTHONHASHSEED for the duration of the block. Most causal-inference numerics don't depend on dict iteration order, but spec-curve enumerators and graph-based DAG search sometimes do. Off by default to avoid surprising downstream callers.

``False``

Yields:

Type Description
SessionState

Object exposing the seed in use as .seed and (when JAX is imported) a .jax_key PRNGKey. Tests / agents can hold a reference to verify the seed inside the block.

Examples:

>>> import numpy as np, statspai as sp
>>> with sp.session(seed=42):
...     a = np.random.randn(3)
>>> with sp.session(seed=42):
...     b = np.random.randn(3)
>>> bool((a == b).all())
True
Notes

Restoration is best-effort: if a library was lazily imported INSIDE the with block (and thus had no prior state), the exit handler skips restoring it. The intended use is small, deterministic blocks of estimator + bootstrap calls — not long-running session orchestration.

sp.brief(...)

brief

One-line dashboard summaries of fitted StatsPAI results.

sp.brief(result) and result.brief() return a single-line status string under ~120 characters: enough to scan a list of results in an agent-orchestrated workflow without paying the token cost of a full to_dict(detail="agent") payload per item.

Format

::

[METHOD] estimand=ATT  est=0.412 (se=0.087)  95% CI [0.241, 0.583]  ***  N=2,000  ⚠ pretrend

Columns:

  • [METHOD] — method label (truncated to 24 chars)
  • estimand= — ATT / ATE / LATE / etc.
  • est= — point estimate to 3 sig figs
  • (se=...) — standard error
  • 95% CI [..., ...] — confidence interval at the result's alpha
  • *** / ** / * — significance stars (omitted if p ≥ 0.10)
  • N= — sample size with thousands separator
  • ⚠ ... — first violations() flag at error severity, if any

Distinct from siblings:

  • :meth:CausalResult.summary — multi-line prose for humans (KB-scale).
  • :meth:CausalResult.to_dict (with detail="minimal") — JSON payload ~ 300 chars; brief() is ~ 100 chars and human-scannable, intended for agent dashboards rather than tool-result payloads.

brief

brief(result: Any) -> str

Render a one-line status summary of a fitted result.

Parameters:

Name Type Description Default
result CausalResult or EconometricResults (or any object

exposing method / estimate / se / pvalue / ci / n_obs attributes).

required

Returns:

Type Description
str

A single-line status string under ~120 characters. Intended for agent dashboards / multi-result comparisons. JSON-safe (it's just a string).

Examples:

>>> r = sp.did(df, y='y', treat='treated', time='t')
>>> sp.brief(r)
"[did_2x2]   estimand=ATT  est=0.412 (se=0.087)  95% CI [0.241, 0.583]  ***  N=2,000"
See Also

CausalResult.summary : Multi-line prose summary for humans. CausalResult.to_dict : Full JSON payload at minimal/standard/agent detail levels.

sp.bib_for(...)

bib_for

bib_for(result: Any) -> Dict[str, Any]

Top-level structured citation for a fitted result.

Convenience entry that pairs with result.cite(format="json") so agents that don't have direct access to the result method can pull the structured payload via sp.bib_for(...) instead.

Parameters:

Name Type Description Default
result CausalResult or EconometricResults

Any fitted result object exposing a .cite() method.

required

Returns:

Type Description
dict

Same shape as result.cite(format="json"): {type, key, authors, year, title, journal, volume, number, pages, publisher, fields}.

Examples:

>>> r = sp.did(df, y='y', treat='treated', time='t', post='post')
>>> sp.bib_for(r)['key']
'angrist2009mostly'