Smart Workflow¶

statspai.smart — estimator recommendation, comparison, assumption auditing, and posterior verification.

`sp.recommend`¶

rec = sp.recommend(
    df,
    outcome='earnings',
    treatment='training',
    covariates=['age', 'educ', 'prior_earnings'],
    design='observational',          # 'rct' | 'observational' | 'did' | 'rd' | 'iv' | 'synth'
    verify=True,                     # run posterior verification (v0.9.3)
)
rec.summary()                        # ranked estimators with rationale
rec.recommended_method
rec.plot('verify_radar')             # stability breakdown per method
rec.to_latex()

`sp.compare_estimators`¶

Run multiple estimators on the same data and show a coefficient- stability forest:

cmp = sp.compare_estimators(
    df, outcome='y', treatment='d', covariates=[...],
    methods=['ols', 'psm', 'dml', 'aipw', 'tmle', 'causal_forest'],
)
cmp.plot_forest()
cmp.table()

`sp.assumption_audit`¶

One-call audit of the most common identification assumptions:

audit = sp.assumption_audit(df, outcome='y', treatment='d', covariates=[...])
audit.overlap                        # propensity score overlap diagnostic
audit.covariate_balance              # Love plot of standardised diffs
audit.placebo_outcomes               # pre-treatment placebos
audit.instrument_strength            # first-stage F if IV specified
audit.parallel_trends                # pre-trend placebo if DID
audit.summary()

`sp.verify` / `sp.verify_benchmark` (v0.9.3)¶

Posterior verification of any sp.recommend() output — aggregates three signals into a verify_score ∈ [0, 100]:

v = sp.verify(
    rec,
    n_boot=500,
    n_subsample=100,
    subsample_frac=0.8,
    n_placebo=20,
)

v.verify_score                       # 0–100 composite
v.components                         # dict: bootstrap / placebo / subsample
v.failures                           # methods that failed verification
v.plot('radar')                      # visual per-method breakdown

Calibration card: top-method verify_score is typically 85–95 on clean DGPs (RD lower at ≈ 74 due to local-polynomial bootstrap variance). sp.verify_benchmark(...) runs verify against synthetic DGPs to calibrate what threshold constitutes "trust it".

Agent-native method-level API¶

These functions are guide-friendly, but they are also public API calls used by agents and notebook workflows. Their docstrings are exposed here so the Reference navigation includes method-level details.

`sp.detect_design(...)`¶

detect_design ¶

Heuristic study-design detection from a raw DataFrame.

sp.detect_design(data, **hints) answers the agent's first question on receiving unfamiliar data: what kind of dataset is this? — cross-section, panel, or something with an obvious RD running variable.

Distinct from siblings:

:func:sp.recommend — needs a research question (outcome, treatment) and recommends an estimator; detect_design only inspects shape.
:func:sp.check_identification — diagnoses a specific declared design; detect_design decides which design is plausible at all.

The function is intentionally heuristic — it reports a confidence, ranks alternatives, and surfaces every column-role candidate it considered, so an agent can override with hints (unit=... / time=... / running_var=...) when the heuristic is wrong.

detect_design ¶

detect_design(data: DataFrame, *, unit: Optional[str] = None, time: Optional[str] = None, running_var: Optional[str] = None, cutoff: Optional[float] = None) -> Dict[str, Any]

Detect the most plausible study design from a DataFrame.

Heuristic — never definitive. Returns ranked candidates so the agent can override with hints (unit=... / time=... / running_var=... / cutoff=...) when the auto-detection is wrong.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	The dataset to inspect. Must have ≥ 1 row.	required
`unit`	`str`	Column the caller has already identified as the unit ID. Skips unit-detection and pins this column.	`None`
`time`	`str`	Column the caller has already identified as the time dimension.	`None`
`running_var`	`str`	Force this numeric column to be evaluated as an RD running variable.	`None`
`cutoff`	`float`	RD cutoff value (the heuristic does NOT auto-discover this).	`None`

Returns:

Type Description

dict

JSON-safe payload with keys:

design (str) — "panel" / "rd" / "cross_section"
confidence (float in [0, 1])
identified (dict[str, str|float]) — column-role assignments for the chosen design
candidates (list[dict]) — every alternative considered, each with its own design / confidence / details. Use this to override when the top pick is wrong.
n_obs (int) — sample size
columns (list[str]) — input column names

Examples:

Panel data — a balanced firm × year layout is recognised as a panel:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> df = pd.DataFrame({
...     'firm_id': np.repeat(range(50), 10),
...     'year': np.tile(range(2010, 2020), 50),
...     'sales': rng.standard_normal(500),
... })
>>> sp.detect_design(df)['design']
'panel'

Cross-section — no (unit, time) structure to exploit:

>>> df = pd.DataFrame({'x': rng.standard_normal(200),
...                    'y': rng.standard_normal(200)})
>>> sp.detect_design(df)['design']
'cross_section'

See Also

sp.recommend : Method advisor — pair this with a declared research question (outcome / treatment) and it recommends an estimator. sp.check_identification : Design-level diagnostics for an already declared design.

`sp.preflight(...)`¶

preflight ¶

Method-specific pre-estimation diagnostics.

sp.preflight(data, method, **kwargs) runs cheap, method-specific shape and content checks BEFORE the agent commits to an expensive estimator call. Different from the neighbours:

:func:statspai.smart.check_identification — design-level diagnostics for an already-declared design (DID / RD / IV / observational). Heavier and broader.
:func:statspai.smart.assumption_audit — heavyweight: re-runs statistical tests against the data after the model is fit.
:func:statspai.smart.audit — read-only checklist of robustness evidence ON a fitted result.

preflight answers: "if I call sp.{method}(data, ...) with these arguments, will it work, and is the data the right shape?" — a quick gate the agent can run first to avoid wasting tokens on bad calls.

Per-method check tables cover the curated agent-tool surface (regress / did / callaway_santanna / rdrobust / ivreg / ebalance); unknown methods get the universal sanity checks only (data is a non-empty DataFrame, sample size sanity).

CheckResult `module-attribute` ¶

CheckResult = Tuple[str, str, Dict[str, Any]]

(status, message, evidence) — status in {passed, warning, failed}.

preflight ¶

preflight(data: DataFrame, method: str, **kwargs: Any) -> Dict[str, Any]

Method-specific pre-estimation diagnostics.

Runs cheap, method-aware checks (column existence, data shape, treatment binarity, sample size) BEFORE the agent commits to an expensive estimator call. Use the verdict to decide whether to proceed, fix arguments, or pivot to a different method.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Same DataFrame the agent plans to pass to `sp.{method}(...)`.	required
`method`	`str`	Name of the StatsPAI estimator to pre-flight (e.g. `"did"`, `"rdrobust"`, `"ivreg"`, `"callaway_santanna"`, `"ebalance"`). Unknown methods get only the universal DataFrame-shape sanity checks.	required
`**kwargs`	`Any`	Estimator arguments — column names (`y`, `treat`, `time`, `i`, `running_var`), a Wilkinson `formula`, a `covariates` list, etc. Passed through unchanged from what the agent intends to use; `preflight` doesn't run the estimator.	`{}`

Returns:

Type Description

dict

JSON-safe payload with keys:

method (str) — input method name (lower-cased)
verdict (str) — "PASS" / "WARN" / "FAIL"
checks (list[dict]) — every check that ran, with name / question / status / message / evidence
summary (dict) — count of passed / warning / failed
n_obs (int) — sample size
known_method (bool) — whether the method has a dedicated check table (False falls back to the universal checks only)

Examples:

>>> import statspai as sp
>>> import pandas as pd
>>> df = pd.DataFrame({'y': [1, 2, 3, 4],
...                    'treated': [0, 1, 0, 1],
...                    't': [0, 0, 1, 1]})
>>> # n=4 is below the typical-minimum threshold, so preflight warns:
>>> sp.preflight(df, 'did', y='y', treat='treated', time='t')['verdict']
'WARN'

See Also

sp.check_identification : Design-level diagnostics for an already-declared design. sp.assumption_audit : Heavyweight: re-runs statistical tests after fitting. sp.audit : Read-only checklist of robustness evidence ON a fitted result.

`sp.audit(...)`¶

audit ¶

Reviewer-checklist audit of a fitted StatsPAI result.

sp.audit(result) returns the missing-evidence view of a result: which robustness / sensitivity / diagnostic checks a careful reviewer would expect for this estimator family — and which of those have already been run vs. are still missing on the result.

Distinct from neighbouring methods:

:meth:CausalResult.violations — items already on model_info whose values fail their threshold ("checked but failed").
:meth:CausalResult.next_steps — recommendations of what to do next, oriented around action (export, robustness, alternative method).
:func:statspai.smart.assumption_audit — heavyweight: takes (result, data) and re-runs statistical tests. audit is pure introspection — it never re-runs anything, never touches data, and runs in microseconds.

The agent's mental model: audit answers "what evidence is missing for a reviewer to trust this estimate?"; assumption_audit answers "given the data, do the assumptions actually hold?".

Returns a JSON-safe dict so MCP-mediated agents can branch on the status field of each check without parsing prose.

audit ¶

audit(result: Any, *, treatment: Optional[str] = None) -> Dict[str, Any]

Reviewer-checklist audit of a fitted StatsPAI result.

Returns the missing-evidence view: which robustness / sensitivity / diagnostic checks the estimator family expects, and which of those are present, failed, or absent on the result.

This is read-only — never re-runs a statistical test, never touches the original DataFrame, runs in microseconds. Pair it with :func:statspai.smart.assumption_audit (which does re-run tests against the data) when you need both perspectives.

Parameters:

Name	Type	Description	Default
`result`	`CausalResult or EconometricResults`	Any fitted StatsPAI result with `model_info` attached.	required
`treatment`	`str`	Name of the treatment variable when `result` is a plain regression used for causal adjustment on a selection-on-observables design. When supplied, the audit additionally asks for overlap / common-support, post-adjustment balance, and omitted-variable sensitivity — the checks a referee demands on an observational design but that a bare OLS would otherwise escape. Has no effect on designs whose family already carries these checks (matching / DML / IV / …). Descriptive regressions (no treatment declared) are never flagged.	`None`

Returns:

Type Description

dict

JSON-safe payload with keys:

method (str) — the estimator's name
method_family (str) — one of "did" / "rd" / "iv" / "synth" / "matching" / "dml" / "hte" / "regression" / "generic"
checks (list[dict]) — every applicable check, each with name / question / status / severity / value / threshold / suggest_function / rationale
summary (dict) — count of passed / failed / missing / n_total
coverage (float in [0, 1]) — passed / n_total; agents can sort multiple results by reviewer-readiness

Examples:

>>> import statspai as sp
>>> import numpy as np
>>> import pandas as pd
>>> rng = np.random.default_rng(5)
>>> rows = []
>>> for i in range(200):
...     tr = 1 if i < 100 else 0
...     for t in (0, 1):
...         y = (1.0 + 0.3 * t + 0.5 * tr + 2.0 * tr * t
...              + rng.normal(scale=0.5))
...         rows.append({'i': i, 't': t, 'treated': tr,
...                      'post': t, 'wage': y})
>>> df = pd.DataFrame(rows)
>>> r = sp.did(df, y='wage', treat='treated', time='t', post='post')
>>> audit_card = sp.audit(r)
>>> for c in audit_card['checks']:
...     if c['status'] == 'missing' and c['importance'] == 'high':
...         print(c['suggest_function'])
sp.pretrends_test

See Also

statspai.smart.assumption_audit : Heavyweight counterpart: re-runs statistical tests against the original data and returns pass/fail per assumption. CausalResult.violations : Items already on model_info whose values fail thresholds ("checked-but-failed" view). CausalResult.next_steps : Action-oriented recommendations for what to do next.

`sp.examples(...)`¶

examples ¶

Runnable code-example surface for StatsPAI agents.

sp.examples(name) is the agent-discoverable entry for "show me how to call this function". Different from neighbouring APIs:

:func:sp.describe_function returns the full registry record (params / assumptions / failure_modes etc.) — useful, but verbose.
:func:sp.recommend walks DATA + research question → estimator selection.
:func:sp.examples answers: "I know I want sp.{name}; show me one short, copy-pasteable Python snippet that exercises it." — agents need this to bootstrap a fresh notebook without reading docs.

Per-method curated snippets cover the flagship surface (regress / did / callaway_santanna / rdrobust / ivreg / ebalance / synth / metalearners). For any other registered function, falls back to the example field stored on the registry entry.

examples ¶

examples(name: str) -> Dict[str, Any]

Return runnable code examples + registry metadata for a function.

Parameters:

Name	Type	Description	Default
`name`	`str`	Canonical StatsPAI function name (e.g. `"did"`, `"regress"`, `"callaway_santanna"`). Lower-cased and stripped before lookup.	required

Returns:

Type Description

dict

JSON-safe payload with keys:

name (str)
category (str | None) — registry category
description (str) — one-line summary from the registry
signature (str | None) — example call from the registry
examples (list[dict]) — curated runnable snippets (each with title / code); empty if no curated snippet exists for this function. The snippets are intentionally short (≤ 20 lines) so an agent can paste them whole into a fresh REPL.
pre_conditions (list[str])
assumptions (list[str])
alternatives (list[str])
known_function (bool) — True if registry has the function, False if a fallback record was synthesised

Raises:

Type	Description
`TypeError`	If `name` is not a string.

Examples:

>>> ex = sp.examples("did")
>>> ex["examples"][0]["title"]
'Classic 2x2 DID'

See Also

sp.describe_function : Full registry record (params / failure_modes / etc.). sp.list_functions : Discover available function names. sp.recommend : Method advisor when you don't yet know which function to call.

`sp.session(...)`¶

session ¶

Deterministic-RNG context manager for reproducible agent loops.

sp.session(seed=42) snapshots the RNG state of Python's random and NumPy's legacy global MT19937 generator (the one backing np.random.randn, np.random.choice, etc.), applies the new seed for the duration of the with block, and restores the prior state on exit. Optional extras (PyTorch, JAX) are seeded only when those libraries are already importable — never auto-installed.

What it does NOT cover

np.random.default_rng() creates a fresh PCG64 generator seeded from OS entropy each time it is called — those generators have no process-global state for sp.session to manipulate. If your code calls rng = np.random.default_rng() inside the block, the draws will be different on every run regardless of the session seed. To get deterministic default_rng draws, pass the seed explicitly:

with sp.session(seed=42) as state: ... rng = np.random.default_rng(state.seed) # explicit seed ... x = rng.normal(size=5)

Threading

Not thread-safe. The snapshot lives in a context-manager local but the target state (Python random and NumPy legacy globals) is process-wide. Two threads that enter sp.session simultaneously will trample each other's snapshots and produce non-deterministic results with no error. For parallel workloads, instantiate np.random.default_rng(seed) per-thread and thread the generator through your call stack instead of relying on sp.session.

The point: agents iterate. A bootstrap CI that drifts between calls because different RNG state was active is a debugging nightmare. with sp.session(seed=42): ... makes every call reproducible without polluting the global RNG state outside the block.

Usage

import statspai as sp import numpy as np

with sp.session(seed=42): ... a = np.random.randn(3) ... # any sp.xxx call inside is deterministic

State outside the block is untouched.¶

b = np.random.randn(3) # uses prior global state

session ¶

session(seed: Optional[int] = None, *, torch: bool = True, jax: bool = True, pythonhashseed: bool = False) -> Iterator[Any]

Set every reachable RNG to a known seed for the duration of the with block, then restore the prior state on exit.

Parameters:

Name	Type	Description	Default
`seed`	`int`	Seed value. `None` (the default) means "snapshot current state but don't reseed" — useful for opportunistic save / restore around code that you don't want to leak RNG drift.	`None`
`torch`	`bool`	Seed PyTorch (CPU + CUDA) when the library is already imported. Never imports torch on its own.	``True``
`jax`	`bool`	Yield a fresh JAX `PRNGKey(seed)` to the caller via the `jax_key` attribute on the yielded session object, when JAX is already imported. Never imports jax on its own. (JAX has no global state so we can't seed it — agents must thread the key explicitly.)	``True``
`pythonhashseed`	`bool`	Set `PYTHONHASHSEED` for the duration of the block. Most causal-inference numerics don't depend on dict iteration order, but spec-curve enumerators and graph-based DAG search sometimes do. Off by default to avoid surprising downstream callers.	``False``

Yields:

Type	Description
`SessionState`	Object exposing the seed in use as `.seed` and (when JAX is imported) a `.jax_key` PRNGKey. Tests / agents can hold a reference to verify the seed inside the block.

Examples:

>>> import numpy as np, statspai as sp
>>> with sp.session(seed=42):
...     a = np.random.randn(3)
>>> with sp.session(seed=42):
...     b = np.random.randn(3)
>>> bool((a == b).all())
True

Notes

Restoration is best-effort: if a library was lazily imported INSIDE the with block (and thus had no prior state), the exit handler skips restoring it. The intended use is small, deterministic blocks of estimator + bootstrap calls — not long-running session orchestration.

`sp.brief(...)`¶

brief ¶

One-line dashboard summaries of fitted StatsPAI results.

sp.brief(result) and result.brief() return a single-line status string under ~120 characters: enough to scan a list of results in an agent-orchestrated workflow without paying the token cost of a full to_dict(detail="agent") payload per item.

Format

::

[METHOD] estimand=ATT  est=0.412 (se=0.087)  95% CI [0.241, 0.583]  ***  N=2,000  ⚠ pretrend

Columns:

[METHOD] — method label (truncated to 24 chars)
estimand= — ATT / ATE / LATE / etc.
est= — point estimate to 3 sig figs
(se=...) — standard error
95% CI [..., ...] — confidence interval at the result's alpha
*** / ** / * — significance stars (omitted if p ≥ 0.10)
N= — sample size with thousands separator
⚠ ... — first violations() flag at error severity, if any

Distinct from siblings:

:meth:CausalResult.summary — multi-line prose for humans (KB-scale).
:meth:CausalResult.to_dict (with detail="minimal") — JSON payload ~ 300 chars; brief() is ~ 100 chars and human-scannable, intended for agent dashboards rather than tool-result payloads.

brief ¶

brief(result: Any) -> str

Render a one-line status summary of a fitted result.

Parameters:

Name	Type	Description	Default
`result`	`CausalResult or EconometricResults (or any object`	exposing `method` / `estimate` / `se` / `pvalue` / `ci` / `n_obs` attributes).	required

Returns:

Type	Description
`str`	A single-line status string under ~120 characters. Intended for agent dashboards / multi-result comparisons. JSON-safe (it's just a string).

Examples:

>>> import statspai as sp
>>> import numpy as np
>>> import pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 200
>>> unit = np.repeat(np.arange(n // 2), 2)
>>> t = np.tile([0, 1], n // 2)
>>> treated = np.where(unit < (n // 4), 1, 0)
>>> post = (t == 1).astype(int)
>>> y = 1.0 + 0.8 * post + 1.5 * treated * post + rng.normal(size=n)
>>> df = pd.DataFrame({"y": y, "treated": treated, "t": t, "unit": unit})
>>> r = sp.did(df, y='y', treat='treated', time='t')
>>> line = sp.brief(r)
>>> isinstance(line, str) and "estimand=ATT" in line
True

See Also

CausalResult.summary : Multi-line prose summary for humans. CausalResult.to_dict : Full JSON payload at minimal/standard/agent detail levels.

`sp.bib_for(...)`¶

bib_for ¶

bib_for(result: Any) -> Dict[str, Any]

Top-level structured citation for a fitted result.

Convenience entry that pairs with result.cite(format="json") so agents that don't have direct access to the result method can pull the structured payload via sp.bib_for(...) instead.

Parameters:

Name	Type	Description	Default
`result`	`CausalResult or EconometricResults`	Any fitted result object exposing a `.cite()` method.	required

Returns:

Type	Description
`dict`	Same shape as `result.cite(format="json")`: `{type, key, authors, year, title, journal, volume, number, pages, publisher, fields}`.

Examples:

>>> import statspai as sp
>>> import numpy as np
>>> import pandas as pd
>>> rng = np.random.default_rng(5)
>>> rows = []
>>> for i in range(200):
...     tr = 1 if i < 100 else 0
...     for t in (0, 1):
...         y = (1.0 + 0.3 * t + 0.5 * tr + 2.0 * tr * t
...              + rng.normal(scale=0.5))
...         rows.append({'i': i, 't': t, 'treated': tr, 'post': t, 'y': y})
>>> df = pd.DataFrame(rows)
>>> r = sp.did(df, y='y', treat='treated', time='t', post='post')
>>> sp.bib_for(r)['key']
'angrist2009mostly'

Smart Workflow¶

sp.recommend¶

sp.compare_estimators¶

sp.assumption_audit¶

sp.verify / sp.verify_benchmark (v0.9.3)¶

Agent-native method-level API¶

sp.detect_design(...)¶

detect_design ¶

detect_design ¶

sp.preflight(...)¶

preflight ¶

CheckResult module-attribute ¶

preflight ¶

sp.audit(...)¶

audit ¶

audit ¶

sp.examples(...)¶

examples ¶

examples ¶

sp.session(...)¶

session ¶

State outside the block is untouched.¶

session ¶

sp.brief(...)¶

brief ¶

brief ¶

sp.bib_for(...)¶

bib_for ¶

`sp.recommend`¶

`sp.compare_estimators`¶

`sp.assumption_audit`¶

`sp.verify` / `sp.verify_benchmark` (v0.9.3)¶

`sp.detect_design(...)`¶

`sp.preflight(...)`¶

CheckResult `module-attribute` ¶

`sp.audit(...)`¶

`sp.examples(...)`¶

`sp.session(...)`¶

`sp.brief(...)`¶

`sp.bib_for(...)`¶