Smart Workflow¶
statspai.smart — estimator recommendation, comparison, assumption
auditing, and posterior verification.
sp.recommend¶
rec = sp.recommend(
df,
outcome='earnings',
treatment='training',
covariates=['age', 'educ', 'prior_earnings'],
design='observational', # 'rct' | 'observational' | 'did' | 'rd' | 'iv' | 'synth'
verify=True, # run posterior verification (v0.9.3)
)
rec.summary() # ranked estimators with rationale
rec.recommended_method
rec.plot('verify_radar') # stability breakdown per method
rec.to_latex()
sp.compare_estimators¶
Run multiple estimators on the same data and show a coefficient- stability forest:
cmp = sp.compare_estimators(
df, outcome='y', treatment='d', covariates=[...],
methods=['ols', 'psm', 'dml', 'aipw', 'tmle', 'causal_forest'],
)
cmp.plot_forest()
cmp.table()
sp.assumption_audit¶
One-call audit of the most common identification assumptions:
audit = sp.assumption_audit(df, outcome='y', treatment='d', covariates=[...])
audit.overlap # propensity score overlap diagnostic
audit.covariate_balance # Love plot of standardised diffs
audit.placebo_outcomes # pre-treatment placebos
audit.instrument_strength # first-stage F if IV specified
audit.parallel_trends # pre-trend placebo if DID
audit.summary()
sp.verify / sp.verify_benchmark (v0.9.3)¶
Posterior verification of any sp.recommend() output — aggregates
three signals into a verify_score ∈ [0, 100]:
v = sp.verify(
rec,
n_boot=500,
n_subsample=100,
subsample_frac=0.8,
n_placebo=20,
)
v.verify_score # 0–100 composite
v.components # dict: bootstrap / placebo / subsample
v.failures # methods that failed verification
v.plot('radar') # visual per-method breakdown
Calibration card: top-method verify_score is typically 85–95 on
clean DGPs (RD lower at ≈ 74 due to local-polynomial bootstrap
variance). sp.verify_benchmark(...) runs verify against synthetic
DGPs to calibrate what threshold constitutes "trust it".
Agent-native method-level API¶
These functions are guide-friendly, but they are also public API calls used by agents and notebook workflows. Their docstrings are exposed here so the Reference navigation includes method-level details.
sp.detect_design(...)¶
detect_design ¶
Heuristic study-design detection from a raw DataFrame.
sp.detect_design(data, **hints) answers the agent's first question
on receiving unfamiliar data: what kind of dataset is this? —
cross-section, panel, or something with an obvious RD running variable.
Distinct from siblings:
- :func:
sp.recommend— needs a research question (outcome, treatment) and recommends an estimator;detect_designonly inspects shape. - :func:
sp.check_identification— diagnoses a specific declared design;detect_designdecides which design is plausible at all.
The function is intentionally heuristic — it reports a confidence,
ranks alternatives, and surfaces every column-role candidate it
considered, so an agent can override with hints (unit=... /
time=... / running_var=...) when the heuristic is wrong.
detect_design ¶
detect_design(data: DataFrame, *, unit: Optional[str] = None, time: Optional[str] = None, running_var: Optional[str] = None, cutoff: Optional[float] = None) -> Dict[str, Any]
Detect the most plausible study design from a DataFrame.
Heuristic — never definitive. Returns ranked candidates so the
agent can override with hints (unit=... / time=... /
running_var=... / cutoff=...) when the auto-detection is
wrong.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
The dataset to inspect. Must have ≥ 1 row. |
required |
unit
|
str
|
Column the caller has already identified as the unit ID. Skips unit-detection and pins this column. |
None
|
time
|
str
|
Column the caller has already identified as the time dimension. |
None
|
running_var
|
str
|
Force this numeric column to be evaluated as an RD running variable. |
None
|
cutoff
|
float
|
RD cutoff value (the heuristic does NOT auto-discover this). |
None
|
Returns:
| Type | Description |
|---|---|
dict
|
JSON-safe payload with keys:
|
Examples:
Panel data:
>>> df = pd.DataFrame({
... 'firm_id': np.repeat(range(50), 10),
... 'year': np.tile(range(2010, 2020), 50),
... 'sales': np.random.randn(500),
... })
>>> sp.detect_design(df)['design']
'panel'
Cross-section:
>>> df = pd.DataFrame({'x': np.random.randn(200),
... 'y': np.random.randn(200)})
>>> sp.detect_design(df)['design']
'cross_section'
See Also
sp.recommend : Method advisor — pair this with a declared research question (outcome / treatment) and it recommends an estimator. sp.check_identification : Design-level diagnostics for an already declared design.
sp.preflight(...)¶
preflight ¶
Method-specific pre-estimation diagnostics.
sp.preflight(data, method, **kwargs) runs cheap, method-specific
shape and content checks BEFORE the agent commits to an expensive
estimator call. Different from the neighbours:
- :func:
statspai.smart.check_identification— design-level diagnostics for an already-declared design (DID / RD / IV / observational). Heavier and broader. - :func:
statspai.smart.assumption_audit— heavyweight: re-runs statistical tests against the data after the model is fit. - :func:
statspai.smart.audit— read-only checklist of robustness evidence ON a fitted result.
preflight answers: "if I call sp.{method}(data, ...) with these
arguments, will it work, and is the data the right shape?" — a quick
gate the agent can run first to avoid wasting tokens on bad calls.
Per-method check tables cover the curated agent-tool surface (regress / did / callaway_santanna / rdrobust / ivreg / ebalance); unknown methods get the universal sanity checks only (data is a non-empty DataFrame, sample size sanity).
CheckResult
module-attribute
¶
(status, message, evidence) — status in {passed, warning, failed}.
preflight ¶
Method-specific pre-estimation diagnostics.
Runs cheap, method-aware checks (column existence, data shape, treatment binarity, sample size) BEFORE the agent commits to an expensive estimator call. Use the verdict to decide whether to proceed, fix arguments, or pivot to a different method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
Same DataFrame the agent plans to pass to |
required |
method
|
str
|
Name of the StatsPAI estimator to pre-flight (e.g. |
required |
**kwargs
|
Any
|
Estimator arguments — column names ( |
{}
|
Returns:
| Type | Description |
|---|---|
dict
|
JSON-safe payload with keys:
|
Examples:
>>> df = pd.DataFrame({'y': [1, 2, 3, 4],
... 'treated': [0, 1, 0, 1],
... 't': [0, 0, 1, 1]})
>>> sp.preflight(df, 'did', y='y', treat='treated', time='t')['verdict']
'WARN' # n=4 is below the typical-minimum threshold of 50
See Also
sp.check_identification : Design-level diagnostics for an already-declared design. sp.assumption_audit : Heavyweight: re-runs statistical tests after fitting. sp.audit : Read-only checklist of robustness evidence ON a fitted result.
sp.audit(...)¶
audit ¶
Reviewer-checklist audit of a fitted StatsPAI result.
sp.audit(result) returns the missing-evidence view of a result:
which robustness / sensitivity / diagnostic checks a careful reviewer
would expect for this estimator family — and which of those have
already been run vs. are still missing on the result.
Distinct from neighbouring methods:
- :meth:
CausalResult.violations— items already onmodel_infowhose values fail their threshold ("checked but failed"). - :meth:
CausalResult.next_steps— recommendations of what to do next, oriented around action (export, robustness, alternative method). - :func:
statspai.smart.assumption_audit— heavyweight: takes(result, data)and re-runs statistical tests.auditis pure introspection — it never re-runs anything, never touches data, and runs in microseconds.
The agent's mental model: audit answers "what evidence is missing
for a reviewer to trust this estimate?"; assumption_audit answers
"given the data, do the assumptions actually hold?".
Returns a JSON-safe dict so MCP-mediated agents can branch on the
status field of each check without parsing prose.
audit ¶
Reviewer-checklist audit of a fitted StatsPAI result.
Returns the missing-evidence view: which robustness / sensitivity / diagnostic checks the estimator family expects, and which of those are present, failed, or absent on the result.
This is read-only — never re-runs a statistical test, never
touches the original DataFrame, runs in microseconds. Pair it
with :func:statspai.smart.assumption_audit (which does re-run
tests against the data) when you need both perspectives.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result
|
CausalResult or EconometricResults
|
Any fitted StatsPAI result with |
required |
Returns:
| Type | Description |
|---|---|
dict
|
JSON-safe payload with keys:
|
Examples:
>>> r = sp.did(df, y='wage', treat='treated', time='post')
>>> audit_card = sp.audit(r)
>>> for c in audit_card['checks']:
... if c['status'] == 'missing' and c['severity'] == 'high':
... print(c['suggest_function'])
sp.pretrends_test
See Also
statspai.smart.assumption_audit :
Heavyweight counterpart: re-runs statistical tests against the
original data and returns pass/fail per assumption.
CausalResult.violations :
Items already on model_info whose values fail thresholds
("checked-but-failed" view).
CausalResult.next_steps :
Action-oriented recommendations for what to do next.
sp.examples(...)¶
examples ¶
Runnable code-example surface for StatsPAI agents.
sp.examples(name) is the agent-discoverable entry for "show me how
to call this function". Different from neighbouring APIs:
- :func:
sp.describe_functionreturns the full registry record (params / assumptions / failure_modes etc.) — useful, but verbose. - :func:
sp.recommendwalks DATA + research question → estimator selection. - :func:
sp.examplesanswers: "I know I wantsp.{name}; show me one short, copy-pasteable Python snippet that exercises it." — agents need this to bootstrap a fresh notebook without reading docs.
Per-method curated snippets cover the flagship surface (regress / did
/ callaway_santanna / rdrobust / ivreg / ebalance / synth /
metalearners). For any other registered function, falls back to the
example field stored on the registry entry.
examples ¶
Return runnable code examples + registry metadata for a function.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Canonical StatsPAI function name (e.g. |
required |
Returns:
| Type | Description |
|---|---|
dict
|
JSON-safe payload with keys:
|
Raises:
| Type | Description |
|---|---|
TypeError
|
If |
Examples:
See Also
sp.describe_function : Full registry record (params / failure_modes / etc.). sp.list_functions : Discover available function names. sp.recommend : Method advisor when you don't yet know which function to call.
sp.session(...)¶
session ¶
Deterministic-RNG context manager for reproducible agent loops.
sp.session(seed=42) snapshots the RNG state of Python's
random and NumPy's legacy global MT19937 generator (the one
backing np.random.randn, np.random.choice, etc.), applies the
new seed for the duration of the with block, and restores the
prior state on exit. Optional extras (PyTorch, JAX) are seeded only
when those libraries are already importable — never auto-installed.
What it does NOT cover
np.random.default_rng() creates a fresh PCG64 generator seeded
from OS entropy each time it is called — those generators have no
process-global state for sp.session to manipulate. If your code
calls rng = np.random.default_rng() inside the block, the draws
will be different on every run regardless of the session seed. To
get deterministic default_rng draws, pass the seed explicitly:
with sp.session(seed=42) as state: ... rng = np.random.default_rng(state.seed) # explicit seed ... x = rng.normal(size=5)
Threading
Not thread-safe. The snapshot lives in a context-manager local but
the target state (Python random and NumPy legacy globals) is
process-wide. Two threads that enter sp.session simultaneously
will trample each other's snapshots and produce non-deterministic
results with no error. For parallel workloads, instantiate
np.random.default_rng(seed) per-thread and thread the generator
through your call stack instead of relying on sp.session.
The point: agents iterate. A bootstrap CI that drifts between calls
because different RNG state was active is a debugging nightmare.
with sp.session(seed=42): ... makes every call reproducible
without polluting the global RNG state outside the block.
Usage
import statspai as sp import numpy as np
with sp.session(seed=42): ... a = np.random.randn(3) ... # any sp.xxx call inside is deterministic
State outside the block is untouched.¶
b = np.random.randn(3) # uses prior global state
session ¶
session(seed: Optional[int] = None, *, torch: bool = True, jax: bool = True, pythonhashseed: bool = False) -> Iterator[Any]
Set every reachable RNG to a known seed for the duration of the
with block, then restore the prior state on exit.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seed
|
int
|
Seed value. |
None
|
torch
|
bool
|
Seed PyTorch (CPU + CUDA) when the library is already imported. Never imports torch on its own. |
``True``
|
jax
|
bool
|
Yield a fresh JAX |
``True``
|
pythonhashseed
|
bool
|
Set |
``False``
|
Yields:
| Type | Description |
|---|---|
SessionState
|
Object exposing the seed in use as |
Examples:
>>> import numpy as np, statspai as sp
>>> with sp.session(seed=42):
... a = np.random.randn(3)
>>> with sp.session(seed=42):
... b = np.random.randn(3)
>>> bool((a == b).all())
True
Notes
Restoration is best-effort: if a library was lazily imported
INSIDE the with block (and thus had no prior state), the
exit handler skips restoring it. The intended use is small,
deterministic blocks of estimator + bootstrap calls — not
long-running session orchestration.
sp.brief(...)¶
brief ¶
One-line dashboard summaries of fitted StatsPAI results.
sp.brief(result) and result.brief() return a single-line
status string under ~120 characters: enough to scan a list of
results in an agent-orchestrated workflow without paying the token
cost of a full to_dict(detail="agent") payload per item.
Format
::
[METHOD] estimand=ATT est=0.412 (se=0.087) 95% CI [0.241, 0.583] *** N=2,000 ⚠ pretrend
Columns:
[METHOD]— method label (truncated to 24 chars)estimand=— ATT / ATE / LATE / etc.est=— point estimate to 3 sig figs(se=...)— standard error95% CI [..., ...]— confidence interval at the result's alpha***/**/*— significance stars (omitted if p ≥ 0.10)N=— sample size with thousands separator⚠ ...— firstviolations()flag at error severity, if any
Distinct from siblings:
- :meth:
CausalResult.summary— multi-line prose for humans (KB-scale). - :meth:
CausalResult.to_dict(withdetail="minimal") — JSON payload ~ 300 chars;brief()is ~ 100 chars and human-scannable, intended for agent dashboards rather than tool-result payloads.
brief ¶
Render a one-line status summary of a fitted result.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result
|
CausalResult or EconometricResults (or any object
|
exposing |
required |
Returns:
| Type | Description |
|---|---|
str
|
A single-line status string under ~120 characters. Intended for agent dashboards / multi-result comparisons. JSON-safe (it's just a string). |
Examples:
>>> r = sp.did(df, y='y', treat='treated', time='t')
>>> sp.brief(r)
"[did_2x2] estimand=ATT est=0.412 (se=0.087) 95% CI [0.241, 0.583] *** N=2,000"
See Also
CausalResult.summary : Multi-line prose summary for humans. CausalResult.to_dict : Full JSON payload at minimal/standard/agent detail levels.
sp.bib_for(...)¶
bib_for ¶
Top-level structured citation for a fitted result.
Convenience entry that pairs with result.cite(format="json")
so agents that don't have direct access to the result method can
pull the structured payload via sp.bib_for(...) instead.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result
|
CausalResult or EconometricResults
|
Any fitted result object exposing a |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Same shape as |
Examples: