Agent-native API surface (v1.9.0)¶

StatsPAI v1.9.0 ships a 12-piece API surface designed for the case where the caller is a language-model agent — Claude Code, Cursor, Copilot CLI, or a custom workflow that uses StatsPAI through the Model Context Protocol. The underlying estimators are unchanged; what's new is everything around them: shape detection, pre-flight checks, structured exceptions, token-budgeted serialization, missing-evidence audits, multi-format citations, deterministic RNG sessions, MCP prompts, and a one-line dashboard view.

This guide is a quickstart for agent authors. Human researchers can use these too — they just turn out to be the right primitives for agents to chain.

Why agent-native?¶

When an LLM is the caller, three things differ from human use:

The agent can't see the DataFrame. It needs APIs that report structure (panel? RD running variable? cross-section?) without a visualisation step.
Token budget matters per call. A 4 000-character "tidy summary" may be useful to a notebook but burns context the agent needs for reasoning. We expose every result at three sizes — minimal / standard / agent — and an even smaller one-line brief().
Errors should be machine-readable. A free-text "weak instrument F=2.1, try LIML" is great for a human but the agent has to regex-parse it. v1.9.0's exception envelope ships error_kind / recovery_hint / diagnostics / alternative_functions as discrete fields.

Everything below is additive — no estimator numerical path changed, default behaviour is byte-identical to v1.8.0.

The 12-piece surface at a glance¶

import statspai as sp

# Discovery — "what is this data?"
sp.detect_design(df)                       # cross-section / panel / RD
sp.preflight(df, "did", y=..., treat=...)  # cheap pre-estimation check
sp.examples("did")                         # runnable code snippets

# Estimation — unchanged, plus richer envelope
result = sp.did(df, y='y', treat='t', time='post')

# Serialization — pick payload size per call
result.to_dict(detail="minimal")           # ~150 tokens — answer only
result.to_dict(detail="standard")          # ~250 tokens — coefs + diagnostics
result.to_dict(detail="agent")             # ~620 tokens — + violations + next_steps
result.brief()                             # ~95 chars — dashboard view

# Reviewer-grade follow-up
sp.audit(result)                           # what robustness checks are missing?
result.cite(format="apa")                  # APA / BibTeX / JSON citations
sp.bib_for(result)                         # structured citation dict

# Reproducibility
with sp.session(seed=42):
    result_a = sp.did(df, ...)             # deterministic across runs
    result_b = sp.bayes_did(df, ...)

End-to-end agent workflow¶

Concrete example: an agent receives an unfamiliar CSV and is asked "is there a treatment effect?". Five calls, each with a clear purpose:

import statspai as sp
import pandas as pd

df = pd.read_csv("/path/to/dataset.csv")

# 1. Identify the study design.
design = sp.detect_design(df)
# {'design': 'panel', 'confidence': 1.0,
#  'identified': {'unit': 'firm_id', 'time': 'year'}, ...}

# 2. Pre-flight a candidate estimator before paying for it.
report = sp.preflight(df, 'did',
                      y='sales', treat='treated', time='year')
if report['verdict'] == 'FAIL':
    # The verdict carries structured failure info — agent can
    # decide whether to fix args, switch method, or stop.
    for c in report['checks']:
        if c['status'] == 'failed':
            print(f"  blocked by {c['name']}: {c['message']}")
    raise SystemExit
elif report['verdict'] == 'WARN':
    print("warnings present but proceeding")

# 3. Run the estimator. If it raises a structured StatsPAIError,
#    the MCP layer surfaces error_kind + alternative_functions.
result = sp.did(df, y='sales', treat='treated', time='year')

# 4. One-line dashboard summary for logs / multi-result loops.
print(result.brief())
# [Difference-in-Differences (2x2)]  estimand=ATT  est=0.412
# (se=0.087)  95% CI [0.241, 0.583]  ***  N=2,000

# 5. Reviewer checklist — which robustness checks are still
#    MISSING from the result's evidence base?
audit_card = sp.audit(result)
for c in audit_card['checks']:
    if c['status'] == 'missing' and c['importance'] == 'high':
        print(f"  follow-up: {c['suggest_function']}  ({c['name']})")
# follow-up: sp.pretrends_test  (parallel_trends)
# follow-up: sp.honest_did       (rambachan_roth)

The agent now has enough structured information to plan its next call — no prose parsing, no "did you remember to test parallel trends?" loops.

Token-budget control¶

Every fitted result exposes the same payload at three sizes. Agents choose per call:

level	shape	typical size
`brief()`	one-line string (`[METHOD] estimand= est=… ci ⚠`)	~95 char
`"minimal"`	dict: method / estimand / estimate / SE / CI / N	~150 tokens
`"standard"`	`"minimal"` + scalar diagnostics + detail rows	~250 tokens
`"agent"`	`"standard"` + violations + next_steps	~620 tokens

result.to_dict()                       # = "standard" (legacy default)
result.to_dict(detail="minimal")       # cheap sub-step
result.to_dict(detail="agent")         # full agent envelope

Through MCP, the same control is exposed as a detail argument on every tools/call:

{
  "method": "tools/call",
  "params": {
    "name": "did",
    "arguments": {
      "data_path": "/abs/path.csv",
      "y": "sales", "treat": "treated", "time": "year",
      "detail": "minimal"
    }
  }
}

`sp.audit(result)` — the missing-evidence view¶

sp.audit() is intentionally distinct from three neighbours:

function	answers
`result.violations()`	"what evidence is on the result and failing?"
`result.next_steps()`	"what should the user do next to publish this result?"
`sp.assumption_audit(result, data)`	"given the data, do the assumptions actually hold?" (re-runs tests)
`sp.audit(result)`	"what reviewer-grade evidence is still missing from this result?"

audit is read-only and runs in microseconds: it inspects result.model_info for the diagnostics each method family expects, and reports each as passed / failed / missing. Each missing check carries a suggest_function so the agent knows exactly what to call next.

{
    "method": "did_2x2",
    "method_family": "did",
    "checks": [
        {
            "name": "parallel_trends",
            "question": "Are pre-treatment trends statistically parallel?",
            "status": "missing",
            "severity": "warning",
            "importance": "high",
            "suggest_function": "sp.pretrends_test",
            "rationale": "DID identification rests on parallel trends; "
                         "without a pre-trend test the design is "
                         "unfalsifiable.",
            ...
        },
        ...
    ],
    "summary": {"passed": 0, "failed": 0, "missing": 5, "n_total": 5},
    "coverage": 0.0,
}

coverage is passed / n_total — agents can sort multiple results by reviewer-readiness.

Citations: zero-hallucination, three formats¶

result.cite(format=...) and sp.bib_for(result) parse the canonical BibTeX entry stored on the result class and reformat it. Bibliographic facts come only from the parsed BibTeX — the formatter never invents authors, years, journals, or publishers (per CLAUDE.md §10).

r = sp.callaway_santanna(df, ...)

r.cite()                              # default — BibTeX
# @article{callaway2021difference, ...}

r.cite(format="apa")
# Callaway, B., & Sant'Anna, P. H. C. (2021). Difference-in-
# differences with multiple time periods. Journal of Econometrics,
# 225(2), 200–230.

sp.bib_for(r)                         # structured dict
# {'type': 'article', 'key': 'callaway2021difference',
#  'authors': [{'last': 'Callaway', 'first': 'Brantly'}, ...],
#  'year': '2021', 'title': '...', 'journal': '...', ...}

Methods that cite multiple papers (e.g. twfe_decomposition cites both Goodman-Bacon 2021 and de Chaisemartin & D'Haultfœuille 2020) round-trip every author — the parser walks every @type{...} head in the source string.

`sp.session(seed=42)` — reproducible blocks¶

A standard frustration: an agent reruns sp.bootstrap_ci(result) twice and gets different intervals because Python random and NumPy's legacy global drifted between calls. sp.session snapshots both, applies the seed for the duration of the block, and restores prior state on exit (even when an exception is raised inside):

with sp.session(seed=42):
    boot = sp.bootstrap_ci(result, n_boot=1000)
    perm = sp.permutation_test(result, n_perm=1000)
# state outside the block is byte-identical to before the with

What's covered: Python random, NumPy legacy global (np.random.randn, np.random.choice, …). Lazy interop with PyTorch / JAX (only seeded if those libraries are already imported — never auto-installed).

What's not covered: np.random.default_rng() instances. Those have no process-global state; pass state.seed explicitly if you need them deterministic:

with sp.session(seed=42) as state:
    rng = np.random.default_rng(state.seed)   # explicit seed
    x = rng.normal(size=5)

Not thread-safe — for parallel workloads, use one np.random.default_rng(seed) per thread.

MCP server: drop-in for Claude Desktop / Cursor¶

pip install statspai exposes a statspai-mcp console script. Wire it into your MCP-capable client by adding to the client's config:

Claude Desktop (claude_desktop_config.json):

{
  "mcpServers": {
    "statspai": {
      "command": "statspai-mcp"
    }
  }
}

Cursor / generic stdio MCP client:

statspai-mcp     # speaks JSON-RPC 2.0 over stdio

What the server exposes:

tools/list — every registered StatsPAI function as a typed tool with a JSON-Schema input. ~100 tools merged from the hand-curated flagship list and the auto-generated registry.
tools/call — runs the estimator. Accepts data_path (CSV, Parquet, etc. — server-side pd.read_*) plus the estimator's own kwargs plus the detail parameter to control payload size.
resources/list — statspai://catalog (Markdown index) and statspai://functions (JSON [{name, description}]).
resources/templates/list — statspai://function/{name} → per-function rich agent card (description, signature, assumptions, failure_modes, alternatives, typical_n_min, example).
prompts/list / prompts/get — three curated workflow templates (audit_did_result, design_then_estimate, robustness_followup) MCP clients surface as direct action buttons.

When an estimator raises a structured StatsPAIError, the tools/call response carries the full payload alongside legacy fields:

{
  "error": "MethodIncompatibility: treatment has 3 unique values...",
  "error_kind": "method_incompatibility",
  "error_payload": {
    "code": "method_incompatibility",
    "message": "...",
    "recovery_hint": "Use sp.callaway_santanna or sp.multi_treatment.",
    "diagnostics": {"n_unique_values": 3, "expected": 2},
    "alternative_functions": ["sp.callaway_santanna",
                               "sp.multi_treatment"]
  },
  "tool": "did", "arguments": {...}, "remediation": {...}
}

Agents branch on error_kind (typed) instead of regex-parsing error (free text).

Deciding which API to call when¶

Quick decision tree for agents:

unfamiliar data?      → sp.detect_design(df)
known data, want method advice?  → sp.recommend(df, outcome=…, treatment=…)
chosen method, before fitting?   → sp.preflight(df, method, **args)
fitting succeeded, want a quick view?     → result.brief()
fitting succeeded, want structured agent payload?  → result.to_dict(detail="agent")
fitting succeeded, want to find missing evidence?  → sp.audit(result)
need to cite the method?                   → result.cite(format="apa")
running multiple estimators, want determinism?  → with sp.session(seed=42): …
need a code snippet?                       → sp.examples(name)