From a data MCP to an estimate — the last mile¶
sp.from_worldbank/sp.from_fred/sp.from_sdmxreshape what a data MCP already fetched into a tidy framesp.detect_design→sp.recommend→sp.cross_validatecan use.
The 2026 workflow¶
The modern empirical loop is data MCP → estimator. An agent pulls a series
from an official data server — the World Bank data360 MCP, fred-mcp-server,
the OECD / Eurostat SDMX MCPs — and then has to estimate something. StatsPAI
deliberately does not re-implement those connectors: they are official,
maintained, and excellent. StatsPAI's job is to be their best consumer.
The friction is shape. Each source returns its own JSON dialect — World Bank's
[metadata, rows], FRED's {"observations": [...]}, the deeply nested
index-encoded SDMX-JSON of OECD/Eurostat. The ingestion helpers absorb that and
hand you a tidy long (or wide) panel.
These functions never touch the network — you pass them a payload the data MCP already returned, so they are deterministic and offline-testable.
World Bank¶
import statspai as sp
# `payload` is whatever the World Bank MCP / REST returned.
panel = sp.from_worldbank(payload) # tidy long: country, iso3,
# indicator, indicator_id,
# year, value
reg_frame = sp.from_worldbank(payload, wide=True) # one column per indicator,
# one row per (iso3, year)
wide=True is the form you want for a cross-country panel regression: pull
several indicators, get one column each, indexed by (iso3, year).
FRED¶
# Single series:
ts = sp.from_fred(observations, series_id="cpi") # date, cpi
# Several series merged on date in one call:
macro = sp.from_fred({"CPI": cpi_obs, "UNRATE": unrate_obs}) # date, CPI, UNRATE
FRED's "." missing-value sentinel becomes NaN; dates are parsed to
datetime64 and the frame is sorted.
OECD / Eurostat (SDMX-JSON)¶
# SDMX encodes each cell by integer indices into per-dimension code lists;
# from_sdmx expands them back to readable dimension columns.
df = sp.from_sdmx(payload) # e.g. LOCATION, SUBJECT, TIME_PERIOD, value
End-to-end recipe¶
A conversational session ties the whole chain together — no browser, no manual reshaping:
import statspai as sp
# 1. (the agent calls the World Bank MCP and gets `payload`)
panel = sp.from_worldbank(payload, wide=True)
# 2. Let StatsPAI read the shape and suggest an estimator.
design = sp.detect_design(panel)
rec = sp.recommend(panel, y="life_expectancy", design=design.design)
# 3. Fit — and cross-validate the headline number across engines (see the
# cross-engine validation guide).
cv = sp.cross_validate(
panel, "feols",
formula="life_expectancy ~ gdp_pc | iso3 + year", treatment="gdp_pc",
engines=["statspai", "pyfixest", "R::fixest"],
)
print(cv.verdict)
That is the picture the econometrics community has been sketching for 2026 — "OECD MCP pulls the data, Jupyter MCP runs the code, you never open a browser" — with StatsPAI supplying the estimation-and-validation half of the loop.
Provenance¶
Every ingested frame stamps df.attrs["source"] ("worldbank" / "fred" /
"sdmx") so downstream tooling and your own audit trail know where the numbers
came from. It also stamps df.attrs["provenance"], a deterministic dictionary
that records the normalizer, payload kind, output shape, row count, columns and
source-specific labels such as World Bank indicator ids, FRED series ids, or
SDMX dimensions.
sp.cross_validate preserves that metadata under
cv.provenance["data"]. In an MCP run that loads a CSV/Parquet/Stata file via
data_path, the MCP response also includes a top-level data_provenance
record with the local file hash or a sanitized remote URL.
Scope, on purpose¶
These are normalisers, not connectors. They will not fetch data, cache it,
or manage API keys — that is the data MCPs' job, and duplicating it would only
rot. If you need live fetching, call the relevant data MCP (or wbgapi /
fredapi / pandasdmx) and pass the result here.