`statspai.causal_text`¶

causal_text ¶

Causal inference with text data — MVP (P1-B, v1.6 experimental).

Two methods to start the family:

:func:text_treatment_effect — Veitch-Wang-Blei (2020) text-as- treatment via embedding adjustment. Estimates the ATE of a user-supplied treatment indicator while controlling for the confounding variation captured in the text embedding.
:func:llm_annotator_correct — Egami-Hinck-Stewart-Wei (2024) measurement-error correction for LLM-derived treatment labels. Given a small validation subset where both LLM and human labels exist, debias the downstream regression coefficient via Hausman- style correction.

Status

Experimental. Both estimators ship with conservative defaults and validation-set-estimated noise parameters; consume them as a starting point, not a final answer. Future versions will add text-as-confounder (Roberts-Stewart-Nielsen) and text-as-outcome (Egami et al. 2018) methods alongside topic-model integration.

References

Veitch, V., Sridhar, D., & Blei, D. M. (2019). "Adapting text embeddings for causal inference." UAI. arXiv:1905.12741.
Egami, N., Hinck, M., Stewart, B., & Wei, H. (2024). "Using imperfect surrogates for downstream inference: Design-based supervised learning for social science applications of large language models." NeurIPS. arXiv:2306.04746.

TextTreatmentResult ¶

Bases: CausalResult

ATE result for text-as-treatment estimation.

Subclasses :class:CausalResult so it inherits .tidy(), .to_latex(), .cite(), and the agent-native .to_dict() when P0's additions are present. Exposes embedding-specific metadata on the instance and inside model_info['text_diagnostics'] so agents can introspect what happened.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(0)
>>> n = 200
>>> docs = ['good service' if rng.random() > 0.5 else 'late delivery'
...         for _ in range(n)]
>>> treat = rng.binomial(1, 0.5, size=n)
>>> y = 2.0 * treat + rng.normal(size=n)
>>> df = pd.DataFrame({'review': docs, 'treat': treat, 'y': y})
>>> res = sp.text_treatment_effect(
...     df, text_col='review', outcome='y', treatment='treat',
...     n_components=8, seed=0,
... )
>>> isinstance(res, sp.TextTreatmentResult)
True
>>> res.estimand
'ATE'
>>> res.embedding_dim
8

LLMAnnotatorResult ¶

Bases: CausalResult

Output of :func:llm_annotator_correct.

Inherits the agent-native CausalResult API. Adds annotator-specific fields naive_estimate, correction_factor, and annotator_diagnostics (false-positive / false-negative rates, validation-set size, agreement rate, optional confusion matrix and inflation factor) on the instance.

Examples:

Normally returned by sp.llm_annotator_correct (which calls an LLM annotator); the result object itself is a plain CausalResult subclass and can be inspected directly:

>>> import statspai as sp
>>> res = sp.LLMAnnotatorResult(
...     method='llm_annotator_correct', estimand='ATE',
...     estimate=0.42, se=0.08, pvalue=0.001, ci=(0.26, 0.58),
...     alpha=0.05, n_obs=500, naive_estimate=0.30, naive_se=0.06,
...     correction_factor=1.4,
...     annotator_diagnostics={'agreement': 0.9, 'n_validation': 100},
... )
>>> round(float(res.estimate), 2)
0.42
>>> isinstance(res, sp.CausalResult)
True

text_treatment_effect ¶

text_treatment_effect(data: DataFrame, *, text_col: str, outcome: str, treatment: str, covariates: Optional[List[str]] = None, embedder: Union[str, Callable] = 'hash', n_components: int = 20, seed: int = 0, alpha: float = 0.05) -> TextTreatmentResult

Estimate the ATE of a text-derived treatment via embedding adjustment.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`text_col`	`str`	Column containing the document text (string).	required
`outcome`	`str`	Outcome column (numeric).	required
`treatment`	`str`	Treatment column (binary or continuous). This is the variable whose coefficient is interpreted as the ATE.	required
`covariates`	`list of str`	Additional non-text covariates to include in the adjustment regression.	`None`
`embedder`	`('hash', 'sbert')`	Text embedder. See :func:`statspai.causal_text._common.embed_texts`.	`'hash'`
`n_components`	`int`	Embedding dimensionality (for 'hash'); ignored for 'sbert'.	`20`
`seed`	`int`		`0`
`alpha`	`float`	CI level (1 - alpha confidence).	`0.05`

Returns:

Type	Description
`TextTreatmentResult`

Examples:

>>> import statspai as sp, numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> df = pd.DataFrame({
...     "text": ["great product love it"] * 15 + ["terrible buggy slow"] * 15,
...     "treatment": [1] * 15 + [0] * 15,
...     "outcome": list(rng.normal(4.0, 0.3, 15)) + list(rng.normal(2.0, 0.3, 15)),
... })
>>> r = sp.text_treatment_effect(df, text_col="text", outcome="outcome",
...                              treatment="treatment", n_components=2)
>>> r.estimate

llm_annotator_correct ¶

llm_annotator_correct(*, annotations_llm: Series, outcome: Series, annotations_human: Optional[Series] = None, covariates: Optional[DataFrame] = None, method: str = 'hausman', bootstrap: bool = False, n_bootstrap: int = 500, bootstrap_seed: Optional[int] = None, alpha: float = 0.05) -> LLMAnnotatorResult

Correct a downstream causal coefficient for LLM annotation noise.

Implements two correction paths sharing the same API:

Binary T — Hausman (1998) ``β_corrected = β_obs / (1 - p_01
p_10)``.
Multi-class T (K ≥ 3) — inverse-confusion-matrix correction. The treatment is dummy-encoded with the smallest label as reference; per-class contrasts are recovered via a linear transformation built from the validation-set Bayes posterior Q[i, j] = P(T_true=i | T_obs=j).

Parameters:

Name	Type	Description	Default
`annotations_llm`	`Series`	LLM-derived annotation for every row. Binary or multi-class (numeric class labels).	required
`outcome`	`Series`	Outcome variable.	required
`annotations_human`	`Series`	Human annotation; `NaN` where unavailable. At least 30 rows with both LLM and human labels are required, and every true class must appear at least once.	`None`
`covariates`	`DataFrame`	Additional control variables for the OLS regression.	`None`
`method`	`'hausman'`	Correction method. Future versions will add full SAR with super learners (Egami et al. 2024 §3).	`'hausman'`
`bootstrap`	`bool`	If True, run a bias-corrected percentile bootstrap that resamples the full sample (validation rows + unlabeled rows jointly) `n_bootstrap` times. CIs / SE on the result reflect validation-set sampling uncertainty; the first-order versions ship in `model_info['first_order_se']` and `['first_order_ci']`.	`False`
`n_bootstrap`	`int`	Bootstrap replicates.	`500`
`bootstrap_seed`	`int`	Seed for the bootstrap RNG (NumPy `default_rng`).	`None`
`alpha`	`float`	CI level (1 - alpha confidence).	`0.05`

Returns:

Type	Description
`LLMAnnotatorResult`

Examples:

>>> import statspai as sp, pandas as pd, numpy as np
>>> n, n_val = 1000, 100
>>> rng = np.random.default_rng(0)
>>> T_true = (rng.random(n) > 0.5).astype(int)
>>> noise = (rng.random(n) < 0.15).astype(int)
>>> T_llm = (T_true ^ noise).astype(int)            # 15% misclass.
>>> y = 1.0 * T_true + rng.standard_normal(n)        # true ATE 1.0
>>> human = pd.Series([T_true[i] if i < n_val else np.nan
...                    for i in range(n)])
>>> r = sp.llm_annotator_correct(
...     annotations_llm=pd.Series(T_llm),
...     annotations_human=human,
...     outcome=pd.Series(y),
... )
>>> isinstance(r, sp.LLMAnnotatorResult)
True
>>> bool(0.7 < r.estimate < 1.3)    # corrected toward the true ATE 1.0
True

References

Egami, Hinck, Stewart & Wei (NeurIPS 2024) — arXiv:2306.04746. Hausman, Abrevaya & Scott-Morton (J. Econometrics 1998).

statspai.causal_text¶

causal_text ¶

TextTreatmentResult ¶

LLMAnnotatorResult ¶

text_treatment_effect ¶

llm_annotator_correct ¶

`statspai.causal_text`¶