Skip to content

statspai.causal_text

causal_text

Causal inference with text data — MVP (P1-B, v1.6 experimental).

Two methods to start the family:

  • :func:text_treatment_effect — Veitch-Wang-Blei (2020) text-as- treatment via embedding adjustment. Estimates the ATE of a user-supplied treatment indicator while controlling for the confounding variation captured in the text embedding.

  • :func:llm_annotator_correct — Egami-Hinck-Stewart-Wei (2024) measurement-error correction for LLM-derived treatment labels. Given a small validation subset where both LLM and human labels exist, debias the downstream regression coefficient via Hausman- style correction.

Status

Experimental. Both estimators ship with conservative defaults and validation-set-estimated noise parameters; consume them as a starting point, not a final answer. Future versions will add text-as-confounder (Roberts-Stewart-Nielsen) and text-as-outcome (Egami et al. 2018) methods alongside topic-model integration.

References
  • Veitch, V., Sridhar, D., & Blei, D. M. (2019). "Adapting text embeddings for causal inference." UAI. arXiv:1905.12741.
  • Egami, N., Hinck, M., Stewart, B., & Wei, H. (2024). "Using imperfect surrogates for downstream inference: Design-based supervised learning for social science applications of large language models." NeurIPS. arXiv:2306.04746.

TextTreatmentResult

Bases: CausalResult

ATE result for text-as-treatment estimation.

Subclasses :class:CausalResult so it inherits .tidy(), .to_latex(), .cite(), and the agent-native .to_dict() when P0's additions are present. Exposes embedding-specific metadata on the instance and inside model_info['text_diagnostics'] so agents can introspect what happened.

LLMAnnotatorResult

Bases: CausalResult

Output of :func:llm_annotator_correct.

Inherits the agent-native CausalResult API. Adds annotator-specific fields naive_estimate, correction_factor, and annotator_diagnostics (false-positive / false-negative rates, validation-set size, agreement rate, optional confusion matrix and inflation factor) on the instance.

text_treatment_effect

text_treatment_effect(data: DataFrame, *, text_col: str, outcome: str, treatment: str, covariates: Optional[List[str]] = None, embedder: Union[str, Callable] = 'hash', n_components: int = 20, seed: int = 0, alpha: float = 0.05) -> TextTreatmentResult

Estimate the ATE of a text-derived treatment via embedding adjustment.

Parameters:

Name Type Description Default
data DataFrame
required
text_col str

Column containing the document text (string).

required
outcome str

Outcome column (numeric).

required
treatment str

Treatment column (binary or continuous). This is the variable whose coefficient is interpreted as the ATE.

required
covariates list of str

Additional non-text covariates to include in the adjustment regression.

None
embedder ('hash', 'sbert')

Text embedder. See :func:statspai.causal_text._common.embed_texts.

'hash'
n_components int

Embedding dimensionality (for 'hash'); ignored for 'sbert'.

20
seed int
0
alpha float

CI level (1 - alpha confidence).

0.05

Returns:

Type Description
TextTreatmentResult

Examples:

>>> import statspai as sp, pandas as pd
>>> df = pd.DataFrame({
...     'text': ['great product', 'terrible bug', 'okay tool', 'love it'],
...     'treatment': [1, 0, 0, 1],
...     'outcome': [4.5, 1.2, 2.8, 4.7],
... })
>>> r = sp.text_treatment_effect(df, text_col='text', outcome='outcome',
...                              treatment='treatment', n_components=4)
>>> r.estimate

llm_annotator_correct

llm_annotator_correct(*, annotations_llm: Series, outcome: Series, annotations_human: Optional[Series] = None, covariates: Optional[DataFrame] = None, method: str = 'hausman', bootstrap: bool = False, n_bootstrap: int = 500, bootstrap_seed: Optional[int] = None, alpha: float = 0.05) -> LLMAnnotatorResult

Correct a downstream causal coefficient for LLM annotation noise.

Implements two correction paths sharing the same API:

  • Binary T — Hausman (1998) ``β_corrected = β_obs / (1 - p_01
  • p_10)``.
  • Multi-class T (K ≥ 3) — inverse-confusion-matrix correction. The treatment is dummy-encoded with the smallest label as reference; per-class contrasts are recovered via a linear transformation built from the validation-set Bayes posterior Q[i, j] = P(T_true=i | T_obs=j).

Parameters:

Name Type Description Default
annotations_llm Series

LLM-derived annotation for every row. Binary or multi-class (numeric class labels).

required
outcome Series

Outcome variable.

required
annotations_human Series

Human annotation; NaN where unavailable. At least 30 rows with both LLM and human labels are required, and every true class must appear at least once.

None
covariates DataFrame

Additional control variables for the OLS regression.

None
method 'hausman'

Correction method. Future versions will add full SAR with super learners (Egami et al. 2024 §3).

'hausman'
bootstrap bool

If True, run a bias-corrected percentile bootstrap that resamples the full sample (validation rows + unlabeled rows jointly) n_bootstrap times. CIs / SE on the result reflect validation-set sampling uncertainty; the first-order versions ship in model_info['first_order_se'] and ['first_order_ci'].

False
n_bootstrap int

Bootstrap replicates.

500
bootstrap_seed int

Seed for the bootstrap RNG (NumPy default_rng).

None
alpha float

CI level (1 - alpha confidence).

0.05

Returns:

Type Description
LLMAnnotatorResult

Examples:

>>> import statspai as sp, pandas as pd, numpy as np
>>> n, n_val = 1000, 100
>>> rng = np.random.default_rng(0)
>>> T_true = (rng.random(n) > 0.5).astype(int)
>>> noise = (rng.random(n) < 0.15).astype(int)
>>> T_llm = (T_true ^ noise).astype(int)            # 15% misclass.
>>> y = 1.0 * T_true + rng.standard_normal(n)        # true ATE 1.0
>>> human = pd.Series([T_true[i] if i < n_val else np.nan
...                    for i in range(n)])
>>> r = sp.llm_annotator_correct(
...     annotations_llm=pd.Series(T_llm),
...     annotations_human=human,
...     outcome=pd.Series(y),
... )
>>> r.estimate    # ~1.0 (corrected from naive ~0.7)
References

Egami, Hinck, Stewart & Wei (NeurIPS 2024) — arXiv:2306.04746. Hausman, Abrevaya & Scott-Morton (J. Econometrics 1998).