statspai.causal_text¶
causal_text ¶
Causal inference with text data — MVP (P1-B, v1.6 experimental).
Two methods to start the family:
-
:func:
text_treatment_effect— Veitch-Wang-Blei (2020) text-as- treatment via embedding adjustment. Estimates the ATE of a user-supplied treatment indicator while controlling for the confounding variation captured in the text embedding. -
:func:
llm_annotator_correct— Egami-Hinck-Stewart-Wei (2024) measurement-error correction for LLM-derived treatment labels. Given a small validation subset where both LLM and human labels exist, debias the downstream regression coefficient via Hausman- style correction.
Status
Experimental. Both estimators ship with conservative defaults and validation-set-estimated noise parameters; consume them as a starting point, not a final answer. Future versions will add text-as-confounder (Roberts-Stewart-Nielsen) and text-as-outcome (Egami et al. 2018) methods alongside topic-model integration.
References
- Veitch, V., Sridhar, D., & Blei, D. M. (2019). "Adapting text embeddings for causal inference." UAI. arXiv:1905.12741.
- Egami, N., Hinck, M., Stewart, B., & Wei, H. (2024). "Using imperfect surrogates for downstream inference: Design-based supervised learning for social science applications of large language models." NeurIPS. arXiv:2306.04746.
TextTreatmentResult ¶
Bases: CausalResult
ATE result for text-as-treatment estimation.
Subclasses :class:CausalResult so it inherits .tidy(),
.to_latex(), .cite(), and the agent-native .to_dict()
when P0's additions are present. Exposes embedding-specific
metadata on the instance and inside model_info['text_diagnostics']
so agents can introspect what happened.
LLMAnnotatorResult ¶
Bases: CausalResult
Output of :func:llm_annotator_correct.
Inherits the agent-native CausalResult API. Adds annotator-specific
fields naive_estimate, correction_factor, and
annotator_diagnostics (false-positive / false-negative rates,
validation-set size, agreement rate, optional confusion matrix and
inflation factor) on the instance.
text_treatment_effect ¶
text_treatment_effect(data: DataFrame, *, text_col: str, outcome: str, treatment: str, covariates: Optional[List[str]] = None, embedder: Union[str, Callable] = 'hash', n_components: int = 20, seed: int = 0, alpha: float = 0.05) -> TextTreatmentResult
Estimate the ATE of a text-derived treatment via embedding adjustment.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
|
required |
text_col
|
str
|
Column containing the document text (string). |
required |
outcome
|
str
|
Outcome column (numeric). |
required |
treatment
|
str
|
Treatment column (binary or continuous). This is the variable whose coefficient is interpreted as the ATE. |
required |
covariates
|
list of str
|
Additional non-text covariates to include in the adjustment regression. |
None
|
embedder
|
('hash', 'sbert')
|
Text embedder. See :func: |
'hash'
|
n_components
|
int
|
Embedding dimensionality (for 'hash'); ignored for 'sbert'. |
20
|
seed
|
int
|
|
0
|
alpha
|
float
|
CI level (1 - alpha confidence). |
0.05
|
Returns:
| Type | Description |
|---|---|
TextTreatmentResult
|
|
Examples:
>>> import statspai as sp, pandas as pd
>>> df = pd.DataFrame({
... 'text': ['great product', 'terrible bug', 'okay tool', 'love it'],
... 'treatment': [1, 0, 0, 1],
... 'outcome': [4.5, 1.2, 2.8, 4.7],
... })
>>> r = sp.text_treatment_effect(df, text_col='text', outcome='outcome',
... treatment='treatment', n_components=4)
>>> r.estimate
llm_annotator_correct ¶
llm_annotator_correct(*, annotations_llm: Series, outcome: Series, annotations_human: Optional[Series] = None, covariates: Optional[DataFrame] = None, method: str = 'hausman', bootstrap: bool = False, n_bootstrap: int = 500, bootstrap_seed: Optional[int] = None, alpha: float = 0.05) -> LLMAnnotatorResult
Correct a downstream causal coefficient for LLM annotation noise.
Implements two correction paths sharing the same API:
- Binary T — Hausman (1998) ``β_corrected = β_obs / (1 - p_01
- p_10)``.
- Multi-class T (
K ≥ 3) — inverse-confusion-matrix correction. The treatment is dummy-encoded with the smallest label as reference; per-class contrasts are recovered via a linear transformation built from the validation-set Bayes posteriorQ[i, j] = P(T_true=i | T_obs=j).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
annotations_llm
|
Series
|
LLM-derived annotation for every row. Binary or multi-class (numeric class labels). |
required |
outcome
|
Series
|
Outcome variable. |
required |
annotations_human
|
Series
|
Human annotation; |
None
|
covariates
|
DataFrame
|
Additional control variables for the OLS regression. |
None
|
method
|
'hausman'
|
Correction method. Future versions will add full SAR with super learners (Egami et al. 2024 §3). |
'hausman'
|
bootstrap
|
bool
|
If True, run a bias-corrected percentile bootstrap that
resamples the full sample (validation rows + unlabeled rows
jointly) |
False
|
n_bootstrap
|
int
|
Bootstrap replicates. |
500
|
bootstrap_seed
|
int
|
Seed for the bootstrap RNG (NumPy |
None
|
alpha
|
float
|
CI level (1 - alpha confidence). |
0.05
|
Returns:
| Type | Description |
|---|---|
LLMAnnotatorResult
|
|
Examples:
>>> import statspai as sp, pandas as pd, numpy as np
>>> n, n_val = 1000, 100
>>> rng = np.random.default_rng(0)
>>> T_true = (rng.random(n) > 0.5).astype(int)
>>> noise = (rng.random(n) < 0.15).astype(int)
>>> T_llm = (T_true ^ noise).astype(int) # 15% misclass.
>>> y = 1.0 * T_true + rng.standard_normal(n) # true ATE 1.0
>>> human = pd.Series([T_true[i] if i < n_val else np.nan
... for i in range(n)])
>>> r = sp.llm_annotator_correct(
... annotations_llm=pd.Series(T_llm),
... annotations_human=human,
... outcome=pd.Series(y),
... )
>>> r.estimate # ~1.0 (corrected from naive ~0.7)
References
Egami, Hinck, Stewart & Wei (NeurIPS 2024) — arXiv:2306.04746. Hausman, Abrevaya & Scott-Morton (J. Econometrics 1998).