`statspai.causal_rl`¶

causal_rl ¶

Causal Reinforcement Learning (StatsPAI v0.10).

Bridges between RL and causal inference for offline / batch learning scenarios with unobserved confounding.

References

Li, Zhang & Bareinboim (2025), arXiv 2510.21110 — Confounding-Robust Deep RL.
Cunha, Liu, French & Mian (2025), arXiv 2512.18135 — Unifying Causal RL.
Chemingui, Deshwal, Fern, Nguyen-Tang & Doppa (2025), arXiv 2510.22027 — Online Optimization for Offline Safe RL.

CausalDQNResult `dataclass` ¶

Bases: ResultProtocolMixin

Output of confounding-robust Q-learning (:func:causal_dqn).

Attributes:

Name	Type	Description
`q_table`	`ndarray`	Learned action-value table, shape `(n_states, n_actions)`.
`policy`	`ndarray`	Greedy action for each state, shape `(n_states,)`.
`gamma_bound`	`float`	Confounding bound used during the Bellman updates.
`n_iter`	`int`	Number of value-iteration sweeps performed.
`final_bellman_error`	`float`	Mean squared temporal-difference error at the last iteration.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(0)
>>> n = 400
>>> s = rng.integers(0, 3, size=n)
>>> a = rng.integers(0, 2, size=n)
>>> r = (a == s % 2).astype(float) + rng.normal(0, 0.1, size=n)
>>> s_next = rng.integers(0, 3, size=n)
>>> df = pd.DataFrame({'s': s, 'a': a, 'r': r, 's_next': s_next})
>>> res = sp.causal_dqn(df, state='s', action='a', reward='r',
...                     next_state='s_next', gamma_bound=0.1, n_iter=50)
>>> isinstance(res, sp.CausalDQNResult)
True
>>> res.q_table.shape
(3, 2)
>>> res.gamma_bound
0.1

BanditBenchmarkResult `dataclass` ¶

Bases: ResultProtocolMixin

Output from a causal-RL benchmark run.

Returned by :func:causal_rl_benchmark. Holds the generated transition dataset, the optimal policy/value of the underlying causal model, and the name of the recommended off-policy evaluator.

Examples:

>>> import statspai as sp
>>> res = sp.causal_rl_benchmark(
...     name='confounded_bandit', n_episodes=200, seed=0)
>>> type(res).__name__
'BanditBenchmarkResult'
>>> res.suggested_evaluator
'sp.causal_dqn'
>>> res.transitions['action'].isin([0, 1]).all().item()
True

OfflineSafeResult `dataclass` ¶

Bases: ResultProtocolMixin

Output of safe offline policy learning.

Returned by :func:offline_safe_policy. Holds the per-state action table (policy), the policy's expected reward and cost, the cost threshold it was constrained against, and whether the realised cost stays under that threshold (feasible).

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(0)
>>> n = 600
>>> df = pd.DataFrame({
...     "state": rng.integers(0, 3, n),
...     "action": rng.integers(0, 2, n),
...     "reward": rng.integers(0, 2, n) * 1.0 + rng.normal(0, 0.5, n),
...     "cost": rng.integers(0, 2, n) * 0.3 + rng.normal(0, 0.1, n),
... })
>>> res = sp.offline_safe_policy(df, state="state", action="action",
...                              reward="reward", cost="cost",
...                              cost_threshold=0.5)
>>> isinstance(res, sp.OfflineSafeResult)
True
>>> bool(res.feasible)
True
>>> bool(res.expected_cost <= res.cost_threshold)
True

StructuralMDPResult `dataclass` ¶

counterfactual_rollout ¶

counterfactual_rollout(initial_state: ndarray, policy: Callable[[ndarray], ndarray], horizon: int = 10) -> Dict[str, ndarray]

Roll out the fitted SVAR under a new policy to get a counterfactual (state, action, reward) trajectory.

causal_rl_benchmark ¶

causal_rl_benchmark(name: str = 'confounded_bandit', n_episodes: int = 1000, confounding_strength: float = 0.5, seed: int = 0) -> BanditBenchmarkResult

Generate a synthetic causal-RL benchmark dataset.

Parameters:

Name	Type	Description	Default
`name`	`{'confounded_bandit', 'confounded_dosage', 'confounded_pricing',`	`'confounded_targeting', 'confounded_routing'}`	`'confounded_bandit'`
`n_episodes`	`int`		`1000`
`confounding_strength`	`float in [0, 1]`	Magnitude of unmeasured confounding U → (action, reward).	`0.5`
`seed`	`int`		`0`

Returns:

Type	Description
`BanditBenchmarkResult`

References

.. [1] cunha2025unifying

Examples:

Generate a confounded two-arm bandit and inspect the transition table:

>>> import statspai as sp
>>> res = sp.causal_rl_benchmark(
...     name='confounded_bandit', n_episodes=200, seed=0)
>>> res.benchmark
'confounded_bandit'
>>> len(res.transitions)
200
>>> list(res.transitions.columns)
['state', 'action', 'reward', 'next_state']
>>> res.optimal_value
1.5
>>> res.optimal_policy.tolist()
[1]
>>> print(res.summary())

offline_safe_policy ¶

offline_safe_policy(data: DataFrame, state: str, action: str, reward: str, cost: str, cost_threshold: float = 0.5, discount: float = 0.95, n_iter: int = 100, seed: int = 0) -> OfflineSafeResult

Safe offline policy learning with a cost-constraint.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	Transition data (s, a, r, cost).	required
`state`	`str`	Column names. state and action must be discrete.	required
`action`	`str`	Column names. state and action must be discrete.	required
`reward`	`str`	Column names. state and action must be discrete.	required
`cost`	`str`	Column names. state and action must be discrete.	required
`cost_threshold`	`float`	Max allowed expected cost per step.	`0.5`
`discount`	`float`		`0.95`
`n_iter`	`int`		`100`
`seed`	`int`		`0`

Returns:

Type	Description
`OfflineSafeResult`

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(0)
>>> n = 600
>>> df = pd.DataFrame({
...     "state": rng.integers(0, 3, n),
...     "action": rng.integers(0, 2, n),
...     "reward": rng.integers(0, 2, n) * 1.0 + rng.normal(0, 0.5, n),
...     "cost": rng.integers(0, 2, n) * 0.3 + rng.normal(0, 0.1, n),
... })
>>> res = sp.offline_safe_policy(df, state="state", action="action",
...                              reward="reward", cost="cost",
...                              cost_threshold=0.5)
>>> int(res.policy.shape[0])      # one action per visited state
3
>>> bool(res.feasible)
True

causal_bandit ¶

causal_bandit(arms: Sequence[str], *, reward_fn: Callable[[str, Optional[dict]], float], context: Optional[dict] = None, n_samples: int = 500, rng_seed: int = 0) -> CausalBanditResult

Bareinboim-Forney-Pearl contextual causal bandit.

Given a callable reward_fn(arm, context) that samples the potential outcome of an arm under the current context, Monte Carlo estimates E[Y(a) | context] for each arm and returns the argmax.

Parameters:

Name	Type	Description	Default
`arms`	`sequence of str`	Arm labels.	required
`reward_fn`	`callable`	Stochastic reward sampler. Must accept (arm, context) and return a scalar reward.	required
`context`	`dict`		`None`
`n_samples`	`int`	Monte Carlo draws per arm.	`500`
`rng_seed`	`int`		`0`

Returns:

Type	Description
`CausalBanditResult`

Examples:

>>> import statspai as sp
>>> import numpy as np
>>> rng = np.random.default_rng(0)
>>> true = {"A": 1.0, "B": 0.3, "C": 0.6}
>>> def reward_fn(arm, context):
...     return true[arm] + rng.normal(0, 0.5)
>>> res = sp.causal_bandit(["A", "B", "C"], reward_fn=reward_fn,
...                        n_samples=300, rng_seed=0)
>>> res.arm_labels[res.optimal_arm]
'A'
>>> len(res.expected_rewards)
3

counterfactual_policy_optimization ¶

counterfactual_policy_optimization(data: DataFrame, *, state: str, action: str, reward: str, target_policy: Callable[[float], float], noise_sd: float = 1.0) -> CFPolicyResult

Counterfactual policy evaluation under a linear-Gaussian SCM.

Assumes a one-step SCM

r = alpha * s + beta * a + eps,  eps ~ Normal(0, noise_sd²)

so that fixing s and changing a uniquely determines a new reward via noise inversion.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	One row per trajectory; must contain numeric `state`, `action`, and `reward` columns.	required
`state`	`str`		required
`action`	`str`		required
`reward`	`str`		required
`target_policy`	`callable(float) -> float`	Proposed policy `a_new = π(s)`.	required
`noise_sd`	`float`		`1.0`

Returns:

Type	Description
`CFPolicyResult`

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 300
>>> s = rng.normal(0, 1, n)
>>> a = 0.5 * s + rng.normal(0, 1, n)
>>> r = 1.0 * s + 2.0 * a + rng.normal(0, 1, n)
>>> df = pd.DataFrame({"s": s, "a": a, "r": r})
>>> res = sp.counterfactual_policy_optimization(
...     df, state="s", action="a", reward="r",
...     target_policy=lambda si: si + 1.0)
>>> res.n_trajectories
300
>>> bool(np.isfinite(res.improvement))
True

structural_mdp ¶

structural_mdp(data: DataFrame, *, state_cols: Sequence[str], action_cols: Sequence[str], reward: str, next_state_cols: Optional[Sequence[str]] = None, time: Optional[str] = None, trajectory: Optional[str] = None) -> StructuralMDPResult

Fit a linear SVAR for a Markov decision process.

Estimates:

s_{t+1} = A s_t + B a_t + noise
r_t     = coef_s @ s_t + coef_a @ a_t

from logged tuples. Supports per-trajectory data (trajectory column groups consecutive transitions) or single-stream data with a time column.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`state_cols`	`sequence of str`		required
`action_cols`	`sequence of str`		required
`reward`	`str`		required
`next_state_cols`	`sequence of str`	If present, each row is a complete (s, a, r, s') tuple. If omitted, the function derives `s_{t+1}` by shifting `s` within each trajectory.	`None`
`time`	`str`	Required if `next_state_cols` is None — used to order rows within each trajectory.	`None`
`trajectory`	`str`	Trajectory identifier for multi-episode data.	`None`

Returns:

Type	Description
`StructuralMDPResult`

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 200
>>> s1, s2 = rng.normal(0, 1, n), rng.normal(0, 1, n)
>>> a1 = rng.normal(0, 1, n)
>>> df = pd.DataFrame({
...     "s1": s1, "s2": s2, "a1": a1,
...     "ns1": 0.8 * s1 + 0.2 * a1 + rng.normal(0, 0.1, n),
...     "ns2": 0.5 * s2 + 0.3 * a1 + rng.normal(0, 0.1, n),
...     "r": 1.0 * s1 + 0.5 * a1 + rng.normal(0, 0.1, n)})
>>> res = sp.structural_mdp(
...     df, state_cols=["s1", "s2"], action_cols=["a1"],
...     reward="r", next_state_cols=["ns1", "ns2"])
>>> (res.state_dim, res.action_dim)
(2, 1)
>>> res.A.shape
(2, 2)
>>> roll = res.counterfactual_rollout(
...     initial_state=[0.0, 0.0], policy=lambda s: np.array([1.0]), horizon=5)
>>> roll["states"].shape
(6, 2)

statspai.causal_rl¶

causal_rl ¶

CausalDQNResult dataclass ¶

BanditBenchmarkResult dataclass ¶

OfflineSafeResult dataclass ¶

StructuralMDPResult dataclass ¶

counterfactual_rollout ¶

causal_rl_benchmark ¶

offline_safe_policy ¶

causal_bandit ¶

counterfactual_policy_optimization ¶

structural_mdp ¶

`statspai.causal_rl`¶

CausalDQNResult `dataclass` ¶

BanditBenchmarkResult `dataclass` ¶

OfflineSafeResult `dataclass` ¶

StructuralMDPResult `dataclass` ¶