`statspai.regression`¶

regression ¶

Regression module initialization

OLSRegression ¶

Bases: BaseModel

OLS regression model with comprehensive functionality

fit ¶

fit(robust: str = 'nonrobust', cluster: Optional[str] = None, **kwargs: Any) -> EconometricResults

Fit the OLS model

Parameters:

Name	Type	Description	Default
`robust`	`str`	Type of standard errors	`'nonrobust'`
`cluster`	`str`	Variable name for clustering	`None`
`**kwargs`	`Any`	Additional options	`{}`

Returns:

Type	Description
`EconometricResults`	Fitted model results

predict ¶

predict(data: Optional[DataFrame] = None, what: str = 'mean', alpha: float = 0.05, return_df: bool = False) -> ndarray | DataFrame

Generate predictions from the fitted OLS model.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	New data at which to predict. If `None`, returns the in-sample fitted values.	`None`
`what`	`(mean, confidence, prediction)`	`"mean"` — point predictions only (default). `"confidence"` — point + `(1-alpha)` confidence interval for the conditional mean `E[y \| x]`. `"prediction"` — point + `(1-alpha)` prediction interval for a new observation (wider than the CI by `sqrt(sigma^2)`).	`"mean"`
`alpha`	`float`	Significance level for the interval.	`0.05`
`return_df`	`bool`	Return a DataFrame with columns `["yhat", "lower", "upper"]`. Ignored (forces True) when `what != "mean"`.	`False`

Returns:

Type	Description
`ndarray or DataFrame`	Point predictions, optionally with interval columns.

OLSEstimator ¶

Bases: BaseEstimator

Ordinary Least Squares estimator with robust standard errors

estimate ¶

estimate(y: ndarray, X: ndarray, robust: str = 'nonrobust', cluster: Optional[Series] = None, **kwargs: Any) -> Dict[str, Any]

Estimate OLS parameters

Parameters:

Name	Type	Description	Default
`y`	`ndarray`	Dependent variable	required
`X`	`ndarray`	Independent variables (including constant if desired)	required
`robust`	`str`	Type of standard errors ('nonrobust', 'hc0', 'hc1', 'hc2', 'hc3', 'hac')	`'nonrobust'`
`cluster`	`Series`	Cluster variable for clustered standard errors	`None`
`**kwargs`	`Any`	Additional options	`{}`

Returns:

Type	Description
`Dict[str, Any]`	Estimation results

IVRegression ¶

Bases: BaseModel

Instrumental Variables regression model.

Supports multiple estimation methods via method parameter: '2sls', 'liml', 'fuller', 'gmm', 'jive'.

Parameters:

Name	Type	Description	Default
`formula`	`str`	Formula with IV syntax: `"y ~ (endog ~ z1 + z2) + exog1 + exog2"`	`None`
`data`	`DataFrame`		`None`
`method`	`str`	Estimation method.	`'2sls'`
`fuller_alpha`	`float`	Fuller constant (only used when method='fuller'). `alpha=1` gives the bias-corrected Fuller estimator; `alpha=4` minimises MSE under normal errors.	`1.0`
`y`	`array - like`	Alternative to formula interface.	`None`
`X_exog`	`array - like`	Alternative to formula interface.	`None`
`X_endog`	`array - like`	Alternative to formula interface.	`None`
`Z`	`array - like`	Alternative to formula interface.	`None`
`var_names`	`array - like`	Alternative to formula interface.	`None`

References

Angrist, J. D., Imbens, G. W. and Rubin, D. B. (1996). Identification of Causal Effects Using Instrumental Variables. Journal of the American Statistical Association. doi:10.1080/01621459.1996.10476902 [@angrist1996identification]

Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press. [@angrist2009mostly]

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(2)
>>> z = rng.normal(size=300)
>>> u = rng.normal(size=300)
>>> x = 0.8 * z + u + rng.normal(size=300)
>>> y = 1.0 + 2.0 * x + u + rng.normal(size=300)
>>> df = pd.DataFrame({"y": y, "x": x, "z": z})
>>> model = sp.IVRegression("y ~ (x ~ z)", data=df, method="2sls")
>>> res = model.fit()
>>> bool(1.5 < float(res.params["x"]) < 2.5)
True

first_stage `property` ¶

first_stage: List[Dict[str, float]]

First-stage diagnostics for each endogenous variable.

sargan_test `property` ¶

sargan_test: Optional[Dict[str, float]]

Sargan/Hansen J overidentification test results.

hausman_test `property` ¶

hausman_test: Dict[str, float]

Durbin-Wu-Hausman endogeneity test results.

fit ¶

fit(robust: Any = 'nonrobust', cluster: Optional[str] = None, **kwargs: Any) -> EconometricResults

Fit the IV model.

Parameters:

Name	Type	Description	Default
`robust`	`str or bool`	Standard-error type. Accepts 'nonrobust' and 'hc0'–'hc3' (case-insensitive), plus the aliases `True` / `'robust'` (≡ HC1, matching Stata) and `'white'` (≡ HC0). Classical and robust SEs match `ivregress 2sls, small` / `..., robust small` (the finite-sample t convention).	`'nonrobust'`
`cluster`	`str`	Variable name for clustering.	`None`

Returns:

Type	Description
`EconometricResults`

predict ¶

predict(data: Optional[DataFrame] = None) -> ndarray

Generate predictions from the fitted IV model.

For a structural-form estimator, the natural forecast of y given new data is X_exog β_exog + X_endog β_endog — i.e. we plug observed values of the endogenous variables through the structural equation. Instruments are not used at prediction time.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	New data at which to predict. Must contain all exogenous and endogenous variables referenced by the model's formula. If `None`, returns in-sample fitted values.	`None`

IVEstimator ¶

Bases: BaseEstimator

Two-Stage Least Squares (2SLS) estimator.

Legacy class. Prefer using the iv() function directly.

GLMRegression ¶

Bases: BaseModel

Generalized Linear Model with IRLS estimation.

Parameters:

Name	Type	Description	Default
`formula`	`str`	Model formula (e.g. `"y ~ x1 + x2"`).	`None`
`data`	`DataFrame`	Data frame containing the variables.	`None`
`y`	`ndarray`	Response array (alternative to formula).	`None`
`X`	`ndarray`	Design matrix (alternative to formula).	`None`
`var_names`	`list of str`	Variable names when using `y`/`X` directly.	`None`
`family`	`str`	Distribution family.	`'gaussian'`
`link`	`str or None`	Link function (`None` selects canonical link).	`None`

Examples:

>>> import statspai as sp
>>> import numpy as np
>>> import pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 300
>>> x1 = rng.normal(size=n)
>>> x2 = rng.normal(size=n)
>>> eta = -0.5 + 1.2 * x1 - 0.8 * x2
>>> y = rng.binomial(1, 1 / (1 + np.exp(-eta)))
>>> df = pd.DataFrame({"y": y, "x1": x1, "x2": x2})
>>> model = sp.GLMRegression(formula="y ~ x1 + x2", data=df,
...                          family="binomial")
>>> results = model.fit()
>>> len(results.params)   # intercept + x1 + x2
3
>>> print(results.summary())

fit ¶

fit(robust: str = 'nonrobust', cluster: Optional[str] = None, weights: Optional[str] = None, offset: Optional[str] = None, exposure: Optional[str] = None, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05, **kwargs: Any) -> EconometricResults

Fit the GLM.

Parameters:

Name	Type	Description	Default
`robust`	`str`	Standard-error type.	`'nonrobust'`
`cluster`	`str`	Cluster variable name.	`None`
`weights`	`str`	Weight variable name.	`None`
`offset`	`str`	Offset variable name.	`None`
`exposure`	`str`	Exposure variable name (log added as offset).	`None`
`maxiter`	`int`	Maximum IRLS iterations.	`100`
`tol`	`float`	Convergence tolerance.	`1e-08`
`alpha`	`float`	Significance level for confidence intervals.	`0.05`

Returns:

Type	Description
`EconometricResults`

predict ¶

predict(data: Optional[DataFrame] = None, type: str = 'response', offset: Optional[ndarray] = None) -> ndarray

Generate predictions from the fitted model.

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`	New data for prediction. Uses training data if `None`.	`None`
`type`	`str`	`'response'` (mean), `'link'` (linear predictor), or `'variance'` (variance function evaluated at predicted mu).	`'response'`
`offset`	`ndarray`	Offset for new data.	`None`

Returns:

Type	Description
`ndarray`

GLMEstimator ¶

Bases: BaseEstimator

Generalized Linear Model estimator using IRLS

Implements Iteratively Reweighted Least Squares for maximum likelihood estimation of GLM parameters. This is the low-level engine that operates on numpy arrays and Family / LinkFunction instances; most users should call :func:statspai.glm or :class:statspai.GLMRegression, which accept formulas / column names.

Examples:

>>> import statspai as sp
>>> import numpy as np
>>> from statspai.regression.glm import GLMEstimator, Binomial, LogitLink
>>> rng = np.random.default_rng(0)
>>> n = 300
>>> X = np.column_stack([np.ones(n), rng.normal(size=n),
...                      rng.normal(size=n)])
>>> eta = X @ np.array([-0.5, 1.2, -0.8])
>>> y = rng.binomial(1, 1 / (1 + np.exp(-eta)))
>>> est = GLMEstimator()
>>> res = est.estimate(y, X, family=Binomial(), link=LogitLink())
>>> np.asarray(res["params"]).shape
(3,)
>>> bool(res["converged"])
True

estimate ¶

estimate(y: ndarray, X: ndarray, family: Optional[Family] = None, link: Optional[LinkFunction] = None, robust: str = 'nonrobust', cluster: Optional[Series] = None, weights: Optional[ndarray] = None, offset: Optional[ndarray] = None, maxiter: int = 100, tol: float = 1e-08, alpha_nb: Optional[float] = None, **kwargs: Any) -> Dict[str, Any]

Estimate GLM parameters via IRLS.

Parameters:

Name	Type	Description	Default
`y`	`ndarray`	Response variable (n,).	required
`X`	`ndarray`	Design matrix (n, k) including intercept if desired.	required
`family`	`Family`	Distribution family instance.	`None`
`link`	`LinkFunction`	Link function instance.	`None`
`robust`	`str`	Standard-error type ('nonrobust', 'hc0', 'hc1', 'hc2', 'hc3', 'hac').	`'nonrobust'`
`cluster`	`Series`	Cluster variable.	`None`
`weights`	`ndarray`	Prior / frequency weights.	`None`
`offset`	`ndarray`	Known offset added to the linear predictor.	`None`
`maxiter`	`int`	Maximum IRLS iterations.	`100`
`tol`	`float`	Convergence tolerance on deviance.	`1e-08`
`alpha_nb`	`float`	If family is NB, initial alpha for joint estimation.	`None`

Returns:

Type	Description
`Dict[str, Any]`

regress ¶

regress(formula: str, data: DataFrame, robust: str = 'nonrobust', cluster: Optional[str] = None, weights: Optional[Any] = None, *, vce: Optional[str] = None, conley_lat: Optional[str] = None, conley_lon: Optional[str] = None, conley_cutoff: Optional[float] = None, **kwargs: Any) -> EconometricResults

Convenient function for OLS regression

Parameters:

Name	Type	Description	Default
`formula`	`str`	Regression formula	required
`data`	`DataFrame`	Data containing variables	required
`robust`	`str`	Type of standard errors ('nonrobust', 'hc0'–'hc3', 'hac'; case-insensitive)	`'nonrobust'`
`cluster`	`str`	Variable name for clustering	`None`
`weights`	`str or array - like`	Analytic regression weights (Stata `aweight` semantics). Pass a column name or an array of length `nobs`. Fits WLS — point estimates, classical / robust / clustered SEs and R² match `regress y x [aw=w]`. Weights must be strictly positive and finite; invalid weights raise `ValueError` rather than being silently ignored.	`None`
`**kwargs`	`Any`	Additional options	`{}`

Returns:

Type	Description
`EconometricResults`	Fitted model results

Examples:

>>> import statspai as sp
>>> df = sp.cps_wage()
>>> results = sp.regress("log_wage ~ education + experience", data=df)
>>> bool(results.params["education"] > 0)
True

>>> results = sp.regress("log_wage ~ education + experience", data=df,
...                      robust='hc1', cluster='union')
>>> "education" in results.params.index
True

ivreg ¶

ivreg(formula: str, data: DataFrame, robust: str = 'nonrobust', cluster: Optional[str] = None, *, vce: Optional[str] = None, wild_reps: int = 999, wild_weight_type: str = 'rademacher', seed: Optional[int] = None, conley_lat: Optional[str] = None, conley_lon: Optional[str] = None, conley_cutoff: Optional[float] = None, **kwargs: Any) -> EconometricResults

Instrumental variables regression (2SLS).

.. deprecated:: Use sp.iv(formula, data, method='2sls') instead. ivreg is kept for backward compatibility.

Parameters:

Name	Type	Description	Default
`formula`	`str`	IV formula: `"y ~ (endog ~ z1 + z2) + exog1 + exog2"`	required
`data`	`DataFrame`		required
`robust`	`str`		`'nonrobust'`
`cluster`	`str`		`None`
`vce`	`str`	Set `vce="wild"` (with `cluster=`) to run the WRE wild cluster bootstrap (Davidson-MacKinnon 2010) on the endogenous coefficient — pinned to Stata `boottest` after `ivreg2`. Otherwise `vce` is the canonical alias for `robust`.	`None`
`wild_reps`	`int`	Controls for the `vce="wild"` path.	`999`
`wild_weight_type`	`int`	Controls for the `vce="wild"` path.	`999`
`seed`	`int`	Controls for the `vce="wild"` path.	`999`

Returns:

Type	Description
`EconometricResults`

References

Angrist, J. D., Imbens, G. W. and Rubin, D. B. (1996). Identification of Causal Effects Using Instrumental Variables. Journal of the American Statistical Association. doi:10.1080/01621459.1996.10476902 [@angrist1996identification]

Examples:

>>> import numpy as np, pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(42)
>>> n = 500
>>> z = rng.normal(size=n)
>>> u = rng.normal(size=n)
>>> x = 0.8 * z + u + rng.normal(size=n)        # endogenous regressor
>>> y = 1.5 * x + 2.0 * u + rng.normal(size=n)
>>> df = pd.DataFrame({'y': y, 'x': x, 'z': z})
>>> result = sp.ivreg("y ~ (x ~ z)", data=df)
>>> bool(abs(result.params['x'] - 1.5) < 0.2)  # 2SLS recovers the true effect
True

>>> # Preferred modern entry point:
>>> result = sp.iv("y ~ (x ~ z)", data=df, method='2sls')

qreg ¶

qreg(data: DataFrame, formula: Optional[str] = None, y: Optional[str] = None, x: Optional[List[str]] = None, quantile: float = 0.5, alpha: float = 0.05) -> CausalResult

Quantile regression at a single quantile.

Equivalent to Stata's qreg y x, quantile(0.5).

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`formula`	`str`	Formula like `"y ~ x1 + x2"` (patsy-style).	`None`
`y`	`str`	Outcome variable (alternative to formula).	`None`
`x`	`list of str`	Regressors (alternative to formula).	`None`
`quantile`	`float`	Quantile to estimate (0 < q < 1). 0.5 = median.	`0.5`
`alpha`	`float`		`0.05`

Returns:

Type	Description
`CausalResult`	Coefficients at the specified quantile.

Examples:

>>> import statspai as sp
>>> df = sp.cps_wage()
>>> # Median (0.5) regression of log wage on education and experience
>>> result = sp.qreg(df, y='log_wage', x=['education', 'experience'],
...                  quantile=0.5)
>>> # 90th percentile
>>> result = sp.qreg(df, y='log_wage', x=['education', 'experience'],
...                  quantile=0.9)
>>> bool(0 < result.estimate < 1)
True

Notes

Quantile regression minimizes:

.. math:: \min_\beta \sum_i \rho_\tau(Y_i - X_i'\beta)

where ρ_τ(u) = u(τ - 1(u < 0)) is the check function.

Standard errors are computed using the Powell (1991) sandwich estimator with a kernel density estimate of f(0|X).

See Koenker & Bassett (1978, Econometrica).

sqreg ¶

sqreg(data: DataFrame, y: str, x: List[str], quantiles: Optional[List[float]] = None, alpha: float = 0.05) -> DataFrame

Simultaneous quantile regression at multiple quantiles.

Equivalent to Stata's sqreg y x, quantiles(10 25 50 75 90).

Parameters:

Name	Type	Description	Default
`data`	`DataFrame`		required
`y`	`str`		required
`x`	`list of str`		required
`quantiles`	`list of float`	Default: [0.1, 0.25, 0.5, 0.75, 0.9].	`None`
`alpha`	`float`		`0.05`

Returns:

Type	Description
`DataFrame`	Rows: variables. Columns: quantiles with coefficients and SEs.

Examples:

>>> import statspai as sp
>>> df = sp.cps_wage()
>>> table = sp.sqreg(df, y='log_wage', x=['education', 'experience'])
>>> bool('Q(0.5)' in table.columns)
True

logit ¶

logit(formula: Optional[str] = None, data: Optional[DataFrame] = None, y: Optional[str] = None, x: Optional[List[str]] = None, robust: str = 'nonrobust', cluster: Optional[str] = None, weights: Optional[str] = None, marginal_effects: Optional[str] = None, odds_ratio: bool = False, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05, at_values: Optional[Dict[str, float]] = None) -> EconometricResults

Logit (logistic) regression via maximum likelihood.

Equivalent to Stata's logit y x1 x2 or logistic (with or=True).

Parameters:

Name	Type	Description	Default
`formula`	`str`	Formula like `"y ~ x1 + x2"`.	`None`
`data`	`DataFrame`	Data containing the variables.	`None`
`y`	`str`	Dependent variable name (alternative to formula).	`None`
`x`	`list of str`	Regressor names (alternative to formula).	`None`
`robust`	`str`	`'nonrobust'` for MLE SE, `'hc1'` / `'robust'` for sandwich SE.	``'nonrobust'``
`cluster`	`str`	Column name for clustered standard errors.	`None`
`weights`	`str`	Column name for frequency/analytic weights.	`None`
`marginal_effects`	`str`	`'average'` (AME), `'mean'` (MEM), or `'at'` (MER).	`None`
`odds_ratio`	`bool`	Report odds ratios instead of log-odds coefficients.	`False`
`maxiter`	`int`	Maximum Newton-Raphson iterations.	`100`
`tol`	`float`	Convergence tolerance on log-likelihood change.	`1e-8`
`alpha`	`float`	Significance level for confidence intervals.	`0.05`
`at_values`	`dict`	Variable values for `marginal_effects='at'`.	`None`

Returns:

Type	Description
`EconometricResults`	Fitted model with `.summary()`, `.predict()`, diagnostics, etc.

Examples:

>>> import statspai as sp
>>> df = sp.cps_wage()  # binary `union` outcome
>>> result = sp.logit("union ~ education + experience", data=df)
>>> print(result.summary())

>>> # With odds ratios and robust SE
>>> result = sp.logit("union ~ education + experience", data=df,
...                   robust='hc1', odds_ratio=True)

>>> # Marginal effects at the mean
>>> result = sp.logit("union ~ education + experience", data=df,
...                   marginal_effects='mean')
>>> me = result.model_info['marginal_effects']
>>> bool('dy/dx' in me.columns)
True

probit ¶

probit(formula: Optional[str] = None, data: Optional[DataFrame] = None, y: Optional[str] = None, x: Optional[List[str]] = None, robust: str = 'nonrobust', cluster: Optional[str] = None, weights: Optional[str] = None, marginal_effects: Optional[str] = None, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05, at_values: Optional[Dict[str, float]] = None) -> EconometricResults

Probit regression via maximum likelihood.

Equivalent to Stata's probit y x1 x2.

Parameters:

Name	Type	Description	Default
`formula`	`str`	Formula like `"y ~ x1 + x2"`.	`None`
`data`	`DataFrame`	Data containing the variables.	`None`
`y`	`str`	Dependent variable name (alternative to formula).	`None`
`x`	`list of str`	Regressor names (alternative to formula).	`None`
`robust`	`str`	`'nonrobust'` for MLE SE, `'hc1'` / `'robust'` for sandwich SE.	``'nonrobust'``
`cluster`	`str`	Column name for clustered standard errors.	`None`
`weights`	`str`	Column name for frequency/analytic weights.	`None`
`marginal_effects`	`str`	`'average'` (AME), `'mean'` (MEM), or `'at'` (MER).	`None`
`maxiter`	`int`	Maximum Newton-Raphson iterations.	`100`
`tol`	`float`	Convergence tolerance on log-likelihood change.	`1e-8`
`alpha`	`float`	Significance level for confidence intervals.	`0.05`
`at_values`	`dict`	Variable values for `marginal_effects='at'`.	`None`

Returns:

Type	Description
`EconometricResults`	Fitted model with `.summary()`, `.predict()`, diagnostics, etc.

Examples:

>>> import statspai as sp
>>> df = sp.cps_wage()  # binary `union` outcome
>>> result = sp.probit("union ~ education + experience", data=df)
>>> print(result.summary())

>>> # Average marginal effects with robust SE
>>> result = sp.probit("union ~ education + experience", data=df,
...                    robust='hc1', marginal_effects='average')
>>> me = result.model_info['marginal_effects']
>>> bool('dy/dx' in me.columns)
True

cloglog ¶

cloglog(formula: Optional[str] = None, data: Optional[DataFrame] = None, y: Optional[str] = None, x: Optional[List[str]] = None, robust: str = 'nonrobust', cluster: Optional[str] = None, weights: Optional[str] = None, marginal_effects: Optional[str] = None, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05, at_values: Optional[Dict[str, float]] = None) -> EconometricResults

Complementary log-log regression via maximum likelihood.

Appropriate when P(Y=1) is small (rare events) or when the latent distribution is asymmetric (extreme value type I).

Equivalent to Stata's cloglog y x1 x2.

Parameters:

Name	Type	Description	Default
`formula`	`str`	Formula like `"y ~ x1 + x2"`.	`None`
`data`	`DataFrame`	Data containing the variables.	`None`
`y`	`str`	Dependent variable name (alternative to formula).	`None`
`x`	`list of str`	Regressor names (alternative to formula).	`None`
`robust`	`str`	`'nonrobust'` for MLE SE, `'hc1'` / `'robust'` for sandwich SE.	``'nonrobust'``
`cluster`	`str`	Column name for clustered standard errors.	`None`
`weights`	`str`	Column name for frequency/analytic weights.	`None`
`marginal_effects`	`str`	`'average'` (AME), `'mean'` (MEM), or `'at'` (MER).	`None`
`maxiter`	`int`	Maximum Newton-Raphson iterations.	`100`
`tol`	`float`	Convergence tolerance on log-likelihood change.	`1e-8`
`alpha`	`float`	Significance level for confidence intervals.	`0.05`
`at_values`	`dict`	Variable values for `marginal_effects='at'`.	`None`

Returns:

Type	Description
`EconometricResults`	Fitted model with `.summary()`, `.predict()`, diagnostics, etc.

Examples:

>>> import statspai as sp
>>> df = sp.cps_wage()  # binary `union` outcome
>>> result = sp.cloglog("union ~ education + experience", data=df)
>>> print(result.summary())
>>> bool(result.model_info['link'] == 'cloglog')
True

zip_model ¶

zip_model(formula: Optional[str] = None, data: Optional[DataFrame] = None, y: Optional[str] = None, x: Optional[List[str]] = None, inflate: Optional[List[str]] = None, robust: str = 'nonrobust', cluster: Optional[str] = None, maxiter: int = 200, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults

Zero-Inflated Poisson (ZIP) regression via MLE.

Two-part model: - Inflate equation: logit model for P(structural zero) = Λ(z'γ) - Count equation: Poisson model with mean μ = exp(x'β)

Equivalent to Stata's zip y x, inflate(z).

Parameters:

Name	Type	Description	Default
`formula`	`str`	Patsy-style formula for the count equation, e.g. "y ~ x1 + x2".	`None`
`data`	`DataFrame`	Dataset.	`None`
`y`	`str`	Dependent variable name (alternative to formula).	`None`
`x`	`list of str`	Count-equation regressors (alternative to formula).	`None`
`inflate`	`list of str`	Inflation-equation regressors. Default: same as count regressors.	`None`
`robust`	`str`	"nonrobust", "HC0", "HC1", etc.	`"nonrobust"`
`cluster`	`str`	Cluster variable name for clustered standard errors.	`None`
`maxiter`	`int`	Maximum iterations for optimizer.	`200`
`tol`	`float`	Convergence tolerance.	`1e-8`
`alpha`	`float`	Significance level for confidence intervals.	`0.05`

Returns:

Type	Description
`EconometricResults`	Coefficients for both equations, Vuong test, diagnostics.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(0)
>>> n = 300
>>> age = rng.normal(0, 1, n)
>>> chronic = rng.integers(0, 2, n)
>>> visits = rng.poisson(np.exp(0.5 + 0.3 * age))
>>> visits[rng.random(n) < 0.3] = 0  # excess structural zeros
>>> df = pd.DataFrame({'visits': visits, 'age': age, 'chronic': chronic})
>>> result = sp.zip_model(data=df, y='visits', x=['age'],
...                       inflate=['chronic'])
>>> print(result.summary())
>>> bool(result.model_info['model_type'] == 'zip')
True

Notes

Log-likelihood for ZIP:

.. math:: y_i = 0: \log[\pi_i + (1-\pi_i) e^{-\mu_i}] y_i > 0: \log(1-\pi_i) + y_i \log\mu_i - \mu_i - \log(y_i!)

where π_i = Λ(z_i'γ) and μ_i = exp(x_i'β).

See Lambert (1992, Technometrics).

zinb ¶

zinb(formula: Optional[str] = None, data: Optional[DataFrame] = None, y: Optional[str] = None, x: Optional[List[str]] = None, inflate: Optional[List[str]] = None, robust: str = 'nonrobust', cluster: Optional[str] = None, maxiter: int = 200, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults

Zero-Inflated Negative Binomial (ZINB) regression via MLE.

Two-part model: - Inflate equation: logit for P(structural zero) = Λ(z'γ) - Count equation: NB2 with mean μ = exp(x'β), Var = μ + α·μ²

Equivalent to Stata's zinb y x, inflate(z).

Parameters:

Name	Type	Description	Default
`formula`	`str`	Patsy-style formula for the count equation.	`None`
`data`	`DataFrame`	Dataset.	`None`
`y`	`str`	Dependent variable name.	`None`
`x`	`list of str`	Count-equation regressors.	`None`
`inflate`	`list of str`	Inflation-equation regressors. Default: same as count regressors.	`None`
`robust`	`str`	Standard error type.	`"nonrobust"`
`cluster`	`str`	Cluster variable name.	`None`
`maxiter`	`int`		`200`
`tol`	`float`		`1e-8`
`alpha`	`float`		`0.05`

Returns:

Type	Description
`EconometricResults`	Coefficients for count, inflate, and dispersion parameter.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(0)
>>> n = 300
>>> age = rng.normal(0, 1, n)
>>> chronic = rng.integers(0, 2, n)
>>> visits = rng.poisson(np.exp(0.5 + 0.3 * age))
>>> visits[rng.random(n) < 0.3] = 0  # excess structural zeros
>>> df = pd.DataFrame({'visits': visits, 'age': age, 'chronic': chronic})
>>> result = sp.zinb(data=df, y='visits', x=['age'],
...                  inflate=['chronic'])
>>> print(result.summary())
>>> bool(result.model_info['model_type'] == 'zinb')
True

Notes

The NB2 parameterization uses dispersion parameter α so that Var(Y|μ) = μ + α·μ². When α → 0 the model collapses to ZIP.

See Cameron & Trivedi (2013, Ch. 4).

hurdle ¶

hurdle(formula: Optional[str] = None, data: Optional[DataFrame] = None, y: Optional[str] = None, x: Optional[List[str]] = None, count_model: str = 'poisson', robust: str = 'nonrobust', cluster: Optional[str] = None, maxiter: int = 200, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults

Hurdle (two-part) model for count data.

Part 1 (binary): logit model for P(Y > 0). Part 2 (count): truncated-at-zero Poisson or Negative Binomial for the distribution of Y | Y > 0.

Unlike zero-inflated models, ALL zeros come from the binary process.

Equivalent to R's pscl::hurdle().

Parameters:

Name	Type	Description	Default
`formula`	`str`	Patsy-style formula.	`None`
`data`	`DataFrame`	Dataset.	`None`
`y`	`str`	Dependent variable name.	`None`
`x`	`list of str`	Regressors (used for both hurdle and count parts).	`None`
`count_model`	`str`	Count distribution: "poisson" or "negbin".	`"poisson"`
`robust`	`str`		`"nonrobust"`
`cluster`	`str`		`None`
`maxiter`	`int`		`200`
`tol`	`float`		`1e-8`
`alpha`	`float`		`0.05`

Returns:

Type	Description
`EconometricResults`

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(0)
>>> n = 300
>>> age = rng.normal(0, 1, n)
>>> visits = rng.poisson(np.exp(0.5 + 0.3 * age))
>>> visits[rng.random(n) < 0.3] = 0  # excess zeros below the hurdle
>>> df = pd.DataFrame({'visits': visits, 'age': age})
>>> result = sp.hurdle(data=df, y='visits', x=['age'],
...                    count_model='negbin')
>>> print(result.summary())
>>> bool(result.model_info['model_type'] == 'hurdle')
True

Notes

The hurdle log-likelihood decomposes as:

.. math:: \ell = \sum_{y_i=0} \log(1-p_i) + \sum_{y_i>0} [\log p_i + \log f(y_i|\mu_i) - \log(1 - f(0|\mu_i))]

where p_i = Λ(x_i'δ) is the hurdle probability.

See Mullahy (1986, Journal of Econometrics).

poisson ¶

poisson(formula: Optional[str] = None, data: DataFrame = None, y: Optional[str] = None, x: Optional[List[str]] = None, robust: str = 'nonrobust', cluster: Optional[str] = None, weights: Optional[str] = None, offset: Optional[str] = None, exposure: Optional[str] = None, irr: bool = False, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults

Poisson regression via MLE (IRLS).

Parameters:

Name	Type	Description	Default
`formula`	`str`	Model formula, e.g. "y ~ x1 + x2".	`None`
`data`	`DataFrame`	Data containing all variables.	`None`
`y`	`str`	Dependent variable name (alternative to formula).	`None`
`x`	`list of str`	Independent variable names (alternative to formula).	`None`
`robust`	`str`	Standard error type: "nonrobust", "robust"/"hc0", "hc1".	`"nonrobust"`
`cluster`	`str`	Variable name for clustered standard errors.	`None`
`weights`	`str`	Frequency/analytic weight variable.	`None`
`offset`	`str`	Offset variable (log of exposure already computed).	`None`
`exposure`	`str`	Exposure variable (will be logged and used as offset).	`None`
`irr`	`bool`	If True, report Incidence Rate Ratios (exp(beta)) instead of raw coefficients.	`False`
`maxiter`	`int`	Maximum IRLS iterations.	`100`
`tol`	`float`	Convergence tolerance.	`1e-8`
`alpha`	`float`	Significance level for confidence intervals.	`0.05`

Returns:

Type	Description
`EconometricResults`	Fitted model with params, standard errors, diagnostics.

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 200
>>> math = rng.normal(50, 10, n)
>>> prog = rng.integers(0, 3, n).astype(float)
>>> num_awards = rng.poisson(np.exp(-3.0 + 0.06 * math + 0.2 * prog))
>>> df = pd.DataFrame({'num_awards': num_awards, 'math': math, 'prog': prog})
>>> res = sp.poisson("num_awards ~ math + prog", data=df)
>>> list(res.params.index)
['_cons', 'math', 'prog']
>>> res_irr = sp.poisson("num_awards ~ math + prog", data=df,
...                      robust="robust", irr=True)
>>> bool(res_irr.params['math'] > 0)
True

nbreg ¶

nbreg(formula: Optional[str] = None, data: DataFrame = None, y: Optional[str] = None, x: Optional[List[str]] = None, robust: str = 'nonrobust', cluster: Optional[str] = None, weights: Optional[str] = None, offset: Optional[str] = None, exposure: Optional[str] = None, irr: bool = False, dispersion: str = 'mean', maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults

Negative binomial regression (NB2 or NB1).

Parameters:

Name	Type	Description	Default
`formula`	`str`	Model formula, e.g. "y ~ x1 + x2".	`None`
`data`	`DataFrame`	Data containing all variables.	`None`
`y`	`str`	Dependent variable name (alternative to formula).	`None`
`x`	`list of str`	Independent variable names (alternative to formula).	`None`
`robust`	`str`	Standard error type: "nonrobust", "robust"/"hc0", "hc1".	`"nonrobust"`
`cluster`	`str`	Variable name for clustered standard errors.	`None`
`weights`	`str`	Weight variable name.	`None`
`offset`	`str`	Offset variable (log of exposure).	`None`
`exposure`	`str`	Exposure variable (will be logged).	`None`
`irr`	`bool`	Report Incidence Rate Ratios.	`False`
`dispersion`	`str`	Dispersion parameterization: - "mean" (NB2): Var(y) = mu + alpha * mu^2 - "constant" (NB1): Var(y) = mu * (1 + delta)	`"mean"`
`maxiter`	`int`	Maximum iterations.	`100`
`tol`	`float`	Convergence tolerance.	`1e-8`
`alpha`	`float`	Significance level.	`0.05`

Returns:

Type	Description
`EconometricResults`

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> n = 200
>>> math = rng.normal(50, 10, n)
>>> prog = rng.integers(0, 3, n).astype(float)
>>> days_absent = rng.negative_binomial(2, 0.3, n)
>>> df = pd.DataFrame({'days_absent': days_absent, 'math': math, 'prog': prog})
>>> res = sp.nbreg("days_absent ~ math + prog", data=df, irr=True)
>>> 'math' in res.params.index
True

xtnbreg ¶

xtnbreg(formula: Optional[str] = None, data: DataFrame = None, y: Optional[str] = None, x: Optional[Sequence[str]] = None, entity: Optional[str] = None, time: Optional[str] = None, model: str = 'fe', time_effects: bool = False, robust: str = 'nonrobust', cluster: Optional[str] = None, weights: Optional[str] = None, offset: Optional[str] = None, exposure: Optional[str] = None, irr: bool = False, dispersion: str = 'mean', maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05) -> Any

Panel negative-binomial regression with Stata-like xtnbreg ergonomics.

model="fe" fits an unconditional fixed-effects NB model by adding explicit entity dummies through :func:nbreg. This is appropriate for moderate panels and, most importantly, does not silently replace a count model with OLS. model="re" dispatches to :func:sp.menbreg, the random-intercept NB-2 GLMM.

Parameters:

Name	Type	Description	Default
`formula`	`str`	Count-model formula. For fixed effects you may pass `"y ~ x1 + x2 \| id"` directly, or pass `entity=`.	`None`
`data`	`DataFrame`	Long-format panel data.	`None`
`y`	`optional`	Alternative to `formula`.	`None`
`x`	`optional`	Alternative to `formula`.	`None`
`entity`	`str`	Panel/unit identifier. Required when the formula does not contain a `\| id` fixed-effect part.	`None`
`time`	`str`	Time column. Stored as metadata; included as a fixed effect only when `time_effects=True`.	`None`
`model`	`(fe, re, pooled)`	Fixed-effects, random-effects, or pooled negative binomial.	`"fe"`

Returns:

Type	Description
`EconometricResults or MEGLMResult`	`model="fe"` / `"pooled"` return :class:`EconometricResults`; `model="re"` returns the multilevel :class:`MEGLMResult`.

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> rows = []
>>> for uid in range(15):
...     fe = rng.normal(0, 0.3)  # entity effect
...     for t in range(8):
...         x = rng.normal()
...         mu = np.exp(0.4 + 0.5 * x + fe)
...         rate = rng.gamma(shape=2.0, scale=mu / 2.0)  # NB-2 mixture
...         count = rng.poisson(rate)
...         rows.append(dict(unit=uid, year=t, count=count, x=x))
>>> df = pd.DataFrame(rows)
>>> res = sp.xtnbreg("count ~ x", data=df, entity="unit", model="fe")
>>> type(res).__name__
'EconometricResults'
>>> "x" in res.params.index
True

ppmlhdfe ¶

ppmlhdfe(formula: Optional[str] = None, data: DataFrame = None, y: Optional[str] = None, x: Optional[List[str]] = None, absorb: Optional[str] = None, robust: str = 'robust', cluster: Optional[Union[str, List[str], Tuple[str, str]]] = None, weights: Optional[str] = None, separation: bool = True, maxiter: int = 1000, tol: float = 1e-08, alpha: float = 0.05, vce: Optional[str] = None, wild_reps: int = 9999, wild_weight_type: str = 'rademacher', seed: Optional[int] = None, conley_lat: Optional[str] = None, conley_lon: Optional[str] = None, conley_cutoff: Optional[float] = None) -> EconometricResults

Pseudo-Poisson Maximum Likelihood with high-dimensional fixed effects.

Implements Santos Silva & Tenreyro (2006) PPML estimator, the standard approach for gravity models and other trade/economic settings where: - The dependent variable has zeros - Log-linearization would be inconsistent under heteroskedasticity - High-dimensional fixed effects (origin, destination, year) must be absorbed

Parameters:

Name	Type	Description	Default
`formula`	`str`	Model formula. Fixed effects can be specified via `\|`: `"trade ~ dist + contig \| origin + destination + year"`	`None`
`data`	`DataFrame`	Data containing all variables.	`None`
`y`	`str`	Dependent variable name (alternative to formula).	`None`
`x`	`list of str`	Independent variable names (alternative to formula).	`None`
`absorb`	`str`	Fixed effects to absorb, e.g. `"origin + destination + year"`. Overrides any FE specification in the formula.	`None`
`robust`	`str`	Default is robust SE (as in Stata's ppmlhdfe). Options: "robust"/"hc0", "hc1", "nonrobust".	`"robust"`
`cluster`	`str`	Variable name for clustered standard errors (recommended for gravity models, e.g. cluster on country-pair). A pair `cluster=["a", "b"]` requests two-way clustering (Cameron-Gelbach-Miller 2011 inclusion-exclusion with the single `G_min/(G_min-1)` small-sample factor — byte-identical to Stata `ppmlhdfe ..., cluster(a b)`).	`None`
`weights`	`str`	Weight variable name.	`None`
`separation`	`bool`	If True, check for separation (perfect prediction of zeros) and warn. Observations causing separation are not dropped automatically.	`True`
`maxiter`	`int`	Maximum IRLS iterations.	`1000`
`tol`	`float`	Convergence tolerance.	`1e-8`
`alpha`	`float`	Significance level for confidence intervals.	`0.05`
`vce`	`str`	Canonical SE-menu keyword. `"robust"`/`"hc1"`/`"hc0"` alias the `robust=` parameter. `"wild"` (with `cluster=`) runs the boottest-convention score wild cluster bootstrap on the FE-absorbed design — exact at any FE dimensionality (the weighted-FWL reduction of the score numerator is exact) and byte-identical to `sp.fepois(vce="wild")` on low-dimensional FE. `"CR2"`/`"CR3"`/`"jackknife"` (with `cluster=`) compute the clubSandwich glm bias-reduced SEs on the FE-as-dummies design (guarded against high-dimensional FE).	`None`
`wild_reps`	`int`	Replications for `vce="wild"` (enumerates the 2^G grid when `2**G <= wild_reps`).	`9999`
`wild_weight_type`	`str`	Wild weight distribution.	`"rademacher"`
`seed`	`int`	RNG seed for sampled (non-enumerated) wild draws.	`None`

Returns:

Type	Description
`EconometricResults`

Notes

PPML is consistent under the assumption E[y|x] = exp(x'beta), regardless of the true conditional variance. With robust SE it is a quasi-MLE estimator and does not assume Poisson variance.

References

Santos Silva, J.M.C. & Tenreyro, S. (2006). "The Log of Gravity." Review of Economics and Statistics, 88(4), 641-658.

Examples:

>>> import statspai as sp
>>> import numpy as np, pandas as pd
>>> rng = np.random.default_rng(0)
>>> rows = []
>>> for o in range(6):
...     for d in range(6):
...         if o == d:
...             continue
...         for year in (2000, 2001):
...             dist = rng.uniform(1.0, 5.0)
...             contig = float(rng.integers(0, 2))
...             mu = np.exp(2.0 - 0.6 * np.log(dist) + 0.3 * contig
...                         + 0.1 * o - 0.1 * d)
...             rows.append(dict(trade=rng.poisson(mu), dist=dist,
...                              contig=contig, origin=o, dest=d, year=year,
...                              pair_id=f"{min(o, d)}_{max(o, d)}"))
>>> df = pd.DataFrame(rows)
>>> # Basic gravity model with formula fixed effects
>>> res = sp.ppmlhdfe("trade ~ dist + contig | origin + dest + year",
...                   data=df, cluster="pair_id")
>>> list(res.params.index)
['dist', 'contig']
>>> # With absorb parameter instead of formula FE
>>> res2 = sp.ppmlhdfe("trade ~ dist + contig", data=df,
...                    absorb="origin + dest + year",
...                    cluster="pair_id")
>>> 'dist' in res2.params.index
True

mlogit ¶

mlogit(formula: Optional[str] = None, data: Optional[DataFrame] = None, y: Optional[str] = None, x: Optional[List[str]] = None, base: int = 0, robust: str = 'nonrobust', cluster: Optional[str] = None, rrr: bool = False, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults

Multinomial logit for J > 2 unordered categories via MLE.

Equivalent to Stata's mlogit y x, base(0) or mlogit y x, rrr.

Parameters:

Name	Type	Description	Default
`formula`	`str`	Formula `"y ~ x1 + x2"`.	`None`
`data`	`DataFrame`	Data.	`None`
`y`	`str`	Dependent variable (categorical, integer-coded).	`None`
`x`	`list of str`	Regressors.	`None`
`base`	`int`	Base / reference category (index into sorted unique values).	`0`
`robust`	`str`	`"robust"` / `"HC1"` for Huber-White sandwich SE.	`"nonrobust"`
`cluster`	`str`	Cluster variable for clustered SE.	`None`
`rrr`	`bool`	Report Relative Risk Ratios (exp(beta)) instead of coefficients.	`False`
`maxiter`	`int`		`100`
`tol`	`float`		`1e-8`
`alpha`	`float`		`0.05`

Returns:

Type	Description
`EconometricResults`

Examples:

>>> import numpy as np, pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(0)
>>> n = 300
>>> price = rng.normal(0, 1, n)
>>> income = rng.normal(0, 1, n)
>>> eta1 = 0.5 * price - 0.3 * income
>>> eta2 = -0.4 * price + 0.6 * income
>>> exps = np.column_stack([np.ones(n), np.exp(eta1), np.exp(eta2)])
>>> P = exps / exps.sum(axis=1, keepdims=True)
>>> choice = np.array([rng.choice(3, p=P[i]) for i in range(n)])
>>> df = pd.DataFrame({'choice': choice, 'price': price, 'income': income})
>>> result = sp.mlogit('choice ~ price + income', data=df, base=0)
>>> print(result.summary())
>>> rrr = sp.mlogit(data=df, y='choice', x=['price', 'income'], rrr=True)
>>> bool(rrr.params is not None)
True

Notes

Softmax parameterisation: β_j for each category j != base.

.. math:: P(Y_i = j | X_i) = \frac{\exp(X_i' \beta_j)} {\sum_{k} \exp(X_i' \beta_k)}, \quad \beta_{\text{base}} = 0.

McFadden pseudo-R^2 = 1 - LL / LL_0.

References

mcfadden1974conditional

ologit ¶

ologit(formula: Optional[str] = None, data: Optional[DataFrame] = None, y: Optional[str] = None, x: Optional[List[str]] = None, robust: str = 'nonrobust', cluster: Optional[str] = None, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults

Ordered logit (proportional odds) model via MLE.

Equivalent to Stata's ologit y x.

Parameters:

Name	Type	Description	Default
`formula`	`str`	Formula `"y ~ x1 + x2"`.	`None`
`data`	`DataFrame`		`None`
`y`	`str`	Ordered categorical dependent variable.	`None`
`x`	`list of str`		`None`
`robust`	`str`		`"nonrobust"`
`cluster`	`str`		`None`
`maxiter`	`int`		`100`
`tol`	`float`		`1e-8`
`alpha`	`float`		`0.05`

Returns:

Type	Description
`EconometricResults`	Coefficients (beta) and cutpoints (kappa). `result.predicted_probs` gives per-category probabilities. `result.marginal_effects` gives AME per category. `result.brant_test` gives the Brant parallel-lines test.

Examples:

>>> import numpy as np, pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(1)
>>> n = 300
>>> income = rng.normal(0, 1, n)
>>> age = rng.normal(0, 1, n)
>>> latent = 0.8 * income + 0.4 * age + rng.logistic(0, 1, n)
>>> satisfaction = np.digitize(latent, [-0.5, 0.8])  # ordered {0, 1, 2}
>>> df = pd.DataFrame({'satisfaction': satisfaction,
...                    'income': income, 'age': age})
>>> result = sp.ologit('satisfaction ~ income + age', data=df)
>>> print(result.summary())
>>> bool('_omnibus' in result.brant_test)  # parallel-regression test
True

Notes

.. math:: P(Y \le j | X) = \Lambda(\kappa_j - X'\beta)

where :math:\Lambda is the logistic CDF. The parallel regression (proportional odds) assumption requires that :math:\beta is the same for each cumulative split.

References

mckelvey1975statistical

oprobit ¶

oprobit(formula: Optional[str] = None, data: Optional[DataFrame] = None, y: Optional[str] = None, x: Optional[List[str]] = None, robust: str = 'nonrobust', cluster: Optional[str] = None, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults

Ordered probit model via MLE.

Equivalent to Stata's oprobit y x.

Parameters:

Name	Type	Description	Default
`formula`	`str`	Formula `"y ~ x1 + x2"`.	`None`
`data`	`DataFrame`		`None`
`y`	`str`	Ordered categorical dependent variable.	`None`
`x`	`list of str`		`None`
`robust`	`str`		`"nonrobust"`
`cluster`	`str`		`None`
`maxiter`	`int`		`100`
`tol`	`float`		`1e-8`
`alpha`	`float`		`0.05`

Returns:

Type	Description
`EconometricResults`	Same structure as :func:`ologit` but with probit link.

Examples:

>>> import numpy as np, pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(2)
>>> n = 300
>>> quality = rng.normal(0, 1, n)
>>> price = rng.normal(0, 1, n)
>>> latent = 0.7 * quality - 0.5 * price + rng.normal(0, 1, n)
>>> rating = np.digitize(latent, [-0.4, 0.6])  # ordered {0, 1, 2}
>>> df = pd.DataFrame({'rating': rating, 'quality': quality, 'price': price})
>>> result = sp.oprobit(data=df, y='rating', x=['quality', 'price'])
>>> print(result.summary())
>>> bool(len(result.marginal_effects) > 0)
True

Notes

.. math:: P(Y \le j | X) = \Phi(\kappa_j - X'\beta)

where :math:\Phi is the standard normal CDF.

References

mckelvey1975statistical

clogit ¶

clogit(formula: Optional[str] = None, data: Optional[DataFrame] = None, y: Optional[str] = None, x: Optional[List[str]] = None, group: Optional[str] = None, robust: str = 'nonrobust', cluster: Optional[str] = None, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults

McFadden's conditional (fixed-effect) logit for choice data.

Each observation is an alternative within a choice set (group). The dependent variable is 1 for the chosen alternative, 0 otherwise.

Equivalent to Stata's clogit y x, group(id).

Parameters:

Name	Type	Description	Default
`formula`	`str`	Formula `"chosen ~ price + quality"`.	`None`
`data`	`DataFrame`	Long-format data with one row per alternative per choice set.	`None`
`y`	`str`	Binary indicator: 1 = chosen, 0 = not chosen.	`None`
`x`	`list of str`	Alternative-specific (and/or individual-specific interacted with alternative dummies) covariates.	`None`
`group`	`str`	Variable identifying the choice set / decision-maker.	`None`
`robust`	`str`		`"nonrobust"`
`cluster`	`str`		`None`
`maxiter`	`int`		`100`
`tol`	`float`		`1e-8`
`alpha`	`float`		`0.05`

Returns:

Type	Description
`EconometricResults`

Examples:

>>> import numpy as np, pandas as pd
>>> import statspai as sp
>>> rng = np.random.default_rng(2)
>>> rows = []
>>> for case in range(150):
...     price = rng.normal(0, 1, 3)
...     quality = rng.normal(0, 1, 3)
...     util = -0.8 * price + 0.9 * quality + rng.gumbel(0, 1, 3)
...     chosen_alt = int(np.argmax(util))
...     for a in range(3):
...         rows.append({'case_id': case, 'chosen': int(a == chosen_alt),
...                      'price': price[a], 'quality': quality[a]})
>>> df = pd.DataFrame(rows)
>>> result = sp.clogit('chosen ~ price + quality', data=df, group='case_id')
>>> print(result.summary())
>>> bool(result.params is not None)
True

Notes

The conditional log-likelihood for group g:

.. math:: \ell_g = X_{g,chosen}'\beta - \log\left(\sum_{j \in g} \exp(X_{gj}'\beta)\right)

Only alternative-specific variation identifies beta; the group fixed effect is conditioned out (no constant estimated).

References

mcfadden1974conditional

statspai.regression¶

regression ¶

OLSRegression ¶

fit ¶

predict ¶

OLSEstimator ¶

estimate ¶

IVRegression ¶

first_stage property ¶

sargan_test property ¶

hausman_test property ¶

fit ¶

predict ¶

IVEstimator ¶

GLMRegression ¶

fit ¶

predict ¶

GLMEstimator ¶

estimate ¶

regress ¶

ivreg ¶

qreg ¶

sqreg ¶

logit ¶

probit ¶

cloglog ¶

zip_model ¶

zinb ¶

hurdle ¶

poisson ¶

nbreg ¶

xtnbreg ¶

ppmlhdfe ¶

mlogit ¶

ologit ¶

oprobit ¶

clogit ¶

`statspai.regression`¶

first_stage `property` ¶

sargan_test `property` ¶

hausman_test `property` ¶