Skip to content

statspai.regression

regression

Regression module initialization

OLSRegression

Bases: BaseModel

OLS regression model with comprehensive functionality

fit

fit(robust: str = 'nonrobust', cluster: Optional[str] = None, **kwargs) -> EconometricResults

Fit the OLS model

Parameters:

Name Type Description Default
robust str

Type of standard errors

'nonrobust'
cluster str

Variable name for clustering

None
**kwargs

Additional options

{}

Returns:

Type Description
EconometricResults

Fitted model results

predict

predict(data: Optional[DataFrame] = None, what: str = 'mean', alpha: float = 0.05, return_df: bool = False) -> ndarray | DataFrame

Generate predictions from the fitted OLS model.

Parameters:

Name Type Description Default
data DataFrame

New data at which to predict. If None, returns the in-sample fitted values.

None
what (mean, confidence, prediction)
  • "mean" — point predictions only (default).
  • "confidence" — point + (1-alpha) confidence interval for the conditional mean E[y | x].
  • "prediction" — point + (1-alpha) prediction interval for a new observation (wider than the CI by sqrt(sigma^2)).
"mean"
alpha float

Significance level for the interval.

0.05
return_df bool

Return a DataFrame with columns ["yhat", "lower", "upper"]. Ignored (forces True) when what != "mean".

False

Returns:

Type Description
ndarray or DataFrame

Point predictions, optionally with interval columns.

OLSEstimator

Bases: BaseEstimator

Ordinary Least Squares estimator with robust standard errors

estimate

estimate(y: ndarray, X: ndarray, robust: str = 'nonrobust', cluster: Optional[Series] = None, **kwargs) -> Dict[str, Any]

Estimate OLS parameters

Parameters:

Name Type Description Default
y ndarray

Dependent variable

required
X ndarray

Independent variables (including constant if desired)

required
robust str

Type of standard errors ('nonrobust', 'hc0', 'hc1', 'hc2', 'hc3', 'hac')

'nonrobust'
cluster Series

Cluster variable for clustered standard errors

None
**kwargs

Additional options

{}

Returns:

Type Description
Dict[str, Any]

Estimation results

IVRegression

Bases: BaseModel

Instrumental Variables regression model.

Supports multiple estimation methods via method parameter: '2sls', 'liml', 'fuller', 'gmm', 'jive'.

Parameters:

Name Type Description Default
formula str

Formula with IV syntax: "y ~ (endog ~ z1 + z2) + exog1 + exog2"

None
data DataFrame
None
method str

Estimation method.

'2sls'
fuller_alpha float

Fuller constant (only used when method='fuller'). alpha=1 gives the bias-corrected Fuller estimator; alpha=4 minimises MSE under normal errors.

1.0
y array - like

Alternative to formula interface.

None
X_exog array - like

Alternative to formula interface.

None
X_endog array - like

Alternative to formula interface.

None
Z array - like

Alternative to formula interface.

None
var_names array - like

Alternative to formula interface.

None

first_stage property

first_stage: List[Dict[str, float]]

First-stage diagnostics for each endogenous variable.

sargan_test property

sargan_test: Optional[Dict[str, float]]

Sargan/Hansen J overidentification test results.

hausman_test property

hausman_test: Dict[str, float]

Durbin-Wu-Hausman endogeneity test results.

fit

fit(robust: str = 'nonrobust', cluster: Optional[str] = None, **kwargs) -> EconometricResults

Fit the IV model.

Parameters:

Name Type Description Default
robust str

Standard-error type ('nonrobust', 'hc0', 'hc1', 'hc2', 'hc3').

'nonrobust'
cluster str

Variable name for clustering.

None

Returns:

Type Description
EconometricResults

predict

predict(data: Optional[DataFrame] = None) -> ndarray

Generate predictions from the fitted IV model.

For a structural-form estimator, the natural forecast of y given new data is X_exog β_exog + X_endog β_endog — i.e. we plug observed values of the endogenous variables through the structural equation. Instruments are not used at prediction time.

Parameters:

Name Type Description Default
data DataFrame

New data at which to predict. Must contain all exogenous and endogenous variables referenced by the model's formula. If None, returns in-sample fitted values.

None

IVEstimator

Bases: BaseEstimator

Two-Stage Least Squares (2SLS) estimator.

Legacy class. Prefer using the iv() function directly.

GLMRegression

Bases: BaseModel

Generalized Linear Model with IRLS estimation.

Parameters:

Name Type Description Default
formula str

Model formula (e.g. "y ~ x1 + x2").

None
data DataFrame

Data frame containing the variables.

None
y ndarray

Response array (alternative to formula).

None
X ndarray

Design matrix (alternative to formula).

None
var_names list of str

Variable names when using y/X directly.

None
family str

Distribution family.

'gaussian'
link str or None

Link function (None selects canonical link).

None

fit

fit(robust: str = 'nonrobust', cluster: Optional[str] = None, weights: Optional[str] = None, offset: Optional[str] = None, exposure: Optional[str] = None, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05, **kwargs) -> EconometricResults

Fit the GLM.

Parameters:

Name Type Description Default
robust str

Standard-error type.

'nonrobust'
cluster str

Cluster variable name.

None
weights str

Weight variable name.

None
offset str

Offset variable name.

None
exposure str

Exposure variable name (log added as offset).

None
maxiter int

Maximum IRLS iterations.

100
tol float

Convergence tolerance.

1e-08
alpha float

Significance level for confidence intervals.

0.05

Returns:

Type Description
EconometricResults

predict

predict(data: Optional[DataFrame] = None, type: str = 'response', offset: Optional[ndarray] = None) -> ndarray

Generate predictions from the fitted model.

Parameters:

Name Type Description Default
data DataFrame

New data for prediction. Uses training data if None.

None
type str

'response' (mean), 'link' (linear predictor), or 'variance' (variance function evaluated at predicted mu).

'response'
offset ndarray

Offset for new data.

None

Returns:

Type Description
ndarray

GLMEstimator

Bases: BaseEstimator

Generalized Linear Model estimator using IRLS

Implements Iteratively Reweighted Least Squares for maximum likelihood estimation of GLM parameters.

estimate

estimate(y: ndarray, X: ndarray, family: Family, link: LinkFunction, robust: str = 'nonrobust', cluster: Optional[Series] = None, weights: Optional[ndarray] = None, offset: Optional[ndarray] = None, maxiter: int = 100, tol: float = 1e-08, alpha_nb: Optional[float] = None, **kwargs) -> Dict[str, Any]

Estimate GLM parameters via IRLS.

Parameters:

Name Type Description Default
y ndarray

Response variable (n,).

required
X ndarray

Design matrix (n, k) including intercept if desired.

required
family Family

Distribution family instance.

required
link LinkFunction

Link function instance.

required
robust str

Standard-error type ('nonrobust', 'hc0', 'hc1', 'hc2', 'hc3', 'hac').

'nonrobust'
cluster Series

Cluster variable.

None
weights ndarray

Prior / frequency weights.

None
offset ndarray

Known offset added to the linear predictor.

None
maxiter int

Maximum IRLS iterations.

100
tol float

Convergence tolerance on deviance.

1e-08
alpha_nb float

If family is NB, initial alpha for joint estimation.

None

Returns:

Type Description
Dict[str, Any]

regress

regress(formula: str, data: DataFrame, robust: str = 'nonrobust', cluster: Optional[str] = None, **kwargs) -> EconometricResults

Convenient function for OLS regression

Parameters:

Name Type Description Default
formula str

Regression formula

required
data DataFrame

Data containing variables

required
robust str

Type of standard errors

'nonrobust'
cluster str

Variable name for clustering

None
**kwargs

Additional options

{}

Returns:

Type Description
EconometricResults

Fitted model results

Examples:

>>> results = regress("wage ~ education + experience", data=df)
>>> print(results.summary())
>>> results = regress("wage ~ education + experience", data=df,
...                   robust='hc1', cluster='firm_id')

ivreg

ivreg(formula: str, data: DataFrame, robust: str = 'nonrobust', cluster: Optional[str] = None, **kwargs) -> EconometricResults

Instrumental variables regression (2SLS).

.. deprecated:: Use sp.iv(formula, data, method='2sls') instead. ivreg is kept for backward compatibility.

Parameters:

Name Type Description Default
formula str

IV formula: "y ~ (endog ~ z1 + z2) + exog1 + exog2"

required
data DataFrame
required
robust str
'nonrobust'
cluster str
None

Returns:

Type Description
EconometricResults

qreg

qreg(data: DataFrame, formula: Optional[str] = None, y: Optional[str] = None, x: Optional[List[str]] = None, quantile: float = 0.5, alpha: float = 0.05) -> CausalResult

Quantile regression at a single quantile.

Equivalent to Stata's qreg y x, quantile(0.5).

Parameters:

Name Type Description Default
data DataFrame
required
formula str

Formula like "y ~ x1 + x2" (patsy-style).

None
y str

Outcome variable (alternative to formula).

None
x list of str

Regressors (alternative to formula).

None
quantile float

Quantile to estimate (0 < q < 1). 0.5 = median.

0.5
alpha float
0.05

Returns:

Type Description
CausalResult

Coefficients at the specified quantile.

Examples:

>>> # Median regression
>>> result = sp.qreg(df, y='wage', x=['education', 'experience'],
...                  quantile=0.5)
>>> # 90th percentile
>>> result = sp.qreg(df, y='wage', x=['education', 'experience'],
...                  quantile=0.9)
Notes

Quantile regression minimizes:

.. math:: \min_\beta \sum_i \rho_\tau(Y_i - X_i'\beta)

where ρ_τ(u) = u(τ - 1(u < 0)) is the check function.

Standard errors are computed using the Powell (1991) sandwich estimator with a kernel density estimate of f(0|X).

See Koenker & Bassett (1978, Econometrica).

sqreg

sqreg(data: DataFrame, y: str, x: List[str], quantiles: Optional[List[float]] = None, alpha: float = 0.05) -> DataFrame

Simultaneous quantile regression at multiple quantiles.

Equivalent to Stata's sqreg y x, quantiles(10 25 50 75 90).

Parameters:

Name Type Description Default
data DataFrame
required
y str
required
x list of str
required
quantiles list of float

Default: [0.1, 0.25, 0.5, 0.75, 0.9].

None
alpha float
0.05

Returns:

Type Description
DataFrame

Rows: variables. Columns: quantiles with coefficients and SEs.

Examples:

>>> table = sp.sqreg(df, y='wage', x=['education', 'experience'])
>>> print(table)

logit

logit(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, robust: str = 'nonrobust', cluster: str = None, weights: str = None, marginal_effects: str = None, odds_ratio: bool = False, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05, at_values: dict = None) -> EconometricResults

Logit (logistic) regression via maximum likelihood.

Equivalent to Stata's logit y x1 x2 or logistic (with or=True).

Parameters:

Name Type Description Default
formula str

Formula like "y ~ x1 + x2".

None
data DataFrame

Data containing the variables.

None
y str

Dependent variable name (alternative to formula).

None
x list of str

Regressor names (alternative to formula).

None
robust str

'nonrobust' for MLE SE, 'hc1' / 'robust' for sandwich SE.

``'nonrobust'``
cluster str

Column name for clustered standard errors.

None
weights str

Column name for frequency/analytic weights.

None
marginal_effects str

'average' (AME), 'mean' (MEM), or 'at' (MER).

None
odds_ratio bool

Report odds ratios instead of log-odds coefficients.

False
maxiter int

Maximum Newton-Raphson iterations.

100
tol float

Convergence tolerance on log-likelihood change.

1e-8
alpha float

Significance level for confidence intervals.

0.05
at_values dict

Variable values for marginal_effects='at'.

None

Returns:

Type Description
EconometricResults

Fitted model with .summary(), .predict(), diagnostics, etc.

Examples:

>>> import statspai as sp
>>> result = sp.logit("admit ~ gre + gpa + rank", data=df)
>>> print(result.summary())
>>> # With odds ratios and robust SE
>>> result = sp.logit("admit ~ gre + gpa", data=df,
...                   robust='hc1', odds_ratio=True)
>>> # Marginal effects at the mean
>>> result = sp.logit("y ~ x1 + x2", data=df, marginal_effects='mean')
>>> print(result.model_info['marginal_effects'])

probit

probit(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, robust: str = 'nonrobust', cluster: str = None, weights: str = None, marginal_effects: str = None, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05, at_values: dict = None) -> EconometricResults

Probit regression via maximum likelihood.

Equivalent to Stata's probit y x1 x2.

Parameters:

Name Type Description Default
formula str

Formula like "y ~ x1 + x2".

None
data DataFrame

Data containing the variables.

None
y str

Dependent variable name (alternative to formula).

None
x list of str

Regressor names (alternative to formula).

None
robust str

'nonrobust' for MLE SE, 'hc1' / 'robust' for sandwich SE.

``'nonrobust'``
cluster str

Column name for clustered standard errors.

None
weights str

Column name for frequency/analytic weights.

None
marginal_effects str

'average' (AME), 'mean' (MEM), or 'at' (MER).

None
maxiter int

Maximum Newton-Raphson iterations.

100
tol float

Convergence tolerance on log-likelihood change.

1e-8
alpha float

Significance level for confidence intervals.

0.05
at_values dict

Variable values for marginal_effects='at'.

None

Returns:

Type Description
EconometricResults

Fitted model with .summary(), .predict(), diagnostics, etc.

Examples:

>>> import statspai as sp
>>> result = sp.probit("admit ~ gre + gpa + rank", data=df)
>>> print(result.summary())
>>> # Average marginal effects with clustered SE
>>> result = sp.probit("y ~ x1 + x2", data=df,
...                    cluster='state', marginal_effects='average')

cloglog

cloglog(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, robust: str = 'nonrobust', cluster: str = None, weights: str = None, marginal_effects: str = None, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05, at_values: dict = None) -> EconometricResults

Complementary log-log regression via maximum likelihood.

Appropriate when P(Y=1) is small (rare events) or when the latent distribution is asymmetric (extreme value type I).

Equivalent to Stata's cloglog y x1 x2.

Parameters:

Name Type Description Default
formula str

Formula like "y ~ x1 + x2".

None
data DataFrame

Data containing the variables.

None
y str

Dependent variable name (alternative to formula).

None
x list of str

Regressor names (alternative to formula).

None
robust str

'nonrobust' for MLE SE, 'hc1' / 'robust' for sandwich SE.

``'nonrobust'``
cluster str

Column name for clustered standard errors.

None
weights str

Column name for frequency/analytic weights.

None
marginal_effects str

'average' (AME), 'mean' (MEM), or 'at' (MER).

None
maxiter int

Maximum Newton-Raphson iterations.

100
tol float

Convergence tolerance on log-likelihood change.

1e-8
alpha float

Significance level for confidence intervals.

0.05
at_values dict

Variable values for marginal_effects='at'.

None

Returns:

Type Description
EconometricResults

Fitted model with .summary(), .predict(), diagnostics, etc.

Examples:

>>> import statspai as sp
>>> result = sp.cloglog("default ~ income + balance", data=df)
>>> print(result.summary())

zip_model

zip_model(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, inflate: list = None, robust: str = 'nonrobust', cluster: str = None, maxiter: int = 200, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults

Zero-Inflated Poisson (ZIP) regression via MLE.

Two-part model: - Inflate equation: logit model for P(structural zero) = Λ(z'γ) - Count equation: Poisson model with mean μ = exp(x'β)

Equivalent to Stata's zip y x, inflate(z).

Parameters:

Name Type Description Default
formula str

Patsy-style formula for the count equation, e.g. "y ~ x1 + x2".

None
data DataFrame

Dataset.

None
y str

Dependent variable name (alternative to formula).

None
x list of str

Count-equation regressors (alternative to formula).

None
inflate list of str

Inflation-equation regressors. Default: same as count regressors.

None
robust str

"nonrobust", "HC0", "HC1", etc.

"nonrobust"
cluster str

Cluster variable name for clustered standard errors.

None
maxiter int

Maximum iterations for optimizer.

200
tol float

Convergence tolerance.

1e-8
alpha float

Significance level for confidence intervals.

0.05

Returns:

Type Description
EconometricResults

Coefficients for both equations, Vuong test, diagnostics.

Examples:

>>> result = sp.zip_model(data=df, y='doctor_visits', x=['age', 'income'],
...                       inflate=['age', 'chronic'])
>>> print(result.summary())
Notes

Log-likelihood for ZIP:

.. math:: y_i = 0: \log[\pi_i + (1-\pi_i) e^{-\mu_i}] y_i > 0: \log(1-\pi_i) + y_i \log\mu_i - \mu_i - \log(y_i!)

where π_i = Λ(z_i'γ) and μ_i = exp(x_i'β).

See Lambert (1992, Technometrics).

zinb

zinb(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, inflate: list = None, robust: str = 'nonrobust', cluster: str = None, maxiter: int = 200, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults

Zero-Inflated Negative Binomial (ZINB) regression via MLE.

Two-part model: - Inflate equation: logit for P(structural zero) = Λ(z'γ) - Count equation: NB2 with mean μ = exp(x'β), Var = μ + α·μ²

Equivalent to Stata's zinb y x, inflate(z).

Parameters:

Name Type Description Default
formula str

Patsy-style formula for the count equation.

None
data DataFrame

Dataset.

None
y str

Dependent variable name.

None
x list of str

Count-equation regressors.

None
inflate list of str

Inflation-equation regressors. Default: same as count regressors.

None
robust str

Standard error type.

"nonrobust"
cluster str

Cluster variable name.

None
maxiter int
200
tol float
1e-8
alpha float
0.05

Returns:

Type Description
EconometricResults

Coefficients for count, inflate, and dispersion parameter.

Examples:

>>> result = sp.zinb(data=df, y='doctor_visits', x=['age', 'income'],
...                  inflate=['age', 'chronic'])
>>> print(result.summary())
Notes

The NB2 parameterization uses dispersion parameter α so that Var(Y|μ) = μ + α·μ². When α → 0 the model collapses to ZIP.

See Cameron & Trivedi (2013, Ch. 4).

hurdle

hurdle(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, count_model: str = 'poisson', robust: str = 'nonrobust', cluster: str = None, maxiter: int = 200, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults

Hurdle (two-part) model for count data.

Part 1 (binary): logit model for P(Y > 0). Part 2 (count): truncated-at-zero Poisson or Negative Binomial for the distribution of Y | Y > 0.

Unlike zero-inflated models, ALL zeros come from the binary process.

Equivalent to R's pscl::hurdle().

Parameters:

Name Type Description Default
formula str

Patsy-style formula.

None
data DataFrame

Dataset.

None
y str

Dependent variable name.

None
x list of str

Regressors (used for both hurdle and count parts).

None
count_model str

Count distribution: "poisson" or "negbin".

"poisson"
robust str
"nonrobust"
cluster str
None
maxiter int
200
tol float
1e-8
alpha float
0.05

Returns:

Type Description
EconometricResults

Examples:

>>> result = sp.hurdle(data=df, y='doctor_visits', x=['age', 'income'],
...                    count_model='negbin')
>>> print(result.summary())
Notes

The hurdle log-likelihood decomposes as:

.. math:: \ell = \sum_{y_i=0} \log(1-p_i) + \sum_{y_i>0} [\log p_i + \log f(y_i|\mu_i) - \log(1 - f(0|\mu_i))]

where p_i = Λ(x_i'δ) is the hurdle probability.

See Mullahy (1986, Journal of Econometrics).

poisson

poisson(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, robust: str = 'nonrobust', cluster: str = None, weights: str = None, offset: str = None, exposure: str = None, irr: bool = False, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults

Poisson regression via MLE (IRLS).

Parameters:

Name Type Description Default
formula str

Model formula, e.g. "y ~ x1 + x2".

None
data DataFrame

Data containing all variables.

None
y str

Dependent variable name (alternative to formula).

None
x list of str

Independent variable names (alternative to formula).

None
robust str

Standard error type: "nonrobust", "robust"/"hc0", "hc1".

"nonrobust"
cluster str

Variable name for clustered standard errors.

None
weights str

Frequency/analytic weight variable.

None
offset str

Offset variable (log of exposure already computed).

None
exposure str

Exposure variable (will be logged and used as offset).

None
irr bool

If True, report Incidence Rate Ratios (exp(beta)) instead of raw coefficients.

False
maxiter int

Maximum IRLS iterations.

100
tol float

Convergence tolerance.

1e-8
alpha float

Significance level for confidence intervals.

0.05

Returns:

Type Description
EconometricResults

Fitted model with params, standard errors, diagnostics.

Examples:

>>> import statspai as sp
>>> res = sp.poisson("num_awards ~ math + prog", data=df)
>>> print(res.summary())
>>> res_irr = sp.poisson("num_awards ~ math + prog", data=df,
...                       robust="robust", irr=True)

nbreg

nbreg(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, robust: str = 'nonrobust', cluster: str = None, weights: str = None, offset: str = None, exposure: str = None, irr: bool = False, dispersion: str = 'mean', maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults

Negative binomial regression (NB2 or NB1).

Parameters:

Name Type Description Default
formula str

Model formula, e.g. "y ~ x1 + x2".

None
data DataFrame

Data containing all variables.

None
y str

Dependent variable name (alternative to formula).

None
x list of str

Independent variable names (alternative to formula).

None
robust str

Standard error type: "nonrobust", "robust"/"hc0", "hc1".

"nonrobust"
cluster str

Variable name for clustered standard errors.

None
weights str

Weight variable name.

None
offset str

Offset variable (log of exposure).

None
exposure str

Exposure variable (will be logged).

None
irr bool

Report Incidence Rate Ratios.

False
dispersion str

Dispersion parameterization: - "mean" (NB2): Var(y) = mu + alpha * mu^2 - "constant" (NB1): Var(y) = mu * (1 + delta)

"mean"
maxiter int

Maximum iterations.

100
tol float

Convergence tolerance.

1e-8
alpha float

Significance level.

0.05

Returns:

Type Description
EconometricResults

Examples:

>>> import statspai as sp
>>> res = sp.nbreg("days_absent ~ math + prog", data=df, irr=True)
>>> print(res.summary())

xtnbreg

xtnbreg(formula: str = None, data: DataFrame = None, y: str = None, x: Sequence[str] = None, entity: str = None, time: str = None, model: str = 'fe', time_effects: bool = False, robust: str = 'nonrobust', cluster: str = None, weights: str = None, offset: str = None, exposure: str = None, irr: bool = False, dispersion: str = 'mean', maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05) -> Any

Panel negative-binomial regression with Stata-like xtnbreg ergonomics.

model="fe" fits an unconditional fixed-effects NB model by adding explicit entity dummies through :func:nbreg. This is appropriate for moderate panels and, most importantly, does not silently replace a count model with OLS. model="re" dispatches to :func:sp.menbreg, the random-intercept NB-2 GLMM.

Parameters:

Name Type Description Default
formula str

Count-model formula. For fixed effects you may pass "y ~ x1 + x2 | id" directly, or pass entity=.

None
data DataFrame

Long-format panel data.

None
y optional

Alternative to formula.

None
x optional

Alternative to formula.

None
entity str

Panel/unit identifier. Required when the formula does not contain a | id fixed-effect part.

None
time str

Time column. Stored as metadata; included as a fixed effect only when time_effects=True.

None
model (fe, re, pooled)

Fixed-effects, random-effects, or pooled negative binomial.

"fe"

Returns:

Type Description
EconometricResults or MEGLMResult

model="fe" / "pooled" return :class:EconometricResults; model="re" returns the multilevel :class:MEGLMResult.

ppmlhdfe

ppmlhdfe(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, absorb: str = None, robust: str = 'robust', cluster: str = None, weights: str = None, separation: bool = True, maxiter: int = 1000, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults

Pseudo-Poisson Maximum Likelihood with high-dimensional fixed effects.

Implements Santos Silva & Tenreyro (2006) PPML estimator, the standard approach for gravity models and other trade/economic settings where: - The dependent variable has zeros - Log-linearization would be inconsistent under heteroskedasticity - High-dimensional fixed effects (origin, destination, year) must be absorbed

Parameters:

Name Type Description Default
formula str

Model formula. Fixed effects can be specified via |: "trade ~ dist + contig | origin + destination + year"

None
data DataFrame

Data containing all variables.

None
y str

Dependent variable name (alternative to formula).

None
x list of str

Independent variable names (alternative to formula).

None
absorb str

Fixed effects to absorb, e.g. "origin + destination + year". Overrides any FE specification in the formula.

None
robust str

Default is robust SE (as in Stata's ppmlhdfe). Options: "robust"/"hc0", "hc1", "nonrobust".

"robust"
cluster str

Variable name for clustered standard errors (recommended for gravity models, e.g. cluster on country-pair).

None
weights str

Weight variable name.

None
separation bool

If True, check for separation (perfect prediction of zeros) and warn. Observations causing separation are not dropped automatically.

True
maxiter int

Maximum IRLS iterations.

1000
tol float

Convergence tolerance.

1e-8
alpha float

Significance level for confidence intervals.

0.05

Returns:

Type Description
EconometricResults
Notes

PPML is consistent under the assumption E[y|x] = exp(x'beta), regardless of the true conditional variance. With robust SE it is a quasi-MLE estimator and does not assume Poisson variance.

References

Santos Silva, J.M.C. & Tenreyro, S. (2006). "The Log of Gravity." Review of Economics and Statistics, 88(4), 641-658.

Examples:

>>> import statspai as sp
>>> # Basic gravity model
>>> res = sp.ppmlhdfe("trade ~ dist + contig | origin + dest + year",
...                    data=df, cluster="pair_id")
>>> print(res.summary())
>>>
>>> # With absorb parameter instead of formula FE
>>> res = sp.ppmlhdfe("trade ~ dist + contig", data=df,
...                    absorb="origin + dest + year",
...                    cluster="pair_id")

mlogit

mlogit(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, base: int = 0, robust: str = 'nonrobust', cluster: str = None, rrr: bool = False, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults

Multinomial logit for J > 2 unordered categories via MLE.

Equivalent to Stata's mlogit y x, base(0) or mlogit y x, rrr.

Parameters:

Name Type Description Default
formula str

Formula "y ~ x1 + x2".

None
data DataFrame

Data.

None
y str

Dependent variable (categorical, integer-coded).

None
x list of str

Regressors.

None
base int

Base / reference category (index into sorted unique values).

0
robust str

"robust" / "HC1" for Huber-White sandwich SE.

"nonrobust"
cluster str

Cluster variable for clustered SE.

None
rrr bool

Report Relative Risk Ratios (exp(beta)) instead of coefficients.

False
maxiter int
100
tol float
1e-8
alpha float
0.05

Returns:

Type Description
EconometricResults

Examples:

>>> result = sp.mlogit('choice ~ price + income', data=df, base=0)
>>> print(result.summary())
>>> result = sp.mlogit(data=df, y='choice', x=['price','income'], rrr=True)
Notes

Softmax parameterisation: β_j for each category j != base.

.. math:: P(Y_i = j | X_i) = \frac{\exp(X_i' \beta_j)} {\sum_{k} \exp(X_i' \beta_k)}, \quad \beta_{\text{base}} = 0.

McFadden pseudo-R^2 = 1 - LL / LL_0.

ologit

ologit(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, robust: str = 'nonrobust', cluster: str = None, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults

Ordered logit (proportional odds) model via MLE.

Equivalent to Stata's ologit y x.

Parameters:

Name Type Description Default
formula str

Formula "y ~ x1 + x2".

None
data DataFrame
None
y str

Ordered categorical dependent variable.

None
x list of str
None
robust str
"nonrobust"
cluster str
None
maxiter int
100
tol float
1e-8
alpha float
0.05

Returns:

Type Description
EconometricResults

Coefficients (beta) and cutpoints (kappa). result.predicted_probs gives per-category probabilities. result.marginal_effects gives AME per category. result.brant_test gives the Brant parallel-lines test.

Examples:

>>> result = sp.ologit('satisfaction ~ income + age', data=df)
>>> print(result.summary())
>>> result.brant_test  # parallel regression assumption
Notes

.. math:: P(Y \le j | X) = \Lambda(\kappa_j - X'\beta)

where :math:\Lambda is the logistic CDF. The parallel regression (proportional odds) assumption requires that :math:\beta is the same for each cumulative split.

oprobit

oprobit(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, robust: str = 'nonrobust', cluster: str = None, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults

Ordered probit model via MLE.

Equivalent to Stata's oprobit y x.

Parameters:

Name Type Description Default
formula str

Formula "y ~ x1 + x2".

None
data DataFrame
None
y str

Ordered categorical dependent variable.

None
x list of str
None
robust str
"nonrobust"
cluster str
None
maxiter int
100
tol float
1e-8
alpha float
0.05

Returns:

Type Description
EconometricResults

Same structure as :func:ologit but with probit link.

Examples:

>>> result = sp.oprobit(data=df, y='rating', x=['quality', 'price'])
>>> print(result.summary())
>>> result.marginal_effects
Notes

.. math:: P(Y \le j | X) = \Phi(\kappa_j - X'\beta)

where :math:\Phi is the standard normal CDF.

clogit

clogit(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, group: str = None, robust: str = 'nonrobust', cluster: str = None, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults

McFadden's conditional (fixed-effect) logit for choice data.

Each observation is an alternative within a choice set (group). The dependent variable is 1 for the chosen alternative, 0 otherwise.

Equivalent to Stata's clogit y x, group(id).

Parameters:

Name Type Description Default
formula str

Formula "chosen ~ price + quality".

None
data DataFrame

Long-format data with one row per alternative per choice set.

None
y str

Binary indicator: 1 = chosen, 0 = not chosen.

None
x list of str

Alternative-specific (and/or individual-specific interacted with alternative dummies) covariates.

None
group str

Variable identifying the choice set / decision-maker.

None
robust str
"nonrobust"
cluster str
None
maxiter int
100
tol float
1e-8
alpha float
0.05

Returns:

Type Description
EconometricResults

Examples:

>>> result = sp.clogit('chosen ~ price + quality', data=df, group='case_id')
>>> print(result.summary())
Notes

The conditional log-likelihood for group g:

.. math:: \ell_g = X_{g,chosen}'\beta - \log\left(\sum_{j \in g} \exp(X_{gj}'\beta)\right)

Only alternative-specific variation identifies beta; the group fixed effect is conditioned out (no constant estimated).