statspai.regression¶
regression ¶
Regression module initialization
OLSRegression ¶
Bases: BaseModel
OLS regression model with comprehensive functionality
fit ¶
fit(robust: str = 'nonrobust', cluster: Optional[str] = None, **kwargs) -> EconometricResults
Fit the OLS model
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
robust
|
str
|
Type of standard errors |
'nonrobust'
|
cluster
|
str
|
Variable name for clustering |
None
|
**kwargs
|
Additional options |
{}
|
Returns:
| Type | Description |
|---|---|
EconometricResults
|
Fitted model results |
predict ¶
predict(data: Optional[DataFrame] = None, what: str = 'mean', alpha: float = 0.05, return_df: bool = False) -> ndarray | DataFrame
Generate predictions from the fitted OLS model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
New data at which to predict. If |
None
|
what
|
(mean, confidence, prediction)
|
|
"mean"
|
alpha
|
float
|
Significance level for the interval. |
0.05
|
return_df
|
bool
|
Return a DataFrame with columns |
False
|
Returns:
| Type | Description |
|---|---|
ndarray or DataFrame
|
Point predictions, optionally with interval columns. |
OLSEstimator ¶
Bases: BaseEstimator
Ordinary Least Squares estimator with robust standard errors
estimate ¶
estimate(y: ndarray, X: ndarray, robust: str = 'nonrobust', cluster: Optional[Series] = None, **kwargs) -> Dict[str, Any]
Estimate OLS parameters
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y
|
ndarray
|
Dependent variable |
required |
X
|
ndarray
|
Independent variables (including constant if desired) |
required |
robust
|
str
|
Type of standard errors ('nonrobust', 'hc0', 'hc1', 'hc2', 'hc3', 'hac') |
'nonrobust'
|
cluster
|
Series
|
Cluster variable for clustered standard errors |
None
|
**kwargs
|
Additional options |
{}
|
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Estimation results |
IVRegression ¶
Bases: BaseModel
Instrumental Variables regression model.
Supports multiple estimation methods via method parameter:
'2sls', 'liml', 'fuller', 'gmm', 'jive'.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
formula
|
str
|
Formula with IV syntax: |
None
|
data
|
DataFrame
|
|
None
|
method
|
str
|
Estimation method. |
'2sls'
|
fuller_alpha
|
float
|
Fuller constant (only used when method='fuller'). |
1.0
|
y
|
array - like
|
Alternative to formula interface. |
None
|
X_exog
|
array - like
|
Alternative to formula interface. |
None
|
X_endog
|
array - like
|
Alternative to formula interface. |
None
|
Z
|
array - like
|
Alternative to formula interface. |
None
|
var_names
|
array - like
|
Alternative to formula interface. |
None
|
first_stage
property
¶
First-stage diagnostics for each endogenous variable.
sargan_test
property
¶
Sargan/Hansen J overidentification test results.
fit ¶
fit(robust: str = 'nonrobust', cluster: Optional[str] = None, **kwargs) -> EconometricResults
Fit the IV model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
robust
|
str
|
Standard-error type ('nonrobust', 'hc0', 'hc1', 'hc2', 'hc3'). |
'nonrobust'
|
cluster
|
str
|
Variable name for clustering. |
None
|
Returns:
| Type | Description |
|---|---|
EconometricResults
|
|
predict ¶
Generate predictions from the fitted IV model.
For a structural-form estimator, the natural forecast of y given
new data is X_exog β_exog + X_endog β_endog — i.e. we plug
observed values of the endogenous variables through the structural
equation. Instruments are not used at prediction time.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
New data at which to predict. Must contain all exogenous and
endogenous variables referenced by the model's formula. If
|
None
|
IVEstimator ¶
Bases: BaseEstimator
Two-Stage Least Squares (2SLS) estimator.
Legacy class. Prefer using the iv() function directly.
GLMRegression ¶
Bases: BaseModel
Generalized Linear Model with IRLS estimation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
formula
|
str
|
Model formula (e.g. |
None
|
data
|
DataFrame
|
Data frame containing the variables. |
None
|
y
|
ndarray
|
Response array (alternative to formula). |
None
|
X
|
ndarray
|
Design matrix (alternative to formula). |
None
|
var_names
|
list of str
|
Variable names when using |
None
|
family
|
str
|
Distribution family. |
'gaussian'
|
link
|
str or None
|
Link function ( |
None
|
fit ¶
fit(robust: str = 'nonrobust', cluster: Optional[str] = None, weights: Optional[str] = None, offset: Optional[str] = None, exposure: Optional[str] = None, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05, **kwargs) -> EconometricResults
Fit the GLM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
robust
|
str
|
Standard-error type. |
'nonrobust'
|
cluster
|
str
|
Cluster variable name. |
None
|
weights
|
str
|
Weight variable name. |
None
|
offset
|
str
|
Offset variable name. |
None
|
exposure
|
str
|
Exposure variable name (log added as offset). |
None
|
maxiter
|
int
|
Maximum IRLS iterations. |
100
|
tol
|
float
|
Convergence tolerance. |
1e-08
|
alpha
|
float
|
Significance level for confidence intervals. |
0.05
|
Returns:
| Type | Description |
|---|---|
EconometricResults
|
|
predict ¶
predict(data: Optional[DataFrame] = None, type: str = 'response', offset: Optional[ndarray] = None) -> ndarray
Generate predictions from the fitted model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
New data for prediction. Uses training data if |
None
|
type
|
str
|
|
'response'
|
offset
|
ndarray
|
Offset for new data. |
None
|
Returns:
| Type | Description |
|---|---|
ndarray
|
|
GLMEstimator ¶
Bases: BaseEstimator
Generalized Linear Model estimator using IRLS
Implements Iteratively Reweighted Least Squares for maximum likelihood estimation of GLM parameters.
estimate ¶
estimate(y: ndarray, X: ndarray, family: Family, link: LinkFunction, robust: str = 'nonrobust', cluster: Optional[Series] = None, weights: Optional[ndarray] = None, offset: Optional[ndarray] = None, maxiter: int = 100, tol: float = 1e-08, alpha_nb: Optional[float] = None, **kwargs) -> Dict[str, Any]
Estimate GLM parameters via IRLS.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y
|
ndarray
|
Response variable (n,). |
required |
X
|
ndarray
|
Design matrix (n, k) including intercept if desired. |
required |
family
|
Family
|
Distribution family instance. |
required |
link
|
LinkFunction
|
Link function instance. |
required |
robust
|
str
|
Standard-error type ('nonrobust', 'hc0', 'hc1', 'hc2', 'hc3', 'hac'). |
'nonrobust'
|
cluster
|
Series
|
Cluster variable. |
None
|
weights
|
ndarray
|
Prior / frequency weights. |
None
|
offset
|
ndarray
|
Known offset added to the linear predictor. |
None
|
maxiter
|
int
|
Maximum IRLS iterations. |
100
|
tol
|
float
|
Convergence tolerance on deviance. |
1e-08
|
alpha_nb
|
float
|
If family is NB, initial alpha for joint estimation. |
None
|
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
|
regress ¶
regress(formula: str, data: DataFrame, robust: str = 'nonrobust', cluster: Optional[str] = None, **kwargs) -> EconometricResults
Convenient function for OLS regression
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
formula
|
str
|
Regression formula |
required |
data
|
DataFrame
|
Data containing variables |
required |
robust
|
str
|
Type of standard errors |
'nonrobust'
|
cluster
|
str
|
Variable name for clustering |
None
|
**kwargs
|
Additional options |
{}
|
Returns:
| Type | Description |
|---|---|
EconometricResults
|
Fitted model results |
Examples:
ivreg ¶
ivreg(formula: str, data: DataFrame, robust: str = 'nonrobust', cluster: Optional[str] = None, **kwargs) -> EconometricResults
Instrumental variables regression (2SLS).
.. deprecated::
Use sp.iv(formula, data, method='2sls') instead.
ivreg is kept for backward compatibility.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
formula
|
str
|
IV formula: |
required |
data
|
DataFrame
|
|
required |
robust
|
str
|
|
'nonrobust'
|
cluster
|
str
|
|
None
|
Returns:
| Type | Description |
|---|---|
EconometricResults
|
|
qreg ¶
qreg(data: DataFrame, formula: Optional[str] = None, y: Optional[str] = None, x: Optional[List[str]] = None, quantile: float = 0.5, alpha: float = 0.05) -> CausalResult
Quantile regression at a single quantile.
Equivalent to Stata's qreg y x, quantile(0.5).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
|
required |
formula
|
str
|
Formula like |
None
|
y
|
str
|
Outcome variable (alternative to formula). |
None
|
x
|
list of str
|
Regressors (alternative to formula). |
None
|
quantile
|
float
|
Quantile to estimate (0 < q < 1). 0.5 = median. |
0.5
|
alpha
|
float
|
|
0.05
|
Returns:
| Type | Description |
|---|---|
CausalResult
|
Coefficients at the specified quantile. |
Examples:
>>> # Median regression
>>> result = sp.qreg(df, y='wage', x=['education', 'experience'],
... quantile=0.5)
>>> # 90th percentile
>>> result = sp.qreg(df, y='wage', x=['education', 'experience'],
... quantile=0.9)
Notes
Quantile regression minimizes:
.. math:: \min_\beta \sum_i \rho_\tau(Y_i - X_i'\beta)
where ρ_τ(u) = u(τ - 1(u < 0)) is the check function.
Standard errors are computed using the Powell (1991) sandwich estimator with a kernel density estimate of f(0|X).
See Koenker & Bassett (1978, Econometrica).
sqreg ¶
sqreg(data: DataFrame, y: str, x: List[str], quantiles: Optional[List[float]] = None, alpha: float = 0.05) -> DataFrame
Simultaneous quantile regression at multiple quantiles.
Equivalent to Stata's sqreg y x, quantiles(10 25 50 75 90).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataFrame
|
|
required |
y
|
str
|
|
required |
x
|
list of str
|
|
required |
quantiles
|
list of float
|
Default: [0.1, 0.25, 0.5, 0.75, 0.9]. |
None
|
alpha
|
float
|
|
0.05
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Rows: variables. Columns: quantiles with coefficients and SEs. |
Examples:
logit ¶
logit(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, robust: str = 'nonrobust', cluster: str = None, weights: str = None, marginal_effects: str = None, odds_ratio: bool = False, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05, at_values: dict = None) -> EconometricResults
Logit (logistic) regression via maximum likelihood.
Equivalent to Stata's logit y x1 x2 or logistic (with or=True).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
formula
|
str
|
Formula like |
None
|
data
|
DataFrame
|
Data containing the variables. |
None
|
y
|
str
|
Dependent variable name (alternative to formula). |
None
|
x
|
list of str
|
Regressor names (alternative to formula). |
None
|
robust
|
str
|
|
``'nonrobust'``
|
cluster
|
str
|
Column name for clustered standard errors. |
None
|
weights
|
str
|
Column name for frequency/analytic weights. |
None
|
marginal_effects
|
str
|
|
None
|
odds_ratio
|
bool
|
Report odds ratios instead of log-odds coefficients. |
False
|
maxiter
|
int
|
Maximum Newton-Raphson iterations. |
100
|
tol
|
float
|
Convergence tolerance on log-likelihood change. |
1e-8
|
alpha
|
float
|
Significance level for confidence intervals. |
0.05
|
at_values
|
dict
|
Variable values for |
None
|
Returns:
| Type | Description |
|---|---|
EconometricResults
|
Fitted model with |
Examples:
>>> import statspai as sp
>>> result = sp.logit("admit ~ gre + gpa + rank", data=df)
>>> print(result.summary())
probit ¶
probit(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, robust: str = 'nonrobust', cluster: str = None, weights: str = None, marginal_effects: str = None, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05, at_values: dict = None) -> EconometricResults
Probit regression via maximum likelihood.
Equivalent to Stata's probit y x1 x2.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
formula
|
str
|
Formula like |
None
|
data
|
DataFrame
|
Data containing the variables. |
None
|
y
|
str
|
Dependent variable name (alternative to formula). |
None
|
x
|
list of str
|
Regressor names (alternative to formula). |
None
|
robust
|
str
|
|
``'nonrobust'``
|
cluster
|
str
|
Column name for clustered standard errors. |
None
|
weights
|
str
|
Column name for frequency/analytic weights. |
None
|
marginal_effects
|
str
|
|
None
|
maxiter
|
int
|
Maximum Newton-Raphson iterations. |
100
|
tol
|
float
|
Convergence tolerance on log-likelihood change. |
1e-8
|
alpha
|
float
|
Significance level for confidence intervals. |
0.05
|
at_values
|
dict
|
Variable values for |
None
|
Returns:
| Type | Description |
|---|---|
EconometricResults
|
Fitted model with |
Examples:
cloglog ¶
cloglog(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, robust: str = 'nonrobust', cluster: str = None, weights: str = None, marginal_effects: str = None, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05, at_values: dict = None) -> EconometricResults
Complementary log-log regression via maximum likelihood.
Appropriate when P(Y=1) is small (rare events) or when the latent distribution is asymmetric (extreme value type I).
Equivalent to Stata's cloglog y x1 x2.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
formula
|
str
|
Formula like |
None
|
data
|
DataFrame
|
Data containing the variables. |
None
|
y
|
str
|
Dependent variable name (alternative to formula). |
None
|
x
|
list of str
|
Regressor names (alternative to formula). |
None
|
robust
|
str
|
|
``'nonrobust'``
|
cluster
|
str
|
Column name for clustered standard errors. |
None
|
weights
|
str
|
Column name for frequency/analytic weights. |
None
|
marginal_effects
|
str
|
|
None
|
maxiter
|
int
|
Maximum Newton-Raphson iterations. |
100
|
tol
|
float
|
Convergence tolerance on log-likelihood change. |
1e-8
|
alpha
|
float
|
Significance level for confidence intervals. |
0.05
|
at_values
|
dict
|
Variable values for |
None
|
Returns:
| Type | Description |
|---|---|
EconometricResults
|
Fitted model with |
Examples:
zip_model ¶
zip_model(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, inflate: list = None, robust: str = 'nonrobust', cluster: str = None, maxiter: int = 200, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults
Zero-Inflated Poisson (ZIP) regression via MLE.
Two-part model: - Inflate equation: logit model for P(structural zero) = Λ(z'γ) - Count equation: Poisson model with mean μ = exp(x'β)
Equivalent to Stata's zip y x, inflate(z).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
formula
|
str
|
Patsy-style formula for the count equation, e.g. "y ~ x1 + x2". |
None
|
data
|
DataFrame
|
Dataset. |
None
|
y
|
str
|
Dependent variable name (alternative to formula). |
None
|
x
|
list of str
|
Count-equation regressors (alternative to formula). |
None
|
inflate
|
list of str
|
Inflation-equation regressors. Default: same as count regressors. |
None
|
robust
|
str
|
"nonrobust", "HC0", "HC1", etc. |
"nonrobust"
|
cluster
|
str
|
Cluster variable name for clustered standard errors. |
None
|
maxiter
|
int
|
Maximum iterations for optimizer. |
200
|
tol
|
float
|
Convergence tolerance. |
1e-8
|
alpha
|
float
|
Significance level for confidence intervals. |
0.05
|
Returns:
| Type | Description |
|---|---|
EconometricResults
|
Coefficients for both equations, Vuong test, diagnostics. |
Examples:
>>> result = sp.zip_model(data=df, y='doctor_visits', x=['age', 'income'],
... inflate=['age', 'chronic'])
>>> print(result.summary())
Notes
Log-likelihood for ZIP:
.. math:: y_i = 0: \log[\pi_i + (1-\pi_i) e^{-\mu_i}] y_i > 0: \log(1-\pi_i) + y_i \log\mu_i - \mu_i - \log(y_i!)
where π_i = Λ(z_i'γ) and μ_i = exp(x_i'β).
See Lambert (1992, Technometrics).
zinb ¶
zinb(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, inflate: list = None, robust: str = 'nonrobust', cluster: str = None, maxiter: int = 200, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults
Zero-Inflated Negative Binomial (ZINB) regression via MLE.
Two-part model: - Inflate equation: logit for P(structural zero) = Λ(z'γ) - Count equation: NB2 with mean μ = exp(x'β), Var = μ + α·μ²
Equivalent to Stata's zinb y x, inflate(z).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
formula
|
str
|
Patsy-style formula for the count equation. |
None
|
data
|
DataFrame
|
Dataset. |
None
|
y
|
str
|
Dependent variable name. |
None
|
x
|
list of str
|
Count-equation regressors. |
None
|
inflate
|
list of str
|
Inflation-equation regressors. Default: same as count regressors. |
None
|
robust
|
str
|
Standard error type. |
"nonrobust"
|
cluster
|
str
|
Cluster variable name. |
None
|
maxiter
|
int
|
|
200
|
tol
|
float
|
|
1e-8
|
alpha
|
float
|
|
0.05
|
Returns:
| Type | Description |
|---|---|
EconometricResults
|
Coefficients for count, inflate, and dispersion parameter. |
Examples:
>>> result = sp.zinb(data=df, y='doctor_visits', x=['age', 'income'],
... inflate=['age', 'chronic'])
>>> print(result.summary())
Notes
The NB2 parameterization uses dispersion parameter α so that Var(Y|μ) = μ + α·μ². When α → 0 the model collapses to ZIP.
See Cameron & Trivedi (2013, Ch. 4).
hurdle ¶
hurdle(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, count_model: str = 'poisson', robust: str = 'nonrobust', cluster: str = None, maxiter: int = 200, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults
Hurdle (two-part) model for count data.
Part 1 (binary): logit model for P(Y > 0). Part 2 (count): truncated-at-zero Poisson or Negative Binomial for the distribution of Y | Y > 0.
Unlike zero-inflated models, ALL zeros come from the binary process.
Equivalent to R's pscl::hurdle().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
formula
|
str
|
Patsy-style formula. |
None
|
data
|
DataFrame
|
Dataset. |
None
|
y
|
str
|
Dependent variable name. |
None
|
x
|
list of str
|
Regressors (used for both hurdle and count parts). |
None
|
count_model
|
str
|
Count distribution: "poisson" or "negbin". |
"poisson"
|
robust
|
str
|
|
"nonrobust"
|
cluster
|
str
|
|
None
|
maxiter
|
int
|
|
200
|
tol
|
float
|
|
1e-8
|
alpha
|
float
|
|
0.05
|
Returns:
| Type | Description |
|---|---|
EconometricResults
|
|
Examples:
>>> result = sp.hurdle(data=df, y='doctor_visits', x=['age', 'income'],
... count_model='negbin')
>>> print(result.summary())
Notes
The hurdle log-likelihood decomposes as:
.. math:: \ell = \sum_{y_i=0} \log(1-p_i) + \sum_{y_i>0} [\log p_i + \log f(y_i|\mu_i) - \log(1 - f(0|\mu_i))]
where p_i = Λ(x_i'δ) is the hurdle probability.
See Mullahy (1986, Journal of Econometrics).
poisson ¶
poisson(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, robust: str = 'nonrobust', cluster: str = None, weights: str = None, offset: str = None, exposure: str = None, irr: bool = False, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults
Poisson regression via MLE (IRLS).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
formula
|
str
|
Model formula, e.g. "y ~ x1 + x2". |
None
|
data
|
DataFrame
|
Data containing all variables. |
None
|
y
|
str
|
Dependent variable name (alternative to formula). |
None
|
x
|
list of str
|
Independent variable names (alternative to formula). |
None
|
robust
|
str
|
Standard error type: "nonrobust", "robust"/"hc0", "hc1". |
"nonrobust"
|
cluster
|
str
|
Variable name for clustered standard errors. |
None
|
weights
|
str
|
Frequency/analytic weight variable. |
None
|
offset
|
str
|
Offset variable (log of exposure already computed). |
None
|
exposure
|
str
|
Exposure variable (will be logged and used as offset). |
None
|
irr
|
bool
|
If True, report Incidence Rate Ratios (exp(beta)) instead of raw coefficients. |
False
|
maxiter
|
int
|
Maximum IRLS iterations. |
100
|
tol
|
float
|
Convergence tolerance. |
1e-8
|
alpha
|
float
|
Significance level for confidence intervals. |
0.05
|
Returns:
| Type | Description |
|---|---|
EconometricResults
|
Fitted model with params, standard errors, diagnostics. |
Examples:
nbreg ¶
nbreg(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, robust: str = 'nonrobust', cluster: str = None, weights: str = None, offset: str = None, exposure: str = None, irr: bool = False, dispersion: str = 'mean', maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults
Negative binomial regression (NB2 or NB1).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
formula
|
str
|
Model formula, e.g. "y ~ x1 + x2". |
None
|
data
|
DataFrame
|
Data containing all variables. |
None
|
y
|
str
|
Dependent variable name (alternative to formula). |
None
|
x
|
list of str
|
Independent variable names (alternative to formula). |
None
|
robust
|
str
|
Standard error type: "nonrobust", "robust"/"hc0", "hc1". |
"nonrobust"
|
cluster
|
str
|
Variable name for clustered standard errors. |
None
|
weights
|
str
|
Weight variable name. |
None
|
offset
|
str
|
Offset variable (log of exposure). |
None
|
exposure
|
str
|
Exposure variable (will be logged). |
None
|
irr
|
bool
|
Report Incidence Rate Ratios. |
False
|
dispersion
|
str
|
Dispersion parameterization: - "mean" (NB2): Var(y) = mu + alpha * mu^2 - "constant" (NB1): Var(y) = mu * (1 + delta) |
"mean"
|
maxiter
|
int
|
Maximum iterations. |
100
|
tol
|
float
|
Convergence tolerance. |
1e-8
|
alpha
|
float
|
Significance level. |
0.05
|
Returns:
| Type | Description |
|---|---|
EconometricResults
|
|
Examples:
xtnbreg ¶
xtnbreg(formula: str = None, data: DataFrame = None, y: str = None, x: Sequence[str] = None, entity: str = None, time: str = None, model: str = 'fe', time_effects: bool = False, robust: str = 'nonrobust', cluster: str = None, weights: str = None, offset: str = None, exposure: str = None, irr: bool = False, dispersion: str = 'mean', maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05) -> Any
Panel negative-binomial regression with Stata-like xtnbreg ergonomics.
model="fe" fits an unconditional fixed-effects NB model by adding
explicit entity dummies through :func:nbreg. This is appropriate for
moderate panels and, most importantly, does not silently replace a count
model with OLS. model="re" dispatches to :func:sp.menbreg, the
random-intercept NB-2 GLMM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
formula
|
str
|
Count-model formula. For fixed effects you may pass
|
None
|
data
|
DataFrame
|
Long-format panel data. |
None
|
y
|
optional
|
Alternative to |
None
|
x
|
optional
|
Alternative to |
None
|
entity
|
str
|
Panel/unit identifier. Required when the formula does not contain a
|
None
|
time
|
str
|
Time column. Stored as metadata; included as a fixed effect only when
|
None
|
model
|
(fe, re, pooled)
|
Fixed-effects, random-effects, or pooled negative binomial. |
"fe"
|
Returns:
| Type | Description |
|---|---|
EconometricResults or MEGLMResult
|
|
ppmlhdfe ¶
ppmlhdfe(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, absorb: str = None, robust: str = 'robust', cluster: str = None, weights: str = None, separation: bool = True, maxiter: int = 1000, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults
Pseudo-Poisson Maximum Likelihood with high-dimensional fixed effects.
Implements Santos Silva & Tenreyro (2006) PPML estimator, the standard approach for gravity models and other trade/economic settings where: - The dependent variable has zeros - Log-linearization would be inconsistent under heteroskedasticity - High-dimensional fixed effects (origin, destination, year) must be absorbed
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
formula
|
str
|
Model formula. Fixed effects can be specified via |
None
|
data
|
DataFrame
|
Data containing all variables. |
None
|
y
|
str
|
Dependent variable name (alternative to formula). |
None
|
x
|
list of str
|
Independent variable names (alternative to formula). |
None
|
absorb
|
str
|
Fixed effects to absorb, e.g. |
None
|
robust
|
str
|
Default is robust SE (as in Stata's ppmlhdfe). Options: "robust"/"hc0", "hc1", "nonrobust". |
"robust"
|
cluster
|
str
|
Variable name for clustered standard errors (recommended for gravity models, e.g. cluster on country-pair). |
None
|
weights
|
str
|
Weight variable name. |
None
|
separation
|
bool
|
If True, check for separation (perfect prediction of zeros) and warn. Observations causing separation are not dropped automatically. |
True
|
maxiter
|
int
|
Maximum IRLS iterations. |
1000
|
tol
|
float
|
Convergence tolerance. |
1e-8
|
alpha
|
float
|
Significance level for confidence intervals. |
0.05
|
Returns:
| Type | Description |
|---|---|
EconometricResults
|
|
Notes
PPML is consistent under the assumption E[y|x] = exp(x'beta), regardless of the true conditional variance. With robust SE it is a quasi-MLE estimator and does not assume Poisson variance.
References
Santos Silva, J.M.C. & Tenreyro, S. (2006). "The Log of Gravity." Review of Economics and Statistics, 88(4), 641-658.
Examples:
>>> import statspai as sp
>>> # Basic gravity model
>>> res = sp.ppmlhdfe("trade ~ dist + contig | origin + dest + year",
... data=df, cluster="pair_id")
>>> print(res.summary())
>>>
>>> # With absorb parameter instead of formula FE
>>> res = sp.ppmlhdfe("trade ~ dist + contig", data=df,
... absorb="origin + dest + year",
... cluster="pair_id")
mlogit ¶
mlogit(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, base: int = 0, robust: str = 'nonrobust', cluster: str = None, rrr: bool = False, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults
Multinomial logit for J > 2 unordered categories via MLE.
Equivalent to Stata's mlogit y x, base(0) or mlogit y x, rrr.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
formula
|
str
|
Formula |
None
|
data
|
DataFrame
|
Data. |
None
|
y
|
str
|
Dependent variable (categorical, integer-coded). |
None
|
x
|
list of str
|
Regressors. |
None
|
base
|
int
|
Base / reference category (index into sorted unique values). |
0
|
robust
|
str
|
|
"nonrobust"
|
cluster
|
str
|
Cluster variable for clustered SE. |
None
|
rrr
|
bool
|
Report Relative Risk Ratios (exp(beta)) instead of coefficients. |
False
|
maxiter
|
int
|
|
100
|
tol
|
float
|
|
1e-8
|
alpha
|
float
|
|
0.05
|
Returns:
| Type | Description |
|---|---|
EconometricResults
|
|
Examples:
>>> result = sp.mlogit('choice ~ price + income', data=df, base=0)
>>> print(result.summary())
>>> result = sp.mlogit(data=df, y='choice', x=['price','income'], rrr=True)
Notes
Softmax parameterisation: β_j for each category j != base.
.. math:: P(Y_i = j | X_i) = \frac{\exp(X_i' \beta_j)} {\sum_{k} \exp(X_i' \beta_k)}, \quad \beta_{\text{base}} = 0.
McFadden pseudo-R^2 = 1 - LL / LL_0.
ologit ¶
ologit(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, robust: str = 'nonrobust', cluster: str = None, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults
Ordered logit (proportional odds) model via MLE.
Equivalent to Stata's ologit y x.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
formula
|
str
|
Formula |
None
|
data
|
DataFrame
|
|
None
|
y
|
str
|
Ordered categorical dependent variable. |
None
|
x
|
list of str
|
|
None
|
robust
|
str
|
|
"nonrobust"
|
cluster
|
str
|
|
None
|
maxiter
|
int
|
|
100
|
tol
|
float
|
|
1e-8
|
alpha
|
float
|
|
0.05
|
Returns:
| Type | Description |
|---|---|
EconometricResults
|
Coefficients (beta) and cutpoints (kappa).
|
Examples:
>>> result = sp.ologit('satisfaction ~ income + age', data=df)
>>> print(result.summary())
>>> result.brant_test # parallel regression assumption
Notes
.. math:: P(Y \le j | X) = \Lambda(\kappa_j - X'\beta)
where :math:\Lambda is the logistic CDF. The parallel regression
(proportional odds) assumption requires that :math:\beta is the
same for each cumulative split.
oprobit ¶
oprobit(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, robust: str = 'nonrobust', cluster: str = None, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults
Ordered probit model via MLE.
Equivalent to Stata's oprobit y x.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
formula
|
str
|
Formula |
None
|
data
|
DataFrame
|
|
None
|
y
|
str
|
Ordered categorical dependent variable. |
None
|
x
|
list of str
|
|
None
|
robust
|
str
|
|
"nonrobust"
|
cluster
|
str
|
|
None
|
maxiter
|
int
|
|
100
|
tol
|
float
|
|
1e-8
|
alpha
|
float
|
|
0.05
|
Returns:
| Type | Description |
|---|---|
EconometricResults
|
Same structure as :func: |
Examples:
>>> result = sp.oprobit(data=df, y='rating', x=['quality', 'price'])
>>> print(result.summary())
>>> result.marginal_effects
Notes
.. math:: P(Y \le j | X) = \Phi(\kappa_j - X'\beta)
where :math:\Phi is the standard normal CDF.
clogit ¶
clogit(formula: str = None, data: DataFrame = None, y: str = None, x: list = None, group: str = None, robust: str = 'nonrobust', cluster: str = None, maxiter: int = 100, tol: float = 1e-08, alpha: float = 0.05) -> EconometricResults
McFadden's conditional (fixed-effect) logit for choice data.
Each observation is an alternative within a choice set (group). The dependent variable is 1 for the chosen alternative, 0 otherwise.
Equivalent to Stata's clogit y x, group(id).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
formula
|
str
|
Formula |
None
|
data
|
DataFrame
|
Long-format data with one row per alternative per choice set. |
None
|
y
|
str
|
Binary indicator: 1 = chosen, 0 = not chosen. |
None
|
x
|
list of str
|
Alternative-specific (and/or individual-specific interacted with alternative dummies) covariates. |
None
|
group
|
str
|
Variable identifying the choice set / decision-maker. |
None
|
robust
|
str
|
|
"nonrobust"
|
cluster
|
str
|
|
None
|
maxiter
|
int
|
|
100
|
tol
|
float
|
|
1e-8
|
alpha
|
float
|
|
0.05
|
Returns:
| Type | Description |
|---|---|
EconometricResults
|
|
Examples:
>>> result = sp.clogit('chosen ~ price + quality', data=df, group='case_id')
>>> print(result.summary())
Notes
The conditional log-likelihood for group g:
.. math:: \ell_g = X_{g,chosen}'\beta - \log\left(\sum_{j \in g} \exp(X_{gj}'\beta)\right)
Only alternative-specific variation identifies beta; the group fixed effect is conditioned out (no constant estimated).