12 Generalized Linear Models

This chapter describes Generalized Linear Models (GLM), a statistical technique for linear modeling. Oracle Data Mining supports GLM for both regression and classification mining functions.

This chapter includes the following topics:

About Generalized Linear Models

Generalized Linear Models (GLM) include and extend the class of linear models described in "Linear Regression".

Linear models make a set of restrictive assumptions, most importantly, that the target (dependent variable y) is normally distributed conditioned on the value of predictors with a constant variance regardless of the predicted response value. The advantage of linear models and their restrictions include computational simplicity, an interpretable model form, and the ability to compute certain diagnostic information about the quality of the fit.

Generalized linear models relax these restrictions, which are often violated in practice. For example, binary (yes/no or 0/1) responses do not have same variance across classes. Furthermore, the sum of terms in a linear model typically can have very large ranges encompassing very negative and very positive values. For the binary response example, we would like the response to be a probability in the range [0,1].

Generalized linear models accommodate responses that violate the linear model assumptions through two mechanisms: a link function and a variance function. The link function transforms the target range to potentially -infinity to +infinity so that the simple form of linear models can be maintained. The variance function expresses the variance as a function of the predicted response, thereby accommodating responses with non-constant variances (such as the binary responses).

Oracle Data Mining includes two of the most popular members of the GLM family of models with their most popular link and variance functions:

  • Linear regression with the identity link and variance function equal to the constant 1 (constant variance over the range of response values). See "Linear Regression".

  • Logistic regression with the logit link and binomial variance functions. See "Logistic Regression".

GLM in Oracle Data Mining

GLM is a parametric modeling technique. Parametric models make assumptions about the distribution of the data. When the assumptions are met, parametric models can be more efficient than non-parametric models.

The challenge in developing models of this type involves assessing the extent to which the assumptions are met. For this reason, quality diagnostics are key to developing quality parametric models.

Interpretability and Transparency

Oracle Data Mining GLM models are easy to interpret. Each model build generates many statistics and diagnostics. Transparency is also a key feature: model details describe key characteristics of the coefficients, and global details provide high-level statistics.

Wide Data

Oracle Data Mining GLM is uniquely suited for handling wide data. The algorithm can build and score quality models that use a virtually limitless number of predictors (attributes). The only constraints are those imposed by system resources.

Confidence Bounds

GLM has the ability to predict confidence bounds. In addition to predicting a best estimate and a probability (classification only) for each row, GLM identifies an interval wherein the prediction (regression) or probability (classification) will lie. The width of the interval depends upon the precision of the model and a user-specified confidence level.

The confidence level is a measure of how sure the model is that the true value will lie within a confidence interval computed by the model. A popular choice for confidence level is 95%. For example, a model might predict that an employee's income is $125K, and that you can be 95% sure that it lies between $90K and $160K. Oracle Data Mining supports 95% confidence by default, but that value is configurable.

Note:

Confidence bounds are returned with the coefficient statistics. You can also use the PREDICTION_BOUNDS SQL function to obtain the confidence bounds of a model prediction. See Oracle Database SQL Language Reference.

Ridge Regression

The best regression models are those in which the predictors correlate highly with the target, but there is very little correlation between the predictors themselves. Multicollinearity is the term used to describe multivariate regression with correlated predictors.

Ridge regression is a technique that compensates for multicollinearity. Oracle Data Mining supports ridge regression for both regression and classification mining functions. The algorithm automatically uses ridge if it detects singularity (exact multicollinearity) in the data.

Information about singularity is returned in the global model details. See "Global Model Statistics for Linear Regression" and "Global Model Statistics for Logistic Regression".

Build Settings for Ridge Regression

You can choose to explicitly enable ridge regression by specifying the GLMS_RIDGE_REGRESSION setting. If you explicitly enable ridge, you can use the system-generated ridge parameter or you can supply your own. If ridge is used automatically, the ridge parameter is also calculated automatically.

The build settings for ridge are summarized as follows:

  • GLMS_RIDGE_REGRESSION — Whether or not to override the automatic choice made by the algorithm regarding ridge regression

  • GLMS_RIDGE_VALUE — The value of the ridge parameter, used only if you specifically enable ridge regression.

  • GLMS_VIF_FOR_RIDGE — Whether or not to produce Variance Inflation Factor (VIF) statistics when ridge is being used for linear regression.

Ridge and Confidence Bounds

Confidence bounds are not supported by models built with ridge regression. See "Confidence Bounds".

Ridge and Variance Inflation Factor for Linear Regression

GLM produces Variance Inflation Factor (VIF) statistics for linear regression models, unless they were built with ridge. You can explicitly request VIF with ridge by specifying the GLMS_VIF_FOR_RIDGE setting. The algorithm will produce VIF with ridge only if enough system resources are available.

Ridge and Data Preparation

When ridge regression is enabled, different data preparation is likely to produce different results in terms of model coefficients and diagnostics. Oracle recommends that you enable Automatic Data Preparation for GLM models, especially when ridge regression is being used. See "Data Preparation for GLM".

Tuning and Diagnostics for GLM

The process of developing a GLM model typically involves a number of model builds. Each build generates many statistics that you can evaluate to determine the quality of your model. Depending on these diagnostics, you may want to try changing the model settings or making other modifications.

Build Settings

You can use build settings to specify:

  • Coefficient confidence — The GLMS_CONF_LEVEL setting indicates the degree of certainty that the true coefficient lies within the confidence bounds computed by the model. The default confidence is.95.

  • Row weights — The ODMS_ROW_WEIGHT_COLUMN_NAME setting identifies a column that contains a weighting factor for the rows.

  • Row diagnostics — The GLMS_DIAGNOSTICS_TABLE_NAME setting identifies a table to contain row-level diagnostics.

Additional build settings are available to:

See:

Oracle Database PL/SQL Packages and Types Reference for details about GLM settings

Diagnostics

GLM models generate many metrics to help you evaluate the quality of the model.

Coefficient Statistics

The same set of statistics is returned for both linear and logistic regression, but statistics that do not apply to the mining function are returned as NULL. The coefficient statistics are described in "Coefficient Statistics for Linear Regression" and "Coefficient Statistics for Logistic Regression" .

Coefficient statistics are returned by the GET_MODEL_DETAILS_GLM function in DBMS_DATA_MINING.

Global Model Statistics

Separate high-level statistics describing the model as a whole, are returned for linear and logistic regression. When ridge regression is enabled, fewer global details are returned (See "Ridge Regression"). The global model statistics are described in "Global Model Statistics for Linear Regression" and "Global Model Statistics for Logistic Regression".

Global statistics are returned by the GET_MODEL_DETAILS_GLOBAL function in DBMS_DATA_MINING.

Row Diagnostics

You can configure GLM models to generate per-row statistics by specifying the name of a diagnostics table in the build setting GLMS_DIAGNOSTICS_TABLE_NAME. The row diagnostics are described in "Row Diagnostics for Linear Regression" and "Row Diagnostics for Logistic Regression".

GLM requires a case ID to generate row diagnostics. If you provide the name of a diagnostic table but the data does not include a case ID column, an exception is raised.

Data Preparation for GLM

Automatic Data Preparation (ADP) implements suitable data transformations for both linear and logistic regression.

Note:

Oracle recommends that you use Automatic Data Preparation with GLM.

Data Preparation for Linear Regression

When ADP is enabled, the build data are standardized using a widely used correlation transformation (Netter, et. al, 1990). The data are first centered by subtracting the attribute means from the attribute values for each observation. Then the data are scaled by dividing each attribute value in an observation by the square root of the sum of squares per attribute across all observations. This transformation is applied to both numeric and categorical attributes.

Prior to standardization, categorical attributes are exploded into N-1 columns where N is the attribute cardinality. The most frequent value (mode) is omitted during the explosion transformation. In the case of highest frequency ties, the attribute values are sorted alpha-numerically in ascending order, and the first value on the list is omitted during the explosion. This explosion transformation occurs whether or not ADP is enabled.

In the case of high cardinality categorical attributes, the described transformations (explosion followed by standardization) can increase the build data size because the resulting data representation is dense. To reduce memory, disk space, and processing requirements, an alternative approach needs to be used. For large datasets where the estimated internal dense representation would require more than 1Gb of disk space, categorical attributes are not standardized. Under these circumstances, the VIF statistic should be used with caution.

Reference:

Neter, J., Wasserman, W., and Kutner, M.H., "Applied Statistical Models", Richard D. Irwin, Inc., Burr Ridge, IL, 1990.

Data Preparation for Logistic Regression

Categorical attributes are exploded into N-1 columns where N is the attribute cardinality. The most frequent value (mode) is omitted during the explosion transformation. In the case of highest frequency ties, the attribute values are sorted alpha-numerically in ascending order and the first value on the list is omitted during the explosion. This explosion transformation occurs whether or not ADP is enabled.

When ADP is enabled, numerical attributes are standardized by scaling the attribute values by a measure of attribute variability. This measure of variability is computed as the standard deviation per attribute with respect to the origin (not the mean) (Marquardt, 1980).

Reference:

Marquardt, D.W., "A Critique of Some Ridge Regression Methods: Comment", Journal of the American Statistical Association, Vol. 75, No. 369 , 1980, pp. 87-91.

Missing Values

When building or applying a model, Oracle Data Mining automatically replaces missing values of numerical attributes with the mean and missing values of categorical attributes with the mode.

You can configure a GLM model to override the default treatment of missing values. With the ODMS_MISSING_VALUE_TREATMENT setting, you can cause the algorithm to delete rows in the training data that have missing values instead of replacing them with the mean or the mode. However, when the model is applied, Oracle Data Mining will perform the usual mean/mode missing value replacement. As a result, statistics generated from scoring may not match the statistics generated from building the model.

If you want to delete rows with missing values in the scoring the model, you must perform the transformation explicitly. To make build and apply statistics match, you must remove the rows with NULLs from the scoring data before performing the apply operation. You can do this by creating a view.

CREATE VIEW viewname AS SELECT * from tablename 
     WHERE column_name1 is NOT NULL 
     AND   column_name2 is NOT NULL 
     AND   column_name3 is NOT NULL ..... 

Note:

In Oracle Data Mining, missing values in nested data indicate sparsity, not values missing at random.

The value ODMS_MISSING_VALUE_DELETE_ROW is only valid for tables without nested columns. If this value is used with nested data, an exception is raised.

Linear Regression

Linear regression is the GLM regression algorithm supported by Oracle Data Mining. The algorithm assumes no target transformation and constant variance over the range of target values.

Coefficient Statistics for Linear Regression

GLM regression models generate the following coefficient statistics:

  • Linear coefficient estimate

  • Standard error of the coefficient estimate

  • t-value of the coefficient estimate

  • Probability of the t-value

  • Variance Inflation Factor (VIF)

  • Standardized estimate of the coefficient

  • Lower and upper confidence bounds of the coefficient

Global Model Statistics for Linear Regression

GLM regression models generate the following statistics that describe the model as a whole:

  • Model degrees of freedom

  • Model sum of squares

  • Model mean square

  • Model F statistic

  • Model F value probability

  • Error degrees of freedom

  • Error sum of squares

  • Error mean square

  • Corrected total degrees of freedom

  • Corrected total sum of squares

  • Root mean square error

  • Dependent mean

  • Coefficient of variation

  • R-Square

  • Adjusted R-Square

  • Akaike's information criterion

  • Schwarz's Baysian information criterion

  • Estimated mean square error of the prediction

  • Hocking Sp statistic

  • JP statistic (the final prediction error)

  • Number of parameters (the number of coefficients, including the intercept)

  • Number of rows

  • Whether or not the model converged

  • Whether or not a covariance matrix was computed

Row Diagnostics for Linear Regression

For linear regression, the diagnostics table has the columns described in Table 12-1. All the columns are NUMBER, except the CASE_ID column, which preserves the type from the training data.

Table 12-1 Diagnostics Table for GLM Regression Models

Column Description

CASE_ID

Value of the case ID column

TARGET_VALUE

Value of the target column

PREDICTED_VALUE

Value predicted by the model for the target

HAT

Value of the diagonal element of the hat matrix

RESIDUAL

Measure of error

STD_ERR_RESIDUAL

Standard error of the residual

STUDENTIZED_RESIDUAL

Studentized residual

PRED_RES

Predicted residual

COOKS_D

Cook's D influence statistic


Logistic Regression

Binary logistic regression is the GLM classification algorithm supported by Oracle Data Mining. The algorithm uses the logit link function and the binomial variance function.

Reference Class

You can use the build setting GLMS_REFERENCE_CLASS_NAME to specify the target value to be used as a reference in a binary logistic regression model. Probabilities will be produced for the other (non-reference) class. By default, the algorithm chooses the value with the highest prevalence. If there are ties, the attributes are sorted alpha-numerically in ascending order.

Class Weights

You can use the build setting CLAS_WEIGHTS_TABLE_NAME to specify the name of a class weights table. Class weights influence the weighting of target classes during the model build.

Coefficient Statistics for Logistic Regression

GLM classification models generate the following coefficient statistics:

  • Name of the predictor

  • Coefficient estimate

  • Standard error of the coefficient estimate

  • Wald chi-square value of the coefficient estimate

  • Probability of the Wald chi-square value

  • Standardized estimate of the coefficient

  • Lower and upper confidence bounds of the coefficient

  • Exponentiated coefficient

  • Exponentiated coefficient for the upper and lower confidence bounds of the coefficient

Global Model Statistics for Logistic Regression

GLM classification models generate the following statistics that describe the model as a whole:

  • Akaike's criterion for the fit of the intercept only model

  • Akaike's criterion for the fit of the intercept and the covariates (predictors) model

  • Schwarz's criterion for the fit of the intercept only model

  • Schwarz's criterion for the fit of the intercept and the covariates (predictors) model

  • -2 log likelihood of the intercept only model

  • -2 log likelihood of the model

  • Likelihood ratio degrees of freedom

  • Likelihood ratio chi-square probability value

  • Pseudo R-square Cox an Snell

  • Pseudo R-square Nagelkerke

  • Dependent mean

  • Percent of correct predictions

  • Percent of incorrect predictions

  • Percent of ties (probability for two cases is the same)

  • Number of parameters (the number of coefficients, including the intercept)

  • Number of rows

  • Whether or not the model converged

  • Whether or not a covariance matrix was computed.

Row Diagnostics for Logistic Regression

For logistic regression, the diagnostics table has the columns described in Table 12-2. All the columns are NUMBER, except the CASE_ID and TARGET_VALUE columns, which preserve the type from the training data.

Table 12-2 Row Diagnostics Table for Logistic Regression

Column Description

CASE_ID

Value of the case ID column

TARGET_VALUE

Value of the target value

TARGET_VALUE_PROB

Probability associated with the target value

HAT

Value of the diagonal element of the hat matrix

WORKING_RESIDUAL

Residual with respect to the adjusted dependent variable

PEARSON_RESIDUAL

The raw residual scaled by the estimated standard deviation of the target

DEVIANCE_RESIDUAL

Contribution to the overall goodness of fit of the model

C

Confidence interval displacement diagnostic

CBAR

Confidence interval displacement diagnostic

DIFDEV

Change in the deviance due to deleting an individual observation

DIFCHISQ

Change in the Pearson chi-square