5 Building a Model

This chapter explains how to create data mining models and retrieve model details.

Note:

This chapter assumes a basic understanding of mining functions and algorithms, as described in Oracle Data Mining Concepts.

This chapter contains the following topics:

Steps in Building a Model

The general steps involved in creating a data mining model are summarized as follows:

  1. Prepare the data. (See Chapter 3.)

  2. Optionally, specify model settings. (See "Model Settings".)

  3. Create the model. (See "Creating a Model".)

  4. View model details. (See "Model Details".)

  5. Test the model. (See Oracle Data Mining Concepts for information about test metrics for classification and regression.)

  6. Evaluate the model with questions like: How accurate is the model? If there are rules associated with the model, what is their confidence and support? How well does the model address the business question?

Note:

To better understand this process, you can look at the source code of the sample data mining programs provided with Oracle Database. See "Sample Mining Models".

See Also:

Oracle Data Mining Concepts for more information about the process of building a model

Model Settings

Numerous settings are available for configuring data mining models at build time. To specify settings, create a settings table and pass it to the model creation process. A settings table is optional, because all settings have default values.

Model settings are documented in Oracle Database PL/SQL Packages and Types Reference. If you are using the HTML version of Oracle Data Mining Application Developer's Guide, the links in Table 5-1 will take you directly to the documentation for each type of model setting.

Table 5-1 Model Settings

Settings Documentation

Modify the mining function

See Oracle Database PL/SQL Packages and Types Reference

Modify the algorithm

See Oracle Database PL/SQL Packages and Types Reference

Set global model characteristics

See Oracle Database PL/SQL Packages and Types Reference

Enable or disable Automatic Data Preparation

See Oracle Database PL/SQL Packages and Types Reference


Specifying a Settings Table

A settings table must have the columns shown in Table 5-2.

Table 5-2 Settings Table Required Columns

Column Name Data Type

setting_name

VARCHAR2(30)

setting_value

VARCHAR2(4000)


The values inserted into the setting_name column are one or more of several constants defined in the DBMS_DATA_MINING package. Depending on what the setting name denotes, the value for the setting_value column can be a predefined constant or the actual numerical value corresponding to the setting itself. The setting_value column is defined to be VARCHAR2. You can explicitly cast numerical inputs to string using the TO_CHAR() function, or you can rely on the implicit type conversion provided by the Database.

Example 5-1 creates a settings table for an SVM classification model. Since SVM is not the default classifier, the ALGO_NAME setting is used to specify the algorithm. Setting the SVMS_KERNEL_FUNCTION to SVMS_LINEAR causes the model to be built with a linear kernel. If you do not specify the kernel function, the algorithm chooses the kernel based on the number of attributes in the data.

Example 5-1 Create a Settings Table

CREATE TABLE svmc_sh_sample_settings (
  setting_name VARCHAR2(30),
  setting_value VARCHAR2(4000));

BEGIN 
  INSERT INTO svmc_sh_sample_settings (setting_name, setting_value) VALUES
    (dbms_data_mining.algo_name, dbms_data_mining.algo_support_vector_machines);
  INSERT INTO svmc_sh_sample_settings (setting_name, setting_value) VALUES
    (dbms_data_mining.svms_kernel_function, dbms_data_mining.svms_linear);
  COMMIT;
END;
/

Specifying the Algorithm

The ALGO_NAME setting specifies the algorithm for a model. If you wish to use the default algorithm for a particular mining function, or if there is only one algorithm available for the mining function, you do not need to specify the ALGO_NAME setting.

Table 5-3 Data Mining Algorithms

ALGO_NAME Value Algorithm Default? Mining Model Function

ALGO_AI_MDL

Minimum Description Length

attribute importance

ALGO_APRIORI_ASSOCIATION_RULES

Apriori

association

ALGO_DECISION_TREE

Decision Tree

classification

ALGO_GENERALIZED_LINEAR_MODEL

Generalized Linear Model

classification and regression

ALGO_KMEANS

k-Means

yes

clustering

ALGO_NAIVE_BAYES

Naive Bayes

yes

classification

ALGO_NONNEGATIVE_MATRIX_FACTOR

Non-Negative Matrix Factorization

feature extraction

ALGO_O_CLUSTER

O-Cluster

clustering

ALGO_SUPPORT_VECTOR_MACHINES

Support Vector MachineSuppor

yes

default regression algorithm

regression, classification, and anomaly detection (classification with no target)


Specifying Costs

The CLAS_COST_TABLE_NAME setting specifies the name of a cost matrix table to be used in building a Decision Tree model. A cost matrix biases a classification model to minimize costly misclassifications. The cost matrix table must have the columns shown in Table 5-4.

Table 5-4 Cost Matrix Table Required Columns

Column Name Data Type

actual_target_value

CHAR, VARCHAR2, NUMBER, or FLOAT

predicted_target_value

CHAR, VARCHAR2, NUMBER, or FLOAT

cost

NUMBER


Decision Tree is the only algorithm that supports a cost matrix at build time. However, you can create a cost matrix and associate it with any classification model for scoring.

If you want to use costs for scoring, create a table with the columns shown in Table 5-4, and use the DBMS_DATA_MINING.ADD_COST_MATRIX procedure to add the cost matrix table to the model. You can also specify a cost matrix inline when invoking a PREDICTION function.

See Also:

Oracle Data Mining Concepts for information about costs

Specifying Prior Probabilities

The CLAS_PRIORS_TABLE_NAME setting specifies the name of a table of prior probabilities to be used in building a Naive Bayes model. Prior probabilities can be used to offset differences in distribution between the build data and the actual population. The priors table must have the columns shown in Table 5-5.

Table 5-5 Priors Table Required Columns

Column Name Data Type

target_value

CHAR, VARCHAR2, NUMBER, or FLOAT

prior_probability

NUMBER


See Also:

Oracle Data Mining Concepts for information about priors

Specifying Class Weights

The CLAS_WEIGHTS_TABLE_NAME setting specifies the name of a table of class weights to be used to bias a logistic regression (GLM classification) or SVM classification model to favor higher weighted classes. The weights table must have the columns shown in Table 5-6.

Table 5-6 Class Weights Table Required Columns

Column Name Data Type

target_value

CHAR, VARCHAR2, NUMBER, or FLOAT

class_weight

NUMBER


See Also:

Oracle Data Mining Concepts for information about class weights

Model Settings in the Data Dictionary

Information about mining model settings can be obtained from the data dictionary view ALL/USER/DBA_MINING_MODEL_SETTINGS. When used with the ALL prefix, this view returns information about the settings for the models accessible to the current user. When used with the USER prefix, it returns information about the settings for the models in the user's schema. The DBA prefix is only available for DBAs.

The columns of ALL_MINING_MODEL_SETTINGS are described as follows and explained in Table 5-7.

SQL> describe all_mining_model_settings
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 OWNER                                     NOT NULL VARCHAR2(30)
 MODEL_NAME                                NOT NULL VARCHAR2(30)
 SETTING_NAME                              NOT NULL VARCHAR2(30)
 SETTING_VALUE                                      VARCHAR2(4000)
 SETTING_TYPE                                       VARCHAR2(7)

Table 5-7 ALL_MINING_MODEL_SETTINGS

Column Description

owner

Owner of the mining model

model_name

Name of the mining model

setting_name

Name of the setting

setting_value

Value of the setting

setting_type

'INPUT' if the value is specified by a user; 'DEFAULT' if the value is system-generated


The following query lists the settings for the SVM classification model SVMC_SH_CLAS_SAMPLE. The ALGO_NAME, CLAS_PRIORS_TABLE_NAME, and SVMS_KERNEL_FUNCTION settings are user-specified. These settings have been specified in a settings table for the model.

Example 5-2 ALL_MINING_MODEL_SETTINGS

SQL> COLUMN setting_value FORMAT A35
SQL> SELECT setting_name, setting_value, setting_type
            FROM all_mining_model_settings
            WHERE model_name in 'SVMC_SH_CLAS_SAMPLE';
 
SETTING_NAME                   SETTING_VALUE                       SETTING
------------------------------ ----------------------------------- -------
SVMS_ACTIVE_LEARNING           SVMS_AL_ENABLE                      DEFAULT
PREP_AUTO                      OFF                                 DEFAULT
SVMS_COMPLEXITY_FACTOR         0.244212                            DEFAULT
SVMS_KERNEL_FUNCTION           SVMS_LINEAR                         INPUT
CLAS_WEIGHTS_TABLE_NAME        svmc_sh_sample_class_wt             INPUT
SVMS_CONV_TOLERANCE            .001                                DEFAULT
ALGO_NAME                      ALGO_SUPPORT_VECTOR_MACHINES        INPUT
 
7 rows selected.






















SQL> COLUMN setting_value FORMAT A25
SQL> SELECT setting_name, setting_value, setting_type 
                           FROM all_mining_model_settings 
           WHERE model_name in 'SVMC_SH_CLAS_SAMPLE';
 
SETTING_NAME                   SETTING_VALUE                  SETTING_TYPE
------------------------------ -------------------------      ------------
ALGO_NAME                      ALGO_SUPPORT_VECTOR_MACHINES   INPUT
SVMS_ACTIVE_LEARNING           SVMS_AL_ENABLE                 DEFAULT
CLAS_PRIORS_TABLE_NAME         svmc_sh_sample_priors          INPUT
PREP_AUTO                      OFF                            DEFAULT
SVMS_COMPLEXITY_FACTOR         0.244212                       DEFAULT
SVMS_KERNEL_FUNCTION           SVMS_LINEAR                    INPUT
SVMS_CONV_TOLERANCE            .001                           DEFAULT

Note:

Some model settings are determined by the algorithm if not specified in a settings table. You can find the system-generated setting values by querying the ALL_MINING_MODEL_SETTINGS view.

See Also:

Oracle Database PL/SQL Packages and Types Reference for details about model settings

Creating a Model

The CREATE_MODEL procedure in the DBMS_DATA_MINING package creates a mining model with the specified name, mining function, and case table (build data).

DBMS_DATA_MINING.CREATE_MODEL (
      model_name            IN VARCHAR2,
      mining_function       IN VARCHAR2,
      data_table_name       IN VARCHAR2,
      case_id_column_name   IN VARCHAR2,
      target_column_name    IN VARCHAR2 DEFAULT NULL,
      settings_table_name   IN VARCHAR2 DEFAULT NULL,
      data_schema_name      IN VARCHAR2 DEFAULT NULL,
      settings_schema_name  IN VARCHAR2 DEFAULT NULL,
      transform_list        IN DM_TRANSFORMS DEFAULT NULL;)

See Also:

DBMS_DATA_MINING.CREATE_MODEL in Oracle Database PL/SQL Packages and Types Reference

Mining Functions

The mining function is a required argument to the CREATE_MODEL procedure. A data mining function specifies a class of problems that can be modeled and solved.

Data mining functions implement either supervised or unsupervised learning. Supervised learning uses a set of independent attributes to predict the value of a dependent attribute or target. Unsupervised learning does not distinguish between dependent and independent attributes.

Supervised functions are predictive. Unsupervised functions are descriptive.

You can specify any of the values in Table 5-8 for the mining_function parameter to CREATE_MODEL.

Table 5-8 Mining Model Functions

Mining_Function Value Description

ASSOCIATION

Association is a descriptive mining function. An association model identifies relationships and the probability of their occurrence within a data set.

Association models use the Apriori algorithm.

ATTRIBUTE_IMPORTANCE

Attribute Importance is a predictive mining function. An attribute importance model identifies the relative importance of an attribute in predicting a given outcome.

Attribute Importance models use the Minimal Description Length algorithm.

CLASSIFICATION

Classification is a predictive mining function. A classification model uses historical data to predict a categorical target.

Classification models can use: Naive Bayes, Decision Tree, Logistic Regression, or Support Vector Machine algorithms. The default is Naive Bayes.

The classification function can also be used for anomaly detection. In this case, the SVM algorithm with a null target is used (One-Class SVM).

CLUSTERING

Clustering is a descriptive mining function. A clustering model identifies natural groupings within a data set.

Clustering models can use: k-Means or O-Cluster algorithms. The default is k-Means.

FEATURE_EXTRACTION

Feature Extraction is a descriptive mining function. A feature extraction model creates an optimized data set on which to base a model.

Feature extraction models use the Non-Negative Matrix Factorization algorithm.

REGRESSION

Regression is a predictive mining function. A regression model uses historical data to predict a numerical target.

Regression models can use Support Vector Machine or Linear Regression. The default is Support Vector Machine.


See Also:

Oracle Data Mining Concepts for an introduction to mining functions

Transformation List

You can optionally specify a list of transformations to be applied to the build data before it is acted on by the algorithm. You can use the STACK interface in DBMS_DATA_MINING_TRANSFORM to build a list of transformation expressions for different attributes, you can specify a single transformation using the XFORM interface in DBMS_DATA_MINING_TRANSFORM, or you can write your own SQL expressions.

The transformation list argument to CREATE_MODEL interacts with the PREP_AUTO setting, which controls Automatic Data Preparation (ADP):

  • When ADP is on and you specify a transformation list, your transformations are applied with the automatic transformations and embedded in the model.

  • When ADP is off and you specify a transformation list, your transformations are applied and embedded in the model, but no system-generated transformations are performed.

  • When ADP is on and you do not specify a transformation list, the system-generated transformations are applied and embedded in the model.

  • When ADP is off and you do not specify a transformation list, no transformations are embedded in the model; you must separately prepare the data sets you use for building, testing, and scoring the model. This is the pre-release 11 behavior; it is the default behavior in 11g.

See Also:

Oracle Data Mining Concepts for information about Automatic Data Preparation

Model Details

Model details describe model attributes, rules, and other information about the model. You can invoke a GET_MODEL_DETAILS function to retrieve model details. A separate GET_MODEL_DETAILS function exists for each algorithm.

Model details reverse the transformations applied to the attributes, thus enabling the information to be easily understood by a user. You can obtain the transformations embedded in the model by invoking the GET_MODEL_TRANSFORMATIONS function.

Model details, summarized in Table 5-9, support model transparency.

Table 5-9 Model Details

Algorithm Model Details

Apriori (association rules)

Association rules and frequent itemsets

Decision Tree

The full model with its content and rules

Generalized Linear Models

Attribute-level coefficient and statistics from GET_MODEL_DETAILS_GLM and global model information from GET_MODEL_DETAILS_GLOBAL

k-Means

For each cluster: statistics and hierarchy information, centroid, attribute histograms, and rules

MDL (attribute importance)

Ranked importance of each attribute

Naive Bayes

Conditional probabilities and priors

Non-Negative Matrix Factorization

Coefficients

O-Cluster

For each cluster: statistics and hierarchy information, centroid, attribute histograms, and rules

Support Vector Machine

Coefficients for linear models


Mining Model Schema Objects

Mining models are database schema objects. Several system and object privileges, described in "Users and Privileges" in Oracle Data Mining Administrator's Guide, govern data mining activities. Mining models also support SQL AUDIT and SQL COMMENT, as described in "Mining Model Schema Objects" in Oracle Data Mining Administrator's Guide.

Mining Models in the Data Dictionary

Information about mining model objects can be obtained from the data dictionary view ALL/USER/DBA_MINING_MODELS. When used with the ALL prefix, this view returns information about the mining models accessible to the current user. When used with the USER prefix, it returns information about the mining models in the user's schema. The DBA prefix is only available for DBAs.

The columns of ALL_MINING_MODELS are described as follows and explained in Table 5-10.

SQL> describe all_mining_models
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 OWNER                                     NOT NULL VARCHAR2(30)
 MODEL_NAME                                NOT NULL VARCHAR2(30)
 MINING_FUNCTION                                    VARCHAR2(30)
 ALGORITHM                                          VARCHAR2(30)
 CREATION_DATE                             NOT NULL DATE
 BUILD_DURATION                                     NUMBER
 MODEL_SIZE                                         NUMBER
 COMMENTS                                           VARCHAR2(4000)

Table 5-10 ALL_MINING_MODELS

Column Description

owner

Owner of the mining model.

model_name

Name of the mining model.

mining_function

The mining model function. See "Mining Functions".

algorithm

The algorithm used by the mining model. See "Specifying the Algorithm".

creation_date

The date on which the mining model was created.

build_duration

The duration of the mining model build process in seconds.

model_size

The size of the mining model in megabytes.

comments

Results of a SQL COMMENT applied to the mining model.


The query in Example 5-3 returns information about the mining models in the schema DMUSER.

Example 5-3 ALL_MINING_MODELS

SQL> select model_name, mining_function, algorithm, creation_date, build_duration
            FROM all_mining_models WHERE owner IN 'DMUSER';

MODEL_NAME              MINING_FUNCTION         ALGORITHM                  CREATION_DATE BUILD_DURA 
---------------------   ---------------------  --------------------------- ----------------------- 
AI_SH_SAMPLE            ATTRIBUTE_IMPORTANCE   MINIMUM_DESCRIPTION_LENGTH   13-JUN-07         1 
AR_SH_SAMPLE            ASSOCIATION_RULES      APRIORI_ASSOCIATION_RULES    13-JUN-07         5 
DT_SH_CLAS_SAMPLE       CLASSIFICATION         DECISION_TREE                13-JUN-07         4 
KM_SH_CLUS_SAMPLE       CLUSTERING             KMEANS                       13-JUN-07         7 
NB_SH_CLAS_SAMPLE       CLASSIFICATION         NAIVE_BAYES                  13-JUN-07         3 
OC_SH_CLUS_SAMPLE       CLUSTERING             O_CLUSTER                    13-JUN-07        14 
NMF_SH_SAMPLE           FEATURE_EXTRACTION     NONNEGATIVE_MATRIX_FACTOR    13-JUN-07         2 
SVMC_SH_CLAS_SAMPLE     CLASSIFICATION         SUPPORT_VECTOR_MACHINES      13-JUN-07         4 
GLMR_SH_REGR_SAMPLE     REGRESSION             GENERALIZED_LINEAR_MODEL     13-JUN-07         3 
GLMC_SH_CLAS_SAMPLE     CLASSIFICATION         GENERALIZED_LINEAR_MODEL     13-JUN-07         3 
SVMR_SH_REGR_SAMPLE     REGRESSION             SUPPORT_VECTOR_MACHINES      13-JUN-07         7 
SVMO_SH_CLAS_SAMPLE     CLASSIFICATION         SUPPORT_VECTOR_MACHINES      13-JUN-07         3 
T_SVM_CLAS_SAMPLE       CLASSIFICATION         SUPPORT_VECTOR_MACHINES      13-JUN-07         8 
T_NMF_SAMPLE            FEATURE_EXTRACTION     NONNEGATIVE_MATRIX_FACTOR    13-JUN-07         7

Mining Model Privileges

You need the CREATE MINING MODEL privilege to create models in your own schema. You can perform any operation on models that you own. This includes applying the model, adding a cost matrix, renaming the model, and dropping the model.

You can perform specific operations on mining models in other schemas if you have the appropriate system privileges. For example, CREATE ANY MINING MODEL enables you to create models in other schemas. SELECT ANY MINING MODEL enables you to apply models that reside in other schemas. You can add comments to models if you have the COMMENT ANY MINING MODEL privilege.

Sample Mining Models

The models listed in Example 5-3 are created by the Oracle Data Mining sample programs provided with Oracle Database. The sample programs, in PL/SQL and in Java, create mining models that illustrate each of the algorithms supported by Oracle Data Mining.

The sample programs are installed using Oracle Database Companion. Once installed, you can locate them in the rdbms/demo subdirectory under Oracle Home. You can list the sample PL/SQL data mining programs on a Linux system with commands like these.

> cd $ORACLE_HOME/rdbms/demo
> ls dm*.sql

Likewise, you can list the sample Java data mining programs with commands like the following:

> cd $ORACLE_HOME/rdbms/demo
> ls dm*.java

See Also:

Oracle Data Mining Administrator's Guide to learn how to install, configure, and execute the Data Mining sample programs.