6 Scoring and Deployment

This chapter explains how to use data mining models to mine your data.

This chapter contains the following sections:

In-Database Scoring

Scoring, the application of models to new data, is a primary objective of data mining. Once the models have been built, the challenges come in deploying them to obtain the best results, and in maintaining them within a production environment.

In traditional data mining, models are built using specialized software on a remote system and deployed to another system for scoring. This is a cumbersome, error-prone process open to security violations and difficulties in data synchronization.

With Oracle Data Mining, scoring is easy and secure. The scoring engine and the data both reside within the database. Scoring is an extension to the SQL language, so the results of mining can easily be incorporated into applications and reporting systems.

In-database mining provides security, backup and recovery, and high performance. It minimizes the IT effort needed to support data mining initiatives. Using standard database techniques, models can easily be refreshed (re-created) on more recent data and redeployed. The deployment is immediate since the scoring query remains the same; only the underlying model is replaced in the database.

What is Deployment?

Deploying a model means using it within a target environment. Model deployment could be:

  • Scoring data either for batch or real-time results

  • Extracting model details to produce reports. For example: the rules from a Decision Tree model, or the attribute rankings from an Attribute Importance model

  • Extending the business intelligence infrastructure of a data warehouse by incorporating mining results in applications or operational systems

  • Moving a model from the database where it was built to the database where it will be used for scoring (export/import)

See Also:

Oracle Data Mining Administrator's Guide for information about exporting and importing data mining models

Real-Time Scoring

Oracle Data Mining SQL functions enable prediction, clustering, and feature extraction analysis to be easily integrated into live production and operational systems. Because mining results are returned within SQL queries, mining can occur in real time.

With real-time scoring, point-of-sales database transactions can be mined. Predictions and rule sets can be generated to help front-line workers make better analytical decisions. Real-time scoring enables fraud detection, identification of potential liabilities, and recognition of better marketing and selling opportunities.

The query in Example 6-1 uses a Decision Tree model named dt_sh_clas_sample to predict the probability that customer 101488 will use an affinity card. A customer representative could retrieve this information in real time when talking to this customer on the phone. Based on the query result, the representative might offer an extra-value card, since there is a 73% chance that the customer will use a card.

Example 6-1 Real-Time Query with Prediction Probability

SELECT PREDICTION_PROBABILITY(dt_sh_clas_sample, 1 USING *) cust_card_prob
       FROM mining_data_apply_v
       WHERE cust_id = 101488;

CUST_CARD_PROB--------------    .727642276

Prediction

Oracle Data Mining supports six SQL functions that return results from predictive models (classification or regression).

Predictive models produce a target value for each row (case) in the scoring data. Each SQL function returns different information from the scoring results.

See Also:

Oracle Data Mining Concepts for information on classification and regression

Best Prediction

(Classification or regression). For classification, the PREDICTION function returns the target value that is predicted with the highest probability (or lowest cost, if costs are specified). For regression, PREDICTION returns the best predicted target value.

PREDICTION supports costs for classification. See "Cost-Sensitive Decision Making".

See Also:

Oracle Database SQL Language Reference for syntax and an example that uses PREDICTION

Confidence Bounds (GLM only)

(Classification or regression) The PREDICTION_BOUNDS function returns the upper and lower confidence bounds computed by the model.

Confidence is the degree of certainty that the true value (regression) or probability (classification) lies within the bounded interval. The default confidence is .95. Confidence can be specified by the user in the GLMS_CONF_LEVEL setting for the model. You can override the confidence associated with the model by specifying the confidence inline when you invoke the PREDICTION_BOUNDS function.

No confidence bounds are returned if ridge regression is being used by the algorithm.

See Also:

Costs

(Classification only) The PREDICTION_COST function returns the cost associated with the class that is predicted with the lowest cost. If you specify a class, the function returns the cost associated with that class.

Costs are a user-specified biasing mechanism for classification. See "Cost-Sensitive Decision Making".

See Also:

Oracle Database SQL Language Reference for syntax and an example that uses PREDICTION_COST

Rules (Decision Tree only)

(Classification only) The PREDICTION_DETAILS function returns the rule of a Decision Tree model corresponding to the given prediction. A rule is the condition (combination of attribute values) that leads to a specific classification.

Decision Tree rule identifiers are returned as XML. The full rules can be retrieved with the GET_MODEL_DETAILS_XML function.

See Also:

Probability

(Classification only) The PREDICTION_PROBABILITY function returns the probability associated with the best prediction (prediction with the highest probability) or the probability associated with the class that you specify.

See Also:

Oracle Database SQL Language Reference for syntax and an example that uses PREDICTION_PROBABILITY

Per-Class Results

(Classification only) The PREDICTION_SET function returns all the target classes, associated probabilities, and associated costs (if specified) for each scored row. You can specify parameters to restrict the output of the function.

See Also:

Clustering

Oracle Data Mining supports three SQL functions that return results when applying clustering models.

Clustering models assign each row to a cluster with an associated probability. Each SQL function returns different information from the scoring results.

See Also:

Oracle Data Mining Concepts for information on clustering

Cluster Identifier

The CLUSTER_ID function returns the identifier of the cluster predicted with the highest probability.

See Also:

Oracle Database SQL Language Reference for syntax and an example that uses CLUSTER_ID

Probability

The CLUSTER_PROBABILITY function returns the probability associated with the cluster to which cases are most likely to be assigned. If you specify a cluster ID, the function returns the probability associated with that cluster.

See Also:

Oracle Database SQL Language Reference for syntax and an example that uses CLUSTER_PROBABILITY

Per-Cluster Probabilities

The CLUSTER_SET function returns the probability associated with each cluster for each scored row. You can specify parameters to restrict the output of the function.

Feature Extraction

Oracle Data Mining supports three SQL functions that return results from feature extraction models.

Feature extraction models combine the attributes into a set of features that capture important characteristics of the data. The scoring process generates a value of each feature for each row. The value is a number that identifies the match quality of the case to the feature. Each SQL function returns different information from the scoring results.

Feature Identifier

The FEATURE_ID function returns the identifier of the feature with the highest value (match quality) for the data.

See Also:

Oracle Database SQL Language Reference for syntax and an example that uses FEATURE_ID

Match Quality

The FEATURE_VALUE function returns the highest feature value. If you specify a feature ID, the function returns the value of that feature.

See Also:

Oracle Database SQL Language Reference for syntax and an example that uses FEATURE_VALUE

Per-Feature Values

The FEATURE_SET function returns the values associated with each feature for each scored row. You can specify parameters to restrict the output of the function.

See Also:

Oracle Database SQL Language Reference for syntax and an example that uses FEATURE_SET

Save Scoring Results in a Table

If you wish to save the results of a scoring function, you can store them in a table.

This example shows how to save the results of scoring a customer response model.

UPDATE CUST_RESPONSE_APPLY_UPDATE
SET prediction = prediction(CUST_RESPONSE19964_DT using *),
     probability =  prediction_probability(CUST_RESPONSE19964_DT using *)

The table in question has all of the predictors, and has columns to hold the prediction and probability. The assumption is that any necessary transformations are embedded in the model (otherwise the using clause would need to contain them).

Cost-Sensitive Decision Making

Costs are user-specified numbers that bias classification. The algorithm uses positive numbers to penalize more expensive outcomes over less expensive outcomes. Higher numbers indicate higher costs. The algorithm uses negative numbers to favor more beneficial outcomes over less beneficial outcomes. Lower negative numbers indicate higher benefits.

All classification algorithms can use costs for scoring. You can specify the costs in a cost matrix table, or you can specify the costs inline when scoring. The PREDICTION, PREDICTION_COST, and PREDICTION_SET functions all support costs.

A sample cost matrix table is shown in Table 6-1.

Table 6-1 Sample Cost Matrix Table

ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE COST

0

0

0

0

1

2

1

0

1

1

1

0


The cost matrix in Table 6-1 specifies costs for a binary target. The matrix indicates that the algorithm should treat a misclassified 0 as twice as costly as a misclassified 1. If the table name is cost_tbl and it is associated with the Naive Bayes model nb_sh_clas_sample, then the following query takes cost_tbl into account when scoring nb_sh_clas_sample. The output will be restricted to those rows where a prediction of 1 is less costly then a prediction of 0.

SELECT cust_gender, COUNT(*) AS cnt, ROUND(AVG(age)) AS avg_age
   FROM mining_data_apply_v
   WHERE PREDICTION (nb_sh_clas_sample COST MODEL
      USING cust_marital_status, education, household_size) = 1
   GROUP BY cust_gender
   ORDER BY cust_gender;

If there is a possibility that the cost matrix table is not present, or that a cost matrix was not specified for the model, you can use the AUTO keyword with COST MODEL so that scoring only uses costs if the cost matrix is available.

SELECT cust_gender, COUNT(*) AS cnt, ROUND(AVG(age)) AS avg_age
   FROM mining_data_apply_v
   WHERE PREDICTION (nb_sh_clas_sample COST MODEL AUTO
      USING cust_marital_status, education, household_size) = 1
   GROUP BY cust_gender
   ORDER BY cust_gender;

You can specify the costs inline when you invoke the scoring function. The inline costs are used for scoring even if a cost matrix table is associated with the model. Here is the same query with the costs specified inline.

SELECT cust_gender, COUNT(*) AS cnt, ROUND(AVG(age)) AS avg_age
   FROM mining_data_apply_v
   WHERE PREDICTION (nb_sh_clas_sample
                     COST (0,1) values ((0, 2),
                                        (1, 0))
                     USING cust_marital_status, education, household_size) = 1
   GROUP BY cust_gender
   ORDER BY cust_gender;

To associate a cost matrix table with a model for scoring, use the ADD_COST_MATRIX procedure in the DBMS_DATA_MINING package. You can retrieve the cost matrix with the GET_COST_MATRIX function. The REMOVE_COST_MATRIX procedure removes the cost matrix. If you want to use a different cost matrix table with a model, first remove the existing one then add the new one.

-- add cost matrix table cost_tbl
-- to model nb_sh_clas_sample
-- for scoring
--
EXEC DBMS_DATA_MINING.ADD_COST_MATRIX('nb_sh_clas_sample', 'cost_tbl');
--
-- replace cost_tbl with cost_tbl_2
--
EXEC DBMS_DATA_MINING.REMOVE_COST_MATRIX('nb_sh_clas_sample', 'cost_tbl');
EXEC DBMS_DATA_MINING.ADD_COST_MATRIX('nb_sh_clas_sample', 'cost_tbl_2');

The Decision Tree algorithm can use costs to bias the model build. If you want to create a Decision Tree model with costs, create a cost matrix table and provide its name in the CLAS_COST_TABLE_NAME setting for the model. If you specify costs when scoring the model, the cost matrix used to create the model will be used. If you want to use a different cost matrix table for scoring, first remove the existing cost matrix table then add the new one.

Batch Apply

Oracle Data Mining supports a batch apply operation that writes the results of scoring directly to a table. The columns in the table are mining function-dependent. The apply operation is accomplished by DBMS_DATA_MINING.APPLY.

APPLY creates an output table with the columns shown in Table 6-2.

Table 6-2 APPLY Output Table

Mining Function Output Columns

classification

CASE_ID

PREDICTION

PROBABILITY

regression

CASE_ID

PREDICTION

anomaly detection (one-class SVM)

CASE_ID

PREDICTION

PROBABILITY

clustering

CASE_ID

CLUSTER_ID

PROBABILITY

feature extraction

CASE_ID

FEATURE_ID

MATCH_QUALITY


Example 6-2 illustrates anomaly detection with APPLY. The query of the APPLY output table returns the ten first customers in the table. Each has a a probability for being typical (1) and a probability for being anomalous (0).

Example 6-2 Anomaly Detection with DBMS_DATA_MINING.APPLY

EXEC dbms_data_mining.apply
        ('SVMO_SH_Clas_sample','svmo_sh_sample_prepared', 
         'cust_id', 'one_class_output'); 

SELECT * from one_class_output where rownum < 11;
 
   CUST_ID PREDICTION PROBABILITY
---------- ---------- -----------
    101798          1  .567389309
    101798          0  .432610691
    102276          1  .564922469
    102276          0  .435077531
    102404          1   .51213544
    102404          0   .48786456
    101891          1  .563474346
    101891          0  .436525654
    102815          0  .500663683
    102815          1  .499336317

Comparing APPLY and SQL Scoring Functions

Whether performed by APPLY or by a SQL scoring function, scoring generates the same mining results. Classification produces a prediction and a probability for each case; clustering produces a cluster ID and a probability for each case, and so on. The difference lies in the way that scoring results are captured and the mechanisms that can be used for retrieving them.

Since APPLY output is stored separately from the scoring data, it must be joined to the scoring data to support queries that include the data attributes being mined (the scored rows). Thus any model that will be used with APPLY must have a case ID.

A case ID is not required for models that will be applied with SQL scoring functions. Likewise, storage and joins are not required, since scoring results are generated and consumed in real time within a SQL query.

The SQL scoring functions offer flexibility. You can invoke the function that returns the specific information you need. You simply reference a model and identify the kind of mining results you want to retrieve.