2 A Tour of the Data Mining APIs

This chapter provides an overview of the PL/SQL, SQL, and Java interfaces to Oracle Data Mining.

Note:

The Oracle Data Mining Java API is deprecated in this release.

Oracle recommends that you not use deprecated features in new applications. Support for deprecated features is for backward compatibility only

This chapter contains the following sections:

Data Mining PL/SQL Packages

The PL/SQL interface to Oracle Data Mining is implemented in three packages:

  • DBMS_DATA_MINING, the primary interface to Oracle Data Mining

  • DBMS_DATA_MINING_TRANSFORM, convenience routines for data transformation

  • DBMS_PREDICTIVE_ANALYTICS, predictive analytics

DBMS_DATA_MINING

The DBMS_DATA_MINING package includes procedures for:

  • Creating, dropping, and renaming mining models

  • Applying a model to new data

  • Describing the model details

  • Creating costs and computing test metrics for a classification model

  • Exporting and importing models

CREATE_MODEL

The CREATE_MODEL procedure creates a mining model. The attributes, transformations, rules, and other information internal to the model are returned by GET_MODEL_DETAILS functions. You can also obtain information about mining models by querying data dictionary views, as described in "Data Mining Data Dictionary Views".

APPLY

The APPLY procedure creates a table with specific columns and populates the columns with mining results. The columns of this table vary based on the particular mining function and algorithm.

Note:

The Data Mining SQL functions, introduced in Oracle Database 10.2, are now generally the preferred method for applying Data Mining models. See "Data Mining SQL Functions".

DBMS_DATA_MINING_TRANSFORM

This package includes routines for transforming the data to make it suitable for mining. Since Oracle Data Mining supports Automatic Data Transformation (ADP), you will not need to use this package unless you want to implement specialized transformations.

You can supplement the ADP-generated transformations with additional transformations that you specify yourself, or you can elect to transform the data yourself instead of using ADP.

The routines in DBMS_DATA_MINING_TRANSFORM are convenience routines to assist you in creating your own transformations. If these routines do not entirely suit your needs, you can write SQL to modify their output, or you can write your own routines.

To specify transformations for a model, pass a transformation list to the DBMS_DATA_MINING.CREATE_MODEL procedure. You can use the STACK procedures in DBMS_DATA_MINING_TRANSFORM to build the transformation list.

Oracle Data Mining embeds automatic transformations and transformations you pass to CREATE_MODEL in the model. The embedded transformations are automatically applied to the data when the model is applied. You do not need to separately transform the test or scoring data.

See Also:

DBMS_PREDICTIVE_ANALYTICS

This package includes routines for predictive analytics, an automated form of data mining. With predictive analytics, you do not need to be aware of model building or scoring. All mining activities are handled internally by the predictive analytics procedure.

Predictive analytics routines prepare the data, build a model, score the model, and return the results of model scoring. Before exiting, they delete the model and supporting objects.

Oracle predictive analytics supports these routines:

  • EXPLAIN ranks attributes in order of influence in explaining a target column.

  • PREDICT predicts the value of a target column based on values in the input data.

  • PROFILE generates rules that describe the cases from the input data.

    See Also:

    DBMS_PREDICTIVE_ANALYTICS in Oracle Database PL/SQL Packages and Types Reference

Data Mining Data Dictionary Views

You can obtain information about mining models from the data dictionary. The data dictionary views for Oracle Data Mining are available for ALL_, USER_, and DBA_ access.

The Data Mining data dictionary views are summarized as follows:

  • ALL_MINING_MODELS returns information about the mining models to which you have access.

    See "Mining Model Schema Objects".

  • ALL_MINING_MODEL_ATTRIBUTES returns information about the attributes of the mining models to which you have access.

    See "About Attributes".

  • ALL_MINING_MODEL_SETTINGS returns information about the settings for the mining models to which you have access.

    See "Model Settings".

Data Mining SQL Functions

The built-in SQL functions for Data Mining implement scoring operations for models that have already been created in the database. They provide the following benefits:

  • Models can be easily deployed within the context of existing SQL applications.

  • Scoring operations take advantage of existing query execution functionality. This provides performance benefits, especially in the case of single row scoring.

  • Scoring results are pipelined, enabling the rows to be returned iteratively as they are produced. Pipelining is an optimization that greatly improves performance. For information about pipelining, see Oracle Database PL/SQL Language Reference.

Note:

SQL functions are built into Oracle Database and are available for use within SQL statements. SQL functions should not be confused with functions defined in PL/SQL packages.

When applied to a given row of scoring data, classification and regression models provide the best predicted value for the target and the associated probability of that value occurring. The SQL functions for prediction are described in Table 2-1.

Table 2-1 SQL Functions for Prediction

Function Description

PREDICTION

Returns the best prediction for the target

PREDICTION_BOUNDS

(GLM only) Returns the upper and lower bounds of the interval wherein the values (linear regression) or probabilities (logistic regression) will lie

PREDICTION_COST

Returns a measure of the cost of incorrect predictions

PREDICTION_DETAILS

Returns the rules of a Decision Tree model

PREDICTION_PROBABILITY

Returns the probability of a given prediction

PREDICTION_SET

Returns the results of a classification model, including the predictions and associated probabilities for each case


Applying a cluster model to a given row of scoring data returns the cluster ID and the probability of that row's membership in the cluster. The SQL functions for clustering are described in Table 2-2.

Table 2-2 SQL Functions for Clustering

Function Description

CLUSTER_ID

Returns the ID of the predicted cluster

CLUSTER_PROBABILITY

Returns the probability of a case belonging to a given cluster

CLUSTER_SET

Returns a list of all possible clusters to which a given case belongs along with the associated probability of inclusion


Applying a feature extraction model involves the mapping of features (sets of attributes) to columns in the scoring data set. The SQL functions for feature extraction are described in Table 2-3.

Table 2-3 SQL Functions for Feature Extraction

Function Description

FEATURE_ID

Returns the ID of the feature with the highest coefficient value

FEATURE_SET

Returns a list of objects containing all possible features along with the associated coefficients

FEATURE_VALUE

Returns the value of a given feature


Data Mining Java API

The Oracle Data Mining Java API is an Oracle implementation of the JDM standard (JSR-73) Java API. It is a thin API developed using the rich in-database functionality of Oracle Data Mining.

Note:

The Oracle Data Mining Java is deprecated in this release.

Oracle recommends that you not use deprecated features in new applications. Support for deprecated features is for backward compatibility only

The Oracle Data Mining Java API implements Oracle specific extensions to provide all the data mining features available in the database. All extensions are designed to be compliant with the JDM standards extension framework. All the mining functions and algorithms available in the database are exposed through the Oracle Data Mining Java API.

Oracle Database 10.2.0.1 introduced the JDM 1.0 standard compliant API that replaced the old Oracle proprietary Java API in the previous releases. Database 10.2.0.2 patch-set release extended the JDM standards support by implementing Oracle Data Mining Java API compatible with JDM 1.1.

In this release, the Oracle Data Mining Java API continues to be compatible with the JDM 1.1 and provides new data mining functionality in the database server as Oracle extensions. In this release new Oracle features include automatic and embedded data preparation, generalized linear models, transformation sequencing and task dependency specifications.

The JDM Standard

JDM is an industry standard Java API for data mining developed under the Java Community Process (JCP). It defines Java interfaces that vendors can implement for their Data Mining Engine (DME). It includes interfaces supporting mining functions such as classification, regression, clustering, attribute importance and association along with specific mining algorithms such as naïve bayes, support vector machines, decision trees, feed forward neural networks, and k-means.

An overview of the Java packages defined by the standards is listed in Table 2-4. For more details, refer to the Java documentation published with the standard at http://www.jcp.org. In the Go to JSR box, type in 73.

Table 2-4 JDM Standard Java Packages

Package Description

javax.datamining

Defines objects supporting all JDM subpackages.

javax.datamining.base

Defines objects supporting many top-level mining objects. Introduced to avoid cyclic package dependencies.

javax.datamining.resource

Defines objects that support connecting to the Data Mining ENgine and executing tasks.

javax.datamining.data

Defines objects supporting logical and physical data, model signature, taxonomy, category set and the generic super class category matrix.

javax.datamining.statistics

Defines objects supporting attribute statistics.

javax.datamining.rules

Defines objects supporting rules and their predicate components.

javax.datamining.task

Defines objects supporting tasks for build, compute statistics, import, and export. Task has an optional subpackage for apply since apply is used mainly for supervised and clustering functions.

javax.datamining.association

Defines objects supporting the build settings and model for association.

javax.datamining.clustering

Defines objects supporting the build settings and model for clustering.

javax.datamining.attributeimportance

Defines objects supporting the build settings and model for attribute importance.

javax.datamining.supervised

Defines objects supporting the build settings and models for supervised learning functions, specifically classification and regression, with corresponding optional packages. It also includes a common test task for the classification and regression functions.

javax.datamining.algorithm

Defines objects supporting the settings that are specific to algorithms. The algorithm package has optional sub packages for different algorithms.

javax.datamining.modeldetail

Defines objects supporting details of various model representation. Model Details has optional sub packages for different model details.


Oracle Extensions to JDM

Oracle extensions are defined to support the functionality that is not part of the JDM standards. This section gives an overview of these extensions.

Oracle extensions have the following major additional features:

  • Feature Extraction function with the Non-negative Matrix Factorization (NMF) algorithm

  • Generalized Linear Model algorithm for regression and classification functions

  • Oracle-proprietary clustering algorithm, Orthogonal Partitioning Clustering (O-CLuster)

  • Oracle-proprietary classification algorithm, Adaptive Bayes Network (ABN) (deprecated)

  • Automated and embedded data transformations

  • Predictive analytics tasks

An overview of the Oracle extensions higher-level Java packages is provided in Table 2-5.

Table 2-5 Oracle Extensions Higher-Level Packages

Package Description

oracle.dmt.jdm.featureextraction

Defines the objects related to the feature extraction function. Feature extraction supports the scoring operation.

oracle.dmt.jdm.algorithm.nmf

Defines the objects related to the Non-negative Matrix Factorization (NMF) algorithm.

oracle.dmt.jdm.algorithm.glm

oracle.dmt.jdm.modeldetail.glm

Defines the objects related to the Generalized Linear Model (GLM) algorithm.

oracle.dmt.jdm.algorithm.ocluster

Defines the objects related to the Orthogonal Clustering (O-Cluster) algorithm.

oracle.dmt.jdm.algorithm.abn

Defines the objects related to the Adaptive Bayes Network (ABN) algorithm (deprecated).

oracle.dmt.jdm.transform

Defines the objects related to the transformations.


Principal Objects in the Oracle Data Mining Java API

In JDM, named objects are objects that can be saved using the save method in the Connection. All the named objects are inherited from the javax.datamining.MiningObject interface. You can choose to persist the named objects either permanently (persistent objects) or only for the lifetime of the Connection object (transient objects).

Table 2-6 lists the JDM named objects supported by Oracle Data Mining.

Table 2-6 JDM Named Objects Supported by Oracle Data Mining

Persistent Objects Transient Objects Unsupported Objects

Model

Apply Settings

Logical Data

Build Settings

Physical Dataset

Taxonomy

Task

   

Cost Matrix

   

Test Metrics

   

Transformation sequence

   

Physical Data Set

Physical data sets refer to the data to be used as input to data mining operations. PhysicalDataSet objects reference specific data identified by a URI. Oracle Data Mining supports a table or view in the same database as a valid physical dataset URI. Syntax of the physical dataset URI is as follows:

Data URI Syntax:

[schemaName.] tableName/viewName

The PhysicalDataSet object can support multiple data representations. Oracle Data Mining supports two types of data representation: single-record case and wide data. (See Chapter 3 for details.) The Oracle Data Mining implementation requires users to specify the case ID column in the physical dataset.

A PhysicalDataSet object is a transient object in the Oracle Data Mining Java API. It is stored in the Connection as an in-memory object.

Build Settings

A BuildSettings object captures the high-level specification input for building a model. The API specifies mining functions: classification, regression, attribute importance, association, clustering, and feature extraction.

Build settings allow a user to specify the type of result desired without having to specify a particular algorithm. Although a build settings object allows for the specification of an algorithm and its settings, if the algorithm settings are omitted, the DME selects an algorithm based on the build settings and possibly characteristics of the data.

Build settings may also be validated for correct parameters using the verify method.

A BuildSettings object is persistent; it is stored as a table with a user-specified name in the user's schema. This settings table is interoperable with the PL/SQL API. Oracle recommends that you not modify the build settings table manually.

Task

The Execute method in the Connection object is used to start an execution of a mining task. Typically, mining operations are done using tables with millions of records, so the execution of operations such as a model build can take a long time.

JDM supports asynchronous execution of mining tasks using DBMS_SCHEDULER in the database. Each mining task is stored as a DBMS_SCHEDULER job object in the user's schema. When the user saves the task object, it creates a job object and sets the object to be in the DISABLED state. When the user executes a task, it enables the job to start execution.

To monitor tasks that are executed asynchronously, the Execute method returns a javax.datamining.ExecutionHandle object. It provides methods such as waitForCompletion and getStatus to retrieve the status details of the task.

Model

A Model object is the result of applying an algorithm to data as specified in a BuildSettings object.

Models can be used in several operations. They can be:

  • Inspected, for example to examine the rules produced from a decision tree or association

  • Tested for accuracy

  • Applied to data for scoring

  • Exported to an external representation such as native format or PMML

  • Imported for use in the DME

When a model is applied to data, it is submitted to the DME for interpretation. A Model object references its BuildSettings object as well as the Task that created it. as.

Test Metrics

A TestMetrics object is the result of testing a supervised model with test data. Based on the type of mining function, different test metrics are computed. For classification models, accuracy, confusion-matrix, lift, and receiver-operating characteristic can be computed to access the model. Similarly for regression models, R-squared and RMS errors can be computed.

Apply Settings

An ApplySettings object allows users to tailor the results of an apply task. It contains a set of ordered items. Output can consist of:

  • Data (key attributes) to be passed through to the output from the input dataset.

  • Values computed from the apply itself: score, probability, and in the case of decision trees, rule identifiers.

  • Multi-class categories for its associated probabilities. For example, in a classification model with target favoriteColor, users could select the specific colors to receive the probability that a given color is favorite.

Each mining function class defines a method to construct a default apply settings object. This simplifies the programmer's effort if only standard output is desired. For example, typical output for a classification apply would include the top prediction and its probability.

Transformation Sequence

A TransformationSequence object represents the sequence of transformations that are to be performed as part of a mining operation. For example, a Support Vector Machine model build involves outlier handling and normalization transformations. In addition to this, there can be new derived attribute creation and business transformations, and so on. Typically these transformations are reused for other model builds and hence applications can save the transformation sequence as a named object in the API.

Transformation sequences can be used either to perform transformations as a separate task or embed them to the modeling process by specifying it as one of the input objects for model building.

See Also:

Chapter 7 for more details about the Java API