20 Text Mining

This chapter includes the following topics:

About Unstructured Data

Data mining algorithms act on numerical and categorical data stored in relational databases or spreadsheets. Numerical data has a type such as INTEGER, DECIMAL, or FLOAT. Categorical data has a type such as CHAR or VARCHAR2.

What if you want to mine data items that are not numericals or categoricals? There are many examples: web pages, document libraries, PowerPoint presentations, product specifications, emails, sound files, and digital images to name a few. What if you want to mine the information stored in long character strings, such as product descriptions, comment fields in reports, or call center notes?

Data that cannot be meaningfully interpreted as numerical or categorical is considered unstructured for purposes of data mining. It has been estimated that as much as 85% of enterprise data falls into this category. Extracting meaningful information from this unstructured data can be critical to the success of a business.

How Oracle Data Mining Supports Unstructured Data

Unstructured data may be binary objects, such as image or audio files, or text objects, which are language-based. Oracle Data Mining supports text objects.

The case table for Data Mining may include one or more columns of text (see "Mixed Data"), which can be designated as attributes. A text column cannot be used as a target. The case table itself must be a relational table; it cannot be created as a view.

Text must undergo a transformation process before it can be mined. Once the data has been properly transformed, the case table can be used for building, testing, or scoring data mining models. Most Oracle Data Mining algorithms support text. (See "Text Mining Algorithms".)

Mixed Data

Much of today's enterprise information includes both structured and unstructured content related to a given item of interest. Customer account data may include text fields that describe support calls and other interactions with the customer. Insurance claim data may include a claim status description, supporting documents, email correspondence, and other information. It is often essential that analytic applications evaluate the structured information together with the related unstructured information.

Oracle Data Mining offers this capability. You can use Oracle Data Mining to mine data sets that contain regular relational information (numeric and character columns), as well as one or more text columns.

Text Data Types

Oracle Data Mining supports text columns that have any of the data types shown in Table 20-1.

Table 20-1 Data Types for Text Columns

Data Type Description

BFILE

Locator to a large binary file stored outside the database

BLOB

Binary large object

CHAR

Fixed length character string

CLOB

Character large object

LONG

Long variable length character string

LONG RAW

Long variable length raw binary data

RAW

Raw binary data

VARCHAR2

Variable length character string

XMLTYPE

XML data


See Also:

Oracle Database SQL Language Reference for information about Oracle data types

Text Mining Algorithms

The Oracle Data Mining algorithms shown in Table 20-2 can be used for text mining.

Table 20-2 Oracle Data Mining Algorithms that Support Text

Algorithm Mining Function

Naive Bayes

Classification

Generalized Linear Models

Classification, Regression

Support Vector Machine

Classification, Regression, Anomaly Detection

k-Means

Clustering

Non-Negative Matrix Factorization

Feature Extraction

Apriori

Association Rules

Minimum Description Length

Attribute Importance


Oracle Data Mining supports text with all mining functions. As shown in Table 20-2, at least one algorithm per mining function has text mining capability.

Classification, clustering, and feature extraction have important applications in pure text mining. Other functions, such as regression and anomaly detection, are more suited for mining mixed data (both structured and unstructured).

Text Classification

Text classification is the process of categorizing documents: for example, by subject or author. Most document classification applications use either multi-class classification or multi-target classification.

Multi-Class Document Classification

In multi-class document classification, each document is assigned a probability for each category, and the probabilities add to 1. For example, if the categories are economics, math, and physics, document A might be 20% likely to be economics, 50% likely to be math, and 30% likely to be physics.

This approach to document classification is supported by Oracle Data Mining and by Oracle Text.

Multi-Target Document Classification

In multi-target document classification, each document is assigned a probability for either being in a category or not being in a category, and the probabilities for each category add to 1. Given categories economics, math, and physics, document A might be classified as: 30% likely to be economics and 70% likely not to be economics; 65% likely to be math and 35% likely not to be math; 40% likely to be physics and 60% likely not to be physics.

In multi-target document classification, each category is a separate binary target. Each document is scored for each target.

This approach to document classification is supported by Oracle Text but not by Oracle Data Mining. However, you can obtain similar results by building a single binary classification model for each category and then scoring all the models separately in a single SQL scoring query.

Document Classification Algorithms

Oracle Data Mining supports three classification algorithms that are well suited to text mining applications. Both can easily process thousands of text features (see "Preparing Text for Mining" for information about text features), and both are easy to train with small or large amounts of data. The algorithms are:

Text Clustering

The main applications of clustering in text mining are:

  • Simple clustering. This refers to the creation of clusters of text features (see "Preparing Text for Mining" for information about text features). For example: grouping the hits returned by a search engine.

  • Taxonomy generation. This refers to the generation of hierarchical groupings. For example: a cluster that includes text about car manufacturers is the parent of child clusters that include text about car models.

  • Topic extraction. This refers to the extraction of the most typical features of a group. For example: the most typical characteristics of documents in each document topic.

The Oracle Data Mining enhanced k-Means clustering algorithm, described in Chapter 13, supports text mining.

Text Feature Extraction

Feature extraction is central to text mining. Feature extraction is used for text transformation at two different stages in the text mining process:

  1. A feature extraction process must be performed on text documents before they can be mined. This preprocessing step transforms text documents into small units of text called features or terms.

  2. The text transformation process generates large numbers (potentially many thousands) of text features from a text document. Oracle Data Mining algorithms treat each feature as a separate attribute. Thus text data may present a huge number of attributes, many of which provide little significant information for training a supervised model or building an unsupervised model.

    Oracle Data Mining supports the Non-Negative Matrix Factorization (NMF) algorithm for feature extraction. You can create an NMF model to consolidate the text attributes derived from the case table and generate a reduced set of more meaningful attributes. The results can be far more effective for use in classification, clustering, or other types of data mining models. See Chapter 16 for information on NMF.

Text Association

Association models can be used to uncover the semantic meaning of words. For example, suppose that the word account co-occurs with words like customer, explanation, churn, story, region, debit, and memo. An association model would produce rules connecting account with these concepts. Inspection of the rules would provide context for account in the document collection. Such associations can improve information retrieval engines.

Oracle Data Mining supports Apriori for association. See Chapter 10 for information on Apriori.

Text Attribute Importance

Attribute importance can be used to find terms that distinguish the values of a target column. Attribute importance ranks the relative importance of the terms in predicting the target. For example, certain words and phrases might distinguish the writing style of one writer from another.

Oracle Data Mining supports Minimum Description Length (MDL) for attribute importance. See Chapter 14 for information on MDL.

Preparing Text for Mining

Before text can be mined, it must undergo a special preprocessing step known as term extraction, also called feature extraction. This process breaks the text down into units (terms) that can be mined. Text terms may be keywords or other document-derived features.

The Oracle Data Miner graphical tool performs term extraction transparently when you create or apply a text mining model. You can use a set of Oracle Text table functions to extract terms for text mining with the PL/SQL API, as described in Oracle Data Mining Application Developer's Guide.

See Also:

Oracle Data Mining Administrator's Guide for information about sample term extraction code provided with the Oracle Data Mining sample programs

The term extraction process uses Oracle Text routines to transform a text column into a nested column. Each term is the name of a nested attribute. The value of each nested attribute is a number that uniquely identifies the term. Thus each term derived from the text is used as a separate numerical attribute by the data mining algorithm.

All Oracle Data Mining algorithms that support nested data can be used for text mining. These algorithms are listed in Table 20-2

Oracle Data Mining and Oracle Text

Oracle Text is a technology included in the base functionality offered by Oracle Database. Oracle Text uses internal components of Oracle Data Mining to provide some data mining capabilities.

Oracle Data Mining is an option of the Enterprise Edition of Oracle Database. To use Oracle Data Mining, you must have a license for the Data Mining option. To use Oracle Text and its data mining capabilities, you do not need to license the Data Mining option.

Oracle Text consists of a set of PL/SQL packages and related data structures that support text query and document classification. Oracle Text routines can be used to:

  • Query document collections, such as web sites and online libraries

  • Query document catalogs, such author and publisher descriptions

  • Perform document classification and clustering

The primary functional differences between Oracle Data Mining and Oracle Text can be summarized as follows:

  • Oracle Data Mining supports the mining of mixed data, as described in "Mixed Data". Oracle Text mining capabilities only support text; they do not support mixed structured and unstructured data.

  • Oracle Data Mining supports mining more than one text column at once. Oracle Text routines operate on a single column.

  • Oracle Data Mining supports text with all data mining functions. Oracle Text has limited support for data mining. The differences are summarized in Table 20-3.

  • Oracle Data Mining and Oracle Text both support text columns with any of the data types listed in Table 20-1. Oracle Data Mining requires text feature extraction transformation prior to mining, as described in "Preparing Text for Mining". Oracle Text operates on native text; it performs text feature extraction internally.

  • Oracle Data Mining and Oracle Text both support document classification, as described in "Text Classification". Oracle Data Mining supports multi-class classification. Oracle Text supports multi-class and multi-target classification.

Table 20-3 Mining Functions: Oracle Data Mining and Oracle Text

Mining Function Oracle Data Mining Oracle Text

Anomaly detection

Text or mixed data can be mined using One-Class SVM

No support

Association

Text or mixed data can be mined using Apriori

No support

Attribute importance

Text or mixed data can be mined using MDL

No support

Classification

Text or mixed data can be mined using SVM, GLM, or Naive Bayes

Text can be mined using SVM, decision trees, or user-defined rules

Clustering

Text or mixed data can be mined using k-Means

Text can be mined using k-Means

Feature extraction

Text or mixed data can be mined using NMF

No support

Regression

Text or mixed data can be mined using SVM or GLM

No support