10 Apriori

This chapter describes Apriori, the algorithm used by Oracle Data Mining for calculating association rules.

This chapter contains the following topics:

About Apriori

An association mining problem can be decomposed into two subproblems:

  • Find all combinations of items in a set of transactions that occur with a specified minimum frequency. These combinations are called frequent itemsets.

  • Calculate rules that express the probable co-occurrence of items within frequent itemsets. (See "Example: Calculating Rules from Frequent Itemsets".)

Apriori calculates the probability of an item being present in a frequent itemset, given that another item or items is present.

Association rule mining is not recommended for finding associations involving rare events in problem domains with a large number of items. Apriori discovers patterns with frequency above the minimum support threshold. Therefore, in order to find associations involving rare events, the algorithm must run with very low minimum support values. However, doing so could potentially explode the number of enumerated itemsets, especially in cases with a large number of items. This could increase the execution time significantly. Classification or anomaly detection may be more suitable for discovering rare events when the data has a high number of attributes.

Association Rules and Frequent Itemsets

The Apriori algorithm calculates rules that express probabilistic relationships between items in frequent itemsets For example, a rule derived from frequent itemsets containing A, B, and C might state that if A and B are included in a transaction, then C is likely to also be included.

An association rule states that an item or group of items implies the presence of another item with some probability. Unlike decision tree rules, which predict a target, association rules simply express correlation.

Antecedent and Consequent

The IF component of an association rule is known as the antecedent. The THEN component is known as the consequent. The antecedent and the consequent are disjoint; they have no items in common.

Oracle Data Mining supports association rules that have one or more items in the antecedent and a single item in the consequent.

Confidence

Rules have an associated , which is the conditional probability that the consequent will occur given the occurrence of the antecedent. The minimum confidence for rules can be specified by the user.

Data Preparation for Apriori

Association models are designed to use transactional data. In transactional data, there is a one-to-many relationship between the case identifier and the values for each case. Each case ID/value pair is specified in a separate record (row).

Native Transactional Data and Star Schemas

Transactional data may be stored in native transactional format, with a non-unique case ID column and a values column, or it may be stored in some other configuration, such as a star schema. If the data is not stored in native transactional format, it must be transformed to a nested column for processing by the Apriori algorithm.

See Also:

"Transactional Data"

Oracle Data Mining Application Developer's Guide for details about transforming transactional data to nested columns

Items and Collections

In transactional data, a collection of items is associated with each case. The collection could theoretically include all possible members of the collection. For example, all products could theoretically be purchased in a single market-basket transaction. However, in actuality, only a tiny subset of all possible items are present in a given transaction; the items in the market-basket represent only a small fraction of the items available for sale in the store.

Sparse Data

Missing items in a collection indicate sparsity. Missing items may be present with a null value, or they may simply be missing.

Nulls in transactional data are assumed to represent values that are known but not present in the transaction. For example, three items out of hundreds of possible items might be purchased in a single transaction. The items that were not purchased are known but not present in the transaction.

Oracle Data Mining assumes sparsity in transactional data. The Apriori algorithm is optimized for processing sparse data.

See Also:

Oracle Data Mining Application Developer's Guide for information about missing value treatment

Note:

Apriori is not affected by Automatic Data Preparation.

Calculating Association Rules

The first step in association analysis is the enumeration of itemsets. An itemset is any combination of two or more items in a transaction.

Itemsets

The maximum number of items in an itemset is user-specified. If the maximum is two, all the item pairs will be counted. If the maximum is greater than two, all the item pairs, all the item triples, and all the item combinations up to the specified maximum will be counted.

Example 10-1 Sample Transactional Data

TRANS_ID   ITEM_ID
---------  -------------------
11         B
11         D
11         E
12         A
12         B
12         C
12         E
13         B
13         C
13         D
13         E

Table 10-1 shows the itemsets derived from the transactions shown in Example 10-1, assuming that maximum number of items in an itemset is set to 3.

Table 10-1 Itemsets

Transaction Itemsets

11

(B,D) (B,E) (D,E) (B,D,E)

12

(A,B) (A,C) (A,E) (B,C) (B,E) (C,E) (A,B,C) (A,B,E) (A,C,E) (B,C,E)

13

(B,C) (B,D) (B,E) (C,D) (C,E) (D,E) (B,C,D) (B,C,E) (B,D,E) (C,D,E)


Frequent Itemsets

Association rules are calculated from itemsets. If rules are generated from all possible itemsets, there may be a very high number of rules and the rules may not be very meaningful. Also, the model may take a long time to build. Typically it is desirable to only generate rules from itemsets that are well-represented in the data. Frequent itemsets are those that occur with a minimum frequency specified by the user.

The minimum frequent itemset support is a user-specified percentage that limits the number of itemsets used for association rules. An itemset must appear in at least this percentage of all the transactions if it is to be used as a basis for rules.

Table 10-2 shows the itemsets from Table 10-1 that are frequent itemsets with support > 66%.

Table 10-2 Frequent Itemsets

Frequent Itemset Transactions Support

(B,C)

2 of 3

67%

(B,D)

2 of 3

67%

(B,E)

3 of 3

100%

(C,E)

2 of 3

67%

(D,E)

2 of 3

67%

(B,C,E)

2 of 3

67%

(B,D,E)

2 of 3

67%


See Also:

Chapter 10, "Apriori" for information about the calculation of association rules

Example: Calculating Rules from Frequent Itemsets

Table 10-4 and Table 10-4 show the itemsets and frequent itemsets that were calculated in Chapter 8. The frequent itemsets are the itemsets that occur with a minimum support of 67%; at least 2 of the 3 transactions must include the itemset.

Table 10-3 Itemsets

Transaction Itemsets

11

(B,D) (B,E) (D,E) (B,D,E)

12

(A,B) (A,C) (A,E) (B,C) (B,E) (C,E) (A,B,C) (A,B,E) (A,C,E) (B,C,E)

13

(B,C) (B,D) (B,E) (C,D) (C,E) (D,E) (B,C,D) (B,C,E) (B,D,E) (C,D,E)


Table 10-4 Frequent Itemsets with Minimum Support 67%

Itemset Transactions Support

(B,C)

12 and 13

67%

(B,D)

11 and 13

67%

(B,E)

11, 12, and 13

100%

(C,E)

12 and 13

67%

(D,E)

11 and 13

67%

(B,C,E)

12 and 13

67%

(B,D,E)

11 and 13

67%


A rule expresses a conditional probability. Confidence in a rule is calculated by dividing the probability of the items occurring together by the probability of the occurrence of the antecedent.

For example, if B (antecedent) is present, what is the chance that C (consequent) will also be present? What is the confidence for the rule "IF B, THEN C"?

As shown in Table 10-3:

  • All 3 transactions include B (3/3 or 100%)

  • Only 2 transactions include both B and C (2/3 or 67%)

  • Therefore, the confidence of the rule "IF B, THEN C" is 67/100 or 67%.

Table 10-5, shows the rules that could be derived from the frequent itemsets in Table 10-4.

Table 10-5 Frequent Itemsets and Rules

Frequent Itemset Rules prob(antecedent and consequent) / prob(antecedent) Confidence

(B,C)


(If B then C)
(If C then B)

67/100
67/67

67%
100%

(B,D)


(If B then D)
(If D then B)

67/100
67/67

67%
100%

(B,E)


(If B then E)
(If E then B)

100/100
100/100

100%
100%

(C,E)


(If C then E)
(If E then C)

67/67
67/100

100%
67%

(D,E)


(If D then E)
I(f E then D)

67/67
67/100

100%
67%

(B,C,E)


(If B and C then E)
(If B and E then C)
(If C and E then B)

67/67
67/100
67/67

100%
67%
100%

(B,D,E)


(If B and D then E)
(If B and E then D)
(If D and E then B)

67/67
67/100
67/67

100%
67%
100%

If the minimum confidence is 70%, ten rules will be generated for these frequent itemsets. If the minimum confidence is 60%, sixteen rules will be generated

Tip:

Increase the minimum confidence if you want to decrease the build time for the model and generate fewer rules.

Evaluating Association Rules

Minimum support and confidence are used to influence the build of an association model. Support and confidence are also the primary metrics for evaluating the quality of the rules generated by the model. Additionally, Oracle Data Mining supports lift for association rules. These statistical measures can be used to rank the rules and hence the usefulness of the predictions.

Support

The support of a rule indicates how frequently the items in the rule occur together. For example, cereal and milk might appear together in 40% of the transactions. If so, the following two rules would each have a support of 40%.

cereal implies milk
milk implies cereal

Support is the ratio of transactions that include all the items in the antecedent and consequent to the number of total transactions.

Support can be expressed in probability notation as follows.

support(A implies B) = P(A, B)

Confidence

The confidence of a rule indicates the probability of both the antecedent and the consequent appearing in the same transaction. Confidence is the conditional probability of the consequent given the antecedent. For example, cereal might appear in 50 transactions; 40 of the 50 might also include milk. The rule confidence would be:

cereal implies milk with 80% confidence

Confidence is the ratio of the rule support to the number of transactions that include the antecedent.

Confidence can be expressed in probability notation as follows.

confidence (A implies B) = P (B/A), which is equal to P(A, B) / P(A)

Lift

Both support and confidence must be used to determine if a rule is valid. However, there are times when both of these measures may be high, and yet still produce a rule that is not useful. For example:

Convenience store customers who buy orange juice also buy milk with 
a 75% confidence. 
The combination of milk and orange juice has a support of 30%.

This at first sounds like an excellent rule, and in most cases, it would be. It has high confidence and high support. However, what if convenience store customers in general buy milk 90% of the time? In that case, orange juice customers are actually less likely to buy milk than customers in general.

A third measure is needed to evaluate the quality of the rule. Lift indicates the strength of a rule over the random co-occurrence of the antecedent and the consequent, given their individual support. It provides information about the improvement, the increase in probability of the consequent given the antecedent. Lift is defined as follows.

(Rule Support) /(Support(Antecedent) * Support(Consequent))

This can also be defined as the confidence of the combination of items divided by the support of the consequent. So in our milk example, assuming that 40% of the customers buy orange juice, the improvement would be:

30% / (40% * 90%)

which is 0.83 – an improvement of less than 1.

Any rule with an improvement of less than 1 does not indicate a real cross-selling opportunity, no matter how high its support and confidence, because it actually offers less ability to predict a purchase than does random chance.

Tip:

Decrease the maximum rule length if you want to decrease the build time for the model and generate simpler rules.

Tip:

Increase the minimum support if you want to decrease the build time for the model and generate fewer rules.