Lecture 10. Frequent Itemset Mining
Date: 2023-04-04
1. Overview
Frequent itemset mining (FIM) is a fundamental method in data mining aimed at identifying frequently occurring patterns, items, or combinations of items in a dataset. The main objective of this method is to discover the inherent regularities in large and complex datasets.
Motivation
-
Scalability Challenge: As datasets grow, the sheer number of possible item combinations makes it computationally challenging to identify frequent sets. Traditional methods may not be efficient enough. FIM offers a systematic and scalable approach.
-
Decision Making: Discovering patterns helps businesses and researchers make more informed decisions. Knowing combinations of products frequently bought together can guide strategies like product placements, marketing, and even new product development.
-
Pattern Recognition: In numerous domains, understanding frequent combinations can be crucial. For example, in bioinformatics, identifying common patterns of genes can help in disease diagnosis.
Use Cases
-
Market Basket Analysis: This is one of the most famous applications of FIM. Retailers can identify items that are frequently bought together, helping in product placement, promotions, and bundle offers.
-
Website Analytics: By analyzing frequently visited combinations of web pages, businesses can enhance user navigation, adjust the layout, or personalize content for better user experience.
-
Medical Diagnosis: Analyzing medical data to find frequent combinations of symptoms can aid in early diagnosis and treatment of diseases.
-
Collaborative Filtering: Used in recommendation systems, FIM can identify frequently co-rated items to suggest products, movies, or songs.
-
Intrusion Detection: In cybersecurity, FIM can assist in identifying patterns of malicious activity by detecting frequently occurring sequences of system calls.
-
Text Mining: By analyzing combinations of words or phrases that frequently appear together, insights into the main topics or sentiments of documents can be derived.
Definitions
- I (items, a set of literals):
-
I represents a finite set of items in the dataset. For instance, in a market basket analysis context, this could be all available products in a store.
-
Transaction T:
-
A transaction is an individual record or instance in the dataset. It's a subset of items. Using the store analogy, a transaction could represent a single customer's purchase, comprising a set of items they bought.
-
Dataset D:
-
This is the collection of all transactions. In a store, it would be like having a record of all purchases made over a certain period.
-
Association Rule:
-
An association rule is an implication of the form , where and are disjoint itemsets. It suggests that if items in are bought, items in are likely to be bought as well.
-
Support:
-
Support is a measure of how frequently an itemset appears in the dataset. Mathematically, the support of an itemset X in dataset D is given by the formula:
-
Confidence:
-
Confidence measures the reliability of an association rule. It's the ratio of the support of the entire itemset to the support of the antecedent (the itemset before the implication arrow). Given a rule , the confidence is:
-
Itemset:
-
An itemset is a non-empty set of items. It can have a single item or more.
-
Support Count (σ):
-
The support count of an itemset is the actual number of transactions in which the itemset appears. It's the absolute frequency, whereas support is the relative frequency.
-
Superset :
-
Given an itemset , any itemset that contains all items in is considered a superset of . For example, if is {apple, banana}, then {apple, banana, cherry} is a superset of .
-
Frequent Itemset:
- An itemset is considered "frequent" if its support meets or exceeds a predetermined threshold or minimum support. The identification of such itemsets is the primary goal of frequent itemset mining, as these represent patterns or combinations that occur commonly in the dataset.
2. Association Rules
Introduction
Association rules are a popular method used in data mining to discover relationships between variables in large datasets. These rules indicate how often items are bought together, helping retailers understand purchasing behaviors, for instance.
Interestingness Measures
Interestingness measures help assess the quality and relevance of discovered association rules. The most common measures include: - Support: Represents how frequently an itemset appears in the dataset. - Confidence: Gives the probability that an item Y is bought when item X is bought. - Lift: Measures how much more often X and Y are bought together than expected if they were statistically independent.
Mining Association Rules
Mining association rules is primarily about discovering frequent itemsets and then generating rules from them.
Step 1: Frequent Itemset Generation
Here, we identify those itemsets in the database that meet the minimum support threshold. Algorithms like Apriori and FP-growth are popular for this step. The key idea is to find all combinations of items that have transactional support above the specified threshold.
Step 2: Rule Generation
After identifying the frequent itemsets, the next step is to generate association rules from them while ensuring these rules satisfy the minimum confidence threshold. For each frequent itemset, we generate all possible rules and then test them for the required confidence.
Other Discussion Points
Why use support and confidence?
- Support: Allows us to filter out those itemsets that are infrequently purchased, ensuring that the rules we discover are relevant.
- Confidence: Assures the reliability of the association. A high confidence means that the rule found holds true in a significant proportion of cases.
Association rules results should be interpreted with caution
While association rules provide valuable insights, there are pitfalls: - Correlation vs. Causation: Just because two items are frequently bought together doesn’t mean one causes the purchase of the other. - Rare items: Items that are rare but have a high association might be overlooked if only focusing on high support. - Redundant rules: Sometimes, rules generated can be redundant, providing the same information.
Further Considerations in Association Rules
Handling Large Datasets
Efficient algorithms and data structures, like the FP-tree in the FP-growth algorithm, help handle large datasets without exhaustive search.
Evaluating the Quality of Rules
Apart from support and confidence, measures like lift, leverage, and conviction can help in evaluating the quality and significance of association rules.
Applications in Modern Industry
While retail is a common application, association rules are also used in bioinformatics, finance, and even areas of computer vision, like understanding patterns in image datasets.
Example: Grocery Store Purchases
Imagine a small grocery store that wants to understand the purchasing habits of its customers over the past month. The store records the following transactions (among many others):
- Milk, Bread, Butter
- Bread, Butter
- Milk, Bread
- Coffee, Sugar
- Milk, Coffee, Sugar
- Milk, Bread
Let's set our minimum support threshold to 50% and confidence to 70%.
Frequent Itemset Generation
Support Calculation: - Milk: 4/6 = 66.67% - Bread: 4/6 = 66.67% - Butter: 2/6 = 33.33% - Coffee: 2/6 = 33.33% - Sugar: 2/6 = 33.33%
Pairwise Support Calculation: - Milk, Bread: 3/6 = 50% - Milk, Coffee: 2/6 = 33.33% - Bread, Butter: 2/6 = 33.33% - Coffee, Sugar: 2/6 = 33.33%
From the above, only 'Milk' & 'Bread' as a combination meet our support threshold of 50%.
Rule Generation
From our frequent itemset {Milk, Bread}:
Rule 1: Milk → Bread - Confidence: P(Bread|Milk) = P(Milk and Bread) / P(Milk) = 50% / 66.67% = 75%
Rule 2: Bread → Milk - Confidence: P(Milk|Bread) = P(Milk and Bread) / P(Bread) = 50% / 66.67% = 75%
Both these rules meet our confidence threshold of 70%.
Interpretation
Based on the transactions: - 75% of the time, customers who bought Milk also bought Bread. - Similarly, 75% of the time, customers who purchased Bread also bought Milk.
Recommendations
The store can consider the following actions: 1. Placement: Place milk and bread closer, or run promotions where buying one gives a discount on the other. 2. Inventory: Ensure stock levels of both milk and bread are maintained, as they are frequently bought together. 3. Advertisements: Highlight recipes or uses that incorporate both milk and bread.
3. Apriori Algorithm
Intuition
The Apriori algorithm is a foundational algorithm in the field of association rule mining. It operates on the basic principle that if an itemset is infrequent, its supersets will also be infrequent. This intuition helps in reducing the computational overhead significantly by eliminating the need to check certain subsets.
Basic Principles
Apriori Principle: If an itemset is frequent, then all its subsets must also be frequent. Conversely, if an itemset is found to be infrequent, then its supersets will definitely be infrequent.
Details
By using the Apriori principle, the algorithm drastically reduces the number of candidates it has to check. This is achieved as follows:
Steps of the Algorithm:
-
Initialization: Begin by counting the occurrences of individual items and determining if they meet the support threshold.
-
Larger Itemset Formation: Formulate larger itemsets only from those items that passed the previous thresholds. For instance, if {Milk} and {Bread} are frequent individual items, then check for the itemset {Milk, Bread} in the next round.
-
Pruning: After generating the new itemsets, prune the itemsets that have infrequent subsets based on the Apriori principle. This is where the significant computational savings come into play.
-
Iteration: Continue the process iteratively, increasing the size of the itemsets until no more frequent itemsets can be found.
-
Rule Formation: Once all frequent itemsets have been found, generate association rules from them while ensuring the rules meet the minimum confidence threshold.
Generating Association Rules
After identifying all frequent itemsets:
- For each itemset, generate all possible rules.
- Test the confidence of each rule by dividing the support of the itemset by the support of the antecedent (the item or items on the left side of the rule).
- Keep only those rules that meet the minimum confidence threshold.
Example
Given the grocery store example previously mentioned, where transactions included items like Milk, Bread, and Butter:
Step 1: Count the occurrences of each item. - Milk: 4 - Bread: 4 - Butter: 2 - Coffee: 2 - Sugar: 2
Step 2: Form pairs. - {Milk, Bread}, {Milk, Butter}, {Bread, Butter}...
Step 3: Prune candidates. - If {Milk, Butter} is below the threshold, then {Milk, Bread, Butter} won't be considered in the next round.
Step 4: Continue until no more combinations meet the threshold.
Step 5: Form rules. - From the frequent itemset {Milk, Bread}, rules like "If Milk, then Bread" and "If Bread, then Milk" can be generated and their confidence can be tested.
4. Frequent Pattern (FP) Growth Algorithm
Motivation
The Apriori algorithm, despite its efficiency gains over brute force, can still be quite slow due to its iterative nature and the need to repeatedly scan the database. The FP Growth Algorithm offers a solution to this problem by compressing the database into a tree and mining the tree for frequent patterns. This reduces the scans of the database to just two.
FP Growth
FP Growth stands for Frequent Pattern Growth. Instead of generating candidate itemsets explicitly like Apriori, FP Growth represents the database in a compressed form called the FP Tree (Frequent Pattern Tree). This structure retains the itemset association information, and frequent pattern mining is performed on this tree.
FP Tree Representation
An FP Tree is a tree-like data structure that stores both the itemsets and their frequencies. The items in the dataset are organized in a descending order of frequency, and paths in the tree are built as transactions are read.
- Node Composition: Each node in the tree has three fields - item-name, count, and a link to the next node with the same item-name.
- Pathway: Each transaction in the dataset becomes a path in the FP Tree, with nodes incrementing their count if they are encountered again.
Example: Constructing the FP Tree
Step 1: Read the database and calculate the support of each item. Order items in descending frequency.
Step 2: Read transactions from the dataset. For each transaction, create or traverse a path in the FP Tree. This means: - Start from the root. - For each item in the transaction, check if it exists as a child node from the current node. - If it does, increment the count of that node. - If it doesn’t, create a new node with count 1.
Step 3: Link nodes that represent the same item using the node link field.
Step 4: Continue until all transactions are processed and the entire FP Tree is constructed.
Step 5: The tree is now ready for mining.
Properties of an FP Tree
- Frequency Preservation: The path from the root of the FP Tree to an item-node gives all the itemsets in which that item participates, and the count gives their frequencies.
- Itemset Association: Nodes in an FP Tree are linked if they represent the same item. This preserves the association of itemsets.
- Compactness: The FP Tree compresses the dataset while preserving the complete frequency information.
Conditional FP Tree
To find all the frequent patterns associated with a particular item (say 'Milk'), a Conditional FP Tree for 'Milk' is constructed. This is a subtree that represents all transactions that include 'Milk'. Mining this Conditional FP Tree will reveal all frequent patterns that include the item 'Milk'.
Apriori vs. FP Growth
- Database Scans: Apriori requires multiple passes over the database, while FP Growth requires only two.
- Data Structures: Apriori uses lists of candidate itemsets, while FP Growth uses the FP Tree.
- Computational Overhead: Apriori might generate many candidate sets, leading to high computational costs. FP Growth avoids explicit generation of candidates.
- Memory Usage: While Apriori can be memory-efficient, FP Growth can sometimes be memory-intensive if the FP Tree is large.
- Performance: Generally, FP Growth is faster because of its non-iterative nature, especially when the dataset is large.
Both methods have their advantages, but FP Growth is often preferred in scenarios with large datasets due to its efficiency gains over Apriori.
5. Maximal and Closed Frequent Itemsets
Motivation
In frequent itemset mining, one of the primary challenges is the vast number of itemsets that can be generated, especially in large datasets. Reporting all frequent itemsets can be overwhelming and often redundant. To overcome this, more concise and meaningful representations of frequent itemsets are needed, leading to the introduction of Maximal and Closed Frequent Itemsets.
Maximal Frequent Itemsets
A frequent itemset is considered maximal if it is frequent, and no immediate superset of it is frequent. In other words, it is at the "maximum" level of its frequent lineage. It offers a condensed representation of frequent itemsets by only considering those that aren’t contained by any other frequent itemset.
Example: If {Milk, Bread} is frequent but {Milk, Bread, Butter} is not, then {Milk, Bread} is a maximal frequent itemset.
Closed Frequent Itemsets
A frequent itemset is termed closed if it's frequent and there exists no immediate superset that has the same support count. They capture the essence of the data since any itemset not being closed has a superset with the same support, meaning it doesn’t bring additional information regarding frequency.
Example: If {Milk, Bread} and {Milk, Bread, Butter} are both frequent and have the same support count, only {Milk, Bread, Butter} is considered closed since it's a superset with the same support.
Relationship between Frequent Itemset Representations
-
All maximal itemsets are frequent: By definition, maximal itemsets are those that are frequent and have no frequent supersets. Hence, they are a subset of the frequent itemsets.
-
All closed itemsets are frequent: Similarly, closed itemsets are a subset of frequent itemsets because they have a frequency above the threshold.
-
All maximal itemsets are closed: This is because a maximal itemset doesn't have a frequent superset, implying it won't have a superset with the same support either.
-
Not all closed itemsets are maximal: While closed itemsets don’t have supersets with the same support, they might still have supersets that are frequent.
To illustrate, consider a set of itemsets where {A} is frequent, {A, B} is frequent and has the same support as {A}, and {A, B, C} is also frequent. Here, {A, B} is closed but not maximal because {A, B, C} is also frequent. On the other hand, {A, B, C} is both closed and maximal.
Both maximal and closed frequent itemsets are used to provide concise representations of frequent patterns in the dataset, thereby reducing redundancy and making the patterns more interpretable.
6. Q&A
1. Frequent Itemsets
Q: What are frequent itemsets in data mining? A: Frequent itemsets refer to sets of items that appear together in transactions (or data entries) more often than a specified threshold or minimum support level.
2. Apriori Algorithm
Q: How does the Apriori algorithm determine which itemsets are frequent? A: The Apriori algorithm uses a two-step iterative approach. First, it identifies individual frequent items based on a support threshold. Then, in subsequent passes, it extends the itemsets and checks their frequency, pruning itemsets that have infrequent subsets, based on the principle that if an itemset is infrequent, its supersets will also be infrequent.
3. FP Tree
Q: Why is the FP Tree a preferred method for mining frequent itemsets over the Apriori algorithm in some cases? A: The FP Tree compresses the database into a tree-like structure, allowing for faster mining without the need for repeatedly scanning the database. This leads to efficiency gains, especially for larger datasets, as only two scans are needed compared to the multiple scans required by the Apriori algorithm.
4. Maximal Frequent Itemsets
Q: How is a maximal frequent itemset defined? A: A frequent itemset is considered maximal if it is frequent and no immediate superset of it is frequent.
5. Closed Frequent Itemsets
Q: What differentiates a closed frequent itemset from other frequent itemsets? A: A frequent itemset is termed closed if it's frequent and there exists no immediate superset with the same support count. Essentially, any itemset that isn't closed has a superset with the same frequency.
6. Measuring Interestingness
Q: What is the significance of measuring interestingness in association rule mining? A: Measuring interestingness helps determine the relevance or utility of discovered association rules. While many rules can be generated from a dataset, not all of them provide meaningful or actionable insights. Interestingness measures help prioritize and filter rules that are likely to be more valuable or surprising to the user.
7. Apriori Principle
Q: How does the Apriori principle assist in reducing computational overhead? A: The Apriori principle states that if an itemset is infrequent, its supersets will also be infrequent. This allows the algorithm to prune large numbers of potential itemsets without checking their frequency, leading to significant computational savings.
8. FP Tree Structure
Q: What are the key components of an FP Tree node? A: Each node in an FP Tree typically contains the item-name, a count (representing the number of transactions containing the itemset represented by the path to the node), and a link to the next node with the same item-name.
9. Maximal vs. Closed Itemsets
Q: Are all closed itemsets also maximal? A: No, while all maximal itemsets are closed, not all closed itemsets are maximal. A closed itemset might still have supersets that are frequent, even if those supersets have a different support count.
10. Confidence as an Interestingness Measure
Q: How is confidence used as a measure of interestingness in association rule mining? A: Confidence measures the probability of an item Y being purchased when item X is purchased, represented by the rule X -> Y. It's calculated as the support of the itemset {X, Y} divided by the support of item X. A high confidence indicates a strong association between the items in the rule.
11. Support Measure
Q: How is support used as a measure in association rule mining? A: Support measures the frequency of an item or itemset in the database. It's calculated as the number of transactions containing the itemset divided by the total number of transactions. High support indicates that the itemset is common, while low support indicates rarity.
12. Limitations of Apriori
Q: What are some of the primary limitations of the Apriori algorithm? A: One limitation is that it can be computationally expensive due to multiple passes over the dataset. It may also generate many candidate sets, leading to increased memory usage. Lastly, its performance can degrade with large datasets or low minimum support values.
13. Efficiency of FP Growth
Q: Why is FP Growth considered more efficient than Apriori in many scenarios? A: FP Growth reduces the dataset scans to just two, and it compresses the dataset into a tree structure (FP Tree). This means it doesn’t require generation of candidate sets like Apriori, and avoids multiple passes over the database, making it faster and more memory efficient in many scenarios.
14. Leverage and Lift
Q: How do leverage and lift act as measures of interestingness in association rule mining? A: Leverage computes the difference between the observed frequency of both items together and what would be expected if they were independent. Lift, on the other hand, is the ratio of the observed support to that expected if the two rules were independent. A lift value greater than 1 suggests a positive association between items.
15. Use-cases for Frequent Itemsets
Q: Where are frequent itemsets commonly used in real-world applications? A: Frequent itemset mining is often used in market basket analysis to identify products frequently bought together. This can aid in areas like product placement, promotions, and recommendation systems.
16. FP Tree Depth
Q: How does the depth of the FP Tree affect the efficiency of the FP Growth algorithm? A: A deeper FP Tree might indicate that transactions have many items, which can lead to more complex mining. However, the main efficiency of FP Growth comes from its structure and the fact that it avoids candidate generation. Generally, the width (number of branches) has a more pronounced effect on efficiency than depth.
17. Redundant Rules
Q: Why is it important to filter out redundant association rules? A: Redundant rules don't provide new or actionable insights and can clutter the results. By focusing on non-redundant, meaningful rules, businesses or analysts can make more informed decisions.
18. Conviction as a Measure
Q: How does conviction act as an interestingness measure? A: Conviction compares the probability that X appears without Y if they were independent versus the observed data. A high conviction value indicates that Y is highly dependent on X. If items are independent, the conviction is 1.
19. Difference between Frequent and Infrequent Itemsets
Q: How do frequent and infrequent itemsets differ? A: Frequent itemsets meet or exceed a specified support threshold in the dataset, indicating they occur together often. Infrequent itemsets, on the other hand, occur together less frequently than this threshold.
20. Scalability of Itemset Mining Algorithms
Q: How do the Apriori and FP Growth algorithms scale with increasing dataset sizes? A: Apriori's performance can degrade with large datasets or very low minimum support thresholds because of its iterative nature and multiple database scans. FP Growth, in contrast, often scales better with larger datasets due to its compressed tree structure and reduced number of database scans. However, if the FP Tree becomes very large and complex, it can also become memory-intensive.