![]() |
Bulletin of Applied Computing and Information Technology |
![]() |
![]() |
![]() |
![]() |
![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Embedding Multiple Constraints into Apriori Algorithm |
|
03:01 |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Russel Pears & Sangeetha Kutty Pears, R. & Kutty, S. (2005, May), Embedding Multiple Constraints into Apriori Algorithm. Bulletin of Applied Computing and Information Technology Vol. 3, Issue 1. ISSN 1176-4120. Retrieved from ABSTRACTTo improve the efficiency of frequent pattern mining recent studies have tended to focus on a constraint-based approach. Recently a number of techniques have been proposed to incorporate constraints directly into mining algorithms. Some constraints such as average, median and sum do not exhibit desirable properties such as anti-monotonicity and succinctness. Hence incorporating them in the mining algorithm efficiently was deemed to be infeasible. These constraints were thus considered to be tough constraints. This paper proposes a novel approach to combining these tough constraints and then embedding them into the popular mining algorithm, Apriori. KeywordsPattern mining, Apriori 1. INTRODUCTIONThe most widely used algorithm in frequent item mining is the Apriori algorithm (Agrawal et al, 1993). This algorithm has been used in highly diversified domains such as web usage mining, bio informatics, text mining, video mining etc. In the Bioinformatics domain, the Apriori algorithm has been used to generate rules from genomic data. This rules express the correlation among sequence, structure and function of the protein (Satou et al., 1997). In the context of web usage mining, Apriori was used to customize the content of the website based on the usage by the website users (Mobasher, Dai, Luo, & Nakagawa, 2001). In the area of web usage mining, it has been used to extract sequential patterns of users as they navigate from page to page (Huysmans, Baesens, Mues, & Vanthienen, 2004). Apriori uses the standard framework which is based on Support and Confidence to identify good rules in two steps. Firstly, this algorithm identifies the item sets, whose frequency of occurrence (support) is above a user-defined threshold (generally referred to as minsup). The resulting itemsets are considered as frequent. Secondly, the algorithm calculates the confidence or the strength of each rule that results. If the confidence of the generated rule is above a user-specified level then that rule is returned to the user. For example “90% of the people buy bread and milk” will be a desired rule if the minimum confidence level was set to 70%. The underlying itemsets from which the rules are generated are the focal point in frequent pattern mining. Focusing on how to identify these frequent itemsets efficiently has been the subject of intense research in recent years. This was due to the sheer volume of itemsets that needs to be evaluated in a real world mining scenario. A number of approaches have been proposed to deal with this problem. One such approach is the use of constraints in the mining process. The constraint-based approach has two important advantages over the naïve approach. Firstly, the user is able to narrow the focus to the required part of the database in which he/she is interested in, thus improving performance (Jia, Pei, & Zhang, 2002). Secondly, the use of constraints increases the level of user involvement, as users are able to incorporate their requirements into the mining process. A simple example of a constraint is to find the sum of the price of milk products greater than 100 which can be represented as sum (Milk Products. Price) > 100. A number of approaches have been proposed to efficiently sift through the potentially huge number of rules that could be generated. The early work by Srikant, Vu et al (1997) on constraint based frequent itemsets mining proposed an algorithm to deal with Boolean constraints such as presence and absence of items in an itemset. A systematic categorization of various aggregation constraints namely minimum, maximum, count, sum and average based on properties such as monotonicity and succinctness was done by T.Ng, Laksmanan et al (1998). A comprehensive study on Frequent pattern mining using succinct constraints was conducted by Leung, Laksmanan et al (2002). Some constraints, namely average, sum, mean, and median do not exhibit monotonicity. Hence, these constraints were deemed to be “tough constraints” as they could not be incorporated into the mining algorithms. But recently there were several studies conducted on constrained based mining to convert these constraints into a suitable form that would enable them to be incorporated into standard algorithms like Apriori (Jia et al., 2002) and the much more recent FIC(Frequent Itemset Mining) algorithm(Pei, Han, & Lakshmanan, 2004). Yet, due to the inherent properties of each of these constraints, combining them together and embeddingthem into an Apriori-like algorithm was deemed complex (Jia et al., 2002; T. Ng et al., 1998). Pei et al (2004) went so far as to assert that embedding constraints into Apriori will not be feasible. However, we will demonstrate in this paper that we can effectively embed not just one but multiple constraints into the Apriori algorithm. The rest of this paper is organised as follows. Section 2 introduces the different types of constraints in detail. We provide an overview of some of the recent work on embedding tough constraints in Section 3. Section 4 discusses a technique for incorporating a single constraint into Apriori. In Section 5 the focus is laid on combining multiple constraints together. Section 6 builds on sections 4 and 5 and introduces our approach to embedding multiple constraints into Apriori. Some of the limitations of our approach are discussed in Section 7. Finally, Section 8 outlines some of the open questions in this area of research. 2. TYPES OF CONSTRAINTSMany different types of constraints exist, namely item constraint, length constraint, super pattern constraint, aggregate constraint, model constraint and regular expression constraint (Pei & Han, 2002). However, in this paper we will focus on aggregate constraints and the methods required to embed them into mining algorithms. The aggregate constraint is classified into three kinds based on the way they are assembled: Distributive such as count, mount, max and min; Algebraic such as average, standard deviation, and sum; and Holistic such as median, mode (Jia et al., 2002). 2.1 Constraint PropertiesConstraints are also characterised by their suitability for exploitation within the mining process. In this scheme, constraints that exhibit anti-monotone, monotone or succinct properties can be directly embedded into the mining process. Definition 1 (Anti-monotonicity):A constraint is said to be anti-monotone on an itemset S whenever its violation propagates to all supersets of S. For instance, if S represents the itemset and the price of items in the itemset is represented by S.Price then min (S.Price) > v, avg(S.Price) £ v are anti-monotone. However, the sum of prices of items in the itemset S greater than a value v, represented by sum(S.Price) ³ v, is not anti-monotone. The level of an itemset represents the number of items in that itemset.
For instance, level-1 itemset represent itemsets with one item namely
A simple example of an anti-monotone constraint is the support constraint which forms the basis of the Apriori algorithm. This support constraint in turn uses minimum support as a threshold to identify the desired rules. For instance, if a subset violates the constraint then the check for the superset can be waived. Hence, anti-monotonicity is a desirable property in Apriori-like algorithms. Definition 2 (Monotonicity):A constraint is said to be monotone on an itemset S when its satisfaction propagates to all of its supersets. For instance, Sum(S.Price) ³ v is monotone as well as Min(S.Price) £ v. However, Sum(S.Price) £ v is not montone. Definition 3(Succinctness):A constraint C is said to be succinct when any selection A of items that satisfies C appears in every possible set S of items that satisfy C. The idea behind succinct constraints is that we can determine whether an itemset S satisfies a constraint C on the basis of the items alone, without having to scan the database. For instance, Min(S.Price) £ v is succinct but Sum(S.Price) ³ v is not a succinct constraint. In the case of Min(S.Price) £ v, we can list the items in a concise way for items below a minimum price. On the other hand, we cannot list the items with Sum(S.Price) as the sum of item prices increases when more values are added to it.
Certain types of constraints, such as Count(S) £ v are not succinct by definition. However they can be transformed so that they obey the succinct property (T.Ng et al., 1998). These types of constraints are referred to as “weakly” constraints, to distinguish them from succinct constraints which obey definition 3 without the need for any transformation. Recent research has highlighted the fact that constraints which were previously considered to be tough can be embedded into the mining process (Jia et al., 2002; Pei, Han, & Lakshmanan, 2001; Pei et al., 2004). This is achieved by re-ordering the items in either descending or ascending order. Constraints that exhibit anti-monotone or monotone properties as a result of the re-ordering process are referred to as convertible constraints. 2.2 Taxonomy of ConstraintsFigure 1 shows the classification of constraints and their relationships. As indicated in the figure, there is a category of constraints called Inconvertible. Certain constraints, such as avg(S) – median(S) = 0 that cannot be embedded in the mining process (Pei et al., 2004). This constraint requires an ordering in which the median in the itemset equates to the average. This ordering is not feasible in the natural ordering on items. Table 3 classifies each of the commonly used constraints into the three types of constraints: anti-monotone, monotone and succinct.
Table 3. Types of constraints (* means it depends on the particular constraint)
3. EXISTING APPROACHES TO EMBEDDING TOUGH CONSTRAINTSSeveral approaches have been proposed in the literature to embed tough constraints into Apriori-like algorithms. One such approach was to try and induce weaker constraints. Unlike tough constraints, weaker constraints exhibit anti-monotone and succinct properties (T. Ng et al., 1998). Thus for a constraint such as Avg(S.Price) <C, we can induce the constraint Min(S.Price)>C. Since the latter constraint exhibits the anti-monotone property, it can be incorporated into the mining algorithm. However, the itemsets produced after application of the weak constraint would have to be subject to a post-check for satisfaction of the average constraint. Though this technique reduces the amount of itemsets for post-checking (due to the filtration provided by the weak constraint), it does involve checking constraints twice. In cases where the degree of filtration provided by the weak constraint is low the effectiveness of this approach is questionable. One other approach suggested in the literature is the Divide-and-Approximate method (Wang, Jiang, Yu, Dong, & Han, 2003). Though this approach is related to the approach of T. Ng et al (1998), it has two significant differences: Firstly, it is tuple-based, and secondly, it has the ability to handle constraints irrespective of their properties. In the Divide-and-Approximate method, the focus is on identifying “tuple-based” aggregates and grouping them by a “group by” clause In constrast, T. Ng, Laksmanan et al (1998) take an item-based approach. Yet another approach that was proposed recently was based on the idea
of capturing a witness. (Kifer, Gehrke, Bucila, & White, 2003). A witness
is an itemset which determine whether the constraint holds on any other
given itemset. Based on this property a witness can be positive or
negative. Consider the prices of the items S= However, the disadvantage of this approach is that for each of the constraints the witness needs to be identified. Hence, this approach could be suitable only for specific constraints for which witnesses can be identified. For simple constraints, this approach might be suitable; however the extensibility of this approach for complex constraints is questionable. The algorithm Frequent Itemset Constraint (FIC) algorithm is another attempt at embedding tough constraints into the mining process. This algorithm arranges the item values in either ascending or descending order based on the requirements of the constraint. This is done to convert tough constraints into anti-monotone constraints which will enable them to be embedded into the mining algorithm. It then identifies the frequent itemsets and creates a database for each of the identified frequent itemsets. For every projection of the database, it scans and identifies the next level of frequent itemsets which satisfy the constraint and then projects the database for the resulting itemset. As illustrated in Figure 2, this process iterates until there is no itemset which satisfies the constraint. This approach has several advantages over Apriori. Two major advantages of this approach are:
In spite of the highlighted advantages of this approach, the efficiency of this approach in a real-world context needs to be evaluated carefully. In the following section we discuss a generic approach to incorporating constraints into Apriori efficiently. 4. EMBEDDING CONSTRAINTS INTO APRIORIIn a very recent piece of research, Pei, Han et al (2004) asserted that it will not be feasible to incorporate constraints into an Apriori-like algorithm as tough constraints do not exhibit the anti-monotone property. However, by re-ordering the attribute values such as item price in ascending or descending order Jia, Pei et al (2002) demonstrated that it was feasible to incorporate certain types of constraints into Apriori-like algorithms. Consider Table 2 arranged in descending order of the item values. The
itemset now becomes
Similarly
in the level-2 itemset (that is generated for identifying frequent item
pairs) illustrated in Figure 3, constraint checking for
5. MULTIPLE TOUGH CONSTRAINTSWhen dealing with multiple constraints, it is important to first classify them so that we can evolve a strategy for effectively combining them together. There are two different classification schemes. The first was introduced in section 2, where constraints are categorised according to whether or not they exhibit the anti-monotonic property. In this respect there are three possible cases: Case 1, where both constraints are anti-monotone, Case 2, where one is anti-monotone and the other is tough, and finally Case 3, where both are tough constraints. Incorporating multiple tough constraints into the Apriori algorithm is much more challenging than embedding a single tough constraint. Single tough constraints when re-ordered can fully utilize the anti-monotone property, constraint checking can be waived. On the other hand, multiple constraints cannot fully utilize the anti-monotone or monotone properties as each constraint could exhibit different or conflicting properties. We propose to extend the FIC strategy of Pei, Han et al (2004) to combine multi-constraints The second classification scheme is based on the property of selectivity. A constraint is considered sharp when the selectivity is high and blunt when the selectivity is low. Selectivity denoted by δ, is defined formally below as: Selectivity δ = Number of frequent itemsets not satisfying C Total Number of frequent itemsets For instance, consider the case of constraint A as Sum(S.Price)≥ 30 and B as Avg(S.Price) ≥ 30. Since A satisfies 13 frequent itemsets out of a total of 31 frequent itemsets generated using transaction tables (Tables 1 and 2), the selectivity is: δsum = 18/31 = 0.5806 Table 4. Selectivity of constraints
On the other hand, Avg(S.Price) ≥ 30 satisfies 5 frequent itemsets among 31 frequent itemsets Hence, the selectivity for Avg(S.Price) ≥ 30 is δavg = 26/31 = 0.8387 Thus, as shown in Table 4, A which is Sum(S.Price) ≥ 30 can be considered to be a blunt constraint while B which is Avg(S.Price) ≥ 30 can be treated as a sharp constraint. 6. EMBEDDING MULTIPLE TOUGH CONSTRAINTS INTO APRIORIIn this section we outline our approach for embedding multiple constraints. Table 5 presents four possible strategies for combining multiple constraints together. Consider for example, the two constraints Sum(S.Price) ≥ 30 V Avg(S.Price) ≥ 30 on the data set used previously in this paper. As evident from Section 5, Sum(S.Price) is a blunt constraint and Avg(S.Price) is a sharp constraint. Following Strategy 1 and using Tables 1 and 2, Level-1, 2 and 3 itemsets for the blunt constraint Sum (S.Price) ≥ 30 and sharp constraint Avg (S.Price) ≥ 30 were determined as shown in Table 6. It should be noted that the itemsets generated in level -1 remains the same. On the other hand, there are many more itemsets for Level–2 and Level-3 that satisfy the blunt constraint. As these constraints are combined using an OR operator, we need to consider all the itemsets generated. Hence, we can only filter itemsets that are violated by both of the constraints. However, multiple constraints can be combined efficiently when there
is an AND operator. Using Strategy 2, for the same constraints instead of
an OR operator, we will demonstrate how the AND operator can be used
efficiently. In Table 6, consider the constraint, Sum (S.Price) Λ Avg (S.Price)
≥ 30, strategy 2 enables the amount of constraint checking to be reduced.
For instance in Level-2 of the itemsets generated, when the itemset
The next situation deals with a conflict on the item ordering, namely one constraint requires a certain ordering, say R1 and the other constraint requires a different ordering, say R2. Strategy 3 discusses how to deal with such constraints when they are combined using an OR operator. Table 5. Strategies for embedding multiple constraints into the Apriori algorithm (extended from Pei et al, 2004)
Table 6. Constraint checking on level-1, 2 and 3 itemsets
Finally, strategy 4 illustrates how to efficiently combine constraints that have conflicts in their ordering using the logical AND operator. Strategy 4 can be considered efficient as the constraint checking of the blunt constraint is performed on the itemsets that survive the sharp constraint. This significantly reduces itemsets to be considered by the post filtering stage. 7. DISCUSSIONThis approach proposed in this paper is significantly different to both the Apriori and FIC approaches (Pei et al., 2004). The standard Apriori algorithm checks the entire transaction database but the FIC algorithm only checks the frequent itemsets in the projected database. Our approach will embed the constraints into the Apriori thus reducing the volume of constraint checking involved. However, it does require searching the entire database, in contrast to FIC algorithm, which only checks the projected database. To our knowledge, there has been no research undertaken to date to compare the performances of the Apriori and FIC algorithms. Thus it is an open question as to whether the modified Apriori algorithm will outperform the FIC algorithm. We intend to undertake an experimental study to compare the two approaches. An exploratory study on identifying convertible anti-monotone constraints will also be useful to analyze the extensibility of the approach proposed in this paper. Future research can focus on how to deal with constraints that are equally sharp or equally blunt such as A having approximately the same selectivity as that of B. Some open research questions are:
8. CONCLUSIONIn this paper, an approach to combining multiple constraints efficiently and incorporating them into Apriori was provided. As already discussed, Apriori is one of the most widely used algorithms for determining association rules. Hence, by adopting the approach proposed, we can re-use existing techniques with relatively minor changes which are required to accommodate tough constraints. This approach as demonstrated could reduce the volume of constraint checking and thus be invaluable for large databases. Moreover, we also presented a strategy for combining tough constraints together efficiently. We also dealt with the case when application of the constraints caused conflicts in item ordering. REFERENCESAgrawal, R., T. Imielinski, et al. (1993). "Mining association rules between sets of items in large databases." SIGMOD Record (ACM Special Interest Group on Management of Data) 22(2): 207-216. Huysmans, J., Baesens, B., Mues, C., & Vanthienen, J. (2004). Web usage mining with time constrained association rules. Paper presented at the Proceedings of the Sixth International Conference on Enterprise Information Systems (ICEIS 2004), Porto, Portugal. Jia, L., Pei, R.-Q., & Yao, G.-X. (2003). Using constraint technology to mine frequent datasets. Paper presented at the Machine Learning and Cybernetics, 2003 International Conference on. Jia, L., Pei, R.-Q., & Zhang, S.-Q. (2002). Mining frequent itemsets with tough constraints. Paper presented at the Machine Learning and Cybernetics, 2002. Proceedings. 2002 International Conference on. Kifer, D., Gehrke, J., Bucila, C., & White, W. (2003). How to quickly find a witness. Paper presented at the Twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, San Diego, California. Laksmanan, L. V. S., Leung, C. K.-S., & T.Ng, R. (2002). Exploiting succinct constraints using FP-trees. ACM SIGKDD Explorations Newsletter, 4(1), 40 - 49. Mobasher, B., Dai, H., Luo, T., & Nakagawa, M. (2001). Effective Personalization Based on Association Discovery from Web Usage Data. Paper presented at the Proceeding of the third international workshop on Web information and data management, Atlanta, Georgia, USA. Pei, J., & Han, J. (2002). Constrained frequent pattern mining: a pattern-growth view. ACM SIGKDD Explorations Newsletter, 4(1), 31-39. Pei, J., Han, J., & Lakshmanan, L. V. S. (2001). Mining frequent itemsets with convertible constraints. Paper presented at the Proceedings. 17th International Conference on Data Engineering, 2001. Pei, J., Han, J., & Lakshmanan, L. V. S. (2004). Pushing Convertible Constraints in Frequent Itemset Mining. Data Mining and Knowledge Discovery, 8(3), 227-252. Satou, K., Shibayama, G., Ono, T., Yamamura, Y., Furuichi, E., Kuhara, S., et al. (1997). Finding Association rules on Heterogeneous Genome Data. Paper presented at the Pacific Symposium on Biocomputing '97, Hawaii. Srikant, R., Vu, Q., & Agrawal, R. (1997). Mining association rules with item constraints. Paper presented at the 3rd International conference on Knowledge Discovery and Data mining. T.Ng, R., Laksmanan, L. V. S., Han, J., & Pang, A. (1998). Exploratory mining and pruning optimizations of constrained association rules. Paper presented at the ACM SIGMOD Conference, Seattle. Wang, K., Jiang, Y., Yu, J. X., Dong, G., & Han, J. (2003). Pushing aggregate constraints by divide-and-approximate. Paper presented at the Proceedings. 19th International Conference on Data Engineering, 2003. Copyright © 2005 Russel Pears & Sangeetha Kutty |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
![]() |
![]() |
![]() |
![]() |
![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2005 NACCQ,
Krassie Petrova, Michael Verhaart & Christo Potgieter. All rights reserved. |