以有效分群方法進行高頻項目集探勘

Abstract

Mining frequent itemsets is a key problem in many important mining applications, such asthe discovery of association rules, sequential patterns, and the network intrusion analysis.However, the task of discovering frequent itemsets in a large database is quite challenging.Traditional algorithms solve this problem by using a bottom-up breadth-first approach; theyuse a candidate generation method such that the frequent itemsets at a level can be used toconstruct candidate patterns at the next level. Nevertheless, it has the drawback of scanningthe database multiple times for verifying candidate itemsets. If there are lots of candidateitemsets, the cost of scanning the database may become very expensive. Thus, the goal of ourresearch is to develop a more effective algorithm for mining frequent itemsets.In this research project, we propose a new algorithm that will divide database items intoclusters and discover the frequent itemsets in each cluster by using the support-propagationmethod. Compared to the traditional algorithms, examining database items in a top-downmanner is the unique feature of the support-propagation method. As a result, it scans thedatabase only once. However, it may generate too many itemsets while decomposing thetransactions from the database. Therefore, we propose a clustering technique to improve thespace problem in the support-propagation method. Furthermore, we will design a simulationplatform to compare our top-down approach against the traditional bottom-up algorithms.尋找高頻項目集在許多重要的資料探勘應用中扮演了極為關鍵的角色,例如關聯式規則探勘、循序序列探勘、網路入侵行為分析探勘等,然而如何有效地找出資料庫中所有高頻項目集卻是一個極富挑戰性的問題。傳統的研究利用由下而上、廣先搜尋的方式來解決此問題,即不斷地產生候選項目集方式,每一回合中的高頻項目集均被利用來成為下一回合的候選項目集。然而當資料庫存在候選項目集個數過多時,利用此種方式有可能造成探勘效率不彰。因此,如何發展出一種更有效率的高頻項目集探勘方式即為本計畫的研究目的。本計畫主要採用分群的方法,並在每一個分群中以支持度傳遞方式找到所有高頻項目集。不同於傳統演算法,支持度傳遞方式的特點為掃描資料庫一次即可以由上而下的方式去找到所有的高頻項目集,然而它卻有產生過多項目集的問題,因此本計畫提出利用分群方法來改善支持度傳遞方式的空間問題。除此之外,為了驗証本研究計畫的探勘效能,我們也將設計一實驗平台,針對本研究計畫及傳統的探勘演算法進行實驗數據分析,並歸納出由上而下及由下而上兩種探勘方式的特點及其適用性以提供更多學者來進行後續研究

    Similar works