Mining Sequence Patterns by Using Probabilistic Suffix Tree on Hadoop Platform

Abstract

『序列型樣探勘(sequence pattern mining)』主要是挖掘隱藏在序列資料中特殊、重要、具代表性的特徵(feature)。序列型樣探勘吸引了大量的關注,尤其是在生物資訊領域與具有時空軌跡探勘領域中。許多後置樹(Suffix Tree),特別是Probabilistic Suffix Tree (PST),常被用於序列型樣探勘,主要是因為其具有擷取序列資料之結構特徵能力,且其計算複雜度較低,因此常被用於計算能力或記憶體容量有限的環境下。 近來,隨著數據蒐集技術的進步,大量資訊迅速累積且無所不在,然而傳統集中式suffix tree演算法的可擴充性較差,因此無法應付大量資料的型樣探勘。有介於此,本論文提出了三種適用於雲端Hadoop平台的平行分散式的PST建構演算法,分別為CloudPST_Naïve、CloudPST、CloudPST_OneScan,具體來說,此三種演算法是使用Hadoop/MapReduce的程式模型來建構PST。CloudPST_Naïve是一種較直覺的演算法,由WordCount範例衍生而來。為解決Naïve方法的缺點,我們提出CloudPST演算法,其以漸進、一次多層、迭代的方式建構PST,因此可以避免過度探勘型樣,同時能平衡分散式計算的overhead。為避免多次掃描整個序列資料而拖累整體效能,我們進一步提出CloudPST_OneScan演算法,其利用一個新設計的資料結構以暫存中間統計的結果,因此每次迭代只需要掃描整個序列一次,即可建構該次迭代所需建構的PST層數。 為效能比較,我們進行許多實驗,試驗結果顯示CloudPST_OneScan各方面的表現都比CloudPST_Naïve和CloudPST好。而且CloudPST_OneScan擁有良好的執行效能,且具有良好的可擴充性與穩定性。Sequence pattern mining is to discover special, important, and representative features hidden in sequence data. It attracts a lot of attention especially in the domains of bioinformatics and spatio-temporal trajectory data mining. To discover features in sequence data, many suffix trees, especially the Probabilistic Suffix Tree (PST), are frequently used due to their capability in capturing the structural characteristics in sequence data. Recently, with the advance of data collection techniques, huge amounts of sequence data are accumulated ubiquitously. However, traditional centralized suffix tree learning algorithms do not scale well in learning huge sequence data. In view of this, we propose three distributed and parallel PST building algorithm, named CloudPST_Naive, CloudPST and CloudPST_OneScan on the Hadoop platform to speed up the learning process. Specifically, the three algorithms are Map/Reduce frameworks. CloudPST_Naive is an intuitive approach derived by the WordCount example. To overcome the drawbacks of the Naïve approach, we propose the CloudPST algorithm which builds a PST in an iterative, levels by levels manner to avoid learning excessive patterns and trade off the overhead of distributed computing. Furthermore, to avoid multiple scanning of the entire sequence data, we propose CloudPST_OneScan algorithm which involves a new data structure to store the intermediate statistics so that the One-Scan algorithm only scans the entire sequence data once for each iteration. To evaluate the performance of our proposed algorithms, we conduct exhaustive experiments and the experimental results show that the CloudPST_OneScan outperforms CloudPST_Naive and CloudPST algorithms. In addition, CloudPST_OneScan algorithm shows good efficiency and possesses great scalability and stability.中文摘要 I Abstract II 致謝辭 III 【目錄】IV 【表目錄】VI 【圖目錄】VII 第一章 緒論 1 1.1 研究動機 1 1.2 研究背景 3 第二章 相關研究 5 2.1 序列資料探勘 5 2.2 Probabilistic Suffix Tree 6 第三章 雲端計算/Hadoop 11 3.1 雲端計算 11 3.2 Hadoop 14 3.2.1 HDFS 14 3.2.2 MapReduce 16 第四章 Hadoop平台上之PST建構演算法設計 19 4.1 重要型樣的平行計算 19 4.2 CloudPST_Naïve 21 4.3 Design of CloudPST(Cloud Probabilistic Suffix Tree) 24 4.3.1 CloudPST Algorithm 25 4.4 Design of CloudPST_OneScan 31 4.4.1 CloudPST_OneScan Algorithm 32 第五章 效能研究 39 5.1 Experiment1-input data size 39 5.2 Experiment2-mapper numbers 42 5.3 Experiment3-p_min 45 5.4 Dynamic Subtree Depth in CloudPST_One Scan 50 5.5 Experiment Conclusion 53 第六章 結論 54 參考文獻 5

    Similar works