Search CORE

239,460 research outputs found

APPLYING THE ATTRIBUTED PREFIX TREE FOR MINING CLOSED SEQUENTIAL PATTERNS

Author: Thiet Pham Thi
Publication venue: 'Publishing House for Science and Technology, Vietnam Academy of Science and Technology'
Publication date: 20/03/2018
Field of study

Mining closed sequential patterns is one of important tasks in data mining. It is proposed to resolve difficult problems in mining sequential pattern such as mining long frequent sequences that contain a combinatorial number of frequent subsequences or using very low support thresholds to mine sequential patterns is usually both time- and memory-consuming. This paper applies the characteristics of closed sequential patterns and sequence extensions into the prefix tree structure to mine closed sequential patterns from the sequence database. The paper uses the parent–child relationship on prefix tree structure and each node on prefix tree is also added fields to determine whether that is a closed sequential pattern or not. Experimental results show that the number of sequential patterns is reduced significantly

Vietnam Academy of Science and Technology: Journals Online

Mining very long sequences with PLWAPLong algorithms

Author: Saeed Kashif
Publication venue: 'University of Windsor Leddy Library'
Publication date: 01/01/2009
Field of study

Sequential pattern mining is the process of finding inter-transaction frequent sequential patterns from a sequential database, where records consist of ordered sets of events (or items), by applying data mining techniques on such sequential databases. Discovering sequential patterns in web server logs is an example application of sequential mining, which is useful for predicting visiting patterns of web users for such purposes as targeted advertisements. Position Coded Pre-order Linked Web Access Pattern (PLWAP) mining algorithm is one of the existing efficient web sequential pattern mining algorithms, which stores the frequently stored sequences of the entire sequential database in a compressed tree form with position coded nodes. However, for very long sequences exceeding thirty two nodes, the number of bits an integer position code can hold, the PLWAP algorithm\u27s performance begins to degrade because it employs linked lists to store conjunctions of long position codes and the linked list traversals slow down the algorithm both during tree construction and mining. PLWAP algorithm also uses each and every node in the frequent 1-item event queue to test for that event inclusion in the suffix tree root set during mining. This is a very expensive operation since except for one node all other nodes that are its ancestors and descendents are not included in the root set. This thesis proposes two new algorithms, i.e. PLWAPLong1 and PLWAPLong2. Both of these new algorithms use a new position code numbering scheme where each node is assigned two numeric variables (startPosition, endPosition) instead of one. Using this scheme we can determine the ancestor node in O (1) operation by comparing the startPosition and endPosition of two nodes. PLWAPLong1 algorithm also proposes transforming the linked list based tree to an equivalent array representation and using binary search to find the immediate descendant in a suffix tree. PLWAPLong2 uses existing linked list based tree. Both PLWAPLong1 and PLWAPLong2 algorithms introduce a new technique called Last Descendant to eliminate the unwanted nodes from ancestor/descendent test when creating the suffix tree root set. Keywords: Data mining, Web Mining, Association Rule Mining, Long Sequences, PLWAP Minin

Classification with Single Constraint Progressive Mining of Sequential Patterns

Author: Saptawati Putri
Sitohang Benhard
Yasmin Regina Yulia
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/08/2017
Field of study

Classification based on sequential pattern data has become an important topic to explore. One of research has been carried was the Classify-By-Sequence, CBS. CBS classified data based on sequential patterns obtained from AprioriLike sequential pattern mining. Sequential patterns obtained were called CSP, Classifiable Sequential Patterns. CSP was used as classifier rules or features for the classification task. CBS used AprioriLike algorithm to search for sequential patterns. However, AprioriLike algorithm took a long time to search for them. Moreover, not all sequential patterns were important for the user. In order to get the right and meaningful features for classification, user uses a constraint in sequential pattern mining. Constraint is also expected to reduce the number of sequential patterns that are short and less meaningful to the user. Therefore, we developed CBS_CLASS* with Single Constraint Progressive Mining of Sequential Patterns or Single Constraint PISA or PISA*. CBS_Class* with PISA* was proven to classify data in faster time since it only processed lesser number of sequential patterns but still conform to user’s need. The experiment result showed that compared to CBS_CLASS, CBS_Class* reduced the classification execution time by 89.8%. Moreover, the accuracy of the classification process can still be maintained.

ZENODO

Institute of Advanced Engineering and Science

WildSpan: mining structured motifs from protein sequences

Author: Chen Chien-Yu
Hsu Chen-Ming
Liu Baw-Jhiune
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Automatic extraction of motifs from biological sequences is an important research problem in study of molecular biology. For proteins, it is desired to discover sequence motifs containing a large number of wildcard symbols, as the residues associated with functional sites are usually largely separated in sequences. Discovering such patterns is time-consuming because abundant combinations exist when long gaps (a gap consists of one or more successive wildcards) are considered. Mining algorithms often employ constraints to narrow down the search space in order to increase efficiency. However, improper constraint models might degrade the sensitivity and specificity of the motifs discovered by computational methods. We previously proposed a new constraint model to handle large wildcard regions for discovering functional motifs of proteins. The patterns that satisfy the proposed constraint model are called W-patterns. A W-pattern is a structured motif that groups motif symbols into pattern blocks interleaved with large irregular gaps. Considering large gaps reflects the fact that functional residues are not always from a single region of protein sequences, and restricting motif symbols into clusters corresponds to the observation that short motifs are frequently present within protein families. To efficiently discover W-patterns for large-scale sequence annotation and function prediction, this paper first formally introduces the problem to solve and proposes an algorithm named WildSpan (sequential pattern mining across large wildcard regions) that incorporates several pruning strategies to largely reduce the mining cost. Results WildSpan is shown to efficiently find W-patterns containing conserved residues that are far separated in sequences. We conducted experiments with two mining strategies, protein-based and family-based mining, to evaluate the usefulness of W-patterns and performance of WildSpan. The protein-based mining mode of WildSpan is developed for discovering functional regions of a single protein by referring to a set of related sequences (e.g. its homologues). The discovered W-patterns are used to characterize the protein sequence and the results are compared with the conserved positions identified by multiple sequence alignment (MSA). The family-based mining mode of WildSpan is developed for extracting sequence signatures for a group of related proteins (e.g. a protein family) for protein function classification. In this situation, the discovered W-patterns are compared with PROSITE patterns as well as the patterns generated by three existing methods performing the similar task. Finally, analysis on execution time of running WildSpan reveals that the proposed pruning strategy is effective in improving the scalability of the proposed algorithm. Conclusions The mining results conducted in this study reveal that WildSpan is efficient and effective in discovering functional signatures of proteins directly from sequences. The proposed pruning strategy is effective in improving the scalability of WildSpan. It is demonstrated in this study that the W-patterns discovered by WildSpan provides useful information in characterizing protein sequences. The WildSpan executable and open source codes are available on the web (<url>http://biominer.csie.cyu.edu.tw/wildspan</url>).</p

Springer - Publisher Connector

Directory of Open Access Journals