205,876 research outputs found

    Finding Temporal Patterns in Noisy Longitudinal Data: A Study in Diabetic Retinopathy

    Get PDF
    This paper describes an approach to temporal pattern mining using the concept of user defined temporal prototypes to define the nature of the trends of interests. The temporal patterns are defined in terms of sequences of support values associated with identified frequent patterns. The prototypes are defined mathematically so that they can be mapped onto the temporal patterns. The focus for the advocated temporal pattern mining process is a large longitudinal patient database collected as part of a diabetic retinopathy screening programme, The data set is, in itself, also of interest as it is very noisy (in common with other similar medical datasets) and does not feature a clear association between specific time stamps and subsets of the data. The diabetic retinopathy application, the data warehousing and cleaning process, and the frequent pattern mining procedure (together with the application of the prototype concept) are all described in the paper. An evaluation of the frequent pattern mining process is also presented

    Mining substructures in protein data

    Get PDF
    In this paper we consider the 'Prions' database that describes protein instances stored for Human Prion Proteins. The Prions database can be viewed as a database of rooted ordered labeled subtrees. Mining frequent substructures from tree databases is an important task and it has gained a considerable amount of interest in areas such as XML mining, Bioinformatics, Web mining etc. This has given rise to the development of many tree mining algorithms which can aid in structural comparisons, association rule discovery and in general mining of tree structured knowledge representations. Previously we have developed the MB3 tree mining algorithm, which given a minimum support threshold, efficiently discovers all frequent embedded subtrees from a database of rooted ordered labeled subtrees. In this work we apply the algorithm to the Prions database in order to extract the frequently occurring patterns, which in this case are of induced subtree type. Obtaining the set of frequent induced subtrees from the Prions database can potentially reveal some useful knowledge. This aspect will be demonstrated by providing an analysis of the extracted frequent subtrees with respect to discovering interesting protein information. Furthermore, the minimum support threshold can be used as the controlling factor for answering specific queries posed on the Prions dataset. This approach is shown to be a viable technique for mining protein data

    Directed Graph based Distributed Sequential Pattern Mining Using Hadoop MapReduce

    Get PDF
    Usual sequential pattern mining algorithms experiences the scalability problem when trade with very big data sets. In existing systems like PrefixSpan, UDDAG major time is needed to generate projected databases like prefix and suffix projected database from given sequential database. In DSPM (Distributed Sequential Pattern Mining) Directed Graph is introduced to generate prefix and suffix projected database which reduces the execution time for scanning large database. In UDDAG, for each unique id UDDAG is created to find next level sequential patterns. So it requires maximum storage for each UDDAG. In DSPM single directed graph is used to generate projected database and finding patterns. To improve the scanning time and scalability problem we introduce a distributed sequential pattern mining algorithm on Hadoop platform using MapReduce programming model. We use transformed database to reduce scanning time and directed graph to optimize the memory storage. Mapper is used to construct prefix and suffix projected databases for each length-1 frequent item parallel. The Reducer combines all intermediary outcomes to get final sequential patterns. Experiment results are compared against UDDAG, different values of minimum support, different massive data sets and with and without Hadoop platform which improves the scaling and speed performances. Experimental results show that DSPM using Hadoop MapReduce solves the scaling problem as well as storage problem of UDDAG. DOI: 10.17762/ijritcc2321-8169.15020

    A Three-phased Online Association Rule Mining Approach for Diverse Mining Requests

    Get PDF
    In the past, most incremental mining and online mining algorithms considered finding the set of association rules or patterns consistent with the entire set of data inserted so far. Users can not easily obtain the results from their only interested portion of data. For providing ad-hoc, query-driven and online mining supports, we first propose a relation called multidimensional pattern relation to structurally and systematically store the context information and the mining information for later analysis. Each tuple in the relation comes from an inserted dataset in the database. This concept is similar to the construction of a data warehouse for OLAP. However, unlike the summarized information of fact attributes in a data warehouse, the mined patterns in the multidimensional pattern relation can not be directly aggregated to satisfy users’ mining requests. We then develop an online mining approach called Three-phased Online Association Rule Mining (TOARM) based on the proposed multidimensional pattern relation to support online generation of association rules under multidimensional considerations. Experiments for both homogeneous and heterogeneous datasets are made, with results showing the effectiveness of the proposed approach

    Mining High Utility Patterns Over Data Streams

    Get PDF
    Mining useful patterns from sequential data is a challenging topic in data mining. An important task for mining sequential data is sequential pattern mining, which discovers sequences of itemsets that frequently appear in a sequence database. In sequential pattern mining, the selection of sequences is generally based on the frequency/support framework. However, most of the patterns returned by sequential pattern mining may not be informative enough to business people and are not particularly related to a business objective. In view of this, high utility sequential pattern (HUSP) mining has emerged as a novel research topic in data mining recently. The main objective of HUSP mining is to extract valuable and useful sequential patterns from data by considering the utility of a pattern that captures a business objective (e.g., profit, users interest). In HUSP mining, the goal is to find sequences whose utility in the database is no less than a user-specified minimum utility threshold. Nowadays, many applications generate a huge volume of data in the form of data streams. A number of studies have been conducted on mining HUSPs, but they are mainly intended for non-streaming data and thus do not take data stream characteristics into consideration. Mining HUSP from such data poses many challenges. First, it is infeasible to keep all streaming data in the memory due to the high volume of data accumulated over time. Second, mining algorithms need to process the arriving data in real time with one scan of data. Third, depending on the minimum utility threshold value, the number of patterns returned by a HUSP mining algorithm can be large and overwhelms the user. In general, it is hard for the user to determine the value for the threshold. Thus, algorithms that can find the most valuable patterns (i.e., top-k high utility patterns) are more desirable. Mining the most valuable patterns is interesting in both static data and data streams. To address these research limitations and challenges, this dissertation proposes techniques and algorithms for mining high utility sequential patterns over data streams. We work on mining HUSPs over both a long portion of a data stream and a short period of time. We also work on how to efficiently identify the most significant high utility patterns (namely, the top-k high utility patterns) over data streams. In the first part, we explore a fundamental problem that is how the limited memory space can be well utilized to produce high quality HUSPs over the entire data stream. An approximation algorithm, called MAHUSP, is designed which employs memory adaptive mechanisms to use a bounded portion of memory, to efficiently discover HUSPs over the entire data streams. The second part of the dissertation presents a new sliding window-based algorithm to discover recent high utility sequential patterns over data streams. A novel data structure named HUSP-Tree is proposed to maintain the essential information for mining recenT HUSPs. An efficient and single-pass algorithm named HUSP-Stream is proposed to generate recent HUSPs from HUSP-Tree. The third part addresses the problem of top-k high utility pattern mining over data streams. Two novel methods, named T-HUDS and T-HUSP, for finding top-k high utility patterns over a data stream are proposed. T-HUDS discovers top-k high utility itemsets and T-HUSP discovers top-k high utility sequential patterns over a data stream. T-HUDS is based on a compressed tree structure, called HUDS-Tree, that can be used to efficiently find potential top-k high utility itemsets over data streams. T-HUSP incrementally maintains the content of top-k HUSPs in a data stream in a summary data structure, named TKList, and discovers top-k HUSPs efficiently. All of the algorithms are evaluated using both synthetic and real datasets. The performances, including the running time, memory consumption, precision, recall and Fmeasure, are compared. In order to show the effectiveness and efficiency of the proposed methods in reallife applications, the fourth part of this dissertation presents applications of one of the proposed methods (i.e., MAHUSP) to extract meaningful patterns from a real web clickstream dataset and a real biosequence dataset. The utility-based sequential patterns are compared with the patterns in the frequency/support framework. The results show that high utility sequential pattern mining provides meaningful patterns in real-life applications
    • …
    corecore