73 research outputs found

    Sequence Classification Based on Delta-Free Sequential Pattern

    Get PDF
    International audienceSequential pattern mining is one of the most studied and challenging tasks in data mining. However, the extension of well-known methods from many other classical patterns to sequences is not a trivial task. In this paper we study the notion of δ-freeness for sequences. While this notion has extensively been discussed for itemsets, this work is the first to extend it to sequences. We define an efficient algorithm devoted to the extraction of δ-free sequential patterns. Furthermore, we show the advantage of the δ-free sequences and highlight their importance when building sequence classifiers, and we show how they can be used to address the feature selection problem in statistical classifiers, as well as to build symbolic classifiers which optimizes both accuracy and earliness of predictions

    Machine Learning Approaches for Breast Cancer Survivability Prediction

    Get PDF
    Breast cancer is one of the leading causes of cancer death in women. If not diagnosed early, the 5-year survival rate of patients is just about 26\%. Furthermore, patients with similar phenotypes can respond differently to the same therapies, which means the therapies might not work well for some of them. Identifying biomarkers that can help predict a cancer class with high accuracy is at the heart of breast cancer studies because they are targets of the treatments and drug development. Genomics data have been shown to carry useful information for breast cancer diagnosis and prognosis, as well as uncovering the disease’s mechanism. Machine learning methods are powerful tools to find such information. Feature selection methods are often utilized in supervised learning and unsupervised learning tasks to deal with data containing a large number of features in which only a small portion of them are useful to the classification task. On the other hand, analyzing only one type of data, without reference to the existing knowledge about the disease and the therapies, might mislead the findings. Effective data integration approaches are necessary to uncover this complex disease. In this thesis, we apply and develop machine learning methods to identify meaningful biomarkers for breast cancer survivability prediction after a certain treatment. They include applying feature selection methods on gene-expression data to derived gene-signatures, where the initial genes are collected concerning the mechanism of some drugs used breast cancer therapies. We also propose a new feature selection method, named PAFS, and apply it to discover accurate biomarkers. In addition, it has been increasingly supported that, sub-network biomarkers are more robust and accurate than gene biomarkers. We proposed two network-based approaches to identify sub-network biomarkers for breast cancer survivability prediction after a treatment. They integrate gene-expression data with protein-protein interactions during the optimal sub-network searching process and use cancer-related genes and pathways to prioritize the extracted sub-networks. The sub-network search space is usually huge and many proteins interact with thousands of other proteins. Thus, we apply some heuristics to avoid generating and evaluating redundant sub-networks

    Discovering Interesting Patterns and Associations in Data Streams

    Get PDF
    A data stream is a sequence of items that arrive in a timely order. Different from data in traditional static databases, data streams are continuous, unbounded, usually come with high speed, and have a data value distribution that often changes with time (Guha, 2001). As more applications such as web transactions, telephone records, and network flows generate a large number of data streams every day, efficient knowledge discovery of data streams is an active and growing research area in data mining with broad applications. Traditional data mining algorithms are developed to work on a complete static dataset and, thus, cannot be applied directly in data stream applications.One area of data mining research is to mine association relationship in a data set. Most of association mining techniques for data streams can be categorized into two types: those developed based on frequent patterns and those developed based on closed patterns. Due to the number of frequent patterns are often huge and redundant, non-informative patterns are contained in frequent patterns. An alternative way is to develop the association mining approaches for data streaming applications based on closed patterns, which generally represent a small subset of all frequent patterns, but provide complete and condensed information. In these researches, the closed pattern mining is the prerequisite condition for non-redundant and informative association mining.In this dissertation, a sliding window technique for dynamic mining of closed patterns in data streams is proposed, and an approach of mining non-redundant and informative associations based on the discovered closed patterns is developed. The closed pattern and relevant association mining techniques are selected research area in this dissertation. First, the closed patterns for a given collection of data are currently the most compact data knowledge that can provide complete support information for all data patterns.Compared with other techniques, the proposed closed pattern mining technique has potential to largely decrease the number of subsequent combinatorial calculations performed on the data patterns. Second, the memory requirement to store the closed patterns and relevant associations is generally lower than the corresponding frequent patterns and associations. In some data streaming applications, memory usage is an important measurement, because in these applications memory usage is the bottleneck for knowledge discovery. Third, the associations generated for data streams are the knowledge used to identify the relations within the data. The discovered relations can find their wide applications in many data streaming environments.Different from the closed pattern mining techniques on traditional databases, which require multiple scans of the entire database, the proposed technique determines the closed patterns with a single scan. It is an incremental mining process; as the sliding window advances, new data transactions enter and old data transactions exit the window. But instead of regenerating closed patterns from the entire window, the proposed technique updates the old set of closed patterns whenever a new transaction arrives and/or an old transaction leaves the sliding window to obtain the current set of closed patterns. This incremental feature allows the user to get the most recent updated closed patterns without rescanning the entire updated database, which saves not only the computation time, but more importantly, the I/O operating time to load and write data from database to memory. Third, the proposed sliding window technique can handle both the insertion and deletion operations independently, which allows the user to adjust the sliding window size in different application environments. Furthermore, the proposed interesting patterns and association mining framework can handle different users' requests at the same time at their specified support and confidence thresholds, and interested input and output patterns.The research includes both theoretical proofs of correctness for the proposed algorithms and simulation experiments to compare the proposed techniques with those existing in the literature using synthetic and real datasets. The utility of the proposed technique is applied to sensor network databases of a traffic management and an environmental monitoring site for missing data estimation purpose

    Mining iterative generators and representative rules for software specification discovery

    Get PDF
    10.1109/TKDE.2010.24IEEE Transactions on Knowledge and Data Engineering232282-296ITKE
    • …
    corecore