77 research outputs found

    Mining local staircase patterns in noisy data

    Get PDF
    Most traditional biclustering algorithms identify biclusters with no or little overlap. In this paper, we introduce the problem of identifying staircases of biclusters. Such staircases may be indicative for causal relationships between columns and can not easily be identified by existing biclustering algorithms. Our formalization relies on a scoring function based on the Minimum Description Length principle. Furthermore, we propose a first algorithm for identifying staircase biclusters, based on a combination of local search and constraint programming. Experiments show that the approach is promising

    New approaches for clustering high dimensional data

    Get PDF
    Clustering is one of the most effective methods for analyzing datasets that contain a large number of objects with numerous attributes. Clustering seeks to identify groups, or clusters, of similar objects. In low dimensional space, the similarity between objects is often evaluated by summing the difference across all of their attributes. High dimensional data, however, may contain irrelevant attributes which mask the existence of clusters. The discovery of groups of objects that are highly similar within some subsets of relevant attributes becomes an important but challenging task. My thesis focuses on various models and algorithms for this task. We first present a flexible clustering model, namely OP-Cluster (Order Preserving Cluster). Under this model, two objects are similar on a subset of attributes if the values of these two objects induce the same relative ordering of these attributes. OPClustering algorithm has demonstrated to be useful to identify co-regulated genes in gene expression data. We also propose a semi-supervised approach to discover biologically meaningful OP-Clusters by incorporating existing gene function classifications into the clustering process. This semi-supervised algorithm yields only OP-clusters that are significantly enriched by genes from specific functional categories. Real datasets are often noisy. We propose a noise-tolerant clustering algorithm for mining frequently occuring itemsets. This algorithm is called approximate frequent itemsets (AFI). Both the theoretical and experimental results demonstrate that our AFI mining algorithm has higher recoverability of real clusters than any other existing itemset mining approaches. Pair-wise dissimilarities are often derived from original data to reduce the complexities of high dimensional data. Traditional clustering algorithms taking pair-wise dissimilarities as input often generate disjoint clusters from pair-wise dissimilarities. It is well known that the classification model represented by disjoint clusters is inconsistent with many real classifications, such gene function classifications. We develop a Poclustering algorithm, which generates overlapping clusters from pair-wise dissimilarities. We prove that by allowing overlapping clusters, Poclustering fully preserves the information of any dissimilarity matrices while traditional partitioning algorithms may cause significant information loss

    When Things Matter: A Data-Centric View of the Internet of Things

    Full text link
    With the recent advances in radio-frequency identification (RFID), low-cost wireless sensor devices, and Web technologies, the Internet of Things (IoT) approach has gained momentum in connecting everyday objects to the Internet and facilitating machine-to-human and machine-to-machine communication with the physical world. While IoT offers the capability to connect and integrate both digital and physical entities, enabling a whole new class of applications and services, several significant challenges need to be addressed before these applications and services can be fully realized. A fundamental challenge centers around managing IoT data, typically produced in dynamic and volatile environments, which is not only extremely large in scale and volume, but also noisy, and continuous. This article surveys the main techniques and state-of-the-art research efforts in IoT from data-centric perspectives, including data stream processing, data storage models, complex event processing, and searching in IoT. Open research issues for IoT data management are also discussed

    Mining Top-K Patterns from Binary Datasets in presence of Noise

    Full text link

    New rule induction algorithms with improved noise tolerance and scalability

    Get PDF
    As data storage capacities continue to increase due to rapid advances in information technology, there is a growing need for devising scalable data mining algorithms able to sift through large volumes of data in a short amount of time. Moreover, real-world data is inherently imperfect due to the presence of noise as opposed to artificially prepared data. Consequently, there is also a need for designing robust algorithms capable of handling noise, so that the discovered patterns are reliable with good predictive performance on future data. This has led to ongoing research in the field of data mining, intended to develop algorithms that are scalable as well as robust. The most straightforward approach for handling the issue of scalability is to develop efficient algorithms that can process large datasets in a relatively short time. Efficiency may be achieved by employing suitable rule mining constraints that can drastically cut down the search space. The first part of this thesis focuses on the improvement of a state-of-the-art rule induction algorithm, RULES-6, which incorporates certain search space pruning constraints in order to scale to large datasets. However, the constraints are insufficient and also have not been exploited to the maximum, resulting in the generation of specific rules which not only increases learning time but also the length of the rule set. In order to address these issues, a new algorithm RULES-7 is proposed which uses deep rule mining constraints from association learning. This results in a significant drop in execution time for large datasets while boosting the classification accuracy of the model on future data. A novel comparison heuristic is also proposed for the algorithm which improves classification accuracy while maintaining the execution time. Since an overwhelming majority of induction algorithms are unable to handle the continuous data ubiquitous in the real-world, it is also necessary to develop an efficient discretisation procedure whereby continuous attributes can be treated as discrete. By generalizing the raw continuous data, discretisation helps to speed up the induction process and results in a simpler and intelligible model that is also more accurate on future data. Many preprocessing discretisation techniques have been proposed to date, of which the entropy based technique has by far been accepted as the most accurate. However, the technique is suboptimal for classification because of failing to identify the cut points within the value range of each class for a continuous attribute, which deteriorates its classification accuracy. The second part of this thesis presents a new discretisation technique which utilizes the entropy based principle but takes a class-centered approach to discretisation. The proposed technique not only increases the efficiency of rule induction but also improves the classification accuracy of the induced model. Another issue with existing induction algorithms relates to the way covered examples are dealt with when a new rule is formed. To avoid problems such as fragmentation and small disjuncts, the RULES family of algorithms marks the examples instead of removing them. This tends to increase overlapping between rules. The third part of this thesis proposes a new hybrid pruning technique in order to address the overlapping issue so as to reduce the rule set size. It also proposes an incremental post-pruning technique designed specifically to handle the issue of noisy data. This leads to improved induction performance as well as better classification accuracy
    corecore