90,162 research outputs found

    Summarizing data with representative patterns

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.The advance of technology makes data acquisition and storage become unprecedentedly convenient. It contributes to the rapid growth of not only the volume but also the veracity and variety of data in recent years, which poses new challenges to the data mining area. For example, uncertain data mining emerges due to its capability to model the inherent veracity of data; spatial data mining attracts much research attention as the widespread of location-based services and wearable devices. As a fundamental topic of data mining, how to effectively and efficiently summarize data in this situation still remains to be explored. This thesis studied the problem of summarizing data with representative patterns. The objective is to find a set of patterns, which is much more concise but still contains rich information of the original data, and may provide valuable insights for further analysis of data. In the light of this idea, we formally formulate the problem and provide effective and efficient solutions in various scenarios. We study the problem of summarizing probabilistic frequent patterns over uncertain data. Probabilistic frequent pattern mining over uncertain data has received much research attention due to the wide applicabilities of uncertain data. It suffers from the problem of generating an exponential number of result patterns, which hinders the analysis of patterns and calls for the need to find a small number of representative patterns to approximate all other patterns. We formally formulate the problem of probabilistic representative frequent pattern (P-RFP) mining, which aims to find the minimal set of patterns with sufficiently high probability to represent all other patterns. The bottleneck turns out to be checking whether a pattern can probabilistically represent another, which involves the computation of a joint probability of the supports of two patterns. We propose a novel dynamic programming-based approach to address the problem and devise effective optimization strategies to improve the computation efficiency. To enhance the practicability of P-RFP mining, we introduce a novel approximation of the joint probability with both theoretical and empirical proofs. Based on the approximation, we propose an Approximate P-RFP Mining (APM) algorithm, which effectively and efficiently compresses the probabilistic frequent pattern set. The error rate of APM is guaranteed to be very small when the database contains hundreds of transactions, which further affirms that APM is a practical solution for summarizing probabilistic frequent patterns. We address the problem of directly summarizing uncertain transaction database by formulating the problem as Minimal Probabilistic Tile Cover Mining, which aims to find a high-quality probabilistic tile set covering an uncertain database with minimal cost. We define the concept of Probabilistic Price and Probabilistic Price Order to evaluate and compare the quality of tiles, and propose a framework to discover the minimal probabilistic tile cover. The bottleneck is to check whether a tile is better than another according to the Probabilistic Price Order, which involves the computation of a joint probability. We prove that it can be decomposed into independent terms and calculated efficiently. Several optimization techniques are devised to further improve the performance. We analyze the problem of summarizing co-locations mined from spatial databases. Co-location pattern mining finds patterns of spatial features whose instances tend to locate together in geographic space. However, the traditional framework of co-location pattern mining produces an exponential number of patterns because of the downward closure property, which makes it difficult for users to understand, assess or apply the huge number of resulted patterns. To address this issue, we study the problem of mining representative co-location patterns (RCP). We first define a covering relationship between two co-location patterns then formally formulate the problem of Representative Co-location Pattern mining. To solve the problem of RCP mining, we propose the RCPFast algorithm adopting the post-mining framework and the RCPMS algorithm pushing pattern summarization into the co-location mining process

    Empirical Risk Minimization for Probabilistic Grammars: Sample Complexity and Hardness of Learning

    Get PDF
    Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. They are used ubiquitously in computational linguistics. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of probabilistic grammars using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting. By making assumptions about the underlying distribution that are appropriate for natural language scenarios, we are able to derive distribution-dependent sample complexity bounds for probabilistic grammars. We also give simple algorithms for carrying out empirical risk minimization using this framework in both the supervised and unsupervised settings. In the unsupervised case, we show that the problem of minimizing empirical risk is NP-hard. We therefore suggest an approximate algorithm, similar to expectation-maximization, to minimize the empirical risk. Learning from data is central to contemporary computational linguistics. It is in common in such learning to estimate a model in a parametric family using the maximum likelihood principle. This principle applies in the supervised case (i.e., using annotate

    A taxonomy for emergency service station location problem

    Get PDF
    The emergency service station (ESS) location problem has been widely studied in the literature since 1970s. There has been a growing interest in the subject especially after 1990s. Various models with different objective functions and constraints have been proposed in the academic literature and efficient solution techniques have been developed to provide good solutions in reasonable times. However, there is not any study that systematically classifies different problem types and methodologies to address them. This paper presents a taxonomic framework for the ESS location problem using an operations research perspective. In this framework, we basically consider the type of the emergency, the objective function, constraints, model assumptions, modeling, and solution techniques. We also analyze a variety of papers related to the literature in order to demonstrate the effectiveness of the taxonomy and to get insights for possible research directions
    corecore