40,077 research outputs found

    Implications of probabilistic data modeling for rule mining

    Get PDF
    Mining association rules is an important technique for discovering meaningful patterns in transaction databases. In the current literature, the properties of algorithms to mine associations are discussed in great detail. In this paper we investigate properties of transaction data sets from a probabilistic point of view. We present a simple probabilistic framework for transaction data and its implementation using the R statistical computing environment. The framework can be used to simulate transaction data when no associations are present. We use such data to explore the ability to filter noise of confidence and lift, two popular interest measures used for rule mining. Based on the framework we develop the measure hyperlift and we compare this new measure to lift using simulated data and a real-world grocery database.Series: Research Report Series / Department of Statistics and Mathematic

    New probabilistic interest measures for association rules

    Full text link
    Mining association rules is an important technique for discovering meaningful patterns in transaction databases. Many different measures of interestingness have been proposed for association rules. However, these measures fail to take the probabilistic properties of the mined data into account. In this paper, we start with presenting a simple probabilistic framework for transaction data which can be used to simulate transaction data when no associations are present. We use such data and a real-world database from a grocery outlet to explore the behavior of confidence and lift, two popular interest measures used for rule mining. The results show that confidence is systematically influenced by the frequency of the items in the left hand side of rules and that lift performs poorly to filter random noise in transaction data. Based on the probabilistic framework we develop two new interest measures, hyper-lift and hyper-confidence, which can be used to filter or order mined association rules. The new measures show significantly better performance than lift for applications where spurious rules are problematic

    Data stream analysis in sliding windows: random sampling and other problems

    Get PDF
    In many data stream applications we need to perform some analysis in a "window" or subsequence of contiguous elements, quite often the last M elements seen or the elements seen in the last X time units. For example, we might be interested in obtaining a random sample of the distinct elements seen in the last 10 minutes, or estimate how many distinct elements have been processed among the last 100000 processed items. Given the restrictions in processing time and memory available, exact solutions become unfeasible and we seek for randomized algorithms which are fast, have low memory requirements and provide probabilistic guarantees. In this project we will implement some of the algorithms available in the literature and conduct extensive experiments to assess their performance and compare their relative merits; we will also develop novel and original algorithms or variants of existing algorithm to compare them with the state-of-the-art solutions. We will mostly focus in algorithms to obtain random samples, a fundamental task for more complex statistical inference: detecting outliers, finding frequent items, detecting unusual patterns, etc

    Optimal Elephant Flow Detection

    Full text link
    Monitoring the traffic volumes of elephant flows, including the total byte count per flow, is a fundamental capability for online network measurements. We present an asymptotically optimal algorithm for solving this problem in terms of both space and time complexity. This improves on previous approaches, which can only count the number of packets in constant time. We evaluate our work on real packet traces, demonstrating an up to X2.5 speedup compared to the best alternative.Comment: Accepted to IEEE INFOCOM 201
    • …
    corecore