27 research outputs found

    Finding Associations and Computing Similarity via Biased Pair Sampling

    Full text link
    This version is ***superseded*** by a full version that can be found at http://www.itu.dk/people/pagh/papers/mining-jour.pdf, which contains stronger theoretical results and fixes a mistake in the reporting of experiments. Abstract: Sampling-based methods have previously been proposed for the problem of finding interesting associations in data, even for low-support items. While these methods do not guarantee precise results, they can be vastly more efficient than approaches that rely on exact counting. However, for many similarity measures no such methods have been known. In this paper we show how a wide variety of measures can be supported by a simple biased sampling method. The method also extends to find high-confidence association rules. We demonstrate theoretically that our method is superior to exact methods when the threshold for "interesting similarity/confidence" is above the average pairwise similarity/confidence, and the average support is not too low. Our method is particularly good when transactions contain many items. We confirm in experiments on standard association mining benchmarks that this gives a significant speedup on real data sets (sometimes much larger than the theoretical guarantees). Reductions in computation time of over an order of magnitude, and significant savings in space, are observed.Comment: This is an extended version of a paper that appeared at the IEEE International Conference on Data Mining, 2009. The conference version is (c) 2009 IEE

    An Optimized In-Network Aggregation Scheme for Data Collection in Periodic Sensor Networks

    No full text
    International audienceIn-network data aggregation is considered an effective technique for conserving energy communication in wireless sensor networks. It consists in eliminating the inherent redundancy in raw data collected from the sensor nodes. Prior works on data aggregation protocols have focused on the measurement data redundancy. In this paper, our goal in addition of reducing measures redundancy is to identify near duplicate nodes that generate similar data sets. We consider a tree based bi-level periodic data aggregation approach implemented on the source node and on the aggregator levels. We investigate the problem of finding all pairs of nodes generating similar data sets such that similarity between each pair of sets is above a threshold t. We propose a new frequency filtering approach and several optimizations using sets similarity functions to solve this problem. To evaluate the performance of the proposed filtering method, experiments on real sensor data have been conducted. The obtained results show that our approach offers significant data reduction by eliminating in network redundancy and out-performs existing filtering techniques

    A Fast Detection of Duplicates Using Progressive Methods

    Get PDF
    In any database large amount of data will be present and as different people use this data, there is a chance of occurring quality of data problems, representing similar objects in different forms called as ‘duplicates’ and identifying these duplicates is one of the major problems. In now-a-days, different methods of duplicate - detection need to process huge datasets in shorter amounts of time and at same time maintaining the quality of a dataset which is becoming difficult. In existing system, methods of duplicate - detection like Sorted Neighborhood Method (SNM) and Blocking Methods are used for increasing the efficiency of finding duplicate records. In this paper, two new Progressive duplicate - detection algorithms are used for increasing the efficiency of finding the duplicate records and to eliminate the identified duplicate records if there is a limited time for duplicate - detection process. These algorithms increase the overall process gain by delivering complete results faster. In this paper am comparing the two progressive algorithms and results are displayed

    EFFICIENT DUPLICATE DETECTION USING PROGRESSIVE ALGORITHMS

    Get PDF
    Duplicate detection is the way toward recognizing different representations of same certifiable elements. Today, Duplicate detection strategies need to prepare ever bigger datasets in ever shorter time: keeping up the nature of a dataset turns out to be progressively troublesome. The two novel, dynamic copy detection calculations that altogether increment the ability of discovering copies while the execution time is constrained: They boost the pickup of the general procedure inside the time accessible by reporting most results much sooner than customary methodologies. Far reaching tests demonstrate that our dynamic calculations can twofold the proficiency after some time of customary copy detection and essentially enhance related work

    A Progressive Technique for Duplicate Detection Evaluating Multiple Data Using Genetic Algorithm with Real World Objects

    Get PDF
    Here in this paper we discuss about an analysis on progressive duplicate record detection in real world data have at least two redundancy in database. Duplicate detection is strategy for recognizing all instances of various delineation of some genuine items, case client relationship administration or data mining. An agent case client relationship administration, where an organization loses cash by sending different inventories to a similar individual that would bring down consumer loyalty. Another application is Data Mining i.e to rectify input data is important to build valuable reports that from the premise of components. In this paper to learn about the progressive duplication calculation with the assistance of guide lessen to recognize the duplicates data and erase those duplicate records

    TT-Join: Efficient set containment join

    Full text link
    © 2017 IEEE. In this paper, we study the problem of set containment join. Given two collections R and S of records, the set containment join R ⊆ S retrieves all record pairs {(r, s)} ∈ R × S such that r ⊆ s. This problem has been extensively studied in the literature and has many important applications in commercial and scientific fields. Recent research focuses on the in-memory set containment join algorithms, and several techniques have been developed following intersectionoriented or union-oriented computing paradigms. Nevertheless, we observe that two computing paradigms have their limits due to the nature of the intersection and union operators. Particularly, intersection-oriented method relies on the intersection of the relevant inverted lists built on the elements of S. A nice property of the intersection-oriented method is that the join computation is verification free. However, the number of records explored during the join process may be large because there are multiple replicas for each record in S. On the other hand, the unionoriented method generates a signature for each record in R and the candidate pairs are obtained by the union of the inverted lists of the relevant signatures. The candidate size of the union-oriented method is usually small because each record contributes only one replica in the index. Unfortunately, unionoriented method needs to verify the candidate pairs, which may be cost expensive especially when the join result size is large. As a matter of fact, the state-of-The-Art union-oriented solution is not competitive compared to the intersection-oriented ones. In this paper, we propose a new union-oriented method, namely TT-Join, which not only enhances the advantage of the previous unionoriented methods but also integrates the goodness of intersectionoriented methods by imposing a variant of prefix tree structure. We conduct extensive experiments on 20 real-life datasets by comparing our method with 7 existing methods. The experiment results demonstrate that TT-Join significantly outperforms the existing algorithms on most of the datasets, and can achieve up to two orders of magnitude speedup
    corecore