368 research outputs found

    Sampling time-based sliding windows in bounded space

    Get PDF
    Random sampling is an appealing approach to build synopses of large data streams because random samples can be used for a broad spectrum of analytical tasks. Users are often interested in analyzing only the most recent fraction of the data stream in order to avoid outdated results. In this paper, we focus on sampling schemes that sample from a sliding window over a recent time interval; such windows are a popular and highly comprehensible method to model recency. In this setting, the main challenge is to guarantee an upper bound on the space consumption of the sample while using the allotted space efficiently at the same time. The difficulty arises from the fact that the number of items in the window is unknown in advance and may vary significantly over time, so that the sampling fraction has to be adjusted dynamically. We consider uniform sampling schemes, which produce each sample of the same size with equal probability, and stratified sampling schemes, in which the window is divided into smaller strata and a uniform sample is maintained per stratum. For uniform sampling, we prove that it is impossible to guarantee a minimum sample size in bounded space. We then introduce a novel sampling scheme called bounded priority sampling (BPS), which requires only bounded space. We derive a lower bound on the expected sample size and show that BPS quickly adapts to changing data rates. For stratified sampling, we propose a merge-based stratification scheme (MBS), which maintains strata of approximately equal size. Compared to naive stratification, MBS has the advantage that the sample is evenly distributed across the window, so that no part of the window is over- or underrepresented. We conclude the paper with a feasibility study of our algorithms on large real-world datasets

    Deferred Maintenance of Disk-Based Random Samples

    Get PDF
    Random sampling is a well-known technique for approximate processing of large datasets. We introduce a set of algorithms for incremental maintenance of large random samples on secondary storage. We show that the sample maintenance cost can be reduced by refreshing the sample in a deferred manner. We introduce a novel type of log file which follows the intuition that only a “sample” of the operations on the base data has to be considered to maintain a random sample in a statistically correct way. Additionally, we develop a deferred refresh algorithm which updates the sample by using fast sequential disk access only, and which does not require any main memory. We conducted an extensive set of experiments and found, that our algorithms reduce maintenance cost by several orders of magnitude

    Memory-Efficient Frequent-Itemset Mining

    Get PDF
    Efficient discovery of frequent itemsets in large datasets is a key component of many data mining tasks. In-core algorithms---which operate entirely in main memory and avoid expensive disk accesses---and in particular the prefix tree-based algorithm FP-growth are generally among the most efficient of the available algorithms. Unfortunately, their excessive memory requirements render them inapplicable for large datasets with many distinct items and/or itemsets of high cardinality. To overcome this limitation, we propose two novel data structures---the CFP-tree and the CFP-array---, which reduce memory consumption by about an order of magnitude. This allows us to process significantly larger datasets in main memory than previously possible. Our data structures are based on structural modifications of the prefix tree that increase compressability, an optimized physical representation, lightweight compression techniques, and intelligent node ordering and indexing. Experiments with both real-world and synthetic datasets show the effectiveness of our approach

    Hierarchisches gruppenbasiertes Sampling

    Get PDF
    In Zeiten wachsender DatenbankgrĂ¶ĂŸen ist es unumgĂ€nglich, Anfragen nĂ€herungsweise auszuwerten um schnelle Antworten zu erhalten. Dieser Artikel stellt verschiedene Methoden vor, dieses Ziel zu erreichen, und wendet sich anschließend dem Sampling zu, welches mit Hilfe einer Stichprobe schnell zu adĂ€quaten Ergebnissen fĂŒhrt. Enthalten Datenbankanfragen Verbund- oder Gruppierungsoperationen, so sinkt die Genauigkeit vieler Sampling-Verfahren sehr stark; insbesondere werden vor allem kleine Gruppen nicht erkannt. Dieser Artikel befasst sich mit hierarchischen gruppenbasiertem Sampling, welches Sampling, Gruppierung und Verbundoperationen kombiniert.In times of increasing database sizes it is crucial to process queries approximately in order to obtain answers quickly. This article introduces several methods for achieving this goal and afterwards focuses on sampling, yielding appropriate results by using only a subset of the actual data. If database queries contain join or group-by operations, the accuracy of many sampling methods drops significantly; especially small groups are not recognized. This article is concerned with hierarchical group-based sampling, which combines sampling, grouping and joins

    Hierarchical Group-Based Sampling

    Get PDF
    Approximate query processing is an adequate technique to reduce response times and system load in cases where approximate results suffice. In database literature, sampling has been proposed to evaluate queries approximately by using only a subset of the original data. Unfortunately, most of these methods consider either only certain problems arising due to the use of samples in databases (e.g. data skew) or only join operations involving multiple relations. We describe how well-known sampling techniques dealing with group-by operations can be combined with foreign-key joins such that the join is computed after the generation of the sample. In detail, we show how senate sampling and small group sampling can be combined efficiently with the idea of join synopses. Additionally, we introduce different algorithms which maintain the sample if the underlying data changes. Finally, we prove the superiority of our method to the naive approach in an extensive set of experiments

    Linked Bernoulli Synopses: Sampling along Foreign Keys

    Get PDF
    Random sampling is a popular technique for providing fast approximate query answers, especially in data warehouse environments. Compared to other types of synopses, random sampling bears the advantage of retaining the dataset’s dimensionality; it also associates probabilistic error bounds with the query results. Most of the available sampling techniques focus on table-level sampling, that is, they produce a sample of only a single database table. Queries that contain joins over multiple tables cannot be answered with such samples because join results on random samples are often small and skewed. On the contrary, schema-level sampling techniques by design support queries containing joins. In this paper, we introduce Linked Bernoulli Synopses, a schema-level sampling scheme based upon the well-known Join Synopses. Both schemes rely on the idea of maintaining foreign-key integrity in the synopses; they are therefore suited to process queries containing arbitrary foreign-key joins. In contrast to Join Synopses, however, Linked Bernoulli Synopses correlate the sampling processes of the different tables in the database so as to minimize the space overhead, without destroying the uniformity of the individual samples. We also discuss how to compute Linked Bernoulli Synopses which maximize the effective sampling fraction for a given memory budget. The computation of the optimum solution is often computationally prohibitive so that approximate solutions are needed. We propose a simple heuristic approach which is fast and seems to produce close-to-optimum results in practice. We conclude the paper with an evaluation of our methods on both synthetic and real-world datasets

    Designing Random Sample Synopses with Outliers

    Get PDF
    Random sampling is one of the most widely used means to build synopses of large datasets because random samples can be used for a wide range of analytical tasks. Unfortunately, the quality of the estimates derived from a sample is negatively affected by the presence of 'outliers' in the data. In this paper, we show how to circumvent this shortcoming by constructing outlier-aware sample synopses. Our approach extends the well-known outlier indexing scheme to multiple aggregation columns

    A dip in the reservoir: Maintaining sample synopses of evolving datasets

    Get PDF
    Perhaps the most flexible synopsis of a database is a random sample of the data; such samples are widely used to speed up processing of analytic queries and data-mining tasks, enhance query optimization, and facilitate information integration. In this paper, we study methods for incrementally maintaining a uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions and deletions. For “stable” datasets whose sizeremains roughly constant over time, we provide a novel sampling scheme, called “random pairing” (RP) which maintains a bounded-size uniform sample by using newly inserted data items to compensate for previous deletions. The RP algorithm is the first extension of the almost 40-year-old reservoir sampling algorithm to handle deletions. Experiments show that, when dataset-size fluctuations over time are not too extreme, RP is the algorithm of choice with respect to speed and sample-size stability. For “growing” datasets, we consider algorithms for periodically “resizing” a bounded-size random sample upwards. We prove that any such algorithm cannot avoid accessing the base data, and provide a novel resizing algorithm that minimizes the time needed to increase the sample size

    Non-uniformity issues and workarounds in bounded-size sampling

    Get PDF
    A variety of schemes have been proposed in the literature to speed up query processing and analytics by incrementally maintaining a bounded-size uniform sample from a dataset in the presence of a sequence of insertion, deletion, and update transactions. These algorithms vary according to whether the dataset is an ordinary set or a multiset and whether the transaction sequence consists only of insertions or can include deletions and updates. We report on subtle non-uniformity issues that we found in a number of these prior bounded-size sampling schemes, including some of our own. We provide workarounds that can avoid the non-uniformity problem; these workarounds are easy to implement and incur negligible additional cost. We also consider the impact of non-uniformity in practice and describe simple statistical tests that can help detect non-uniformity in new algorithms

    Predictors for rTMS response in chronic tinnitus

    Get PDF
    Background: Repetitive transcranial magnetic stimulation (rTMS) has been studied as a treatment option for chronic tinnitus for almost 10 years now. Although most of these studies have demonstrated beneficial effects, treatment results show high interindividual variability and yet, little is known about predictors for treatment response. Methods: Data from 538 patients with chronic tinnitus were analyzed. Patients received either low-frequency rTMS over the left temporal cortex (n = 345, 1 Hz, 110% motor threshold, 2000 stimuli/day) or combined temporal and frontal stimulation (n = 193, 110% motor threshold, 2000 stimuli at 20 Hz over left dorsolateral prefrontal cortex plus 2000 stimuli at 1 Hz over temporal cortex). Numerous demographic, clinical, and audiological variables as well as different tinnitus characteristics were analyzed as potential predictors for treatment outcome, which was defined as change in the tinnitus questionnaire (TQ) score. Results: Both stimulation protocols resulted in a significant decrease of TQ scores. Effect sizes were small, however. In the group receiving combined treatment, patients with comorbid temporomandibular complaints benefited more from rTMS than patients without those complaints. In addition, patients with higher TQ scores at baseline had more pronounced TQ reductions than patients with low TQ baseline scores. Also, patients who had already improved from screening to baseline benefited less than patients without initial improvement. Conclusions: The results from this large sample demonstrate that rTMS shows only small but clinically significant effects in the treatment of chronic tinnitus. There are no good demographic or clinical predictors for treatment outcome
    • 

    corecore