327 research outputs found

    Optimizing Sample Design for Approximate Query Processing

    Get PDF
    The rapid increase of data volumes makes sampling a crucial component of modern data management systems. Although there is a large body of work on database sampling, the problem of automatically determine the optimal sample for a given query remained (almost) unaddressed. To tackle this problem the authors propose a sample advisor based on a novel cost model. Primarily designed for advising samples of a few queries specified by an expert, the authors additionally propose two extensions of the sample advisor. The first extension enhances the applicability by utilizing recorded workload information and taking memory bounds into account. The second extension increases the effectiveness by merging samples in case of overlapping pieces of sample advice. For both extensions, the authors present exact and heuristic solutions. Within their evaluation, the authors analyze the properties of the cost model and demonstrate the effectiveness and the efficiency of the heuristic solutions with a variety of experiments

    Linked Bernoulli Synopses: Sampling along Foreign Keys

    Get PDF
    Random sampling is a popular technique for providing fast approximate query answers, especially in data warehouse environments. Compared to other types of synopses, random sampling bears the advantage of retaining the dataset’s dimensionality; it also associates probabilistic error bounds with the query results. Most of the available sampling techniques focus on table-level sampling, that is, they produce a sample of only a single database table. Queries that contain joins over multiple tables cannot be answered with such samples because join results on random samples are often small and skewed. On the contrary, schema-level sampling techniques by design support queries containing joins. In this paper, we introduce Linked Bernoulli Synopses, a schema-level sampling scheme based upon the well-known Join Synopses. Both schemes rely on the idea of maintaining foreign-key integrity in the synopses; they are therefore suited to process queries containing arbitrary foreign-key joins. In contrast to Join Synopses, however, Linked Bernoulli Synopses correlate the sampling processes of the different tables in the database so as to minimize the space overhead, without destroying the uniformity of the individual samples. We also discuss how to compute Linked Bernoulli Synopses which maximize the effective sampling fraction for a given memory budget. The computation of the optimum solution is often computationally prohibitive so that approximate solutions are needed. We propose a simple heuristic approach which is fast and seems to produce close-to-optimum results in practice. We conclude the paper with an evaluation of our methods on both synthetic and real-world datasets

    A Sample Advisor for Approximate Query Processing

    Get PDF
    The rapid growth of current data warehouse systems makes random sampling a crucial component of modern data management systems. Although there is a large body of work on database sampling, the problem of automatic sample selection remained (almost) unaddressed. In this paper, we tackle the problem with a sample advisor. We propose a cost model to evaluate a sample for a given query. Based on this, our sample advisor determines the optimal set of samples for a given set of queries specified by an expert. We further propose an extension to utilize recorded workload information. In this case, the sample advisor takes the set of queries and a given memory bound into account for the computation of a sample advice. Additionally, we consider the merge of samples in case of overlapping sample advice and present both an exact and a heuristic solution. Within our evaluation, we analyze the properties of the cost model and compare the proposed algorithms. We further demonstrate the effectiveness and the efficiency of the heuristic solutions with a variety of experiments

    A synopsis based approach for XML fast approximate querying

    Get PDF
    In the last few years, XML has spread in many application fields and today it is used as a format to exchange data on the web, to ensure inter-operability among applications. Due to this success, the W3C has proposed a new query language, XQuery [25], specifically designed to query XML data. XQuery is a well-defined but rather complex language [14]. In this work we propose a new approach to overcome the problem of the high computational costs required by aggregate queries over massive XML data collections. In traditional relational warehouses [11] a similar problem is solved by means of fast approximate queries, that use concise data statistics based on histograms or on other statistical techniques. Their most common application is for aggregate queries in modern decision support systems, where large volumes of data need to be queried, and quick and interactive responses from the DBMS are claimed, e.g., to analyze the data in the warehouse in order to get trend information to evaluate marketing strategies. In such applications, users are often more interested to obtain an approximate answer computed in a short time rather than an exact one obtained in some minutes or, at the worst, hours

    AQUAGP: Approximate QUery Answering Using Genetic Programming

    Get PDF
    Speed, cost, and accuracy are crucial performance parameters while evaluating the quality of a query using any Database Management System (DBMS). For some queries it may be possible to approximate the answer using an approximate query answering algorithm or tool. Also, for certain queries, it may not be critical to determine the perfect/exact results so long as the following conditions are true: (a) a high percentage of the relevant data is retrieved correctly, (b) irrelevant or extra data is minimized, and (c) an approximate answer (if available) results in a significant savings in terms of the overall query cost and retrieval time. In this paper we describe a novel approach for approximate query answering using the Genetic Programming (GP) paradigms. We develop an evolutionary computing based query space exploration framework. Given an input query and the database schema, our framework uses tree-based GP to automatically generate and evaluate approximate query candidates. We highlight and discuss different avenues we explored. We evaluate the success of our experiments based on the speed, the cost, and the accuracy of the results retrieved by the re-formulated (GP generated) queries and present the results on a variety of query types for TPC-benchmark and PKDD-benchmark datasets
    • …
    corecore