45 research outputs found

    Capturing the Laws of (Data) Nature

    Get PDF
    Model fitting is at the core of many scientific and industrial applications. These models encode a wealth of domain knowledge, something a database decidedly lacks. Except for simple cases, databases could not hope to achieve a deeper understanding of the hidden relationships in the data yet. We propose to harvest the statistical models that users fit to the stored data as part of their analysis and use them to advance physical data storage and approximate query answering to unprecedented levels of performance. We motivate our approach with an astronomical use case and discuss its pote

    Exploring Bit-Difference for Approximate KNN Search in High-dimensional Databases

    Get PDF
    In this paper, we develop a novel index structure to support efficient approximate k-nearest neighbor (KNN) query in high-dimensional databases. In high-dimensional spaces, the computational cost of the distance (e.g., Euclidean distance) between two points contributes a dominant portion of the overall query response time for memory processing. To reduce the distance computation, we first propose a structure (BID) using BIt-Difference to answer approximate KNN query. The BID employs one bit to represent each feature vector of point and the number of bit-difference is used to prune the further points. To facilitate real dataset which is typically skewed, we enhance the BID mechanism with clustering, cluster adapted bitcoder and dimensional weight, named the BID⁺. Extensive experiments are conducted to show that our proposed method yields significant performance advantages over the existing index structures on both real life and synthetic high-dimensional datasets.Singapore-MIT Alliance (SMA

    Optimizing Sample Design for Approximate Query Processing

    Get PDF
    The rapid increase of data volumes makes sampling a crucial component of modern data management systems. Although there is a large body of work on database sampling, the problem of automatically determine the optimal sample for a given query remained (almost) unaddressed. To tackle this problem the authors propose a sample advisor based on a novel cost model. Primarily designed for advising samples of a few queries specified by an expert, the authors additionally propose two extensions of the sample advisor. The first extension enhances the applicability by utilizing recorded workload information and taking memory bounds into account. The second extension increases the effectiveness by merging samples in case of overlapping pieces of sample advice. For both extensions, the authors present exact and heuristic solutions. Within their evaluation, the authors analyze the properties of the cost model and demonstrate the effectiveness and the efficiency of the heuristic solutions with a variety of experiments

    VerdictDB: Universalizing Approximate Query Processing

    Full text link
    Despite 25 years of research in academia, approximate query processing (AQP) has had little industrial adoption. One of the major causes of this slow adoption is the reluctance of traditional vendors to make radical changes to their legacy codebases, and the preoccupation of newer vendors (e.g., SQL-on-Hadoop products) with implementing standard features. Additionally, the few AQP engines that are available are each tied to a specific platform and require users to completely abandon their existing databases---an unrealistic expectation given the infancy of the AQP technology. Therefore, we argue that a universal solution is needed: a database-agnostic approximation engine that will widen the reach of this emerging technology across various platforms. Our proposal, called VerdictDB, uses a middleware architecture that requires no changes to the backend database, and thus, can work with all off-the-shelf engines. Operating at the driver-level, VerdictDB intercepts analytical queries issued to the database and rewrites them into another query that, if executed by any standard relational engine, will yield sufficient information for computing an approximate answer. VerdictDB uses the returned result set to compute an approximate answer and error estimates, which are then passed on to the user or application. However, lack of access to the query execution layer introduces significant challenges in terms of generality, correctness, and efficiency. This paper shows how VerdictDB overcomes these challenges and delivers up to 171×\times speedup (18.45×\times on average) for a variety of existing engines, such as Impala, Spark SQL, and Amazon Redshift, while incurring less than 2.6% relative error. VerdictDB is open-sourced under Apache License.Comment: Extended technical report of the paper that appeared in Proceedings of the 2018 International Conference on Management of Data, pp. 1461-1476. ACM, 201

    AQUAGP: Approximate QUery Answering Using Genetic Programming

    Get PDF
    Speed, cost, and accuracy are crucial performance parameters while evaluating the quality of a query using any Database Management System (DBMS). For some queries it may be possible to approximate the answer using an approximate query answering algorithm or tool. Also, for certain queries, it may not be critical to determine the perfect/exact results so long as the following conditions are true: (a) a high percentage of the relevant data is retrieved correctly, (b) irrelevant or extra data is minimized, and (c) an approximate answer (if available) results in a significant savings in terms of the overall query cost and retrieval time. In this paper we describe a novel approach for approximate query answering using the Genetic Programming (GP) paradigms. We develop an evolutionary computing based query space exploration framework. Given an input query and the database schema, our framework uses tree-based GP to automatically generate and evaluate approximate query candidates. We highlight and discuss different avenues we explored. We evaluate the success of our experiments based on the speed, the cost, and the accuracy of the results retrieved by the re-formulated (GP generated) queries and present the results on a variety of query types for TPC-benchmark and PKDD-benchmark datasets
    corecore