993 research outputs found

    Deductive Optimization of Relational Data Storage

    Full text link
    Optimizing the physical data storage and retrieval of data are two key database management problems. In this paper, we propose a language that can express a wide range of physical database layouts, going well beyond the row- and column-based methods that are widely used in database management systems. We use deductive synthesis to turn a high-level relational representation of a database query into a highly optimized low-level implementation which operates on a specialized layout of the dataset. We build a compiler for this language and conduct experiments using a popular database benchmark, which shows that the performance of these specialized queries is competitive with a state-of-the-art in memory compiled database system

    Four Lessons in Versatility or How Query Languages Adapt to the Web

    Get PDF
    Exposing not only human-centered information, but machine-processable data on the Web is one of the commonalities of recent Web trends. It has enabled a new kind of applications and businesses where the data is used in ways not foreseen by the data providers. Yet this exposition has fractured the Web into islands of data, each in different Web formats: Some providers choose XML, others RDF, again others JSON or OWL, for their data, even in similar domains. This fracturing stifles innovation as application builders have to cope not only with one Web stack (e.g., XML technology) but with several ones, each of considerable complexity. With Xcerpt we have developed a rule- and pattern based query language that aims to give shield application builders from much of this complexity: In a single query language XML and RDF data can be accessed, processed, combined, and re-published. Though the need for combined access to XML and RDF data has been recognized in previous work (including the W3C’s GRDDL), our approach differs in four main aspects: (1) We provide a single language (rather than two separate or embedded languages), thus minimizing the conceptual overhead of dealing with disparate data formats. (2) Both the declarative (logic-based) and the operational semantics are unified in that they apply for querying XML and RDF in the same way. (3) We show that the resulting query language can be implemented reusing traditional database technology, if desirable. Nevertheless, we also give a unified evaluation approach based on interval labelings of graphs that is at least as fast as existing approaches for tree-shaped XML data, yet provides linear time and space querying also for many RDF graphs. We believe that Web query languages are the right tool for declarative data access in Web applications and that Xcerpt is a significant step towards a more convenient, yet highly efficient data access in a “Web of Data”

    Factorised Representations of Query Results

    Full text link
    Query tractability has been traditionally defined as a function of input database and query sizes, or of both input and output sizes, where the query result is represented as a bag of tuples. In this report, we introduce a framework that allows to investigate tractability beyond this setting. The key insight is that, although the cardinality of a query result can be exponential, its structure can be very regular and thus factorisable into a nested representation whose size is only polynomial in the size of both the input database and query. For a given query result, there may be several equivalent representations, and we quantify the regularity of the result by its readability, which is the minimum over all its representations of the maximum number of occurrences of any tuple in that representation. We give a characterisation of select-project-join queries based on the bounds on readability of their results for any input database. We complement it with an algorithm that can find asymptotically optimal upper bounds and corresponding factorised representations.Comment: 44 pages, 13 figure

    AQUAGP: Approximate QUery Answering Using Genetic Programming

    Get PDF
    Speed, cost, and accuracy are crucial performance parameters while evaluating the quality of a query using any Database Management System (DBMS). For some queries it may be possible to approximate the answer using an approximate query answering algorithm or tool. Also, for certain queries, it may not be critical to determine the perfect/exact results so long as the following conditions are true: (a) a high percentage of the relevant data is retrieved correctly, (b) irrelevant or extra data is minimized, and (c) an approximate answer (if available) results in a significant savings in terms of the overall query cost and retrieval time. In this paper we describe a novel approach for approximate query answering using the Genetic Programming (GP) paradigms. We develop an evolutionary computing based query space exploration framework. Given an input query and the database schema, our framework uses tree-based GP to automatically generate and evaluate approximate query candidates. We highlight and discuss different avenues we explored. We evaluate the success of our experiments based on the speed, the cost, and the accuracy of the results retrieved by the re-formulated (GP generated) queries and present the results on a variety of query types for TPC-benchmark and PKDD-benchmark datasets

    Compile-Time Query Optimization for Big Data Analytics

    Get PDF
    Many emerging programming environments for large-scale data analysis, such as Map-Reduce, Spark, and Flink, provide Scala-based APIs that consist of powerful higher-order operations that ease the development of complex data analysis applications. However, despite the simplicity of these APIs, many programmers prefer to use declarative languages, such as Hive and Spark SQL, to code their distributed applications. Unfortunately, most current data analysis query languages are based on the relational model and cannot effectively capture the rich data types and computations required for complex data analysis applications. Furthermore, these query languages are not well-integrated with the host programming language, as they are based on an incompatible data model. To address these shortcomings, we introduce a new query language for data-intensive scalable computing that is deeply embedded in Scala, called DIQL, and a query optimization framework that optimizes and translates DIQL queries to byte code at compile-time. In contrast to other query languages, our query embedding eliminates impedance mismatch as any Scala code can be seamlessly mixed with SQL-like syntax, without having to add any special declaration. DIQL supports nested collections and hierarchical data and allows query nesting at any place in a query. With DIQL, programmers can express complex data analysis tasks, such as PageRank and matrix factorization, using SQL-like syntax exclusively. The DIQL query optimizer uses algebraic transformations to derive all possible joins in a query, including those hidden across deeply nested queries, thus unnesting nested queries of any form and any number of nesting levels. The optimizer also uses general transformations to push down predicates before joins and to prune unneeded data across operations. DIQL has been implemented on three Big Data platforms, Apache Spark, Apache Flink, and Twitter's Cascading/Scalding, and has been shown to have competitive performance relative to Spark DataFrames and Spark SQL for some complex queries. This paper extends our previous work on embedded data-intensive query languages by describing the complete details of the formal framework and the query translation and optimization processes, and by providing more experimental results that give further evidence of the performance of our system

    Scalable Querying of Nested Data

    Get PDF
    While large-scale distributed data processing platforms have become an attractive target for query processing, these systems are problematic for applications that deal with nested collections. Programmers are forced either to perform non-trivial translations of collection programs or to employ automated flattening procedures, both of which lead to performance problems. These challenges only worsen for nested collections with skewed cardinalities, where both handcrafted rewriting and automated flattening are unable to enforce load balancing across partitions. In this work, we propose a framework that translates a program manipulating nested collections into a set of semantically equivalent shredded queries that can be efficiently evaluated. The framework employs a combination of query compilation techniques, an efficient data representation for nested collections, and automated skew-handling. We provide an extensive experimental evaluation, demonstrating significant improvements provided by the framework in diverse scenarios for nested collection programs

    Optimization of object query languages

    Get PDF
    corecore