353 research outputs found

    Compile-Time Query Optimization for Big Data Analytics

    Get PDF
    Many emerging programming environments for large-scale data analysis, such as Map-Reduce, Spark, and Flink, provide Scala-based APIs that consist of powerful higher-order operations that ease the development of complex data analysis applications. However, despite the simplicity of these APIs, many programmers prefer to use declarative languages, such as Hive and Spark SQL, to code their distributed applications. Unfortunately, most current data analysis query languages are based on the relational model and cannot effectively capture the rich data types and computations required for complex data analysis applications. Furthermore, these query languages are not well-integrated with the host programming language, as they are based on an incompatible data model. To address these shortcomings, we introduce a new query language for data-intensive scalable computing that is deeply embedded in Scala, called DIQL, and a query optimization framework that optimizes and translates DIQL queries to byte code at compile-time. In contrast to other query languages, our query embedding eliminates impedance mismatch as any Scala code can be seamlessly mixed with SQL-like syntax, without having to add any special declaration. DIQL supports nested collections and hierarchical data and allows query nesting at any place in a query. With DIQL, programmers can express complex data analysis tasks, such as PageRank and matrix factorization, using SQL-like syntax exclusively. The DIQL query optimizer uses algebraic transformations to derive all possible joins in a query, including those hidden across deeply nested queries, thus unnesting nested queries of any form and any number of nesting levels. The optimizer also uses general transformations to push down predicates before joins and to prune unneeded data across operations. DIQL has been implemented on three Big Data platforms, Apache Spark, Apache Flink, and Twitter's Cascading/Scalding, and has been shown to have competitive performance relative to Spark DataFrames and Spark SQL for some complex queries. This paper extends our previous work on embedded data-intensive query languages by describing the complete details of the formal framework and the query translation and optimization processes, and by providing more experimental results that give further evidence of the performance of our system

    Pregelix: Big(ger) Graph Analytics on A Dataflow Engine

    Full text link
    There is a growing need for distributed graph processing systems that are capable of gracefully scaling to very large graph datasets. Unfortunately, this challenge has not been easily met due to the intense memory pressure imposed by process-centric, message passing designs that many graph processing systems follow. Pregelix is a new open source distributed graph processing system that is based on an iterative dataflow design that is better tuned to handle both in-memory and out-of-core workloads. As such, Pregelix offers improved performance characteristics and scaling properties over current open source systems (e.g., we have seen up to 15x speedup compared to Apache Giraph and up to 35x speedup compared to distributed GraphLab), and makes more effective use of available machine resources to support Big(ger) Graph Analytics

    AxleDB: A novel programmable query processing platform on FPGA

    Get PDF
    With the rise of Big Data, providing high-performance query processing capabilities through the acceleration of the database analytic has gained significant attention. Leveraging Field Programmable Gate Array (FPGA) technology, this approach can lead to clear benefits. In this work, we present the design and implementation of AxleDB: An FPGA-based platform that enables fast query processing for database systems by melding novel database-specific accelerators with commercial-off-the-shelf (COTS) storage using modern interfaces, in a novel, unified, and a programmable environment. AxleDB can perform a large subset of SQL queries through its set of instructions that can map compute-intensive database operations, such as filter, arithmetic, aggregate, group by, table join, or sort, on to the specialized high-throughput accelerators. To minimize the amount of SSD I/O operations required, AxleDB also supports hardware MinMax indexing for databases. We evaluated AxleDB with five decision support queries from the TPC-H benchmark suite and achieved a speedup from 1.8X to 34.2X and energy efficiency from 2.8X to 62.1X, in comparison to the state-of-the-art DBMS, i.e., PostgreSQL and MonetDB.The research leading to these results has received funding from the European Union Seventh Framework Program (FP7) (under the AXLE project GA number 318633), the Ministry of Economy and Competitiveness of Spain (under contract number TIN2015-65316-p), Turkish Ministry of Development TAM Project (number 2007K120610), and Bogazici University Scientific Projects (number 7060).Peer ReviewedPostprint (author's final draft

    StreamBed: capacity planning for stream processing

    Full text link
    StreamBed is a capacity planning system for stream processing. It predicts, ahead of any production deployment, the resources that a query will require to process an incoming data rate sustainably, and the appropriate configuration of these resources. StreamBed builds a capacity planning model by piloting a series of runs of the target query in a small-scale, controlled testbed. We implement StreamBed for the popular Flink DSP engine. Our evaluation with large-scale queries of the Nexmark benchmark demonstrates that StreamBed can effectively and accurately predict capacity requirements for jobs spanning more than 1,000 cores using a testbed of only 48 cores.Comment: 14 pages, 11 figures. This project has been funded by the Walloon region (Belgium) through the Win2Wal project GEPICIA

    Flattening an object algebra to provide performance

    Get PDF
    Algebraic transformation and optimization techniques have been the method of choice in relational query execution, but applying them in object-oriented (OO) DBMSs is difficult due to the complexity of OO query languages. This paper demonstrates that the problem can be simplified by mapping an OO data model to the binary relational model implemented by Monet, a state-of-the-art database kernel. We present a generic mapping scheme to flatten data models and study the case of straightforward OO model. We show how flattening enabled us to implement a query algebra, using only a very limited set of simple operations. The required primitives and query execution strategies are discussed, and their performance is evaluated on the 1-GByte TPC-D (Transaction-processing Performance Council's Benchmark D), showing that our divide-and-conquer approach yields excellent result

    Advanced analytics through FPGA based query processing and deep reinforcement learning

    Get PDF
    Today, vast streams of structured and unstructured data have been incorporated in databases, and analytical processes are applied to discover patterns, correlations, trends and other useful relationships that help to take part in a broad range of decision-making processes. The amount of generated data has grown very large over the years, and conventional database processing methods from previous generations have not been sufficient to provide satisfactory results regarding analytics performance and prediction accuracy metrics. Thus, new methods are needed in a wide array of fields from computer architectures, storage systems, network design to statistics and physics. This thesis proposes two methods to address the current challenges and meet the future demands of advanced analytics. First, we present AxleDB, a Field Programmable Gate Array based query processing system which constitutes the frontend of an advanced analytics system. AxleDB melds highly-efficient accelerators with memory, storage and provides a unified programmable environment. AxleDB is capable of offloading complex Structured Query Language queries from host CPU. The experiments have shown that running a set of TPC-H queries, AxleDB can perform full queries between 1.8x and 34.2x faster and 2.8x to 62.1x more energy efficient compared to MonetDB, and PostgreSQL on a single workstation node. Second, we introduce TauRieL, a novel deep reinforcement learning (DRL) based method for combinatorial problems. The design idea behind combining DRL and combinatorial problems is to apply the prediction capabilities of deep reinforcement learning and to use the universality of combinatorial optimization problems to explore general purpose predictive methods. TauRieL utilizes an actor-critic inspired DRL architecture that adopts ordinary feedforward nets. Furthermore, TauRieL performs online training which unifies training and state space exploration. The experiments show that TauRieL can generate solutions two orders of magnitude faster and performs within 3% of accuracy compared to the state-of-the-art DRL on the Traveling Salesman Problem while searching for the shortest tour. Also, we present that TauRieL can be adapted to the Knapsack combinatorial problem. With a very minimal problem specific modification, TauRieL can outperform a Knapsack specific greedy heuristics.Hoy en día, se han incorporado grandes cantidades de datos estructurados y no estructurados en las bases de datos, y se les aplican procesos analíticos para descubrir patrones, correlaciones, tendencias y otras relaciones útiles que se utilizan mayormente para la toma de decisiones. La cantidad de datos generados ha crecido enormemente a lo largo de los años, y los métodos de procesamiento de bases de datos convencionales utilizados en las generaciones anteriores no son suficientes para proporcionar resultados satisfactorios respecto al rendimiento del análisis y respecto de la precisión de las predicciones. Por lo tanto, se necesitan nuevos métodos en una amplia gama de campos, desde arquitecturas de computadoras, sistemas de almacenamiento, diseño de redes hasta estadísticas y física. Esta tesis propone dos métodos para abordar los desafíos actuales y satisfacer las demandas futuras de análisis avanzado. Primero, presentamos AxleDB, un sistema de procesamiento de consultas basado en FPGAs (Field Programmable Gate Array) que constituye la interfaz de un sistema de análisis avanzado. AxleDB combina aceleradores altamente eficientes con memoria, almacenamiento y proporciona un entorno programable unificado. AxleDB es capaz de descargar consultas complejas de lenguaje de consulta estructurado desde la CPU del host. Los experimentos han demostrado que al ejecutar un conjunto de consultas TPC-H, AxleDB puede realizar consultas completas entre 1.8x y 34.2x más rápido y 2.8x a 62.1x más eficiente energéticamente que MonetDB, y PostgreSQL en un solo nodo de una estación de trabajo. En segundo lugar, presentamos TauRieL, un nuevo método basado en Deep Reinforcement Learning (DRL) para problemas combinatorios. La idea central que está detrás de la combinación de DRL y problemas combinatorios, es aplicar las capacidades de predicción del aprendizaje de refuerzo profundo y el uso de la universalidad de los problemas de optimización combinatoria para explorar métodos predictivos de propósito general. TauRieL utiliza una arquitectura DRL inspirada en el actor-crítico que se adapta a redes feedforward. Además, TauRieL realiza el entrenamieton en línea que unifica el entrenamiento y la exploración espacial de los estados. Los experimentos muestran que TauRieL puede generar soluciones dos órdenes de magnitud más rápido y funciona con un 3% de precisión en comparación con el estado del arte en DRL aplicado al problema del viajante mientras busca el recorrido más corto. Además, presentamos que TauRieL puede adaptarse al problema de la Mochila. Con una modificación específica muy mínima del problema, TauRieL puede superar a una heurística codiciosa de Knapsack Problem.Postprint (published version
    corecore