54 research outputs found

    Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

    Get PDF
    Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of processing a variety of query languages, an adapter architecture designed for extensibility, and support for heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This flexible, embeddable, and extensible architecture is what makes Calcite an attractive choice for adoption in big-data frameworks. It is an active project that continues to introduce support for the new types of data sources, query languages, and approaches to query processing and optimization.Comment: SIGMOD'1

    Transitioning From Relational to Nosql: a Case Study

    Get PDF
    Data storage requirements have increased dramatically in recent years due to the explosion in data volumes brought about by the Web 2.0 era. Changing priorities for database system requirements has seen NoSQL databases emerge as an alternative to relational database systems that have dominated this market for over 40 years. Web-enabled, always on applications mean availability of the database system is critically important as any downtime can translate in to unrecoverable financial loss. Cost is also hugely important in this era where credit is difficult to obtain and organizations look to get the maximum from their IT infrastructure from the least amount of investment. The purpose of this study is to evaluate the current NoSQL market and assess its suitability as an alternative to a relational database. The research will look at a case study of a bulletin board application that uses a relational database for data storage and evaluate how such an application can be converted to using a NoSQL database. This case study will also be used to assess the performance attributes of a NoSQL database when implemented on a low cost hardware platform. The findings will provide insight to those who are considering making the switch from a relational database system to a NoSQL database system

    Data Integration over NoSQL Stores Using Access Path Based Mappings

    Get PDF
    International audienceDue to the large amount of data generated by user interactions on the Web, some companies are currently innovating in the domain of data management by designing their own systems. Many of them are referred to as NoSQL databases, standing for 'Not only SQL'. With their wide adoption will emerge new needs and data integration will certainly be one of them. In this paper, we adapt a framework encountered for the integration of relational data to a broader context where both NoSQL and relational databases can be integrated. One important extension consists in the efficient answering of queries expressed over these data sources. The highly denormalized aspect of NoSQL databases results in varying performance costs for several possible query translations. Thus a data integration targeting NoSQL databases needs to generate an optimized translation for a given query. Our contributions are to propose (i) an access path based mapping solution that takes benefit of the design choices of each data source, (ii) integrate preferences to handle conflicts between sources and (iii) a query language that bridges the gap between the SQL query expressed by the user and the query language of the data sources. We also present a prototype implementation, where the target schema is represented as a set of relations and which enables the integration of two of the most popular NoSQL database models, namely document and a column family stores

    Scalable RDF compression with MapReduce and HDT

    Get PDF
    El uso de RDF para publicar datos semánticos se ha incrementado de forma notable en los últimos años. Hoy los datasets son tan grandes y están tan interconectados que su procesamiento presenta problemas de escalabilidad. HDT es una representación compacta de RDF que pretende minimizar el consumo de espacio a la vez que proporciona capacidades de consulta. No obstante, la generación de HDT a partir de formatos en texto de RDF es una tarea costosa en tiempo y recursos. Este trabajo estudia el uso de MapReduce, un framework para el procesamiento distribuido de grandes cantidades de datos, para la tarea de creación de estructuras HDT a partir de RDF, y analiza las mejoras obtenidas tanto en recursos como en tiempo frente a la creación de dichas estructuras en un proceso mono-nodo.Departamento de Informática (Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, Lenguajes y Sistemas Informáticos)Máster en Investigación en Tecnologías de la Información y las Comunicacione

    Design and Implementation of a NoSQL-concept for an international and multicentral clinical database

    Get PDF
    Tinnitus is a very complex symptom that has many subtypes, which all require different treatment methods. The Tinnitus Database collects the data of tinnitus patients in centers all over the world, with the aim of helping doctors in determining the correct subtype of tinnitus the patient suffers and determining the best treatment method. This is done by providing the relevant information, out of the huge amount of data that is stored in the database, to the doctor. The current database is based on MySQL and it has two main problems. First, the application needs many joins to provide the relevant information that is distributed among different tables. This causes a long response time in some cases. The other problem is the data validation that is pretty important in medical processes, as if it is violated the health of people could be affected. For example, there only exist some possible treatment methods, so it should not be possible to assign another treatment method to a patient. Currently, this has to be ensured with additional methods in the application and additional tables in the database. This thesis examines different NoSQL technologies, if they could solve these two problems and what other advantages or disadvantages they have compared to relational databases. The purpose of this thesis is then to find the best fitting database technology for the system

    Tuple MapReduce and Pangool: an associated implementation

    Get PDF
    This paper presents Tuple MapReduce, a new foundational model extending MapReduce with the notion of tuples. Tuple MapReduce allows to bridge the gap between the low-level constructs provided by MapReduce and higher-level needs required by programmers, such as compound records, sorting, or joins. This paper shows as well Pangool, an open-source framework implementing Tuple MapReduce. Pangool eases the design and implementation of applications based on MapReduce and increases their flexibility, still maintaining Hadoop's performance. Additionally, this paper shows: pseudo-codes for relational joins, rollup, and the PageRank algorithm; a Pangool's code example; benchmark results comparing Pangool with existing approaches; reports from users of Pangool in industry; and the description of a distributed database exploiting Pangool. These results show that Tuple MapReduce can be used as a direct, better-suited replacement of the MapReduce model in current implementations without the need of modifying key system fundamentals

    Large Scale Data Management for Enterprise Workloads

    Get PDF
    • …
    corecore