141 research outputs found

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    Arquiteturas federadas para integração de dados biomédicos

    Get PDF
    Doutoramento Ciências da ComputaçãoThe last decades have been characterized by a continuous adoption of IT solutions in the healthcare sector, which resulted in the proliferation of tremendous amounts of data over heterogeneous systems. Distinct data types are currently generated, manipulated, and stored, in the several institutions where patients are treated. The data sharing and an integrated access to this information will allow extracting relevant knowledge that can lead to better diagnostics and treatments. This thesis proposes new integration models for gathering information and extracting knowledge from multiple and heterogeneous biomedical sources. The scenario complexity led us to split the integration problem according to the data type and to the usage specificity. The first contribution is a cloud-based architecture for exchanging medical imaging services. It offers a simplified registration mechanism for providers and services, promotes remote data access, and facilitates the integration of distributed data sources. Moreover, it is compliant with international standards, ensuring the platform interoperability with current medical imaging devices. The second proposal is a sensor-based architecture for integration of electronic health records. It follows a federated integration model and aims to provide a scalable solution to search and retrieve data from multiple information systems. The last contribution is an open architecture for gathering patient-level data from disperse and heterogeneous databases. All the proposed solutions were deployed and validated in real world use cases.A adoção sucessiva das tecnologias de comunicação e de informação na área da saúde tem permitido um aumento na diversidade e na qualidade dos serviços prestados, mas, ao mesmo tempo, tem gerado uma enorme quantidade de dados, cujo valor científico está ainda por explorar. A partilha e o acesso integrado a esta informação poderá permitir a identificação de novas descobertas que possam conduzir a melhores diagnósticos e a melhores tratamentos clínicos. Esta tese propõe novos modelos de integração e de exploração de dados com vista à extração de conhecimento biomédico a partir de múltiplas fontes de dados. A primeira contribuição é uma arquitetura baseada em nuvem para partilha de serviços de imagem médica. Esta solução oferece um mecanismo de registo simplificado para fornecedores e serviços, permitindo o acesso remoto e facilitando a integração de diferentes fontes de dados. A segunda proposta é uma arquitetura baseada em sensores para integração de registos electrónicos de pacientes. Esta estratégia segue um modelo de integração federado e tem como objetivo fornecer uma solução escalável que permita a pesquisa em múltiplos sistemas de informação. Finalmente, o terceiro contributo é um sistema aberto para disponibilizar dados de pacientes num contexto europeu. Todas as soluções foram implementadas e validadas em cenários reais

    Just-In-Time Data Virtualization: Lightweight Data Management with ViDa

    Get PDF
    As the size of data and its heterogeneity increase, traditional database system architecture becomes an obstacle to data analysis. Integrating and ingesting (loading) data into databases is quickly becoming a bottleneck in face of massive data as well as increasingly heterogeneous data formats. Still, state-of-the-art approaches typically rely on copying and transforming data into one (or few) repositories. Queries, on the other hand, are often ad-hoc and supported by pre-cooked operators which are not adaptive enough to optimize access to data. As data formats and queries increasingly vary, there is a need to depart from the current status quo of static query processing primitives and build dynamic, fully adaptive architectures. We build ViDa, a system which reads data in its raw format and processes queries using adaptive, just-in-time operators. Our key insight is use of virtualization, i.e., abstracting data and manipulating it regardless of its original format, and dynamic generation of operators. ViDa's query engine is generated just-in-time; its caches and its query operators adapt to the current query and the workload, while also treating raw datasets as its native storage structures. Finally, ViDa features a language expressive enough to support heterogeneous data models, and to which existing languages can be translated. Users therefore have the power to choose the language best suited for an analysis

    Graph Processing in Main-Memory Column Stores

    Get PDF
    Evermore, novel and traditional business applications leverage the advantages of a graph data model, such as the offered schema flexibility and an explicit representation of relationships between entities. As a consequence, companies are confronted with the challenge of storing, manipulating, and querying terabytes of graph data for enterprise-critical applications. Although these business applications operate on graph-structured data, they still require direct access to the relational data and typically rely on an RDBMS to keep a single source of truth and access. Existing solutions performing graph operations on business-critical data either use a combination of SQL and application logic or employ a graph data management system. For the first approach, relying solely on SQL results in poor execution performance caused by the functional mismatch between typical graph operations and the relational algebra. To the worse, graph algorithms expose a tremendous variety in structure and functionality caused by their often domain-specific implementations and therefore can be hardly integrated into a database management system other than with custom coding. Since the majority of these enterprise-critical applications exclusively run on relational DBMSs, employing a specialized system for storing and processing graph data is typically not sensible. Besides the maintenance overhead for keeping the systems in sync, combining graph and relational operations is hard to realize as it requires data transfer across system boundaries. A basic ingredient of graph queries and algorithms are traversal operations and are a fundamental component of any database management system that aims at storing, manipulating, and querying graph data. Well-established graph traversal algorithms are standalone implementations relying on optimized data structures. The integration of graph traversals as an operator into a database management system requires a tight integration into the existing database environment and a development of new components, such as a graph topology-aware optimizer and accompanying graph statistics, graph-specific secondary index structures to speedup traversals, and an accompanying graph query language. In this thesis, we introduce and describe GRAPHITE, a hybrid graph-relational data management system. GRAPHITE is a performance-oriented graph data management system as part of an RDBMS allowing to seamlessly combine processing of graph data with relational data in the same system. We propose a columnar storage representation for graph data to leverage the already existing and mature data management and query processing infrastructure of relational database management systems. At the core of GRAPHITE we propose an execution engine solely based on set operations and graph traversals. Our design is driven by the observation that different graph topologies expose different algorithmic requirements to the design of a graph traversal operator. We derive two graph traversal implementations targeting the most common graph topologies and demonstrate how graph-specific statistics can be leveraged to select the optimal physical traversal operator. To accelerate graph traversals, we devise a set of graph-specific, updateable secondary index structures to improve the performance of vertex neighborhood expansion. Finally, we introduce a domain-specific language with an intuitive programming model to extend graph traversals with custom application logic at runtime. We use the LLVM compiler framework to generate efficient code that tightly integrates the user-specified application logic with our highly optimized built-in graph traversal operators. Our experimental evaluation shows that GRAPHITE can outperform native graph management systems by several orders of magnitude while providing all the features of an RDBMS, such as transaction support, backup and recovery, security and user management, effectively providing a promising alternative to specialized graph management systems that lack many of these features and require expensive data replication and maintenance processes

    Graph Pattern Matching on Symmetric Multiprocessor Systems

    Get PDF
    Graph-structured data can be found in nearly every aspect of today's world, be it road networks, social networks or the internet itself. From a processing perspective, finding comprehensive patterns in graph-structured data is a core processing primitive in a variety of applications, such as fraud detection, biological engineering or social graph analytics. On the hardware side, multiprocessor systems, that consist of multiple processors in a single scale-up server, are the next important wave on top of multi-core systems. In particular, symmetric multiprocessor systems (SMP) are characterized by the fact, that each processor has the same architecture, e.g. every processor is a multi-core and all multiprocessors share a common and huge main memory space. Moreover, large SMPs will feature a non-uniform memory access (NUMA), whose impact on the design of efficient data processing concepts should not be neglected. The efficient usage of SMP systems, that still increase in size, is an interesting and ongoing research topic. Current state-of-the-art architectural design principles provide different and in parts disjunct suggestions on which data should be partitioned and or how intra-process communication should be realized. In this thesis, we propose a new synthesis of four of the most well-known principles Shared Everything, Partition Serial Execution, Data Oriented Architecture and Delegation, to create the NORAD architecture, which stands for NUMA-aware DORA with Delegation. We built our research prototype called NeMeSys on top of the NORAD architecture to fully exploit the provided hardware capacities of SMPs for graph pattern matching. Being an in-memory engine, NeMeSys allows for online data ingestion as well as online query generation and processing through a terminal based user interface. Storing a graph on a NUMA system inherently requires data partitioning to cope with the mentioned NUMA effect. Hence, we need to dissect the graph into a disjunct set of partitions, which can then be stored on the individual memory domains. This thesis analyzes the capabilites of the NORAD architecture, to perform scalable graph pattern matching on SMP systems. To increase the systems performance, we further develop, integrate and evaluate suitable optimization techniques. That is, we investigate the influence of the inherent data partitioning, the interplay of messaging with and without sufficient locality information and the actual partition placement on any NUMA socket in the system. To underline the applicability of our approach, we evaluate NeMeSys against synthetic datasets and perform an end-to-end evaluation of the whole system stack on the real world knowledge graph of Wikidata

    The MGX framework for microbial community analysis

    Get PDF
    Jaenicke S. The MGX framework for microbial community analysis. Bielefeld: Universität Bielefeld; 2020

    Reducing End-User Burden in Everyday Data Organization.

    Full text link
    As digital data permeates every aspect of our daily life, more and more end-users are organizing their everyday data electronically. In fact, end-users are already used to managing their personal data such as contact books and calendars in electronic devices. Meanwhile, the desire for organizing more information into the computer is expanding for a broader group of users. For example, a scientist may need to regularly manage a substantial amount of science data on his desktop. However, to organize such everyday data is challenging for these end-users, because they have limited knowledge about data schema, which is key to data management tasks such as database design, data transformation and data integration. While the user is struggling with these schema tasks, various cognitive and operational burdens emerge. First, when designing her data collection, the user has the burden to abstract her mental model of her real-life data into a reasonable schema design. Moreover, when incorporating external data sources, there is a burden to understand the source semantics and a burden to transform the data from those sources into the user's own data collection. Meanwhile, if the user wants to filter the data, she has the burden to understand and specify the selection condition. Finally, when existing sources are update, there is a burden to understand and fuse these updates. This dissertation introduces various approaches to help the end-user reduce these burdens. To ease the design pain, the dissertation proposes a system with a next-generation spreadsheet for the end-user to easily design and evolve her schema. To facilitate incorporation of external data sources, a sample-driven schema mapping approach is introduced so that the user can freely provide sample instances in her own collection and the system will automatically deduce the desired schema mapping from the sources to the collection. In a similar flavor, this dissertation proposes an approach to facilitate the user in specifying selection conditions via example data points she wants to select. Finally, to help the user incorporate source data updates into her data collection, the dissertation proposes a technique to incrementally update the integrated data using previous integration results.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/99778/1/eql_1.pd
    corecore