2,510 research outputs found

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    Mario. A system for iterative and interactive processing of biological data

    Get PDF
    This thesis address challenges in metagenomic data processing on clusters of computers; in particular the need for interactive response times during development, debugging and tuning of data processing pipelines. Typical metagenomics pipelines batch process data, and have execution times ranging from hours to months, making configuration and tuning time consuming and impractical. We have analyzed the data usage of metagenomic pipelines, including a visualization frontend, to develop an approach that use an online, data-parallel processing model, where changes in the pipeline configuration are quickly reflected in updated pipeline output available to the user. We describe the design and implementation of the Mario system that real- izes the approach. Mario is a distributed system built on top of the HBase storage system, that provide data processing using commonly used bioinformatics applications, interactive tuning, automatic parallelization and data provenance support. We evaluate Mario and its underlying storage system, HBase, using a benchmark developed to simulate I/O loads that are representative for biological data processing. The results show that Mario adds less than 100 milliseconds to the end-to-end latency of processing one item of data. This low latency, combined with Mario’s storage of all intermediate data generated by the processing, enables easy parameter tuning. In addition to improved interactivity, Mario also offer integrated data provenance, by storing detailed pipeline configurations associated with the data. The evaluation of Mario demonstrate that it can be used to achieve more interactivity in the configuration of pipelines for processing biological data. We believe that biology researchers can take advantage of this interactivity to perform better parameter tuning, which may lead to more accurate analyses, and ultimately to new scientific discoveries

    ShareCare: a study of databases within Q&A webapp context

    Get PDF
    Treballs Finals de Grau d'Enginyeria Informàtica, Facultat de Matemàtiques, Universitat de Barcelona, Any: 2018, Director: Blasco Jiménez, Guillermo[en] The content of this work is related to the study of the different database families that can be found working in different applications for the web. This study means a description of those families together with an analysis of their features and some of their common uses. This overview of databases will be reinforced by the example of a web application, created to exemplify some interesting use cases that make nowadays application use several kinds of databases. The purpose of this work is, therefore, to show how important is to know the possibilities that exist in terms of databases, and also to know some facts that a developer may bear in mind in order to make a good choice when selecting a database to work with

    Automated Data Collection and Management at Enhanced Lagoons for Wastewater Treatment

    Get PDF
    Les stations de mesure automatiques sont utilisées pour suivre et contrôler des usines de traitement des eaux usées. Ce suivi en continu à haute fréquence est devenu indispensable afin de réduire les impacts négatifs sur l’environnement car les caractéristiques de l’eau varient rapidement dans l’espace et dans le temps. Toutefois, même s’il y a eu des progrès considérables, ces dernières années, de la technologie de mesure, les instruments sont encore chers. Aussi des problèmes de colmatage, d’encrassement ou de mauvaise calibration sont assez fréquents à cause du contact avec les eaux usées. La fiabilité des mesures en ligne et en continu est affectée négativement. Par conséquent, un bon entretien des instruments est essentiel, ainsi que la validation des données collectées, afin de détecter d’éventuelles valeurs aberrantes. Dans le contexte de ce mémoire, en collaboration avec Bionest®, une méthodologie est proposée pour attaquer ces problèmes. Deux cas d’études en étangs aérés au Québec ont été considérés, avec l’objectif d’optimiser les activités d’entretien, de réduire les données non fiables et d’obtenir des grandes séries de données représentatives.Automated monitoring stations have been used to monitor and control wastewater treatment plants. Their capability to monitor at high frequency has become essential to reduce the negative impacts to the environment since the wastewater characteristics have an elevated spatial and time variability. Over the last few years, the technology used to build these automatic monitoring stations, for example the sensors, has been improved. However, the instrumentation is still expensive. Also, in wastewater uses, basic problems like fouling, bad calibration or clogging are frequently affecting the reliability of the continuous on-line measurements. Thus, a good maintenance of the instruments, as well as a validation of the collected data to detect faults is required. In the context of this thesis, in collaboration with Bionest®, a methodology has been developed to deal with these problems for two facultative/aerated lagoon case studies in Québec, with the objective of optimizing the maintenance activities, of reducing the fraction of unreliable data and of obtaining large representative data series

    PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development

    Full text link
    This paper describes PlinyCompute, a system for development of high-performance, data-intensive, distributed computing tools and libraries. In the large, PlinyCompute presents the programmer with a very high-level, declarative interface, relying on automatic, relational-database style optimization to figure out how to stage distributed computations. However, in the small, PlinyCompute presents the capable systems programmer with a persistent object data model and API (the "PC object model") and associated memory management system that has been designed from the ground-up for high performance, distributed, data-intensive computing. This contrasts with most other Big Data systems, which are constructed on top of the Java Virtual Machine (JVM), and hence must at least partially cede performance-critical concerns such as memory management (including layout and de/allocation) and virtual method/function dispatch to the JVM. This hybrid approach---declarative in the large, trusting the programmer's ability to utilize PC object model efficiently in the small---results in a system that is ideal for the development of reusable, data-intensive tools and libraries. Through extensive benchmarking, we show that implementing complex objects manipulation and non-trivial, library-style computations on top of PlinyCompute can result in a speedup of 2x to more than 50x or more compared to equivalent implementations on Spark.Comment: 48 pages, including references and Appendi
    • …
    corecore