63 research outputs found

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    Profiling Large-scale Live Video Streaming and Distributed Applications

    Get PDF
    PhDToday, distributed applications run at data centre and Internet scales, from intensive data analysis, such as MapReduce; to the dynamic demands of a worldwide audience, such as YouTube. The network is essential to these applications at both scales. To provide adequate support, we must understand the full requirements of the applications, which are revealed by the workloads. In this thesis, we study distributed system applications at different scales to enrich this understanding. Large-scale Internet applications have been studied for years, such as social networking service (SNS), video on demand (VoD), and content delivery networks (CDN). An emerging type of video broadcasting on the Internet featuring crowdsourced live video streaming has garnered attention allowing platforms such as Twitch to attract over 1 million concurrent users globally. To better understand Twitch, we collected real-time popularity data combined with metadata about the contents and found the broadcasters rather than the content drives its popularity. Unlike YouTube and Netflix where content can be cached, video streaming on Twitch is generated instantly and needs to be delivered to users immediately to enable real-time interaction. Thus, we performed a large-scale measurement of Twitchs content location revealing the global footprint of its infrastructure as well as discovering the dynamic stream hosting and client redirection strategies that helped Twitch serve millions of users at scale. We next consider applications that run inside the data centre. Distributed computing applications heavily rely on the network due to data transmission needs and the scheduling of resources and tasks. One successful application, called Hadoop, has been widely deployed for Big Data processing. However, little work has been devoted to understanding its network. We found the Hadoop behaviour is limited by hardware resources and processing jobs presented. Thus, after characterising the Hadoop traffic on our testbed with a set of benchmark jobs, we built a simulator to reproduce Hadoops job traffic With the simulator, users can investigate the connections between Hadoop traffic and network performance without additional hardware cost. Different network components can be added to investigate the performance, such as network topologies, queue policies, and transport layer protocols. In this thesis, we extended the knowledge of networking by investigated two widelyused applications in the data centre and at Internet scale. We (i)studied the most popular live video streaming platform Twitch as a new type of Internet-scale distributed application revealing that broadcaster factors drive the popularity of such platform, and we (ii)discovered the footprint of Twitch streaming infrastructure and the dynamic stream hosting and client redirection strategies to provide an in-depth example of video streaming delivery occurring at the Internet scale, also we (iii)investigated the traffic generated by a distributed application by characterising the traffic of Hadoop under various parameters, (iv)with such knowledge, we built a simulation tool so users can efficiently investigate the performance of different network components under distributed applicationQueen Mary University of Londo

    Evaluation and optimization of Big Data Processing on High Performance Computing Systems

    Get PDF
    Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo] Hoxe en día, moitas organizacións empregan tecnoloxías Big Data para extraer información de grandes volumes de datos. A medida que o tamaño destes volumes crece, satisfacer as demandas de rendemento das aplicacións de procesamento de datos masivos faise máis difícil. Esta Tese céntrase en avaliar e optimizar estas aplicacións, presentando dúas novas ferramentas chamadas BDEv e Flame-MR. Por unha banda, BDEv analiza o comportamento de frameworks de procesamento Big Data como Hadoop, Spark e Flink, moi populares na actualidade. BDEv xestiona a súa configuración e despregamento, xerando os conxuntos de datos de entrada e executando cargas de traballo previamente elixidas polo usuario. Durante cada execución, BDEv extrae diversas métricas de avaliación que inclúen rendemento, uso de recursos, eficiencia enerxética e comportamento a nivel de microarquitectura. Doutra banda, Flame-MR permite optimizar o rendemento de aplicacións Hadoop MapReduce. En xeral, o seu deseño baséase nunha arquitectura dirixida por eventos capaz de mellorar a eficiencia dos recursos do sistema mediante o solapamento da computación coas comunicacións. Ademais de reducir o número de copias en memoria que presenta Hadoop, emprega algoritmos eficientes para ordenar e mesturar os datos. Flame-MR substitúe o motor de procesamento de datos MapReduce de xeito totalmente transparente, polo que non é necesario modificar o código de aplicacións xa existentes. A mellora de rendemento de Flame-MR foi avaliada de maneira exhaustiva en sistemas clúster e cloud, executando tanto benchmarks estándar coma aplicacións pertencentes a casos de uso reais. Os resultados amosan unha redución de entre un 40% e un 90% do tempo de execución das aplicacións. Esta Tese proporciona aos usuarios e desenvolvedores de Big Data dúas potentes ferramentas para analizar e comprender o comportamento de frameworks de procesamento de datos e reducir o tempo de execución das aplicacións sen necesidade de contar con coñecemento experto para elo.[Resumen] Hoy en día, muchas organizaciones utilizan tecnologías Big Data para extraer información de grandes volúmenes de datos. A medida que el tamaño de estos volúmenes crece, satisfacer las demandas de rendimiento de las aplicaciones de procesamiento de datos masivos se vuelve más difícil. Esta Tesis se centra en evaluar y optimizar estas aplicaciones, presentando dos nuevas herramientas llamadas BDEv y Flame-MR. Por un lado, BDEv analiza el comportamiento de frameworks de procesamiento Big Data como Hadoop, Spark y Flink, muy populares en la actualidad. BDEv gestiona su configuración y despliegue, generando los conjuntos de datos de entrada y ejecutando cargas de trabajo previamente elegidas por el usuario. Durante cada ejecución, BDEv extrae diversas métricas de evaluación que incluyen rendimiento, uso de recursos, eficiencia energética y comportamiento a nivel de microarquitectura. Por otro lado, Flame-MR permite optimizar el rendimiento de aplicaciones Hadoop MapReduce. En general, su diseño se basa en una arquitectura dirigida por eventos capaz de mejorar la eficiencia de los recursos del sistema mediante el solapamiento de la computación con las comunicaciones. Además de reducir el número de copias en memoria que presenta Hadoop, utiliza algoritmos eficientes para ordenar y mezclar los datos. Flame-MR reemplaza el motor de procesamiento de datos MapReduce de manera totalmente transparente, por lo que no se necesita modificar el código de aplicaciones ya existentes. La mejora de rendimiento de Flame-MR ha sido evaluada de manera exhaustiva en sistemas clúster y cloud, ejecutando tanto benchmarks estándar como aplicaciones pertenecientes a casos de uso reales. Los resultados muestran una reducción de entre un 40% y un 90% del tiempo de ejecución de las aplicaciones. Esta Tesis proporciona a los usuarios y desarrolladores de Big Data dos potentes herramientas para analizar y comprender el comportamiento de frameworks de procesamiento de datos y reducir el tiempo de ejecución de las aplicaciones sin necesidad de contar con conocimiento experto para ello.[Abstract] Nowadays, Big Data technologies are used by many organizations to extract valuable information from large-scale datasets. As the size of these datasets increases, meeting the huge performance requirements of data processing applications becomes more challenging. This Thesis focuses on evaluating and optimizing these applications by proposing two new tools, namely BDEv and Flame-MR. On the one hand, BDEv allows to thoroughly assess the behavior of widespread Big Data processing frameworks such as Hadoop, Spark and Flink. It manages the configuration and deployment of the frameworks, generating the input datasets and launching the workloads specified by the user. During each workload, it automatically extracts several evaluation metrics that include performance, resource utilization, energy efficiency and microarchitectural behavior. On the other hand, Flame-MR optimizes the performance of existing Hadoop MapReduce applications. Its overall design is based on an event-driven architecture that improves the efficiency of the system resources by pipelining data movements and computation. Moreover, it avoids redundant memory copies present in Hadoop, while also using efficient sort and merge algorithms for data processing. Flame-MR replaces the underlying MapReduce data processing engine in a transparent way and thus the source code of existing applications does not require to be modified. The performance benefits provided by Flame- MR have been thoroughly evaluated on cluster and cloud systems by using both standard benchmarks and real-world applications, showing reductions in execution time that range from 40% to 90%. This Thesis provides Big Data users with powerful tools to analyze and understand the behavior of data processing frameworks and reduce the execution time of the applications without requiring expert knowledge

    The Camera in conservation: determining photography’s place in the preservation of wildlife

    Get PDF
    This MA by research study is a reflection of photography’s past, current and future role within wildlife conservation, or whether there is indeed a necessity for it moving forwards. The following investigation and analysis of photography seeks to materialise how in fact the photographic medium can be both beneficial and negatively impactful to the preservation of wildlife, and how best it can be used by photographers in future conservation projects to ensure the preservation of wildlife. Several significant aspects of photography and external influences are engaged with in this study, firstly investigating the importance of empathy within wildlife conservation and how it can be elicited through imagery and photographic methods. Furthermore, I investigate the other side of conservation photography’s success, analysing what negative or neutral impacts it can bring with it, before researching the role that social media does and has the potential to play in conservation, and how photography can adapt to it to maximise its success. Lastly, I explore alternative visual media such as moving image, and how photography can best applicate successful techniques learned from them to reinterpret how conservation photography is perceived. Finally, using information and research from across my thesis, I have produced a ‘guide’ as to how conservation photography can be shaped to achieve its full potential for success, drawing upon previous successes and failures of other conservation attempts and photographers

    ACOA: application quality of service aware orchestration architecture for the Edge to Cloud continuum.

    Get PDF
    206 p.El continuo edge-cloud proporciona una infraestructura de nodos heterogéneos para ejecutar aplicaciones distribuidas. La calidad de servicio de estas aplicaciones está ligada a múltiples requisitos no funcionales, como el tiempo de respuesta o la eficiencia energética, que dependen en gran medida del despliegue de los componentes de la aplicación en los nodos de la infraestructura. En este contexto, se necesitan nuevas arquitecturas de orquestación para gestionar tanto las infraestructuras complejas como los requisitos no funcionales de las aplicaciones.Para hacer frente a estos retos se propone una novedosa arquitectura de orquestación centrada en las aplicaciones (ACOA : Application-Centric Orchestration Architecture) orientada al continuo edge-cloud. Esta arquitectura aprovecha el uso de múltiples planificadores para mejorar el rendimiento de la orquestación y proporcionar un algoritmo de planificación personalizado para cada aplicación. Propone un modelo de infraestructura para la caracterización de los nodos y la red, un modelo de carga de trabajo para la definición de las aplicaciones, y un conjunto de componentes de sistema interrelacionados necesarios para gestionar las tareas de orquestación. Uno de estos componentes implementa SWIM-NSM, un nuevo protocolo de monitorización de la red necesario para la gestión de la calidad de servicio de las aplicaciones

    Literature Survey of Big Data

    Get PDF
    Mention the topic of big data, and a person is bound to experience information overload. Indeed, it is so complex with so many terms and details that people want to run away from it. When used right, big data (BD) will help people access data they need in in real time and help managers make better decisions. The purpose of this paper is to evaluate methods, procedures, and architectures for the storage and retrieval of all Federal Aviation Administration (FAA) research, engineering, and development (RE&D) data sets, to leverage on the technology innovation and advancement opportunities in the field of BD analytics. The paper also discusses all relevant Executive Orders (EOs), laws, and Office of Management and Budget (OMB) memorandums that were written to address what federal agencies under the OMB\u2019s jurisdiction must do to comply with various aspects of BD

    Improving the end-to-end latency of datacenter applications using coordination across application components

    Get PDF
    To handle millions of user requests every second and process hundreds of terabytes of data each day, many organizations have turned to large datacenter-scale computing systems. The applications running in these datacenters consist of a multitude of dependent logical components or stages which perform specific functionality. These stages are connected to form a directed acyclic graph (DAG), with edges representing input-output dependencies. Each stage can run over tens to thousands of machines, and involves multiple cluster sub-systems such as storage, network and compute. The scale and complexity of these applications can lead to significant delays in their end-to-end latency. However, the organizations running these applications have strict requirements on this latency as it directly affects their revenue and operational costs. Addressing this problem, the goal of this dissertation is to develop scheduling and resource allocation techniques to optimize for the end-to-end latency of datacenter applications. The key idea behind these techniques is to utilize coordination between different application components, allowing us to efficiently allocate cluster resources. In particular, we develop planning algorithms that coordinate the storage and compute sub-systems in datacenters to determine how many resources should be allocated to each stage in an application along with where in the cluster should they be allocated, to meet application requirements (e.g., completion time goals, minimize average completion time etc.). To further speed up applications at runtime, we develop a few latency reduction techniques: reissuing laggards elsewhere in the cluster, returning partial results and speeding up laggards by giving them extra resources. We perform a global optimization to coordinate across all the stages in an application DAG and determine which of these techniques works best for each stage, while ensuring that the cost incurred by these techniques is within a given end-to-end budget. We use application characteristics to predict and determine how resources should be allocated to different application components to meet the end-to-end latency requirements. We evaluate our techniques on two different kinds of datacenter applications: (a) web services, and (b) data analytics. With large-scale simulations and an implementation in Apache Yarn (Hadoop 2.0), we use workloads derived from production traces to show that our techniques can achieve more than 50% reduction in the 99th percentile latency of web services and up to 56% reduction in the median latency of data analytics jobs

    ASCR/HEP Exascale Requirements Review Report

    Full text link
    This draft report summarizes and details the findings, results, and recommendations derived from the ASCR/HEP Exascale Requirements Review meeting held in June, 2015. The main conclusions are as follows. 1) Larger, more capable computing and data facilities are needed to support HEP science goals in all three frontiers: Energy, Intensity, and Cosmic. The expected scale of the demand at the 2025 timescale is at least two orders of magnitude -- and in some cases greater -- than that available currently. 2) The growth rate of data produced by simulations is overwhelming the current ability, of both facilities and researchers, to store and analyze it. Additional resources and new techniques for data analysis are urgently needed. 3) Data rates and volumes from HEP experimental facilities are also straining the ability to store and analyze large and complex data volumes. Appropriately configured leadership-class facilities can play a transformational role in enabling scientific discovery from these datasets. 4) A close integration of HPC simulation and data analysis will aid greatly in interpreting results from HEP experiments. Such an integration will minimize data movement and facilitate interdependent workflows. 5) Long-range planning between HEP and ASCR will be required to meet HEP's research needs. To best use ASCR HPC resources the experimental HEP program needs a) an established long-term plan for access to ASCR computational and data resources, b) an ability to map workflows onto HPC resources, c) the ability for ASCR facilities to accommodate workflows run by collaborations that can have thousands of individual members, d) to transition codes to the next-generation HPC platforms that will be available at ASCR facilities, e) to build up and train a workforce capable of developing and using simulations and analysis to support HEP scientific research on next-generation systems.Comment: 77 pages, 13 Figures; draft report, subject to further revisio

    Big Data Security (Volume 3)

    Get PDF
    After a short description of the key concepts of big data the book explores on the secrecy and security threats posed especially by cloud based data storage. It delivers conceptual frameworks and models along with case studies of recent technology
    • …
    corecore