463 research outputs found

    Evaluation and optimization of Big Data Processing on High Performance Computing Systems

    Get PDF
    Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo] Hoxe en día, moitas organizacións empregan tecnoloxías Big Data para extraer información de grandes volumes de datos. A medida que o tamaño destes volumes crece, satisfacer as demandas de rendemento das aplicacións de procesamento de datos masivos faise máis difícil. Esta Tese céntrase en avaliar e optimizar estas aplicacións, presentando dúas novas ferramentas chamadas BDEv e Flame-MR. Por unha banda, BDEv analiza o comportamento de frameworks de procesamento Big Data como Hadoop, Spark e Flink, moi populares na actualidade. BDEv xestiona a súa configuración e despregamento, xerando os conxuntos de datos de entrada e executando cargas de traballo previamente elixidas polo usuario. Durante cada execución, BDEv extrae diversas métricas de avaliación que inclúen rendemento, uso de recursos, eficiencia enerxética e comportamento a nivel de microarquitectura. Doutra banda, Flame-MR permite optimizar o rendemento de aplicacións Hadoop MapReduce. En xeral, o seu deseño baséase nunha arquitectura dirixida por eventos capaz de mellorar a eficiencia dos recursos do sistema mediante o solapamento da computación coas comunicacións. Ademais de reducir o número de copias en memoria que presenta Hadoop, emprega algoritmos eficientes para ordenar e mesturar os datos. Flame-MR substitúe o motor de procesamento de datos MapReduce de xeito totalmente transparente, polo que non é necesario modificar o código de aplicacións xa existentes. A mellora de rendemento de Flame-MR foi avaliada de maneira exhaustiva en sistemas clúster e cloud, executando tanto benchmarks estándar coma aplicacións pertencentes a casos de uso reais. Os resultados amosan unha redución de entre un 40% e un 90% do tempo de execución das aplicacións. Esta Tese proporciona aos usuarios e desenvolvedores de Big Data dúas potentes ferramentas para analizar e comprender o comportamento de frameworks de procesamento de datos e reducir o tempo de execución das aplicacións sen necesidade de contar con coñecemento experto para elo.[Resumen] Hoy en día, muchas organizaciones utilizan tecnologías Big Data para extraer información de grandes volúmenes de datos. A medida que el tamaño de estos volúmenes crece, satisfacer las demandas de rendimiento de las aplicaciones de procesamiento de datos masivos se vuelve más difícil. Esta Tesis se centra en evaluar y optimizar estas aplicaciones, presentando dos nuevas herramientas llamadas BDEv y Flame-MR. Por un lado, BDEv analiza el comportamiento de frameworks de procesamiento Big Data como Hadoop, Spark y Flink, muy populares en la actualidad. BDEv gestiona su configuración y despliegue, generando los conjuntos de datos de entrada y ejecutando cargas de trabajo previamente elegidas por el usuario. Durante cada ejecución, BDEv extrae diversas métricas de evaluación que incluyen rendimiento, uso de recursos, eficiencia energética y comportamiento a nivel de microarquitectura. Por otro lado, Flame-MR permite optimizar el rendimiento de aplicaciones Hadoop MapReduce. En general, su diseño se basa en una arquitectura dirigida por eventos capaz de mejorar la eficiencia de los recursos del sistema mediante el solapamiento de la computación con las comunicaciones. Además de reducir el número de copias en memoria que presenta Hadoop, utiliza algoritmos eficientes para ordenar y mezclar los datos. Flame-MR reemplaza el motor de procesamiento de datos MapReduce de manera totalmente transparente, por lo que no se necesita modificar el código de aplicaciones ya existentes. La mejora de rendimiento de Flame-MR ha sido evaluada de manera exhaustiva en sistemas clúster y cloud, ejecutando tanto benchmarks estándar como aplicaciones pertenecientes a casos de uso reales. Los resultados muestran una reducción de entre un 40% y un 90% del tiempo de ejecución de las aplicaciones. Esta Tesis proporciona a los usuarios y desarrolladores de Big Data dos potentes herramientas para analizar y comprender el comportamiento de frameworks de procesamiento de datos y reducir el tiempo de ejecución de las aplicaciones sin necesidad de contar con conocimiento experto para ello.[Abstract] Nowadays, Big Data technologies are used by many organizations to extract valuable information from large-scale datasets. As the size of these datasets increases, meeting the huge performance requirements of data processing applications becomes more challenging. This Thesis focuses on evaluating and optimizing these applications by proposing two new tools, namely BDEv and Flame-MR. On the one hand, BDEv allows to thoroughly assess the behavior of widespread Big Data processing frameworks such as Hadoop, Spark and Flink. It manages the configuration and deployment of the frameworks, generating the input datasets and launching the workloads specified by the user. During each workload, it automatically extracts several evaluation metrics that include performance, resource utilization, energy efficiency and microarchitectural behavior. On the other hand, Flame-MR optimizes the performance of existing Hadoop MapReduce applications. Its overall design is based on an event-driven architecture that improves the efficiency of the system resources by pipelining data movements and computation. Moreover, it avoids redundant memory copies present in Hadoop, while also using efficient sort and merge algorithms for data processing. Flame-MR replaces the underlying MapReduce data processing engine in a transparent way and thus the source code of existing applications does not require to be modified. The performance benefits provided by Flame- MR have been thoroughly evaluated on cluster and cloud systems by using both standard benchmarks and real-world applications, showing reductions in execution time that range from 40% to 90%. This Thesis provides Big Data users with powerful tools to analyze and understand the behavior of data processing frameworks and reduce the execution time of the applications without requiring expert knowledge

    Big Data Analytics on Traditional HPC Infrastructure Using Two-Level Storage

    Full text link
    Data-intensive computing has become one of the major workloads on traditional high-performance computing (HPC) clusters. Currently, deploying data-intensive computing software framework on HPC clusters still faces performance and scalability issues. In this paper, we develop a new two-level storage system by integrating Tachyon, an in-memory file system with OrangeFS, a parallel file system. We model the I/O throughputs of four storage structures: HDFS, OrangeFS, Tachyon and two-level storage. We conduct computational experiments to characterize I/O throughput behavior of two-level storage and compare its performance to that of HDFS and OrangeFS, using TeraSort benchmark. Theoretical models and experimental tests both show that the two-level storage system can increase the aggregate I/O throughputs. This work lays a solid foundation for future work in designing and building HPC systems that can provide a better support on I/O intensive workloads with preserving existing computing resources.Comment: Submitted to SC15, 8 pages, 7 figures, 3 table

    Optimizing Coherence Traffic in Manycore Processors Using Closed-Form Caching/Home Agent Mappings

    Get PDF
    [Abstract] Manycore processors feature a high number of general-purpose cores designed to work in a multithreaded fashion. Recent manycore processors are kept coherent using scalable distributed directories. A paramount example is the Intel Mesh interconnect, which consists of a network-on-chip interconnecting “tiles”, each of which contains computation cores, local caches, and coherence masters. The distributed coherence subsystem must be queried for every out-of-tile access, imposing an overhead on memory latency. This paper studies the physical layout of an Intel Knights Landing processor, with a particular focus on the coherence subsystem, and uncovers the pseudo-random mapping function of physical memory blocks across the pieces of the distributed directory. Leveraging this knowledge, candidate optimizations to improve memory latency through the minimization of coherence traffic are studied. Although these optimizations do improve memory throughput, ultimately this does not translate into performance gains due to inherent overheads stemming from the computational complexity of the mapping functions.Ministerio de Educación; FPU16/00816U.S. National Science Foundation; CCF-1750399Xunta de Galicia and FEDER; ED431G 2019/01Ministerio de Ciencia e Innovación; PID2019-104184RB-I0

    Optimizing iterative data-flow scientific applications using directed cyclic graphs

    Get PDF
    Data-flow programming models have become a popular choice for writing parallel applications as an alternative to traditional work-sharing parallelism. They are better suited to write applications with irregular parallelism that can present load imbalance. However, these programming models suffer from overheads related to task creation, scheduling and dependency management, limiting performance and scalability when tasks become too small. At the same time, many HPC applications implement iterative methods or multi-step simulations that create the same directed acyclic graphs of tasks on each iteration. By giving application programmers a way to express that a specific loop is creating the same task pattern on each iteration, we can create a single task directed acyclic graph (DAG) once and transform it into a cyclic graph. This cyclic graph is then reused for successive iterations, minimizing task creation and dependency management overhead. This paper presents the taskiter, a new construct we propose for the OmpSs-2 and OpenMP programming models, allowing the use of directed cyclic task graphs (DCTG) to minimize runtime overheads. Moreover, we present a simple immediate successor locality-aware heuristic that minimizes task scheduling overhead by bypassing the runtime task scheduler. We evaluate the implementation of the taskiter and the immediate successor heuristic in 8 iterative benchmarks. Using small task granularities, we obtain a geometric mean speedup of 2.56x over the reference OmpSs-2 implementation, and a 3.77x and 5.2x speedup over the LLVM and GCC OpenMP runtimes, respectively.This work was supported in part by the European Union’s Horizon 2020/EuroHPC Research and Innovation Programme (DEEP-SEA) under Grant 955606; in part by the Spanish State Research Agency—Ministry of Science and Innovation, Generalitat de Catalunya, under Project PCI2021121958 and Project 2021-SGR-01007; in part by the Spanish Ministry of Science and Technology under Contract PID2019-107255GB; and in part by Severo Ochoa under Grant CEX2021-001148-S/MCIN/AEI/10.13039/501100011033.Peer ReviewedPostprint (published version

    Many-Task Computing and Blue Waters

    Full text link
    This report discusses many-task computing (MTC) generically and in the context of the proposed Blue Waters systems, which is planned to be the largest NSF-funded supercomputer when it begins production use in 2012. The aim of this report is to inform the BW project about MTC, including understanding aspects of MTC applications that can be used to characterize the domain and understanding the implications of these aspects to middleware and policies. Many MTC applications do not neatly fit the stereotypes of high-performance computing (HPC) or high-throughput computing (HTC) applications. Like HTC applications, by definition MTC applications are structured as graphs of discrete tasks, with explicit input and output dependencies forming the graph edges. However, MTC applications have significant features that distinguish them from typical HTC applications. In particular, different engineering constraints for hardware and software must be met in order to support these applications. HTC applications have traditionally run on platforms such as grids and clusters, through either workflow systems or parallel programming systems. MTC applications, in contrast, will often demand a short time to solution, may be communication intensive or data intensive, and may comprise very short tasks. Therefore, hardware and software for MTC must be engineered to support the additional communication and I/O and must minimize task dispatch overheads. The hardware of large-scale HPC systems, with its high degree of parallelism and support for intensive communication, is well suited for MTC applications. However, HPC systems often lack a dynamic resource-provisioning feature, are not ideal for task communication via the file system, and have an I/O system that is not optimized for MTC-style applications. Hence, additional software support is likely to be required to gain full benefit from the HPC hardware

    Modeling and optimization of high-performance many-core systems for energy-efficient and reliable computing

    Full text link
    Thesis (Ph.D.)--Boston UniversityMany-core systems, ranging from small-scale many-core processors to large-scale high performance computing (HPC) data centers, have become the main trend in computing system design owing to their potential to deliver higher throughput per watt. However, power densities and temperatures increase following the growth in the performance capacity, and bring major challenges in energy efficiency, cooling costs, and reliability. These challenges require a joint assessment of performance, power, and temperature tradeoffs as well as the design of runtime optimization techniques that monitor and manage the interplay among them. This thesis proposes novel modeling and runtime management techniques that evaluate and optimize the performance, energy, and reliability of many-core systems. We first address the energy and thermal challenges in 3D-stacked many-core processors. 3D processors with stacked DRAM have the potential to dramatically improve performance owing to lower memory access latency and higher bandwidth. However, the performance increase may cause 3D systems to exceed the power budgets or create thermal hot spots. In order to provide an accurate analysis and enable the design of efficient management policies, this thesis introduces a simulation framework to jointly analyze performance, power, and temperature for 3D systems. We then propose a runtime optimization policy that maximizes the system performance by characterizing the application behavior and predicting the operating points that satisfy the power and thermal constraints. Our policy reduces the energy-delay product (EDP) by up to 61.9% compared to existing strategies. Performance, cooling energy, and reliability are also critical aspects in HPC data centers. In addition to causing reliability degradation, high temperatures increase the required cooling energy. Communication cost, on the other hand, has a significant impact on system performance in HPC data centers. This thesis proposes a topology-aware technique that maximizes system reliability by selecting between workload clustering and balancing. Our policy improves the system reliability by up to 123.3% compared to existing temperature balancing approaches. We also introduce a job allocation methodology to simultaneously optimize the communication cost and the cooling energy in a data center. Our policy reduces the cooling cost by 40% compared to cooling-aware and performance-aware policies, while achieving comparable performance to performance-aware policy
    corecore