1,689 research outputs found

    A protocol reconfiguration and optimization system for MPI

    Get PDF
    Modern high performance computing (HPC) applications, for example adaptive mesh refinement and multi-physics codes, have dynamic communication characteristics which result in poor performance on current Message Passing Interface (MPI) implementations. The degraded application performance can be attributed to a mismatch between changing application requirements and static communication library functionality. To improve the performance of these applications, MPI libraries should change their protocol functionality in response to changing application requirements, and tailor their functionality to take advantage of hardware capabilities. This dissertation describes Protocol Reconfiguration and Optimization system for MPI (PRO-MPI), a framework for constructing profile-driven reconfigurable MPI libraries; these libraries use past application characteristics (profiles) to dynamically change their functionality to match the changing application requirements. The framework addresses the challenges of designing and implementing the reconfigurable MPI libraries, which include collecting and reasoning about application characteristics to drive the protocol reconfiguration and defining abstractions required for implementing these reconfigurations. Two prototype reconfigurable MPI implementations based on the framework - Open PRO-MPI and Cactus PRO-MPI - are also presented to demonstrate the utility of the framework. To demonstrate the effectiveness of reconfigurable MPI libraries, this dissertation presents experimental results to show the impact of using these libraries on the application performance. The results show that PRO-MPI improves the performance of important HPC applications and benchmarks. They also show that HyperCLaw performance improves by approximately 22% when exact profiles are available, and HyperCLaw performance improves by approximately 18% when only approximate profiles are available

    GSI: a GPU stall inspector to characterize the sources of memory stalls for tightly coupled GPUs

    Get PDF
    In recent years the power wall has prevented the continued scaling of single core performance. This has led to the rise of dark silicon and motivated a move toward parallelism and specialization. As a result, energy-efficient high-throughput GPU cores are increasingly favored for accelerating data-parallel applications. However, the best way to efficiently communicate and synchronize across heterogeneous cores remains an important open research question. Many methods have been proposed to improve the efficiency of heterogeneous memory systems, but current methods for evaluating the performance effects of these innovations are limited in their ability to attribute differences in execution time to sources of latency in the memory system. Performance characterization of tightly coupled CPU-GPU systems is complicated by the high levels of parallelism present in GPU codes. Existing simulation tools provide only coarse-grained metrics which can obscure the underlying memory system interactions that cause performance differences. In this thesis we introduce GPU Stall Inspector (GSI), a method for identifying and visualizing the causes of GPU stalls with a focus on a tightly coupled CPU-GPU memory subsystem. We demonstrate the utility of our approach by evaluating the sources of stalls in several recent architectural innovations for tightly coupled, heterogeneous CPU-GPU systems

    Software-based and regionally-oriented traffic management in Networks-on-Chip

    Get PDF
    Since the introduction of chip-multiprocessor systems, the number of integrated cores has been steady growing and workload applications have been adapted to exploit the increasing parallelism. This changed the importance of efficient on-chip communication significantly and the infrastructure has to keep step with these new requirements. The work at hand makes significant contributions to the state-of-the-art of the latest generation of such solutions, called Networks-on-Chip, to improve the performance, reliability, and flexible management of these on-chip infrastructures

    Analysis, classification and comparison of scheduling techniques for software transactional memories

    Get PDF
    Transactional Memory (TM) is a practical programming paradigm for developing concurrent applications. Performance is a critical factor for TM implementations, and various studies demonstrated that specialised transaction/thread scheduling support is essential for implementing performance-effective TM systems. After one decade of research, this article reviews the wide variety of scheduling techniques proposed for Software Transactional Memories. Based on peculiarities and differences of the adopted scheduling strategies, we propose a classification of the existing techniques, and we discuss the specific characteristics of each technique. Also, we analyse the results of previous evaluation and comparison studies, and we present the results of a new experimental study encompassing techniques based on different scheduling strategies. Finally, we identify potential strengths and weaknesses of the different techniques, as well as the issues that require to be further investigated

    ERASE: Energy Efficient Task Mapping and Resource Management for Work Stealing Runtimes

    Get PDF
    Parallel applications often rely on work stealing schedulers in combination with fine-grained tasking to achieve high performance and scalability. However, reducing the total energy consumption in the context of work stealing runtimes is still challenging, particularly when using asymmetric architectures with different types of CPU cores. A common approach for energy savings involves dynamic voltage and frequency scaling (DVFS) wherein throttling is carried out based on factors like task parallelism, stealing relations, and task criticality. This article makes the following observations: (i) leveraging DVFS on a per-task basis is impractical when using fine-grained tasking and in environments with cluster/chip-level DVFS; (ii) task moldability, wherein a single task can execute on multiple threads/cores via work-sharing, can help to reduce energy consumption; and (iii) mismatch between tasks and assigned resources (i.e., core type and number of cores) can detrimentally impact energy consumption. In this article, we propose EneRgy Aware SchedulEr (ERASE), an intra-application task scheduler on top of work stealing runtimes that aims to reduce the total energy consumption of parallel applications. It achieves energy savings by guiding scheduling decisions based on per-task energy consumption predictions of different resource configurations. In addition, ERASE is capable of adapting to both given static frequency settings and externally controlled DVFS. Overall, ERASE achieves up to 31% energy savings and improves performance by 44% on average, compared to the state-of-the-art DVFS-based schedulers

    RUNTIME METHODS TO IMPROVE ENERGY EFFICIENCY IN SUPERCOMPUTING APPLICATIONS

    Get PDF
    Energy efficiency in supercomputing is critical to limit operating costs and carbon footprints. While the energy efficiency of future supercomputing centers needs to improve at all levels, the energy consumed by the processing units is a large fraction of the total energy consumed by High Performance Computing (HPC) systems. HPC applications use a parallel programming paradigm like the Message Passing Interface (MPI) to coordinate computation and communication among thousands of processors. With dynamically-changing factors both in hardware and software affecting energy usage of processors, there exists a need for power monitoring and regulation at runtime to achieve savings in energy. This dissertation highlights an adaptive runtime framework that enables processors with core-specific power control by dynamically adapting to workload characteristics to reduce power with little or no performance impact. Two opportunities to improve the energy efficiency of processors running MPI applications are identified - computational workload imbalance and waiting on memory. Monitoring of performance and power regulation is performed by the framework transparently within the MPI runtime system, eliminating the need for code changes to MPI applications. The effect of enforcing power limits (capping) on processors is also investigated. Experiments on 32 nodes (1024 cores) show that in presence of workload imbalance, the runtime reduces Central Processing Unit (CPU) frequency on cores not on the critical path, thereby reducing power and hence energy usage without deteriorating performance. Using this runtime, six MPI mini-applications and a full MPI application show an overall 20% decrease in energy use with less than 1% increase in execution time. In addition, the lowering of frequency on non-critical cores reduces run-to-run performance variation and improves performance. For the full application, an average speedup of 11% is seen, while the power is lowered by about 31% for an energy savings of up to 42%. Another experiment on 16 nodes (256 cores) that are power capped also shows performance improvement along with power reduction. Thus, energy optimization can also be a performance optimization. For applications that are limited by memory access times, memory metrics identified facilitate lowering of power by up to 32% without adversely impacting performance.Doctor of Philosoph

    A Survey of Green Networking Research

    Full text link
    Reduction of unnecessary energy consumption is becoming a major concern in wired networking, because of the potential economical benefits and of its expected environmental impact. These issues, usually referred to as "green networking", relate to embedding energy-awareness in the design, in the devices and in the protocols of networks. In this work, we first formulate a more precise definition of the "green" attribute. We furthermore identify a few paradigms that are the key enablers of energy-aware networking research. We then overview the current state of the art and provide a taxonomy of the relevant work, with a special focus on wired networking. At a high level, we identify four branches of green networking research that stem from different observations on the root causes of energy waste, namely (i) Adaptive Link Rate, (ii) Interface proxying, (iii) Energy-aware infrastructures and (iv) Energy-aware applications. In this work, we do not only explore specific proposals pertaining to each of the above branches, but also offer a perspective for research.Comment: Index Terms: Green Networking; Wired Networks; Adaptive Link Rate; Interface Proxying; Energy-aware Infrastructures; Energy-aware Applications. 18 pages, 6 figures, 2 table
    • …
    corecore