213 research outputs found

    Energy Demand Response for High-Performance Computing Systems

    Get PDF
    The growing computational demand of scientific applications has greatly motivated the development of large-scale high-performance computing (HPC) systems in the past decade. To accommodate the increasing demand of applications, HPC systems have been going through dramatic architectural changes (e.g., introduction of many-core and multi-core systems, rapid growth of complex interconnection network for efficient communication between thousands of nodes), as well as significant increase in size (e.g., modern supercomputers consist of hundreds of thousands of nodes). With such changes in architecture and size, the energy consumption by these systems has increased significantly. With the advent of exascale supercomputers in the next few years, power consumption of the HPC systems will surely increase; some systems may even consume hundreds of megawatts of electricity. Demand response programs are designed to help the energy service providers to stabilize the power system by reducing the energy consumption of participating systems during the time periods of high demand power usage or temporary shortage in power supply. This dissertation focuses on developing energy-efficient demand-response models and algorithms to enable HPC system\u27s demand response participation. In the first part, we present interconnection network models for performance prediction of large-scale HPC applications. They are based on interconnected topologies widely used in HPC systems: dragonfly, torus, and fat-tree. Our interconnect models are fully integrated with an implementation of message-passing interface (MPI) that can mimic most of its functions with packet-level accuracy. Extensive experiments show that our integrated models provide good accuracy for predicting the network behavior, while at the same time allowing for good parallel scaling performance. In the second part, we present an energy-efficient demand-response model to reduce HPC systems\u27 energy consumption during demand response periods. We propose HPC job scheduling and resource provisioning schemes to enable HPC system\u27s emergency demand response participation. In the final part, we propose an economic demand-response model to allow both HPC operator and HPC users to jointly reduce HPC system\u27s energy cost. Our proposed model allows the participation of HPC systems in economic demand-response programs through a contract-based rewarding scheme that can incentivize HPC users to participate in demand response

    Energy Awareness and Scheduling in Mobile Devices and High End Computing

    Get PDF
    In the context of the big picture as energy demands rise due to growing economies and growing populations, there will be greater emphasis on sustainable supply, conservation, and efficient usage of this vital resource. Even at a smaller level, the need for minimizing energy consumption continues to be compelling in embedded, mobile, and server systems such as handheld devices, robots, spaceships, laptops, cluster servers, sensors, etc. This is due to the direct impact of constrained energy sources such as battery size and weight, as well as cooling expenses in cluster-based systems to reduce heat dissipation. Energy management therefore plays a paramount role in not only hardware design but also in user-application, middleware and operating system design. At a higher level Datacenters are sprouting everywhere due to the exponential growth of Big Data in every aspect of human life, the buzz word these days is Cloud computing. This dissertation, focuses on techniques, specifically algorithmic ones to scale down energy needs whenever the system performance can be relaxed. We examine the significance and relevance of this research and develop a methodology to study this phenomenon. Specifically, the research will study energy-aware resource reservations algorithms to satisfy both performance needs and energy constraints. Many energy management schemes focus on a single resource that is dedicated to real-time or nonreal-time processing. Unfortunately, in many practical systems the combination of hard and soft real-time periodic tasks, a-periodic real-time tasks, interactive tasks and batch tasks must be supported. Each task may also require access to multiple resources. Therefore, this research will tackle the NP-hard problem of providing timely and simultaneous access to multiple resources by the use of practical abstractions and near optimal heuristics aided by cooperative scheduling. We provide an elegant EAS model which works across the spectrum which uses a run-profile based approach to scheduling. We apply this model to significant applications such as BLAT and Assembly of gene sequences in the Bioinformatics domain. We also provide a simulation for extending this model to cloud computing to answers “what if” scenario questions for consumers and operators of cloud resources to help answers questions of deadlines, single v/s distributed cluster use and impact analysis of energy-index and availability against revenue and ROI

    Multi-Objective Task Scheduling Using Smart MPI-Based Cloud Resources

    Get PDF
    Task Scheduling and Resource Allocation (TSRA) is the key focus of cloud computing. This paper utilizes Smart Message Passing Interface based Approach (SMPIA) and the Roulette Wheel selection method in order to determine the best Alternative Virtual Machine (AVM). To do so, the Virtual MPI Bus (VMPIB) is employed for efficient communication among Virtual Machines (VMs) using SMPIA. In this matter, SMPIA is applied on different resource allocation and task scheduling strategies. MakeSpan (MS) was chosen as an optimization factor and solutions with minimum MS value as the best task mapping performance and reduced cloud consumption. The simulation is conducted using MATLAB. The analysis proves that applying SMPIA reduced the Total Execution Time (TET) of resource allocation, maximum MS time, and increase the Resource Utilization (RU), as compared to non-SMPIA for Greedy, Max-Min, Min-Min algorithms. It is observed that SMPIA can outperform non-SMPIA. The effect of SMPIA is more obvious as change in the MS and the number of cloud workloads increase. Furthermore, regarding the TET and MS of the tasks, the SMPIA can significantly reduce the starvation problem as well as the lack of sufficient resources. In addition, this approach improves the system's performance more than the previous methods, what reflects effectiveness of the proposed approach concerning the Message Passing Interface (MPI) communication time in the network virtualization. The mentioned text mining work was prepared concurrently after practical evaluation

    Partial aggregation for collective communication in distributed memory machines

    Get PDF
    High Performance Computing (HPC) systems interconnect a large number of Processing Elements (PEs) in high-bandwidth networks to simulate complex scientific problems. The increasing scale of HPC systems poses great challenges on algorithm designers. As the average distance between PEs increases, data movement across hierarchical memory subsystems introduces high latency. Minimizing latency is particularly challenging in collective communications, where many PEs may interact in complex communication patterns. Although collective communications can be optimized for network-level parallelism, occasional synchronization delays due to dependencies in the communication pattern degrade application performance. To reduce the performance impact of communication and synchronization costs, parallel algorithms are designed with sophisticated latency hiding techniques. The principle is to interleave computation with asynchronous communication, which increases the overall occupancy of compute cores. However, collective communication primitives abstract parallelism which limits the integration of latency hiding techniques. Approaches to work around these limitations either modify the algorithmic structure of application codes, or replace collective primitives with verbose low-level communication calls. While these approaches give fine-grained control for latency hiding, implementing collective communication algorithms is challenging and requires expertise knowledge about HPC network topologies. A collective communication pattern is commonly described as a Directed Acyclic Graph (DAG) where a set of PEs, represented as vertices, resolve data dependencies through communication along the edges. Our approach improves latency hiding in collective communication through partial aggregation. Based on mathematical rules of binary operations and homomorphism, we expose data parallelism in a respective DAG to overlap computation with communication. The proposed concepts are implemented and evaluated with a subset of collective primitives in the Message Passing Interface (MPI), an established communication standard in scientific computing. An experimental analysis with communication-bound microbenchmarks shows considerable performance benefits for the evaluated collective primitives. A detailed case study with a large-scale distributed sort algorithm demonstrates, how partial aggregation significantly improves performance in data-intensive scenarios. Besides better latency hiding capabilities with collective communication primitives, our approach enables further optimizations of their implementations within MPI libraries. The vast amount of asynchronous programming models, which are actively studied in the HPC community, benefit from partial aggregation in collective communication patterns. Future work can utilize partial aggregation to improve the interaction of MPI collectives with acclerator architectures, and to design more efficient communication algorithms

    A Design Framework for Efficient Distributed Analytics on Structured Big Data

    Get PDF
    Distributed analytics architectures are often comprised of two elements: a compute engine and a storage system. Conventional distributed storage systems usually store data in the form of files or key-value pairs. This abstraction simplifies how the data is accessed and reasoned about by an application developer. However, the separation of compute and storage systems makes it difficult to optimize costly disk and network operations. By design the storage system is isolated from the workload and its performance requirements such as block co-location and replication. Furthermore, optimizing fine-grained data access requests becomes difficult as the storage layer is hidden away behind such abstractions. Using a clean slate approach, this thesis proposes a modular distributed analytics system design which is centered around a unified interface for distributed data objects named the DDO. The interface couples key mechanisms that utilize storage, memory, and compute resources. This coupling makes it ideal to optimize data access requests across all memory hierarchy levels, with respect to the workload and its performance requirements. In addition to the DDO, a complementary DDO controller implementation controls the logical view of DDOs, their replication, and distribution across the cluster. A proof-of-concept implementation shows improvement in mean query time by 3-6x on the TPC-H and TPC-DS benchmarks, and more than an order of magnitude improvement in many cases

    Parallel Processes in HPX: Designing an Infrastructure for Adaptive Resource Management

    Get PDF
    Advancement in cutting edge technologies have enabled better energy efficiency as well as scaling computational power for the latest High Performance Computing(HPC) systems. However, complexity, due to hybrid architectures as well as emerging classes of applications, have shown poor computational scalability using conventional execution models. Thus alternative means of computation, that addresses the bottlenecks in computation, is warranted. More precisely, dynamic adaptive resource management feature, both from systems as well as application\u27s perspective, is essential for better computational scalability and efficiency. This research presents and expands the notion of Parallel Processes as a placeholder for procedure definitions, targeted at one or more synchronous domains, meta data for computation and resource management as well as infrastructure for dynamic policy deployment. In addition to this, the research presents additional guidelines for a framework for resource management in HPX runtime system. Further, this research also lists design principles for scalability of Active Global Address Space (AGAS), a necessary feature for Parallel Processes. Also, to verify the usefulness of Parallel Processes, a preliminary performance evaluation of different task scheduling policies is carried out using two different applications. The applications used are: Unbalanced Tree Search, a reference dynamic graph application, implemented by this research in HPX and MiniGhost, a reference stencil based application using bulk synchronous parallel model. The results show that different scheduling policies provide better performance for different classes of applications; and for the same application class, in certain instances, one policy fared better than the others, while vice versa in other instances, hence supporting the hypothesis of the need of dynamic adaptive resource management infrastructure, for deploying different policies and task granularities, for scalable distributed computing

    Decentralized Scheduling for Many-Task Applications in the Hybrid Cloud

    Get PDF
    While Cloud Computing has transformed how we solve many computing tasks, some scientific and many-task applications are not efficiently executed on cloud resources. Decentralized scheduling, as studied in grid computing, can provide a scalable system to organize cloud resources and schedule a variety of work. By measuring simulations of two algorithms, the fully decentralized Organic Grid, and the partially decentralized Air Traffic Controller from IBM, we establish that decentralization is a workable approach, and that there are bottlenecks that can impact partially centralized algorithms. Through measurements in the cloud, we verify that our simulation approach is sound, and assess the variable performance of cloud resources. We propose a scheduler that measures the capabilities of the resources available to execute a task and distributes work dynamically at run time. Our scheduling algorithm is evaluated experimentally, and we show that performance-aware scheduling in a cloud environment can provide improvements in execution time. This provides a framework by which a variety of parameters can be weighed to make job-specific and context-aware scheduling decisions. Our measurements examine the usefulness of benchmarking as a metric used to measure a node\u27s performance, and drive scheduling. Benchmarking provides an advantage over simple queue-based scheduling on distributed systems whose members vary in actual performance, but the NAS benchmark we use does not always correlate perfectly with actual performance. The utilized hardware is examined, as are enforced performance variations, and we observe changes in performance that result in running on a system in which different workers receive different CPU allocations. As we see that performance metrics are useful near the end of the execution of a large job, we create a new metric from historical data of partially completed work, and use that to drive execution time down further. Interdependent task graph work is introduced and described as a next step in improving cloud scheduling. Realistic task graph problems are defined and a scheduling approach is introduced. This dissertation lays the groundwork to expand the types of problems that can be solved efficiently in the cloud environment

    Middleware for large scale in situ analytics workflows

    Get PDF
    The trend to exascale is causing researchers to rethink the entire computa- tional science stack, as future generation machines will contain both diverse hardware environments and run times that manage them. Additionally, the science applications themselves are stepping away from the traditional bulk-synchronous model and are moving towards a more dynamic and decoupled environment where analysis routines are run in situ alongside the large scale simulations. This thesis presents CoApps, a middleware that allows in situ science analytics applications to operate in a location-flexible manner. Additionally, CoApps explores methods to extract information from, and issue management operations to, lower level run times that are managing the diverse hardware expected to be found on next generation exascale machines. This work leverages experience with several extremely scalable applications in materials and fusion, and has been evaluated on machines ranging from local Linux clusters to the supercomputer Titan.Ph.D

    Integration of a parallel efficiency monitoring tool into an HPC production system

    Get PDF
    This thesis presents the design, implementation, and evaluation of an extension of a library called TALP for tracing useful computation and performance metrics in MPI-based applications (Message Passing Interface). The extension also integrates the tool with a web portal called UserPortal. They are both developed at the Barcelona Supercomputing Center (BSC). The library captures information about communication patterns and computation performed by MPI applications, and makes this information available to users. The extension developed in this project adds the functionality of reading PAPI (Performance Application Programming Interface) counters, allowing users to know the instructions per cycle of their application and identify bottlenecks in their code. UserPortal provides an easy-to-use interface for visualizing and analyzing the captured information and allows users to easily monitor the status of their jobs, such as memory usage, CPU usage, and their evolution over time. The integration of the library with the BSC system involves several stages of design and development, including a software wrapper, a modulefile, scripts for retrieving and processing data, and web development to display the data on the UserPortal. However, users must be educated in performance analysis in order to effectively make a good reading and interpretation of the reported metrics and optimize their codes. A public documentation has been developed as well as a reference on how to use these tools on BSC machines, along with links to educational resources on related topics. Overall, this work provides a valuable tool for developers and researchers working with MPI-based applications, making performance optimization more approachable and efficient.Aquesta tesi presenta el disseny, la implementació i l'avaluació d'una extensió d'una llibreria anomenada TALP per traçar el càlcul útil i les mètriques de rendiment en aplicacions basades en MPI (Message Programming Interface por les seves sigles en anglès). L'extensió també integra l'eina amb un portal web anomenat UserPortal. Tots dos són desenvolupats al Barcelona Supercomputing Center (BSC). La llibreria captura informació sobre patrons de comunicació i càlcul realitzat per aplicacions MPI i posa aquesta informació a disposició dels usuaris. L'extensió desenvolupada en aquest projecte afegeix la funcionalitat de llegir comptadors PAPI (Performance Application Programming Interface), permetent als usuaris saber les instruccions per cicle de la seva aplicació i identificar colls d'ampolla en el seu codi. UserPortal proporciona una interfície fàcil d'utilitzar per visualitzar i analitzar la informació capturada i permet als usuaris monitoritzar fàcilment l'estat dels seus treballs, com ara l'ús de la memòria, l'ús de la CPU i la seva evolució en el temps. La integració de la llibreria amb el sistema del BSC implica diverses etapes de disseny i desenvolupament, incloent un envolupador de programari, un modulefile, scripts per recuperar i processar dades i desenvolupament web per mostrar les dades en el UserPortal. No obstant, els usuaris han de ser educats en l'anàlisi de rendiment per tal de fer una lectura i interpretació efectiva de les mètriques reportades i ser capaços d'optimitzar els seus codis. S'ha desenvolupat una documentació pública així com una referència sobre com utilitzar aquestes eines en les màquines BSC, juntament amb enllaços a recursos educatius sobre temes relacionats. En general, aquest treball proporciona una eina valuosa per als desenvolupadors i investigadors que treballen amb aplicacions basades en MPI, fent que l'optimització del rendiment sigui més accessible i eficient
    corecore