320 research outputs found

    Metascheduling of HPC Jobs in Day-Ahead Electricity Markets

    Full text link
    High performance grid computing is a key enabler of large scale collaborative computational science. With the promise of exascale computing, high performance grid systems are expected to incur electricity bills that grow super-linearly over time. In order to achieve cost effectiveness in these systems, it is essential for the scheduling algorithms to exploit electricity price variations, both in space and time, that are prevalent in the dynamic electricity price markets. In this paper, we present a metascheduling algorithm to optimize the placement of jobs in a compute grid which consumes electricity from the day-ahead wholesale market. We formulate the scheduling problem as a Minimum Cost Maximum Flow problem and leverage queue waiting time and electricity price predictions to accurately estimate the cost of job execution at a system. Using trace based simulation with real and synthetic workload traces, and real electricity price data sets, we demonstrate our approach on two currently operational grids, XSEDE and NorduGrid. Our experimental setup collectively constitute more than 433K processors spread across 58 compute systems in 17 geographically distributed locations. Experiments show that our approach simultaneously optimizes the total electricity cost and the average response time of the grid, without being unfair to users of the local batch systems.Comment: Appears in IEEE Transactions on Parallel and Distributed System

    Exploring Scheduling for On-demand File Systems and Data Management within HPC Environments

    Get PDF

    Exploring Scheduling for On-demand File Systems and Data Management within HPC Environments

    Get PDF

    Routing on the Channel Dependency Graph:: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks

    Get PDF
    In the pursuit for ever-increasing compute power, and with Moore's law slowly coming to an end, high-performance computing started to scale-out to larger systems. Alongside the increasing system size, the interconnection network is growing to accommodate and connect tens of thousands of compute nodes. These networks have a large influence on total cost, application performance, energy consumption, and overall system efficiency of the supercomputer. Unfortunately, state-of-the-art routing algorithms, which define the packet paths through the network, do not utilize this important resource efficiently. Topology-aware routing algorithms become increasingly inapplicable, due to irregular topologies, which either are irregular by design, or most often a result of hardware failures. Exchanging faulty network components potentially requires whole system downtime further increasing the cost of the failure. This management approach becomes more and more impractical due to the scale of today's networks and the accompanying steady decrease of the mean time between failures. Alternative methods of operating and maintaining these high-performance interconnects, both in terms of hardware- and software-management, are necessary to mitigate negative effects experienced by scientific applications executed on the supercomputer. However, existing topology-agnostic routing algorithms either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables. Using the fail-in-place strategy, a well-established method for storage systems to repair only critical component failures, is a feasible solution for current and future HPC interconnects as well as other large-scale installations such as data center networks. Although, an appropriate combination of topology and routing algorithm is required to minimize the throughput degradation for the entire system. This thesis contributes a network simulation toolchain to facilitate the process of finding a suitable combination, either during system design or while it is in operation. On top of this foundation, a key contribution is a novel scheduling-aware routing, which reduces fault-induced throughput degradation while improving overall network utilization. The scheduling-aware routing performs frequent property preserving routing updates to optimize the path balancing for simultaneously running batch jobs. The increased deployment of lossless interconnection networks, in conjunction with fail-in-place modes of operation and topology-agnostic, scheduling-aware routing algorithms, necessitates new solutions to solve the routing-deadlock problem. Therefore, this thesis further advances the state-of-the-art by introducing a novel concept of routing on the channel dependency graph, which allows the design of an universally applicable destination-based routing capable of optimizing the path balancing without exceeding a given number of virtual channels, which are a common hardware limitation. This disruptive innovation enables implicit deadlock-avoidance during path calculation, instead of solving both problems separately as all previous solutions

    Catch Me If You Can: Using Power Analysis to Identify HPC Activity

    Full text link
    Monitoring users on large computing platforms such as high performance computing (HPC) and cloud computing systems is non-trivial. Utilities such as process viewers provide limited insight into what users are running, due to granularity limitation, and other sources of data, such as system call tracing, can impose significant operational overhead. However, despite technical and procedural measures, instances of users abusing valuable HPC resources for personal gains have been documented in the past \cite{hpcbitmine}, and systems that are open to large numbers of loosely-verified users from around the world are at risk of abuse. In this paper, we show how electrical power consumption data from an HPC platform can be used to identify what programs are executed. The intuition is that during execution, programs exhibit various patterns of CPU and memory activity. These patterns are reflected in the power consumption of the system and can be used to identify programs running. We test our approach on an HPC rack at Lawrence Berkeley National Laboratory using a variety of scientific benchmarks. Among other interesting observations, our results show that by monitoring the power consumption of an HPC rack, it is possible to identify if particular programs are running with precision up to and recall of 95\% even in noisy scenarios

    Ad-Hoc File Systems At Extreme Scales

    Get PDF

    LOMARC: Look ahead matchmaking for multi-resource coscheduling.

    Get PDF
    Hyper-Threading (HT) provides a new possibility for job coscheduling without context switch and without the cost for coordinating processes of one parallel job. However, HT achieves high processor throughput at the expense of reducing the performance of the individual process. Since the hardware resources are actually shared between two coscheduled jobs, the resource contention will harm the performance of each job. Most scheduling approaches only focus on the CPU without considering the impact on other resources. In this thesis we present LOMARC, a space-time sharing approach that takes multiple resources, including CPU, I/O, memory and network, into consideration for job coscheduling on HT processors. To improve resource utilization and reduce job response times, LOMARC matches two jobs with complementary resource requirements to coschedule. Our approach partially reorders the waiting job queue by lookahead to increase the possibility of finding a good match. LOMARC also generalizes for standard CPUs, using an adjusted matching scheme and only focusing on hiding I/O latency. In addition, LOMARC incorporates standard scheduling approaches such as priority ordering, aging and backfilling. In our simulation experiment, we use a realistic workload model to provide the convincing results. Our experimental results demonstrate that LOMARC delivers better performance than the standard space sharing approach and the other two job coscheduling approaches for HT processors. The performance gain is mainly due to an increased possibility of coscheduling two complementary jobs by looking ahead on the waiting queue. Source: Masters Abstracts International, Volume: 43-01, page: 0239. Adviser: Angela Sodan. Thesis (M.Sc.)--University of Windsor (Canada), 2004
    • …
    corecore