25,155 research outputs found
Metascheduling of HPC Jobs in Day-Ahead Electricity Markets
High performance grid computing is a key enabler of large scale collaborative
computational science. With the promise of exascale computing, high performance
grid systems are expected to incur electricity bills that grow super-linearly
over time. In order to achieve cost effectiveness in these systems, it is
essential for the scheduling algorithms to exploit electricity price
variations, both in space and time, that are prevalent in the dynamic
electricity price markets. In this paper, we present a metascheduling algorithm
to optimize the placement of jobs in a compute grid which consumes electricity
from the day-ahead wholesale market. We formulate the scheduling problem as a
Minimum Cost Maximum Flow problem and leverage queue waiting time and
electricity price predictions to accurately estimate the cost of job execution
at a system. Using trace based simulation with real and synthetic workload
traces, and real electricity price data sets, we demonstrate our approach on
two currently operational grids, XSEDE and NorduGrid. Our experimental setup
collectively constitute more than 433K processors spread across 58 compute
systems in 17 geographically distributed locations. Experiments show that our
approach simultaneously optimizes the total electricity cost and the average
response time of the grid, without being unfair to users of the local batch
systems.Comment: Appears in IEEE Transactions on Parallel and Distributed System
RLQ: Workload Allocation With Reinforcement Learning in Distributed Queues
Distributed workload queues are nowadays widely used due to their significant advantages in terms of decoupling, resilience, and scaling. Task allocation to worker nodes in distributed queue systems is typically simplistic (e.g., Least Recently Used) or uses hand-crafted heuristics that require task-specific information (e.g., task resource demands or expected time of execution). When such task information is not available and worker node capabilities are not homogeneous, the existing placement strategies may lead to unnecessarily large execution timings and usage costs. In this work, we formulate the task allocation problem in the Markov Decision Process framework, in which an agent assigns tasks to an available resource, and receives a numerical reward signal upon task completion. Our adaptive and learning-based task allocation solution, Reinforcement Learning based Queues (RLQ), is implemented and integrated with the popular Celery task queuing system for Python. We compare RLQ against traditional solutions using both synthetic and real workload traces. On average, using synthetic workloads, RLQ reduces the execution cost by approximately 70%, the execution time by a factor of at least 3×, and the waiting time by almost 7×. Using real traces, we observe an improvement of about 20% for execution cost, around 70% improvement for execution time, and a reduction of approximately 20× in waiting time. We also compare RLQ with a strategy inspired by E-PVM, a state-of-the-art solution used in Google's Borg cluster manager, showing we are able to outperform it in five out of six scenarios
RLQ: Workload Allocation With Reinforcement Learning in Distributed Queues
Distributed workload queues are nowadays widely used due to their significant advantages in terms of decoupling, resilience, and scaling. Task allocation to worker nodes in distributed queue systems is typically simplistic (e.g., Least Recently Used) or uses hand-crafted heuristics that require task-specific information (e.g., task resource demands or expected time of execution). When such task information is not available and worker node capabilities are not homogeneous, the existing placement strategies may lead to unnecessarily large execution timings and usage costs. In this work, we formulate the task allocation problem in the Markov Decision Process framework, in which an agent assigns tasks to an available resource, and receives a numerical reward signal upon task completion. Our adaptive and learning-based task allocation solution, Reinforcement Learning based Queues ( RLQ ), is implemented and integrated with the popular Celery task queuing system for Python. We compare RLQ against traditional solutions using both synthetic and real workload traces. On average, using synthetic workloads, RLQ reduces the execution cost by approximately 70%, the execution time by a factor of at least 3×, and the waiting time by almost 7×. Using real traces, we observe an improvement of about 20% for execution cost, around 70% improvement for execution time, and a reduction of approximately 20× in waiting time. We also compare RLQ with a strategy inspired by E-PVM, a state-of-the-art solution used in Google's Borg cluster manager, showing we are able to outperform it in five out of six scenarios
DESIGN AND EVALUATION OF RESOURCE ALLOCATION AND JOB SCHEDULING ALGORITHMS ON COMPUTATIONAL GRIDS
Grid, an infrastructure for resource sharing, currently has shown its importance in
many scientific applications requiring tremendously high computational power. Grid
computing enables sharing, selection and aggregation of resources for solving
complex and large-scale scientific problems. Grids computing, whose resources are
distributed, heterogeneous and dynamic in nature, introduces a number of fascinating
issues in resource management. Grid scheduling is the key issue in grid environment
in which its system must meet the functional requirements of heterogeneous domains,
which are sometimes conflicting in nature also, like user, application, and network.
Moreover, the system must satisfy non-functional requirements like reliability,
efficiency, performance, effective resource utilization, and scalability. Thus, overall
aim of this research is to introduce new grid scheduling algorithms for resource
allocation as well as for job scheduling for enabling a highly efficient and effective
utilization of the resources in executing various applications.
The four prime aspects of this work are: firstly, a model of the grid scheduling
problem for dynamic grid computing environment; secondly, development of a new
web based simulator (SyedWSim), enabling the grid users to conduct a statistical
analysis of grid workload traces and provides a realistic basis for experimentation in
resource allocation and job scheduling algorithms on a grid; thirdly, proposal of a new
grid resource allocation method of optimal computational cost using synthetic and
real workload traces with respect to other allocation methods; and finally, proposal of
some new job scheduling algorithms of optimal performance considering parameters
like waiting time, turnaround time, response time, bounded slowdown, completion
time and stretch time. The issue is not only to develop new algorithms, but also to
evaluate them on an experimental computational grid, using synthetic and real
workload traces, along with the other existing job scheduling algorithms.
Experimental evaluation confirmed that the proposed grid scheduling algorithms
possess a high degree of optimality in performance, efficiency and scalability
DESIGN AND EVALUATION OF RESOURCE ALLOCATION AND JOB SCHEDULING ALGORITHMS ON COMPUTATIONAL GRIDS
Grid, an infrastructure for resource sharing, currently has shown its importance in
many scientific applications requiring tremendously high computational power. Grid
computing enables sharing, selection and aggregation of resources for solving
complex and large-scale scientific problems. Grids computing, whose resources are
distributed, heterogeneous and dynamic in nature, introduces a number of fascinating
issues in resource management. Grid scheduling is the key issue in grid environment
in which its system must meet the functional requirements of heterogeneous domains,
which are sometimes conflicting in nature also, like user, application, and network.
Moreover, the system must satisfy non-functional requirements like reliability,
efficiency, performance, effective resource utilization, and scalability. Thus, overall
aim of this research is to introduce new grid scheduling algorithms for resource
allocation as well as for job scheduling for enabling a highly efficient and effective
utilization of the resources in executing various applications.
The four prime aspects of this work are: firstly, a model of the grid scheduling
problem for dynamic grid computing environment; secondly, development of a new
web based simulator (SyedWSim), enabling the grid users to conduct a statistical\ud
analysis of grid workload traces and provides a realistic basis for experimentation in
resource allocation and job scheduling algorithms on a grid; thirdly, proposal of a new
grid resource allocation method of optimal computational cost using synthetic and
real workload traces with respect to other allocation methods; and finally, proposal of
some new job scheduling algorithms of optimal performance considering parameters
like waiting time, turnaround time, response time, bounded slowdown, completion
time and stretch time. The issue is not only to develop new algorithms, but also to
evaluate them on an experimental computational grid, using synthetic and real
workload traces, along with the other existing job scheduling algorithms.
Experimental evaluation confirmed that the proposed grid scheduling algorithms
possess a high degree of optimality in performance, efficiency and scalability
Predicting Intermediate Storage Performance for Workflow Applications
Configuring a storage system to better serve an application is a challenging
task complicated by a multidimensional, discrete configuration space and the
high cost of space exploration (e.g., by running the application with different
storage configurations). To enable selecting the best configuration in a
reasonable time, we design an end-to-end performance prediction mechanism that
estimates the turn-around time of an application using storage system under a
given configuration. This approach focuses on a generic object-based storage
system design, supports exploring the impact of optimizations targeting
workflow applications (e.g., various data placement schemes) in addition to
other, more traditional, configuration knobs (e.g., stripe size or replication
level), and models the system operation at data-chunk and control message
level.
This paper presents our experience to date with designing and using this
prediction mechanism. We evaluate this mechanism using micro- as well as
synthetic benchmarks mimicking real workflow applications, and a real
application.. A preliminary evaluation shows that we are on a good track to
meet our objectives: it can scale to model a workflow application run on an
entire cluster while offering an over 200x speedup factor (normalized by
resource) compared to running the actual application, and can achieve, in the
limited number of scenarios we study, a prediction accuracy that enables
identifying the best storage system configuration
A Big Data Analyzer for Large Trace Logs
Current generation of Internet-based services are typically hosted on large
data centers that take the form of warehouse-size structures housing tens of
thousands of servers. Continued availability of a modern data center is the
result of a complex orchestration among many internal and external actors
including computing hardware, multiple layers of intricate software, networking
and storage devices, electrical power and cooling plants. During the course of
their operation, many of these components produce large amounts of data in the
form of event and error logs that are essential not only for identifying and
resolving problems but also for improving data center efficiency and
management. Most of these activities would benefit significantly from data
analytics techniques to exploit hidden statistical patterns and correlations
that may be present in the data. The sheer volume of data to be analyzed makes
uncovering these correlations and patterns a challenging task. This paper
presents BiDAl, a prototype Java tool for log-data analysis that incorporates
several Big Data technologies in order to simplify the task of extracting
information from data traces produced by large clusters and server farms. BiDAl
provides the user with several analysis languages (SQL, R and Hadoop MapReduce)
and storage backends (HDFS and SQLite) that can be freely mixed and matched so
that a custom tool for a specific task can be easily constructed. BiDAl has a
modular architecture so that it can be extended with other backends and
analysis languages in the future. In this paper we present the design of BiDAl
and describe our experience using it to analyze publicly-available traces from
Google data clusters, with the goal of building a realistic model of a complex
data center.Comment: 26 pages, 10 figure
- …