58 research outputs found
A Bag-of-Tasks Scheduler Tolerant to Temporal Failures in Clouds
Cloud platforms have emerged as a prominent environment to execute high
performance computing (HPC) applications providing on-demand resources as well
as scalability. They usually offer different classes of Virtual Machines (VMs)
which ensure different guarantees in terms of availability and volatility,
provisioning the same resource through multiple pricing models. For instance,
in Amazon EC2 cloud, the user pays per hour for on-demand VMs while spot VMs
are unused instances available for lower price. Despite the monetary
advantages, a spot VM can be terminated, stopped, or hibernated by EC2 at any
moment.
Using both hibernation-prone spot VMs (for cost sake) and on-demand VMs, we
propose in this paper a static scheduling for HPC applications which are
composed by independent tasks (bag-of-task) with deadline constraints. However,
if a spot VM hibernates and it does not resume within a time which guarantees
the application's deadline, a temporal failure takes place. Our scheduling,
thus, aims at minimizing monetary costs of bag-of-tasks applications in EC2
cloud, respecting its deadline and avoiding temporal failures. To this end, our
algorithm statically creates two scheduling maps: (i) the first one contains,
for each task, its starting time and on which VM (i.e., an available spot or
on-demand VM with the current lowest price) the task should execute; (ii) the
second one contains, for each task allocated on a VM spot in the first map, its
starting time and on which on-demand VM it should be executed to meet the
application deadline in order to avoid temporal failures. The latter will be
used whenever the hibernation period of a spot VM exceeds a time limit.
Performance results from simulation with task execution traces, configuration
of Amazon EC2 VM classes, and VMs market history confirms the effectiveness of
our scheduling and that it tolerates temporal failures
On reducing the complexity of matrix clocks
Matrix clocks are a generalization of the notion of vector clocks that allows
the local representation of causal precedence to reach into an asynchronous
distributed computation's past with depth , where is an integer.
Maintaining matrix clocks correctly in a system of nodes requires that
everymessage be accompanied by numbers, which reflects an exponential
dependency of the complexity of matrix clocks upon the desired depth . We
introduce a novel type of matrix clock, one that requires only numbers to
be attached to each message while maintaining what for many applications may be
the most significant portion of the information that the original matrix clock
carries. In order to illustrate the new clock's applicability, we demonstrate
its use in the monitoring of certain resource-sharing computations
Multi-FedLS: a Framework for Cross-Silo Federated Learning Applications on Multi-Cloud Environments
Federated Learning (FL) is a distributed Machine Learning (ML) technique that
can benefit from cloud environments while preserving data privacy. We propose
Multi-FedLS, a framework that manages multi-cloud resources, reducing execution
time and financial costs of Cross-Silo Federated Learning applications by using
preemptible VMs, cheaper than on-demand ones but that can be revoked at any
time. Our framework encloses four modules: Pre-Scheduling, Initial Mapping,
Fault Tolerance, and Dynamic Scheduler. This paper extends our previous work
\cite{brum2022sbac} by formally describing the Multi-FedLS resource manager
framework and its modules. Experiments were conducted with three Cross-Silo FL
applications on CloudLab and a proof-of-concept confirms that Multi-FedLS can
be executed on a multi-cloud composed by AWS and GCP, two commercial cloud
providers. Results show that the problem of executing Cross-Silo FL
applications in multi-cloud environments with preemptible VMs can be
efficiently resolved using a mathematical formulation, fault tolerance
techniques, and a simple heuristic to choose a new VM in case of revocation.Comment: In review by Journal of Parallel and Distributed Computin
Evaluating Execution Times and Costs of a Federated Learning Application on different Cloud Providers
National audienceFederated Learning (FL) is a new area of distributed Machine Learning (ML) that emerged to deal with data privacy concerns. In FL, each client has access to a local private dataset. At every round, a client trains the model with its local dataset and sends the weights to a central server. The latter aggregates all client weights and then sends the final weights back to the clients. This approach is attractive in many domains as it allows multiple institutions to collaborate on an ML task without sharing their data. However, most ML models used in FL have millions of weights exchanged in each message. The messages sent between a client and the server can achieve gigabytes of size and are exchanged several times in the whole FL execution. This work presents a preliminary analysis of execution times and costs of a FL application in a multi-cloud scenario. Experiments were conducted considering executions on the Amazon Web Services, Google Cloud Provider, and also in both cloud providers at the same time
A Hibernation Aware Dynamic Scheduler for Cloud Environments
International audienceNowadays, cloud platforms usually offer several types of Virtual Machines (VMs) which have different guarantees in terms of availability and volatility, provisioning the same resource through multiple pricing models. For instance, in the Amazon EC2 cloud, the user pays per hour for on-demand VMs while spot VMs are unused instances available for a lower price. Despite the monetary advantages, a spot VM can be terminated or hibernated by EC2 at any moment. In this work, we propose the Hibernation-Aware Dynamic Scheduler (HADS), to schedule applications composed of independent tasks (bag-of-tasks) with deadline constraints in both hibernation-prone spot VMs (for cost sake) and on-demand VMs. We also consider the problem of temporal failures, that occurs when a spot VM hibernates, and does not resume within a time that guarantees the application's deadline. Our dynamic scheduling approach aims at minimizing the monetary costs of bag-of-tasks applications execution, respecting its deadline even in the presence of hibernation. It is also able to avoid temporal failures, by using task migration and work-stealing techniques. Experimental results with real executions using Amazon EC2 VMs confirm the effectiveness of our scheduling when compared with on-demand VM only based approaches, in terms of monetary costs and execution times. It is also shown that our strategy can tolerate temporal failures
- âŠ