1,866 research outputs found
HPC Cloud for Scientific and Business Applications: Taxonomy, Vision, and Research Challenges
High Performance Computing (HPC) clouds are becoming an alternative to
on-premise clusters for executing scientific applications and business
analytics services. Most research efforts in HPC cloud aim to understand the
cost-benefit of moving resource-intensive applications from on-premise
environments to public cloud platforms. Industry trends show hybrid
environments are the natural path to get the best of the on-premise and cloud
resources---steady (and sensitive) workloads can run on on-premise resources
and peak demand can leverage remote resources in a pay-as-you-go manner.
Nevertheless, there are plenty of questions to be answered in HPC cloud, which
range from how to extract the best performance of an unknown underlying
platform to what services are essential to make its usage easier. Moreover, the
discussion on the right pricing and contractual models to fit small and large
users is relevant for the sustainability of HPC clouds. This paper brings a
survey and taxonomy of efforts in HPC cloud and a vision on what we believe is
ahead of us, including a set of research challenges that, once tackled, can
help advance businesses and scientific discoveries. This becomes particularly
relevant due to the fast increasing wave of new HPC applications coming from
big data and artificial intelligence.Comment: 29 pages, 5 figures, Published in ACM Computing Surveys (CSUR
A Framework for Approximate Optimization of BoT Application Deployment in Hybrid Cloud Environment
We adopt a systematic approach to investigate the efficiency of near-optimal deployment of large-scale CPU-intensive Bag-of-Task applications running on cloud resources with the non-proportional cost to performance ratios. Our analytical solutions perform in both known and unknown running time of the given application. It tries to optimize users' utility by choosing the most desirable tradeoff between the make-span and the total incurred expense. We propose a schema to provide a near-optimal deployment of BoT application regarding users' preferences. Our approach is to provide user with a set of Pareto-optimal solutions, and then she may select one of the possible scheduling points based on her internal utility function. Our framework can cope with uncertainty in the tasks' execution time using two methods, too. First, an estimation method based on a Monte Carlo sampling called AA algorithm is presented. It uses the minimum possible number of sampling to predict the average task running time. Second, assuming that we have access to some code analyzer, code profiling or estimation tools, a hybrid method to evaluate the accuracy of each estimation tool in certain interval times for improving resource allocation decision has been presented. We propose approximate deployment strategies that run on hybrid cloud. In essence, proposed strategies first determine either an estimated or an exact optimal schema based on the information provided from users' side and environmental parameters. Then, we exploit dynamic methods to assign tasks to resources to reach an optimal schema as close as possible by using two methods. A fast yet simple method based on First Fit Decreasing algorithm, and a more complex approach based on the approximation solution of the transformed problem into a subset sum problem. Extensive experiment results conducted on a hybrid cloud platform confirm that our framework can deliver a near optimal solution respecting user's utility function
Recommended from our members
Trust Model for Optimized Cloud Services
Cloud computing with its inherent advantages draws attention for business critical applications, but concurrently expects high level of trust in cloud service providers. Reputation-based trust is emerging as a good choice to model trust of cloud service providers based on available evidence. Many existing reputation based systems either ignore or give less importance to uncertainty linked with the evidence. In this paper, we propose an uncertainty model and define our approach to compute opinion for cloud service providers. Using subjective logic operators along with the computed opinion values, we propose mechanisms to calculate the reputation of cloud service providers. We evaluate and compare our proposed model with existing reputation models
Atlas: Hybrid Cloud Migration Advisor for Interactive Microservices
Hybrid cloud provides an attractive solution to microservices for better
resource elasticity. A subset of application components can be offloaded from
the on-premises cluster to the cloud, where they can readily access additional
resources. However, the selection of this subset is challenging because of the
large number of possible combinations. A poor choice degrades the application
performance, disrupts the critical services, and increases the cost to the
extent of making the use of hybrid cloud unviable. This paper presents Atlas, a
hybrid cloud migration advisor. Atlas uses a data-driven approach to learn how
each user-facing API utilizes different components and their network footprints
to drive the migration decision. It learns to accelerate the discovery of
high-quality migration plans from millions and offers recommendations with
customizable trade-offs among three quality indicators: end-to-end latency of
user-facing APIs representing application performance, service availability,
and cloud hosting costs. Atlas continuously monitors the application even after
the migration for proactive recommendations. Our evaluation shows that Atlas
can achieve 21% better API performance (latency) and 11% cheaper cost with less
service disruption than widely used solutions.Comment: To appear at EuroSys 202
Towards Autonomous and Efficient Machine Learning Systems
Computation-intensive machine learning (ML) applications are becoming some of the most popular workloads running atop cloud infrastructure. While training ML applications, practitioners face the challenge of tuning various system-level parameters, such as the number of training nodes, communication topology during training, instance type, and the number of serving nodes, to meet the SLO requirements for bursty workload during the inference. Similarly, efficient resource utilization is another key challenge in cloud computing. This dissertation proposes high-performing and efficient ML systems to speed up training time and inference tasks while enabling automated and robust system management.To train an ML model in a distributed fashion we focus on strategies to mitigate the resource provisioning overhead and improve the training speed without impacting the model accuracy. More specifically, a system for autonomic and adaptive scheduling is built atop serverless computing that dynamically optimizes deployment and resource scaling for ML training tasks for cost-effectiveness and fast training. Similarly, a dynamic client selection framework is developed to address the stragglers problem caused by resource heterogeneity, data quality, and data quantity in a privacy-preserving Federated Learning (FL) environment without impacting the model accuracy.For serving bursty ML workloads we focus on developing highly scalable and adaptive strategies to serve the dynamically changing workload in a cost-effective manner in an autonomic fashion. We develop a framework that optimizes batching parameters on the fly using a lightweight profiler and an analytical model. We also devise strategies for serving ML workloads of varying sizes, leading to non-deterministic service time in a cost-effective manner. More specifically, we develop an SLO-aware framework that first analyzes the request size variations and workload variation to estimate the number of serving functions and intelligently route requests to multiple serving functions. Finally, resource utilization of burstable instances is optimized to benefit the cloud provider and end-user through a careful orchestration of resources (i.e., CPU, network, and I/O) using an analytical model and lightweight profiling, while complying with a user-defined SLO
Cloud Cost Optimization: A Comprehensive Review of Strategies and Case Studies
Cloud computing has revolutionized the way organizations manage their IT
infrastructure, but it has also introduced new challenges, such as managing
cloud costs. This paper explores various techniques for cloud cost
optimization, including cloud pricing, analysis, and strategies for resource
allocation. Real-world case studies of these techniques are presented, along
with a discussion of their effectiveness and key takeaways. The analysis
conducted in this paper reveals that organizations can achieve significant cost
savings by adopting cloud cost optimization techniques. Additionally, future
research directions are proposed to advance the state of the art in this
important field
- …