147 research outputs found
HPC Cloud for Scientific and Business Applications: Taxonomy, Vision, and Research Challenges
High Performance Computing (HPC) clouds are becoming an alternative to
on-premise clusters for executing scientific applications and business
analytics services. Most research efforts in HPC cloud aim to understand the
cost-benefit of moving resource-intensive applications from on-premise
environments to public cloud platforms. Industry trends show hybrid
environments are the natural path to get the best of the on-premise and cloud
resources---steady (and sensitive) workloads can run on on-premise resources
and peak demand can leverage remote resources in a pay-as-you-go manner.
Nevertheless, there are plenty of questions to be answered in HPC cloud, which
range from how to extract the best performance of an unknown underlying
platform to what services are essential to make its usage easier. Moreover, the
discussion on the right pricing and contractual models to fit small and large
users is relevant for the sustainability of HPC clouds. This paper brings a
survey and taxonomy of efforts in HPC cloud and a vision on what we believe is
ahead of us, including a set of research challenges that, once tackled, can
help advance businesses and scientific discoveries. This becomes particularly
relevant due to the fast increasing wave of new HPC applications coming from
big data and artificial intelligence.Comment: 29 pages, 5 figures, Published in ACM Computing Surveys (CSUR
A Study on Cloud Cost Efficiency by Exploiting Idle Billing Period Fractions
In most of the current commercial Clouds, resources
are billed based on a time interval equal to one hour,
as is the case of virtual machine (VM) instances on Amazon
EC2. Such time interval is usually long, and yet the user has
to pay for the whole last hour, even if he/she has only used a
fraction of it, contradicting the pay-as-you-go model of Clouds.
In this paper, we analyse the advantages of adopting alternative
scheduling policies that exploit idle last time intervals,
in terms of service cost to Cloud users and operating costs
to Cloud providers. Using a real-life astronomy workflow
application, constrained by user-defined Deadline and Budget
quality of service (QoS) parameters, a set of online state-ofthe-
art-based scheduling algorithms try different execution and
resource provisioning plans. Our results show that exploitation
of partially idle last time intervals can reduce the cost of service
to the end user, and augments providers competitiveness up to
21.6% through energy efficiency improvement and consequent
lowering of operational costs.info:eu-repo/semantics/publishedVersio
Executing Large Scale Scientific Workflows in Public Clouds
Scientists in different fields, such as high-energy physics, earth science, and astronomy are developing large-scale workflow applications. In many use cases, scientists need to run a set of interrelated but independent workflows (i.e., workflow ensembles) for the entire scientific analysis. As a workflow ensemble usually contains many sub-workflows in each of which hundreds or thousands of jobs exist with precedence constraints, the execution of such a workflow ensemble makes a great concern with cost even using elastic and pay-as-you-go cloud resources. In this thesis, we develop a set of methods to optimize the execution of large-scale scientific workflows in public clouds with both cost and deadline constraints with a two-step approach. Firstly, we present a set of methods to optimize the execution of scientific workflow in public clouds, with the Montage astronomical mosaic engine running on Amazon EC2 as an example. Secondly, we address three main challenges in realizing benefits of using public clouds when executing large-scale workflow ensembles: (1) execution coordination, (2) resource provisioning, and (3) data staging. To this end, we develop a new pulling-based workflow execution system with a profiling-based resource provisioning strategy. Our results show that our solution system can achieve 80% speed-up, by removing scheduling overhead, compared to the well-known Pegasus workflow management system when running scientific workflow ensembles. Besides, our evaluation using Montage workflow ensembles on around 1000-core Amazon EC2 clusters has demonstrated the efficacy of our resource provisioning strategy in terms of cost effectiveness within deadline
Scheduling Flexible Demand in Cloud Computing Spot Markets
The rapid standardization and specialization of cloud computing services have led to the development of cloud spot markets on which cloud service providers and customers can trade in near real-time. Frequent changes in demand and supply give rise to spot prices that vary throughout the day. Cloud customers often have temporal flexibility to execute their jobs before a specific deadline. In this paper, the authors apply real options analysis (ROA), which is an established valuation method designed to capture the flexibility of action under uncertainty. They adapt and compare multiple discrete-time approaches that enable cloud customers to quantify and exploit the monetary value of their short-term temporal flexibility. The paper contributes to the field by guaranteeing cloud job execution of variable-time requests in a single cloud spot market, whereas existing multi-market strategies may not fulfill requests when outbid. In a broad simulation of scenarios for the use of Amazon EC2 spot instances, the developed approaches exploit the existing savings potential up to 40 percent – a considerable extent. Moreover, the results demonstrate that ROA, which explicitly considers time-of-day-specific spot price patterns, outperforms traditional option pricing models and expectation optimization
Ad hoc cloud computing
Commercial and private cloud providers offer virtualized resources via a set of co-located
and dedicated hosts that are exclusively reserved for the purpose of offering
a cloud service. While both cloud models appeal to the mass market, there are many
cases where outsourcing to a remote platform or procuring an in-house infrastructure
may not be ideal or even possible.
To offer an attractive alternative, we introduce and develop an ad hoc cloud computing
platform to transform spare resource capacity from an infrastructure owner’s
locally available, but non-exclusive and unreliable infrastructure, into an overlay cloud
platform. The foundation of the ad hoc cloud relies on transferring and instantiating
lightweight virtual machines on-demand upon near-optimal hosts while virtual machine
checkpoints are distributed in a P2P fashion to other members of the ad hoc
cloud. Virtual machines found to be non-operational are restored elsewhere ensuring
the continuity of cloud jobs.
In this thesis we investigate the feasibility, reliability and performance of ad hoc
cloud computing infrastructures. We firstly show that the combination of both volunteer
computing and virtualization is the backbone of the ad hoc cloud. We outline the
process of virtualizing the volunteer system BOINC to create V-BOINC. V-BOINC
distributes virtual machines to volunteer hosts allowing volunteer applications to be
executed in the sandbox environment to solve many of the downfalls of BOINC; this
however also provides the basis for an ad hoc cloud computing platform to be developed.
We detail the challenges of transforming V-BOINC into an ad hoc cloud and outline
the transformational process and integrated extensions. These include a BOINC job
submission system, cloud job and virtual machine restoration schedulers and a periodic
P2P checkpoint distribution component. Furthermore, as current monitoring tools are
unable to cope with the dynamic nature of ad hoc clouds, a dynamic infrastructure
monitoring and management tool called the Cloudlet Control Monitoring System is
developed and presented.
We evaluate each of our individual contributions as well as the reliability, performance
and overheads associated with an ad hoc cloud deployed on a realistically
simulated unreliable infrastructure. We conclude that the ad hoc cloud is not only a
feasible concept but also a viable computational alternative that offers high levels of
reliability and can at least offer reasonable performance, which at times may exceed
the performance of a commercial cloud infrastructure
Climbing Up Cloud Nine: Performance Enhancement Techniques for Cloud Computing Environments
With the transformation of cloud computing technologies from an attractive trend to a business reality, the need is more pressing than ever for efficient cloud service management tools and techniques. As cloud technologies continue to mature, the service model, resource allocation methodologies, energy efficiency models and general service management schemes are not yet saturated. The burden of making this all tick perfectly falls on cloud providers. Surely, economy of scale revenues and leveraging existing infrastructure and giant workforce are there as positives, but it is far from straightforward operation from that point. Performance and service delivery will still depend on the providers’ algorithms and policies which affect all operational areas.
With that in mind, this thesis tackles a set of the more critical challenges faced by cloud providers with the purpose of enhancing cloud service performance and saving on providers’ cost. This is done by exploring innovative resource allocation techniques and developing novel tools and methodologies in the context of cloud resource management, power efficiency, high availability and solution evaluation.
Optimal and suboptimal solutions to the resource allocation problem in cloud data centers from both the computational and the network sides are proposed. Next, a deep dive into the energy efficiency challenge in cloud data centers is presented. Consolidation-based and non-consolidation-based solutions containing a novel dynamic virtual machine idleness prediction technique are proposed and evaluated. An investigation of the problem of simulating cloud environments follows. Available simulation solutions are comprehensively evaluated and a novel design framework for cloud simulators covering multiple variations of the problem is presented. Moreover, the challenge of evaluating cloud resource management solutions performance in terms of high availability is addressed. An extensive framework is introduced to design high availability-aware cloud simulators and a prominent cloud simulator (GreenCloud) is extended to implement it. Finally, real cloud application scenarios evaluation is demonstrated using the new tool.
The primary argument made in this thesis is that the proposed resource allocation and simulation techniques can serve as basis for effective solutions that mitigate performance and cost challenges faced by cloud providers pertaining to resource utilization, energy efficiency, and client satisfaction
Recommended from our members
Transiency-driven Resource Management for Cloud Computing Platforms
Modern distributed server applications are hosted on enterprise or cloud data centers that provide computing, storage, and networking capabilities to these applications. These applications are built using the implicit assumption that the underlying servers will be stable and normally available, barring for occasional faults. In many emerging scenarios, however, data centers and clouds only provide transient, rather than continuous, availability of their servers. Transiency in modern distributed systems arises in many contexts, such as green data centers powered using renewable intermittent sources, and cloud platforms that provide lower-cost transient servers which can be unilaterally revoked by the cloud operator.
Transient computing resources are increasingly important, and existing fault-tolerance and resource management techniques are inadequate for transient servers because applications typically assume continuous resource availability. This thesis presents research in distributed systems design that treats transiency as a first-class design principle. I show that combining transiency-specific fault-tolerance mechanisms with resource management policies to suit application characteristics and requirements, can yield significant cost and performance benefits. These mechanisms and policies have been implemented and prototyped as part of software systems, which allow a wide range of applications, such as interactive services and distributed data processing, to be deployed on transient servers, and can reduce cloud computing costs by up to 90\%.
This thesis makes contributions to four areas of computer systems research: transiency-specific fault-tolerance, resource allocation, abstractions, and resource reclamation. For reducing the impact of transient server revocations, I develop two fault-tolerance techniques that are tailored to transient server characteristics and application requirements. For interactive applications, I build a derivative cloud platform that masks revocations by transparently moving application-state between servers of different types. Similarly, for distributed data processing applications, I investigate the use of application level periodic checkpointing to reduce the performance impact of server revocations. For managing and reducing the risk of server revocations, I investigate the use of server portfolios that allow transient resource allocation to be tailored to application requirements.
Finally, I investigate how resource providers (such as cloud platforms) can provide transient resource availability without revocation, by looking into alternative resource reclamation techniques. I develop resource deflation, wherein a server\u27s resources are fractionally reclaimed, allowing the application to continue execution albeit with fewer resources. Resource deflation generalizes revocation, and the deflation mechanisms and cluster-wide policies can yield both high cluster utilization and low application performance degradation
- …