61 research outputs found
Improving Resource Management in Virtualized Data Centers using Application Performance Models
The rapid growth of virtualized data centers and cloud hosting services is making the management of physical resources such as CPU, memory, and I/O bandwidth in data center servers increasingly important. Server management now involves dealing with multiple dissimilar applications with varying Service-Level-Agreements (SLAs) and multiple resource dimensions. The multiplicity and diversity of resources and applications are rendering administrative tasks more complex and challenging. This thesis aimed to develop a framework and techniques that would help substantially reduce data center management complexity.
We specifically addressed two crucial data center operations. First, we precisely estimated capacity requirements of client virtual machines (VMs) while renting server space in cloud environment. Second, we proposed a systematic process to efficiently allocate physical resources to hosted VMs in a data center. To realize these dual objectives, accurately capturing the effects of resource allocations on application performance is vital. The benefits of accurate application performance modeling are multifold. Cloud users can size their VMs appropriately and pay only for the resources that they need; service providers can also offer a new charging model based on the VMs performance instead of their configured sizes. As a result, clients will pay exactly for the performance they are actually experiencing; on the other hand, administrators will be able to maximize their total revenue by utilizing application performance models and SLAs.
This thesis made the following contributions. First, we identified resource control parameters crucial for distributing physical resources and characterizing contention for virtualized applications in a shared hosting environment. Second, we explored several modeling techniques and confirmed the suitability of two machine learning tools, Artificial Neural Network and Support Vector Machine, to accurately model the performance of virtualized applications. Moreover, we suggested and evaluated modeling optimizations necessary to improve prediction accuracy when using these modeling tools. Third, we presented an approach to optimal VM sizing by employing the performance models we created. Finally, we proposed a revenue-driven resource allocation algorithm which maximizes the SLA-generated revenue for a data center
Models, methods, and tools for developing MMOG backends on commodity clouds
Online multiplayer games have grown to unprecedented scales, attracting millions of players
worldwide. The revenue from this industry has already eclipsed well-established entertainment
industries like music and films and is expected to continue its rapid growth in the future.
Massively Multiplayer Online Games (MMOGs) have also been extensively used in research
studies and education, further motivating the need to improve their development process.
The development of resource-intensive, distributed, real-time applications like MMOG backends
involves a variety of challenges. Past research has primarily focused on the development and
deployment of MMOG backends on dedicated infrastructures such as on-premise data centers
and private clouds, which provide more flexibility but are expensive and hard to set up and
maintain. A limited set of works has also focused on utilizing the Infrastructure-as-a-Service
(IaaS) layer of public clouds to deploy MMOG backends. These clouds can offer various advantages
like a lower barrier to entry, a larger set of resources, etc. but lack resource elasticity,
standardization, and focus on development effort, from which MMOG backends can greatly
benefit.
Meanwhile, other research has also focused on solving various problems related to consistency,
performance, and scalability. Despite major advancements in these areas, there is no standardized
development methodology to facilitate these features and assimilate the development of
MMOG backends on commodity clouds. This thesis is motivated by the results of a systematic
mapping study that identifies a gap in research, evident from the fact that only a handful
of studies have explored the possibility of utilizing serverless environments within commodity
clouds to host these types of backends. These studies are mostly vision papers and do
not provide any novel contributions in terms of methods of development or detailed analyses
of how such systems could be developed. Using the knowledge gathered from this mapping
study, several hypotheses are proposed and a set of technical challenges is identified, guiding
the development of a new methodology.
The peculiarities of MMOG backends have so far constrained their development and deployment
on commodity clouds despite rapid advancements in technology. To explore whether such
environments are viable options, a feasibility study is conducted with a minimalistic MMOG
prototype to evaluate a limited set of public clouds in terms of hosting MMOG backends. Foli
lowing encouraging results from this study, this thesis first motivates toward and then presents
a set of models, methods, and tools with which scalable MMOG backends can be developed
for and deployed on commodity clouds. These are encapsulated into a software development
framework called Athlos which allows software engineers to leverage the proposed development
methodology to rapidly create MMOG backend prototypes that utilize the resources of
these clouds to attain scalable states and runtimes. The proposed approach is based on a dynamic
model which aims to abstract the data requirements and relationships of many types of
MMOGs. Based on this model, several methods are outlined that aim to solve various problems
and challenges related to the development of MMOG backends, mainly in terms of performance
and scalability. Using a modular software architecture, and standardization in common development
areas, the proposed framework aims to improve and expedite the development process
leading to higher-quality MMOG backends and a lower time to market. The models and methods
proposed in this approach can be utilized through various tools during the development
lifecycle.
The proposed development framework is evaluated qualitatively and quantitatively. The thesis
presents three case study MMOG backend prototypes that validate the suitability of the proposed
approach. These case studies also provide a proof of concept and are subsequently used
to further evaluate the framework. The propositions in this thesis are assessed with respect to
the performance, scalability, development effort, and code maintainability of MMOG backends
developed using the Athlos framework, using a variety of methods such as small and large-scale
simulations and more targeted experimental setups. The results of these experiments uncover
useful information about the behavior of MMOG backends. In addition, they provide evidence
that MMOG backends developed using the proposed methodology and hosted on serverless
environments can: (a) support a very high number of simultaneous players under a given latency
threshold, (b) elastically scale both in terms of processing power and memory capacity
and (c) significantly reduce the amount of development effort. The results also show that this
methodology can accelerate the development of high-performance, distributed, real-time applications
like MMOG backends, while also exposing the limitations of Athlos in terms of code
maintainability.
Finally, the thesis provides a reflection on the research objectives, considerations on the hypotheses
and technical challenges, and outlines plans for future work in this domain
Data-Driven Methods for Data Center Operations Support
During the last decade, cloud technologies have been evolving at
an impressive pace, such that we are now living in a cloud-native
era where developers can leverage on an unprecedented landscape
of (possibly managed) services for orchestration, compute, storage,
load-balancing, monitoring, etc. The possibility to have on-demand
access to a diverse set of configurable virtualized resources allows
for building more elastic, flexible and highly-resilient distributed
applications. Behind the scenes, cloud providers sustain the heavy
burden of maintaining the underlying infrastructures, consisting in
large-scale distributed systems, partitioned and replicated among
many geographically dislocated data centers to guarantee scalability,
robustness to failures, high availability and low latency. The larger the
scale, the more cloud providers have to deal with complex interactions
among the various components, such that monitoring, diagnosing and
troubleshooting issues become incredibly daunting tasks.
To keep up with these challenges, development and operations
practices have undergone significant transformations, especially in
terms of improving the automations that make releasing new software,
and responding to unforeseen issues, faster and sustainable at scale.
The resulting paradigm is nowadays referred to as DevOps. However,
while such automations can be very sophisticated, traditional DevOps
practices fundamentally rely on reactive mechanisms, that typically
require careful manual tuning and supervision from human experts.
To minimize the risk of outages—and the related costs—it is crucial to
provide DevOps teams with suitable tools that can enable a proactive
approach to data center operations.
This work presents a comprehensive data-driven framework to address
the most relevant problems that can be experienced in large-scale
distributed cloud infrastructures. These environments are indeed characterized
by a very large availability of diverse data, collected at each
level of the stack, such as: time-series (e.g., physical host measurements,
virtual machine or container metrics, networking components
logs, application KPIs); graphs (e.g., network topologies, fault graphs
reporting dependencies among hardware and software components,
performance issues propagation networks); and text (e.g., source code,
system logs, version control system history, code review feedbacks).
Such data are also typically updated with relatively high frequency,
and subject to distribution drifts caused by continuous configuration
changes to the underlying infrastructure. In such a highly dynamic scenario,
traditional model-driven approaches alone may be inadequate
at capturing the complexity of the interactions among system components. DevOps teams would certainly benefit from having robust
data-driven methods to support their decisions based on historical
information. For instance, effective anomaly detection capabilities may
also help in conducting more precise and efficient root-cause analysis.
Also, leveraging on accurate forecasting and intelligent control
strategies would improve resource management.
Given their ability to deal with high-dimensional, complex data,
Deep Learning-based methods are the most straightforward option for
the realization of the aforementioned support tools. On the other hand,
because of their complexity, this kind of models often requires huge
processing power, and suitable hardware, to be operated effectively
at scale. These aspects must be carefully addressed when applying
such methods in the context of data center operations. Automated
operations approaches must be dependable and cost-efficient, not to
degrade the services they are built to improve.
i
Cost-effective resource management for distributed computing
Current distributed computing and resource management infrastructures (e.g., Cluster and Grid) suffer
from a wide variety of problems related to resource management, which include scalability bottleneck,
resource allocation delay, limited quality-of-service (QoS) support, and lack of cost-aware and service
level agreement (SLA) mechanisms.
This thesis addresses these issues by presenting a cost-effective resource management solution
which introduces the possibility of managing geographically distributed resources in resource units that
are under the control of a Virtual Authority (VA). A VA is a collection of resources controlled, but not
necessarily owned, by a group of users or an authority representing a group of users. It leverages the
fact that different resources in disparate locations will have varying usage levels. By creating smaller
divisions of resources called VAs, users would be given the opportunity to choose between a variety of
cost models, and each VA could rent resources from resource providers when necessary, or could potentially
rent out its own resources when underloaded. The resource management is simplified since the
user and owner of a resource recognize only the VA because all permissions and charges are associated
directly with the VA. The VA is controlled by a ’rental’ policy which is supported by a pool of resources
that the system may rent from external resource providers. As far as scheduling is concerned, the VA is
independent from competitors and can instead concentrate on managing its own resources. As a result,
the VA offers scalable resource management with minimal infrastructure and operating costs.
We demonstrate the feasibility of the VA through both a practical implementation of the prototype
system and an illustration of its quantitative advantages through the use of extensive simulations. First,
the VA concept is demonstrated through a practical implementation of the prototype system. Further, we
perform a cost-benefit analysis of current distributed resource infrastructures to demonstrate the potential
cost benefit of such a VA system. We then propose a costing model for evaluating the cost effectiveness
of the VA approach by using an economic approach that captures revenues generated from applications
and expenses incurred from renting resources. Based on our costing methodology, we present rental
policies that can potentially offer effective mechanisms for running distributed and parallel applications
without a heavy upfront investment and without the cost of maintaining idle resources. By using real
workload trace data, we test the effectiveness of our proposed rental approaches.
Finally, we propose an extension to the VA framework that promotes long-term negotiations and
rentals based on service level agreements or long-term contracts. Based on the extended framework,
we present new SLA-aware policies and evaluate them using real workload traces to demonstrate their effectiveness in improving rental decisions
Descoberta de recursos para sistemas de escala arbitrarias
Doutoramento em InformáticaTecnologias de Computação Distribuída em larga escala tais como Cloud,
Grid, Cluster e Supercomputadores HPC estão a evoluir juntamente com a
emergência revolucionária de modelos de múltiplos núcleos (por exemplo:
GPU, CPUs num único die, Supercomputadores em single die, Supercomputadores
em chip, etc) e avanços significativos em redes e soluções de
interligação. No futuro, nós de computação com milhares de núcleos podem
ser ligados entre si para formar uma única unidade de computação
transparente que esconde das aplicações a complexidade e a natureza distribuída desses sistemas com múltiplos núcleos. A fim de beneficiar de forma
eficiente de todos os potenciais recursos nesses ambientes de computação
em grande escala com múltiplos núcleos ativos, a descoberta de recursos é um elemento crucial para explorar ao máximo as capacidade de todos
os recursos heterogéneos distribuídos, através do reconhecimento preciso e
localização desses recursos no sistema. A descoberta eficiente e escalável
de recursos ´e um desafio para tais sistemas futuros, onde os recursos e as
infira-estruturas de computação e comunicação subjacentes são altamente
dinâmicas, hierarquizadas e heterogéneas. Nesta tese, investigamos o problema
da descoberta de recursos no que diz respeito aos requisitos gerais da
escalabilidade arbitrária de ambientes de computação futuros com múltiplos
núcleos ativos. A principal contribuição desta tese ´e a proposta de uma
entidade de descoberta de recursos adaptativa híbrida (Hybrid Adaptive
Resource Discovery - HARD), uma abordagem de descoberta de recursos eficiente
e altamente escalável, construída sobre uma sobreposição hierárquica
virtual baseada na auto-organizaçãoo e auto-adaptação de recursos de processamento
no sistema, onde os recursos computacionais são organizados
em hierarquias distribuídas de acordo com uma proposta de modelo de
descriçãoo de recursos multi-camadas hierárquicas. Operacionalmente, em
cada camada, que consiste numa arquitetura ponto-a-ponto de módulos que,
interagindo uns com os outros, fornecem uma visão global da disponibilidade
de recursos num ambiente distribuído grande, dinâmico e heterogéneo. O
modelo de descoberta de recursos proposto fornece a adaptabilidade e flexibilidade
para executar consultas complexas através do apoio a um conjunto
de características significativas (tais como multi-dimensional, variedade e
consulta agregada) apoiadas por uma correspondência exata e parcial, tanto
para o conteúdo de objetos estéticos e dinâmicos. Simulações mostram
que o HARD pode ser aplicado a escalas arbitrárias de dinamismo, tanto
em termos de complexidade como de escala, posicionando esta proposta
como uma arquitetura adequada para sistemas futuros de múltiplos núcleos.
Também contribuímos com a proposta de um regime de gestão eficiente
dos recursos para sistemas futuros que podem utilizar recursos distribuíos
de forma eficiente e de uma forma totalmente descentralizada. Além disso,
aproveitando componentes de descoberta (RR-RPs) permite que a nossa
plataforma de gestão de recursos encontre e aloque dinamicamente recursos
disponíeis que garantam os parâmetros de QoS pedidos.Large scale distributed computing technologies such as Cloud, Grid, Cluster
and HPC supercomputers are progressing along with the revolutionary emergence
of many-core designs (e.g. GPU, CPUs on single die, supercomputers
on chip, etc.) and significant advances in networking and interconnect solutions.
In future, computing nodes with thousands of cores may be connected
together to form a single transparent computing unit which hides from applications
the complexity and distributed nature of these many core systems. In
order to efficiently benefit from all the potential resources in such large scale
many-core-enabled computing environments, resource discovery is the vital
building block to maximally exploit the capabilities of all distributed heterogeneous
resources through precisely recognizing and locating those resources
in the system. The efficient and scalable resource discovery is challenging for
such future systems where the resources and the underlying computation and
communication infrastructures are highly-dynamic, highly-hierarchical and
highly-heterogeneous. In this thesis, we investigate the problem of resource
discovery with respect to the general requirements of arbitrary scale future
many-core-enabled computing environments. The main contribution of this
thesis is to propose Hybrid Adaptive Resource Discovery (HARD), a novel
efficient and highly scalable resource-discovery approach which is built upon
a virtual hierarchical overlay based on self-organization and self-adaptation
of processing resources in the system, where the computing resources are
organized into distributed hierarchies according to a proposed hierarchical
multi-layered resource description model. Operationally, at each layer, it
consists of a peer-to-peer architecture of modules that, by interacting with
each other, provide a global view of the resource availability in a large,
dynamic and heterogeneous distributed environment. The proposed resource
discovery model provides the adaptability and flexibility to perform complex
querying by supporting a set of significant querying features (such as
multi-dimensional, range and aggregate querying) while supporting exact
and partial matching, both for static and dynamic object contents. The
simulation shows that HARD can be applied to arbitrary scales of dynamicity,
both in terms of complexity and of scale, positioning this proposal as a
proper architecture for future many-core systems. We also contributed to
propose a novel resource management scheme for future systems which
efficiently can utilize distributed resources in a fully decentralized fashion.
Moreover, leveraging discovery components (RR-RPs) enables our resource
management platform to dynamically find and allocate available resources
that guarantee the QoS parameters on demand
Virtual machine scheduling in dedicated computing clusters
Time-critical applications process a continuous stream of input data and have to meet specific timing constraints. A common approach to ensure that such an application satisfies its constraints is over-provisioning: The application is deployed in a dedicated cluster environment with enough processing power to achieve the target performance for every specified data input rate. This approach comes with a drawback: At times of decreased data input rates, the cluster resources are not fully utilized. A typical use case is the HLT-Chain application that processes physics data at runtime of the ALICE experiment at CERN. From a perspective of cost and efficiency it is desirable to exploit temporarily unused cluster resources. Existing approaches aim for that goal by running additional applications. These approaches, however, a) lack in flexibility to dynamically grant the time-critical application the resources it needs, b) are insufficient for isolating the time-critical application from harmful side-effects introduced by additional applications or c) are not general because application-specific interfaces are used. In this thesis, a software framework is presented that allows to exploit unused resources in a dedicated cluster without harming a time-critical application. Additional applications are hosted in Virtual Machines (VMs) and unused cluster resources are allocated to these VMs at runtime. In order to avoid resource bottlenecks, the resource usage of VMs is dynamically modified according to the needs of the time-critical application. For this purpose, a number of previously not combined methods is used. On a global level, appropriate VM manipulations like hot migration, suspend/resume and start/stop are determined by an informed search heuristic and applied at runtime. Locally on cluster nodes, a feedback-controlled adaption of VM resource usage is carried out in a decentralized manner. The employment of this framework allows to increase a cluster’s usage by running additional applications, while at the same time preventing negative impact towards a time-critical application. This capability of the framework is shown for the HLT-Chain application: In an empirical evaluation the cluster CPU usage is increased from 49% to 79%, additional results are computed and no negative effect towards the HLT-Chain application are observed
Virtual machine scheduling in dedicated computing clusters
Time-critical applications process a continuous stream of input data and have to meet specific timing constraints. A common approach to ensure that such an application satisfies its constraints is over-provisioning: The application is deployed in a dedicated cluster environment with enough processing power to achieve the target performance for every specified data input rate. This approach comes with a drawback: At times of decreased data input rates, the cluster resources are not fully utilized. A typical use case is the HLT-Chain application that processes physics data at runtime of the ALICE experiment at CERN. From a perspective of cost and efficiency it is desirable to exploit temporarily unused cluster resources. Existing approaches aim for that goal by running additional applications. These approaches, however, a) lack in flexibility to dynamically grant the time-critical application the resources it needs, b) are insufficient for isolating the time-critical application from harmful side-effects introduced by additional applications or c) are not general because application-specific interfaces are used. In this thesis, a software framework is presented that allows to exploit unused resources in a dedicated cluster without harming a time-critical application. Additional applications are hosted in Virtual Machines (VMs) and unused cluster resources are allocated to these VMs at runtime. In order to avoid resource bottlenecks, the resource usage of VMs is dynamically modified according to the needs of the time-critical application. For this purpose, a number of previously not combined methods is used. On a global level, appropriate VM manipulations like hot migration, suspend/resume and start/stop are determined by an informed search heuristic and applied at runtime. Locally on cluster nodes, a feedback-controlled adaption of VM resource usage is carried out in a decentralized manner. The employment of this framework allows to increase a cluster’s usage by running additional applications, while at the same time preventing negative impact towards a time-critical application. This capability of the framework is shown for the HLT-Chain application: In an empirical evaluation the cluster CPU usage is increased from 49% to 79%, additional results are computed and no negative effect towards the HLT-Chain application are observed
A review of the Siyakhula Living Lab’s network solution for Internet in marginalized communities
Changes within Information and Communication Technology (ICT) over the past decade required a review of the network layer component deployed in the Siyakhula Living Lab (SLL), a long-term joint venture between the Telkom Centres of Excellence hosted at University of Fort Hare and Rhodes University in South Africa. The SLL overall solution for the sustainable internet in poor communities consists of three main components – the computing infrastructure layer, the network layer, and the e-services layer. At the core of the network layer is the concept of BI, a high-speed local area network realized through easy-to deploy wireless technologies that establish point-to-multipoint connections among schools within a limited geographical area. Schools within the broadband island become then Digital Access Nodes (DANs), with computing infrastructure that provides access to the network. The review, reported in this thesis, aimed at determining whether the model for the network layer was still able to meet the needs of marginalized communities in South Africa, given the recent changes in ICT. The research work used the living lab methodology – a grassroots, user-driven approach that emphasizes co-creation between the beneficiaries and external entities (researchers, industry partners and the government) - to do viability tests on the solution for the network component. The viability tests included lab and field experiments, to produce the qualitative and quantitative data needed to propose an updated blueprint. The results of the review found that the network topology used in the SLL’s network, the BI, is still viable, while WiMAX is now outdated. Also, the in-network web cache, Squid, is no longer effective, given the switch to HTTPS and the pervasive presence of advertising. The solution to the first issue is outdoor Wi-Fi, a proven solution easily deployable in grass-roots fashion. The second issue can be mitigated by leveraging Squid’s ‘bumping’ and splicing features; deploying a browser extension to make picture download optional; and using Pihole, a DNS sinkhole. Hopefully, the revised solution could become a component of South African Government’s broadband plan, “SA Connect”.Thesis (MSc) -- Faculty of Science, Computer Science, 202
- …