35,571 research outputs found
Modeling Conceptual Characteristics of Virtual Machines for CPU Utilization Prediction
Cloud services have grown rapidly in recent years, which provide high
flexibility for cloud users to fulfill their computing requirements on demand.
To wisely allocate computing resources in the cloud, it is inevitably important
for cloud service providers to be aware of the potential utilization of various
resources in the future. This paper focuses on predicting CPU utilization of
virtual machines (VMs) in the cloud. We conduct empirical analysis on Microsoft
Azure's VM workloads and identify important conceptual characteristics of CPU
utilization among VMs, including locality, periodicity and tendency. We propose
a neural network method, named Time-aware Residual Networks (T-ResNet), to
model the observed conceptual characteristics with expanded network depth for
CPU utilization prediction. We conduct extensive experiments to evaluate the
effectiveness of our proposed method and the results show that T-ResNet
consistently outperforms baseline approaches in various metrics including RMSE,
MAE and MAPE
Predictive Performance Modeling for Distributed Computing using Black-Box Monitoring and Machine Learning
In many domains, the previous decade was characterized by increasing data
volumes and growing complexity of computational workloads, creating new demands
for highly data-parallel computing in distributed systems. Effective operation
of these systems is challenging when facing uncertainties about the performance
of jobs and tasks under varying resource configurations, e.g., for scheduling
and resource allocation. We survey predictive performance modeling (PPM)
approaches to estimate performance metrics such as execution duration, required
memory or wait times of future jobs and tasks based on past performance
observations. We focus on non-intrusive methods, i.e., methods that can be
applied to any workload without modification, since the workload is usually a
black-box from the perspective of the systems managing the computational
infrastructure. We classify and compare sources of performance variation,
predicted performance metrics, required training data, use cases, and the
underlying prediction techniques. We conclude by identifying several open
problems and pressing research needs in the field.Comment: 19 pages, 3 figures, 5 table
Machine Learning for Vehicular Networks
The emerging vehicular networks are expected to make everyday vehicular
operation safer, greener, and more efficient, and pave the path to autonomous
driving in the advent of the fifth generation (5G) cellular system. Machine
learning, as a major branch of artificial intelligence, has been recently
applied to wireless networks to provide a data-driven approach to solve
traditionally challenging problems. In this article, we review recent advances
in applying machine learning in vehicular networks and attempt to bring more
attention to this emerging area. After a brief overview of the major concept of
machine learning, we present some application examples of machine learning in
solving problems arising in vehicular networks. We finally discuss and
highlight several open issues that warrant further research.Comment: Accepted by IEEE Vehicular Technology Magazin
ADARES: Adaptive Resource Management for Virtual Machines
Virtual execution environments allow for consolidation of multiple
applications onto the same physical server, thereby enabling more efficient use
of server resources. However, users often statically configure the resources of
virtual machines through guesswork, resulting in either insufficient resource
allocations that hinder VM performance, or excessive allocations that waste
precious data center resources. In this paper, we first characterize real-world
resource allocation and utilization of VMs through the analysis of an extensive
dataset, consisting of more than 250k VMs from over 3.6k private enterprise
clusters. Our large-scale analysis confirms that VMs are often misconfigured,
either overprovisioned or underprovisioned, and that this problem is pervasive
across a wide range of private clusters. We then propose ADARES, an adaptive
system that dynamically adjusts VM resources using machine learning techniques.
In particular, ADARES leverages the contextual bandits framework to effectively
manage the adaptations. Our system exploits easily collectible data, at the
cluster, node, and VM levels, to make more sensible allocation decisions, and
uses transfer learning to safely explore the configurations space and speed up
training. Our empirical evaluation shows that ADARES can significantly improve
system utilization without sacrificing performance. For instance, when compared
to threshold and prediction-based baselines, it achieves more predictable
VM-level performance and also reduces the amount of virtual CPUs and memory
provisioned by up to 35% and 60% respectively for synthetic workloads on real
clusters
Modular Resource Centric Learning for Workflow Performance Prediction
Workflows provide an expressive programming model for fine-grained control of
large-scale applications in distributed computing environments. Accurate
estimates of complex workflow execution metrics on large-scale machines have
several key advantages. The performance of scheduling algorithms that rely on
estimates of execution metrics degrades when the accuracy of predicted
execution metrics decreases. This in-progress paper presents a technique being
developed to improve the accuracy of predicted performance metrics of
large-scale workflows on distributed platforms. The central idea of this work
is to train resource-centric machine learning agents to capture complex
relationships between a set of program instructions and their performance
metrics when executed on a specific resource. This resource-centric view of a
workflow exploits the fact that predicting execution times of sub-modules of a
workflow requires monitoring and modeling of a few dynamic and static features.
We transform the input workflow that is essentially a directed acyclic graph of
actions into a Physical Resource Execution Plan (PREP). This transformation
enables us to model an arbitrarily complex workflow as a set of simpler
programs running on physical nodes. We delegate a machine learning model to
capture performance metrics for each resource type when it executes different
program instructions under varying degrees of resource contention. Our
algorithm takes the prediction metrics from each resource agent and composes
the overall workflow performance metrics by utilizing the structure of the
corresponding Physical Resource Execution Plan.Comment: This paper was presented at: 6th Workshop on Big Data Analytics:
Challenges, and Opportunities (BDAC) at the 27th IEEE/ACM International
Conference for High Performance Computing, Networking, Storage, and Analysis
(SC 2015
Dynamic Selection of Virtual Machines for Application Servers in Cloud Environments
Autoscaling is a hallmark of cloud computing as it allows flexible
just-in-time allocation and release of computational resources in response to
dynamic and often unpredictable workloads. This is especially important for web
applications whose workload is time dependent and prone to flash crowds. Most
of them follow the 3-tier architectural pattern, and are divided into
presentation, application/domain and data layers. In this work we focus on the
application layer. Reactive autoscaling policies of the type "Instantiate a new
Virtual Machine (VM) when the average server CPU utilisation reaches X%" have
been used successfully since the dawn of cloud computing. But which VM type is
the most suitable for the specific application at the moment remains an open
question. In this work, we propose an approach for dynamic VM type selection.
It uses a combination of online machine learning techniques, works in real time
and adapts to changes in the users' workload patterns, application changes as
well as middleware upgrades and reconfigurations. We have developed a
prototype, which we tested with the CloudStone benchmark deployed on AWS EC2.
Results show that our method quickly adapts to workload changes and reduces the
total cost compared to the industry standard approach
Application of Machine Learning in Wireless Networks: Key Techniques and Open Issues
As a key technique for enabling artificial intelligence, machine learning
(ML) is capable of solving complex problems without explicit programming.
Motivated by its successful applications to many practical tasks like image
recognition, both industry and the research community have advocated the
applications of ML in wireless communication. This paper comprehensively
surveys the recent advances of the applications of ML in wireless
communication, which are classified as: resource management in the MAC layer,
networking and mobility management in the network layer, and localization in
the application layer. The applications in resource management further include
power control, spectrum management, backhaul management, cache management,
beamformer design and computation resource management, while ML based
networking focuses on the applications in clustering, base station switching
control, user association and routing. Moreover, literatures in each aspect is
organized according to the adopted ML techniques. In addition, several
conditions for applying ML to wireless communication are identified to help
readers decide whether to use ML and which kind of ML techniques to use, and
traditional approaches are also summarized together with their performance
comparison with ML based approaches, based on which the motivations of surveyed
literatures to adopt ML are clarified. Given the extensiveness of the research
area, challenges and unresolved issues are presented to facilitate future
studies, where ML based network slicing, infrastructure update to support ML
based paradigms, open data sets and platforms for researchers, theoretical
guidance for ML implementation and so on are discussed.Comment: 34 pages,8 figure
Bioinformatics Computational Cluster Batch Task Profiling with Machine Learning for Failure Prediction
Motivation: Traditional computational cluster schedulers are based on user
inputs and run time needs request for memory and CPU, not IO. Heavily IO bound
task run times, like ones seen in many big data and bioinformatics problems,
are dependent on the IO subsystems scheduling and are problematic for cluster
resource scheduling. The problematic rescheduling of IO intensive and errant
tasks is a lost resource. Understanding the conditions in both successful and
failed tasks and differentiating them could provide knowledge to enhancing
cluster scheduling and intelligent resource optimization.
Results: We analyze a production computational cluster contributing 6.7
thousand CPU hours to research over two years. Through this analysis we develop
a machine learning task profiling agent for clusters that attempts to predict
failures between identically provision requested tasks
Performance-Aware Management of Cloud Resources: A Taxonomy and Future Directions
Dynamic nature of the cloud environment has made distributed resource
management process a challenge for cloud service providers. The importance of
maintaining the quality of service in accordance with customer expectations as
well as the highly dynamic nature of cloud-hosted applications add new levels
of complexity to the process. Advances to the big data learning approaches have
shifted conventional static capacity planning solutions to complex
performance-aware resource management methods. It is shown that the process of
decision making for resource adjustment is closely related to the behaviour of
the system including the utilization of resources and application components.
Therefore, a continuous monitoring of system attributes and performance metrics
provide the raw data for the analysis of problems affecting the performance of
the application. Data analytic methods such as statistical and machine learning
approaches offer the required concepts, models and tools to dig into the data,
find general rules, patterns and characteristics that define the functionality
of the system. Obtained knowledge form the data analysis process helps to find
out about the changes in the workloads, faulty components or problems that can
cause system performance to degrade. A timely reaction to performance
degradations can avoid violations of the service level agreements by performing
proper corrective actions including auto-scaling or other resource adjustment
solutions. In this paper, we investigate the main requirements and limitations
in cloud resource management including a study of the approaches in workload
and anomaly analysis in the context of the performance management in the cloud.
A taxonomy of the works on this problem is presented which identifies the main
approaches in existing researches from data analysis side to resource
adjustment techniques
Applications of Deep Reinforcement Learning in Communications and Networking: A Survey
This paper presents a comprehensive literature review on applications of deep
reinforcement learning in communications and networking. Modern networks, e.g.,
Internet of Things (IoT) and Unmanned Aerial Vehicle (UAV) networks, become
more decentralized and autonomous. In such networks, network entities need to
make decisions locally to maximize the network performance under uncertainty of
network environment. Reinforcement learning has been efficiently used to enable
the network entities to obtain the optimal policy including, e.g., decisions or
actions, given their states when the state and action spaces are small.
However, in complex and large-scale networks, the state and action spaces are
usually large, and the reinforcement learning may not be able to find the
optimal policy in reasonable time. Therefore, deep reinforcement learning, a
combination of reinforcement learning with deep learning, has been developed to
overcome the shortcomings. In this survey, we first give a tutorial of deep
reinforcement learning from fundamental concepts to advanced models. Then, we
review deep reinforcement learning approaches proposed to address emerging
issues in communications and networking. The issues include dynamic network
access, data rate control, wireless caching, data offloading, network security,
and connectivity preservation which are all important to next generation
networks such as 5G and beyond. Furthermore, we present applications of deep
reinforcement learning for traffic routing, resource sharing, and data
collection. Finally, we highlight important challenges, open issues, and future
research directions of applying deep reinforcement learning.Comment: 37 pages, 13 figures, 6 tables, 174 reference paper
- …