Search CORE

2,854 research outputs found

An Approach for Modeling and Ranking Node-level Stragglers in Cloud Datacenters

Author: Garraghan P
Ouyang X
Townend P
Wang C
Xu J
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/09/2016
Field of study

The ability of servers to effectively execute tasks within Cloud datacenters varies due to heterogeneous CPU and memory capacities, resource contention situations, network configurations and operational age. Unexpectedly slow server nodes (node-level stragglers) result in assigned tasks becoming task-level stragglers, which dramatically impede parallel job execution. However, it is currently unknown how slow nodes directly correlate to task straggler manifestation. To address this knowledge gap, we propose a method for node performance modeling and ranking in Cloud datacenters based on analyzing parallel job execution tracelog data. By using a production Cloud system as a case study, we demonstrate how node execution performance is driven by temporal changes in node operation as opposed to node hardware capacity. Different sample sets have been filtered in order to evaluate the generality of our framework, and the analytic results demonstrate that node abilities of executing parallel tasks tend to follow a 3-parameter-loglogistic distribution. Further statistical attribute values such as confidence interval, quantile value, extreme case possibility, etc. can also be used for ranking and identifying potential straggler nodes within the cluster. We exploit a graph-based algorithm for partitioning server nodes into five levels, with 0.83% of node-level stragglers identified. Our work lays the foundation towards enhancing scheduling algorithms by avoiding slow nodes, reducing task straggler occurrence, and improving parallel job performance

Crossref

Lancaster E-Prints

White Rose Research Online

Scalable and Distributed Resource Management Protocols for Cloud and Big Data Clusters

Author: Khelghatdoust Mansour
Publication venue: Faculty of Engineering and Information Technologies, School of Information Technologies
Publication date: 22/08/2018
Field of study

Cloud data centers require an operating system to manage resources and satisfy operational requirements and management objectives. The growth of popularity in cloud services causes the appearance of a new spectrum of services with sophisticated workload and resource management requirements. Also, data centers are growing by addition of various type of hardware to accommodate the ever-increasing requests of users. Nowadays a large percentage of cloud resources are executing data-intensive applications which need continuously changing workload fluctuations and specific resource management. To this end, cluster computing frameworks are shifting towards distributed resource management for better scalability and faster decision making. Such systems benefit from the parallelization of control and are resilient to failures. Throughout this thesis we investigate algorithms, protocols and techniques to address these challenges in large-scale data centers. We introduce a distributed resource management framework which consolidates virtual machine to as few servers as possible to reduce the energy consumption of data center and hence decrease the cost of cloud providers. This framework can characterize the workload of virtual machines and hence handle trade-off energy consumption and Service Level Agreement (SLA) of customers efficiently. The algorithm is highly scalable and requires low maintenance cost with dynamic workloads and it tries to minimize virtual machines migration costs. We also introduce a scalable and distributed probe-based scheduling algorithm for Big data analytics frameworks. This algorithm can efficiently address the problem job heterogeneity in workloads that has appeared after increasing the level of parallelism in jobs. The algorithm is massively scalable and can reduce significantly average job completion times in comparison with the-state of-the-art. Finally, we propose a probabilistic fault-tolerance technique as part of the scheduling algorithm

Sydney eScholarship

Workload-Aware Database Monitoring and Consolidation

Author: Balakrishnan Hari
Curino Carlo
Jones Evan Philip Charles
Madden Samuel R.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/06/2011
Field of study

In most enterprises, databases are deployed on dedicated database servers. Often, these servers are underutilized much of the time. For example, in traces from almost 200 production servers from different organizations, we see an average CPU utilization of less than 4%. This unused capacity can be potentially harnessed to consolidate multiple databases on fewer machines, reducing hardware and operational costs. Virtual machine (VM) technology is one popular way to approach this problem. However, as we demonstrate in this paper, VMs fail to adequately support database consolidation, because databases place a unique and challenging set of demands on hardware resources, which are not well-suited to the assumptions made by VM-based consolidation. Instead, our system for database consolidation, named Kairos, uses novel techniques to measure the hardware requirements of database workloads, as well as models to predict the combined resource utilization of those workloads. We formalize the consolidation problem as a non-linear optimization program, aiming to minimize the number of servers and balance load, while achieving near-zero performance degradation. We compare Kairos against virtual machines, showing up to a factor of 12× higher throughput on a TPC-C-like benchmark. We also tested the effectiveness of our approach on real-world data collected from production servers at Wikia.com, Wikipedia, Second Life, and MIT CSAIL, showing absolute consolidation ratios ranging between 5.5:1 and 17:1

CiteSeerX

DSpace@MIT

Dynamic service chain composition in virtualised environment

Author: Wajdi Hajji (7168487)
Publication venue
Publication date: 01/01/2018
Field of study

Network Function Virtualisation (NFV) has contributed to improving the flexibility of network service provisioning and reducing the time to market of new services. NFV leverages the virtualisation technology to decouple the software implementation of network appliances from the physical devices on which they run. However, with the emergence of this paradigm, providing data centre applications with an adequate network performance becomes challenging. For instance, virtualised environments cause network congestion, decrease the throughput and hurt the end user experience. Moreover, applications usually communicate through multiple sequences of virtual network functions (VNFs), aka service chains, for policy enforcement and performance and security enhancement, which increases the management complexity at to the network level. To address this problematic situation, existing studies have proposed high-level approaches of VNFs chaining and placement that improve service chain performance. They consider the VNFs as homogenous entities regardless of their specific characteristics. They have overlooked their distinct behaviour toward the traffic load and how their underpinning implementation can intervene in defining resource usage. Our research aims at filling this gap by finding out particular patterns on production and widely used VNFs. And proposing a categorisation that helps in reducing network latency at the chains. Based on experimental evaluation, we have classified firewalls, NAT, IDS/IPS, Flow monitors into I/O- and CPU-bound functions. The former category is mainly sensitive to the throughput, in packets per second, while the performance of the latter is primarily affected by the network bandwidth, in bits per second. By doing so, we correlate the VNF category with the traversing traffic characteristics and this will dictate how the service chains would be composed. We propose a heuristic called Natif, for a VNF-Aware VNF insTantIation and traFfic distribution scheme, to reconcile the discrepancy in VNF requirements based on the category they belong to and to eventually reduce network latency. We have deployed Natif in an OpenStack-based environment and have compared it to a network-aware VNF composition approach. Our results show a decrease in latency by around 188% on average without sacrificing the throughput

Loughborough University Institutional Repository

Sensing physical fields: Inverse problems for the diffusion equation and beyond

Author: Murray-Bruce John
Publication venue: Electrical and Electronic Engineering, Imperial College London
Publication date: 01/12/2016
Field of study

Due to significant advances made over the last few decades in the areas of (wireless) networking, communications and microprocessor fabrication, the use of sensor networks to observe physical phenomena is rapidly becoming commonplace. Over this period, many aspects of sensor networks have been explored, yet a thorough understanding of how to analyse and process the vast amounts of sensor data collected, remains an open area of research. This work therefore, aims to provide theoretical, as well as practical, advances this area. In particular, we consider the problem of inferring certain underlying properties of the monitored phenomena, from our sensor measurements. Within mathematics, this is commonly formulated as an inverse problem; whereas in signal processing it appears as a (multidimensional) sampling and reconstruction problem. Indeed it is well known that inverse problems are notoriously ill-posed and very demanding to solve; meanwhile viewing it as the latter also presents several technical challenges. In particular, the monitored field is usually nonbandlimited, the sensor placement is typically non-regular and the space-time dimensions of the field are generally non-homogeneous. Furthermore, although sensor production is a very advanced domain, it is near impossible and/or extremely costly to design sensors with no measurement noise. These challenges therefore motivate the need for a stable, noise robust, yet simple sampling theory for the problem at hand. In our work, we narrow the gap between the domains of inverse problems and modern sampling theory, and in so doing, extend existing results by introducing a framework for solving the inverse source problems for a class of some well-known physical phenomena. Some examples include: the reconstruction of plume sources, thermal monitoring of multi-core processors and acoustic source estimation, to name a few. We assume these phenomena and their sources can be described using partial differential equation (PDE) and parametric source models, respectively. Under this assumption, we obtain a well-posed inverse problem. Initially, we consider a phenomena governed by the two-dimensional diffusion equation -- i.e. 2-D diffusion fields, and assume that we have access to its continuous field measurements. In this setup, we derive novel exact closed-form inverse formulae that solve the inverse diffusion source problem, for a class of localized and non-localized source models. In our derivation, we prove that a particular 1-D sequence of, so called, generalized measurements of the field is governed by a power-sum series, hence it can be efficiently solved using existing algebraic methods such as Prony's method. Next, we show how to obtain these generalized measurements, by using Green's second identity to combine the continuous diffusion field with a family of well-chosen sensing functions. From these new inverse formulae, we therefore develop novel noise robust centralized and distributed reconstruction methods for diffusion fields. Specifically, we extend these inverse formulae to centralized sensor networks using numerical quadrature; conversely for distributed networks, we propose a new physics-driven consensus scheme to approximate the generalized measurements through localized interactions between the sensor nodes. Finally we provide numerical results using both synthetic and real data to validate the proposed algorithms. Given the insights gained, we eventually turn to the more general problem. That is, the two- and three-dimensional inverse source problems for any linear PDE with constant coefficients. Extending the previous framework, we solve the new class of inverse problems by establishing an otherwise subtle link with modern sampling theory. We achieved this by showing that, the desired generalized measurements can be computed by taking linear weighted-sums of the sensor measurements. The advantage of this is two-fold. First, we obtain a more flexible framework that permits the use of more general sensing functions, this freedom is important for solving the 3-D problem. Second, and remarkably, we are able to analyse many more physical phenomena beyond diffusion fields. We prove that computing the proper sequence of generalized measurements for any such field, via linear sums, reduces to approximating (a family of) exponentials with translates of a particular prototype function. We show that this prototype function depends on the Green's function of the field, and then derive an explicit formula to evaluate the proper weights. Furthermore, since we now have more freedom in selecting the sensing functions, we discuss how to make the correct choice whilst emphasizing how to retrieve the unknown source parameters from the resulting (multidimensional) Prony-like systems. Based on this new theory we develop practical, noise robust, sensor network strategies for solving the inverse source problem, and then present numerical simulation results to verify the performance of our proposed schemes.Open Acces

Spiral - Imperial College Digital Repository

Energy-aware scheduling in virtualized datacenters

Author: Berral García Josep Lluís
Goiri Presa Íñigo
Guitart Fernández Jordi
Julià Massó Ferran
Nou Castell Ramon
Torres Viñals Jordi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

The reduction of energy consumption in large-scale datacenters is being accomplished through an extensive use of virtualization, which enables the consolidation of multiple workloads in a smaller number of machines. Nevertheless, virtualization also incurs some additional overheads (e.g. virtual machine creation and migration) that can influence what is the best consolidated configuration, and thus, they must be taken into account. In this paper, we present a dynamic job scheduling policy for power-aware resource allocation in a virtualized datacenter. Our policy tries to consolidate workloads from separate machines into a smaller number of nodes, while fulfilling the amount of hardware resources needed to preserve the quality of service of each job. This allows turning off the spare servers, thus reducing the overall datacenter power consumption. As a novelty, this policy incorporates all the virtualization overheads in the decision process. In addition, our policy is prepared to consider other important parameters for a datacenter, such as reliability or dynamic SLA enforcement, in a synergistic way with power consumption. The introduced policy is evaluated comparing it against common policies in a simulated environment that accurately models HPC jobs execution in a virtualized datacenter including power consumption modeling and obtains a power consumption reduction of 15% with respect to typical policies.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC