230,254 research outputs found

    BonFIRE: A multi-cloud test facility for internet of services experimentation

    Get PDF
    BonFIRE offers a Future Internet, multi-site, cloud testbed, targeted at the Internet of Services community, that supports large scale testing of applications, services and systems over multiple, geographically distributed, heterogeneous cloud testbeds. The aim of BonFIRE is to provide an infrastructure that gives experimenters the ability to control and monitor the execution of their experiments to a degree that is not found in traditional cloud facilities. The BonFIRE architecture has been designed to support key functionalities such as: resource management; monitoring of virtual and physical infrastructure metrics; elasticity; single document experiment descriptions; and scheduling. As for January 2012 BonFIRE release 2 is operational, supporting seven pilot experiments. Future releases will enhance the offering, including the interconnecting with networking facilities to provide access to routers, switches and bandwidth-on-demand systems. BonFIRE will be open for general use late 2012

    Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management

    Get PDF
    As users of big data applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges: (i) to benefit from the pay-as-you-go model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases; (ii) failures are common with deployments on hundreds of VMs - systems must be fault-tolerant with fast recovery times, yet low per-machine overheads. An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results. Our key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives. Based on them, we describe an integrated approach for dynamic scale out and recovery of stateful operators. Externalised operator state is checkpointed periodically by the SPS and backed up to upstream VMs. The SPS identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the check-pointed state. At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples. We evaluate this approach with the Linear Road Benchmark on the Amazon EC2 cloud platform and show that it can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures. Copyright © 2013 ACM

    Riding out of the storm: How to deal with the complexity of grid and cloud management

    Get PDF
    Over the last decade, Grid computing paved the way for a new level of large scale distributed systems. This infrastructure made it possible to securely and reliably take advantage of widely separated computational resources that are part of several different organizations. Resources can be incorporated to the Grid, building a theoretical virtual supercomputer. In time, cloud computing emerged as a new type of large scale distributed system, inheriting and expanding the expertise and knowledge that have been obtained so far. Some of the main characteristics of Grids naturally evolved into clouds, others were modified and adapted and others were simply discarded or postponed. Regardless of these technical specifics, both Grids and clouds together can be considered as one of the most important advances in large scale distributed computing of the past ten years; however, this step in distributed computing has came along with a completely new level of complexity. Grid and cloud management mechanisms play a key role, and correct analysis and understanding of the system behavior are needed. Large scale distributed systems must be able to self-manage, incorporating autonomic features capable of controlling and optimizing all resources and services. Traditional distributed computing management mechanisms analyze each resource separately and adjust specific parameters of each one of them. When trying to adapt the same procedures to Grid and cloud computing, the vast complexity of these systems can make this task extremely complicated. But large scale distributed systems complexity could only be a matter of perspective. It could be possible to understand the Grid or cloud behavior as a single entity, instead of a set of resources. This abstraction could provide a different understanding of the system, describing large scale behavior and global events that probably would not be detected analyzing each resource separately. In this work we define a theoretical framework that combines both ideas, multiple resources and single entity, to develop large scale distributed systems management techniques aimed at system performance optimization, increased dependability and Quality of Service (QoS). The resulting synergy could be the key 350 J. Montes et al. to address the most important difficulties of Grid and cloud management

    Self organising cloud cells: a resource efficient network densification strategy

    Get PDF
    Network densification is envisioned as the key enabler for 2020 vision that requires cellular systems to grow in capacity by hundreds of times to cope with unprecedented traffic growth trends being witnessed since advent of broadband on the move. However, increased energy consumption and complex mobility management associated with network densifications remain as the two main challenges to be addressed before further network densification can be exploited on a wide scale. In the wake of these challenges, this paper proposes and evaluates a novel dense network deployment strategy for increasing the capacity of future cellular systems without sacrificing energy efficiency and compromising mobility performance. Our deployment architecture consists of smart small cells, called cloud nodes, which provide data coverage to individual users on a demand bases while taking into account the spatial and temporal dynamics of user mobility and traffic. The decision to activate the cloud nodes, such that certain performance objectives at system level are targeted, is carried out by the overlaying macrocell based on a fuzzy-logic framework. We also compare the proposed architecture with conventional macrocell only deployment and pure microcell-based dense deployment in terms of blocking probability, handover probability and energy efficiency and discuss and quantify the trade-offs therein

    Scalable transactions in the cloud: partitioning revisited

    Get PDF
    Lecture Notes in Computer Science, 6427Cloud computing is becoming one of the most used paradigms to deploy highly available and scalable systems. These systems usually demand the management of huge amounts of data, which cannot be solved with traditional nor replicated database systems as we know them. Recent solutions store data in special key-value structures, in an approach that commonly lacks the consistency provided by transactional guarantees, as it is traded for high scalability and availability. In order to ensure consistent access to the information, the use of transactions is required. However, it is well-known that traditional replication protocols do not scale well for a cloud environment. Here we take a look at current proposals to deploy transactional systems in the cloud and we propose a new system aiming at being a step forward in achieving this goal. We proceed to focus on data partitioning and describe the key role it plays in achieving high scalability.This work has been partially supported by the Spanish Government under grant TIN2009-14460-C03-02 and by the Spanish MEC under grant BES-2007-17362 and by project ReD Resilient Database Clusters (PDTC/EIA-EIA/109044/2008)

    Scalable and Fault-tolerant Stateful Stream Processing.

    Get PDF
    As users of "big data" applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges: (i) to benefit from the "pay-as-you-go" model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases; (ii) failures are common with deployments on hundreds of VMs—systems must be fault-tolerant with fast recovery times, yet low per-machine overheads. An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results. Our key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives. Based on them, we describe an integrated approach for dynamic scale out and recovery of stateful operators. Externalised operator state is checkpointed periodically by the SPS and backed up to upstream VMs. The SPS identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the checkpointed state. At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples. We evaluate this approach with the Linear Road Benchmark on the Amazon EC2 cloud platform and show that it can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures

    Self-management for large-scale distributed systems

    Get PDF
    Autonomic computing aims at making computing systems self-managing by using autonomic managers in order to reduce obstacles caused by management complexity. This thesis presents results of research on self-management for large-scale distributed systems. This research was motivated by the increasing complexity of computing systems and their management. In the first part, we present our platform, called Niche, for programming self-managing component-based distributed applications. In our work on Niche, we have faced and addressed the following four challenges in achieving self-management in a dynamic environment characterized by volatile resources and high churn: resource discovery, robust and efficient sensing and actuation, management bottleneck, and scale. We present results of our research on addressing the above challenges. Niche implements the autonomic computing architecture, proposed by IBM, in a fully decentralized way. Niche supports a network-transparent view of the system architecture simplifying the design of distributed self-management. Niche provides a concise and expressive API for self-management. The implementation of the platform relies on the scalability and robustness of structured overlay networks. We proceed by presenting a methodology for designing the management part of a distributed self-managing application. We define design steps that include partitioning of management functions and orchestration of multiple autonomic managers. In the second part, we discuss robustness of management and data consistency, which are necessary in a distributed system. Dealing with the effect of churn on management increases the complexity of the management logic and thus makes its development time consuming and error prone. We propose the abstraction of Robust Management Elements, which are able to heal themselves under continuous churn. Our approach is based on replicating a management element using finite state machine replication with a reconfigurable replica set. Our algorithm automates the reconfiguration (migration) of the replica set in order to tolerate continuous churn. For data consistency, we propose a majority-based distributed key-value store supporting multiple consistency levels that is based on a peer-to-peer network. The store enables the tradeoff between high availability and data consistency. Using majority allows avoiding potential drawbacks of a master-based consistency control, namely, a single-point of failure and a potential performance bottleneck. In the third part, we investigate self-management for Cloud-based storage systems with the focus on elasticity control using elements of control theory and machine learning. We have conducted research on a number of different designs of an elasticity controller, including a State-Space feedback controller and a controller that combines feedback and feedforward control. We describe our experience in designing an elasticity controller for a Cloud-based key-value store using state-space model that enables to trade-off performance for cost. We describe the steps in designing an elasticity controller. We continue by presenting the design and evaluation of ElastMan, an elasticity controller for Cloud-based elastic key-value stores that combines feedforward and feedback control

    Softwarization of Large-Scale IoT-based Disasters Management Systems

    Get PDF
    The Internet of Things (IoT) enables objects to interact and cooperate with each other for reaching common objectives. It is very useful in large-scale disaster management systems where humans are likely to fail when they attempt to perform search and rescue operations in high-risk sites. IoT can indeed play a critical role in all phases of large-scale disasters (i.e. preparedness, relief, and recovery). Network softwarization aims at designing, architecting, deploying, and managing network components primarily based on software programmability properties. It relies on key technologies, such as cloud computing, Network Functions Virtualization (NFV), and Software Defined Networking (SDN). The key benefits are agility and cost efficiency. This thesis proposes softwarization approaches to tackle the key challenges related to large-scale IoT based disaster management systems. A first challenge faced by large-scale IoT disaster management systems is the dynamic formation of an optimal coalition of IoT devices for the tasks at hand. Meeting this challenge is critical for cost efficiency. A second challenge is an interoperability. IoT environments remain highly heterogeneous. However, the IoT devices need to interact. Yet another challenge is Quality of Service (QoS). Disaster management applications are known to be very QoS sensitive, especially when it comes to delay. To tackle the first challenge, we propose a cloud-based architecture that enables the formation of efficient coalitions of IoT devices for search and rescue tasks. The proposed architecture enables the publication and discovery of IoT devices belonging to different cloud providers. It also comes with a coalition formation algorithm. For the second challenge, we propose an NFV and SDN based - architecture for on-the-fly IoT gateway provisioning. The gateway functions are provisioned as Virtual Network Functions (VNFs) that are chained on-the-fly in the IoT domain using SDN. When it comes to the third challenge, we rely on fog computing to meet the QoS and propose algorithms that provision IoT applications components in hybrid NFV based - cloud/fogs. Both stationary and mobile fog nodes are considered. In the case of mobile fog nodes, a Tabu Search-based heuristic is proposed. It finds a near-optimal solution and we numerically show that it is faster than the Integer Linear Programming (ILP) solution by several orders of magnitude
    corecore