Search CORE

15,785 research outputs found

Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management

Author: Fernandez RC
Kalyvianaki E
Migliavacca M
Pietzuch P
Publication venue: 'American College of Medical Physics (ACMP)'
Publication date: 01/01/2013
Field of study

As users of big data applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges: (i) to benefit from the pay-as-you-go model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases; (ii) failures are common with deployments on hundreds of VMs - systems must be fault-tolerant with fast recovery times, yet low per-machine overheads. An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results. Our key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives. Based on them, we describe an integrated approach for dynamic scale out and recovery of stateful operators. Externalised operator state is checkpointed periodically by the SPS and backed up to upstream VMs. The SPS identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the check-pointed state. At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples. We evaluate this approach with the Linear Road Benchmark on the Amazon EC2 cloud platform and show that it can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures. Copyright © 2013 ACM

CiteSeerX

City Research Online

Crossref

Spiral - Imperial College Digital Repository

Kent Academic Repository

Internet of robotic things : converging sensing/actuating, hypoconnectivity, artificial intelligence and IoT Platforms

Author: Bacciu D
Bahr R
Bröring A
Cavallo F
Chessa S
Dragone M
Gallicchio C
Micheli A.
Saffiotti A
Serrano M
Simoens Pieter
Tragos E
Vermesan O
Publication venue
Publication date: 01/01/2017
Field of study

The Internet of Things (IoT) concept is evolving rapidly and influencing newdevelopments in various application domains, such as the Internet of MobileThings (IoMT), Autonomous Internet of Things (A-IoT), Autonomous Systemof Things (ASoT), Internet of Autonomous Things (IoAT), Internetof Things Clouds (IoT-C) and the Internet of Robotic Things (IoRT) etc.that are progressing/advancing by using IoT technology. The IoT influencerepresents new development and deployment challenges in different areassuch as seamless platform integration, context based cognitive network integration,new mobile sensor/actuator network paradigms, things identification(addressing, naming in IoT) and dynamic things discoverability and manyothers. The IoRT represents new convergence challenges and their need to be addressed, in one side the programmability and the communication ofmultiple heterogeneous mobile/autonomous/robotic things for cooperating,their coordination, configuration, exchange of information, security, safetyand protection. Developments in IoT heterogeneous parallel processing/communication and dynamic systems based on parallelism and concurrencyrequire new ideas for integrating the intelligent “devices”, collaborativerobots (COBOTS), into IoT applications. Dynamic maintainability, selfhealing,self-repair of resources, changing resource state, (re-) configurationand context based IoT systems for service implementation and integrationwith IoT network service composition are of paramount importance whennew “cognitive devices” are becoming active participants in IoT applications.This chapter aims to be an overview of the IoRT concept, technologies,architectures and applications and to provide a comprehensive coverage offuture challenges, developments and applications

Ghent University Academic Bibliography

Publikationer från Örebro universitet

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Building an Emulation Environment for Cyber Security Analyses of Complex Networked Systems

Author: Bonomi Silvia
Meacci Davide
Rapone Raniero
Sorella Mara
Tanasache Florin Dragos
Publication venue
Publication date: 23/10/2018
Field of study

Computer networks are undergoing a phenomenal growth, driven by the rapidly increasing number of nodes constituting the networks. At the same time, the number of security threats on Internet and intranet networks is constantly growing, and the testing and experimentation of cyber defense solutions requires the availability of separate, test environments that best emulate the complexity of a real system. Such environments support the deployment and monitoring of complex mission-driven network scenarios, thus enabling the study of cyber defense strategies under real and controllable traffic and attack scenarios. In this paper, we propose a methodology that makes use of a combination of techniques of network and security assessment, and the use of cloud technologies to build an emulation environment with adjustable degree of affinity with respect to actual reference networks or planned systems. As a byproduct, starting from a specific study case, we collected a dataset consisting of complete network traces comprising benign and malicious traffic, which is feature-rich and publicly available

arXiv.org e-Print Archive

Archivio della ricerca- Università di Roma La Sapienza

Online Fault Classification in HPC Systems through Machine Learning

Author: A Gainaru
Alessio Netti
C Engelmann
F Cappello
I Cohen
M Snir
O Tuncer
Z Lan
Publication venue
Publication date: 01/01/2019
Field of study

As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that they will experience excessive failure rates. For this reason, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures will be essential for continued operation. In this paper, we propose a fault classification method for HPC systems based on machine learning that has been designed specifically to operate with live streamed data. We cast the problem and its solution within realistic operating constraints of online use. Our results show that almost perfect classification accuracy can be reached for different fault types with low computational overhead and minimal delay. We have based our study on a local dataset, which we make publicly available, that was acquired by injecting faults to an in-house experimental HPC system.Comment: Accepted for publication at the Euro-Par 2019 conferenc

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Pisa

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Gossip-based service monitoring platform for wireless edge cloud computing

Author: Freitag Fèlix
Girdzijauskas Sarunas
Navarro Moldes Leandro
Silvestre Apolonia Nuno Miguel
Vlassov Vladimir
Publication venue: Institute of Electrical and Electronics Engineers (IEEE)
Publication date: 01/01/2017
Field of study

Edge cloud computing proposes to support shared services, by using the infrastructure at the network's edge. An important problem is the monitoring and management of services across the edge environment. Therefore, dissemination and gathering of data is not straightforward, differing from the classic cloud infrastructure. In this paper, we consider the environment of community networks for edge cloud computing, in which the monitoring of cloud services is required. We propose a monitoring platform to collect near real-time data about the services offered in the community network using a gossip-enabled network. We analyze and apply this gossip-enabled network to perform service discovery and information sharing, enabling data dissemination among the community. We implemented our solution as a prototype and used it for collecting service monitoring data from the real operational community network cloud, as a feasible deployment of our solution. By means of emulation and simulation we analyze in different scenarios, the behavior of the gossip overlay solution, and obtain average results regarding information propagation and consistency needs, i.e. in high latency situations, data convergence occurs within minutes.Peer ReviewedPostprint (author's final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC

ZENODO

Observing the clouds : a survey and taxonomy of cloud monitoring

Author: Barker Adam David
Ward Jonathan Stuart
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

This research was supported by a Royal Society Industry Fellowship and an Amazon Web Services (AWS) grant. Date of Acceptance: 10/12/2014Monitoring is an important aspect of designing and maintaining large-scale systems. Cloud computing presents a unique set of challenges to monitoring including: on-demand infrastructure, unprecedented scalability, rapid elasticity and performance uncertainty. There are a wide range of monitoring tools originating from cluster and high-performance computing, grid computing and enterprise computing, as well as a series of newer bespoke tools, which have been designed exclusively for cloud monitoring. These tools express a number of common elements and designs, which address the demands of cloud monitoring to various degrees. This paper performs an exhaustive survey of contemporary monitoring tools from which we derive a taxonomy, which examines how effectively existing tools and designs meet the challenges of cloud monitoring. We conclude by examining the socio-technical aspects of monitoring, and investigate the engineering challenges and practices behind implementing monitoring strategies for cloud computing.Publisher PDFPeer reviewe

Springer - Publisher Connector

University of St. Andrews - Pure

St Andrews Research Repository