Search CORE

1,234 research outputs found

Monitoring and analysis system for performance troubleshooting in data centers

Author: Wang Chengwei
Publication venue: Georgia Institute of Technology
Publication date: 13/01/2014
Field of study

It was not long ago. On Christmas Eve 2012, a war of troubleshooting began in Amazon data centers. It started at 12:24 PM, with an mistaken deletion of the state data of Amazon Elastic Load Balancing Service (ELB for short), which was not realized at that time. The mistake first led to a local issue that a small number of ELB service APIs were affected. In about six minutes, it evolved into a critical one that EC2 customers were significantly affected. One example was that Netflix, which was using hundreds of Amazon ELB services, was experiencing an extensive streaming service outage when many customers could not watch TV shows or movies on Christmas Eve. It took Amazon engineers 5 hours 42 minutes to find the root cause, the mistaken deletion, and another 15 hours and 32 minutes to fully recover the ELB service. The war ended at 8:15 AM the next day and brought the performance troubleshooting in data centers to world’s attention. As shown in this Amazon ELB case.Troubleshooting runtime performance issues is crucial in time-sensitive multi-tier cloud services because of their stringent end-to-end timing requirements, but it is also notoriously difficult and time consuming. To address the troubleshooting challenge, this dissertation proposes VScope, a flexible monitoring and analysis system for online troubleshooting in data centers. VScope provides primitive operations which data center operators can use to troubleshoot various performance issues. Each operation is essentially a series of monitoring and analysis functions executed on an overlay network. We design a novel software architecture for VScope so that the overlay networks can be generated, executed and terminated automatically, on-demand. From the troubleshooting side, we design novel anomaly detection algorithms and implement them in VScope. By running anomaly detection algorithms in VScope, data center operators are notified when performance anomalies happen. We also design a graph-based guidance approach, called VFocus, which tracks the interactions among hardware and software components in data centers. VFocus provides primitive operations by which operators can analyze the interactions to find out which components are relevant to the performance issue. VScope’s capabilities and performance are evaluated on a testbed with over 1000 virtual machines (VMs). Experimental results show that the VScope runtime negligibly perturbs system and application performance, and requires mere seconds to deploy monitoring and analytics functions on over 1000 nodes. This demonstrates VScope’s ability to support fast operation and online queries against a comprehensive set of application to system/platform level metrics, and a variety of representative analytics functions. When supporting algorithms with high computation complexity, VScope serves as a ‘thin layer’ that occupies no more than 5% of their total latency. Further, by using VFocus, VScope can locate problematic VMs that cannot be found via solely application-level monitoring, and in one of the use cases explored in the dissertation, it operates with levels of perturbation of over 400% less than what is seen for brute-force and most sampling-based approaches. We also validate VFocus with real-world data center traces. The experimental results show that VFocus has troubleshooting accuracy of 83% on average.Ph.D

Scholarly Materials And Research @ Georgia Tech

Recommended from our members

FABRIC: A National-Scale Programmable Experimental Network Infrastructure

Author: Baldin I
Deelman E
Griffioen J
Lehman T
Monga IIS
Nikolich A
Ruth P
Wang KC
Publication venue: eScholarship, University of California
Publication date: 01/11/2019
Field of study

FABRIC is a unique national research infrastructure to enable cutting-edge and exploratory research at-scale in networking, cybersecurity, distributed computing and storage systems, machine learning, and science applications. It is an everywhere-programmable nationwide instrument comprised of novel extensible network elements equipped with large amounts of compute and storage, interconnected by high speed, dedicated optical links. It will connect a number of specialized testbeds for cloud research (NSF Cloud testbeds CloudLab and Chameleon), for research beyond 5G technologies (Platforms for Advanced Wireless Research or PAWR), as well as production high-performance computing facilities and science instruments to create a rich fabric for a wide variety of experimental activities

eScholarship - University of California

Enabling Interactive Analytics of Secure Data using Cloud Kotta

Author: Babuji Yadu N.
Chard Kyle
Duede Eamon
Publication venue
Publication date: 28/04/2017
Field of study

Research, especially in the social sciences and humanities, is increasingly reliant on the application of data science methods to analyze large amounts of (often private) data. Secure data enclaves provide a solution for managing and analyzing private data. However, such enclaves do not readily support discovery science---a form of exploratory or interactive analysis by which researchers execute a range of (sometimes large) analyses in an iterative and collaborative manner. The batch computing model offered by many data enclaves is well suited to executing large compute tasks; however it is far from ideal for day-to-day discovery science. As researchers must submit jobs to queues and wait for results, the high latencies inherent in queue-based, batch computing systems hinder interactive analysis. In this paper we describe how we have augmented the Cloud Kotta secure data enclave to support collaborative and interactive analysis of sensitive data. Our model uses Jupyter notebooks as a flexible analysis environment and Python language constructs to support the execution of arbitrary functions on private data within this secure framework.Comment: To appear in Proceedings of Workshop on Scientific Cloud Computing, Washington, DC USA, June 2017 (ScienceCloud 2017), 7 page

arXiv.org e-Print Archive

Crossref

LCOGT Network Observatory Operations

Author: Boroson Todd
Burleson Ben
Conway Patrick
de Vera Jon
Elphick Mark
Haworth Brian
Hjelstrom Annie
Pickles Andrew
Rosing Wayne
Saunders Eric
Thomas Doug
Walker Zach
White Gary
Willis Mark
Publication venue: 'SPIE-Intl Soc Optical Eng'
Publication date: 11/07/2014
Field of study

We describe the operational capabilities of the Las Cumbres Observatory Global Telescope Network. We summarize our hardware and software for maintaining and monitoring network health. We focus on methodologies to utilize the automated system to monitor availability of sites, instruments and telescopes, to monitor performance, permit automatic recovery, and provide automatic error reporting. The same jTCS control system is used on telescopes of apertures 0.4m, 0.8m, 1m and 2m, and for multiple instruments on each. We describe our network operational model, including workloads, and illustrate our current tools, and operational performance indicators, including telemetry and metrics reporting from on-site reductions. The system was conceived and designed to establish effective, reliable autonomous operations, with automatic monitoring and recovery - minimizing human intervention while maintaining quality. We illustrate how far we have been able to achieve that.Comment: 13 pages, 9 figure

arXiv.org e-Print Archive

Crossref

Real time truck tracking mass haul android application for construction collaboration cloud

Author: Hassan Muhammad
Publication venue
Publication date: 21/11/2016
Field of study

Aaltodoc Publication Archive