1,234 research outputs found
Monitoring and analysis system for performance troubleshooting in data centers
It was not long ago. On Christmas Eve 2012, a war of troubleshooting began in Amazon data centers. It started at 12:24 PM, with an mistaken deletion of the state data of Amazon Elastic Load Balancing Service (ELB for short), which was
not realized at that time. The mistake first led to a local issue that a small number of ELB service APIs were affected. In about six minutes, it evolved into a critical one that EC2 customers were significantly affected. One example was that Netflix, which was using hundreds of Amazon ELB services, was experiencing an extensive streaming service outage when many customers could not watch TV shows or movies on Christmas Eve. It took Amazon engineers 5 hours 42 minutes to find the root cause, the mistaken deletion, and another 15 hours and 32 minutes to fully recover the ELB service. The war ended at 8:15 AM the next day and brought the performance
troubleshooting in data centers to world’s attention. As shown in this Amazon ELB case.Troubleshooting runtime performance issues is crucial in time-sensitive multi-tier cloud services because of their stringent end-to-end timing requirements, but it is also notoriously difficult and time consuming.
To address the troubleshooting challenge, this dissertation proposes VScope, a flexible monitoring and analysis system for online troubleshooting in data centers.
VScope provides primitive operations which data center operators can use to troubleshoot various performance issues. Each operation is essentially a series of monitoring and analysis functions executed on an overlay network. We design a novel
software architecture for VScope so that the overlay networks can be generated, executed and terminated automatically, on-demand. From the troubleshooting side, we design novel anomaly detection algorithms and implement them in VScope. By
running anomaly detection algorithms in VScope, data center operators are notified when performance anomalies happen. We also design a graph-based guidance approach, called VFocus, which tracks the interactions among hardware and software components in data centers. VFocus provides primitive operations by which operators can analyze the interactions to find out which components are relevant to the
performance issue.
VScope’s capabilities and performance are evaluated on a testbed with over 1000 virtual machines (VMs). Experimental results show that the VScope runtime negligibly perturbs system and application performance, and requires mere seconds to deploy monitoring and analytics functions on over 1000 nodes. This demonstrates VScope’s ability to support fast operation and online queries against a comprehensive set of application to system/platform level metrics, and a variety of representative analytics functions. When supporting algorithms with high computation complexity, VScope serves as a ‘thin layer’ that occupies no more than 5% of their total latency. Further, by using VFocus, VScope can locate problematic VMs that cannot be found
via solely application-level monitoring, and in one of the use cases explored in the dissertation, it operates with levels of perturbation of over 400% less than what is seen for brute-force and most sampling-based approaches. We also validate VFocus
with real-world data center traces. The experimental results show that VFocus has troubleshooting accuracy of 83% on average.Ph.D
Recommended from our members
FABRIC: A National-Scale Programmable Experimental Network Infrastructure
FABRIC is a unique national research infrastructure to enable cutting-edge and exploratory research at-scale in networking, cybersecurity, distributed computing and storage systems, machine learning, and science applications. It is an everywhere-programmable nationwide instrument comprised of novel extensible network elements equipped with large amounts of compute and storage, interconnected by high speed, dedicated optical links. It will connect a number of specialized testbeds for cloud research (NSF Cloud testbeds CloudLab and Chameleon), for research beyond 5G technologies (Platforms for Advanced Wireless Research or PAWR), as well as production high-performance computing facilities and science instruments to create a rich fabric for a wide variety of experimental activities
Enabling Interactive Analytics of Secure Data using Cloud Kotta
Research, especially in the social sciences and humanities, is increasingly
reliant on the application of data science methods to analyze large amounts of
(often private) data. Secure data enclaves provide a solution for managing and
analyzing private data. However, such enclaves do not readily support discovery
science---a form of exploratory or interactive analysis by which researchers
execute a range of (sometimes large) analyses in an iterative and collaborative
manner. The batch computing model offered by many data enclaves is well suited
to executing large compute tasks; however it is far from ideal for day-to-day
discovery science. As researchers must submit jobs to queues and wait for
results, the high latencies inherent in queue-based, batch computing systems
hinder interactive analysis. In this paper we describe how we have augmented
the Cloud Kotta secure data enclave to support collaborative and interactive
analysis of sensitive data. Our model uses Jupyter notebooks as a flexible
analysis environment and Python language constructs to support the execution of
arbitrary functions on private data within this secure framework.Comment: To appear in Proceedings of Workshop on Scientific Cloud Computing,
Washington, DC USA, June 2017 (ScienceCloud 2017), 7 page
LCOGT Network Observatory Operations
We describe the operational capabilities of the Las Cumbres Observatory
Global Telescope Network. We summarize our hardware and software for
maintaining and monitoring network health. We focus on methodologies to utilize
the automated system to monitor availability of sites, instruments and
telescopes, to monitor performance, permit automatic recovery, and provide
automatic error reporting. The same jTCS control system is used on telescopes
of apertures 0.4m, 0.8m, 1m and 2m, and for multiple instruments on each. We
describe our network operational model, including workloads, and illustrate our
current tools, and operational performance indicators, including telemetry and
metrics reporting from on-site reductions. The system was conceived and
designed to establish effective, reliable autonomous operations, with automatic
monitoring and recovery - minimizing human intervention while maintaining
quality. We illustrate how far we have been able to achieve that.Comment: 13 pages, 9 figure
- …