922 research outputs found

    Monitoring and analysis system for performance troubleshooting in data centers

    Get PDF
    It was not long ago. On Christmas Eve 2012, a war of troubleshooting began in Amazon data centers. It started at 12:24 PM, with an mistaken deletion of the state data of Amazon Elastic Load Balancing Service (ELB for short), which was not realized at that time. The mistake first led to a local issue that a small number of ELB service APIs were affected. In about six minutes, it evolved into a critical one that EC2 customers were significantly affected. One example was that Netflix, which was using hundreds of Amazon ELB services, was experiencing an extensive streaming service outage when many customers could not watch TV shows or movies on Christmas Eve. It took Amazon engineers 5 hours 42 minutes to find the root cause, the mistaken deletion, and another 15 hours and 32 minutes to fully recover the ELB service. The war ended at 8:15 AM the next day and brought the performance troubleshooting in data centers to world’s attention. As shown in this Amazon ELB case.Troubleshooting runtime performance issues is crucial in time-sensitive multi-tier cloud services because of their stringent end-to-end timing requirements, but it is also notoriously difficult and time consuming. To address the troubleshooting challenge, this dissertation proposes VScope, a flexible monitoring and analysis system for online troubleshooting in data centers. VScope provides primitive operations which data center operators can use to troubleshoot various performance issues. Each operation is essentially a series of monitoring and analysis functions executed on an overlay network. We design a novel software architecture for VScope so that the overlay networks can be generated, executed and terminated automatically, on-demand. From the troubleshooting side, we design novel anomaly detection algorithms and implement them in VScope. By running anomaly detection algorithms in VScope, data center operators are notified when performance anomalies happen. We also design a graph-based guidance approach, called VFocus, which tracks the interactions among hardware and software components in data centers. VFocus provides primitive operations by which operators can analyze the interactions to find out which components are relevant to the performance issue. VScope’s capabilities and performance are evaluated on a testbed with over 1000 virtual machines (VMs). Experimental results show that the VScope runtime negligibly perturbs system and application performance, and requires mere seconds to deploy monitoring and analytics functions on over 1000 nodes. This demonstrates VScope’s ability to support fast operation and online queries against a comprehensive set of application to system/platform level metrics, and a variety of representative analytics functions. When supporting algorithms with high computation complexity, VScope serves as a ‘thin layer’ that occupies no more than 5% of their total latency. Further, by using VFocus, VScope can locate problematic VMs that cannot be found via solely application-level monitoring, and in one of the use cases explored in the dissertation, it operates with levels of perturbation of over 400% less than what is seen for brute-force and most sampling-based approaches. We also validate VFocus with real-world data center traces. The experimental results show that VFocus has troubleshooting accuracy of 83% on average.Ph.D

    AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges

    Full text link
    Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big data generated by IT Operations processes, particularly in cloud infrastructures, to provide actionable insights with the primary goal of maximizing availability. There are a wide variety of problems to address, and multiple use-cases, where AI capabilities can be leveraged to enhance operational efficiency. Here we provide a review of the AIOps vision, trends challenges and opportunities, specifically focusing on the underlying AI techniques. We discuss in depth the key types of data emitted by IT Operations activities, the scale and challenges in analyzing them, and where they can be helpful. We categorize the key AIOps tasks as - incident detection, failure prediction, root cause analysis and automated actions. We discuss the problem formulation for each task, and then present a taxonomy of techniques to solve these problems. We also identify relatively under explored topics, especially those that could significantly benefit from advances in AI literature. We also provide insights into the trends in this field, and what are the key investment opportunities

    Research Week 2014

    Get PDF

    Speech & Hearing Sciences 2013 APR Self-Study & Documents

    Get PDF
    UNM Sppech & Hearing Sciences APR self-study report, review team report, response to review report, and initial action plan for Fall 2013, fulfilling requirements of the Higher Learning Commission

    A survey of denial-of-service and distributed denial of service attacks and defenses in cloud computing

    Get PDF
    Cloud Computing is a computingmodel that allows ubiquitous, convenient and on-demand access to a shared pool of highly configurable resources (e.g., networks, servers, storage, applications and services). Denial-of-Service (DoS) and Distributed Denial-of-Service (DDoS) attacks are serious threats to the Cloud services’ availability due to numerous new vulnerabilities introduced by the nature of the Cloud, such as multi-tenancy and resource sharing. In this paper, new types of DoS and DDoS attacks in Cloud Computing are explored, especially the XML-DoS and HTTP-DoS attacks, and some possible detection and mitigation techniques are examined. This survey also provides an overview of the existing defense solutions and investigates the experiments and metrics that are usually designed and used to evaluate their performance, which is helpful for the future research in the domain

    The Use of Effective Risk Management in Cloud Computing Projects

    Get PDF
    Project management is one of the most important procedures that promote delivery of services. Examples of such projects include cloud computing projects. Despite the potential benefits associated with cloud computing projects, there are a number of security risks that when not properly managed, can always lead into the organization suffering major loses. In the adoption of cloud computing systems, project managers should have secure and well-configured platforms to reduce and control risks associated with cloud computing systems. There is a need for adopting the best risk management tools, techniques, and operations to achieve the desired results in the process of adopting cloud computing systems in project management. Risk management main elements include establishing the context, analysing the risks as well as evaluating and monitoring the risks and the communicating such risks to various stakeholders. There is need for integrated risks management to provide proper management, flexible planning as well as recovery plans in case of problems. The top management should always quantify the risks involved and evaluate its major effects on the firm business operations and activities. The present paper aims to achieve its primary objective and purpose by providing a comprehensive review of previous literature related to the use of effective risk management in cloud computing projects. The paper specifically looks at the use of effective risk management in cloud computing projects

    Air Force Institute of Technology Research Report 2003

    Get PDF
    This report summarizes the research activities of the Air Force Institute of Technology’s Graduate School of Engineering and Management. It describes research interests and faculty expertise; lists student theses/dissertations; identifies research sponsors and contributions; and outlines the procedures for contacting the school. Included in the report are: faculty publications, conference presentations, consultations, and funded research projects. Research was conducted in the areas of Aeronautical and Astronautical Engineering, Electrical Engineering and Electro-Optics, Computer Engineering and Computer Science, Systems and Engineering Management, Operational Sciences, and Engineering Physics
    • …
    corecore