1,242 research outputs found

    Computing at massive scale: Scalability and dependability challenges

    Get PDF
    Large-scale Cloud systems and big data analytics frameworks are now widely used for practical services and applications. However, with the increase of data volume, together with the heterogeneity of workloads and resources, and the dynamic nature of massive user requests, the uncertainties and complexity of resource management and service provisioning increase dramatically, often resulting in poor resource utilization, vulnerable system dependability, and user-perceived performance degradations. In this paper we report our latest understanding of the current and future challenges in this particular area, and discuss both existing and potential solutions to the problems, especially those concerned with system efficiency, scalability and dependability. We first introduce a data-driven analysis methodology for characterizing the resource and workload patterns and tracing performance bottlenecks in a massive-scale distributed computing environment. We then examine and analyze several fundamental challenges and the solutions we are developing to tackle them, including for example incremental but decentralized resource scheduling, incremental messaging communication, rapid system failover, and request handling parallelism. We integrate these solutions with our data analysis methodology in order to establish an engineering approach that facilitates the optimization, tuning and verification of massive-scale distributed systems. We aim to develop and offer innovative methods and mechanisms for future computing platforms that will provide strong support for new big data and IoE (Internet of Everything) applications

    Holistic cloud computing environmental quantification and behavioural analysis

    Get PDF
    Cloud computing has been characterized to be large-scale multi-tenant systems that are able to dynamically scale-up and scale-down computational resources to consumers with diverse Quality-of-Service requirements. In recent years, a number of dependability and resource management approaches have been proposed for Cloud computing datacenters. However, there is still a lack of real-world Cloud datasets that analyse and extensively model Cloud computing characteristics and quantify their effect on system dimensions such as resource utilization, user behavioural patterns and failure characteristics. This results in two research problems: First, without the holistic analysis of real-world systems Cloud characteristics, their dimensions cannot be quantified resulting in inaccurate research assumptions of Cloud system behaviour. Second, simulated parameters used in state-of-the-art Cloud mechanisms currently rely on theoretical values which do not accurately represent real Cloud systems, as important parameters such as failure times and energy-waste have not been quantified using empirical data. This presents a large gap in terms of practicality and effectiveness between developing and evaluating mechanisms within simulated and real Cloud systems. This thesis presents a comprehensive method and empirical analysis of large-scale production Cloud computing environments in order to quantify system characteristics in terms of consumer submission and resource request patterns, workload behaviour, server utilization and failures. Furthermore, this work identifies areas of operational inefficiency within the system, as well as quantifies the amount of energy waste created due to failures. We discover that 4-10% of all server computation is wasted due to Termination Events, and that failures contribute to approximately 11% of the total datacenter energy waste. These analyses of empirical data enables researchers and Cloud providers an enhanced understanding of real Cloud behaviour and supports system assumptions and provides parameters that can be used to develop and validate the effectiveness of future energy-efficient and dependability mechanisms

    Dependability where the mobile world meets the enterprise world

    Get PDF
    As we move toward increasingly larger scales of computing, complexity of systems and networks has increased manifold leading to massive failures of cloud providers (Amazon Cloudfront, November 2014) and geographically localized outages of cellular services (T-Mobile, June 2014). In this dissertation, we investigate the dependability aspects of two of the most prevalent computing platforms today, namely, smartphones and cloud computing. These two seemingly disparate platforms are part of a cohesive story—they interact to provide end-to-end services which are increasingly being delivered over mobile platforms, examples being iCloud, Google Drive and their smartphone counterparts iPhone and Android. ^ In one of the early work on characterizing failures in dominant mobile OSes, we analyzed bug repositories of Android and Symbian and found similarities in their failure modes [ISSRE2010]. We also presented a classification of root causes and quantified the impact of ease of customizing the smartphones on system reliability. Our evaluation of Inter-Component Communication in Android [DSN2012] show an alarming number of exception handling errors where a phone may be crashed by passing it malformed component invocation messages, even from unprivileged applications. In this work, we also suggest language extensions that can mitigate these problems. ^ Mobile applications today are increasingly being used to interact with enterprise-class web services commonly hosted in virtualized environments. Virutalization suffers from the problem of imperfect performance isolation where contention for low-level hardware resources can impact application performance. Through a set of rigorous experiments in a private cloud testbed and in EC2, we show that interference induced performance degradation is a reality. Our experiments have also shown that optimal configuration settings for web servers change during such phases of interference. Based on this observation, we design and implement the IC 2engine which can mitigate effects of interference by reconfiguring web server parameters [MW2014]. We further improve IC 2 by incorporating it into a two-level configuration engine, named ICE, for managing web server clusters [ICAC2015]. Our evaluations show that, compared to an interference agnostic configuration, IC 2 can improve response time of web servers by upto 40%, while ICE can improve response time by up to 94% during phases of interference

    Dependability Evaluation of Middleware Technology for Large-scale Distributed Caching

    Full text link
    Distributed caching systems (e.g., Memcached) are widely used by service providers to satisfy accesses by millions of concurrent clients. Given their large-scale, modern distributed systems rely on a middleware layer to manage caching nodes, to make applications easier to develop, and to apply load balancing and replication strategies. In this work, we performed a dependability evaluation of three popular middleware platforms, namely Twemproxy by Twitter, Mcrouter by Facebook, and Dynomite by Netflix, to assess availability and performance under faults, including failures of Memcached nodes and congestion due to unbalanced workloads and network link bandwidth bottlenecks. We point out the different availability and performance trade-offs achieved by the three platforms, and scenarios in which few faulty components cause cascading failures of the whole distributed system.Comment: 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE 2020

    Quantitative Timed Analysis of Interactive Markov Chains

    Get PDF
    Abstract This paper presents new algorithms and accompanying tool support for analyzing interactive Markov chains (IMCs), a stochastic timed 1 1 2-player game in which delays are exponentially distributed. IMCs are compositional and act as semantic model for engineering for-malisms such as AADL and dynamic fault trees. We provide algorithms for determining the extremal expected time of reaching a set of states, and the long-run average of time spent in a set of states. The prototypical tool Imca supports these algorithms as well as the synthesis of ε-optimal piecewise constant timed policies for timed reachability objectives. Two case studies show the feasibility and scalability of the algorithms.

    Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters

    Get PDF
    Increased complexity and scale of virtualized distributed systems has resulted in the manifestation of emergent phenomena substantially affecting overall system performance. This phenomena is known as “Long Tail”, whereby a small proportion of task stragglers significantly impede job completion time. While work focuses on straggler detection and mitigation, there is limited work that empirically studies straggler root-cause and quantifies its impact upon system operation. Such analysis is critical to ascertain in-depth knowledge of straggler occurrence for focusing developmental and research efforts towards solving the Long Tail challenge. This paper provides an empirical analysis of straggler root-cause within virtualized Cloud datacenters; we analyze two large-scale production systems to quantify the frequency and impact stragglers impose, and propose a method for conducting root-cause analysis. Results demonstrate approximately 5% of task stragglers impact 50% of total jobs for batch processes, and 53% of stragglers occur due to high server resource utilization. We leverage these findings to propose a method for extreme straggler detection through a combination of offline execution patterns modeling and online analytic agents to monitor tasks at runtime. Experiments show the approach is capable of detecting stragglers less than 11% into their execution lifecycle with 95% accuracy for short duration jobs

    Developing a distributed electronic health-record store for India

    Get PDF
    The DIGHT project is addressing the problem of building a scalable and highly available information store for the Electronic Health Records (EHRs) of the over one billion citizens of India

    Log-based software monitoring: a systematic mapping study

    Full text link
    Modern software development and operations rely on monitoring to understand how systems behave in production. The data provided by application logs and runtime environment are essential to detect and diagnose undesired behavior and improve system reliability. However, despite the rich ecosystem around industry-ready log solutions, monitoring complex systems and getting insights from log data remains a challenge. Researchers and practitioners have been actively working to address several challenges related to logs, e.g., how to effectively provide better tooling support for logging decisions to developers, how to effectively process and store log data, and how to extract insights from log data. A holistic view of the research effort on logging practices and automated log analysis is key to provide directions and disseminate the state-of-the-art for technology transfer. In this paper, we study 108 papers (72 research track papers, 24 journals, and 12 industry track papers) from different communities (e.g., machine learning, software engineering, and systems) and structure the research field in light of the life-cycle of log data. Our analysis shows that (1) logging is challenging not only in open-source projects but also in industry, (2) machine learning is a promising approach to enable a contextual analysis of source code for log recommendation but further investigation is required to assess the usability of those tools in practice, (3) few studies approached efficient persistence of log data, and (4) there are open opportunities to analyze application logs and to evaluate state-of-the-art log analysis techniques in a DevOps context
    • …
    corecore