9 research outputs found

    Testing Data Transformations in MapReduce Programs

    Get PDF
    MapReduce is a parallel data processing paradigm oriented to process large volumes of information in data-intensive applications, such as Big Data environments. A characteristic of these applications is that they can have different data sources and data formats. For these reasons, the inputs could contain some poor quality data that could produce a failure if the program functionality does not handle properly the variety of input data. The output of these programs is obtained from a number of input transformations that represent the program logic. This paper proposes the testing technique called MRFlow that is based on data flow test criteria and oriented to transformations analysis between the input and the output in order to detect defects in MapReduce programs. MRFlow is applied over some MapReduce programs and detects several defect

    Testing data transformations in MapReduce programs

    Full text link

    An Empirical Study on Quality Issues of Production Big Data Platform

    Get PDF
    Abstract-Big Data computing platform has evolved to be a multi-tenant service. The service quality matters because system failure or performance slowdown could adversely affect business and user experience. To date, there is few study in literature on service quality issues of production Big Data computing platform. In this paper, we present an empirical study on the service quality issues of Microsoft ProductA, which is a company-wide multi-tenant Big Data computing platform, serving thousands of customers from hundreds of teams. ProductA has a well-defined escalation process (i.e., incident management process), which helps customers report service quality issues on 24/7 basis. This paper investigates the common symptom, causes and mitigation of service quality issues in Big Data platform. We conduct a comprehensive empirical study on 210 real service quality issues of ProductA. Our major findings include (1) 21.0% of escalations are caused by hardware faults; (2) 36.2% are caused by system side defects; (3) 37.2% are due to customer side faults. We also studied the general diagnosis process and the commonly adopted mitigation solutions. Our study results provide valuable guidance on improving existing development and maintenance practice of production Big Data platform, and motivate tool support

    Effective testing for concurrency bugs

    Get PDF
    In the current multi-core era, concurrency bugs are a serious threat to software reliability. As hardware becomes more parallel, concurrent programming will become increasingly pervasive. However, correct concurrent programming is known to be extremely challenging for developers and can easily lead to the introduction of concurrency bugs. This dissertation addresses this challenge by proposing novel techniques to help developers expose and detect concurrency bugs. We conducted a bug study to better understand the external and internal effects of real-world concurrency bugs. Our study revealed that a significant fraction of concurrency bugs qualify as semantic or latent bugs, which are two particularly challenging classes of concurrency bugs. Based on the insights from the study, we propose a concurrency bug detector, PIKE that analyzes the behavior of program executions to infer whether concurrency bugs have been triggered during a concurrent execution. In addition, we present the design of a testing tool, SKI, that allows developers to test operating system kernels for concurrency bugs in a practical manner. SKI bridges the gap between user-mode testing and kernel-mode testing by enabling the systematic exploration of the kernel thread interleaving space. Our evaluation shows that both PIKE and SKI are effective at finding concurrency bugs.Im gegenwĂ€rtigen Multicore-Zeitalter sind Fehler aufgrund von NebenlĂ€ufigkeit eine ernsthafte Bedrohung der ZuverlĂ€ssigkeit von Software. Mit der wachsenden Parallelisierung von Hardware wird nebenlĂ€ufiges Programmieren nach und nach allgegenwĂ€rtig. Diese Art von Programmieren ist jedoch als Ă€ußerst schwierig bekannt und kann leicht zu Programmierfehlern fĂŒhren. Die vorliegende Dissertation nimmt sich dieser Herausforderung an indem sie neuartige Techniken vorschlĂ€gt, die Entwicklern beim Aufdecken von NebenlĂ€ufigkeitsfehlern helfen. Wir fĂŒhren eine Studie von Fehlern durch, um die externen und internen Effekte von in der Praxis vorkommenden NebenlĂ€ufigkeitsfehlern besser zu verstehen. Diese ergibt, dass ein bedeutender Anteil von solchen Fehlern als semantisch bzw. latent zu charakterisieren ist -- zwei besonders herausfordernde Klassen von NebenlĂ€ufigkeitsfehlern. Basierend auf den Erkenntnissen der Studie entwickeln wir einen Detektor (PIKE), der ProgrammausfĂŒhrungen daraufhin analysiert, ob NebenlĂ€ufigkeitsfehler aufgetreten sind. Weiterhin prĂ€sentieren wir das Design eines Testtools (SKI), das es Entwicklern ermöglicht, Betriebssystemkerne praktikabel auf NebenlĂ€ufigkeitsfehler zu ĂŒberprĂŒfen. SKI fĂŒllt die LĂŒcke zwischen Testen im Benutzermodus und Testen im Kernelmodus, indem es die systematische Erkundung der Kernel-Thread-Verschachtelungen erlaubt. Unsere Auswertung zeigt, dass sowohl PIKE als auch SKI effektiv NebenlĂ€ufigkeitsfehler finden

    An Analysis of Partial Network Partitioning Failures in Modern Distributed Systems

    Get PDF
    We present a comprehensive study of system failures from 12 popular systems caused by a peculiar type of network partitioning faults: partial partitions. Partial partitions isolate a set of nodes from some, but not all, nodes in the cluster. Our study reveals the studied failures are catastrophic; they lead to data loss, complete system unavailability, or stale and dirty reads. Furthermore, our study reveals that these failures are easy to manifest, they are deterministic, they can be triggered by isolating a single node, and without any interaction with the system’s clients. We dissected the implemented fault tolerance techniques in eight popular systems. We identified four principled approaches for building a fault tolerance mechanism for partial partitions and identified the shortcomings of the current approaches. The currently implemented fault tolerance techniques are either specific to a particular protocol or implementation or may lead to a complete cluster shut down despite the availability of alternative network paths between the nodes. Finally, we present NIFTY, a generic communication layer that leverages the capabilities of modern software-defined networking to monitor and recover the connectivity of the cluster in case of partial network partitions. NIFTY is transparent to the application running on top of it. We built NiftyDB, a database system atop NIFTY. NiftyDB implements a set of optimizations that reduce the network overhead and operation latency in case of partial network partitioning. Our analysis and evaluation show that the proposed approach can effectively mask partial network partitioning faults without incurring additional overheads

    An Analysis of Network-Partitioning Failures in Cloud Systems

    Get PDF
    We present a comprehensive study of 136 system failures attributed to network-partitioning faults from 25 widely used distributed systems. We found that the majority of the failures led to catastrophic effects, such as data loss, reappearance of deleted data, broken locks, and system crashes. The majority of the failures can easily manifest once a network partition occurs: They require little to no client input, can be triggered by isolating a single node, and are deterministic. However, the number of test cases that one must consider is extremely large. Fortunately, we identify ordering, timing, and network fault characteristics that significantly simplify testing. Furthermore, we found that a significant number of the failures are due to design flaws in core system mechanisms. We found that the majority of the failures could have been avoided by design reviews, and could have been discovered by testing with network-partitioning fault injection. We built NEAT, a testing framework that simplifies the coordination of multiple clients and can inject different types of network-partitioning faults. We used NEAT to test seven popular systems and found and reported 32 failures

    API Failures in Openstack Cloud Environments

    Get PDF
    Des histoires sur les pannes de service dans les environnements infonuagiques ont fait les manchettes rĂ©cemment. Dans de nombreux cas, la fiabilitĂ© des interfaces de programmation d’applications (API) des infrastructures infonuagiques Ă©taient en dĂ©faut. Par consĂ©quent, la comprĂ©hension des facteurs qui influent sur la fiabilitĂ© de ces APIs est importante pour amĂ©liorer la disponibilitĂ© des services infonuagiques. Dans cette thĂšse, nous Ă©tudions les dĂ©faillances des APIs de la plateforme OpenStack ; qui est la plate-forme infonuagique Ă  code source ouvert la plus populaire Ă  ce jour. Nous examinons les bogues de 25 modules contenus dans les 5 APIs les plus importantes d’OpenStack, afin de comprendre les dĂ©faillances des APIs infonuagiques et leurs caractĂ©ristiques. Nos rĂ©sultats montrent que dans OpenStack, un tiers de tous les changements au code des APIs a pour objectif la correction de fautes ; 7% de ces changements modifiants l’interface des APIs concernĂ©s (induisant un risque de dĂ©faillances des clients de ces APIs). GrĂące Ă  l’analyse qualitative d’un Ă©chantillon de 230 dĂ©faillances d’APIs et de 71 dĂ©faillances d’APIs ayant eu une incidence sur des applications tierces, nous avons constatĂ© que la majoritĂ© des dĂ©faillances d’APIs sont attribuables Ă  de petites erreurs de programmation. Nous avons Ă©galement observĂ© que les erreurs de programmation et les erreurs de configuration sont les principales causes des dĂ©faillances ayant une incidence sur des applications tierces. Nous avons menĂ© un sondage auprĂšs de 38 dĂ©veloppeurs d’OpenStack et d’applications tierces, dans lequel les participants Ă©taient invitĂ©s Ă  se prononcer sur la propagation de dĂ©faillances d’APIs Ă  des applications tierces. Parmi les principales raisons fournies par les dĂ©veloppeurs pour expliquer l’apparition et la propagation des dĂ©faillances d’APIs dans les Ă©cosystĂšmes infonuagiques figurent : les petites erreurs de programmation, les erreurs de configuration, une faible couverture de test, des examens de code peu frĂ©quents, et une frĂ©quence de production de nouvelles versions trop Ă©levĂ©. Nous avons explorĂ© la possibilitĂ© d’utiliser des contrĂŽleurs de style de code, pour dĂ©tecter les petites erreurs de programmation et les erreurs de configuration tĂŽt dans le processus de dĂ©veloppement, mais avons constatĂ© que dans la plupart des cas, ces outils sont incapables de localiser ces types d’erreurs. Heureusement, le sujet des rapports de bogues, les messages contenues dans ces rapports, les traces d’exĂ©cutions, et les dĂ©lais de rĂ©ponses entre les commentaires contenues dans les rapports de bogues se sont avĂ©rĂ©s trĂšs utiles pour la localisation des fautes conduisant aux dĂ©faillances d’APIs.----------ABSTRACT: Stories about service outages in cloud environments have been making the headlines recently. In many cases, the reliability of cloud infrastructure Application Programming Interfaces (APIs) were at fault. Hence, understanding the factors affecting the reliability of these APIs is important to improve the availability of cloud services. In this thesis, we investigate API failures in OpenStack ; the most popular open source cloud platform to date. We mine the bugs of 25 modules within the 5 most important OpenStack APIs to understand API failures and their characteristics. Our results show that in OpenStack, one third of all API-related changes are due to fixing failures, with 7% of all fixes even changing the API interface, potentially breaking clients. Through a qualitative analysis of 230 sampled API failures, and 71 API failures that impacted third parties applications, we observed that the majority of API-related failures are due to small programming faults. We also observed that small programming faults and configuration faults are the most frequent causes of failures that propagate to third parties applications. We conducted a survey with 38 OpenStack and third party developers, in which participants were asked about the causes of API failures that propagate to third party applications. These developers reported that small programming faults, configuration faults, low testing coverage, infrequent code reviews, and a rapid release frequency are the main reasons behind the appearance and propagation of API failures. We explored the possibility of using code style checkers to detect small programming and configuration faults early on, but found that in the majority of cases, they cannot be localized using the tools. Fortunately, the subject, message and stack trace as well as the reply lag between comments included in the failures’ bug reports provide a good indication of the cause of the failure

    A Characteristic Study on Failures of Production Distributed Data-Parallel Programs

    No full text
    Abstract — SCOPE is adopted by thousands of developers from tens of different product teams in Microsoft Bing for daily web-scale data processing, including index building, search ranking, and advertisement display. A SCOPE job is composed of declarative SQL-like queries and imperative C # user-defined functions (UDFs), which are executed in pipeline by thousands of machines. There are tens of thousands of SCOPE jobs executed on Microsoft clusters per day, while some of them fail after a long execution time and thus waste tremendous resources. Reducing SCOPE failures would save significant resources. This paper presents a comprehensive characteristic study on 200 SCOPE failures/fixes and 50 SCOPE failures with debugging statistics from Microsoft Bing, investigating not only major failure types, failure sources, and fixes, but also current debugging practice. Our major findings include (1) most of the failures (84.5%) are caused by defects in data processing rather than defects in code logic; (2) table-level failures (22.5%) are mainly caused by programmers ’ mistakes and frequent data-schema changes while row-level failures (62%) are mainly caused by exceptional data; (3) 93 % fixes do not change data processing logic; (4) there are 8 % failures with root cause not at the failure-exposing stage, making current debugging practice insufficient in this case. Our study results provide valuable guidelines for future development of data-parallel programs. We believe that these guidelines are not limited to SCOPE, but can also be generalized to other similar data-parallel platforms. I
    corecore