Search CORE

9 research outputs found

Testing Data Transformations in MapReduce Programs

Author: Morán Barbón Jesús
Riva Álvarez Claudio A. de la
Tuya González Pablo Javier
Publication venue: ACM
Publication date: 01/01/2015
Field of study

MapReduce is a parallel data processing paradigm oriented to process large volumes of information in data-intensive applications, such as Big Data environments. A characteristic of these applications is that they can have different data sources and data formats. For these reasons, the inputs could contain some poor quality data that could produce a failure if the program functionality does not handle properly the variety of input data. The output of these programs is obtained from a number of input transformations that represent the program logic. This paper proposes the testing technique called MRFlow that is based on data flow test criteria and oriented to transformations analysis between the input and the output in order to detect defects in MapReduce programs. MRFlow is applied over some MapReduce programs and detects several defect

Repositorio Institucional de la Universidad de Oviedo

Testing data transformations in MapReduce programs

Author: Camargo L. C.
Camargo L. C.
Li N.
Owens J. R.
Sneed H. M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

An Empirical Study on Quality Issues of Production Big Data Platform

Author
Publication venue
Publication date: 06/03/2020
Field of study

Abstract-Big Data computing platform has evolved to be a multi-tenant service. The service quality matters because system failure or performance slowdown could adversely affect business and user experience. To date, there is few study in literature on service quality issues of production Big Data computing platform. In this paper, we present an empirical study on the service quality issues of Microsoft ProductA, which is a company-wide multi-tenant Big Data computing platform, serving thousands of customers from hundreds of teams. ProductA has a well-defined escalation process (i.e., incident management process), which helps customers report service quality issues on 24/7 basis. This paper investigates the common symptom, causes and mitigation of service quality issues in Big Data platform. We conduct a comprehensive empirical study on 210 real service quality issues of ProductA. Our major findings include (1) 21.0% of escalations are caused by hardware faults; (2) 36.2% are caused by system side defects; (3) 37.2% are due to customer side faults. We also studied the general diagnosis process and the commonly adopted mitigation solutions. Our study results provide valuable guidance on improving existing development and maintenance practice of production Big Data platform, and motivate tool support

CiteSeerX

Effective testing for concurrency bugs

Author: Sousa da Fonseca Pedro José
Publication venue: Fakultät 6 - Naturwissenschaftlich-Technische Fakultät I. Fachrichtung 6.2 - Informatik
Publication date: 01/01/2015
Field of study

In the current multi-core era, concurrency bugs are a serious threat to software reliability. As hardware becomes more parallel, concurrent programming will become increasingly pervasive. However, correct concurrent programming is known to be extremely challenging for developers and can easily lead to the introduction of concurrency bugs. This dissertation addresses this challenge by proposing novel techniques to help developers expose and detect concurrency bugs. We conducted a bug study to better understand the external and internal effects of real-world concurrency bugs. Our study revealed that a significant fraction of concurrency bugs qualify as semantic or latent bugs, which are two particularly challenging classes of concurrency bugs. Based on the insights from the study, we propose a concurrency bug detector, PIKE that analyzes the behavior of program executions to infer whether concurrency bugs have been triggered during a concurrent execution. In addition, we present the design of a testing tool, SKI, that allows developers to test operating system kernels for concurrency bugs in a practical manner. SKI bridges the gap between user-mode testing and kernel-mode testing by enabling the systematic exploration of the kernel thread interleaving space. Our evaluation shows that both PIKE and SKI are effective at finding concurrency bugs.Im gegenwärtigen Multicore-Zeitalter sind Fehler aufgrund von Nebenläufigkeit eine ernsthafte Bedrohung der Zuverlässigkeit von Software. Mit der wachsenden Parallelisierung von Hardware wird nebenläufiges Programmieren nach und nach allgegenwärtig. Diese Art von Programmieren ist jedoch als äußerst schwierig bekannt und kann leicht zu Programmierfehlern führen. Die vorliegende Dissertation nimmt sich dieser Herausforderung an indem sie neuartige Techniken vorschlägt, die Entwicklern beim Aufdecken von Nebenläufigkeitsfehlern helfen. Wir führen eine Studie von Fehlern durch, um die externen und internen Effekte von in der Praxis vorkommenden Nebenläufigkeitsfehlern besser zu verstehen. Diese ergibt, dass ein bedeutender Anteil von solchen Fehlern als semantisch bzw. latent zu charakterisieren ist -- zwei besonders herausfordernde Klassen von Nebenläufigkeitsfehlern. Basierend auf den Erkenntnissen der Studie entwickeln wir einen Detektor (PIKE), der Programmausführungen daraufhin analysiert, ob Nebenläufigkeitsfehler aufgetreten sind. Weiterhin präsentieren wir das Design eines Testtools (SKI), das es Entwicklern ermöglicht, Betriebssystemkerne praktikabel auf Nebenläufigkeitsfehler zu überprüfen. SKI füllt die Lücke zwischen Testen im Benutzermodus und Testen im Kernelmodus, indem es die systematische Erkundung der Kernel-Thread-Verschachtelungen erlaubt. Unsere Auswertung zeigt, dass sowohl PIKE als auch SKI effektiv Nebenläufigkeitsfehler finden

Universaar

MPG.PuRe

Acronym

An Analysis of Partial Network Partitioning Failures in Modern Distributed Systems

Author: Alfatafta Mohammed
Publication venue: 'University of Waterloo'
Publication date: 19/12/2019
Field of study

We present a comprehensive study of system failures from 12 popular systems caused by a peculiar type of network partitioning faults: partial partitions. Partial partitions isolate a set of nodes from some, but not all, nodes in the cluster. Our study reveals the studied failures are catastrophic; they lead to data loss, complete system unavailability, or stale and dirty reads. Furthermore, our study reveals that these failures are easy to manifest, they are deterministic, they can be triggered by isolating a single node, and without any interaction with the system’s clients. We dissected the implemented fault tolerance techniques in eight popular systems. We identified four principled approaches for building a fault tolerance mechanism for partial partitions and identified the shortcomings of the current approaches. The currently implemented fault tolerance techniques are either specific to a particular protocol or implementation or may lead to a complete cluster shut down despite the availability of alternative network paths between the nodes. Finally, we present NIFTY, a generic communication layer that leverages the capabilities of modern software-defined networking to monitor and recover the connectivity of the cluster in case of partial network partitions. NIFTY is transparent to the application running on top of it. We built NiftyDB, a database system atop NIFTY. NiftyDB implements a set of optimizations that reduce the network overhead and operation latency in case of partial network partitioning. Our analysis and evaluation show that the proposed approach can effectively mask partial network partitioning faults without incurring additional overheads

University of Waterloo's Institutional Repository

An Analysis of Network-Partitioning Failures in Cloud Systems

Author: Alquraan Ahmed
Publication venue: 'University of Waterloo'
Publication date: 04/12/2018
Field of study

We present a comprehensive study of 136 system failures attributed to network-partitioning faults from 25 widely used distributed systems. We found that the majority of the failures led to catastrophic effects, such as data loss, reappearance of deleted data, broken locks, and system crashes. The majority of the failures can easily manifest once a network partition occurs: They require little to no client input, can be triggered by isolating a single node, and are deterministic. However, the number of test cases that one must consider is extremely large. Fortunately, we identify ordering, timing, and network fault characteristics that significantly simplify testing. Furthermore, we found that a significant number of the failures are due to design flaws in core system mechanisms. We found that the majority of the failures could have been avoided by design reviews, and could have been discovered by testing with network-partitioning fault injection. We built NEAT, a testing framework that simplifies the coordination of multiple clients and can inject different types of network-partitioning faults. We used NEAT to test seven popular systems and found and reported 32 failures

University of Waterloo's Institutional Repository

API Failures in Openstack Cloud Environments

Author: Musavi Mirkalaei Seyed Pooya
Publication venue
Publication date: 01/08/2017
Field of study

Des histoires sur les pannes de service dans les environnements infonuagiques ont fait les manchettes récemment. Dans de nombreux cas, la fiabilité des interfaces de programmation d’applications (API) des infrastructures infonuagiques étaient en défaut. Par conséquent, la compréhension des facteurs qui influent sur la fiabilité de ces APIs est importante pour améliorer la disponibilité des services infonuagiques. Dans cette thèse, nous étudions les défaillances des APIs de la plateforme OpenStack ; qui est la plate-forme infonuagique à code source ouvert la plus populaire à ce jour. Nous examinons les bogues de 25 modules contenus dans les 5 APIs les plus importantes d’OpenStack, afin de comprendre les défaillances des APIs infonuagiques et leurs caractéristiques. Nos résultats montrent que dans OpenStack, un tiers de tous les changements au code des APIs a pour objectif la correction de fautes ; 7% de ces changements modifiants l’interface des APIs concernés (induisant un risque de défaillances des clients de ces APIs). Grâce à l’analyse qualitative d’un échantillon de 230 défaillances d’APIs et de 71 défaillances d’APIs ayant eu une incidence sur des applications tierces, nous avons constaté que la majorité des défaillances d’APIs sont attribuables à de petites erreurs de programmation. Nous avons également observé que les erreurs de programmation et les erreurs de configuration sont les principales causes des défaillances ayant une incidence sur des applications tierces. Nous avons mené un sondage auprès de 38 développeurs d’OpenStack et d’applications tierces, dans lequel les participants étaient invités à se prononcer sur la propagation de défaillances d’APIs à des applications tierces. Parmi les principales raisons fournies par les développeurs pour expliquer l’apparition et la propagation des défaillances d’APIs dans les écosystèmes infonuagiques figurent : les petites erreurs de programmation, les erreurs de configuration, une faible couverture de test, des examens de code peu fréquents, et une fréquence de production de nouvelles versions trop élevé. Nous avons exploré la possibilité d’utiliser des contrôleurs de style de code, pour détecter les petites erreurs de programmation et les erreurs de configuration tôt dans le processus de développement, mais avons constaté que dans la plupart des cas, ces outils sont incapables de localiser ces types d’erreurs. Heureusement, le sujet des rapports de bogues, les messages contenues dans ces rapports, les traces d’exécutions, et les délais de réponses entre les commentaires contenues dans les rapports de bogues se sont avérés très utiles pour la localisation des fautes conduisant aux défaillances d’APIs.----------ABSTRACT: Stories about service outages in cloud environments have been making the headlines recently. In many cases, the reliability of cloud infrastructure Application Programming Interfaces (APIs) were at fault. Hence, understanding the factors affecting the reliability of these APIs is important to improve the availability of cloud services. In this thesis, we investigate API failures in OpenStack ; the most popular open source cloud platform to date. We mine the bugs of 25 modules within the 5 most important OpenStack APIs to understand API failures and their characteristics. Our results show that in OpenStack, one third of all API-related changes are due to fixing failures, with 7% of all fixes even changing the API interface, potentially breaking clients. Through a qualitative analysis of 230 sampled API failures, and 71 API failures that impacted third parties applications, we observed that the majority of API-related failures are due to small programming faults. We also observed that small programming faults and configuration faults are the most frequent causes of failures that propagate to third parties applications. We conducted a survey with 38 OpenStack and third party developers, in which participants were asked about the causes of API failures that propagate to third party applications. These developers reported that small programming faults, configuration faults, low testing coverage, infrequent code reviews, and a rapid release frequency are the main reasons behind the appearance and propagation of API failures. We explored the possibility of using code style checkers to detect small programming and configuration faults early on, but found that in the majority of cases, they cannot be localized using the tools. Fortunately, the subject, message and stack trace as well as the reply lag between comments included in the failures’ bug reports provide a good indication of the cause of the failure

PolyPublie

Recommended from our members

Automated Testing and Debugging for Big Data Analytics

Author: Gulzar Muhammad Ali
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

The prevalence of big data analytics in almost every large-scale software system has generated a substantial push to build data-intensive scalable computing (DISC) frameworks such as Google MapReduce and Apache Spark that can fully harness the power of existing data centers. However, frameworks once used by domain experts are now being leveraged by data scientists, business analysts, and researchers. This shift in user demographics calls for immediate advancements in the development, debugging, and testing practices of big data applications, which are falling behind compared to the DISC framework design and implementation. In practice, big data applications often fail as users are unable to test all behaviors emerging from interleaving dataflow operators, user-defined functions, and framework's code. "Testing based on a random sample" rarely guarantees the reliability and "trial and error" and "print" debugging methods are expensive and time-consuming. Thus, the current practice of developing a big data application must be improved and the tools built to enhance the developer's productivity must adapt to the distinct characteristics of data-intensive scalable computing. By synthesizing ideas from software engineering and database systems, our hypothesis is that we can design effective and scalable testing and debugging algorithms for big data analytics without compromising the performance and efficiency of the underlying DISC framework. To design such techniques, we investigate how we can build interactive and responsive debugging primitives that significantly reduce the debugging time, yet do not pose much performance overhead on big data applications. Furthermore, we investigate how we can leverage data provenance techniques from databases and fault-isolation algorithms from software engineering to pinpoint the minimal subset of failure-inducing inputs efficiently. To improve the reliability of big data analytics, we investigate how we can abstract the semantics of dataflow operators and use them in tandem with the semantics of user-defined functions to generate a minimum set of synthetic test inputs capable of revealing more defects than the entire input dataset.To examine the first hypothesis, we introduce interactive, real-time debugging primitives for big data analytics through innovative and scalable debugging features such as simulated breakpoint, dynamic watchpoint, and crash culprit identification. Second, we design a new automated fault localization approach that combines insights from both the software engineering and database literature to bring delta debugging closer to a reality in the big data applications by leveraging data provenance and by constructing systems optimizations for debugging provenance queries. Lastly, we devise a new symbolic-execution based white-box testing algorithm for big data applications that abstracts the implementation of dataflow operators using logical specifications instead of modeling their implementations and combines them with the semantics of any arbitrary user-defined function. We instantiate the idea of an interactive debugging algorithm as BigDebug, the idea of an automated debugging algorithm as BigSift, and the idea of symbolic execution-based testing as BigTest. Our investigation shows that the interactive debugging primitives can scale to terabytes---our record-level tracing incurs less than 25% overhead on average and provides up to 100% time saving compared to the baseline replay debugger. Second, we observe that by combining data provenance with delta debugging, we can identify the minimum faulty input in just under 30% of the original job execution time. Lastly, we verify that by abstracting dataflow operators using logical specifications, we can efficiently generate the most concise test data suitable for local testing while revealing twice as many faults as prior approaches. Our investigations collectively demonstrate that developer productivity can be significantly improved through effective and scalable testing and debugging techniques for big data analytics, without impacting the DISC framework's performance. This dissertation affirms the feasibility of automated debugging and testing techniques for big data analytics---techniques that were previously considered infeasible for large-scale data processing

eScholarship - University of California

A Characteristic Study on Failures of Production Distributed Data-Parallel Programs

Author: Haibo Lin
Haoxiang Lin
Hucheng Zhou
Sihan Li
Tao Xie
Tian Xiao
Wei Lin
Publication venue
Publication date: 10/10/2013
Field of study

Abstract — SCOPE is adopted by thousands of developers from tens of different product teams in Microsoft Bing for daily web-scale data processing, including index building, search ranking, and advertisement display. A SCOPE job is composed of declarative SQL-like queries and imperative C # user-defined functions (UDFs), which are executed in pipeline by thousands of machines. There are tens of thousands of SCOPE jobs executed on Microsoft clusters per day, while some of them fail after a long execution time and thus waste tremendous resources. Reducing SCOPE failures would save significant resources. This paper presents a comprehensive characteristic study on 200 SCOPE failures/fixes and 50 SCOPE failures with debugging statistics from Microsoft Bing, investigating not only major failure types, failure sources, and fixes, but also current debugging practice. Our major findings include (1) most of the failures (84.5%) are caused by defects in data processing rather than defects in code logic; (2) table-level failures (22.5%) are mainly caused by programmers ’ mistakes and frequent data-schema changes while row-level failures (62%) are mainly caused by exceptional data; (3) 93 % fixes do not change data processing logic; (4) there are 8 % failures with root cause not at the failure-exposing stage, making current debugging practice insufficient in this case. Our study results provide valuable guidelines for future development of data-parallel programs. We believe that these guidelines are not limited to SCOPE, but can also be generalized to other similar data-parallel platforms. I

CiteSeerX

Crossref