Search CORE

8 research outputs found

An Empirical Study on Quality Issues of Production Big Data Platform

Author
Publication venue
Publication date: 06/03/2020
Field of study

Abstract-Big Data computing platform has evolved to be a multi-tenant service. The service quality matters because system failure or performance slowdown could adversely affect business and user experience. To date, there is few study in literature on service quality issues of production Big Data computing platform. In this paper, we present an empirical study on the service quality issues of Microsoft ProductA, which is a company-wide multi-tenant Big Data computing platform, serving thousands of customers from hundreds of teams. ProductA has a well-defined escalation process (i.e., incident management process), which helps customers report service quality issues on 24/7 basis. This paper investigates the common symptom, causes and mitigation of service quality issues in Big Data platform. We conduct a comprehensive empirical study on 210 real service quality issues of ProductA. Our major findings include (1) 21.0% of escalations are caused by hardware faults; (2) 36.2% are caused by system side defects; (3) 37.2% are due to customer side faults. We also studied the general diagnosis process and the commonly adopted mitigation solutions. Our study results provide valuable guidance on improving existing development and maintenance practice of production Big Data platform, and motivate tool support

CiteSeerX

Going Big: A Large-Scale Study on What Big Data Developers Ask

Author: Gyöngyi Zoltán
Hindle A.
Khatchadourian Raffi
McCallum Andrew Kachites
Publication venue: CUNY Academic Works
Publication date: 26/08/2019
Field of study

Software developers are increasingly required to write big data code. However, they find big data software development challenging. To help these developers it is necessary to understand big data topics that they are interested in and the difficulty of finding answers for questions in these topics. In this work, we conduct a large-scale study on Stackoverflow to understand the interest and difficulties of big data developers. To conduct the study, we develop a set of big data tags to extract big data posts from Stackoverflow; use topic modeling to group these posts into big data topics; group similar topics into categories to construct a topic hierarchy; analyze popularity and difficulty of topics and their correlations; and discuss implications of our findings for practice, research and education of big data software development and investigate their coincidence with the findings of previous work

City University of New York

Crossref

Big data quality framework: a holistic approach to continuous quality management

Author: Bouhaddioui Chafik
Dssouli Rachida
Serhani Mohamed Adel
Taleb Ikbal
Publication venue: ZU Scholars
Publication date: 29/05/2021
Field of study

Big Data is an essential research area for governments, institutions, and private agencies to support their analytics decisions. Big Data refers to all about data, how it is collected, processed, and analyzed to generate value-added data-driven insights and decisions. Degradation in Data Quality may result in unpredictable consequences. In this case, confidence and worthiness in the data and its source are lost. In the Big Data context, data characteristics, such as volume, multi-heterogeneous data sources, and fast data generation, increase the risk of quality degradation and require efficient mechanisms to check data worthiness. However, ensuring Big Data Quality (BDQ) is a very costly and time-consuming process, since excessive computing resources are required. Maintaining Quality through the Big Data lifecycle requires quality profiling and verification before its processing decision. A BDQ Management Framework for enhancing the pre-processing activities while strengthening data control is proposed. The proposed framework uses a new concept called Big Data Quality Profile. This concept captures quality outline, requirements, attributes, dimensions, scores, and rules. Using Big Data profiling and sampling components of the framework, a faster and efficient data quality estimation is initiated before and after an intermediate pre-processing phase. The exploratory profiling component of the framework plays an initial role in quality profiling; it uses a set of predefined quality metrics to evaluate important data quality dimensions. It generates quality rules by applying various pre-processing activities and their related functions. These rules mainly aim at the Data Quality Profile and result in quality scores for the selected quality attributes. The framework implementation and dataflow management across various quality management processes have been discussed, further some ongoing work on framework evaluation and deployment to support quality evaluation decisions conclude the paper

ZU Scholars (Zayed University)

An Analysis of Partial Network Partitioning Failures in Modern Distributed Systems

Author: Alfatafta Mohammed
Publication venue: 'University of Waterloo'
Publication date: 19/12/2019
Field of study

We present a comprehensive study of system failures from 12 popular systems caused by a peculiar type of network partitioning faults: partial partitions. Partial partitions isolate a set of nodes from some, but not all, nodes in the cluster. Our study reveals the studied failures are catastrophic; they lead to data loss, complete system unavailability, or stale and dirty reads. Furthermore, our study reveals that these failures are easy to manifest, they are deterministic, they can be triggered by isolating a single node, and without any interaction with the system’s clients. We dissected the implemented fault tolerance techniques in eight popular systems. We identified four principled approaches for building a fault tolerance mechanism for partial partitions and identified the shortcomings of the current approaches. The currently implemented fault tolerance techniques are either specific to a particular protocol or implementation or may lead to a complete cluster shut down despite the availability of alternative network paths between the nodes. Finally, we present NIFTY, a generic communication layer that leverages the capabilities of modern software-defined networking to monitor and recover the connectivity of the cluster in case of partial network partitions. NIFTY is transparent to the application running on top of it. We built NiftyDB, a database system atop NIFTY. NiftyDB implements a set of optimizations that reduce the network overhead and operation latency in case of partial network partitioning. Our analysis and evaluation show that the proposed approach can effectively mask partial network partitioning faults without incurring additional overheads

University of Waterloo's Institutional Repository

An Analysis of Network-Partitioning Failures in Cloud Systems

Author: Alquraan Ahmed
Publication venue: 'University of Waterloo'
Publication date: 04/12/2018
Field of study

We present a comprehensive study of 136 system failures attributed to network-partitioning faults from 25 widely used distributed systems. We found that the majority of the failures led to catastrophic effects, such as data loss, reappearance of deleted data, broken locks, and system crashes. The majority of the failures can easily manifest once a network partition occurs: They require little to no client input, can be triggered by isolating a single node, and are deterministic. However, the number of test cases that one must consider is extremely large. Fortunately, we identify ordering, timing, and network fault characteristics that significantly simplify testing. Furthermore, we found that a significant number of the failures are due to design flaws in core system mechanisms. We found that the majority of the failures could have been avoided by design reviews, and could have been discovered by testing with network-partitioning fault injection. We built NEAT, a testing framework that simplifies the coordination of multiple clients and can inject different types of network-partitioning faults. We used NEAT to test seven popular systems and found and reported 32 failures

University of Waterloo's Institutional Repository

Recommended from our members

Automated Testing and Debugging for Big Data Analytics

Author: Gulzar Muhammad Ali
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

The prevalence of big data analytics in almost every large-scale software system has generated a substantial push to build data-intensive scalable computing (DISC) frameworks such as Google MapReduce and Apache Spark that can fully harness the power of existing data centers. However, frameworks once used by domain experts are now being leveraged by data scientists, business analysts, and researchers. This shift in user demographics calls for immediate advancements in the development, debugging, and testing practices of big data applications, which are falling behind compared to the DISC framework design and implementation. In practice, big data applications often fail as users are unable to test all behaviors emerging from interleaving dataflow operators, user-defined functions, and framework's code. "Testing based on a random sample" rarely guarantees the reliability and "trial and error" and "print" debugging methods are expensive and time-consuming. Thus, the current practice of developing a big data application must be improved and the tools built to enhance the developer's productivity must adapt to the distinct characteristics of data-intensive scalable computing. By synthesizing ideas from software engineering and database systems, our hypothesis is that we can design effective and scalable testing and debugging algorithms for big data analytics without compromising the performance and efficiency of the underlying DISC framework. To design such techniques, we investigate how we can build interactive and responsive debugging primitives that significantly reduce the debugging time, yet do not pose much performance overhead on big data applications. Furthermore, we investigate how we can leverage data provenance techniques from databases and fault-isolation algorithms from software engineering to pinpoint the minimal subset of failure-inducing inputs efficiently. To improve the reliability of big data analytics, we investigate how we can abstract the semantics of dataflow operators and use them in tandem with the semantics of user-defined functions to generate a minimum set of synthetic test inputs capable of revealing more defects than the entire input dataset.To examine the first hypothesis, we introduce interactive, real-time debugging primitives for big data analytics through innovative and scalable debugging features such as simulated breakpoint, dynamic watchpoint, and crash culprit identification. Second, we design a new automated fault localization approach that combines insights from both the software engineering and database literature to bring delta debugging closer to a reality in the big data applications by leveraging data provenance and by constructing systems optimizations for debugging provenance queries. Lastly, we devise a new symbolic-execution based white-box testing algorithm for big data applications that abstracts the implementation of dataflow operators using logical specifications instead of modeling their implementations and combines them with the semantics of any arbitrary user-defined function. We instantiate the idea of an interactive debugging algorithm as BigDebug, the idea of an automated debugging algorithm as BigSift, and the idea of symbolic execution-based testing as BigTest. Our investigation shows that the interactive debugging primitives can scale to terabytes---our record-level tracing incurs less than 25% overhead on average and provides up to 100% time saving compared to the baseline replay debugger. Second, we observe that by combining data provenance with delta debugging, we can identify the minimum faulty input in just under 30% of the original job execution time. Lastly, we verify that by abstracting dataflow operators using logical specifications, we can efficiently generate the most concise test data suitable for local testing while revealing twice as many faults as prior approaches. Our investigations collectively demonstrate that developer productivity can be significantly improved through effective and scalable testing and debugging techniques for big data analytics, without impacting the DISC framework's performance. This dissertation affirms the feasibility of automated debugging and testing techniques for big data analytics---techniques that were previously considered infeasible for large-scale data processing

eScholarship - University of California

Towards Automated Software Evolution of Data-Intensive Applications

Author: Tang Yiming
Publication venue: CUNY Academic Works
Publication date: 01/06/2021
Field of study

Recent years have witnessed an explosion of work on Big Data. Data-intensive applications analyze and produce large volumes of data typically terabyte and petabyte in size. Many techniques for facilitating data processing are integrated into data-intensive applications. API is a software interface that allows two applications to communicate with each other. Streaming APIs are widely used in today\u27s Object-Oriented programming development that can support parallel processing. In this dissertation, an approach that automatically suggests stream code run in parallel or sequentially is proposed. However, using streams efficiently and properly needs many subtle considerations. The use and misuse patterns for stream codes are proposed in this dissertation. Modern software, especially for highly transactional software systems, generates vast logging information every day. The huge amount of information prevents developers from receiving useful information effectively. Log-level could be used to filter run-time information. This dissertation proposes an automated evolution approach for alleviating logging information overload by rejuvenating log levels according to developers\u27 interests. Machine Learning (ML) systems are pervasive in today\u27s software society. They are always complex and can process large volumes of data. Due to the complexity of ML systems, they are prone to classic technical debt issues, but how ML systems evolve is still a puzzling problem. This dissertation introduces ML-specific refactoring and technical debt for solving this problem

City University of New York