787 research outputs found
Big Data Testing Techniques: Taxonomy, Challenges and Future Trends
Big Data is reforming many industrial domains by providing decision support
through analyzing large data volumes. Big Data testing aims to ensure that Big
Data systems run smoothly and error-free while maintaining the performance and
quality of data. However, because of the diversity and complexity of data,
testing Big Data is challenging. Though numerous research efforts deal with Big
Data testing, a comprehensive review to address testing techniques and
challenges of Big Data is not available as yet. Therefore, we have
systematically reviewed the Big Data testing techniques evidence occurring in
the period 2010-2021. This paper discusses testing data processing by
highlighting the techniques used in every processing phase. Furthermore, we
discuss the challenges and future directions. Our findings show that diverse
functional, non-functional and combined (functional and non-functional) testing
techniques have been used to solve specific problems related to Big Data. At
the same time, most of the testing challenges have been faced during the
MapReduce validation phase. In addition, the combinatorial testing technique is
one of the most applied techniques in combination with other techniques (i.e.,
random testing, mutation testing, input space partitioning and equivalence
testing) to find various functional faults through Big Data testing.Comment: 32 page
What does fault tolerant Deep Learning need from MPI?
Deep Learning (DL) algorithms have become the de facto Machine Learning (ML)
algorithm for large scale data analysis. DL algorithms are computationally
expensive - even distributed DL implementations which use MPI require days of
training (model learning) time on commonly studied datasets. Long running DL
applications become susceptible to faults - requiring development of a fault
tolerant system infrastructure, in addition to fault tolerant DL algorithms.
This raises an important question: What is needed from MPI for de- signing
fault tolerant DL implementations? In this paper, we address this problem for
permanent faults. We motivate the need for a fault tolerant MPI specification
by an in-depth consideration of recent innovations in DL algorithms and their
properties, which drive the need for specific fault tolerance features. We
present an in-depth discussion on the suitability of different parallelism
types (model, data and hybrid); a need (or lack thereof) for check-pointing of
any critical data structures; and most importantly, consideration for several
fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI
and their applicability to fault tolerant DL implementations. We leverage a
distributed memory implementation of Caffe, currently available under the
Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches
by ex- tending MaTEx-Caffe for using ULFM-based implementation. Our evaluation
using the ImageNet dataset and AlexNet, and GoogLeNet neural network topologies
demonstrates the effectiveness of the proposed fault tolerant DL implementation
using OpenMPI based ULFM
A big data MapReduce framework for fault diagnosis in cloud-based manufacturing
This research develops a MapReduce framework for automatic pattern recognition based on fault diagnosis by solving data imbalance problem in a cloud-based manufacturing (CBM). Fault diagnosis in a CBM system significantly contributes to reduce the product testing cost and enhances manufacturing quality. One of the major challenges facing the big data analytics in cloud-based manufacturing is handling of datasets, which are highly imbalanced in nature due to poor classification result when machine learning techniques are applied on such datasets. The framework proposed in this research uses a hybrid approach to deal with big dataset for smarter decisions. Furthermore, we compare the performance of radial basis function based Support Vector Machine classifier with standard techniques. Our findings suggest that the most important task in cloud-based manufacturing, is to predict the effect of data errors on quality due to highly imbalance unstructured dataset. The proposed framework is an original contribution to the body of literature, where our proposed MapReduce framework has been used for fault detection by managing data imbalance problem appropriately and relating it to firm’s profit function. The experimental results are validated using a case study of steel plate manufacturing fault diagnosis, with crucial performance matrices such as accuracy, specificity and sensitivity. A comparative study shows that the methods used in the proposed framework outperform the traditional ones
Infrastructure-Aware Functional Testing of MapReduce Programs
2016 IEEE 4th International Conference on Future
Internet of Things and Cloud Workshops (FiCloudW), Vienna, 2016Programs that process a large volume of data generally run in a distributed and parallel architecture, such as the programs implemented in the processing model MapReduce. In these programs, developers can abstract the infrastructure where the program will run and focus on the functional issues. However, the infrastructure configuration and its state cause different parallel executions of the program and some could derive in functional faults which are hard to reveal. In general, the infrastructure that executes the program is not considered during the testing, because the tests usually contain few input data and then the parallelization is not necessary. In this paper a testing technique is proposed to generate different infrastructure configurations for a given test input data, and then the program is executed in these configurations in order to reveal functional faults. This testing technique is automatized by using a test engine and is applied to a case study. As a result, several infrastructure configurations are automatically generated and executed for a test case revealing a functional fault that is then fixed by the develope
Towards Ex Vivo Testing of MapReduce Applications
2017 IEEE International Conference on Software Quality, Reliability and Security (QRS), 25-29 July 2017, Prague (Czech Republic)Big Data programs are those that process large data exceeding the capabilities of traditional technologies. Among newly proposed processing models, MapReduce stands out as it allows the analysis of schema-less data in large distributed environments with frequent infrastructure failures. Functional faults in MapReduce are hard to detect in a testing/preproduction environment due to its distributed characteristics. We propose an automatic test framework implementing a novel testing approach called Ex Vivo. The framework employs data from production but executes the tests in a laboratory to avoid side-effects on the application. Faults are detected automatically without human intervention by checking if the same data would generate different outputs with different infrastructure configurations. The framework (MrExist) is validated with a real-world program. MrExist can identify a fault in a few seconds, then the program can be stopped, not only avoiding an incorrect output, but also saving money, time and energy of production resource
Development of an Application to Run Integration Tests on a Data Pipeline
Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business AnalyticsAs new technologies continue to surge, the ever-growing complexity of modern software products and architectures significantly increases the difficulty and development costs of a fully tested application. As agile methodologies enforce faster delivery of new features, a tool to automatically test the different components of an application has become an essential prerequisite in many software development teams.
In this context, this report describes the development of an application for the automatic execution of integration tests. The application was developed during a data engineering internship at Xing, a very well-established German career-oriented professional networking platform. At the time of writing Xing counts more than 19 million users, most of them from Germany, Austria, and Switzerland.
The internship project was carried out in a team focused on data engineering projects and, following the Kaban methodology, an application was developed to automatically perform integration tests on the different components involved in the creation of a type of update on the platform. The application was also coupled to the tool used by the team in the adopted software development continuous integration and delivery practices
This report describes the developed project, which successfully achieved the proposed objectives, and delivered as a final product an application that will serve as a framework to perform integration tests, in an automated way, in the data pipelines for the creation of updates on the platform
Semantic Support for Log Analysis of Safety-Critical Embedded Systems
Testing is a relevant activity for the development life-cycle of Safety
Critical Embedded systems. In particular, much effort is spent for analysis and
classification of test logs from SCADA subsystems, especially when failures
occur. The human expertise is needful to understand the reasons of failures,
for tracing back the errors, as well as to understand which requirements are
affected by errors and which ones will be affected by eventual changes in the
system design. Semantic techniques and full text search are used to support
human experts for the analysis and classification of test logs, in order to
speedup and improve the diagnosis phase. Moreover, retrieval of tests and
requirements, which can be related to the current failure, is supported in
order to allow the discovery of available alternatives and solutions for a
better and faster investigation of the problem.Comment: EDCC-2014, BIG4CIP-2014, Embedded systems, testing, semantic
discovery, ontology, big dat
- …