Search CORE

787 research outputs found

Big Data Testing Techniques: Taxonomy, Challenges and Future Trends

Author: Afzal Wasif
Alsamhi Saeed Hamood
Arshad Iram
Publication venue
Publication date: 14/07/2022
Field of study

Big Data is reforming many industrial domains by providing decision support through analyzing large data volumes. Big Data testing aims to ensure that Big Data systems run smoothly and error-free while maintaining the performance and quality of data. However, because of the diversity and complexity of data, testing Big Data is challenging. Though numerous research efforts deal with Big Data testing, a comprehensive review to address testing techniques and challenges of Big Data is not available as yet. Therefore, we have systematically reviewed the Big Data testing techniques evidence occurring in the period 2010-2021. This paper discusses testing data processing by highlighting the techniques used in every processing phase. Furthermore, we discuss the challenges and future directions. Our findings show that diverse functional, non-functional and combined (functional and non-functional) testing techniques have been used to solve specific problems related to Big Data. At the same time, most of the testing challenges have been faced during the MapReduce validation phase. In addition, the combinatorial testing technique is one of the most applied techniques in combination with other techniques (i.e., random testing, mutation testing, input space partitioning and equivalence testing) to find various functional faults through Big Data testing.Comment: 32 page

arXiv.org e-Print Archive

What does fault tolerant Deep Learning need from MPI?

Author: Amatya Vinay
Daily Jeff
Siegel Charles
Vishnu Abhinav
Publication venue
Publication date: 01/01/2017
Field of study

Deep Learning (DL) algorithms have become the de facto Machine Learning (ML) algorithm for large scale data analysis. DL algorithms are computationally expensive - even distributed DL implementations which use MPI require days of training (model learning) time on commonly studied datasets. Long running DL applications become susceptible to faults - requiring development of a fault tolerant system infrastructure, in addition to fault tolerant DL algorithms. This raises an important question: What is needed from MPI for de- signing fault tolerant DL implementations? In this paper, we address this problem for permanent faults. We motivate the need for a fault tolerant MPI specification by an in-depth consideration of recent innovations in DL algorithms and their properties, which drive the need for specific fault tolerance features. We present an in-depth discussion on the suitability of different parallelism types (model, data and hybrid); a need (or lack thereof) for check-pointing of any critical data structures; and most importantly, consideration for several fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI and their applicability to fault tolerant DL implementations. We leverage a distributed memory implementation of Caffe, currently available under the Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches by ex- tending MaTEx-Caffe for using ULFM-based implementation. Our evaluation using the ImageNet dataset and AlexNet, and GoogLeNet neural network topologies demonstrates the effectiveness of the proposed fault tolerant DL implementation using OpenMPI based ULFM

arXiv.org e-Print Archive

Crossref

A big data MapReduce framework for fault diagnosis in cloud-based manufacturing

Author: Ajay Kumar (192967)
Alok Choudhary (1251471)
Lakshman S. Thakur (7199684)
Ravi Shankar (103040)
Publication venue
Publication date: 04/03/2016
Field of study

This research develops a MapReduce framework for automatic pattern recognition based on fault diagnosis by solving data imbalance problem in a cloud-based manufacturing (CBM). Fault diagnosis in a CBM system significantly contributes to reduce the product testing cost and enhances manufacturing quality. One of the major challenges facing the big data analytics in cloud-based manufacturing is handling of datasets, which are highly imbalanced in nature due to poor classification result when machine learning techniques are applied on such datasets. The framework proposed in this research uses a hybrid approach to deal with big dataset for smarter decisions. Furthermore, we compare the performance of radial basis function based Support Vector Machine classifier with standard techniques. Our findings suggest that the most important task in cloud-based manufacturing, is to predict the effect of data errors on quality due to highly imbalance unstructured dataset. The proposed framework is an original contribution to the body of literature, where our proposed MapReduce framework has been used for fault detection by managing data imbalance problem appropriately and relating it to firm’s profit function. The experimental results are validated using a case study of steel plate manufacturing fault diagnosis, with crucial performance matrices such as accuracy, specificity and sensitivity. A comparative study shows that the methods used in the proposed framework outperform the traditional ones

Loughborough University Institutional Repository

Infrastructure-Aware Functional Testing of MapReduce Programs

Author: Caballero Ismael
Morán Barbón Jesús
Riva Álvarez Claudio A. de la
Rivas Bibiano
Serrano Manuel
Tuya González Pablo Javier
Publication venue: IEEE
Publication date
Field of study

2016 IEEE 4th International Conference on Future Internet of Things and Cloud Workshops (FiCloudW), Vienna, 2016Programs that process a large volume of data generally run in a distributed and parallel architecture, such as the programs implemented in the processing model MapReduce. In these programs, developers can abstract the infrastructure where the program will run and focus on the functional issues. However, the infrastructure configuration and its state cause different parallel executions of the program and some could derive in functional faults which are hard to reveal. In general, the infrastructure that executes the program is not considered during the testing, because the tests usually contain few input data and then the parallelization is not necessary. In this paper a testing technique is proposed to generate different infrastructure configurations for a given test input data, and then the program is executed in these configurations in order to reveal functional faults. This testing technique is automatized by using a test engine and is applied to a case study. As a result, several infrastructure configurations are automatically generated and executed for a test case revealing a functional fault that is then fixed by the develope

Repositorio Institucional de la Universidad de Oviedo

Towards Ex Vivo Testing of MapReduce Applications

Author: Bertolino Antonia
Morán Barbón Jesús
Riva Álvarez Claudio A. de la
Tuya González Pablo Javier
Publication venue: IEEE
Publication date: 01/01/2017
Field of study

2017 IEEE International Conference on Software Quality, Reliability and Security (QRS), 25-29 July 2017, Prague (Czech Republic)Big Data programs are those that process large data exceeding the capabilities of traditional technologies. Among newly proposed processing models, MapReduce stands out as it allows the analysis of schema-less data in large distributed environments with frequent infrastructure failures. Functional faults in MapReduce are hard to detect in a testing/preproduction environment due to its distributed characteristics. We propose an automatic test framework implementing a novel testing approach called Ex Vivo. The framework employs data from production but executes the tests in a laboratory to avoid side-effects on the application. Faults are detected automatically without human intervention by checking if the same data would generate different outputs with different infrastructure configurations. The framework (MrExist) is validated with a real-world program. MrExist can identify a fault in a few seconds, then the program can be stopped, not only avoiding an incorrect output, but also saving money, time and energy of production resource

Repositorio Institucional de la Universidad de Oviedo

Development of an Application to Run Integration Tests on a Data Pipeline

Author: Oliveira Pedro Henrique Medeiros de
Publication venue
Publication date: 25/01/2023
Field of study

Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business AnalyticsAs new technologies continue to surge, the ever-growing complexity of modern software products and architectures significantly increases the difficulty and development costs of a fully tested application. As agile methodologies enforce faster delivery of new features, a tool to automatically test the different components of an application has become an essential prerequisite in many software development teams. In this context, this report describes the development of an application for the automatic execution of integration tests. The application was developed during a data engineering internship at Xing, a very well-established German career-oriented professional networking platform. At the time of writing Xing counts more than 19 million users, most of them from Germany, Austria, and Switzerland. The internship project was carried out in a team focused on data engineering projects and, following the Kaban methodology, an application was developed to automatically perform integration tests on the different components involved in the creation of a type of update on the platform. The application was also coupled to the tool used by the team in the adopted software development continuous integration and delivery practices This report describes the developed project, which successfully achieved the proposed objectives, and delivered as a final product an application that will serve as a framework to perform integration tests, in an automated way, in the data pipelines for the creation of updates on the platform

Repositório da Universidade Nova de Lisboa

Semantic Support for Log Analysis of Safety-Critical Embedded Systems

Author: Ficco Massimo
Mazzocca Nicola
Venticinque Alessio
Venticinque Salvatore
Publication venue
Publication date: 01/01/2014
Field of study

Testing is a relevant activity for the development life-cycle of Safety Critical Embedded systems. In particular, much effort is spent for analysis and classification of test logs from SCADA subsystems, especially when failures occur. The human expertise is needful to understand the reasons of failures, for tracing back the errors, as well as to understand which requirements are affected by errors and which ones will be affected by eventual changes in the system design. Semantic techniques and full text search are used to support human experts for the analysis and classification of test logs, in order to speedup and improve the diagnosis phase. Moreover, retrieval of tests and requirements, which can be related to the current failure, is supported in order to allow the discovery of available alternatives and solutions for a better and faster investigation of the problem.Comment: EDCC-2014, BIG4CIP-2014, Embedded systems, testing, semantic discovery, ontology, big dat

arXiv.org e-Print Archive

Archivio della Ricerca - Università di Salerno

Archivio Istituzionale della Ricerca - Università degli Studi della Campania "Luigi Vanvitelli"