5 research outputs found
Data Pipeline Quality: Influencing Factors, Root Causes of Data-related Issues, and Processing Problem Areas for Developers
Data pipelines are an integral part of various modern data-driven systems.
However, despite their importance, they are often unreliable and deliver
poor-quality data. A critical step toward improving this situation is a solid
understanding of the aspects contributing to the quality of data pipelines.
Therefore, this article first introduces a taxonomy of 41 factors that
influence the ability of data pipelines to provide quality data. The taxonomy
is based on a multivocal literature review and validated by eight interviews
with experts from the data engineering domain. Data, infrastructure, life cycle
management, development & deployment, and processing were found to be the main
influencing themes. Second, we investigate the root causes of data-related
issues, their location in data pipelines, and the main topics of data pipeline
processing issues for developers by mining GitHub projects and Stack Overflow
posts. We found data-related issues to be primarily caused by incorrect data
types (33%), mainly occurring in the data cleaning stage of pipelines (35%).
Data integration and ingestion tasks were found to be the most asked topics of
developers, accounting for nearly half (47%) of all questions. Compatibility
issues were found to be a separate problem area in addition to issues
corresponding to the usual data pipeline processing areas (i.e., data loading,
ingestion, integration, cleaning, and transformation). These findings suggest
that future research efforts should focus on analyzing compatibility and data
type issues in more depth and assisting developers in data integration and
ingestion tasks. The proposed taxonomy is valuable to practitioners in the
context of quality assurance activities and fosters future research into data
pipeline quality.Comment: To be published by The Journal of Systems & Softwar
An approach for assessing industrial IoT data sources to determine their data trustworthiness
Trustworthy data in the Industrial Internet of Things are paramount to ensure correct strategic decision-making and accurate actions on the shop floor. However, the enormous amount of industrial data generated by a variety of sources (e.g. machines and sensors) is often of poor quality (e.g. unreliable sensor readings). Research suggests that certain characteristics of data sources (e.g. battery-powered power supply and wireless communication) contribute to this poor data quality. Nonetheless, to date, much of the research on data trustworthiness has only focused on data values to determine trustworthiness. Consequently, we propose to pay more attention to the characteristics of data sources in the context of data trustworthiness. Thus, this article presents an approach for assessing Industrial Internet of Things data sources to determine their data trustworthiness. The approach is based on a meta-model decomposing data sources into data stores (e.g. databases) and providers (e.g. sensors). Furthermore, the approach provides a quality model comprising quality-related characteristics of data stores to determine their data trustworthiness. Moreover, a catalogue containing properties of data providers is presented to infer the trustworthiness of their provided data. An industrial case study revealed a moderate correlation between the data source assessments of the proposed approach and experts
Integrating software quality models into risk-based testing
Risk-based testing is a frequently used testing approach which utilizes identified risks of a software system to provide decision support in all phases of the testing process. Risk assessment, which is a core activity of every risk-based testing process, is often done in an ad hoc manual way. Software quality assessments, based on quality models, already describe the product-related risks of a whole software product and provide objective and automation-supported assessments. But so far, quality models have not been applied for risk assessment and risk-based testing in a systematic way. This article tries to fill this gap and investigates how the information and data of a quality assessment based on the open quality model QuaMoCo can be integrated into risk-based testing. We first present two generic approaches showing how quality assessments based on quality models can be integrated into risk-based testing and then provide the concrete integration on the basis of the open quality model QuaMoCo. Based on five open source products, a case study is performed. Results of the case study show that a risk-based testing strategy outperforms a lines of code-based testing strategy with regard to the number of defects detected. Moreover, a significant positive relationship between the risk coefficient and the associated number of defects was found.(VLID)452660