145 research outputs found
Data Pipeline Quality: Influencing Factors, Root Causes of Data-related Issues, and Processing Problem Areas for Developers
Data pipelines are an integral part of various modern data-driven systems.
However, despite their importance, they are often unreliable and deliver
poor-quality data. A critical step toward improving this situation is a solid
understanding of the aspects contributing to the quality of data pipelines.
Therefore, this article first introduces a taxonomy of 41 factors that
influence the ability of data pipelines to provide quality data. The taxonomy
is based on a multivocal literature review and validated by eight interviews
with experts from the data engineering domain. Data, infrastructure, life cycle
management, development & deployment, and processing were found to be the main
influencing themes. Second, we investigate the root causes of data-related
issues, their location in data pipelines, and the main topics of data pipeline
processing issues for developers by mining GitHub projects and Stack Overflow
posts. We found data-related issues to be primarily caused by incorrect data
types (33%), mainly occurring in the data cleaning stage of pipelines (35%).
Data integration and ingestion tasks were found to be the most asked topics of
developers, accounting for nearly half (47%) of all questions. Compatibility
issues were found to be a separate problem area in addition to issues
corresponding to the usual data pipeline processing areas (i.e., data loading,
ingestion, integration, cleaning, and transformation). These findings suggest
that future research efforts should focus on analyzing compatibility and data
type issues in more depth and assisting developers in data integration and
ingestion tasks. The proposed taxonomy is valuable to practitioners in the
context of quality assurance activities and fosters future research into data
pipeline quality.Comment: To be published by The Journal of Systems & Softwar
Recommended from our members
Physical Plan Instrumentation in Databases: Mechanisms and Applications
Database management systems (DBMSs) are designed with the goal set to compile SQL queries to physical plans that, when executed, provide results to the SQL queries. Building on this functionality, an ever-increasing number of application domains (e.g., provenance management, online query optimization, physical database design, interactive data profiling, monitoring, and interactive data visualization) seek to operate on how queries are executed by the DBMS for a wide variety of purposes ranging from debugging and data explanation to optimization and monitoring. Unfortunately, DBMSs provide little, if any, support to facilitate the development of this class of important application domains. The effect is such that database application developers and database system architects either rewrite the database internals in ad-hoc ways; work around the SQL interface, if possible, with inevitable performance penalties; or even build new databases from scratch only to express and optimize their domain-specific application logic over how queries are executed.
To address this problem in a principled manner in this dissertation, we introduce a prototype DBMS, namely, Smoke, that exposes instrumentation mechanisms in the form of a framework to allow external applications to manipulate physical plans. Intuitively, a physical plan is the underlying representation that DBMSs use to encode how a SQL query will be executed, and providing instrumentation mechanisms at this representation level allows applications to express and optimize their logic on how queries are executed.
Having such an instrumentation-enabled DBMS in-place, we then consider how to express and optimize applications that rely their logic on how queries are executed. To best demonstrate the expressive and optimization power of instrumentation-enabled DBMSs, we express and optimize applications across several important domains including provenance management, interactive data visualization, interactive data profiling, physical database design, online query optimization, and query discovery. Expressivity-wise, we show that Smoke can express known techniques, introduce novel semantics on known techniques, and introduce new techniques across domains. Performance-wise, we show case-by-case that Smoke is on par with or up-to several orders of magnitudes faster than state-of-the-art imperative and declarative implementations of important applications across domains.
As such, we believe our contributions provide evidence and form the basis towards a class of instrumentation-enabled DBMSs with the goal set to express and optimize applications across important domains with core logic over how queries are executed by DBMSs
Fine-Grained Provenance And Applications To Data Analytics Computation
Data provenance tools seek to facilitate reproducible data science and auditable data analyses by capturing the analytics steps used in generating data analysis results. However, analysts must choose among workflow provenance systems, which allow arbitrary code but only track provenance at the granularity of files; prove-nance APIs, which provide tuple-level provenance, but incur overhead in all computations; and database provenance tools, which track tuple-level provenance through relational operators and support optimization, but support a limited subset of data science tasks. None of these solutions are well suited for tracing errors introduced during common ETL, record alignment, and matching tasks – for data types such as strings, images, etc.Additionally, we need a provenance archival layer to store and manage the tracked fine-grained prove-nance that enables future sophisticated reasoning about why individual output results appear or fail to appear. For reproducibility and auditing, the provenance archival system should be tamper-resistant. On the other hand, the provenance collecting over time or within the same query computation tends to be repeated partially (i.e., the same operation with the same input records in the middle computation step). Hence, we desire efficient provenance storage (i.e., it compresses repeated results). We address these challenges with novel formalisms and algorithms, implemented in the PROVision system, for reconstructing fine-grained provenance for a broad class of ETL-style workflows. We extend database-style provenance techniques to capture equivalences, support optimizations, and enable lazy evaluations. We develop solutions for storing fine-grained provenance in relational storage systems while both compressing and protecting it via cryptographic hashes. We experimentally validate our proposed solutions using both scientific and OLAP workloads
- …