Search CORE

3 research outputs found

Data Pipeline Management in Practice: Challenges and Opportunities

Author: AG Carretero
B Carlo
B Marr
CP Chen
K Goodhope
N Marz
P Burnard
P O’Donovan
P Runeson
TC Redman
TH Davenport
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

Data pipelines involve a complex chain of interconnected activities that starts with a data source and ends in a data sink. Data pipelines are important for data-driven organizations since a data pipeline can process data in multiple formats from distributed data sources with minimal human intervention, accelerate data life cycle activities, and enhance productivity in data-driven enterprises. However, there are challenges and opportunities in implementing data pipelines but practical industry experiences are seldom reported. The findings of this study are derived by conducting a qualitative multiple-case study and interviews with the representatives of three companies. The challenges include data quality issues, infrastructure maintenance problems, and organizational barriers. On the other hand, data pipelines are implemented to enable traceability, fault-tolerance, and reduce human errors through maximizing automation thereby producing high-quality data. Based on multiple-case study research with five use cases from three case companies, this paper identifies the key challenges and benefits associated with the implementation and use of data pipelines

Crossref

Chalmers Research

Error propagation

Author: Lê Minh Ngoc
Publication venue: Independently published
Publication date: 28/05/2021
Field of study

VU Research Portal

Beyond Myopic Inference in Big Data Pipelines

Author: Carpenter B.
Clarke J.
Domingos P.
Hollingshead K.
Li C.
Low Y.
Marcus M. P.
Platt J. C.
Sagae K.
Taskar B.
Yao L.
Publication venue
Publication date: 01/01/2013
Field of study

Big Data Pipelines decompose complex analyses of large data sets into a series of simpler tasks, with independently tuned components for each task. This modular setup allows re-use of components across several different pipelines. However, the interaction of independently tuned pipeline components yields poor end-to-end performance as errors introduced by one component cascade through the whole pipeline, affecting overall accuracy. We propose a novel model for reasoning across components of Big Data Pipelines in a probabilistically well-founded manner. Our key idea is to view the interaction of components as dependencies on an underlying graphical model. Different message passing schemes on this graphical model provide various inference algorithms to trade-off end-to-end performance and computational cost. We instantiate our framework with an efficient beam search algorithm, and demonstrate its efficiency on two Big Data Pipelines: parsing and relation extraction

CiteSeerX

Crossref

Scipedia