Data Ingestion for the Connected World

Cansu Aslantas; Jiang Du; John Meehan; Nesime Tatbul; Stan Zdonik

Data Ingestion for the Connected World

Authors: Cansu Aslantas
Jiang Du
John Meehan
Nesime Tatbul
Stan Zdonik
Publication date: 24 April 2020
Publisher

Abstract

ABSTRACT In this paper, we argue that in many "Big Data" applications, getting data into the system correctly and at scale via traditional ETL (Extract, Transform, and Load) processes is a fundamental roadblock to being able to perform timely analytics or make real-time decisions. The best way to address this problem is to build a new architecture for ETL which takes advantage of the push-based nature of a stream processing system. We discuss the requirements for a streaming ETL engine and describe a generic architecture which satisfies those requirements. We also describe our implementation of streaming ETL using a scalable messaging system (Apache Kafka), a transactional stream processing system (S-Store), and a distributed polystore (Intel's BigDAWG), as well as propose a new time-series database optimized to handle ingestion internally

Similar works

Full text

Available Versions

CiteSeerX

oai:CiteSeerX.psu:10.1.1.1072....

Last time updated on 07/12/2020