A Grammar for Reproducible and Painless Extract-Transform-Load
  Operations on Medium Data

Baumer, Benjamin S.

research

A Grammar for Reproducible and Painless Extract-Transform-Load Operations on Medium Data

Authors: Benjamin S. Baumer
Publication date: 23 May 2018
Publisher
Doi

Abstract

Many interesting data sets available on the Internet are of a medium size---too big to fit into a personal computer's memory, but not so large that they won't fit comfortably on its hard disk. In the coming years, data sets of this magnitude will inform vital research in a wide array of application domains. However, due to a variety of constraints they are cumbersome to ingest, wrangle, analyze, and share in a reproducible fashion. These obstructions hamper thorough peer-review and thus disrupt the forward progress of science. We propose a predictable and pipeable framework for R (the state-of-the-art statistical computing environment) that leverages SQL (the venerable database architecture and query language) to make reproducible research on medium data a painless reality.Comment: 30 pages, plus supplementary material

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Smith College: Smith ScholarWorks

oai:scholarworks.smith.edu:sds...

Last time updated on 17/12/2021

FigShare

oai:figshare.com:article/69870...

Last time updated on 30/05/2019