Many interesting data sets available on the Internet are of a medium
size---too big to fit into a personal computer's memory, but not so large that
they won't fit comfortably on its hard disk. In the coming years, data sets of
this magnitude will inform vital research in a wide array of application
domains. However, due to a variety of constraints they are cumbersome to
ingest, wrangle, analyze, and share in a reproducible fashion. These
obstructions hamper thorough peer-review and thus disrupt the forward progress
of science. We propose a predictable and pipeable framework for R (the
state-of-the-art statistical computing environment) that leverages SQL (the
venerable database architecture and query language) to make reproducible
research on medium data a painless reality.Comment: 30 pages, plus supplementary material