thesis
Unlocking Large-Scale Genomics
- Publication date
- Publisher
Abstract
The dramatic progress in DNA sequencing technology over the last decade,
with the revolutionary introduction of next-generation sequencing, has brought
with it opportunities and difficulties. Indeed, the opportunity to study the
genomes of any species at an unprecedented level of detail has come accompanied
by the difficulty in scaling analysis to handle the tremendous data
generation rates of the sequencing machinery and scaling operational procedures
to handle the increasing sample sizes in ever larger sequencing studies.
This dissertation presents work that strives to address both these problems.
The first contribution, inspired by the success of data-driven industry, is the
Seal suite of tools which harnesses the scalability of the Hadoop framework
to accelerate the analysis of sequencing data and keep up with the sustained
throughput of the sequencing machines. The second contribution, addressing
the second problem, is a system is developed to automate the standard
analysis procedures at a typical sequencing center. Additional work is presented
to make the first two contributions compatible with each other, as to
provide a complete solution for a sequencing operation and to simplify their
use. Finally, the work presented here has been integrated into the production
operations at the CRS4 Sequencing Lab, helping it scale its operation while
reducing personnel requirements