thesis

Unlocking Large-Scale Genomics

Abstract

The dramatic progress in DNA sequencing technology over the last decade, with the revolutionary introduction of next-generation sequencing, has brought with it opportunities and difficulties. Indeed, the opportunity to study the genomes of any species at an unprecedented level of detail has come accompanied by the difficulty in scaling analysis to handle the tremendous data generation rates of the sequencing machinery and scaling operational procedures to handle the increasing sample sizes in ever larger sequencing studies. This dissertation presents work that strives to address both these problems. The first contribution, inspired by the success of data-driven industry, is the Seal suite of tools which harnesses the scalability of the Hadoop framework to accelerate the analysis of sequencing data and keep up with the sustained throughput of the sequencing machines. The second contribution, addressing the second problem, is a system is developed to automate the standard analysis procedures at a typical sequencing center. Additional work is presented to make the first two contributions compatible with each other, as to provide a complete solution for a sequencing operation and to simplify their use. Finally, the work presented here has been integrated into the production operations at the CRS4 Sequencing Lab, helping it scale its operation while reducing personnel requirements

    Similar works