research

The accessibility and scalability of gene family analysis

Abstract

Gene family detection allows us to gain a better understanding of how different genomes are related. At UNH, we have a pipeline that computes these families using a variety of methods. However, the pipeline is inefficient, and performs poorly on large numbers of genornes. The pipeline is comprised of many Pert scripts, which are complex to use, and require specific organization of the data at each step. This means that all users of the pipeline must undergo training to understand each step of the pipeline and the intricacies of each script. The goal of my thesis is two-fold. First, I have optimized the scripts used in determining the gene families. This allows users to run gene family analysis on any number of genomes, without using excessive amounts of memory. My second step was to create a web interface for the pipeline. Each user is given an account that they can use to create pipeline projects. Within a project, users can simply upload their data, create the jobs they wish to run, and the web interface takes care of all the details. The server structures their data in the correct form, and the pipeline scripts are run automatically. The results are produced in an easy to understand format, and can be downloaded by the users. We have taken this interface, and have created a machine image containing all the tools needed to run the pipeline, and have made it available publicly on the Amazon Elastic Compute Cloud

    Similar works