An Analysis of Public Phenotype/Genotype Data with Arvados

Abstract

It can be difficult to gain credentials to perform analysis on sensitive data as a researcher, especially as a student. Furthermore, with specific regard to genomic data, it is potentially identifiable, therefore individuals often do not wish to make these data available to bioinformaticians. The Harvard Personal Genome Project and the 1000 Genomes Project curate the genomes of volunteers who willing are to share it with biomedical researchers to aid the future of biology and genetics. Curoverse develops an open-source data analysis tool, Arvados; Arvados allows complex analysis on large datasets using a cluster of computers through “pipelines,” written in Common Workflow Language. With regard to this project, a team at the Università Degli Studi Di Padova in Italy developed a tool titled “BOOGIE” [BOOGIE: Predicting Blood Groups from High Throughput Sequencing Data, Giollo, M. et al.], used to analyze genomes and predict a blood type, and BOOGIE claims to be 94% accurate. The goal of this project was to use Arvados to run BOOGIE on genomes available from the Personal Genome Project and the 1000 Genomes Project and compare the results to ethnicity data provided in genomic surveys, ultimately determining if these data match readily-available ethnicity and blood type information. A pipeline was written in Arvados incorporating BOOGIE through a Docker image to analyze the datasets. In under 10 hours, the tool was able to run BOOGIE on all 606 genomes available. This included 173 Genomes from the Personal Genome Project and 433 Genomes from the 1000 Genomes Project. After downloading all the data from Arvados and comparing it to the survey data provided from the Personal Genome Project using a Python script, BOOGIE was rated at an 86.67% accuracy, having correctly guessed 39/45 blood types from the Personal Genome Project. Through survey data, each genome analyzed had a blood type and ethnicity, and these data were used to compare the people who had each blood type to their ethnicity. The Personal Genome Project and the 1000 Genomes Project allow genomic data to be accessible and easily available for everyone to use. The Arvados Project records work and simplifies the process of doing so by using Docker images and pipelines. In addition, the Arvados Project allows analysis of massive data sets containing gigabytes to petabytes of information, aiming to create an efficient, common solution for data management across many platforms

    Similar works

    Full text

    thumbnail-image

    Available Versions

    Last time updated on 05/01/2018
    Last time updated on 05/01/2018