Unraveling the genomic landscape of ancestry and disease with gene expression data

Abstract

The field of disease genomics aims to better understand the molecular mechanisms that underlie a disease and allow it to propagate. Additionally, is it well known that diseases disproportionally affect different populations around the world. This disproportionality is the result of many genetic and socioenvironmental factors that influence any given individual. Everyone’s genome can be thought of as a unique genomic landscape made up of many SNPs and indels that they share with their ancestry. Understanding this genomic landscape and how it affects disease prognosis and response to treatment is the goal of personalized medicine. Thanks to the many studies carried out to better understand COVID-19, cancer, and other diseases, hundreds of terabytes of RNA-Seq data is available to the public. However, much of this data does not report on a study participants’ ancestry and if so, it is often vague (Black, White, Asian) and is up to the discretion of the individual conducting the research or study participant, which allows for the possibility of human error. This dissertation introduces a tool, RNA-Seq inferred ancestry and disease (RIAD), which can infer ancestry for the 5 superpopulations; African, East/South Asian, European, and American, to a high degree of accuracy. Furthermore, RIAD has the ability to call germline and somatic mutations using solely RNA-Seq data and can infer ancestry from genomic data. In addition to unraveling the complex genomic landscapes of individuals, this dissertation presents statistical methods for better identifying cancer driver mutations in the overwhelming presence of passenger mutations that have no effect on the cancer. Lastly, the SARS-CoV2 orphan gene, ORF10, is analyzed using state of the art 3D protein structure prediction software along with correlating ORF10 variants with clinical severity using over 210K ORF10 sequences from a clinical dataset

    Similar works