The field of disease genomics aims to better understand the molecular mechanisms that
underlie a disease and allow it to propagate. Additionally, is it well known that diseases
disproportionally affect different populations around the world. This disproportionality is the
result of many genetic and socioenvironmental factors that influence any given individual.
Everyone’s genome can be thought of as a unique genomic landscape made up of many SNPs
and indels that they share with their ancestry. Understanding this genomic landscape and how it
affects disease prognosis and response to treatment is the goal of personalized medicine.
Thanks to the many studies carried out to better understand COVID-19, cancer, and other
diseases, hundreds of terabytes of RNA-Seq data is available to the public. However, much of
this data does not report on a study participants’ ancestry and if so, it is often vague (Black,
White, Asian) and is up to the discretion of the individual conducting the research or study
participant, which allows for the possibility of human error. This dissertation introduces a tool,
RNA-Seq inferred ancestry and disease (RIAD), which can infer ancestry for the 5
superpopulations; African, East/South Asian, European, and American, to a high degree of
accuracy. Furthermore, RIAD has the ability to call germline and somatic mutations using solely
RNA-Seq data and can infer ancestry from genomic data.
In addition to unraveling the complex genomic landscapes of individuals, this dissertation
presents statistical methods for better identifying cancer driver mutations in the overwhelming
presence of passenger mutations that have no effect on the cancer. Lastly, the SARS-CoV2
orphan gene, ORF10, is analyzed using state of the art 3D protein structure prediction software
along with correlating ORF10 variants with clinical severity using over 210K ORF10 sequences
from a clinical dataset