238 research outputs found
Doctor of Philosophy
dissertationAdvances in technology have produced efficient and powerful scientific instruments for measuring biological phenomena. In particular, modern microscopes and nextgeneration sequencing machines produce data at such a rate that manual analysis is no longer practical or feasible for meaningful scientific inquiries. Thus, there is a great need for computational strategies to organize and analyze huge amounts of data produced by biological experiments. My work presents computational strategies and software solutions for application in image analysis, human variant prioritization, and metagenomics. The information content of images can be leveraged to answer an extremely broad spectrum of questions ranging from inquiries about basic biological processes to highly specific, application-driven inquiries like the efficacy of a pharmaceutical drug. Modern microscopes can produce images at a rate at which rigorous manual analysis is impossible. I have created software pipelines that automate image analysis in two specific applications domains. In addition, I discuss general image analysis strategies that can be applied to a wide variety of problems. There are tens of millions of known human genetic variants. Prioritizing human variants based on how likely they are to cause disease is of huge importance because of the potential impact on human health. Current variant prioritization methods are limited by their scope, efficiency, and accuracy. I present a variant prioritization method, the VAAST variant prioritizer, which is superior in its scope, efficiency, and accuracy to existing variant prioritization methods. The rise of next-generation sequencing enables huge quantities of sequence to be generated in a short period of time. No field of study has been affected by rapid sequencing more than metagenomics. Metagenomics, the genomic analysis of a population v of microorganisms, has important implications for pathogen detection because metagenomics enables the culture-free detection of microorganisms. I have created Taxonomer, a comprehensive metagenomics pipeline that enables the real-time analysis of read datasets derived from environmental samples
Patterns of adaptive and purifying selection in the genomes of phocid seals
Modern genomic sequencing technologies provide the opportunity to address long-standing questions in molecular evolution with empirical data. In this dissertation, I combine this new technology with advances in statistical population genetics to describe how deleterious mutations and adaptive evolution have shaped the genomic evolution of phocid seals. In Chapter 1, I model historical demographic processes using whole genome sequences of eight seal taxa: the Hawaiian monk seal, the Mediterranean monk seal, the northern elephant seal, the southern elephant seal, the Weddell seal, the grey seal, the Baltic ringed seal, and the Saimaa ringed seal. Through this, I establish that the endangered monk seal species have long-term small population sizes, as do grey seals. On the other hand, the elephant seals, Weddell seal, and ringed seals had much larger populations in the distant past. Notably, the most recent glaciation (c. 12,000-120,000 years ago) appeared to have a dramatic effect on phocid populations throughout the world. With this knowledge of historical population sizes, I test a fundamental premise of molecular evolution: that the rate of mutation accumulation will be higher in smaller populations due to less efficient purifying selection. I show that there is not a higher substitution rate or overall rate of mutation accumulation in the long-term small populations of monk seals compared to other seal species. On the contrary, overall rates of mutation accumulation appear to be lower in monk seals and grey seals, both of which show smaller long-term population sizes compared to the other species. This suggests either that the distribution of fitness effects may differ across seal species in a way that depends on population size and history. In Chapter 2, I use population genomic data and a newly developed statistical model to detect positive selection in the protein coding genes of phocid seals (monk seals, elephant seals, Weddell seals, grey seals, and ringed seals). In addition, I use a phylogenetic framework to detect parallel evolution across multiple lineages of seals, relating to traits such as polar adaptations, hypoxia tolerance during long dives, and mating behavior. I develop a new bioinformatic tool to process raw BAM files and transform them into useable input for MASS-PRF, a tool to detect selection from polymorphism and divergence data. Through these analyses, I identify thousands of genes that show positive selection across multiple seal lineages. Genes associated with immune function, sperm competition, and blubber composition show positive selection in all lineages, highlighting how complex and important these traits are in seals. In the deep-diving elephant seals, the list of positively selected genes was enriched for genes relating to cardiac muscle development and function, providing important insight into how adaptive protein evolution has helped allow these seals to survive sustained bradycardia during dives that last over an hour. Weddell seals, on the other hand, showed enrichment for genes relating to neuronal development, which may relate to molecular adaptations that allow their neurons to survive hypoxic conditions during long dives. Because MASS-PRF allows for site-specific tests of selection, I am able to show how parallel evolution in the same genes across lineages sometimes may or may not involve positive selection at the same genic site. In Chapter 3, I use the population genomic data from Chapter 2 to model the distribution of fitness effects (DFE) of segregating alleles in each population. Due to sample size issues, only parameters for the Hawaiian monk seal were confidently estimated. Using the site frequency spectrum of synonymous sites, I show that the Hawaiian monk seal has had a long-term effective population size below 5000, in agreement with the results from Chapter 1. In addition, I should that after the arrival of humans in Hawaii, the monk seal experienced a 95% decline in effective population size, in line with the current census size of fewer than 1500 individuals. Conditioning the model on the Hawaiian monk seal demographic parameters, I am able to estimate the shape of DFE in Hawaiian monk seals using the site frequency spectrum of nonsynonymous sites. I estimate a DFE for the Hawaiian monk seal that is nearly identical to the one estimated in humans. This DFE, however, is different than the one estimated for mouse, with the seal and human DFEs having a higher proportion of more strongly deleterious alleles. This pattern cannot be explained by phylogenetic relatedness or differences in phenotypic complexity, but instead is likely related to differences in effective population size. I discuss how the geometric model of evolution predicts such a shift in DFE in response to the epistatic effect of fixed deleterious mutations in smaller populations
Analyses of non-coding somatic drivers in 2,658Â cancer whole genomes.
The discovery of drivers of cancer has traditionally focused on protein-coding genes1-4. Here we present analyses of driver point mutations and structural variants in non-coding regions across 2,658 genomes from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium5 of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). For point mutations, we developed a statistically rigorous strategy for combining significance levels from multiple methods of driver discovery that overcomes the limitations of individual methods. For structural variants, we present two methods of driver discovery, and identify regions that are significantly affected by recurrent breakpoints and recurrent somatic juxtapositions. Our analyses confirm previously reported drivers6,7, raise doubts about others and identify novel candidates, including point mutations in the 5' region of TP53, in the 3' untranslated regions of NFKBIZ and TOB1, focal deletions in BRD4 and rearrangements in the loci of AKR1C genes. We show that although point mutations and structural variants that drive cancer are less frequent in non-coding genes and regulatory sequences than in protein-coding genes, additional examples of these drivers will be found as more cancer genomes become available
Possible A2E Mutagenic Effects on RPE Mitochondrial DNA from Innovative RNA-Seq Bioinformatics Pipeline
Mitochondria are subject to continuous oxidative stress stimuli that, over time, can impair their genome and lead to several pathologies, like retinal degenerations. Our main purpose was the identification of mtDNA variants that might be induced by intense oxidative stress determined by N-retinylidene-N-retinylethanolamine (A2E), together with molecular pathways involving the genes carrying them, possibly linked to retinal degeneration. We performed a variant analysis comparison between transcriptome profiles of human retinal pigment epithelial (RPE) cells exposed to A2E and untreated ones, hypothesizing that it might act as a mutagenic compound towards mtDNA. To optimize analysis, we proposed an integrated approach that foresaw the complementary use of the most recent algorithms applied to mtDNA data, characterized by a mixed output coming from several tools and databases. An increased number of variants emerged following treatment. Variants mainly occurred within mtDNA coding sequences, corresponding with either the polypeptide-encoding genes or the RNA. Time-dependent impairments foresaw the involvement of all oxidative phosphorylation complexes, suggesting a serious damage to adenosine triphosphate (ATP) biosynthesis, that can result in cell death. The obtained results could be incorporated into clinical diagnostic settings, as they are hypothesized to modulate the phenotypic expression of mtDNA pathogenic variants, drastically improving the field of precision molecular medicine
Population Genomics of Polistes Wasps
The molecular mechanisms influencing the evolution of social behaviour in insects are of great interest and have been the focus of many recent studies. Chapter one of this thesis reviews several major hypotheses regarding the evolution of sociality. Chapter two outlines the methodological steps taken to generate a high quality population genomic data set for primitively eusocial paper wasps in the genus Polistes. The third chapter of the thesis uses the dataset generated in chapter two to estimate patterns of natural selection on the Polistes genome, and to evaluate the importance of novel and caste biased genes on the fitness of this primitively eusocial species
New Insights Into Mitochondrial DNA Reconstruction and Variant Detection in Ancient Samples.
Ancient DNA (aDNA) studies are frequently focused on the analysis of the mitochondrial
DNA (mtDNA), which is much more abundant than the nuclear genome, hence can
be better retrieved from ancient remains. However, postmortem DNA damage and
contamination make the data analysis difficult because of DNA fragmentation and
nucleotide alterations. In this regard, the assessment of the heteroplasmic fraction in
ancient mtDNA has always been considered an unachievable goal due to the complexity
in distinguishing true endogenous variants from artifacts. We implemented and applied
a computational pipeline for mtDNA analysis to a dataset of 30 ancient human samples
from an Iron Age necropolis in Polizzello (Sicily, Italy). The pipeline includes several
modules from well-established tools for aDNA analysis and a recently released variant
caller, which was specifically conceived for mtDNA, applied for the first time to aDNA
data. Through a fine-tuned filtering on variant allele sequencing features, we were
able to accurately reconstruct nearly complete (>88%) mtDNA genome for almost all
the analyzed samples (27 out of 30), depending on the degree of preservation and
the sequencing throughput, and to get a reliable set of variants allowing haplogroup
prediction. Additionally, we provide guidelines to deal with possible artifact sources,
including nuclear mitochondrial sequence (NumtS) contamination, an often-neglected
issue in ancient mtDNA surveys. Potential heteroplasmy levels were also estimated,
although most variants were likely homoplasmic, and validated by data simulations,
proving that new sequencing technologies and software are sensitive enough to detect
partially mutated sites in ancient genomes and discriminate true variants from artifacts.
A thorough functional annotation of detected and filtered mtDNA variants was also
performed for a comprehensive evaluation of these ancient samples
On the Origin of Phenotypic Variation: Novel Technologies to Dissect Molecular Determinants of Phenotype
This thesis describes the conception, design, and development of novel computational tools, theoretical models, and experimental techniques applied to the dissection of molecular factors underlying phenotypic variation. The first part of my work is focused on finding rare genetic variants in pooled DNA samples, leading to the development of a novel set of algorithms, SNPseeker and SPLINTER, applied to next-generation sequencing data. The second part of my work describes the creation of a reporter system for DNA methylation for the purpose of dissecting the genetic contribution of tissue-specific patterns of DNA methylation across the genome. Finally the last part of my work is focused on understanding the basis of stochastic variation in gene expression with a focus on modeling and dissecting the relationship between single-cell protein variance and mean at a genome-wide scale
Integrating Human Population Genetics And Genomics To Elucidate The Etiology Of Brain Disorders
Brain disorders present a significant burden on affected individuals, their families and society at large. Existing diagnostic tests suffer from a lack of genetic biomarkers, particularly for substance use disorders, such as alcohol dependence (AD). Numerous studies have demonstrated that AD has a genetic heritability of 40-60%. The existing genetics literature of AD has primarily focused on linkage analyses in small family cohorts and more recently on genome-wide association analyses (GWAS) in large case-control cohorts, fueled by rapid advances in next generation sequencing (NGS). Numerous AD-associated genomic variations are present at a common frequency in the general population, making these variants of public health significance. However, known AD-associated variants explain only a fraction of the expected heritability. In this dissertation, we demonstrate that systems biology applications that integrate evolutionary genomics, rare variants and structural variation can dissect the genetic architecture of AD and elucidate its heritability.
We identified several complex human diseases, including AD and other brain disorders, as potential targets of natural selection forces in diverse world populations. Further evidence of natural selection forces affecting AD was revealed when we identified an association between eye color, a trait under strong selection, and AD. These findings provide strong support for conducting GWAS on brain disorder phenotypes. However, with the ever-increasing abundance of rare genomic variants and large cohorts of multi-ethnic samples, population stratification becomes a serious confounding factor for GWAS. To address this problem, we designed a novel approach to identify ancestry informative single nucleotide polymorphisms (SNPs) for population stratification adjustment in association analyses. Furthermore, to leverage untyped variants from genotyping arrays – particularly rare variants – for GWAS and meta-analysis through rapid imputation, we designed a tool that converts genotype definitions across various array platforms.
To further elucidate the genetic heritability of brain disorders, we designed approaches aimed at identifying Copy Number Variations (CNVs) and viral insertions into the human genome. We conducted the first CNV-based whole genome meta-analysis for AD. We also designed an integrated approach to estimate the sensitivity of NGS-based methods of viral insertion detection. For the first time in the literature, we identified herpesvirus in NGS data from an Alzheimer’s disease brain sample.
The work in this dissertation represents a three-faceted advance in our understanding of brain disease etiology: 1) evolutionary genomic insights, 2) novel resources and tools to leverage rare variants, and 3) the discovery of disease-associated structural genomic aberrations. Our findings have broad implications on the genetics of complex human disease and hold promise for delivering clinically useful knowledge and resources
Development of Integrated Machine Learning and Data Science Approaches for the Prediction of Cancer Mutation and Autonomous Drug Discovery of Anti-Cancer Therapeutic Agents
Few technological ideas have captivated the minds of biochemical researchers to the degree that machine learning (ML) and artificial intelligence (AI) have. Over the last few years, advances in the ML field have driven the design of new computational systems that improve with experience and are able to model increasingly complex chemical and biological phenomena. In this dissertation, we capitalize on these achievements and use machine learning to study drug receptor sites and design drugs to target these sites. First, we analyze the significance of various single nucleotide variations and assess their rate of contribution to cancer. Following that, we used a portfolio of machine learning and data science approaches to design new drugs to target protein kinase inhibitors. We show that these techniques exhibit strong promise in aiding cancer research and drug discovery
Interpretation of Mutations, Expression, Copy Number in Somatic Breast Cancer: Implications for Metastasis and Chemotherapy
Breast cancer (BC) patient management has been transformed over the last two decades due to the development and application of genome-wide technologies. The vast amounts of data generated by these assays, however, create new challenges for accurate and comprehensive analysis and interpretation. This thesis describes novel methods for fluorescence in-situ hybridization (FISH), array comparative genomic hybridization (aCGH), and next generation DNA- and RNA-sequencing, to improve upon current approaches used for these technologies. An ab initio algorithm was implemented to identify genomic intervals of single copy and highly divergent repetitive sequences that were applied to FISH and aCGH probe design. FISH probes with higher resolution than commercially available reagents were developed and validated on metaphase chromosomes. An aCGH microarray was developed that had improved reproducibility compared to the standard Agilent 44K array, which was achieved by placing oligonucleotide probes distant from conserved repetitive sequences.
Splicing mutations are currently underrepresented in genome-wide sequencing analyses, and there are limited methods to validate genome-wide mutation predictions. This thesis describes Veridical, a program developed to statistically validate aberrant splicing caused by a predicted mutation. Splicing mutation analysis was performed on a large subset of BC patients previously analyzed by the Cancer Genome Atlas. This analysis revealed an elevated number of splicing mutations in genes involved in NCAM pathways in basal-like and HER2-enriched lymph node positive tumours. Genome-wide technologies were leveraged further to develop chemosensitivity models that predict BC response to paclitaxel and gemcitabine. A type of machine learning, called support vector machines (SVM), was used to create predictive models from small sets of biologically-relevant genes to drug disposition or resistance. SVM models generated were able to predict sensitivity in two groups of independent patient data.
High variability between individuals requires more accurate and higher resolution genomic data. However the data themselves are insufficient; also needed are more insightful analytical methods to fully exploit these data. This dissertation presents both improvements in data quality and accuracy as well as analytical procedures, with the aim of detecting and interpreting critical genomic abnormalities that are hallmarks of BC subtypes, metastasis and therapy response
- …