2,851 research outputs found

    A Scientific Workflow System For Genomic Data Analysis

    Get PDF
    Scientific workflows have become increasingly popular as a new computing paradigm for scientists to design and execute complex and distributed scientific processes to enable and accelerate many scientific discoveries. Although several scientific workflow management systems (SWFMSs) have been developed, there is a great need for an integrated scientific workflow system that enables the design and execution of higher-level scientific workflows, which integrate heterogeneous scientific workflows enacted by existing SWFMSs. On one hand, science is becoming increasingly collaborative today, requiring an integrated solution that combines the features and capabilities of different SWFMSs, which are typically developed and optimized towards one single discipline. One the other hand, such an integrated environment can immediately leverage existing and emerging techniques and strengths of various SWFMSs and their supported execution environments, such as Cluster, Grid, and Cloud. The main contributions of this dissertation are: 1) We propose a scientific workflow system, called GENOMEFLOW, to design, develop, and execute higher-level scientific workflows, whose workflow tasks are themselves scientific workflows enacted by existing SWFMSs; 2) We propose a workflow scheduling algorithm, called GSA, to enable the parallel execution of such heterogeneous scientific workflows in their native heterogeneous environments; and 3) We implemented GENOMEFLOW towards the life science community and developed several GENOMEFLOW scientific workflows to demonstrate the capabilities of our system for genome data analysis applications

    Islands of linkage in an ocean of pervasive recombination reveals two-speed evolution of human cytomegalovirus genomes

    Get PDF
    Human cytomegalovirus (HCMV) infects most of the population worldwide, persisting throughout the host's life in a latent state with periodic episodes of reactivation. While typically asymptomatic, HCMV can cause fatal disease among congenitally infected infants and immunocompromised patients. These clinical issues are compounded by the emergence of antiviral resistance and the absence of an effective vaccine, the development of which is likely complicated by the numerous immune evasins encoded by HCMV to counter the host's adaptive immune responses, a feature that facilitates frequent super-infections. Understanding the evolutionary dynamics of HCMV is essential for the development of effective new drugs and vaccines. By comparing viral genomes from uncultivated or low-passaged clinical samples of diverse origins, we observe evidence of frequent homologous recombination events, both recent and ancient, and no structure of HCMV genetic diversity at the whole-genome scale. Analysis of individual gene-scale loci reveals a striking dichotomy: while most of the genome is highly conserved, recombines essentially freely and has evolved under purifying selection, 21 genes display extreme diversity, structured into distinct genotypes that do not recombine with each other. Most of these hyper-variable genes encode glycoproteins involved in cell entry or escape of host immunity. Evidence that half of them have diverged through episodes of intense positive selection suggests that rapid evolution of hyper-variable loci is likely driven by interactions with host immunity. It appears that this process is enabled by recombination unlinking hyper-variable loci from strongly constrained neighboring sites. It is conceivable that viral mechanisms facilitating super-infection have evolved to promote recombination between diverged genotypes, allowing the virus to continuously diversify at key loci to escape immune detection, while maintaining a genome optimally adapted to its asymptomatic infectious lifecycle

    Population genetics of identity by descent

    Get PDF
    Recent improvements in high-throughput genotyping and sequencing technologies have afforded the collection of massive, genome-wide datasets of DNA information from hundreds of thousands of individuals. These datasets, in turn, provide unprecedented opportunities to reconstruct the history of human populations and detect genotype-phenotype association. Recently developed computational methods can identify long-range chromosomal segments that are identical across samples, and have been transmitted from common ancestors that lived tens to hundreds of generations in the past. These segments reveal genealogical relationships that are typically unknown to the carrying individuals. In this work, we demonstrate that such identical-by-descent (IBD) segments are informative about a number of relevant population genetics features: they enable the inference of details about past population size fluctuations, migration events, and they carry the genomic signature of natural selection. We derive a mathematical model, based on coalescent theory, that allows for a quantitative description of IBD sharing across purportedly unrelated individuals, and develop inference procedures for the reconstruction of recent demographic events, where classical methodologies are statistically underpowered. We analyze IBD sharing in several contemporary human populations, including representative communities of the Jewish Diaspora, Kenyan Maasai samples, and individuals from several Dutch provinces, in all cases retrieving evidence of fine-scale demographic events from recent history. Finally, we expand the presented model to describe distributions for those sites in IBD shared segments that harbor mutation events, showing how these may be used for the inference of mutation rates in humans and other species.Comment: Ph.D. thesi

    Mapping the Landscape of Mutation Rate Heterogeneity in the Human Genome: Approaches and Applications

    Full text link
    All heritable genetic variation is ultimately the result of mutations that have occurred in the past. Understanding the processes which determine the rate and spectra of new mutations is therefore fundamentally important in efforts to characterize the genetic basis of heritable disease, infer the timing and extent of past demographic events (e.g., population expansion, migration), or identify signals of natural selection. This dissertation aims to describe patterns of mutation rate heterogeneity in detail, identify factors contributing to this heterogeneity, and develop methods and tools to harness such knowledge for more effective and efficient analysis of whole-genome sequencing data. In Chapters 2 and 3, we catalog granular patterns of germline mutation rate heterogeneity throughout the human genome by analyzing extremely rare variants ascertained from large-scale whole-genome sequencing datasets. In Chapter 2, we describe how mutation rates are influenced by local sequence context and various features of the genomic landscape (e.g., histone marks, recombination rate, replication timing), providing detailed insight into the determinants of single-nucleotide mutation rate variation. We show that these estimates reflect genuine patterns of variation among de novo mutations, with broad potential for improving our understanding of the biology of underlying mutation processes and the consequences for human health and evolution. These estimated rates are publicly available at http://mutation.sph.umich.edu/. In Chapter 3, we introduce a novel statistical model to elucidate the variation in rate and spectra of multinucleotide mutations throughout the genome. We catalog two major classes of multinucleotide mutations: those resulting from error-prone translesion synthesis, and those resulting from repair of double-strand breaks. In addition, we identify specific hotspots for these unique mutation classes and describe the genomic features associated with their spatial variation. We show how these multinucleotide mutation processes, along with sample demography and mutation rate heterogeneity, contribute to the overall patterns of clustered variation throughout the genome, promoting a more holistic approach to interpreting the source of these patterns. In chapter 4, we develop Helmsman, a computationally efficient software tool to infer mutational signatures in large samples of cancer genomes. By incorporating parallelization routines and efficient programming techniques, Helmsman performs this task up to 300 times faster and with a memory footprint 100 times smaller than existing mutation signature analysis software. Moreover, Helmsman is the only such program capable of directly analyzing arbitrarily large datasets. The Helmsman software can be accessed at https://github.com/carjed/helmsman. Finally, in Chapter 5, we present a new method for quality control in large-scale whole-genome sequencing datasets, using a combination of dimensionality reduction algorithms and unsupervised anomaly detection techniques. Just as the mutation spectrum can be used to infer the presence of underlying mechanisms, we show that the spectrum of rare variation is a powerful and informative indicator of sample sequencing quality. Analyzing three large-scale datasets, we demonstrate that our method is capable of identifying samples affected by a variety of technical artifacts that would otherwise go undetected by standard ad hoc filtering criteria. We have implemented this method in a software package, Doomsayer, available at https://github.com/carjed/doomsayer.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147537/1/jedidiah_1.pd

    Population Structure and Cryptic Relatedness in Genetic Association Studies

    Get PDF
    We review the problem of confounding in genetic association studies, which arises principally because of population structure and cryptic relatedness. Many treatments of the problem consider only a simple ``island'' model of population structure. We take a broader approach, which views population structure and cryptic relatedness as different aspects of a single confounder: the unobserved pedigree defining the (often distant) relationships among the study subjects. Kinship is therefore a central concept, and we review methods of defining and estimating kinship coefficients, both pedigree-based and marker-based. In this unified framework we review solutions to the problem of population structure, including family-based study designs, genomic control, structured association, regression control, principal components adjustment and linear mixed models. The last solution makes the most explicit use of the kinships among the study subjects, and has an established role in the analysis of animal and plant breeding studies. Recent computational developments mean that analyses of human genetic association data are beginning to benefit from its powerful tests for association, which protect against population structure and cryptic kinship, as well as intermediate levels of confounding by the pedigree.Comment: Published in at http://dx.doi.org/10.1214/09-STS307 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Methods and Algorithms for Inference Problems in Population Genetics

    Get PDF
    Inference of population history is a central problem of population genetics. The advent of large genetic data brings us not only opportunities on developing more accurate methods for inference problems, but also computational challenges. Thus, we aim at developing accurate method and fast algorithm for problems in population genetics. Inference of admixture proportions is a classical statistical problem. We particularly focus on the problem of ancestry inference for ancestors. Standard methods implicitly assume that both parents of an individual have the same admixture fraction. However, this is rarely the case in real data. We develop a Hidden Markov Model (HMM) framework for estimating the admixture proportions of the immediate ancestors of an individual, i.e. a type of appropriation of an individual\u27s admixture proportions into further subsets of ancestral proportions in the ancestors. Based on a genealogical model for admixture tracts, we develop an efficient algorithm for computing the sampling probability of the genome from a single individual, as a function of the admixture proportions of the ancestors of this individual. We show that the distribution and lengths of admixture tracts in a genome contain information about the admixture proportions of the ancestors of an individual. This allows us to perform probabilistic inference of admixture proportions of ancestors only using the genome of an extant individual. To better understand population, we further study the species delimitation problem. It is a problem of determining the boundary between population and species. We propose a classification-based method to assign a set of populations to a number of species. Our new method uses summary statistics generated from genetic data to classify pairwise populations as either \u27same species\u27 or \u27different species\u27. We show that machine learning can be used for species delimitation and scaled for large genomic data. It can also outperform Bayesian approaches, especially when gene flow involves in the evolutionary process

    The discovery of novel recessive genetic disorders in dairy cattle : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Animal Science at AL Rae Centre of Genetics and Breeding, Massey University, Palmerston North, New Zealand

    Get PDF
    The selection of desirable characteristics in livestock has resulted in the transmission of advantageous genetic variants for generations. The advent of artificial insemination has accelerated the propagation of these advantageous genetic variants and led to tremendous advances in animal productivity. However, this intensive selection has led to the rapid uptake of deleterious alleles as well. Recently, a recessive mutation in the GALNT2 gene was identified to dramatically impair growth and production traits in dairy cattle causing small calf syndrome. The research presented here seeks to further investigate the presence and impact of recessive mutations in dairy cattle. A primary aim of genetics is to identify causal variants and understand how they act to manipulate a phenotype. As datasets have expanded, larger analyses are now possible and statistical methods to discover causal mutations have become commonplace. One such method, the genome-wide association study (GWAS), presents considerable exploratory utility in identifying quantitative trait loci (QTL) and causal mutations. GWAS' have predominantly focused on identifying additive genetic effects assuming that each allele at a locus acts independently of the other, whereas non-additive effects including dominant, recessive, and epistatic effects have been neglected. Here, we developed a single-locus non-additive GWAS model intended for the detection of dominant and recessive genetic mechanisms. We applied our non-additive GWAS model to growth, developmental, and lactation phenotypes in dairy cattle. We identified several candidate causal mutations that are associated with moderate to large deleterious recessive disorders of animal welfare and production. These mutations included premature-stop (MUS81, ITGAL, LRCH4, RBM34), splice disrupting (FGD4, GALNT2), and missense (PLCD4, MTRF1, DPF2, DOCK8, SLC25A4, KIAA0556, IL4R) variants, and these occur at surprisingly high frequencies in cattle. We further investigated these candidates for anatomical, molecular, and metabolic phenotypes to understand how these disorders might manifest. In some cases, these mutations were analogous to disorder-causing mutations in other species, these included: Coffin-Siris syndrome (DPF2); Charcot Marie Tooth disease (FGD4); a congenital disorder of glycosylation (GALNT2); hyper Immunoglobulin-E syndrome (DOCK8); Joubert syndrome (KIAA0556); and mitochondrial disease (SLC25A4). These discoveries demonstrate that deleterious recessive mutations exist in dairy cattle at remarkably high frequencies and we are able to detect these disorders through modern genotyping and phenotyping capabilities. These are important findings that can be used to improve the health and productivity of dairy cattle in New Zealand and internationally

    Phylogenetics and association analyses illustrate substantial cryptic diversity of a newly isolated collection of Cenococcum geophilum

    Get PDF
    The ectomycorrhizal fungus Cenococcum geophilum is distributed worldwide across multiple climates and soil types and is known to positively associate with a multitude of plant genera, possibly contributing to plant ability to tolerate inorganic contaminants in a soil environment. New C. geophilum isolates are easily cultured from soils in a laboratory setting, making this an ideal candidate for a model species with which to study multiple plant-fungal effects across a collection of novel isolates. However, C. geophilum is also genetically complex and, at 178Mbp, features one of the largest fungal genomes, necessitating the use of the novel restriction-associated DNA sequencing (RADseq) technique which produces robust de novo single nucleotide polymorphism (SNP) detection. A preliminary investigation into the phylogenetic relationship of \u3e200 new C. geophilum isolates from the United States Pacific Northwest (PNW) region using glyceraldehyde-3-phosphate dehydrogenase (GAPDH) strongly resolved (\u3e80%) 15 cryptic clades. An investigation of the worldwide C. geophilum collection using GAPDH resolved \u3e30 cryptic clades. In both collections, at least two cryptic clades incorporated extreme spatial diversity in the form of cross-regional, cross-country, and international strains. Phylogenetic analyses of 171 PNW isolates conducted using RADseq strongly supported (\u3e80%) the 15 PNW clades using the de novo dataset assembly at a per-site depth of at least 10%. Direct comparison of the PNW ITS and GAPDH gene regions indicated strong evidence of sexual recombination and additional analyses confirmed high levels of incongruency between the two genes. However, when these same analyses were conducted on the RADseq de novo dataset, no strong evidence of recombination was detected across the PNW collection, suggesting this collection represents a hybridized clonal population with rare localized sexual recombination. An association analysis study linked heavy metal resistance of 56 C. geophilum isolates to significant associations detected from the de novo and reference-based RADseq assemblies, finding that the de novo assembly provided a more robust association dataset linked to a series of metabolic and ion-binding protein coding regions as well as two proteins which may be directly involved in resistance to cadmium within isolates from two PNW sites

    Islands of linkage in an ocean of pervasive recombination reveals two-speed evolution of human cytomegalovirus genomes

    Get PDF
    Human cytomegalovirus (HCMV) infects most of the population worldwide, persisting throughout the host's life in a latent state with periodic episodes of reactivation. While typically asymptomatic, HCMV can cause fatal disease among congenitally infected infants and immunocompromised patients. These clinical issues are compounded by the emergence of antiviral resistance and the absence of an effective vaccine, the development of which is likely complicated by the numerous immune evasins encoded by HCMV to counter the host's adaptive immune responses, a feature that facilitates frequent super-infections. Understanding the evolutionary dynamics of HCMV is essential for the development of effective new drugs and vaccines. By comparing viral genomes from uncultivated or low-passaged clinical samples of diverse origins, we observe evidence of frequent homologous recombination events, both recent and ancient, and no structure of HCMV genetic diversity at the whole-genome scale. Analysis of individual gene-scale loci reveals a striking dichotomy: while most of the genome is highly conserved, recombines essentially freely and has evolved under purifying selection, 21 genes display extreme diversity, structured into distinct genotypes that do not recombine with each other. Most of these hyper-variable genes encode glycoproteins involved in cell entry or escape of host immunity. Evidence that half of them have diverged through episodes of intense positive selection suggests that rapid evolution of hyper-variable loci is likely driven by interactions with host immunity. It appears that this process is enabled by recombination unlinking hyper-variable loci from strongly constrained neighboring sites. It is conceivable that viral mechanisms facilitating super-infection have evolved to promote recombination between diverged genotypes, allowing the virus to continuously diversify at key loci to escape immune detection, while maintaining a genome optimally adapted to its asymptomatic infectious lifecycle

    Phylogenomic and population genomic insights on the evolutionary history of coffee leaf rust within the rust fungi

    Get PDF
    Tese de doutoramento, Biologia e Ecologia das Alterações Globais (Biologia do Genoma e Evolução), Universidade de Lisboa, Faculdade de Ciências, 2018Fungi are currently responsible for more than 30% of the emerging diseases worldwide and rust fungi (Pucciniales, Basidiomycota) are one of the most destructive groups of plant pathogens. In this thesis, two genomic approaches were pursued to further our knowledge on these pathogenic fungi at the macro-evolutionary level, using phylogenomics, and micro-evolutionary level, using population genomics. At the macro-evolutionary level, a phylogenomics pipeline was developed with the aim of investigating the role of positive selection on the origin of the rusts, particularly related to their obligate biotrophic life-style and pathogenicity. With up to 30% of the ca. 1000 screened genes showing a signal of positive selection, these results revealed a pervasive role of natural selection on the origin of this fungal group, with an enrichment of functional classes involved in nutrient uptake and secondary metabolites. Furthermore, positive selection was detected on conserved amino acid sites revealing an unexpected but potentially important role of natural selection on codon usage preferences. At the micro-evolutionary level, the focus was shifted to the coffee rust, Hemileia vastatrix, which is the causal agent of leaf rust disease and the main threat to Arabic coffee production worldwide. Using RAD sequencing to produce thousands of informative SNPs for a broad and unique sampling of this species, the aim was to investigate its evolutionary history and translate population genomic insights into recommendations for disease control. The results of this work overturned most of the preconceptions about the pathogen by revealing that instead of a single unstructured and large population, H. vastatrix is most likely a complex of cryptic species with marked host specialization. Moreover, genomic signatures of hybridization and introgression occurring between these lineages were uncovered, raising the possibility that virulence factors may be quickly exchanged. The most recent “domesticated” lineage infects exclusively the most important coffee species and SNP linkage analyses revealed the presence of recombination among isolates that were previously thought to be clonal. Altogether, these results considerably raise the evolutionary potential of this pathogen to overcome disease control measures in coffee crops. To undertake most of the tasks in this project, a new computational application called TriFusion was developed to streamline the gathering, processing and visualization of big genomic data
    corecore