27 research outputs found

    How to infer relative fitness from a sample of genomic sequences

    Full text link
    Mounting evidence suggests that natural populations can harbor extensive fitness diversity with numerous genomic loci under selection. It is also known that genealogical trees for populations under selection are quantifiably different from those expected under neutral evolution and described statistically by Kingman's coalescent. While differences in the statistical structure of genealogies have long been used as a test for the presence of selection, the full extent of the information that they contain has not been exploited. Here we shall demonstrate that the shape of the reconstructed genealogical tree for a moderately large number of random genomic samples taken from a fitness diverse, but otherwise unstructured asexual population can be used to predict the relative fitness of individuals within the sample. To achieve this we define a heuristic algorithm, which we test in silico using simulations of a Wright-Fisher model for a realistic range of mutation rates and selection strength. Our inferred fitness ranking is based on a linear discriminator which identifies rapidly coalescing lineages in the reconstructed tree. Inferred fitness ranking correlates strongly with actual fitness, with a genome in the top 10% ranked being in the top 20% fittest with false discovery rate of 0.1-0.3 depending on the mutation/selection parameters. The ranking also enables to predict the genotypes that future populations inherit from the present one. While the inference accuracy increases monotonically with sample size, samples of 200 nearly saturate the performance. We propose that our approach can be used for inferring relative fitness of genomes obtained in single-cell sequencing of tumors and in monitoring viral outbreaks

    SOPRA: Scaffolding algorithm for paired reads via statistical optimization

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>High throughput sequencing (HTS) platforms produce gigabases of short read (<100 bp) data per run. While these short reads are adequate for resequencing applications, <it>de novo </it>assembly of moderate size genomes from such reads remains a significant challenge. These limitations could be partially overcome by utilizing mate pair technology, which provides pairs of short reads separated by a known distance along the genome.</p> <p>Results</p> <p>We have developed SOPRA, a tool designed to exploit the mate pair/paired-end information for assembly of short reads. The main focus of the algorithm is selecting a sufficiently large subset of simultaneously satisfiable mate pair constraints to achieve a balance between the size and the quality of the output scaffolds. Scaffold assembly is presented as an optimization problem for variables associated with vertices and with edges of the contig connectivity graph. Vertices of this graph are individual contigs with edges drawn between contigs connected by mate pairs. Similar graph problems have been invoked in the context of shotgun sequencing and scaffold building for previous generation of sequencing projects. However, given the error-prone nature of HTS data and the fundamental limitations from the shortness of the reads, the ad hoc greedy algorithms used in the earlier studies are likely to lead to poor quality results in the current context. SOPRA circumvents this problem by treating all the constraints on equal footing for solving the optimization problem, the solution itself indicating the problematic constraints (chimeric/repetitive contigs, etc.) to be removed. The process of solving and removing of constraints is iterated till one reaches a core set of consistent constraints. For SOLiD sequencer data, SOPRA uses a dynamic programming approach to robustly translate the color-space assembly to base-space. For assessing the quality of an assembly, we report the no-match/mismatch error rate as well as the rates of various rearrangement errors.</p> <p>Conclusions</p> <p>Applying SOPRA to real data from bacterial genomes, we were able to assemble contigs into scaffolds of significant length (N50 up to 200 Kb) with very few errors introduced in the process. In general, the methodology presented here will allow better scaffold assemblies of any type of mate pair sequencing data.</p

    Polygenicity and Epistasis Underlie Fitness-Proximal Traits in the Caenorhabditis elegans Multiparental Experimental Evolution (CeMEE) Panel

    Get PDF
    The deposited article is a pre-print version and it has not been submitted to peer reviewing. This article version was provided by bioRxiv and is the preprint first posted online Mar. 26, 2017. This publication hasn't any creative commons license associated. The deposited article version contains attached the supplementary materials within the pdf.Understanding the genetic basis of complex traits remains a major challenge in biology. Polygenicity, phenotypic plasticity and epistasis contribute to phenotypic variance in ways that are rarely clear. This uncertainty can be problematic for estimating heritability, for predicting individual phenotypes from genomic data, and for parameterizing models of phenotypic evolution. Here we report an advanced recombinant inbred line (RIL) quantitative trait locus (QTL) mapping panel for the hermaphroditic nematode Caenorhabditis elegans, the C. elegans multiparental experimental evolution (CeMEE) panel. The CeMEE panel, comprising 507 RILs at present, was created by hybridization of 16 wild isolates, experimental evolution for 140-190 generations, and inbreeding by selfing for 13-16 generations. The panel contains 22% of single nucleotide polymorphisms known to segregate in natural populations, and complements existing C. elegans mapping resources by providing fine resolution and high nucleotide diversity across >95% of the genome. We apply it to study the genetic basis of two fitness components, fertility and hermaphrodite body size at time of reproduction, with high broad sense heritability in the CeMEE. While simulations show we should detect common alleles with additive effects as small as 5%, at gene-level resolution, the genetic architectures of these traits does not feature such alleles. We instead find that a significant fraction of trait variance, approaching 40% for fertility, can be explained by sign epistasis with main effects below the detection limit. In congruence, phenotype prediction from genomic similarity, while generally poor (r2 < 10%), requires modeling epistasis for optimal accuracy, with most variance attributed to the rapidly evolving chromosome arms.National Science Foundation grant: (PHY-1125915); National Institutes of Health grants: (R25-GM-067110, R01-GM-089972, R01-GM-121828); Gordon and Betty Moore Foundation grant: (2919.01); Human Frontiers Science Program (RGP0045/2010); European Research Council grant: (FP7/2007-2013/243285); Agence Nationale de la Recherche grant: (ANR-14-ACHN-0032-01).info:eu-repo/semantics/publishedVersio

    Shape, Size, and Robustness: Feasible Regions in the Parameter Space of Biochemical Networks

    Get PDF
    The concept of robustness of regulatory networks has received much attention in the last decade. One measure of robustness has been associated with the volume of the feasible region, namely, the region in the parameter space in which the system is functional. In this paper, we show that, in addition to volume, the geometry of this region has important consequences for the robustness and the fragility of a network. We develop an approximation within which we could algebraically specify the feasible region. We analyze the segment polarity gene network to illustrate our approach. The study of random walks in the parameter space and how they exit the feasible region provide us with a rich perspective on the different modes of failure of this network model. In particular, we found that, between two alternative ways of activating Wingless, one is more robust than the other. Our method provides a more complete measure of robustness to parameter variation. As a general modeling strategy, our approach is an interesting alternative to Boolean representation of biochemical networks

    Tools from statistical physics for systems biology and for genomics

    No full text
    My graduate studies involved three broad classes of problems, each of which are presented in different chapters of this thesis. The first two parts of my work were related to studying dynamics of biochemical networks. I studied a mean-field/stochastic model of epigenetic chromatin silencing in yeast. The model gives rise to different dynamical behaviors possible within the same molecular model and provides qualitative predictions that are being investigated experimentally. In another part of my work, I studied a model of segment polarity network in Drosophila and analyzed the parameter space of the system. I particularly studied the relation between the geometry of parameter space and the robustness of the network. I will show that, in addition to the volume, the geometry of this region has important consequences for the robustness and the fragility of a network. A major part of my PhD work involved applications of high-throughput sequencing technologies for extracting information at the genomic level. I present SOPRA, a new algorithm for exploiting the mate pair information for assembly of short reads. I have successfully applied SOPRA to real data and were able to assemble scaffolds of significant length with very few errors introduced in the process.Ph.D.Includes bibliographical referencesIncludes vitaby Adel Dayaria

    Schematic presentation of various scenarios for the SIR silencing system.

    No full text
    <p>(A) A highly cooperative binding scheme for the inhibitor drug to Sir2p can cause the activity of the later to change drastically as a result of small change in the drug concentration. (B) The state of each nucleosome is independent, but the transcriptional readout depends on the histone modification status of multiple (neighboring) nucleosomes. (C) Polymerization/Oozing/Railroad model: the feedback from the neighboring silenced regions leads to the spreading of silenced domain. (D) Silencing is a consequence of subnuclear localization of loci and the presence of a higher concentration of Sir proteins in the subnuclear region. (E) Thanks to DNA looping, the strong positive feedback can come from many nuleosomes being in close proximity.</p
    corecore