47 research outputs found

    Fast Statistical Alignment

    Get PDF
    We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment—previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches—yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/

    Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs

    Get PDF
    Background A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment. Results In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased. Conclusions The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference. Implementations of the methods described in this article are available at http://statalign.github.io/WeaveAlign webcite

    Probabilistic Phylogenetic Inference with Insertions and Deletions

    Get PDF
    A fundamental task in sequence analysis is to calculate the probability of a multiple alignment given a phylogenetic tree relating the sequences and an evolutionary model describing how sequences change over time. However, the most widely used phylogenetic models only account for residue substitution events. We describe a probabilistic model of a multiple sequence alignment that accounts for insertion and deletion events in addition to substitutions, given a phylogenetic tree, using a rate matrix augmented by the gap character. Starting from a continuous Markov process, we construct a non-reversible generative (birth–death) evolutionary model for insertions and deletions. The model assumes that insertion and deletion events occur one residue at a time. We apply this model to phylogenetic tree inference by extending the program dnaml in phylip. Using standard benchmarking methods on simulated data and a new “concordance test” benchmark on real ribosomal RNA alignments, we show that the extended program dnamlε improves accuracy relative to the usual approach of ignoring gaps, while retaining the computational efficiency of the Felsenstein peeling algorithm

    A codon substitution model that incorporates the effect of the GC contents, the gene density and the density of CpG islands of human chromosomes

    Get PDF
    Abstract Background Developing a model for codon substitutions is essential for the analyses of protein sequences. Recent studies on the mutation rates in the non-coding regions have shown that CpG mutation rates in the human genome are negatively correlated to the local GC content and to the densities of functional elements. This study aimed at understanding the effect of genomic features, namely, GC content, gene density, and frequency of CpG islands, on the rates of codon substitution in human chromosomes. Results Codon substitution rates of CpG to TpG mutations, TpG to CpG mutations, and non-CpG transitions and transversions in humans were estimated by comparing the coding regions of thousands of human and chimpanzee genes and inferring their ancestral sequences by using macaque genes as the outgroup. Since the genomic features are depending on each other, partial regression coefficients of these features were obtained. Conclusion The substitution rates of codons depend on gene densities of the chromosomes. Transcription-associated mutation is one such pressure. On the basis of these results, a model of codon substitutions that incorporates the effect of genomic features on codon substitution in human chromosomes was developed.</p

    Relationship between amino acid composition and gene expression in the mouse genome

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Codon bias is a phenomenon that refers to the differences in the frequencies of synonymous codons among different genes. In many organisms, natural selection is considered to be a cause of codon bias because codon usage in highly expressed genes is biased toward optimal codons. Methods have previously been developed to predict the expression level of genes from their nucleotide sequences, which is based on the observation that synonymous codon usage shows an overall bias toward a few codons called major codons. However, the relationship between codon bias and gene expression level, as proposed by the translation-selection model, is less evident in mammals.</p> <p>Findings</p> <p>We investigated the correlations between the expression levels of 1,182 mouse genes and amino acid composition, as well as between gene expression and codon preference. We found that a weak but significant correlation exists between gene expression levels and amino acid composition in mouse. In total, less than 10% of variation of expression levels is explained by amino acid components. We found the effect of codon preference on gene expression was weaker than the effect of amino acid composition, because no significant correlations were observed with respect to codon preference.</p> <p>Conclusion</p> <p>These results suggest that it is difficult to predict expression level from amino acid components or from codon bias in mouse.</p

    Whole-chromosome hitchhiking driven by a male-killing endosymbiont.

    Get PDF
    Neo-sex chromosomes are found in many taxa, but the forces driving their emergence and spread are poorly understood. The female-specific neo-W chromosome of the African monarch (or queen) butterfly Danaus chrysippus presents an intriguing case study because it is restricted to a single 'contact zone' population, involves a putative colour patterning supergene, and co-occurs with infection by the male-killing endosymbiont Spiroplasma. We investigated the origin and evolution of this system using whole genome sequencing. We first identify the 'BC supergene', a broad region of suppressed recombination across nearly half a chromosome, which links two colour patterning loci. Association analysis suggests that the genes yellow and arrow in this region control the forewing colour pattern differences between D. chrysippus subspecies. We then show that the same chromosome has recently formed a neo-W that has spread through the contact zone within approximately 2,200 years. We also assembled the genome of the male-killing Spiroplasma, and find that it shows perfect genealogical congruence with the neo-W, suggesting that the neo-W has hitchhiked to high frequency as the male-killer has spread through the population. The complete absence of female crossing-over in the Lepidoptera causes whole-chromosome hitchhiking of a single neo-W haplotype, carrying a single allele of the BC supergene and dragging multiple non-synonymous mutations to high frequency. This has created a population of infected females that all carry the same recessive colour patterning allele, making the phenotypes of each successive generation highly dependent on uninfected male immigrants. Our findings show how hitchhiking can occur between the physically unlinked genomes of host and endosymbiont, with dramatic consequences

    Accelerated Evolution of the Prdm9 Speciation Gene across Diverse Metazoan Taxa

    Get PDF
    The onset of prezygotic and postzygotic barriers to gene flow between populations is a hallmark of speciation. One of the earliest postzygotic isolating barriers to arise between incipient species is the sterility of the heterogametic sex in interspecies' hybrids. Four genes that underlie hybrid sterility have been identified in animals: Odysseus, JYalpha, and Overdrive in Drosophila and Prdm9 (Meisetz) in mice. Mouse Prdm9 encodes a protein with a KRAB motif, a histone methyltransferase domain and several zinc fingers. The difference of a single zinc finger distinguishes Prdm9 alleles that cause hybrid sterility from those that do not. We find that concerted evolution and positive selection have rapidly altered the number and sequence of Prdm9 zinc fingers across 13 rodent genomes. The patterns of positive selection in Prdm9 zinc fingers imply that rapid evolution has acted on the interface between the Prdm9 protein and the DNA sequences to which it binds. Similar patterns are apparent for Prdm9 zinc fingers for diverse metazoans, including primates. Indeed, allelic variation at the DNA–binding positions of human PRDM9 zinc fingers show significant association with decreased risk of infertility. Prdm9 thus plays a role in determining male sterility both between species (mouse) and within species (human). The recurrent episodes of positive selection acting on Prdm9 suggest that the DNA sequences to which it binds must also be evolving rapidly. Our findings do not identify the nature of the underlying DNA sequences, but argue against the proposed role of Prdm9 as an essential transcription factor in mouse meiosis. We propose a hypothetical model in which incompatibilities between Prdm9-binding specificity and satellite DNAs provide the molecular basis for Prdm9-mediated hybrid sterility. We suggest that Prdm9 should be investigated as a candidate gene in other instances of hybrid sterility in metazoans

    Sequencing and de novo assembly of 150 genomes from Denmark as a population reference

    Get PDF
    Hundreds of thousands of human genomes are now being sequenced to characterize genetic variation and use this information to augment association mapping studies of complex disorders and other phenotypic traits. Genetic variation is identified mainly by mapping short reads to the reference genome or by performing local assembly. However, these approaches are biased against discovery of structural variants and variation in the more complex parts of the genome. Hence, large-scale de novo assembly is needed. Here we show that it is possible to construct excellent de novo assemblies from high-coverage sequencing with mate-pair libraries extending up to 20 kilobases. We report de novo assemblies of 150 individuals (50 trios) from the GenomeDenmark project. The quality of these assemblies is similar to those obtained using the more expensive long-read technology. We use the assemblies to identify a rich set of structural variants including many novel insertions and demonstrate how this variant catalogue enables further deciphering of known association mapping signals. We leverage the assemblies to provide 100 completely resolved major histocompatibility complex haplotypes and to resolve major parts of the Y chromosome. Our study provides a regional reference genome that we expect will improve the power of future association mapping studies and hence pave the way for precision medicine initiatives, which now are being launched in many countries including Denmark

    Contributions of protein-coding and regulatory change to adaptive molecular evolution in murid rodents

    Get PDF
    The contribution of regulatory versus protein change to adaptive evolution has long been controversial. In principle, the rate and strength of adaptation within functional genetic elements can be quantified on the basis of an excess of nucleotide substitutions between species compared to the neutral expectation or from effects of recent substitutions on nucleotide diversity at linked sites. Here, we infer the nature of selective forces acting in proteins, their UTRs and conserved noncoding elements (CNEs) using genome-wide patterns of diversity in wild house mice and divergence to related species. By applying an extension of the McDonald-Kreitman test, we infer that adaptive substitutions are widespread in protein-coding genes, UTRs and CNEs, and we estimate that there are at least four times as many adaptive substitutions in CNEs and UTRs as in proteins. We observe pronounced reductions in mean diversity around nonsynonymous sites (whether or not they have experienced a recent substitution). This can be explained by selection on multiple, linked CNEs and exons. We also observe substantial dips in mean diversity (after controlling for divergence) around protein-coding exons and CNEs, which can also be explained by the combined effects of many linked exons and CNEs. A model of background selection (BGS) can adequately explain the reduction in mean diversity observed around CNEs. However, BGS fails to explain the wide reductions in mean diversity surrounding exons (encompassing ~100 Kb, on average), implying that there is a substantial role for adaptation within exons or closely linked sites. The wide dips in diversity around exons, which are hard to explain by BGS, suggest that the fitness effects of adaptive amino acid substitutions could be substantially larger than substitutions in CNEs. We conclude that although there appear to be many more adaptive noncoding changes, substitutions in proteins may dominate phenotypic evolution
    corecore