42,459 research outputs found

    Accurate reconstruction of insertion-deletion histories by statistical phylogenetics

    Get PDF
    The Multiple Sequence Alignment (MSA) is a computational abstraction that represents a partial summary either of indel history, or of structural similarity. Taking the former view (indel history), it is possible to use formal automata theory to generalize the phylogenetic likelihood framework for finite substitution models (Dayhoff's probability matrices and Felsenstein's pruning algorithm) to arbitrary-length sequences. In this paper, we report results of a simulation-based benchmark of several methods for reconstruction of indel history. The methods tested include a relatively new algorithm for statistical marginalization of MSAs that sums over a stochastically-sampled ensemble of the most probable evolutionary histories. For mammalian evolutionary parameters on several different trees, the single most likely history sampled by our algorithm appears less biased than histories reconstructed by other MSA methods. The algorithm can also be used for alignment-free inference, where the MSA is explicitly summed out of the analysis. As an illustration of our method, we discuss reconstruction of the evolutionary histories of human protein-coding genes.Comment: 28 pages, 15 figures. arXiv admin note: text overlap with arXiv:1103.434

    Who Watches the Watchmen? An Appraisal of Benchmarks for Multiple Sequence Alignment

    Get PDF
    Multiple sequence alignment (MSA) is a fundamental and ubiquitous technique in bioinformatics used to infer related residues among biological sequences. Thus alignment accuracy is crucial to a vast range of analyses, often in ways difficult to assess in those analyses. To compare the performance of different aligners and help detect systematic errors in alignments, a number of benchmarking strategies have been pursued. Here we present an overview of the main strategies--based on simulation, consistency, protein structure, and phylogeny--and discuss their different advantages and associated risks. We outline a set of desirable characteristics for effective benchmarking, and evaluate each strategy in light of them. We conclude that there is currently no universally applicable means of benchmarking MSA, and that developers and users of alignment tools should base their choice of benchmark depending on the context of application--with a keen awareness of the assumptions underlying each benchmarking strategy.Comment: Revie

    Regulatory motif discovery using a population clustering evolutionary algorithm

    Get PDF
    This paper describes a novel evolutionary algorithm for regulatory motif discovery in DNA promoter sequences. The algorithm uses data clustering to logically distribute the evolving population across the search space. Mating then takes place within local regions of the population, promoting overall solution diversity and encouraging discovery of multiple solutions. Experiments using synthetic data sets have demonstrated the algorithm's capacity to find position frequency matrix models of known regulatory motifs in relatively long promoter sequences. These experiments have also shown the algorithm's ability to maintain diversity during search and discover multiple motifs within a single population. The utility of the algorithm for discovering motifs in real biological data is demonstrated by its ability to find meaningful motifs within muscle-specific regulatory sequences

    An efficient visualization tool for the analysis of protein mutation matrices

    Get PDF
    BACKGROUND It is useful to develop a tool that would effectively describe protein mutation matrices specifically geared towards the identification of mutations that produce either wanted or unwanted effects, such as an increase or decrease in affinity, or a predisposition towards misfolding. Here, we describe a tool where such mutations are efficiently identified, categorized and visualized. To categorize the mutations, amino acids in a mutation matrix are arranged according to one of three sets of physicochemical characteristics, namely hydrophilicity, size and polarizability, and charge and polarity. The magnitude and frequencies of mutations for an alignment are subsequently described using color information and scaling factors. RESULTS To illustrate the capabilities of our approach, the technique is used to visualize and to compare mutation patterns in evolving sequences with diametrically opposite characteristics. Results show the emergence of distinct patterns not immediately discernible from the raw matrices. CONCLUSION Our technique enables effective categorization and visualization of mutations by using specifically-arranged mutation matrices. This tool has a number of possible applications in protein engineering, notably in simplifying the identification of mutations and/or mutation trends that are associated with specific engineered protein characteristics and behavior

    Elucidating the phylodynamics of endemic rabies virus in eastern Africa using whole-genome sequencing

    Get PDF
    Many of the pathogens perceived to pose the greatest risk to humans are viral zoonoses, responsible for a range of emerging and endemic infectious diseases. Phylogeography is a useful tool to understand the processes that give rise to spatial patterns and drive dynamics in virus populations. Increasingly, whole-genome information is being used to uncover these patterns, but the limits of phylogenetic resolution that can be achieved with this are unclear. Here, whole-genome variation was used to uncover fine-scale population structure in endemic canine rabies virus circulating in Tanzania. This is the first whole-genome population study of rabies virus and the first comprehensive phylogenetic analysis of rabies virus in East Africa, providing important insights into rabies transmission in an endemic system. In addition, sub-continental scale patterns of population structure were identified using partial gene data and used to determine population structure at larger spatial scales in Africa. While rabies virus has a defined spatial structure at large scales, increasingly frequent levels of admixture were observed at regional and local levels. Discrete phylogeographic analysis revealed long-distance dispersal within Tanzania, which could be attributed to human-mediated movement, and we found evidence of multiple persistent, co-circulating lineages at a very local scale in a single district, despite on-going mass dog vaccination campaigns. This may reflect the wider endemic circulation of these lineages over several decades alongside increased admixture due to human-mediated introductions. These data indicate that successful rabies control in Tanzania could be established at a national level, since most dispersal appears to be restricted within the confines of country borders but some coordination with neighbouring countries may be required to limit transboundary movements. Evidence of complex patterns of rabies circulation within Tanzania necessitates the use of whole-genome sequencing to delineate finer scale population structure that can that can guide interventions, such as the spatial scale and design of dog vaccination campaigns and dog movement controls to achieve and maintain freedom from disease

    Bacterial microevolution and the Pangenome

    Get PDF
    The comparison of multiple genome sequences sampled from a bacterial population reveals considerable diversity in both the core and the accessory parts of the pangenome. This diversity can be analysed in terms of microevolutionary events that took place since the genomes shared a common ancestor, especially deletion, duplication, and recombination. We review the basic modelling ingredients used implicitly or explicitly when performing such a pangenome analysis. In particular, we describe a basic neutral phylogenetic framework of bacterial pangenome microevolution, which is not incompatible with evaluating the role of natural selection. We survey the different ways in which pangenome data is summarised in order to be included in microevolutionary models, as well as the main methodological approaches that have been proposed to reconstruct pangenome microevolutionary history

    Expansion of the BioCyc collection of pathway/genome databases to 160 genomes

    Get PDF
    The BioCyc database collection is a set of 160 pathway/genome databases (PGDBs) for most eukaryotic and prokaryotic species whose genomes have been completely sequenced to date. Each PGDB in the BioCyc collection describes the genome and predicted metabolic network of a single organism, inferred from the MetaCyc database, which is a reference source on metabolic pathways from multiple organisms. In addition, each bacterial PGDB includes predicted operons for the corresponding species. The BioCyc collection provides a unique resource for computational systems biology, namely global and comparative analyses of genomes and metabolic networks, and a supplement to the BioCyc resource of curated PGDBs. The Omics viewer available through the BioCyc website allows scientists to visualize combinations of gene expression, proteomics and metabolomics data on the metabolic maps of these organisms. This paper discusses the computational methodology by which the BioCyc collection has been expanded, and presents an aggregate analysis of the collection that includes the range of number of pathways present in these organisms, and the most frequently observed pathways. We seek scientists to adopt and curate individual PGDBs within the BioCyc collection. Only by harnessing the expertise of many scientists we can hope to produce biological databases, which accurately reflect the depth and breadth of knowledge that the biomedical research community is producing

    Differential Functional Constraints Cause Strain-Level Endemism in Polynucleobacter Populations.

    Get PDF
    The adaptation of bacterial lineages to local environmental conditions creates the potential for broader genotypic diversity within a species, which can enable a species to dominate across ecological gradients because of niche flexibility. The genus Polynucleobacter maintains both free-living and symbiotic ecotypes and maintains an apparently ubiquitous distribution in freshwater ecosystems. Subspecies-level resolution supplemented with metagenome-derived genotype analysis revealed that differential functional constraints, not geographic distance, produce and maintain strain-level genetic conservation in Polynucleobacter populations across three geographically proximal riverine environments. Genes associated with cofactor biosynthesis and one-carbon metabolism showed habitat specificity, and protein-coding genes of unknown function and membrane transport proteins were under positive selection across each habitat. Characterized by different median ratios of nonsynonymous to synonymous evolutionary changes (dN/dS ratios) and a limited but statistically significant negative correlation between the dN/dS ratio and codon usage bias between habitats, the free-living and core genotypes were observed to be evolving under strong purifying selection pressure. Highlighting the potential role of genetic adaptation to the local environment, the two-component system protein-coding genes were highly stable (dN/dS ratio, < 0.03). These results suggest that despite the impact of the habitat on genetic diversity, and hence niche partition, strong environmental selection pressure maintains a conserved core genome for Polynucleobacter populations. IMPORTANCE Understanding the biological factors influencing habitat-wide genetic endemism is important for explaining observed biogeographic patterns. Polynucleobacter is a genus of bacteria that seems to have found a way to colonize myriad freshwater ecosystems and by doing so has become one of the most abundant bacteria in these environments. We sequenced metagenomes from locations across the Chicago River system and assembled Polynucleobacter genomes from different sites and compared how the nucleotide composition, gene codon usage, and the ratio of synonymous (codes for the same amino acid) to nonsynonymous (codes for a different amino acid) mutations varied across these population genomes at each site. The environmental pressures at each site drove purifying selection for functional traits that maintained a streamlined core genome across the Chicago River Polynucleobacter population while allowing for site-specific genomic adaptation. These adaptations enable Polynucleobacter to become dominant across different riverine environmental gradients
    • …
    corecore