13 research outputs found

    PICS-Ord: unlimited coding of ambiguous regions by pairwise identity and cost scores ordination

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>We present a novel method to encode ambiguously aligned regions in fixed multiple sequence alignments by 'Pairwise Identity and Cost Scores Ordination' (PICS-Ord). The method works via ordination of sequence identity or cost scores matrices by means of Principal Coordinates Analysis (PCoA). After identification of ambiguous regions, the method computes pairwise distances as sequence identities or cost scores, ordinates the resulting distance matrix by means of PCoA, and encodes the principal coordinates as ordered integers. Three biological and 100 simulated datasets were used to assess the performance of the new method.</p> <p>Results</p> <p>Including ambiguous regions coded by means of PICS-Ord increased topological accuracy, resolution, and bootstrap support in real biological and simulated datasets compared to the alternative of excluding such regions from the analysis a priori. In terms of accuracy, PICS-Ord performs equal to or better than previously available methods of ambiguous region coding (e.g., INAASE), with the advantage of a practically unlimited alignment size and increased analytical speed and the possibility of PICS-Ord scores to be analyzed together with DNA data in a partitioned maximum likelihood model.</p> <p>Conclusions</p> <p>Advantages of PICS-Ord over step matrix-based ambiguous region coding with INAASE include a practically unlimited number of OTUs and seamless integration of PICS-Ord codes into phylogenetic datasets, as well as the increased speed of phylogenetic analysis. Contrary to word- and frequency-based methods, PICS-Ord maintains the advantage of pairwise sequence alignment to derive distances, and the method is flexible with respect to the calculation of distance scores. In addition to distance and maximum parsimony, PICS-Ord codes can be analyzed in a Bayesian or maximum likelihood framework. RAxML (version 7.2.6 or higher that was developed for this study) allows up to 32-state ordered or unordered characters. A GTR, MK, or ORDERED model can be applied to analyse the PICS-Ord codes partition, with GTR performing slightly better than MK and ORDERED.</p> <p>Availability</p> <p>An implementation of the PICS-Ord algorithm is available from <url>http://scit.us/projects/ngila/wiki/PICS-Ord</url>. It requires both the statistical software, R <url>http://www.r-project.org</url> and the alignment software Ngila <url>http://scit.us/projects/ngila</url>.</p

    Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs

    Get PDF
    Background A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment. Results In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased. Conclusions The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference. Implementations of the methods described in this article are available at http://statalign.github.io/WeaveAlign webcite

    Large-Scale Phylogenetic Analysis of Emerging Infectious Diseases

    Get PDF
    Microorganisms that cause infectious diseases present critical issues of national security, public health, and economic welfare.  For example, in recent years, highly pathogenic strains of avian influenza have emerged in Asia, spread through Eastern Europe and threaten to become pandemic. As demonstrated by the coordinated response to Severe Acute Respiratory Syndrome (SARS) and influenza, agents of infectious disease are being addressed via large-scale genomic sequencing.  The goal of genomic sequencing projects are to rapidly put large amounts of data in the public domain to accelerate research on disease surveillance, treatment, and prevention. However, our ability to derive information from large comparative genomic datasets lags far behind acquisition.  Here we review the computational challenges of comparative genomic analyses, specifically sequence alignment and reconstruction of phylogenetic trees.  We present novel analytical results on from two important infectious diseases, Severe Acute Respiratory Syndrome (SARS) and influenza.SARS and influenza have similarities and important differences both as biological and comparative genomic analysis problems.  Influenza viruses (Orthymxyoviridae) are RNA based.  Current evidence indicates that influenza viruses originate in aquatic birds from wild populations. Influenza has been studied for decades via well-coordinated international efforts.  These efforts center on surveillance via antibody characterization of the hemagglutinin (HA) and neuraminidase (N) proteins of the circulating strains to inform vaccine design. However we still do not have a clear understanding of: 1) various transmission pathways such as the role of intermediate hosts such as swine and domestic birds and 2) the key mutation and genomic recombination events that underlie periodic pandemics of influenza.  In the past 30 years, sequence data from HA and N loci has become an important data type. In the past year, full genomic data has become prominent.  These data present exciting opportunities to address unanswered questions in influenza pandemics.SARS is caused by a previously unrecognized lineage of coronavirus, SARS-CoV, which like influenza has an RNA based genome.  Although SARS-CoV is widely believed to have originated in animals there remains disagreement over the candidate animal source that lead to the original outbreak of SARS.  In contrast to the long history of the study of influenza, SARS was only recognized in late 2002 and the virus that causes SARS has been documented primarily by genomic sequencing.In the past, most studies of influenza were performed on a limited number of isolates and genes suited to a particular problem.  Major goals in science today are to understand emerging diseases in broad geographic, environmental, societal, biological, and genomic contexts. Synthesizing diverse information brought together by various researchers is important to find out what can be done to prevent future outbreaks {JON03}.  Thus comprehensive means to organize and analyze large amounts of diverse information are critical.  For example, the relationships of isolates and patterns of genomic change observed in large datasets might not be consistent with hypotheses formed on partial data.  Moreover when researchers rely on partial datasets, they restrict the range of possible discoveries.Phylogenetics is well suited to the complex task of understanding emerging infectious disease. Phylogenetic analyses can test many hypotheses by comparing diverse isolates collected from various hosts, environments, and points in time and organizing these data into various evolutionary scenarios.  The products of a phylogenetic analysis are a graphical tree of ancestor-descendent relationships and an inferred summary of mutations, recombination events, host shifts, geographic, and temporal spread of the viruses.  However, this synthesis comes at a price.  The cost of computation of phylogenetic analysis expands combinatorially as the number of isolates considered increases. Thus, large datasets like those currently produced are commonly considered intractable.  We address this problem with synergistic development of heuristics tree search strategies and parallel computing.Fil: Janies, D.. Ohio State University; Estados UnidosFil: Pol, Diego. Ohio State University; Estados Unidos. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentin

    Solving the master equation for Indels

    No full text
    Abstract Background Despite the long-anticipated possibility of putting sequence alignment on the same footing as statistical phylogenetics, theorists have struggled to develop time-dependent evolutionary models for indels that are as tractable as the analogous models for substitution events. Main text This paper discusses progress in the area of insertion-deletion models, in view of recent work by Ezawa (BMC Bioinformatics 17:304, 2016); (BMC Bioinformatics 17:397, 2016); (BMC Bioinformatics 17:457, 2016) on the calculation of time-dependent gap length distributions in pairwise alignments, and current approaches for extending these approaches from ancestor-descendant pairs to phylogenetic trees. Conclusions While approximations that use finite-state machines (Pair HMMs and transducers) currently represent the most practical approach to problems such as sequence alignment and phylogeny, more rigorous approaches that work directly with the matrix exponential of the underlying continuous-time Markov chain also show promise, especially in view of recent advances
    corecore