1,016 research outputs found

    Computing galled networks from real data

    Get PDF
    Motivation: Developing methods for computing phylogenetic networks from biological data is an important problem posed by molecular evolution and much work is currently being undertaken in this area. Although promising approaches exist, there are no tools available that biologists could easily and routinely use to compute rooted phylogenetic networks on real datasets containing tens or hundreds of taxa. Biologists are interested in clades, i.e. groups of monophyletic taxa, and these are usually represented by clusters in a rooted phylogenetic tree. The problem of computing an optimal rooted phylogenetic network from a set of clusters, is hard, in general. Indeed, even the problem of just determining whether a given network contains a given cluster is hard. Hence, some researchers have focused on topologically restricted classes of networks, such as galled trees and level-k networks, that are more tractable, but have the practical draw-back that a given set of clusters will usually not possess such a representation

    Coupling geometry on binary bipartite networks: hypotheses testing on pattern geometry and nestedness

    Full text link
    Upon a matrix representation of a binary bipartite network, via the permutation invariance, a coupling geometry is computed to approximate the minimum energy macrostate of a network's system. Such a macrostate is supposed to constitute the intrinsic structures of the system, so that the coupling geometry should be taken as information contents, or even the nonparametric minimum sufficient statistics of the network data. Then pertinent null and alternative hypotheses, such as nestedness, are to be formulated according to the macrostate. That is, any efficient testing statistic needs to be a function of this coupling geometry. These conceptual architectures and mechanisms are by and large still missing in community ecology literature, and rendered misconceptions prevalent in this research area. Here the algorithmically computed coupling geometry is shown consisting of deterministic multiscale block patterns, which are framed by two marginal ultrametric trees on row and column axes, and stochastic uniform randomness within each block found on the finest scale. Functionally a series of increasingly larger ensembles of matrix mimicries is derived by conforming to the multiscale block configurations. Here matrix mimicking is meant to be subject to constraints of row and column sums sequences. Based on such a series of ensembles, a profile of distributions becomes a natural device for checking the validity of testing statistics or structural indexes. An energy based index is used for testing whether network data indeed contains structural geometry. A new version block-based nestedness index is also proposed. Its validity is checked and compared with the existing ones. A computing paradigm, called Data Mechanics, and its application on one real data network are illustrated throughout the developments and discussions in this paper

    Synthesizing species trees from gene trees using the parameterized and graph-theoretic approaches

    Get PDF
    Gene trees describe how parts of the species have evolved over time, and it is assumed that gene trees have evolved along the branches of the species tree. However, some of gene trees are often discordant with the corresponding species tree due to the complicated evolution history of genes. To overcome this obstacle, median problems have emerged as a major tool for synthesizing species trees by reconciling discordance in a given collection of gene trees. Given a collection of gene trees and a cost function, the median problem seeks a tree, called median tree, that minimizes the overall cost to the gene trees. Median tree problems are typically NP-hard, and there is an increased interest in making such median tree problems available for large-scale species tree construction. In this thesis work, we first show that the gene duplication median tree problem satisfied the weaker version of the Pareto property and propose a parameterized algorithm to solve the gene duplication median tree problem. Second, we design two efficient methods to handle the issues of applying the parameterized algorithm to unrooted gene trees which are sampled from the different species. Third, we introduce the graph-theoretic formulation of the Robinson-Foulds median tree problem and a new tree edit operation. Fourth, we propose a new metric between two phylogenetic trees and examine the statistical properties of the metric. Finally, we propose a new clustering criteria in a bipartite network and propose a new NP-hard problem and its ILP formulation

    Focus: A Graph Approach for Data-Mining and Domain-Specific Assembly of Next Generation Sequencing Data

    Get PDF
    Next Generation Sequencing (NGS) has emerged as a key technology leading to revolutionary breakthroughs in numerous biomedical research areas. These technologies produce millions to billions of short DNA reads that represent a small fraction of the original target DNA sequence. These short reads contain little information individually but are produced at a high coverage of the original sequence such that many reads overlap. Overlap relationships allow for the reads to be linearly ordered and merged by computational programs called assemblers into long stretches of contiguous sequence called contigs that can be used for research applications. Although the assembly of the reads produced by NGS remains a difficult task, it is the process of extracting useful knowledge from these relatively short sequences that has become one of the most exciting and challenging problems in Bioinformatics. The assembly of short reads is an aggregative process where critical information is lost as reads are merged into contigs. In addition, the assembly process is treated as a black box, with generic assembler tools that do not adapt to input data set characteristics. Finally, as NGS data throughput continues to increase, there is an increasing need for smart parallel assembler implementations. In this dissertation, a new assembly approach called Focus is proposed. Unlike previous assemblers, Focus relies on a novel hybrid graph constructed from multiple graphs at different levels of granularity to represent the assembly problem, facilitating information capture and dynamic adjustment to input data set characteristics. This work is composed of four specific aims: 1) The implementation of a robust assembly and analysis tool built on the hybrid graph platform 2) The development and application of graph mining to extract biologically relevant features in NGS data sets 3) The integration of domain specific knowledge to improve the assembly and analysis process. 4) The construction of smart parallel computing approaches, including the application of energy-aware computing for NGS assembly and knowledge integration to improve algorithm performance. In conclusion, this dissertation presents a complete parallel assembler called Focus that is capable of extracting biologically relevant features directly from its hybrid assembly graph

    A class of phylogenetic networks reconstructable from ancestral profiles

    Get PDF
    Rooted phylogenetic networks provide an explicit representation of the evolutionary history of a set XX of sampled species. In contrast to phylogenetic trees which show only speciation events, networks can also accommodate reticulate processes (for example, hybrid evolution, endosymbiosis, and lateral gene transfer). A major goal in systematic biology is to infer evolutionary relationships, and while phylogenetic trees can be uniquely determined from various simple combinatorial data on XX, for networks the reconstruction question is much more subtle. Here we ask when can a network be uniquely reconstructed from its `ancestral profile' (the number of paths from each ancestral vertex to each element in XX). We show that reconstruction holds (even within the class of all networks) for a class of networks we call `orchard networks', and we provide a polynomial-time algorithm for reconstructing any orchard network from its ancestral profile. Our approach relies on establishing a structural theorem for orchard networks, which also provides for a fast (polynomial-time) algorithm to test if any given network is of orchard type. Since the class of orchard networks includes tree-sibling tree-consistent networks and tree-child networks, our result generalise reconstruction results from 2008 and 2009. Orchard networks allow for an unbounded number kk of reticulation vertices, in contrast to tree-sibling tree-consistent networks and tree-child networks for which kk is at most 2X42|X|-4 and X1|X|-1, respectively.Comment: 21 pages, 5 figure

    Algorithms for weighted multidimensional search and perfect phylogeny

    Get PDF
    This dissertation is a collection of papers from two independent areas: convex optimization problems in R[superscript]d and the construction of evolutionary trees;The paper on convex optimization problems in R[superscript]d gives improved algorithms for solving the Lagrangian duals of problems that have both of the following properties. First, in absence of the bad constraints, the problems can be solved in strongly polynomial time by combinatorial algorithms. Second, the number of bad constraints is fixed. As part of our solution to these problems, we extend Cole\u27s circuit simulation approach and develop a weighted version of Megiddo\u27s multidimensional search technique;The papers on evolutionary tree construction deal with the perfect phylogeny problem, where species are specified by a set of characters and each character can occur in a species in one of a fixed number of states. This problem is known to be NP-complete. The dissertation contains the following results on the perfect phylogeny problem: (1) A linear time algorithm when all the characters have two states. (2) A polynomial time algorithm when the number of character states is fixed. (3) A polynomial time algorithm when the number of characters is fixed

    Computational Methods for Assessment and Prediction of Viral Evolutionary and Epidemiological Dynamics

    Get PDF
    The ability to comprehend the dynamics of viruses’ transmission and their evolution, even to a limited extent, can significantly enhance our capacity to predict and control the spread of infectious diseases. An example of such significance is COVID-19 caused by the severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2). In this dissertation, I am proposing computational models that present more precise and comprehensive approaches in viral outbreak investigations and epidemiology, providing invaluable insights into the transmission dynamics, and potential inter- ventions of infectious diseases by facilitating the timely detection of viral variants. The first model is a mathematical framework based on population dynamics for the calculation of a numerical measure of the fitness of SARS-CoV-2 subtypes. The second model I propose here is a transmissibility estimation method based on a Bayesian approach to calculate the most likely fitness landscape for SARS-CoV-2 using a generalized logistic sub-epidemic model. Using the proposed model I estimate the epistatic interaction networks of spike protein in SARS-CoV-2. Based on the community structure of these epistatic networks, I propose a computational framework that predicts emerging haplotypes of SARS-CoV-2 with altered transmissibility. The last method proposed in this dissertation is a maximum likelihood framework that integrates phylogenetic and random graph models to accurately infer transmission networks without requiring case-specific data

    Statistical Algorithms and Bioinformatics Tools Development for Computational Analysis of High-throughput Transcriptomic Data

    Get PDF
    Next-Generation Sequencing technologies allow for a substantial increase in the amount of data available for various biological studies. In order to effectively and efficiently analyze this data, computational approaches combining mathematics, statistics, computer science, and biology are implemented. Even with the substantial efforts devoted to development of these approaches, numerous issues and pitfalls remain. One of these issues is mapping uncertainty, in which read alignment results are biased due to the inherent difficulties associated with accurately aligning RNA-Sequencing reads. GeneQC is an alignment quality control tool that provides insight into the severity of mapping uncertainty in each annotated gene from alignment results. GeneQC used feature extraction to identify three levels of information for each gene and implements elastic net regularization and mixture model fitting to provide insight in the severity of mapping uncertainty and the quality of read alignment. In combination with GeneQC, the Ambiguous Reads Mapping (ARM) algorithm works to re-align ambiguous reads through the integration of motif prediction from metabolic pathways to establish coregulatory gene modules for re-alignment using a negative binomial distribution-based probabilistic approach. These two tools work in tandem to address the issue of mapping uncertainty and provide more accurate read alignments, and thus more accurate expression estimates. Also presented in this dissertation are two approaches to interpreting the expression estimates. The first is IRIS-EDA, an integrated shiny web server that combines numerous analyses to investigate gene expression data generated from RNASequencing data. The second is ViDGER, an R/Bioconductor package that quickly generates high-quality visualizations of differential gene expression results to assist users in comprehensive interpretations of their differential gene expression results, which is a non-trivial task. These four presented tools cover a variety of aspects of modern RNASeq analyses and aim to address bottlenecks related to algorithmic and computational issues, as well as more efficient and effective implementation methods
    corecore