1,016 research outputs found
Computing galled networks from real data
Motivation: Developing methods for computing phylogenetic networks from biological data is an important problem posed by molecular evolution and much work is currently being undertaken in this area. Although promising approaches exist, there are no tools available that biologists could easily and routinely use to compute rooted phylogenetic networks on real datasets containing tens or hundreds of taxa. Biologists are interested in clades, i.e. groups of monophyletic taxa, and these are usually represented by clusters in a rooted phylogenetic tree. The problem of computing an optimal rooted phylogenetic network from a set of clusters, is hard, in general. Indeed, even the problem of just determining whether a given network contains a given cluster is hard. Hence, some researchers have focused on topologically restricted classes of networks, such as galled trees and level-k networks, that are more tractable, but have the practical draw-back that a given set of clusters will usually not possess such a representation
Coupling geometry on binary bipartite networks: hypotheses testing on pattern geometry and nestedness
Upon a matrix representation of a binary bipartite network, via the
permutation invariance, a coupling geometry is computed to approximate the
minimum energy macrostate of a network's system. Such a macrostate is supposed
to constitute the intrinsic structures of the system, so that the coupling
geometry should be taken as information contents, or even the nonparametric
minimum sufficient statistics of the network data. Then pertinent null and
alternative hypotheses, such as nestedness, are to be formulated according to
the macrostate. That is, any efficient testing statistic needs to be a function
of this coupling geometry. These conceptual architectures and mechanisms are by
and large still missing in community ecology literature, and rendered
misconceptions prevalent in this research area. Here the algorithmically
computed coupling geometry is shown consisting of deterministic multiscale
block patterns, which are framed by two marginal ultrametric trees on row and
column axes, and stochastic uniform randomness within each block found on the
finest scale. Functionally a series of increasingly larger ensembles of matrix
mimicries is derived by conforming to the multiscale block configurations. Here
matrix mimicking is meant to be subject to constraints of row and column sums
sequences. Based on such a series of ensembles, a profile of distributions
becomes a natural device for checking the validity of testing statistics or
structural indexes. An energy based index is used for testing whether network
data indeed contains structural geometry. A new version block-based nestedness
index is also proposed. Its validity is checked and compared with the existing
ones. A computing paradigm, called Data Mechanics, and its application on one
real data network are illustrated throughout the developments and discussions
in this paper
Synthesizing species trees from gene trees using the parameterized and graph-theoretic approaches
Gene trees describe how parts of the species have evolved over time, and it is assumed that gene trees have evolved along the branches of the species tree. However, some of gene trees are often discordant with the corresponding species tree due to the complicated evolution history of genes. To overcome this obstacle, median problems have emerged as a major tool for synthesizing species trees by reconciling discordance in a given collection of gene trees. Given a collection of gene trees and a cost function, the median problem seeks a tree, called median tree, that minimizes the overall cost to the gene trees. Median tree problems are typically NP-hard, and there is an increased interest in making such median tree problems available for large-scale species tree construction.
In this thesis work, we first show that the gene duplication median tree problem satisfied the weaker version of the Pareto property and propose a parameterized algorithm to solve the gene duplication median tree problem. Second, we design two efficient methods to handle the issues of applying the parameterized algorithm to unrooted gene trees which are sampled from the different species. Third, we introduce the graph-theoretic formulation of the Robinson-Foulds median tree problem and a new tree edit operation. Fourth, we propose a new metric between two phylogenetic trees and examine the statistical properties of the metric. Finally, we propose a new clustering criteria in a bipartite network and propose a new NP-hard problem and its ILP formulation
Focus: A Graph Approach for Data-Mining and Domain-Specific Assembly of Next Generation Sequencing Data
Next Generation Sequencing (NGS) has emerged as a key technology leading to revolutionary breakthroughs in numerous biomedical research areas. These technologies produce millions to billions of short DNA reads that represent a small fraction of the original target DNA sequence. These short reads contain little information individually but are produced at a high coverage of the original sequence such that many reads overlap. Overlap relationships allow for the reads to be linearly ordered and merged by computational programs called assemblers into long stretches of contiguous sequence called contigs that can be used for research applications. Although the assembly of the reads produced by NGS remains a difficult task, it is the process of extracting useful knowledge from these relatively short sequences that has become one of the most exciting and challenging problems in Bioinformatics.
The assembly of short reads is an aggregative process where critical information is lost as reads are merged into contigs. In addition, the assembly process is treated as a black box, with generic assembler tools that do not adapt to input data set characteristics. Finally, as NGS data throughput continues to increase, there is an increasing need for smart parallel assembler implementations. In this dissertation, a new assembly approach called Focus is proposed. Unlike previous assemblers, Focus relies on a novel hybrid graph constructed from multiple graphs at different levels of granularity to represent the assembly problem, facilitating information capture and dynamic adjustment to input data set characteristics. This work is composed of four specific aims: 1) The implementation of a robust assembly and analysis tool built on the hybrid graph platform 2) The development and application of graph mining to extract biologically relevant features in NGS data sets 3) The integration of domain specific knowledge to improve the assembly and analysis process. 4) The construction of smart parallel computing approaches, including the application of energy-aware computing for NGS assembly and knowledge integration to improve algorithm performance.
In conclusion, this dissertation presents a complete parallel assembler called Focus that is capable of extracting biologically relevant features directly from its hybrid assembly graph
A class of phylogenetic networks reconstructable from ancestral profiles
Rooted phylogenetic networks provide an explicit representation of the
evolutionary history of a set of sampled species. In contrast to
phylogenetic trees which show only speciation events, networks can also
accommodate reticulate processes (for example, hybrid evolution, endosymbiosis,
and lateral gene transfer). A major goal in systematic biology is to infer
evolutionary relationships, and while phylogenetic trees can be uniquely
determined from various simple combinatorial data on , for networks the
reconstruction question is much more subtle. Here we ask when can a network be
uniquely reconstructed from its `ancestral profile' (the number of paths from
each ancestral vertex to each element in ). We show that reconstruction
holds (even within the class of all networks) for a class of networks we call
`orchard networks', and we provide a polynomial-time algorithm for
reconstructing any orchard network from its ancestral profile. Our approach
relies on establishing a structural theorem for orchard networks, which also
provides for a fast (polynomial-time) algorithm to test if any given network is
of orchard type. Since the class of orchard networks includes tree-sibling
tree-consistent networks and tree-child networks, our result generalise
reconstruction results from 2008 and 2009. Orchard networks allow for an
unbounded number of reticulation vertices, in contrast to tree-sibling
tree-consistent networks and tree-child networks for which is at most
and , respectively.Comment: 21 pages, 5 figure
Algorithms for weighted multidimensional search and perfect phylogeny
This dissertation is a collection of papers from two independent areas: convex optimization problems in R[superscript]d and the construction of evolutionary trees;The paper on convex optimization problems in R[superscript]d gives improved algorithms for solving the Lagrangian duals of problems that have both of the following properties. First, in absence of the bad constraints, the problems can be solved in strongly polynomial time by combinatorial algorithms. Second, the number of bad constraints is fixed. As part of our solution to these problems, we extend Cole\u27s circuit simulation approach and develop a weighted version of Megiddo\u27s multidimensional search technique;The papers on evolutionary tree construction deal with the perfect phylogeny problem, where species are specified by a set of characters and each character can occur in a species in one of a fixed number of states. This problem is known to be NP-complete. The dissertation contains the following results on the perfect phylogeny problem: (1) A linear time algorithm when all the characters have two states. (2) A polynomial time algorithm when the number of character states is fixed. (3) A polynomial time algorithm when the number of characters is fixed
Computational Methods for Assessment and Prediction of Viral Evolutionary and Epidemiological Dynamics
The ability to comprehend the dynamics of viruses’ transmission and their evolution, even to a limited extent, can significantly enhance our capacity to predict and control the spread of infectious diseases. An example of such significance is COVID-19 caused by the severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2). In this dissertation, I am proposing computational models that present more precise and comprehensive approaches in viral outbreak investigations and epidemiology, providing invaluable insights into the transmission dynamics, and potential inter- ventions of infectious diseases by facilitating the timely detection of viral variants. The first model is a mathematical framework based on population dynamics for the calculation of a numerical measure of the fitness of SARS-CoV-2 subtypes. The second model I propose here is a transmissibility estimation method based on a Bayesian approach to calculate the most likely fitness landscape for SARS-CoV-2 using a generalized logistic sub-epidemic model. Using the proposed model I estimate the epistatic interaction networks of spike protein in SARS-CoV-2. Based on the community structure of these epistatic networks, I propose a computational framework that predicts emerging haplotypes of SARS-CoV-2 with altered transmissibility. The last method proposed in this dissertation is a maximum likelihood framework that integrates phylogenetic and random graph models to accurately infer transmission networks without requiring case-specific data
Statistical Algorithms and Bioinformatics Tools Development for Computational Analysis of High-throughput Transcriptomic Data
Next-Generation Sequencing technologies allow for a substantial increase in the amount of data available for various biological studies. In order to effectively and efficiently analyze this data, computational approaches combining mathematics, statistics, computer science, and biology are implemented. Even with the substantial efforts devoted to development of these approaches, numerous issues and pitfalls remain. One of these issues is mapping uncertainty, in which read alignment results are biased due to the inherent difficulties associated with accurately aligning RNA-Sequencing reads. GeneQC is an alignment quality control tool that provides insight into the severity of mapping uncertainty in each annotated gene from alignment results. GeneQC used feature extraction to identify three levels of information for each gene and implements elastic net regularization and mixture model fitting to provide insight in the severity of mapping uncertainty and the quality of read alignment. In combination with GeneQC, the Ambiguous Reads Mapping (ARM) algorithm works to re-align ambiguous reads through the integration of motif prediction from metabolic pathways to establish coregulatory gene modules for re-alignment using a negative binomial distribution-based probabilistic approach. These two tools work in tandem to address the issue of mapping uncertainty and provide more accurate read alignments, and thus more accurate expression estimates. Also presented in this dissertation are two approaches to interpreting the expression estimates. The first is IRIS-EDA, an integrated shiny web server that combines numerous analyses to investigate gene expression data generated from RNASequencing data. The second is ViDGER, an R/Bioconductor package that quickly generates high-quality visualizations of differential gene expression results to assist users in comprehensive interpretations of their differential gene expression results, which is a non-trivial task. These four presented tools cover a variety of aspects of modern RNASeq analyses and aim to address bottlenecks related to algorithmic and computational issues, as well as more efficient and effective implementation methods
- …