9 research outputs found

    2006-05-03 UNM NEWS MINUTE

    Get PDF

    Fast Algorithms for Large-Scale Phylogenetic Reconstruction

    Get PDF
    One of the most fundamental computational problems in biology is that of inferring evolutionary histories of groups of species from sequence data. Such evolutionary histories, known as phylogenies are usually represented as binary trees where leaves represent extant species, whereas internal nodes represent their shared ancestors. As the amount of sequence data available to biologists increases, very fast phylogenetic reconstruction algorithms are becoming necessary. Currently, large sequence alignments can contain up to hundreds of thousands of sequences, making traditional methods, such as Neighbor Joining, computationally prohibitive. To address this problem, we have developed three novel fast phylogenetic algorithms. The first algorithm, QTree, is a quartet-based heuristic that runs in O(n log n) time. It is based on a theoretical algorithm that reconstructs the correct tree, with high probability, assuming every quartet is inferred correctly with constant probability. The core of our algorithm is a balanced search tree structure that enables us to locate an edge in the tree in O(log n) time. Our algorithm is several times faster than all the current methods, while its accuracy approaches that of Neighbour Joining. The second algorithm, LSHTree, is the first sub-quadratic time algorithm with theoretical performance guarantees under a Markov model of sequence evolution. Our new algorithm runs in O(n^{1+γ(g)} log^2 n) time, where γ is an increasing function of an upper bound on the mutation rate along any branch in the phylogeny, and γ(g) < 1 for all g. For phylogenies with very short branches, the running time of our algorithm is close to linear. In experiments, our prototype implementation was more accurate than the current fast algorithms, while being comparably fast. In the final part of this thesis, we apply the algorithmic framework behind LSHTree to the problem of placing large numbers of short sequence reads onto a fixed phylogenetic tree. Our initial results in this area are promising, but there are still many challenges to be resolved

    Graph Algorithms and Applications

    Get PDF
    The mixture of data in real-life exhibits structure or connection property in nature. Typical data include biological data, communication network data, image data, etc. Graphs provide a natural way to represent and analyze these types of data and their relationships. Unfortunately, the related algorithms usually suffer from high computational complexity, since some of these problems are NP-hard. Therefore, in recent years, many graph models and optimization algorithms have been proposed to achieve a better balance between efficacy and efficiency. This book contains some papers reporting recent achievements regarding graph models, algorithms, and applications to problems in the real world, with some focus on optimization and computational complexity

    Subject Index Volumes 1–200

    Get PDF

    Neural Probabilistic Methods for Event Sequence Modeling

    Get PDF
    This thesis focuses on modeling event sequences, namely, sequences of discrete events in continuous time. We build a family of generative probabilistic models that is able to reason about what events will happen in the future and when, given the history of previous events. Under our models, each event—as it happens—is allowed to update the future intensities of multiple event types, and the intensity of each event type—as nothing happens—is allowed to evolve with time along a trajectory. We use neural networks to allow the “updates” and “trajectories” to be complex and realistic. In the purely neural version of our model, all future event intensities are conditioned on the hidden state of a continuous-time LSTM, which has consumed every past event as it happened. To exploit domain-specific knowledge of how an event might only affect a few—but not all—future event intensities, we propose to introduce domain-specific structure into the model. We design a modeling language, by which a domain expert can write down the rules of a temporal deductive database. The database tracks facts over time; the rules deduce facts from other facts and from past events. Each fact has a time-varying state, computed by a neural network whose topology is determined by the fact’s provenance, including its experience of the past events that have contributed to deducing it. The possible event types at any time are given by special facts, whose intensities are neurally modeled alongside their states. We develop efficient methods for training our models, and doing inference with them. Applying the general principle of noise-contrastive estimation, we work out a stochastic training objective that is less expensive to optimize than the log-likelihood, which people typically maximize for parameter estimation. As in the discrete-time case that inspired us, the parameters that maximize our objective will provably maximize the log-likelihood as well. For the scenarios where we are given incomplete sequences, we propose particle smoothing—a form of sequential importance sampling—to impute the missing events. This thesis includes extensive experiments, demonstrating the effectiveness of our models and algorithms. On many synthetic and real-world datasets, on held-out sequences, we show empirically: (1) our purely neural model achieves competitive likelihood and predictive accuracy; (2) our neural-symbolic model improves prediction by encoding appropriate domain knowledge in the architecture; (3) for models to achieve the same level of log-likelihood, our noise-contrastive estimation needs considerably fewer function evaluations and less wall-clock time than maximum likelihood estimation; (4) our particle smoothing method is effective at inferring the ground-truth unobserved events. In this thesis, I will also discuss a few future research directions, including embedding our models within a reinforcement learner to discover causal structure and learn an intervention policy

    Efficiently Computing the Robinson-Foulds Metric ∗

    No full text
    The Robinson-Foulds (RF) metric is the measure most widely used in comparing phylogenetic trees; it can be computed in linear time using Day’s algorithm. When faced with the need to compare large numbers of large trees, however, even linear time becomes prohibitive. We present a randomized approximation scheme that provides, in sublinear time and with high probability, a (1+ε) approximation of the true RF metric. Our approach is to use a sublinear-space embedding of the trees, combined with an application of the Johnson-Lindenstrauss lemma to approximate vector norms very rapidly. We complement our algorithm by presenting an efficient embedding procedure, thereby resolving an open issue from the preliminary version of this paper. We have also improved the performance of Day’s (exact) algorithm in practice by using techniques discovered while implementing our approximation scheme. Indeed, we give a unified framework for edge-based tree algorithms in which implementation tradeoffs are clear. Finally, we present detailed experimental results illustrating the precision and running-time tradeoffs as well as demonstrating the speed of our approach. Our new implementation, FastRF, is available as an open-source tool for phylogenetic analysis.

    A sublineartime randomized approximation scheme for the Robinson-Foulds metric

    Get PDF
    Abstract. The Robinson-Foulds (RF) metric is the measure most widely used in comparing phylogenetic trees; it can be computed in linear time using Day’s algorithm. When faced with the need to compare large numbers of large trees, however, even linear time becomes prohibitive. We present a randomized approximation scheme that provides, with high probability, a (1+ε) approximation of the true RF metric for all pairs of trees in a given collection. Our approach is to use a sublinear-space embedding of the trees, combined with an application of the Johnson-Lindenstrauss lemma to approximate vector norms very rapidly. We discuss the consequences of various parameter choices (in the embedding and in the approximation requirements). We also implemented our algorithm as a Java class that can easily be combined with popular packages such as Mesquite; in consequence, we present experimental results illustrating the precision and running-time tradeoffs as well as demonstrating the speed of our approach.
    corecore