1,641 research outputs found

    Pattern-based phylogenetic distance estimation and tree reconstruction

    Get PDF
    We have developed an alignment-free method that calculates phylogenetic distances using a maximum likelihood approach for a model of sequence change on patterns that are discovered in unaligned sequences. To evaluate the phylogenetic accuracy of our method, and to conduct a comprehensive comparison of existing alignment-free methods (freely available as Python package decaf+py at http://www.bioinformatics.org.au), we have created a dataset of reference trees covering a wide range of phylogenetic distances. Amino acid sequences were evolved along the trees and input to the tested methods; from their calculated distances we infered trees whose topologies we compared to the reference trees. We find our pattern-based method statistically superior to all other tested alignment-free methods on this dataset. We also demonstrate the general advantage of alignment-free methods over an approach based on automated alignments when sequences violate the assumption of collinearity. Similarly, we compare methods on empirical data from an existing alignment benchmark set that we used to derive reference distances and trees. Our pattern-based approach yields distances that show a linear relationship to reference distances over a substantially longer range than other alignment-free methods. The pattern-based approach outperforms alignment-free methods and its phylogenetic accuracy is statistically indistinguishable from alignment-based distances.Comment: 21 pages, 3 figures, 2 table

    Some studies on protein structure alignment algorithms

    Get PDF
    The alignment of two protein structures is a fundamental problem in structural bioinformatics.Their structural similarity carries with it the connotation of similar functional behavior that couldbe exploited in various applications. A plethora of algorithms, including one by us, is a testamentto the importance of the problem. In this thesis, we propose a novel approach to measure theeectiveness of a sample of four such algorithms, DALI, TM-align, CE and EDAlignsse, for de-tecting structural similarities among proteins. The underlying premise is that structural proximityshould translate into spatial proximity. To verify this, we carried out extensive experiments withve dierent datasets, each consisting of proteins from two to six dierent families.In further addition to our work, we have focused on the area of computational methods foraligning multiple protein structures. This problem is known for its np-complete nature. Therefore,there are many ways to come up with a solution which can be better than the existing ones or atleast as good as them. Such a solution is presented here in this thesis. We have used a heuristicalgorithm which is the Progressive Multiple Alignment approach, to have the multiple sequencealignment. We used the root mean square deviation (RMSD) as a measure of alignment quality andreported this measure for a large and varied number of alignments. We also compared the executiontimes of our algorithm with the well-known algorithm MUSTANG for all the tested alignments

    Lambek vs. Lambek: Functorial Vector Space Semantics and String Diagrams for Lambek Calculus

    Full text link
    The Distributional Compositional Categorical (DisCoCat) model is a mathematical framework that provides compositional semantics for meanings of natural language sentences. It consists of a computational procedure for constructing meanings of sentences, given their grammatical structure in terms of compositional type-logic, and given the empirically derived meanings of their words. For the particular case that the meaning of words is modelled within a distributional vector space model, its experimental predictions, derived from real large scale data, have outperformed other empirically validated methods that could build vectors for a full sentence. This success can be attributed to a conceptually motivated mathematical underpinning, by integrating qualitative compositional type-logic and quantitative modelling of meaning within a category-theoretic mathematical framework. The type-logic used in the DisCoCat model is Lambek's pregroup grammar. Pregroup types form a posetal compact closed category, which can be passed, in a functorial manner, on to the compact closed structure of vector spaces, linear maps and tensor product. The diagrammatic versions of the equational reasoning in compact closed categories can be interpreted as the flow of word meanings within sentences. Pregroups simplify Lambek's previous type-logic, the Lambek calculus, which has been extensively used to formalise and reason about various linguistic phenomena. The apparent reliance of the DisCoCat on pregroups has been seen as a shortcoming. This paper addresses this concern, by pointing out that one may as well realise a functorial passage from the original type-logic of Lambek, a monoidal bi-closed category, to vector spaces, or to any other model of meaning organised within a monoidal bi-closed category. The corresponding string diagram calculus, due to Baez and Stay, now depicts the flow of word meanings.Comment: 29 pages, pending publication in Annals of Pure and Applied Logi

    Fine-grained Expressivity of Graph Neural Networks

    Full text link
    Numerous recent works have analyzed the expressive power of message-passing graph neural networks (MPNNs), primarily utilizing combinatorial techniques such as the 11-dimensional Weisfeiler-Leman test (11-WL) for the graph isomorphism problem. However, the graph isomorphism objective is inherently binary, not giving insights into the degree of similarity between two given graphs. This work resolves this issue by considering continuous extensions of both 11-WL and MPNNs to graphons. Concretely, we show that the continuous variant of 11-WL delivers an accurate topological characterization of the expressive power of MPNNs on graphons, revealing which graphs these networks can distinguish and the level of difficulty in separating them. We identify the finest topology where MPNNs separate points and prove a universal approximation theorem. Consequently, we provide a theoretical framework for graph and graphon similarity combining various topological variants of classical characterizations of the 11-WL. In particular, we characterize the expressive power of MPNNs in terms of the tree distance, which is a graph distance based on the concepts of fractional isomorphisms, and substructure counts via tree homomorphisms, showing that these concepts have the same expressive power as the 11-WL and MPNNs on graphons. Empirically, we validate our theoretical findings by showing that randomly initialized MPNNs, without training, exhibit competitive performance compared to their trained counterparts. Moreover, we evaluate different MPNN architectures based on their ability to preserve graph distances, highlighting the significance of our continuous 11-WL test in understanding MPNNs' expressivity

    On the Fine-Grained Complexity of One-Dimensional Dynamic Programming

    Get PDF
    In this paper, we investigate the complexity of one-dimensional dynamic programming, or more specifically, of the Least-Weight Subsequence (LWS) problem: Given a sequence of n data items together with weights for every pair of the items, the task is to determine a subsequence S minimizing the total weight of the pairs adjacent in S. A large number of natural problems can be formulated as LWS problems, yielding obvious O(n^2)-time solutions. In many interesting instances, the O(n^2)-many weights can be succinctly represented. Yet except for near-linear time algorithms for some specific special cases, little is known about when an LWS instantiation admits a subquadratic-time algorithm and when it does not. In particular, no lower bounds for LWS instantiations have been known before. In an attempt to remedy this situation, we provide a general approach to study the fine-grained complexity of succinct instantiations of the LWS problem: Given an LWS instantiation we identify a highly parallel core problem that is subquadratically equivalent. This provides either an explanation for the apparent hardness of the problem or an avenue to find improved algorithms as the case may be. More specifically, we prove subquadratic equivalences between the following pairs (an LWS instantiation and the corresponding core problem) of problems: a low-rank version of LWS and minimum inner product, finding the longest chain of nested boxes and vector domination, and a coin change problem which is closely related to the knapsack problem and (min,+)-convolution. Using these equivalences and known SETH-hardness results for some of the core problems, we deduce tight conditional lower bounds for the corresponding LWS instantiations. We also establish the (min,+)-convolution-hardness of the knapsack problem. Furthermore, we revisit some of the LWS instantiations which are known to be solvable in near-linear time and explain their easiness in terms of the easiness of the corresponding core problems
    corecore