1,641 research outputs found
Pattern-based phylogenetic distance estimation and tree reconstruction
We have developed an alignment-free method that calculates phylogenetic
distances using a maximum likelihood approach for a model of sequence change on
patterns that are discovered in unaligned sequences. To evaluate the
phylogenetic accuracy of our method, and to conduct a comprehensive comparison
of existing alignment-free methods (freely available as Python package decaf+py
at http://www.bioinformatics.org.au), we have created a dataset of reference
trees covering a wide range of phylogenetic distances. Amino acid sequences
were evolved along the trees and input to the tested methods; from their
calculated distances we infered trees whose topologies we compared to the
reference trees.
We find our pattern-based method statistically superior to all other tested
alignment-free methods on this dataset. We also demonstrate the general
advantage of alignment-free methods over an approach based on automated
alignments when sequences violate the assumption of collinearity. Similarly, we
compare methods on empirical data from an existing alignment benchmark set that
we used to derive reference distances and trees. Our pattern-based approach
yields distances that show a linear relationship to reference distances over a
substantially longer range than other alignment-free methods. The pattern-based
approach outperforms alignment-free methods and its phylogenetic accuracy is
statistically indistinguishable from alignment-based distances.Comment: 21 pages, 3 figures, 2 table
Some studies on protein structure alignment algorithms
The alignment of two protein structures is a fundamental problem in structural bioinformatics.Their structural similarity carries with it the connotation of similar functional behavior that couldbe exploited in various applications. A plethora of algorithms, including one by us, is a testamentto the importance of the problem. In this thesis, we propose a novel approach to measure theeectiveness of a sample of four such algorithms, DALI, TM-align, CE and EDAlignsse, for de-tecting structural similarities among proteins. The underlying premise is that structural proximityshould translate into spatial proximity. To verify this, we carried out extensive experiments withve dierent datasets, each consisting of proteins from two to six dierent families.In further addition to our work, we have focused on the area of computational methods foraligning multiple protein structures. This problem is known for its np-complete nature. Therefore,there are many ways to come up with a solution which can be better than the existing ones or atleast as good as them. Such a solution is presented here in this thesis. We have used a heuristicalgorithm which is the Progressive Multiple Alignment approach, to have the multiple sequencealignment. We used the root mean square deviation (RMSD) as a measure of alignment quality andreported this measure for a large and varied number of alignments. We also compared the executiontimes of our algorithm with the well-known algorithm MUSTANG for all the tested alignments
Lambek vs. Lambek: Functorial Vector Space Semantics and String Diagrams for Lambek Calculus
The Distributional Compositional Categorical (DisCoCat) model is a
mathematical framework that provides compositional semantics for meanings of
natural language sentences. It consists of a computational procedure for
constructing meanings of sentences, given their grammatical structure in terms
of compositional type-logic, and given the empirically derived meanings of
their words. For the particular case that the meaning of words is modelled
within a distributional vector space model, its experimental predictions,
derived from real large scale data, have outperformed other empirically
validated methods that could build vectors for a full sentence. This success
can be attributed to a conceptually motivated mathematical underpinning, by
integrating qualitative compositional type-logic and quantitative modelling of
meaning within a category-theoretic mathematical framework.
The type-logic used in the DisCoCat model is Lambek's pregroup grammar.
Pregroup types form a posetal compact closed category, which can be passed, in
a functorial manner, on to the compact closed structure of vector spaces,
linear maps and tensor product. The diagrammatic versions of the equational
reasoning in compact closed categories can be interpreted as the flow of word
meanings within sentences. Pregroups simplify Lambek's previous type-logic, the
Lambek calculus, which has been extensively used to formalise and reason about
various linguistic phenomena. The apparent reliance of the DisCoCat on
pregroups has been seen as a shortcoming. This paper addresses this concern, by
pointing out that one may as well realise a functorial passage from the
original type-logic of Lambek, a monoidal bi-closed category, to vector spaces,
or to any other model of meaning organised within a monoidal bi-closed
category. The corresponding string diagram calculus, due to Baez and Stay, now
depicts the flow of word meanings.Comment: 29 pages, pending publication in Annals of Pure and Applied Logi
Fine-grained Expressivity of Graph Neural Networks
Numerous recent works have analyzed the expressive power of message-passing
graph neural networks (MPNNs), primarily utilizing combinatorial techniques
such as the -dimensional Weisfeiler-Leman test (-WL) for the graph
isomorphism problem. However, the graph isomorphism objective is inherently
binary, not giving insights into the degree of similarity between two given
graphs. This work resolves this issue by considering continuous extensions of
both -WL and MPNNs to graphons. Concretely, we show that the continuous
variant of -WL delivers an accurate topological characterization of the
expressive power of MPNNs on graphons, revealing which graphs these networks
can distinguish and the level of difficulty in separating them. We identify the
finest topology where MPNNs separate points and prove a universal approximation
theorem. Consequently, we provide a theoretical framework for graph and graphon
similarity combining various topological variants of classical
characterizations of the -WL. In particular, we characterize the expressive
power of MPNNs in terms of the tree distance, which is a graph distance based
on the concepts of fractional isomorphisms, and substructure counts via tree
homomorphisms, showing that these concepts have the same expressive power as
the -WL and MPNNs on graphons. Empirically, we validate our theoretical
findings by showing that randomly initialized MPNNs, without training, exhibit
competitive performance compared to their trained counterparts. Moreover, we
evaluate different MPNN architectures based on their ability to preserve graph
distances, highlighting the significance of our continuous -WL test in
understanding MPNNs' expressivity
On the Fine-Grained Complexity of One-Dimensional Dynamic Programming
In this paper, we investigate the complexity of one-dimensional dynamic programming, or more specifically, of the Least-Weight Subsequence (LWS) problem: Given a sequence of n data items together with weights for every pair of the items, the task is to determine a subsequence S minimizing the total weight of the pairs adjacent in S. A large number of natural problems can be formulated as LWS problems, yielding obvious O(n^2)-time solutions.
In many interesting instances, the O(n^2)-many weights can be succinctly represented. Yet except for near-linear time algorithms for some specific special cases, little is known about when an LWS instantiation admits a subquadratic-time algorithm and when it does not. In particular, no lower bounds for LWS instantiations have been known before. In an attempt to remedy this situation, we provide a general approach to study the fine-grained complexity of succinct instantiations of the LWS problem: Given an LWS instantiation we identify a highly parallel core problem that is subquadratically equivalent. This provides either an explanation for the apparent hardness of the problem or an avenue to find improved algorithms as the case may be.
More specifically, we prove subquadratic equivalences between the following pairs (an LWS instantiation and the corresponding core problem) of problems: a low-rank version of LWS and minimum inner product, finding the longest chain of nested boxes and vector domination, and a coin change problem which is closely related to the knapsack problem and (min,+)-convolution. Using these equivalences and known SETH-hardness results for some of the core problems, we deduce tight conditional lower bounds for the corresponding LWS instantiations. We also establish the (min,+)-convolution-hardness of the knapsack problem. Furthermore, we revisit some of the LWS instantiations which are known to be solvable in near-linear time and explain their easiness in terms of the easiness of the corresponding core problems
- …