36,204 research outputs found
Edit Distance: Sketching, Streaming and Document Exchange
We show that in the document exchange problem, where Alice holds and Bob holds , Alice can send Bob a message of
size bits such that Bob can recover using the
message and his input if the edit distance between and is no more
than , and output "error" otherwise. Both the encoding and decoding can be
done in time . This result significantly
improves the previous communication bounds under polynomial encoding/decoding
time. We also show that in the referee model, where Alice and Bob hold and
respectively, they can compute sketches of and of sizes
bits (the encoding), and send to the referee, who can
then compute the edit distance between and together with all the edit
operations if the edit distance is no more than , and output "error"
otherwise (the decoding). To the best of our knowledge, this is the first
result for sketching edit distance using bits.
Moreover, the encoding phase of our sketching algorithm can be performed by
scanning the input string in one pass. Thus our sketching algorithm also
implies the first streaming algorithm for computing edit distance and all the
edits exactly using bits of space.Comment: Full version of an article to be presented at the 57th Annual IEEE
Symposium on Foundations of Computer Science (FOCS 2016
Improved Algorithms for Approximate String Matching (Extended Abstract)
The problem of approximate string matching is important in many different
areas such as computational biology, text processing and pattern recognition. A
great effort has been made to design efficient algorithms addressing several
variants of the problem, including comparison of two strings, approximate
pattern identification in a string or calculation of the longest common
subsequence that two strings share.
We designed an output sensitive algorithm solving the edit distance problem
between two strings of lengths n and m respectively in time
O((s-|n-m|)min(m,n,s)+m+n) and linear space, where s is the edit distance
between the two strings. This worst-case time bound sets the quadratic factor
of the algorithm independent of the longest string length and improves existing
theoretical bounds for this problem. The implementation of our algorithm excels
also in practice, especially in cases where the two strings compared differ
significantly in length. Source code of our algorithm is available at
http://www.cs.miami.edu/\~dimitris/edit_distanceComment: 10 page
Cross-Recurrence Quantification Analysis of Categorical and Continuous Time Series: an R package
This paper describes the R package crqa to perform cross-recurrence
quantification analysis of two time series of either a categorical or
continuous nature. Streams of behavioral information, from eye movements to
linguistic elements, unfold over time. When two people interact, such as in
conversation, they often adapt to each other, leading these behavioral levels
to exhibit recurrent states. In dialogue, for example, interlocutors adapt to
each other by exchanging interactive cues: smiles, nods, gestures, choice of
words, and so on. In order for us to capture closely the goings-on of dynamic
interaction, and uncover the extent of coupling between two individuals, we
need to quantify how much recurrence is taking place at these levels. Methods
available in crqa would allow researchers in cognitive science to pose such
questions as how much are two people recurrent at some level of analysis, what
is the characteristic lag time for one person to maximally match another, or
whether one person is leading another. First, we set the theoretical ground to
understand the difference between 'correlation' and 'co-visitation' when
comparing two time series, using an aggregative or cross-recurrence approach.
Then, we describe more formally the principles of cross-recurrence, and show
with the current package how to carry out analyses applying them. We end the
paper by comparing computational efficiency, and results' consistency, of crqa
R package, with the benchmark MATLAB toolbox crptoolbox. We show perfect
comparability between the two libraries on both levels
Bayesian models and algorithms for protein beta-sheet prediction
Prediction of the three-dimensional structure greatly benefits from the information related to secondary structure, solvent accessibility, and non-local contacts that stabilize a protein's structure. Prediction of such components is vital to our understanding of the structure and function of a protein. In this paper, we address the problem of beta-sheet prediction. We introduce a Bayesian approach for proteins with six or less beta-strands, in which we model the conformational features in a probabilistic framework. To select the optimum architecture, we analyze the space of possible conformations by efficient heuristics. Furthermore, we employ an algorithm that finds the optimum pairwise alignment between beta-strands using dynamic programming. Allowing any number of gaps in an alignment enables us to model beta-bulges more effectively. Though our main focus is proteins with six or less beta-strands, we are also able to perform predictions for proteins with more than six beta-strands by combining the predictions of BetaPro with the gapped alignment algorithm. We evaluated the accuracy of our method and BetaPro. We performed a 10-fold cross validation experiment on the BetaSheet916 set and we obtained significant improvements in the prediction accuracy
Simultaneous identification of specifically interacting paralogs and inter-protein contacts by Direct-Coupling Analysis
Understanding protein-protein interactions is central to our understanding of
almost all complex biological processes. Computational tools exploiting rapidly
growing genomic databases to characterize protein-protein interactions are
urgently needed. Such methods should connect multiple scales from evolutionary
conserved interactions between families of homologous proteins, over the
identification of specifically interacting proteins in the case of multiple
paralogs inside a species, down to the prediction of residues being in physical
contact across interaction interfaces. Statistical inference methods detecting
residue-residue coevolution have recently triggered considerable progress in
using sequence data for quaternary protein structure prediction; they require,
however, large joint alignments of homologous protein pairs known to interact.
The generation of such alignments is a complex computational task on its own;
application of coevolutionary modeling has in turn been restricted to proteins
without paralogs, or to bacterial systems with the corresponding coding genes
being co-localized in operons. Here we show that the Direct-Coupling Analysis
of residue coevolution can be extended to connect the different scales, and
simultaneously to match interacting paralogs, to identify inter-protein
residue-residue contacts and to discriminate interacting from noninteracting
families in a multiprotein system. Our results extend the potential
applications of coevolutionary analysis far beyond cases treatable so far.Comment: Main Text 19 pages Supp. Inf. 16 page
- …