79,589 research outputs found
Linear-time Computation of Minimal Absent Words Using Suffix Array
An absent word of a word y of length n is a word that does not occur in y. It
is a minimal absent word if all its proper factors occur in y. Minimal absent
words have been computed in genomes of organisms from all domains of life;
their computation provides a fast alternative for measuring approximation in
sequence comparison. There exists an O(n)-time and O(n)-space algorithm for
computing all minimal absent words on a fixed-sized alphabet based on the
construction of suffix automata (Crochemore et al., 1998). No implementation of
this algorithm is publicly available. There also exists an O(n^2)-time and
O(n)-space algorithm for the same problem based on the construction of suffix
arrays (Pinho et al., 2009). An implementation of this algorithm was also
provided by the authors and is currently the fastest available. In this
article, we bridge this unpleasant gap by presenting an O(n)-time and
O(n)-space algorithm for computing all minimal absent words based on the
construction of suffix arrays. Experimental results using real and synthetic
data show that the respective implementation outperforms the one by Pinho et
al
Spanish generation from Spanish Sign Language using a phrase-based translation system
This paper describes the development of a Spoken Spanish generator from Spanish Sign Language (LSE â Lengua de Signos Española) in a specific domain: the renewal of Identity Document and Driverâs license. The system is composed of three modules. The first one is an interface where a deaf person can specify a sign sequence in sign-writing. The second one is a language translator for converting the sign sequence into a word sequence. Finally, the last module is a text to speech converter. Also, the paper describes the generation of a parallel corpus for the system development composed of more than 4,000 Spanish sentences and their LSE translations in the application domain. The paper is focused on the translation module that uses a statistical strategy with a phrase-based translation model, and this paper analyses the effect of the alignment configuration used during the process of word based translation model generation. Finally, the best configuration gives a 3.90% mWER and a 0.9645 BLEU
A framework for space-efficient string kernels
String kernels are typically used to compare genome-scale sequences whose
length makes alignment impractical, yet their computation is based on data
structures that are either space-inefficient, or incur large slowdowns. We show
that a number of exact string kernels, like the -mer kernel, the substrings
kernels, a number of length-weighted kernels, the minimal absent words kernel,
and kernels with Markovian corrections, can all be computed in time and
in bits of space in addition to the input, using just a
data structure on the Burrows-Wheeler transform of the
input strings, which takes time per element in its output. The same
bounds hold for a number of measures of compositional complexity based on
multiple value of , like the -mer profile and the -th order empirical
entropy, and for calibrating the value of using the data
Automatic alignment of hieroglyphs and transliteration
Automatic alignment has important applications in philology, facilitating study of texts on the basis of electronic resources produced by different scholars. A simple technique is presented to realise such alignment for Ancient Egyptian hieroglyphic texts and transliteration. Preliminary experiments with the technique are reported, and plans for future work are discussed.Postprin
Google matrix analysis of DNA sequences
For DNA sequences of various species we construct the Google matrix G of
Markov transitions between nearby words composed of several letters. The
statistical distribution of matrix elements of this matrix is shown to be
described by a power law with the exponent being close to those of outgoing
links in such scale-free networks as the World Wide Web (WWW). At the same time
the sum of ingoing matrix elements is characterized by the exponent being
significantly larger than those typical for WWW networks. This results in a
slow algebraic decay of the PageRank probability determined by the distribution
of ingoing elements. The spectrum of G is characterized by a large gap leading
to a rapid relaxation process on the DNA sequence networks. We introduce the
PageRank proximity correlator between different species which determines their
statistical similarity from the view point of Markov chains. The properties of
other eigenstates of the Google matrix are also discussed. Our results
establish scale-free features of DNA sequence networks showing their
similarities and distinctions with the WWW and linguistic networks.Comment: latex, 11 fig
Cooperative "folding transition" in the sequence space facilitates function-driven evolution of protein families
In the protein sequence space, natural proteins form clusters of families
which are characterized by their unique native folds whereas the great majority
of random polypeptides are neither clustered nor foldable to unique structures.
Since a given polypeptide can be either foldable or unfoldable, a kind of
"folding transition" is expected at the boundary of a protein family in the
sequence space. By Monte Carlo simulations of a statistical mechanical model of
protein sequence alignment that coherently incorporates both short-range and
long-range interactions as well as variable-length insertions to reproduce the
statistics of the multiple sequence alignment of a given protein family, we
demonstrate the existence of such transition between natural-like sequences and
random sequences in the sequence subspaces for 15 domain families of various
folds. The transition was found to be highly cooperative and two-state-like.
Furthermore, enforcing or suppressing consensus residues on a few of the
well-conserved sites enhanced or diminished, respectively, the natural-like
pattern formation over the entire sequence. In most families, the key sites
included ligand binding sites. These results suggest some selective pressure on
the key residues, such as ligand binding activity, may cooperatively facilitate
the emergence of a protein family during evolution. From a more practical
aspect, the present results highlight an essential role of long-range effects
in precisely defining protein families, which are absent in conventional
sequence models.Comment: 13 pages, 7 figures, 2 tables (a new subsection added
Alignment-free Genomic Analysis via a Big Data Spark Platform
Motivation: Alignment-free distance and similarity functions (AF functions,
for short) are a well established alternative to two and multiple sequence
alignments for many genomic, metagenomic and epigenomic tasks. Due to
data-intensive applications, the computation of AF functions is a Big Data
problem, with the recent Literature indicating that the development of fast and
scalable algorithms computing AF functions is a high-priority task. Somewhat
surprisingly, despite the increasing popularity of Big Data technologies in
Computational Biology, the development of a Big Data platform for those tasks
has not been pursued, possibly due to its complexity. Results: We fill this
important gap by introducing FADE, the first extensible, efficient and scalable
Spark platform for Alignment-free genomic analysis. It supports natively
eighteen of the best performing AF functions coming out of a recent hallmark
benchmarking study. FADE development and potential impact comprises novel
aspects of interest. Namely, (a) a considerable effort of distributed
algorithms, the most tangible result being a much faster execution time of
reference methods like MASH and FSWM; (b) a software design that makes FADE
user-friendly and easily extendable by Spark non-specialists; (c) its ability
to support data- and compute-intensive tasks. About this, we provide a novel
and much needed analysis of how informative and robust AF functions are, in
terms of the statistical significance of their output. Our findings naturally
extend the ones of the highly regarded benchmarking study, since the functions
that can really be used are reduced to a handful of the eighteen included in
FADE
- âŠ