328 research outputs found
Space-efficient detection of unusual words
Detecting all the strings that occur in a text more frequently or less
frequently than expected according to an IID or a Markov model is a basic
problem in string mining, yet current algorithms are based on data structures
that are either space-inefficient or incur large slowdowns, and current
implementations cannot scale to genomes or metagenomes in practice. In this
paper we engineer an algorithm based on the suffix tree of a string to use just
a small data structure built on the Burrows-Wheeler transform, and a stack of
bits, where is the length of the string and
is the size of the alphabet. The size of the stack is except for very
large values of . We further improve the algorithm by removing its time
dependency on , by reporting only a subset of the maximal repeats and
of the minimal rare words of the string, and by detecting and scoring candidate
under-represented strings that in the string. Our
algorithms are practical and work directly on the BWT, thus they can be
immediately applied to a number of existing datasets that are available in this
form, returning this string mining problem to a manageable scale.Comment: arXiv admin note: text overlap with arXiv:1502.0637
A framework for space-efficient string kernels
String kernels are typically used to compare genome-scale sequences whose
length makes alignment impractical, yet their computation is based on data
structures that are either space-inefficient, or incur large slowdowns. We show
that a number of exact string kernels, like the -mer kernel, the substrings
kernels, a number of length-weighted kernels, the minimal absent words kernel,
and kernels with Markovian corrections, can all be computed in time and
in bits of space in addition to the input, using just a
data structure on the Burrows-Wheeler transform of the
input strings, which takes time per element in its output. The same
bounds hold for a number of measures of compositional complexity based on
multiple value of , like the -mer profile and the -th order empirical
entropy, and for calibrating the value of using the data
Guessing probability distributions from small samples
We propose a new method for the calculation of the statistical properties, as
e.g. the entropy, of unknown generators of symbolic sequences. The probability
distribution of the elements of a population can be approximated by
the frequencies of a sample provided the sample is long enough so that
each element occurs many times. Our method yields an approximation if this
precondition does not hold. For a given we recalculate the Zipf--ordered
probability distribution by optimization of the parameters of a guessed
distribution. We demonstrate that our method yields reliable results.Comment: 10 pages, uuencoded compressed PostScrip
An output-sensitive algorithm for the minimization of 2-dimensional String Covers
String covers are a powerful tool for analyzing the quasi-periodicity of
1-dimensional data and find applications in automata theory, computational
biology, coding and the analysis of transactional data. A \emph{cover} of a
string is a string for which every letter of lies within some
occurrence of . String covers have been generalized in many ways, leading to
\emph{k-covers}, \emph{-covers}, \emph{approximate covers} and were
studied in different contexts such as \emph{indeterminate strings}.
In this paper we generalize string covers to the context of 2-dimensional
data, such as images. We show how they can be used for the extraction of
textures from images and identification of primitive cells in lattice data.
This has interesting applications in image compression, procedural terrain
generation and crystallography
On Quasiperiodic Morphisms
Weakly and strongly quasiperiodic morphisms are tools introduced to study
quasiperiodic words. Formally they map respectively at least one or any
non-quasiperiodic word to a quasiperiodic word. Considering them both on finite
and infinite words, we get four families of morphisms between which we study
relations. We provide algorithms to decide whether a morphism is strongly
quasiperiodic on finite words or on infinite words.Comment: 12 page
A Coverage Criterion for Spaced Seeds and its Applications to Support Vector Machine String Kernels and k-Mer Distances
Spaced seeds have been recently shown to not only detect more alignments, but
also to give a more accurate measure of phylogenetic distances (Boden et al.,
2013, Horwege et al., 2014, Leimeister et al., 2014), and to provide a lower
misclassification rate when used with Support Vector Machines (SVMs) (On-odera
and Shibuya, 2013), We confirm by independent experiments these two results,
and propose in this article to use a coverage criterion (Benson and Mak, 2008,
Martin, 2013, Martin and No{\'e}, 2014), to measure the seed efficiency in both
cases in order to design better seed patterns. We show first how this coverage
criterion can be directly measured by a full automaton-based approach. We then
illustrate how this criterion performs when compared with two other criteria
frequently used, namely the single-hit and multiple-hit criteria, through
correlation coefficients with the correct classification/the true distance. At
the end, for alignment-free distances, we propose an extension by adopting the
coverage criterion, show how it performs, and indicate how it can be
efficiently computed.Comment: http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.017
Scheduling Jobs in Flowshops with the Introduction of Additional Machines in the Future
This is the author's peer-reviewed final manuscript, as accepted by the publisher. The published article is copyrighted by Elsevier and can be found at: http://www.journals.elsevier.com/expert-systems-with-applications/.The problem of scheduling jobs to minimize total weighted tardiness in flowshops,\ud
with the possibility of evolving into hybrid flowshops in the future, is investigated in\ud
this paper. As this research is guided by a real problem in industry, the flowshop\ud
considered has considerable flexibility, which stimulated the development of an\ud
innovative methodology for this research. Each stage of the flowshop currently has\ud
one or several identical machines. However, the manufacturing company is planning\ud
to introduce additional machines with different capabilities in different stages in the\ud
near future. Thus, the algorithm proposed and developed for the problem is not only\ud
capable of solving the current flow line configuration but also the potential new\ud
configurations that may result in the future. A meta-heuristic search algorithm based\ud
on Tabu search is developed to solve this NP-hard, industry-guided problem. Six\ud
different initial solution finding mechanisms are proposed. A carefully planned\ud
nested split-plot design is performed to test the significance of different factors and\ud
their impact on the performance of the different algorithms. To the best of our\ud
knowledge, this research is the first of its kind that attempts to solve an industry-guided\ud
problem with the concern for future developments
Palindromic Decompositions with Gaps and Errors
Identifying palindromes in sequences has been an interesting line of research
in combinatorics on words and also in computational biology, after the
discovery of the relation of palindromes in the DNA sequence with the HIV
virus. Efficient algorithms for the factorization of sequences into palindromes
and maximal palindromes have been devised in recent years. We extend these
studies by allowing gaps in decompositions and errors in palindromes, and also
imposing a lower bound to the length of acceptable palindromes.
We first present an algorithm for obtaining a palindromic decomposition of a
string of length n with the minimal total gap length in time O(n log n * g) and
space O(n g), where g is the number of allowed gaps in the decomposition. We
then consider a decomposition of the string in maximal \delta-palindromes (i.e.
palindromes with \delta errors under the edit or Hamming distance) and g
allowed gaps. We present an algorithm to obtain such a decomposition with the
minimal total gap length in time O(n (g + \delta)) and space O(n g).Comment: accepted to CSR 201
Role of the initial conditions on the enhancement of the escape time in static and fluctuating potentials
We present a study of the noise driven escape of an overdamped Brownian
particle moving in a cubic potential profile with a metastable state. We
analyze the role of the initial conditions of the particle on the enhancement
of the average escape time as a function of the noise intensity for fixed and
fluctuating potentials. We observe the noise enhanced stability effect for all
the initial unstable states investigated. For a fixed potential we find a
peculiar initial condition which separates the set of the initial
unstable states in two regions: those which give rise to divergences from those
which show nonmonotonic behavior of the average escape time. For fluctuating
potential at this particular initial condition and for low noise intensity we
find large fluctuations of the average escape time.Comment: 8 pages, 6 figures. Appeared in Physica A (2003
- …
