Search CORE

328 research outputs found

Space-efficient detection of unusual words

Author: A Apostolico
A Apostolico
CAR Hoare
D Belazzougui
D Belazzougui
J Herold
J Lin
M Crochemore
S Chairungsee
Publication venue
Publication date: 01/01/2015
Field of study

Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the Burrows-Wheeler transform, and a stack of

O(\sigma^2\log^2 n)

bits, where

n

is the length of the string and

\sigma

is the size of the alphabet. The size of the stack is

o(n)

except for very large values of

\sigma

. We further improve the algorithm by removing its time dependency on

\sigma

, by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by detecting and scoring candidate under-represented strings that

\textit{do not occur}

in the string. Our algorithms are practical and work directly on the BWT, thus they can be immediately applied to a number of existing datasets that are available in this form, returning this string mining problem to a manageable scale.Comment: arXiv admin note: text overlap with arXiv:1502.0637

arXiv.org e-Print Archive

Crossref

MPG.PuRe

A framework for space-efficient string kernels

Author: A Apostolico
A Apostolico
AJ Smola
AM İleri
B Chor
D Belazzougui
G Reinert
GE Sims
J Herold
J Qi
J Shawe-Taylor
M Crochemore
R Chikhi
S Chairungsee
Publication venue
Publication date: 23/02/2015
Field of study

String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact string kernels, like the

k

-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in

O(nd)

time and in

o(n)

bits of space in addition to the input, using just a

\mathtt{rangeDistinct}

data structure on the Burrows-Wheeler transform of the input strings, which takes

O(d)

time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple value of

k

, like the

k

-mer profile and the

k

-th order empirical entropy, and for calibrating the value of

k

using the data

arXiv.org e-Print Archive

Crossref

Guessing probability distributions from small samples

Author: A. Apostolico
A. Schmitt
B. McMillan
Donald E. Knuth
H. Herzel
Helge Rosé
Thorsten Pöschel
W. Ebeling
W. H. Press
Werner Ebeling
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/1995
Field of study

We propose a new method for the calculation of the statistical properties, as e.g. the entropy, of unknown generators of symbolic sequences. The probability distribution

p(k)

of the elements

k

of a population can be approximated by the frequencies

f(k)

of a sample provided the sample is long enough so that each element

k

occurs many times. Our method yields an approximation if this precondition does not hold. For a given

f(k)

we recalculate the Zipf--ordered probability distribution by optimization of the parameters of a guessed distribution. We demonstrate that our method yields reliable results.Comment: 10 pages, uuencoded compressed PostScrip

arXiv.org e-Print Archive

CiteSeerX

Crossref

An output-sensitive algorithm for the minimization of 2-dimensional String Covers

Author: A Apostolico
A Apostolico
A Bacciotti
A Katok
A Muchnik
A Tychonoff
A Wlodawer
AK Brodzik
AV Aho
DE Knuth
J Kopf
JR Searle
K Perlin
L Bursill
R Middlestead
RS Bird
S Havlin
WA Sethares
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 02/05/2019
Field of study

String covers are a powerful tool for analyzing the quasi-periodicity of 1-dimensional data and find applications in automata theory, computational biology, coding and the analysis of transactional data. A \emph{cover} of a string

T

is a string

C

for which every letter of

T

lies within some occurrence of

C

. String covers have been generalized in many ways, leading to \emph{k-covers}, \emph{

\lambda

-covers}, \emph{approximate covers} and were studied in different contexts such as \emph{indeterminate strings}. In this paper we generalize string covers to the context of 2-dimensional data, such as images. We show how they can be used for the extraction of textures from images and identification of primitive cells in lattice data. This has interesting applications in image compression, procedural terrain generation and crystallography

arXiv.org e-Print Archive

Crossref

On Quasiperiodic Morphisms

Author: A. Apostolico
A. Glen
C.S. Iliopoulos
F. Levé
F. Levé
G. Richomme
G.S. Brodal
N.J. Fine
R. Groult
R.C. Lyndon
S. Marcus
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Weakly and strongly quasiperiodic morphisms are tools introduced to study quasiperiodic words. Formally they map respectively at least one or any non-quasiperiodic word to a quasiperiodic word. Considering them both on finite and infinite words, we get four families of morphisms between which we study relations. We provide algorithms to decide whether a morphism is strongly quasiperiodic on finite words or on infinite words.Comment: 12 page

arXiv.org e-Print Archive

CiteSeerX

Crossref

HAL Descartes

Portail HAL Um (Université de Montpellier)

HAL: Hyper Article en Ligne

A Coverage Criterion for Spaced Seeds and its Applications to Support Vector Machine String Kernels and k-Mer Distances

Author: Laurent Noé
Donald E.K. Martin
Apostolico A.
Bassino F.
Boden M.
Břinda K.
Burkhardt S.
Egidi L.
Gambin A.
Leslie C.S.
Martin D.E.K.
Martin D.E.K.
Régnier M.
Simon I.
Zhou L.
Publication venue: 'Mary Ann Liebert Inc'
Publication date: 01/01/2010
Field of study

Spaced seeds have been recently shown to not only detect more alignments, but also to give a more accurate measure of phylogenetic distances (Boden et al., 2013, Horwege et al., 2014, Leimeister et al., 2014), and to provide a lower misclassification rate when used with Support Vector Machines (SVMs) (On-odera and Shibuya, 2013), We confirm by independent experiments these two results, and propose in this article to use a coverage criterion (Benson and Mak, 2008, Martin, 2013, Martin and No{\'e}, 2014), to measure the seed efficiency in both cases in order to design better seed patterns. We show first how this coverage criterion can be directly measured by a full automaton-based approach. We then illustrate how this criterion performs when compared with two other criteria frequently used, namely the single-hit and multiple-hit criteria, through correlation coefficients with the correct classification/the true distance. At the end, for alignment-free distances, we propose an extension by adopting the coverage criterion, show how it performs, and indicate how it can be efficiently computed.Comment: http://online.liebertpub.com/doi/abs/10.1089/cmb.2014.017

arXiv.org e-Print Archive

HAL - Lille 3

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

Copenhagen University Research Information System

Parameterized searching with mismatches for run-length encoded strings (extended abstract)

Author: A. Amir
A. Apostolico
B.S. Baker
B.S. Baker
M.L. Fredman
R.K. Ahuja
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Crossref

Repository of the Academy's Library

Scheduling Jobs in Flowshops with the Introduction of Additional Machines in the Future

Author: A. Apostolico
B. Smyth
C.J. Colbourn
D. Gusfield
E. Ukkonen
E.M. McCreight
G. Manzini
G. Manzini
J. Fischer
J. Fischer
M.A. Bender
M.I. Abouelhoda
S. Burkhardt
S.J. Puglisi
T. Kasai
U. Manber
Publication venue: Elsevier
Publication date: 01/01/2008
Field of study

This is the author's peer-reviewed final manuscript, as accepted by the publisher. The published article is copyrighted by Elsevier and can be found at: http://www.journals.elsevier.com/expert-systems-with-applications/.The problem of scheduling jobs to minimize total weighted tardiness in flowshops,\ud with the possibility of evolving into hybrid flowshops in the future, is investigated in\ud this paper. As this research is guided by a real problem in industry, the flowshop\ud considered has considerable flexibility, which stimulated the development of an\ud innovative methodology for this research. Each stage of the flowshop currently has\ud one or several identical machines. However, the manufacturing company is planning\ud to introduce additional machines with different capabilities in different stages in the\ud near future. Thus, the algorithm proposed and developed for the problem is not only\ud capable of solving the current flow line configuration but also the potential new\ud configurations that may result in the future. A meta-heuristic search algorithm based\ud on Tabu search is developed to solve this NP-hard, industry-guided problem. Six\ud different initial solution finding mechanisms are proposed. A carefully planned\ud nested split-plot design is performed to test the significance of different factors and\ud their impact on the performance of the different algorithms. To the best of our\ud knowledge, this research is the first of its kind that attempts to solve an industry-guided\ud problem with the concern for future developments

CiteSeerX

Crossref

ScholarsArchive@OSU

Research Repository RMIT University

Palindromic Decompositions with Gaps and Errors

Author: A Apostolico
A Frid
D Breslauer
D Gusfield
D Kosolobov
DE Knuth
G Fici
G Manacher
M Crochemore
M Crochemore
M Rubinchik
R Kolpakov
S Gupta
T I
X Droubay
X Droubay
Y Fujishige
Z Galil
Publication venue
Publication date: 01/01/2017
Field of study

Identifying palindromes in sequences has been an interesting line of research in combinatorics on words and also in computational biology, after the discovery of the relation of palindromes in the DNA sequence with the HIV virus. Efficient algorithms for the factorization of sequences into palindromes and maximal palindromes have been devised in recent years. We extend these studies by allowing gaps in decompositions and errors in palindromes, and also imposing a lower bound to the length of acceptable palindromes. We first present an algorithm for obtaining a palindromic decomposition of a string of length n with the minimal total gap length in time O(n log n * g) and space O(n g), where g is the number of allowed gaps in the decomposition. We then consider a decomposition of the string in maximal \delta-palindromes (i.e. palindromes with \delta errors under the edit or Hamming distance) and g allowed gaps. We present an algorithm to obtain such a decomposition with the minimal total gap length in time O(n (g + \delta)) and space O(n g).Comment: accepted to CSR 201

arXiv.org e-Print Archive

Crossref

REIN UW Institutional Repository of the University of Warsaw

Role of the initial conditions on the enhancement of the escape time in static and fluctuating potentials

Author: A Fiasconaro
Agudov
Apostolico
Augdov
B Spagnolo
D Valenti
Dan
Dayan
Doering
Gammaitoni
Hirsch
Hänggi
Kramers
Mantegna
Mantegna
Mantegna
Mantegna
Marchi
Mielke
Talkner
Wackerbauer
Publication venue: 'Elsevier BV'
Publication date: 31/10/2003
Field of study

We present a study of the noise driven escape of an overdamped Brownian particle moving in a cubic potential profile with a metastable state. We analyze the role of the initial conditions of the particle on the enhancement of the average escape time as a function of the noise intensity for fixed and fluctuating potentials. We observe the noise enhanced stability effect for all the initial unstable states investigated. For a fixed potential we find a peculiar initial condition

x_c

which separates the set of the initial unstable states in two regions: those which give rise to divergences from those which show nonmonotonic behavior of the average escape time. For fluctuating potential at this particular initial condition and for low noise intensity we find large fluctuations of the average escape time.Comment: 8 pages, 6 figures. Appeared in Physica A (2003

arXiv.org e-Print Archive

Crossref