Search CORE

7 research outputs found

Space-efficient detection of unusual words

Author: A Apostolico
A Apostolico
CAR Hoare
D Belazzougui
D Belazzougui
J Herold
J Lin
M Crochemore
S Chairungsee
Publication venue
Publication date: 01/01/2015
Field of study

Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the Burrows-Wheeler transform, and a stack of

O(\sigma^2\log^2 n)

bits, where

n

is the length of the string and

\sigma

is the size of the alphabet. The size of the stack is

o(n)

except for very large values of

\sigma

. We further improve the algorithm by removing its time dependency on

\sigma

, by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by detecting and scoring candidate under-represented strings that

\textit{do not occur}

in the string. Our algorithms are practical and work directly on the BWT, thus they can be immediately applied to a number of existing datasets that are available in this form, returning this string mining problem to a manageable scale.Comment: arXiv admin note: text overlap with arXiv:1502.0637

arXiv.org e-Print Archive

Crossref

MPG.PuRe

Optimal Computation of Avoided Words

Author: A Akalin
C Acquisti
C Barton
C Barton
D Belazzougui
DB Searls
F Mignosi
I Rusinov
M Crochemore
P Gawrychowski
RN Mantegna
V Brendel
Publication venue
Publication date: 29/04/2016
Field of study

The deviation of the observed frequency of a word

w

from its expected frequency in a given sequence

x

is used to determine whether or not the word is avoided. This concept is particularly useful in DNA linguistic analysis. The value of the standard deviation of

w

, denoted by

std(w)

, effectively characterises the extent of a word by its edge contrast in the context in which it occurs. A word

w

of length

k>2

is a

\rho

-avoided word in

x

std(w) \leq \rho

, for a given threshold

\rho < 0

. Notice that such a word may be completely absent from

x

. Hence computing all such words na\"{\i}vely can be a very time-consuming procedure, in particular for large

k

. In this article, we propose an

O(n)

-time and

O(n)

-space algorithm to compute all

\rho

-avoided words of length

k

in a given sequence

x

of length

n

over a fixed-sized alphabet. We also present a time-optimal

O(\sigma n)

-time and

O(\sigma n)

-space algorithm to compute all

\rho

-avoided words (of any length) in a sequence of length

n

over an alphabet of size

\sigma

. Furthermore, we provide a tight asymptotic upper bound for the number of

\rho

-avoided words and the expected length of the longest one. We make available an open-source implementation of our algorithm. Experimental results, using both real and synthetic data, show the efficiency of our implementation

arXiv.org e-Print Archive

Crossref

King's Research Portal

Optimal Computation of Overabundant Words

Author: Almirantis Yannis
Charalampopoulos Panagiotis
Gao Jia
Iliopoulos Costas S.
Mohamed Manal
Pissis Solon P.
Polychronopoulos Dimitris
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 17th International Workshop on Algorithms in Bioinformatics (WABI 2017)
Publication date: 01/01/2017
Field of study

The observed frequency of the longest proper prefix, the longest proper suffix, and the longest infix of a word w in a given sequence x can be used for classifying w as avoided or overabundant. The definitions used for the expectation and deviation of w in this statistical model were described and biologically justified by Brendel et al. (J Biomol Struct Dyn 1986). We have very recently introduced a time-optimal algorithm for computing all avoided words of a given sequence over an integer alphabet (Algorithms Mol Biol 2017). In this article, we extend this study by presenting an O(n)-time and O(n)-space algorithm for computing all overabundant words in a sequence x of length n over an integer alphabet. Our main result is based on a new non-trivial combinatorial property of the suffix tree T of x: the number of distinct factors of x whose longest infix is the label of an explicit node of T is no more than 3n-4. We further show that the presented algorithm is time-optimal by proving that O(n) is a tight upper bound for the number of overabundant words. Finally, we present experimental results, using both synthetic and real data, which justify the effectiveness and efficiency of our approach in practical terms

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

On avoided words, absent words, and their application to biological sequence analysis

Author
Publication venue: BioMed Central
Publication date
Field of study

Springer - Publisher Connector

R-enum: Enumeration of Characteristic Substrings in BWT-runs Bounded Space

Author: Nishimoto Takaaki
Tabei Yasuo
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)
Publication date: 01/01/2021
Field of study

Enumerating characteristic substrings (e.g., maximal repeats, minimal unique substrings, and minimal absent words) in a given string has been an important research topic because there are a wide variety of applications in various areas such as string processing and computational biology. Although several enumeration algorithms for characteristic substrings have been proposed, they are not space-efficient in that their space-usage is proportional to the length of an input string. Recently, the run-length encoded Burrows-Wheeler transform (RLBWT) has attracted increased attention in string processing, and various algorithms for the RLBWT have been developed. Developing enumeration algorithms for characteristic substrings with the RLBWT, however, remains a challenge. In this paper, we present r-enum (RLBWT-based enumeration), the first enumeration algorithm for characteristic substrings based on RLBWT. R-enum runs in O(n log log (n/r)) time and with O(r log n) bits of working space for string length n and number r of runs in RLBWT. Here, r is expected to be significantly smaller than n for highly repetitive strings (i.e., strings with many repetitions). Experiments using a benchmark dataset of highly repetitive strings show that the results of r-enum are more space-efficient than the previous results. In addition, we demonstrate the applicability of r-enum to a huge string by performing experiments on a 300-gigabyte string of 100 human genomes

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Tilatiivis toteutus tiedon tiivistämiseen osamerkkijonoja luettelemalla

Author: Hankalahti Aleksi
Publication venue: Helsingfors universitet
Publication date: 01/01/2020
Field of study

Häviöttömässä tiedon tiivistämisessä annetusta datasta luodaan tiiviste, joka vie mahdollisimman vähän tilaa suhteessa alkuperäiseen dataan. Tiivisteestä on voitava palauttaa identtinen kopio alkuperäisestä datasta. Tutkielmassa käsitellään häviötöntä tiivistysmenetelmää, joka tutkii tiivistettävää dataa, eli merkkijonoa tai tekstiä, kokonaisuutena, eikä esimerkiksi pieni osa kerrallaan. Menetelmä välittää tiivisteen purkajalle osamerkkijonojen esiintymismääriä tekstissä. Osamerkkijonot käsitellään ennalta tunnetussa järjestyksessä lyhyimmästä pisimpään, jolloin kumpikin osapuoli osaa liittää esiintymismäärän oikeaan osamerkkijonoon. Jotkut esiintymismäärät voivat olla nollia kertomassa, ettei osamerkkijono esiinny tekstissä. Tiivistyvyys saavutetaan huomaamalla, että aiemmin välitetyt osamerkkijonot rajaavat millaisia pidemmät merkkijonot voivat olla. Tällöin osa esiintymismääristä voidaan jättää välittämättä, tai välittämiseen käyttää vähemmän tilaa. Osamerkkijonoja, joiden esiintymismäärä täytyy välittää, karakterisoidaan maksimaalisuuden käsitteen avulla. Maksimaalisten osamerkkijonojen etsiminen ja osamerkkijonojen esiintymismäärien laskeminen paljaasta tekstistä on hidasta. Siksi teksti täytyy tallettaa tietorakenteeseen, joka tukee tarvittuja operaatioita tehden niistä nopeita. Tällaiset tietorakenteet vievät enemmän tilaa kuin paljas teksti. Koska tutkittavassa tiivistysmenetelmässä koko tiivistettävä teksti käsitellään kokonaisuutena, muistinkäytön tehokkuus korostuu. Tutkielmassa toteutetaan tiivistysmenetelmä käyttäen tilatiiviistä tietorakennetta nimeltä kaksisuuntainen BWT-indeksi. Tilatiiviit tietorakenteet vievät vain vähän enemmän tilaa, kuin niihin talletettu data. Tästä huolimatta ne toteuttavat talletettua dataa käsitteleviä operaatioita tehokkaasti. Toteutukselle suoritetut kokeet osoittavat muistinkäytön pysyvän kohtuullisena, jolloin suurempienkin tietomäärien tiivistys on mahdollista

Helsingin yliopiston digitaalinen arkisto