Search CORE

100 research outputs found

Fast Scalable Construction of (Minimal Perfect Hash) Functions

Author: A Goerdt
AM Frieze
AM Odlyzko
BA LaMacchia
BS Majewski
D Belazzougui
D Belazzougui
D Belazzougui
D Belazzougui
FC Botelho
M Aumüller
M Dietzfelbinger
M Dietzfelbinger
N Fountoulakis
Publication venue
Publication date: 01/01/2016
Field of study

Recent advances in random linear systems on finite fields have paved the way for the construction of constant-time data structures representing static functions and minimal perfect hash functions using less space with respect to existing techniques. The main obstruction for any practical application of these results is the cubic-time Gaussian elimination required to solve these linear systems: despite they can be made very small, the computation is still too slow to be feasible. In this paper we describe in detail a number of heuristics and programming techniques to speed up the resolution of these systems by several orders of magnitude, making the overall construction competitive with the standard and widely used MWHC technique, which is based on hypergraph peeling. In particular, we introduce broadword programming techniques for fast equation manipulation and a lazy Gaussian elimination algorithm. We also describe a number of technical improvements to the data structure which further reduce space usage and improve lookup speed. Our implementation of these techniques yields a minimal perfect hash function data structure occupying 2.24 bits per element, compared to 2.68 for MWHC-based ones, and a static function data structure which reduces the multiplicative overhead from 1.23 to 1.03

arXiv.org e-Print Archive

Crossref

AIR Universita degli studi di Milano

Space-efficient detection of unusual words

Author: A Apostolico
A Apostolico
CAR Hoare
D Belazzougui
D Belazzougui
J Herold
J Lin
M Crochemore
S Chairungsee
Publication venue
Publication date: 01/01/2015
Field of study

Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the Burrows-Wheeler transform, and a stack of

O(\sigma^2\log^2 n)

bits, where

n

is the length of the string and

\sigma

is the size of the alphabet. The size of the stack is

o(n)

except for very large values of

\sigma

. We further improve the algorithm by removing its time dependency on

\sigma

, by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by detecting and scoring candidate under-represented strings that

\textit{do not occur}

in the string. Our algorithms are practical and work directly on the BWT, thus they can be immediately applied to a number of existing datasets that are available in this form, returning this string mining problem to a manageable scale.Comment: arXiv admin note: text overlap with arXiv:1502.0637

arXiv.org e-Print Archive

Crossref

MPG.PuRe

Random input helps searching predecessors

Author: Belazzougui D
Kaporis AC
Spirakis PG
Publication venue
Publication date: 01/01/2018
Field of study

A data structure problem consists of the finite sets: D of data, Q of queries, A of query answers, associated with a function f: D x Q → A. The data structure of file X is "static" ("dynamic") if we "do not" ("do") require quick updates as X changes. An important goal is to compactly encode a file X ϵ D, such that for each query y ϵ Q, function f (X, y) requires the minimum time to compute an answer in A. This goal is trivial if the size of D is large, since for each query y ϵ Q, it was shown that f(X,y) requires O(1) time for the most important queries in the literature. Hence, this goal becomes interesting to study as a trade off between the "storage space" and the "query time", both measured as functions of the file size n = \X\. The ideal solution would be to use linear O(n) = O(\X\) space, while retaining a constant O(1) query time. However, if f (X, y) computes the static predecessor search (find largest x ϵ X: x ≤ y), then Ajtai [Ajt88] proved a negative result. By using just n0(1) = [IX]0(1) data space, then it is not possible to evaluate f(X,y) in O(1) time Ay ϵ Q. The proof exhibited a bad distribution of data D, such that Ey∗ ϵ Q (a "difficult" query y∗), that f(X,y∗) requires ω(1) time. Essentially [Ajt88] is an existential result, resolving the worst case scenario. But, [Ajt88] left open the question: do we typically, that is, with high probability (w.h.p.)1 encounter such "difficult" queries y ϵ Q, when assuming reasonable distributions with respect to (w.r.t.) queries and data? Below we make reasonable assumptions w.r.t. the distribution of the queries y ϵ Q, as well as w.r.t. the distribution of data X ϵ D. In two interesting scenarios studied in the literature, we resolve the typical (w.h.p.) query time

University of Liverpool Repository

A Faster Implementation of Online Run-Length Burrows-Wheeler Transform

Author: A Bowe
D Belazzougui
G Navarro
G Navarro
J Sirén
J Ziv
JI Munro
MA Bender
V Mäkinen
W Hon
Publication venue
Publication date: 14/10/2017
Field of study

Run-length encoding Burrows-Wheeler Transformed strings, resulting in Run-Length BWT (RLBWT), is a powerful tool for processing highly repetitive strings. We propose a new algorithm for online RLBWT working in run-compressed space, which runs in

O(n\lg r)

time and

O(r\lg n)

bits of space, where

n

is the length of input string

S

received so far and

r

is the number of runs in the BWT of the reversed

S

. We improve the state-of-the-art algorithm for online RLBWT in terms of empirical construction time. Adopting the dynamic list for maintaining a total order, we can replace rank queries in a dynamic wavelet tree on a run-length compressed string by the direct comparison of labels in a dynamic list. The empirical result for various benchmarks show the efficiency of our algorithm, especially for highly repetitive strings.Comment: In Proc. IWOCA201

arXiv.org e-Print Archive

Crossref

Minimal Forbidden Factors of Circular Words

Author: AJ Pinho
C Barton
C Barton
D Belazzougui
F Mignosi
G Fici
M Béal
M Béal
M Crochemore
M Crochemore
M Crochemore
S Chairungsee
Publication venue
Publication date: 01/01/2017
Field of study

Minimal forbidden factors are a useful tool for investigating properties of words and languages. Two factorial languages are distinct if and only if they have different (antifactorial) sets of minimal forbidden factors. There exist algorithms for computing the minimal forbidden factors of a word, as well as of a regular factorial language. Conversely, Crochemore et al. [IPL, 1998] gave an algorithm that, given the trie recognizing a finite antifactorial language

M

, computes a DFA recognizing the language whose set of minimal forbidden factors is

M

. In the same paper, they showed that the obtained DFA is minimal if the input trie recognizes the minimal forbidden factors of a single word. We generalize this result to the case of a circular word. We discuss several combinatorial properties of the minimal forbidden factors of a circular word. As a byproduct, we obtain a formal definition of the factor automaton of a circular word. Finally, we investigate the case of minimal forbidden factors of the circular Fibonacci words.Comment: To appear in Theoretical Computer Scienc

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università di Palermo

RLZAP: Relative Lempel-Ziv with Adaptive Pointers

Author: A Farruggia
C Boucher
C Hoobin
D Belazzougui
H Ferrada
J Ziv
J Ziv
M Léonard
P Ferragina
R Raman
S Deorowicz
S Deorowicz
S Kuruppu
Publication venue
Publication date: 01/01/2016
Field of study

Relative Lempel-Ziv (RLZ) is a popular algorithm for compressing databases of genomes from individuals of the same species when fast random access is desired. With Kuruppu et al.'s (SPIRE 2010) original implementation, a reference genome is selected and then the other genomes are greedily parsed into phrases exactly matching substrings of the reference. Deorowicz and Grabowski (Bioinformatics, 2011) pointed out that letting each phrase end with a mismatch character usually gives better compression because many of the differences between individuals' genomes are single-nucleotide substitutions. Ferrada et al. (SPIRE 2014) then pointed out that also using relative pointers and run-length compressing them usually gives even better compression. In this paper we generalize Ferrada et al.'s idea to handle well also short insertions, deletions and multi-character substitutions. We show experimentally that our generalization achieves better compression than Ferrada et al.'s implementation with comparable random-access times

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Pisa

Annotating Relationships between Multiple Mixed-media Digital Objects by Extending Annotea

Author: B. Chazelle
B.H. Bloom
D. Belazzougui
D. Belazzougui
D. Charles
D. Talbot
D.A. Huffman
F.C. Botelho
J.P. Schmidt
L. Carter
L.L. Larmore
M. Dietzfelbinger
N.J. Calkin
R. Gallager
R. Motwani
R.L. Milidiú
T. Hagerup
T. Hagerup
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2007
Field of study

Annotea provides an annotation protocol to support collaborative Semantic Web-based annotation of digital resources accessible through the Web. It provides a model whereby a user may attach supplementary information to a resource or part of a resource in the form of: either a simple textual comment; a hyperlink to another web page; a local file; or a semantic tag extracted from a formal ontology and controlled vocabulary. Hence, annotations can be used to attach subjective notes, comments, rankings, queries or tags to enable semantic reasoning across web resources. More recently tabbed Browsers and specific annotation tools, allow users to view several resources (e.g., images, video, audio, text, HTML, PDF) simultaneously in order to carry out side-by-side comparisons. In such scenarios, users frequently want to be able to create and annotate a link or relationship between two or more objects or between segments within those objects. For example, a user might want to create a link between a scene in an original film and the corresponding scene in a remake and attach an annotation to that link. Based on past experiences gained from implementing Annotea within different communities in order to enable knowledge capture, this paper describes and compares alternative ways in which the Annotea Schema may be extended for the purpose of annotating links between multiple resources (or segments of resources). It concludes by identifying and recommending an optimum approach which will enhance the power, flexibility and applicability of Annotea in many domains

CiteSeerX

Crossref

Open Research Online

The IT University of Copenhagen's Repository

UQ eSpace (University of Queensland)

A framework for space-efficient string kernels

Author: A Apostolico
A Apostolico
AJ Smola
AM İleri
B Chor
D Belazzougui
G Reinert
GE Sims
J Herold
J Qi
J Shawe-Taylor
M Crochemore
R Chikhi
S Chairungsee
Publication venue
Publication date: 23/02/2015
Field of study

String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact string kernels, like the

k

-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in

O(nd)

time and in

o(n)

bits of space in addition to the input, using just a

\mathtt{rangeDistinct}

data structure on the Burrows-Wheeler transform of the input strings, which takes

O(d)

time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple value of

k

, like the

k

-mer profile and the

k

-th order empirical entropy, and for calibrating the value of

k

using the data

arXiv.org e-Print Archive

Crossref

Composite repetition-aware data structures

Author: A Blumer
A Lempel
D Arroyuelo
D Belazzougui
DE Willard
J Radoszewski
J Sirén
J Ziv
M Crochemore
M Crochemore
M Raffinot
P Ferragina
S Kreft
T Gagie
V Mäkinen
V Mäkinen
W Rytter
Publication venue
Publication date: 01/01/2015
Field of study

In highly repetitive strings, like collections of genomes from the same species, distinct measures of repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically depend only on one of these measures. We describe two data structures whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure. The key component of our constructions is the run-length encoded BWT (RLBWT), which takes space proportional to the number of BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it with data structures from LZ77 indexes, which take space proportional to the number of LZ77 factors, and with the compact directed acyclic word graph (CDAWG), which takes space proportional to the number of extensions of maximal repeats. The combination of CDAWG and RLBWT enables also a new representation of the suffix tree, whose size depends again on the number of extensions of maximal repeats, and that is powerful enough to support matching statistics and constant-space traversal.Comment: (the name of the third co-author was inadvertently omitted from previous version

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università degli Studi di Udine

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma