90 research outputs found
Fast Scalable Construction of (Minimal Perfect Hash) Functions
Recent advances in random linear systems on finite fields have paved the way
for the construction of constant-time data structures representing static
functions and minimal perfect hash functions using less space with respect to
existing techniques. The main obstruction for any practical application of
these results is the cubic-time Gaussian elimination required to solve these
linear systems: despite they can be made very small, the computation is still
too slow to be feasible.
In this paper we describe in detail a number of heuristics and programming
techniques to speed up the resolution of these systems by several orders of
magnitude, making the overall construction competitive with the standard and
widely used MWHC technique, which is based on hypergraph peeling. In
particular, we introduce broadword programming techniques for fast equation
manipulation and a lazy Gaussian elimination algorithm. We also describe a
number of technical improvements to the data structure which further reduce
space usage and improve lookup speed.
Our implementation of these techniques yields a minimal perfect hash function
data structure occupying 2.24 bits per element, compared to 2.68 for MWHC-based
ones, and a static function data structure which reduces the multiplicative
overhead from 1.23 to 1.03
Space-efficient detection of unusual words
Detecting all the strings that occur in a text more frequently or less
frequently than expected according to an IID or a Markov model is a basic
problem in string mining, yet current algorithms are based on data structures
that are either space-inefficient or incur large slowdowns, and current
implementations cannot scale to genomes or metagenomes in practice. In this
paper we engineer an algorithm based on the suffix tree of a string to use just
a small data structure built on the Burrows-Wheeler transform, and a stack of
bits, where is the length of the string and
is the size of the alphabet. The size of the stack is except for very
large values of . We further improve the algorithm by removing its time
dependency on , by reporting only a subset of the maximal repeats and
of the minimal rare words of the string, and by detecting and scoring candidate
under-represented strings that in the string. Our
algorithms are practical and work directly on the BWT, thus they can be
immediately applied to a number of existing datasets that are available in this
form, returning this string mining problem to a manageable scale.Comment: arXiv admin note: text overlap with arXiv:1502.0637
Random input helps searching predecessors
A data structure problem consists of the finite sets: D of data, Q of queries, A of query answers, associated with a function f: D x Q → A. The data structure of file X is "static" ("dynamic") if we "do not" ("do") require quick updates as X changes. An important goal is to compactly encode a file X ϵ D, such that for each query y ϵ Q, function f (X, y) requires the minimum time to compute an answer in A. This goal is trivial if the size of D is large, since for each query y ϵ Q, it was shown that f(X,y) requires O(1) time for the most important queries in the literature. Hence, this goal becomes interesting to study as a trade off between the "storage space" and the "query time", both measured as functions of the file size n = \X\. The ideal solution would be to use linear O(n) = O(\X\) space, while retaining a constant O(1) query time. However, if f (X, y) computes the static predecessor search (find largest x ϵ X: x ≤ y), then Ajtai [Ajt88] proved a negative result. By using just n0(1) = [IX]0(1) data space, then it is not possible to evaluate f(X,y) in O(1) time Ay ϵ Q. The proof exhibited a bad distribution of data D, such that Ey∗ ϵ Q (a "difficult" query y∗), that f(X,y∗) requires ω(1) time. Essentially [Ajt88] is an existential result, resolving the worst case scenario. But, [Ajt88] left open the question: do we typically, that is, with high probability (w.h.p.)1 encounter such "difficult" queries y ϵ Q, when assuming reasonable distributions with respect to (w.r.t.) queries and data? Below we make reasonable assumptions w.r.t. the distribution of the queries y ϵ Q, as well as w.r.t. the distribution of data X ϵ D. In two interesting scenarios studied in the literature, we resolve the typical (w.h.p.) query time
Computing the antiperiod(s) of a string
A string S[1, n] is a power (or repetition or tandem repeat) of order k and period n/k, if it can be decomposed into k consecutive identical blocks of length n/k. Powers and periods are fundamental structures in the study of strings and algorithms to compute them efficiently have been widely studied. Recently, Fici et al. (Proc. ICALP 2016) introduced an antipower of order k to be a string composed of k distinct blocks of the same length, n/k, called the antiperiod. An arbitrary string will have antiperiod t if it is prefix of an antipower with antiperiod t. In this paper, we describe efficient algorithm for computing the smallest antiperiod of a string S of length n in O(n) time. We also describe an algorithm to compute all the antiperiods of S that runs in O(n log n) time. © Hayam Alamro, Golnaz Badkobeh, Djamal Belazzougui, Costas S. Iliopoulos, and Simon J. Puglisi.Peer reviewe
A Faster Implementation of Online Run-Length Burrows-Wheeler Transform
Run-length encoding Burrows-Wheeler Transformed strings, resulting in
Run-Length BWT (RLBWT), is a powerful tool for processing highly repetitive
strings. We propose a new algorithm for online RLBWT working in run-compressed
space, which runs in time and bits of space, where
is the length of input string received so far and is the number of runs
in the BWT of the reversed . We improve the state-of-the-art algorithm for
online RLBWT in terms of empirical construction time. Adopting the dynamic list
for maintaining a total order, we can replace rank queries in a dynamic wavelet
tree on a run-length compressed string by the direct comparison of labels in a
dynamic list. The empirical result for various benchmarks show the efficiency
of our algorithm, especially for highly repetitive strings.Comment: In Proc. IWOCA201
Optimal Computation of Avoided Words
The deviation of the observed frequency of a word from its expected
frequency in a given sequence is used to determine whether or not the word
is avoided. This concept is particularly useful in DNA linguistic analysis. The
value of the standard deviation of , denoted by , effectively
characterises the extent of a word by its edge contrast in the context in which
it occurs. A word of length is a -avoided word in if
, for a given threshold . Notice that such a word
may be completely absent from . Hence computing all such words na\"{\i}vely
can be a very time-consuming procedure, in particular for large . In this
article, we propose an -time and -space algorithm to compute all
-avoided words of length in a given sequence of length over a
fixed-sized alphabet. We also present a time-optimal -time and
-space algorithm to compute all -avoided words (of any
length) in a sequence of length over an alphabet of size .
Furthermore, we provide a tight asymptotic upper bound for the number of
-avoided words and the expected length of the longest one. We make
available an open-source implementation of our algorithm. Experimental results,
using both real and synthetic data, show the efficiency of our implementation
Capacity Upper Bounds for Deletion-Type Channels
We develop a systematic approach, based on convex programming and real
analysis, for obtaining upper bounds on the capacity of the binary deletion
channel and, more generally, channels with i.i.d. insertions and deletions.
Other than the classical deletion channel, we give a special attention to the
Poisson-repeat channel introduced by Mitzenmacher and Drinea (IEEE Transactions
on Information Theory, 2006). Our framework can be applied to obtain capacity
upper bounds for any repetition distribution (the deletion and Poisson-repeat
channels corresponding to the special cases of Bernoulli and Poisson
distributions). Our techniques essentially reduce the task of proving capacity
upper bounds to maximizing a univariate, real-valued, and often concave
function over a bounded interval. We show the following:
1. The capacity of the binary deletion channel with deletion probability
is at most for , and, assuming the capacity
function is convex, is at most for , where
is the golden ratio. This is the first nontrivial
capacity upper bound for any value of outside the limiting case
that is fully explicit and proved without computer assistance.
2. We derive the first set of capacity upper bounds for the Poisson-repeat
channel.
3. We derive several novel upper bounds on the capacity of the deletion
channel. All upper bounds are maximums of efficiently computable, and concave,
univariate real functions over a bounded domain. In turn, we upper bound these
functions in terms of explicit elementary and standard special functions, whose
maximums can be found even more efficiently (and sometimes, analytically, for
example for ).
Along the way, we develop several new techniques of potentially independent
interest in information theory, probability, and mathematical analysis.Comment: Minor edits, In Proceedings of 50th Annual ACM SIGACT Symposium on
the Theory of Computing (STOC), 201
Annotating Relationships between Multiple Mixed-media Digital Objects by Extending Annotea
Annotea provides an annotation protocol to support collaborative Semantic Web-based annotation of digital resources accessible through the Web. It provides a model whereby a user may attach supplementary information to a resource or part of a resource in the form of: either a simple textual comment; a hyperlink to another web page; a local file; or a semantic tag extracted from a formal ontology and controlled vocabulary. Hence, annotations can be used to attach subjective notes, comments, rankings, queries or tags to enable semantic reasoning across web resources. More recently tabbed Browsers and specific annotation tools, allow users to view several resources (e.g., images, video, audio, text, HTML, PDF) simultaneously in order to carry out side-by-side comparisons. In such scenarios, users frequently want to be able to create and annotate a link or relationship between two or more objects or between segments within those objects. For example, a user might want to create a link between a scene in an original film and the corresponding scene in a remake and attach an annotation to that link. Based on past experiences gained from implementing Annotea within different communities in order to enable knowledge capture, this paper describes and compares alternative ways in which the Annotea Schema may be extended for the purpose of annotating links between multiple resources (or segments of resources). It concludes by identifying and recommending an optimum approach which will enhance the power, flexibility and applicability of Annotea in many domains
Minimal Forbidden Factors of Circular Words
Minimal forbidden factors are a useful tool for investigating properties of
words and languages. Two factorial languages are distinct if and only if they
have different (antifactorial) sets of minimal forbidden factors. There exist
algorithms for computing the minimal forbidden factors of a word, as well as of
a regular factorial language. Conversely, Crochemore et al. [IPL, 1998] gave an
algorithm that, given the trie recognizing a finite antifactorial language ,
computes a DFA recognizing the language whose set of minimal forbidden factors
is . In the same paper, they showed that the obtained DFA is minimal if the
input trie recognizes the minimal forbidden factors of a single word. We
generalize this result to the case of a circular word. We discuss several
combinatorial properties of the minimal forbidden factors of a circular word.
As a byproduct, we obtain a formal definition of the factor automaton of a
circular word. Finally, we investigate the case of minimal forbidden factors of
the circular Fibonacci words.Comment: To appear in Theoretical Computer Scienc
- …