Search CORE

7 research outputs found

Document retrieval hacks

Author: Puglisi Simon J.
Zhukova Bella
Publication venue: Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
Publication date: 01/01/2021
Field of study

Publisher Copyright: © Simon J. Puglisi and Bella Zhukova; licensed under Creative Commons License CC-BY 4.0 19th International Symposium on Experimental Algorithms (SEA 2021).Given a collection of strings, document listing refers to the problem of finding all the strings (or documents) where a given query string (or pattern) appears. Index data structures that support efficient document listing for string collections have been the focus of intense research in the last decade, with dozens of papers published describing exotic and elegant compressed data structures. The problem is now quite well understood in theory and many of the solutions have been implemented and evaluated experimentally. A particular recent focus has been on highly repetitive document collections, which have become prevalent in many areas (such as version control systems and genomics - to name just two very different sources). The aim of this paper is to describe simple and efficient document listing algorithms that can be used in combination with more sophisticated techniques, or as baselines against which the performance of new document listing indexes can be measured. Our approaches are based on simple combinations of scanning and hashing, which we show to combine very well with dictionary compression to achieve small space usage. Our experiments show these methods to be often much faster and less space consuming than the best specialized indexes for the problem.Peer reviewe

Dagstuhl Research Online Publication Server

Helsingin yliopiston digitaalinen arkisto

Tight Upper and Lower Bounds on Suffix Tree Breadth

Author: Badkobeh Golnaz
Gawrychowski Pawel
Kärkkäinen Juha
Puglisi Simon
Zhukova Bella
Publication venue
Publication date: 01/01/2021
Field of study

The suffix tree - the compacted trie of all the suffixes of a string - is the most important and widely-used data structure in string processing. We consider a natural combinatorial question about suffix trees: for a string S of length n, how many nodes nu(S)(d) can there be at (string) depth d in its suffix tree? We prove nu(n, d) = max(S) (is an element of Sigma n) nu(S)(d) is O ((n/d) log(n/d)), and show that this bound is asymptotically tight, describing strings for which nu(S)(d) is Omega((n/d)log(n/d)). (C) 2020 Elsevier B.V. All rights reserved.Peer reviewe

Goldsmiths Research Online

Helsingin yliopiston digitaalinen arkisto

On Suffix Tree Breadth

Author: Badkobeh Golnaz
Karkkainen Juha
Puglisi Simon
Zhukova Bella
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 06/09/2017
Field of study

The suffix tree—the compacted trie of all the suffixes of a string—is the most important and widely-used data structure in string processing. We consider a natural combinatorial question about suffix trees: for a string S of length n, how many nodes νS(d) can there be at (string) depth d in its suffix tree? We prove ν(n,d)=maxS∈ΣnνS(d) is O((n/d)logn) , and show that this bound is almost tight, describing strings for which νS(d)=d is Ω((n/d)log(n/d)

Goldsmiths Research Online

Crossref

Smaller RLZ-Compressed Suffix Arrays

Author: Puglisi Simon J.
Zhukova Bella
Publication venue: IEEE
Publication date: 01/01/2021
Field of study

Recently it was shown (Puglisi and Zhukova, Proc. SPIRE, 2020) that the suffix array (SA) data structure can be effectively compressed with relative Lempel-Ziv (RLZ) dictionary compression in such a way that arbitrary subarrays can be rapidly decompressed, thus facilitating compressed indexing. In this paper we describe optimizations to RLZ-compressed SAs, including generation of more effective dictionaries and compact encodings of index components, both of which reduce index size without adversely affecting subarray access speeds relative to other compressed indexes. Our experimental analysis also elucidates the relationship between subarray size and per element access time.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

On Elias-Fano for Rank Queries in FM-Indexes

Author: Ma Danyang
Puglisi Simon J.
Raman Rajeev
Zhukova Bella
Publication venue: IEEE Computer Society
Publication date: 01/01/2021
Field of study

We describe methods to support fast rank queries on the Burrows-Wheeler transform (BWT) string S of an input string T on alphabet Sigma, in order to support pattern counting queries. Our starting point is an approach previously adopted by several authors, which is to represent S as vertical bar Sigma vertical bar bitvectors, where the bitvector for symbol c has a 1 at position i if and only if S[i] = c, with the bitvectors stored in Elias-Fano (EF) encodings, to enable binary rank queries. We first show that the clustering of symbols induced by the BWT makes standard implementations of EF unattractive. We then engineer several improvements to EF that go some way to alleviating this problem, and go on to describe two new EF-inspired bitvectors that have superior practical performance.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Document retrieval hacks

Author: Puglisi Simon J.
Zhukova Bella
Publication venue: Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
Publication date: 01/06/2021
Field of study

Helsingin yliopiston digitaalinen arkisto

Hepatoprotective properties of taurine during carbon tetrachloride intoxication

Crossref