Search CORE

94 research outputs found

Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

Author: Grossi Roberto
Vitter Jeffrey Scott
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 16/03/2011
Field of study

AMS subject classifications. 68W05, 68Q25, 68P05, 68P10, 68P30 DOI. 10.1137/S0097539702402354The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. The text T can be represented in n lg |Σ| bits by encoding each symbol with lg |Σ| bits. The goal is to support fast online queries for searching any string pattern P of m symbols, with T being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require Ω(n lg n) additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need Ω(n) memory words, each of Ω(lg n) bits. These indexes are larger than the text itself by a multiplicative factor of Ω(lg|Σ| n), which is significant when Σ is of constant size, such as in ascii or unicode. On the other hand, these indexes support fast searching, either in O(mlg |Σ|) time or in O(m+lg n) time, plus an output-sensitive cost O(occ) for listing the occ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast O(m/lg|Σ| n + lg | Σ| n) search time in the worst case, for any constant 0 < ≤ 1, using at most −1 + O(1) n lg |Σ| bits of storage. Our result thus presents for the first time an efficient index whose size is provably linear in the size of the text in the worst case, and for many scenarios, the space is actually sublinear in practice. As a concrete example, the compressed suffix array for a typical 100 MB ascii file can require 30–40 MB or less, while the raw suffix array requires 500 MB. Our theoretical bounds improve both time and space of previous indexing schemes. Listing the pattern occurrences introduces a sublogarithmic slowdown factor in the output-sensitive cost, giving O(occ lg | Σ| n) time as a result. When the patterns are sufficiently long, we can use auxiliary data structures in O(n lg |Σ|) bits to obtain a total search bound of O(m/lg|Σ| n + occ) time, which is optimal

KU ScholarWorks

Метод индексирования текстовых фрагментов для организации смыслового поиска по базе документов

Author: Парамонов А. И.
Publication venue: Белорусский государственный технологический университет
Publication date: 01/01/2022
Field of study

В работе рассматривается решение задачи индексирования документов для организации смыслового поиска по документной базе с ранжированием результатов и сопоставлением документов

Belarusian State University of Informatics and Radioelectronics Repository

Combined Data Structure for Previous- and Next-Smaller-Values

Author: Fischer Johannes
Publication venue
Publication date: 02/02/2011
Field of study

Let

A

be a static array storing

n

elements from a totally ordered set. We present a data structure of optimal size at most

n\log_2(3+2\sqrt{2})+o(n)

bits that allows us to answer the following queries on

A

in constant time, without accessing

A

: (1) previous smaller value queries, where given an index

i

, we wish to find the first index to the left of

i

where

A

is strictly smaller than at

i

, and (2) next smaller value queries, which search to the right of

i

. As an additional bonus, our data structure also allows to answer a third kind of query: given indices

i<j

, find the position of the minimum in

A[i..j]

. Our data structure has direct consequences for the space-efficient storage of suffix trees.Comment: to appear in Theoretical Computer Scienc

arXiv.org e-Print Archive

Elsevier - Publisher Connector

Entropy-scaling search of massive biological data

Author: Berger Bonnie
Daniels Noah M.
Danko David Christian
Yu Y. William
Publication venue: 'Elsevier BV'
Publication date: 01/06/2015
Field of study

Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the dataset is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve "compressive omics," and the general theory can be readily applied to data science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo

arXiv.org e-Print Archive

Elsevier - Publisher Connector

DSpace@MIT

Crossref

PubMed Central

The Rightmost Equal-Cost Position Problem

Author: Crochemore Maxime
Langiu Alessio
Mignosi Filippo
Publication venue
Publication date: 21/11/2012
Field of study

LZ77-based compression schemes compress the input text by replacing factors in the text with an encoded reference to a previous occurrence formed by the couple (length, offset). For a given factor, the smallest is the offset, the smallest is the resulting compression ratio. This is optimally achieved by using the rightmost occurrence of a factor in the previous text. Given a cost function, for instance the minimum number of bits used to represent an integer, we define the Rightmost Equal-Cost Position (REP) problem as the problem of finding one of the occurrences of a factor which cost is equal to the cost of the rightmost one. We present the Multi-Layer Suffix Tree data structure that, for a text of length n, at any time i, it provides REP(LPF) in constant time, where LPF is the longest previous factor, i.e. the greedy phrase, a reference to the list of REP({set of prefixes of LPF}) in constant time and REP(p) in time O(|p| log log n) for any given pattern p

arXiv.org e-Print Archive

Crossref

King's Research Portal

Archivio istituzionale della ricerca - Università di Palermo