8 research outputs found
Accurate Cardinality Estimation of Co-occurring Words Using Suffix Trees (Extended Version)
Estimating the cost of a query plan is one of the hardest problems in query optimization. This includes cardinality estimates of string search patterns, of multi-word strings like phrases or text snippets in particular. At first sight, suffix trees address this problem. To curb the memory usage of a suffix tree, one often prunes the tree to a certain depth. But this pruning method "takes away" more information from long strings than from short ones. This problem is particularly severe with sets of long strings, the setting studied here. In this article, we propose respective pruning techniques. Our approaches remove characters with low information value. The various variants determine a character\u27s information value in different ways, e.g., by using conditional entropy with respect to previous characters in the string. Our experiments show that, in contrast to the well-known pruned suffix tree, our technique provides significantly better estimations when the tree size is reduced by 60% or less. Due to the redundancy of natural language, our pruning techniques yield hardly any error for tree-size reductions of up to 50%
Large-Scale Pattern Search Using Reduced-Space On-Disk Suffix Arrays
Abstract—The suffix array is an efficient data structure for in-memory pattern search. Suffix arrays can also be used for external-memory pattern search, via two-level structures that use an internal index to identify the correct block of suffix pointers. In this paper we describe a new two-level suffix array-based index structure that requires significantly less disk space than previous approaches. Key to the saving is the use of disk blocks that are based on prefixes rather than the more usual uniform-sampling approach, allowing reductions between blocks and subparts of other blocks. We also describe a new in-memory structure – the condensed BWT – and show that it allows common patterns to be resolved without access to the text. Experiments using 64 GB of English web text on a computer with 4 GB of main memory demonstrate the speed and versatility of the new approach. For this data the index is around one-third the size of previous twolevel mechanisms; and the memory footprint of as little as 1 % of the text size means that queries can be processed more quickly than is possible with a compact FM-INDEX. Index Terms—String search, pattern matching, suffix array, Burrows-Wheeler transform, succinct data structure, disk-based algorithm, experimental evaluation. I
Parallel text index construction
In dieser Dissertation betrachten wir die parallele Konstruktion von Text-Indizes. Text-Indizes stellen Zusatzinformationen über Texte bereit, die Anfragen hinsichtlich dieser Texte beschleunigen können. Ein Beispiel hierfür sind Volltext-Indizes, welche für eine effiziente Phrasensuche genutzt werden, also etwa für die Frage, ob eine Phrase in einem Text vorkommt oder nicht. Diese Dissertation befasst sich hauptsächlich, aber nicht ausschließlich mit der parallelen Konstruktion von Text-Indizes im geteilten und verteilten Speicher.
Im ersten Teil der Dissertation betrachten wir Wavelet-Trees. Dabei handelt es sich um kompakte Indizes, welche Rank- und Select-Anfragen von binären Alphabeten auf Alphabete beliebiger Größe verallgemeinern. Im zweiten Teil der Dissertation betrachten wir das Suffix-Array, den am besten erforschten Text-Index überhaupt. Das Suffix-Array enthält die Startpositionen aller lexikografisch sortierten Suffixe eines Textes, d.h., wir möchten alle Suffixe eines Textes sortieren. Oft wird das Suffix-Array um das Longest-Common-Prefix-Array (LCP-Array) erweitert. Das LCP-Array enthält die Länge der längsten gemeinsamen Präfixe zweier lexikografisch konsekutiven Suffixe. Abschließend nutzen wir verteilte Suffix- und LCP-Arrays, um den Distributed-Patricia-Trie zu konstruieren. Dieser erlaubt es uns, verschiedene Phrase-Anfragen effizienter zu beantworten, als wenn wir nur das Suffix-Array nutzen.The focus of this dissertation is the parallel construction of text indices. Text indices provide additional information about a text that allow to answer queries faster. Full-text indices for example are used to efficiently answer phrase queries, i.e., if and where a phrase occurs in a text. The research in this dissertation is focused on but not limited to parallel construction algorithms for text indices in both shared and distributed memory.
In the first part, we look at wavelet trees: a compact index that generalizes rank and select queries from binary alphabets to alphabets of arbitrary size. In the second part of this dissertation, we consider the suffix array---one of the most researched text indices.The suffix array of a text contains the starting positions of the text's lexicographically sorted suffixes, i.e., we want to sort all its suffixes. Finally, we use the distributed suffix arrays (and LCP arrays) to compute distributed Patricia tries. This allows us to answer different phrase queries more efficiently than using only the suffix array
Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools
This dissertation focuses on two fundamental sorting problems: string sorting
and suffix sorting. The first part considers parallel string sorting on
shared-memory multi-core machines, the second part external memory suffix
sorting using the induced sorting principle, and the third part distributed
external memory suffix sorting with a new distributed algorithmic big data
framework named Thrill.Comment: 396 pages, dissertation, Karlsruher Instituts f\"ur Technologie
(2018). arXiv admin note: text overlap with arXiv:1101.3448 by other author
Statistical and repetition-based compressed data structures
[Abstract]
In this thesis we present several practical compressed data structures that address
open problems related to statistically-compressible and highly repetitive databases.
In a the first part, we focus on statistical-based compressed data structures,
targeting the problem of managing large alphabets. This problem arises when
typical sequence-based compression is used as a basis for compressed data structures
representing more general structures like grids and graphs. Concretely, (a) we
provide space-efficient solutions to represent prefix-free codes when the alphabet
is large; (b) we also present a new wavelet-tree based data structure to solve rank
and select queries that obtains zero-order compression and outperforms previous
wavelet tree implementations on large alphabets.
In the second part of this thesis, we focus on highly repetitive datasets. We
present (c) a very space efficient grammar-based compressed data structure to solve
rank and select on these scenarios; (d) the first LZ77-space bounded compressed
data structure that solves rank and select queries in O(1) time and is in practice
almost as fast as statistically-compressed structures; and (e) the first practical
version of grammar-compressed tree topologies, obtaining unprecedented results in
the representation of repetitive trees.
Additionally, we apply our new solutions to several problems of interest: point
grids, inverted indexes, self-indexes, XPath systems, and compressed suffix trees of
highly repetitive inputs, displaying various space-time tradeoffs of interest.[Resumen]
En esta tesis presentamos varias estructuras de datos comprimidas de naturaleza
práctica, centradas en problemas abiertos relacionados con bases de datos
estadÃsticamente compresibles y bases de datos cuyo contenido es altamente
repetitivo.
En la primera parte, nos centramos en las estructuras de datos comprimidas para
bases de datos estadÃsticamente compresibles, más concretamente, en problemas
relativos al manejo de alfabetos grandes. Este tipo de problemas aparecen
cuando usamos técnicas clásicas de compresión estadÃstica en estructuras de datos
comprimidas para secuencias, y éstas a su vez se aplican a problemas tales como
la representación de grillas de puntos o grafos. Concretamente, (a) presentamos
soluciones muy eficientes en términos de espacio para representar códigos libres de
prefijo cuando el alfabeto el grande; (b) y también presentamos una nueva estructura
de datos comprimida basada en wavelet trees para resolver consultas rank y select
que obtiene compresión de orden cero y mejora las implementaciones previas de
wavelet trees en alfabetos grandes.
En la segunda parte de esta tesis, nos centramos en las bases de datos altamente
repetitivas. Presentamos (c) una estructura de datos comprimida basada en
gramáticas para resolver consultas rank y select en este tipo de contextos y
que usa muy poco espacio; (d) la primera estructura de datos comprimida que
obtiene espacio proporcional al de un compresor LZ77 y resuelve consultas rank y
select en tiempo O(1), siendo en la práctica casi tan rápido como las estructuras de
datos basadas en compresión estadÃstica; (e) la primera estructura de datos práctica
que utiliza gramáticas para comprimir topologÃas de árboles, obteniendo resultados
sin precedentes para la representación de árboles repetitivos.
Adicionalmente, mostramos varias aplicaciones en las que las estructuras de datos
que proponemos a lo largo de la tesis resultan de utilidad. Desde representaciones
de grillas de puntos, Ãndices invertidos, auto-Ãndices, sistemas XPath, hasta árboles
de sufijos comprimidos para colecciones altamente repetitivas, mostrando diferentes
resultados de interés tanto en términos de tiempo como de espacio.[Resumo]
Nesta tese presentamos varias estruturas de datos comprimidas de natureza práctica,
centradas en problemas abertos no ámbito das bases de datos estatisticamente
compresibles e das bases de datos altamente repetitivas.
Na primeira parte da tese, centrámonos nas estruturas de datos comprimidas para
as bases de datos estatisticamente compresibles. Máis concretamente en problemas
relativos ó manexo de alfabetos grandes. Este tipo de problemas aparecen cando
usamos técnicas de compresión estatÃstica en estruturas de datos comprimidas para
secuencias, e esta á sua vez se utilizan para aplicacións tales como a representación de
grellas de puntos ou para a representación de grafos. Concretamente, (a) presentamos
solucións que son moi eficientes en termos espaciais para representar códigos libres
de prefixo cando o alfabeto é grande; e (b) tamén presentamos unha nova estructura
de datos comprimida baseada en wavelet trees para resolver consultas rank e select
que obtén compresión de orde cero e mellora as implementacións previas de wavelet
trees para alfabetos grandes.
Na segunda parte da tese, centrámosnos nas bases de datos con contido altamente
repetitivo. Presentamos (c) unha estrutura de datos comprimida baseada en
gramáticas que usa moi pouco espazo e resolve eficientemente consultas rank e
select en este tipo de contextos repetitivos; (d) a primeira estrutura de datos
comprimida que obtén espazo proporcional ó que obtén un compresor LZ77 e resolve
consultas rank e select en tempo O(1), sendo na práctica tan rápido coma as
estruturas de datos baseadas en compresión estatÃstica; (e) a primeira estrutura de
datos práctica que utiliza gramáticas para comprimir topoloxÃas de árbores, obtendo
uns resultados sin precedentes para a representación de árbores repetitivos.
Adicionalmente, mostramos varias aplicacións nas que as estruturas de datos
que propoñemos ó longo da tese resultan de utilidade: representacións de grellas
de puntos, Ãndices invertidos, auto-Ãndices, sistemas XPath e árbores de sufixos
comprimidos para colecións altamente repetitivas, mostrando diferentes resultados
de interese, tanto en termos de espazo coma de tempo