35 research outputs found
Compressed materialised views of semi-structured data
Query performance issues over semi-structured data have led to the emergence of materialised XML views as a means of restricting the data structure processed by a query. However preserving the conventional representation of such views remains a significant limiting factor especially in the context of mobile devices where processing power, memory usage and bandwidth are significant factors. To explore the concept of a compressed materialised view, we extend our earlier work on structural XML compression to produce a combination of structural summarisation and data compression techniques. These techniques provide a basis for efficiently dealing with both structural queries and valuebased predicates. We evaluate the effectiveness of such a scheme, presenting results and performance measures that show advantages of using such structures
Compressed Text Indexes:From Theory to Practice!
A compressed full-text self-index represents a text in a compressed form and
still answers queries efficiently. This technology represents a breakthrough
over the text indexing techniques of the previous decade, whose indexes
required several times the size of the text. Although it is relatively new,
this technology has matured up to a point where theoretical research is giving
way to practical developments. Nonetheless this requires significant
programming skills, a deep engineering effort, and a strong algorithmic
background to dig into the research results. To date only isolated
implementations and focused comparisons of compressed indexes have been
reported, and they missed a common API, which prevented their re-use or
deployment within other applications.
The goal of this paper is to fill this gap. First, we present the existing
implementations of compressed indexes from a practitioner's point of view.
Second, we introduce the Pizza&Chili site, which offers tuned implementations
and a standardized API for the most successful compressed full-text
self-indexes, together with effective testbeds and scripts for their automatic
validation and test. Third, we show the results of our extensive experiments on
these codes with the aim of demonstrating the practical relevance of this novel
and exciting technology
On optimally partitioning a text to improve its compression
In this paper we investigate the problem of partitioning an input string T in
such a way that compressing individually its parts via a base-compressor C gets
a compressed output that is shorter than applying C over the entire T at once.
This problem was introduced in the context of table compression, and then
further elaborated and extended to strings and trees. Unfortunately, the
literature offers poor solutions: namely, we know either a cubic-time algorithm
for computing the optimal partition based on dynamic programming, or few
heuristics that do not guarantee any bounds on the efficacy of their computed
partition, or algorithms that are efficient but work in some specific scenarios
(such as the Burrows-Wheeler Transform) and achieve compression performance
that might be worse than the optimal-partitioning by a
factor. Therefore, computing efficiently the optimal solution is still open. In
this paper we provide the first algorithm which is guaranteed to compute in
O(n \log_{1+\eps}n) time a partition of T whose compressed output is
guaranteed to be no more than -worse the optimal one, where
may be any positive constant
Generating Concise and Readable Summaries of XML Documents
XML has become the de-facto standard for data representation and exchange,
resulting in large scale repositories and warehouses of XML data. In order for
users to understand and explore these large collections, a summarized, bird's
eye view of the available data is a necessity. In this paper, we are interested
in semantic XML document summaries which present the "important" information
available in an XML document to the user. In the best case, such a summary is a
concise replacement for the original document itself. At the other extreme, it
should at least help the user make an informed choice as to the relevance of
the document to his needs. In this paper, we address the two main issues which
arise in producing such meaningful and concise summaries: i) which tags or text
units are important and should be included in the summary, ii) how to generate
summaries of different sizes.%for different memory budgets. We conduct user
studies with different real-life datasets and show that our methods are useful
and effective in practice
Fast and Tiny Structural Self-Indexes for XML
XML document markup is highly repetitive and therefore well compressible
using dictionary-based methods such as DAGs or grammars. In the context of
selectivity estimation, grammar-compressed trees were used before as synopsis
for structural XPath queries. Here a fully-fledged index over such grammars is
presented. The index allows to execute arbitrary tree algorithms with a
slow-down that is comparable to the space improvement. More interestingly,
certain algorithms execute much faster over the index (because no decompression
occurs). E.g., for structural XPath count queries, evaluating over the index is
faster than previous XPath implementations, often by two orders of magnitude.
The index also allows to serialize XML results (including texts) faster than
previous systems, by a factor of ca. 2-3. This is due to efficient copy
handling of grammar repetitions, and because materialization is totally
avoided. In order to compare with twig join implementations, we implemented a
materializer which writes out pre-order numbers of result nodes, and show its
competitiveness.Comment: 13 page
Fast in-memory XPath search using compressed indexes
A large fraction of an XML document typically consists of text data. The XPath query language allows text search via the equal, contains, and starts-with predicates. Such predicates can be efficiently implemented using a compressed self-index of the document's text nodes. Most queries, however, contain some parts querying the text of the document, plus some parts querying the tree structure. It is therefore a challenge to choose an appropriate evaluation order for a given query, which optimally leverages the execution speeds of the text and tree indexes. Here the SXSI system is introduced. It stores the tree structure of an XML document using a bit array of opening and closing brackets plus a sequence of labels, and stores the text nodes of the document using a global compressed self-index. On top of these indexes sits an XPath query engine that is based on tree automata. The engine uses fast counting queries of the text index in order to dynamically determine whether to evaluate top-down or bottom-up with respect to the tree structure. The resulting system has several advantages over existing systems: (1) on pure tree queries (without text search) such as the XPathMark queries, the SXSI system performs on par or better than the fastest known systems MonetDB and Qizx, (2) on queries that use text search, SXSI outperforms the existing systems by 1-3 orders of magnitude (depending on the size of the result set), and (3) with respect to memory consumption, SXSI outperforms all other systems for counting-only queries.Peer reviewe
XBWT Tricks
The eXtended Burrows-Wheeler Transform (XBWT) is a
data transformation introduced in [Ferragina et al., FOCS 2005] to com-
pactly represent a labeled tree and simultaneously support navigation
and path-search operations over its label structure.
A natural application of the XBWT is to store a dictionary of strings.
A recent extensive experimental study [Martฤฑฬnez-Prieto et al., Informa-
tion Systems, 2016] shows that, among the available string dictionary
implementations, the XBWT is attractive because of its good tradeoff
between small space usage, speed, and support for substring searches.
In this paper we further investigate the use of the XBWT for storing a
string dictionary. Our first contribution is to show how to add suffix links
(aka failure links) to a XBWT string dictionary. For a XBWT dictionary
with n internal nodes our suffix links can be traversed in constant time
and only take 2n + o(n) bits of space.
Our second contribution are practical construction algorithms for the
XBWT, including the additional data structure supporting the traver-
sal of suffix links. Our algorithms build on the many well engineered
algorithms for Suffix Array and BWT construction and offer different
tradeoffs between running time and working space
๊ฐ๊ฒฐํ ์๋ฃ๊ตฌ์กฐ๋ฅผ ํ์ฉํ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์ ํ์๋ค์ ๊ณต๊ฐ ํจ์จ์ ํํ๋ฒ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ, 2021. 2. Srinivasa Rao Satti.Numerous big data are generated from a plethora of sources. Most of the data stored as files contain a non-fixed type of schema, so that the files are suitable to be maintained as semi-structured document formats. A number of those formats, such as XML (eXtensible Markup Language), JSON (JavaScript Object Notation), and YAML (YAML Ain't Markup Language) are suggested to sustain hierarchy in the original corpora of data. Several data models structuring the gathered data - including RDF (Resource Description Framework) - depend on the semi-structured document formats to be serialized and transferred for future processing.
Since the semi-structured document formats focus on readability and verbosity, redundant space is required to organize and maintain the document. Even though general-purpose compression schemes are widely used to compact the documents, applying those algorithms hinder future handling of the corpora, owing to loss of internal structures.
The area of succinct data structures is widely investigated and researched in theory, to provide answers to the queries while the encoded data occupy space close to the information-theoretic lower bound. Bit vectors and trees are the notable succinct data structures. Nevertheless, there were few attempts to apply the idea of succinct data structures to represent the semi-structured documents in space-efficient manner.
In this dissertation we propose a unified, space-efficient representation of various semi-structured document formats. The core functionality of this representation is its compactness and query-ability derived from enriched functions of succinct data structures. Incorporation of (a) bit indexed arrays, (b) succinct ordinal trees, and (c) compression techniques engineers the compact representation. We implement this representation in practice, and show by experiments that construction of this representation decreases the disk usage by up to 60% while occupying 90% less RAM. We also allow processing a document in partial manner, to allow processing of larger corpus of big data even in the constrained environment.
In parallel to establishing the aforementioned compact semi-structured document representation, we provide and reinforce some of the existing compression schemes in this dissertation. We first suggest an idea to encode an array of integers that is not necessarily sorted. This compaction scheme improves upon the existing universal code systems, by assistance of succinct bit vector structure. We show that our suggested algorithm reduces space usage by up to 44% while consuming 15% less time than the original code system, while the algorithm additionally supports random access of elements upon the encoded array.
We also reinforce the SBH bitmap index compression algorithm. The main strength of this scheme is the use of intermediate super-bucket during operations, giving better performance on querying through a combination of compressed bitmap indexes. Inspired from splits done during the intermediate process of the SBH algorithm, we give an improved compression mechanism supporting parallelism that could be utilized in both CPUs and GPUs. We show by experiments that this CPU parallel processing optimization diminishes compression and decompression times by up to 38% in a 4-core machine without modifying the bitmap compressed form. For GPUs, the new algorithm gives 48% faster query processing time in the experiments, compared to the previous existing bitmap index compression schemes.์
์ ์๋ ๋น
๋ฐ์ดํฐ๊ฐ ๋ค์ํ ์๋ณธ๋ก๋ถํฐ ์์ฑ๋๊ณ ์๋ค. ์ด๋ค ๋ฐ์ดํฐ์ ๋๋ถ๋ถ์ ๊ณ ์ ๋์ง ์์ ์ข
๋ฅ์ ์คํค๋ง๋ฅผ ํฌํจํ ํ์ผ ํํ๋ก ์ ์ฅ๋๋๋ฐ, ์ด๋ก ์ธํ์ฌ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์ ํ์์ ์ด์ฉํ์ฌ ํ์ผ์ ์ ์งํ๋ ๊ฒ์ด ์ ํฉํ๋ค. XML, JSON ๋ฐ YAML๊ณผ ๊ฐ์ ์ข
๋ฅ์ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์ ํ์์ด ๋ฐ์ดํฐ์ ๋ด์ฌํ๋ ๊ตฌ์กฐ๋ฅผ ์ ์งํ๊ธฐ ์ํ์ฌ ์ ์๋์๋ค. ์์ง๋ ๋ฐ์ดํฐ๋ฅผ ๊ตฌ์กฐํํ๋ RDF์ ๊ฐ์ ์ฌ๋ฌ ๋ฐ์ดํฐ ๋ชจ๋ธ๋ค์ ์ฌํ ์ฒ๋ฆฌ๋ฅผ ์ํ ์ ์ฅ ๋ฐ ์ ์ก์ ์ํ์ฌ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์ ํ์์ ์์กดํ๋ค.
๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์ ํ์์ ๊ฐ๋
์ฑ๊ณผ ๋ค๋ณ์ฑ์ ์ง์คํ๊ธฐ ๋๋ฌธ์, ๋ฌธ์๋ฅผ ๊ตฌ์กฐํํ๊ณ ์ ์งํ๊ธฐ ์ํ์ฌ ์ถ๊ฐ์ ์ธ ๊ณต๊ฐ์ ํ์๋ก ํ๋ค. ๋ฌธ์๋ฅผ ์์ถ์ํค๊ธฐ ์ํ์ฌ ์ผ๋ฐ์ ์ธ ์์ถ ๊ธฐ๋ฒ๋ค์ด ๋๋ฆฌ ์ฌ์ฉ๋๊ณ ์์ผ๋, ์ด๋ค ๊ธฐ๋ฒ๋ค์ ์ ์ฉํ๊ฒ ๋๋ฉด ๋ฌธ์์ ๋ด๋ถ ๊ตฌ์กฐ์ ์์ค๋ก ์ธํ์ฌ ๋ฐ์ดํฐ์ ์ฌํ ์ฒ๋ฆฌ๊ฐ ์ด๋ ต๊ฒ ๋๋ค.
๋ฐ์ดํฐ๋ฅผ ์ ๋ณด์ด๋ก ์ ํํ์ ๊ฐ๊น์ด ๊ณต๊ฐ๋ง์ ์ฌ์ฉํ์ฌ ์ ์ฅ์ ๊ฐ๋ฅํ๊ฒ ํ๋ฉด์ ์ง์์ ๋ํ ์๋ต์ ์ ๊ณตํ๋ ๊ฐ๊ฒฐํ ์๋ฃ๊ตฌ์กฐ๋ ์ด๋ก ์ ์ผ๋ก ๋๋ฆฌ ์ฐ๊ตฌ๋๊ณ ์๋ ๋ถ์ผ์ด๋ค. ๋นํธ์ด๊ณผ ํธ๋ฆฌ๊ฐ ๋๋ฆฌ ์๋ ค์ง ๊ฐ๊ฒฐํ ์๋ฃ๊ตฌ์กฐ๋ค์ด๋ค. ๊ทธ๋ฌ๋ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์๋ค์ ์ ์ฅํ๋ ๋ฐ ๊ฐ๊ฒฐํ ์๋ฃ๊ตฌ์กฐ์ ์์ด๋์ด๋ฅผ ์ ์ฉํ ์ฐ๊ตฌ๋ ๊ฑฐ์ ์งํ๋์ง ์์๋ค.
๋ณธ ํ์๋
ผ๋ฌธ์ ํตํด ์ฐ๋ฆฌ๋ ๋ค์ํ ์ข
๋ฅ์ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์ ํ์์ ํต์ผ๋๊ฒ ํํํ๋ ๊ณต๊ฐ ํจ์จ์ ํํ๋ฒ์ ์ ์ํ๋ค. ์ด ๊ธฐ๋ฒ์ ์ฃผ์ํ ๊ธฐ๋ฅ์ ๊ฐ๊ฒฐํ ์๋ฃ๊ตฌ์กฐ๊ฐ ๊ฐ์ ์ผ๋ก ๊ฐ์ง๋ ํน์ฑ์ ๊ธฐ๋ฐํ ๊ฐ๊ฒฐ์ฑ๊ณผ ์ง์ ๊ฐ๋ฅ์ฑ์ด๋ค. ๋นํธ์ด๋ก ์ธ๋ฑ์ฑ๋ ๋ฐฐ์ด, ๊ฐ๊ฒฐํ ์์ ์๋ ํธ๋ฆฌ ๋ฐ ๋ค์ํ ์์ถ ๊ธฐ๋ฒ์ ํตํฉํ์ฌ ํด๋น ํํ๋ฒ์ ๊ณ ์ํ์๋ค. ์ด ๊ธฐ๋ฒ์ ์ค์ฌ์ ์ผ๋ก ๊ตฌํ๋์๊ณ , ์คํ์ ํตํ์ฌ ์ด ๊ธฐ๋ฒ์ ์ ์ฉํ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์๋ค์ ์ต๋ 60% ์ ์ ๋์คํฌ ๊ณต๊ฐ๊ณผ 90% ์ ์ ๋ฉ๋ชจ๋ฆฌ ๊ณต๊ฐ์ ํตํด ํํ๋ ์ ์๋ค๋ ๊ฒ์ ๋ณด์ธ๋ค. ๋๋ถ์ด ๋ณธ ํ์๋
ผ๋ฌธ์์ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์๋ค์ ๋ถํ ์ ์ผ๋ก ํํ์ด ๊ฐ๋ฅํจ์ ๋ณด์ด๊ณ , ์ด๋ฅผ ํตํ์ฌ ์ ํ๋ ํ๊ฒฝ์์๋ ๋น
๋ฐ์ดํฐ๋ฅผ ํํํ ๋ฌธ์๋ค์ ์ฒ๋ฆฌํ ์ ์๋ค๋ ๊ฒ์ ๋ณด์ธ๋ค.
์์ ์ธ๊ธํ ๊ณต๊ฐ ํจ์จ์ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์ ํํ๋ฒ์ ๊ตฌ์ถํจ๊ณผ ๋์์, ๋ณธ ํ์๋
ผ๋ฌธ์์ ์ด๋ฏธ ์กด์ฌํ๋ ์์ถ ๊ธฐ๋ฒ ์ค ์ผ๋ถ๋ฅผ ์ถ๊ฐ์ ์ผ๋ก ๊ฐ์ ํ๋ค. ์ฒซ์งธ๋ก, ๋ณธ ํ์๋
ผ๋ฌธ์์๋ ์ ๋ ฌ ์ฌ๋ถ์ ๊ด๊ณ์๋ ์ ์ ๋ฐฐ์ด์ ๋ถํธํํ๋ ์์ด๋์ด๋ฅผ ์ ์ํ๋ค. ์ด ๊ธฐ๋ฒ์ ์ด๋ฏธ ์กด์ฌํ๋ ๋ฒ์ฉ ์ฝ๋ ์์คํ
์ ๊ฐ์ ํ ํํ๋ก, ๊ฐ๊ฒฐํ ๋นํธ์ด ์๋ฃ๊ตฌ์กฐ๋ฅผ ์ด์ฉํ๋ค. ์ ์๋ ์๊ณ ๋ฆฌ์ฆ์ ๊ธฐ์กด ๋ฒ์ฉ ์ฝ๋ ์์คํ
์ ๋นํด ์ต๋ 44\% ์ ์ ๊ณต๊ฐ์ ์ฌ์ฉํ ๋ฟ๋ง ์๋๋ผ 15\% ์ ์ ๋ถํธํ ์๊ฐ์ ํ์๋ก ํ๋ฉฐ, ๊ธฐ์กด ์์คํ
์์ ์ ๊ณตํ์ง ์๋ ๋ถํธํ๋ ๋ฐฐ์ด์์์ ์์ ์ ๊ทผ์ ์ง์ํ๋ค.
๋ํ ๋ณธ ํ์๋
ผ๋ฌธ์์๋ ๋นํธ๋งต ์ธ๋ฑ์ค ์์ถ์ ์ฌ์ฉ๋๋ SBH ์๊ณ ๋ฆฌ์ฆ์ ๊ฐ์ ์ํจ๋ค. ํด๋น ๊ธฐ๋ฒ์ ์ฃผ๋ ๊ฐ์ ์ ๋ถํธํ์ ๋ณตํธํ ์งํ ์ ์ค๊ฐ ๋งค๊ฐ์ธ ์ํผ๋ฒ์ผ์ ์ฌ์ฉํจ์ผ๋ก์จ ์ฌ๋ฌ ์์ถ๋ ๋นํธ๋งต ์ธ๋ฑ์ค์ ๋ํ ์ง์ ์ฑ๋ฅ์ ๊ฐ์ ์ํค๋ ๊ฒ์ด๋ค. ์ ์์ถ ์๊ณ ๋ฆฌ์ฆ์ ์ค๊ฐ ๊ณผ์ ์์ ์งํ๋๋ ๋ถํ ์์ ์๊ฐ์ ์ป์ด, ๋ณธ ํ์๋
ผ๋ฌธ์์ CPU ๋ฐ GPU์ ์ ์ฉ ๊ฐ๋ฅํ ๊ฐ์ ๋ ๋ณ๋ ฌํ ์์ถ ๋งค์ปค๋์ฆ์ ์ ์ํ๋ค. ์คํ์ ํตํด CPU ๋ณ๋ ฌ ์ต์ ํ๊ฐ ์ด๋ฃจ์ด์ง ์๊ณ ๋ฆฌ์ฆ์ ์์ถ๋ ํํ์ ๋ณํ ์์ด 4์ฝ์ด ์ปดํจํฐ์์ ์ต๋ 38\%์ ์์ถ ๋ฐ ํด์ ์๊ฐ์ ๊ฐ์์ํจ๋ค๋ ๊ฒ์ ๋ณด์ธ๋ค. GPU ๋ณ๋ ฌ ์ต์ ํ๋ ๊ธฐ์กด์ ์กด์ฌํ๋ GPU ๋นํธ๋งต ์์ถ ๊ธฐ๋ฒ์ ๋นํด 48\% ๋น ๋ฅธ ์ง์ ์ฒ๋ฆฌ ์๊ฐ์ ํ์๋ก ํจ์ ํ์ธํ๋ค.Chapter 1 Introduction 1
1.1 Contribution 3
1.2 Organization 5
Chapter 2 Background 6
2.1 Model of Computation 6
2.2 Succinct Data Structures 7
Chapter 3 Space-efficient Representation of Integer Arrays 9
3.1 Introduction 9
3.2 Preliminaries 10
3.2.1 Universal Code System 10
3.2.2 Bit Vector 13
3.3 Algorithm Description 13
3.3.1 Main Principle 14
3.3.2 Optimization in the Implementation 16
3.4 Experimental Results 16
Chapter 4 Space-efficient Parallel Compressed Bitmap Index Processing 19
4.1 Introduction 19
4.2 Related Work 23
4.2.1 Byte-aligned Bitmap Code (BBC) 24
4.2.2 Word-Aligned Hybrid (WAH) 27
4.2.3 WAH-derived Algorithms 28
4.2.4 GPU-based WAH Algorithms 31
4.2.5 Super Byte-aligned Hybrid (SBH) 33
4.3 Parallelizing SBH 38
4.3.1 CPU Parallelism 38
4.3.2 GPU Parallelism 39
4.4 Experimental Results 40
4.4.1 Plain Version 41
4.4.2 Parallelized Version 46
4.4.3 Summary 49
Chapter 5 Space-efficient Representation of Semi-structured Document Formats 50
5.1 Preliminaries 50
5.1.1 Semi-structured Document Formats 50
5.1.2 Resource Description Framework 57
5.1.3 Succinct Ordinal Tree Representations 60
5.1.4 String Compression Schemes 64
5.2 Representation 66
5.2.1 Bit String Indexed Array 67
5.2.2 Main Structure 68
5.2.3 Single Document as a Collection of Chunks 72
5.2.4 Supporting Queries 73
5.3 Experimental Results 75
5.3.1 Datasets 76
5.3.2 Construction Time 78
5.3.3 RAM Usage during Construction 80
5.3.4 Disk Usage and Serialization Time 83
5.3.5 Chunk Division 83
5.3.6 String Compression 88
5.3.7 Query Time 89
Chapter 6 Conclusion 94
Bibliography 96
์์ฝ 109
Acknowledgements 111Docto
Wheeler graphs: A framework for BWT-based data structures
The famous Burrows\u2013Wheeler Transform (BWT) was originally defined for a single string but variations have been developed for sets of strings, labeled trees, de Bruijn graphs, etc. In this paper we propose a framework that includes many of these variations and that we hope will simplify the search for more.
We first define Wheeler graphs and show they have a property we call path coherence. We show that if the state diagram of a finite-state automaton is a Wheeler graph then, by its path coherence, we can order the nodes such that, for any string, the nodes reachable from the initial state or states by processing that string are consecutive. This means that even if the automaton is non-deterministic, we can still store it compactly and process strings with it quickly.
We then rederive several variations of the BWT by designing straightforward finite-state automata for the relevant problems and showing that their state diagrams are Wheeler graphs