301 research outputs found
Sorting improves word-aligned bitmap indexes
Bitmap indexes must be compressed to reduce input/output costs and minimize
CPU usage. To accelerate logical operations (AND, OR, XOR) over bitmaps, we use
techniques based on run-length encoding (RLE), such as Word-Aligned Hybrid
(WAH) compression. These techniques are sensitive to the order of the rows: a
simple lexicographical sort can divide the index size by 9 and make indexes
several times faster. We investigate row-reordering heuristics. Simply
permuting the columns of the table can increase the sorting efficiency by 40%.
Secondary contributions include efficient algorithms to construct and aggregate
bitmaps. The effect of word length is also reviewed by constructing 16-bit,
32-bit and 64-bit indexes. Using 64-bit CPUs, we find that 64-bit indexes are
slightly faster than 32-bit indexes despite being nearly twice as large
Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes
Bitmap indexes must be compressed to reduce input/output costs and minimize
CPU usage. To accelerate logical operations (AND, OR, XOR) over bitmaps, we use
techniques based on run-length encoding (RLE), such as Word-Aligned Hybrid
(WAH) compression. These techniques are sensitive to the order of the rows: a
simple lexicographical sort can divide the index size by 9 and make indexes
several times faster. We investigate reordering heuristics based on computed
attribute-value histograms. Simply permuting the columns of the table based on
these histograms can increase the sorting efficiency by 40%.Comment: To appear in proceedings of DOLAP 200
Tri de la table de faits et compression des index bitmaps avec alignement sur les mots
Bitmap indexes are frequently used to index multidimensional data. They rely
mostly on sequential input/output. Bitmaps can be compressed to reduce
input/output costs and minimize CPU usage. The most efficient compression
techniques are based on run-length encoding (RLE), such as Word-Aligned Hybrid
(WAH) compression. This type of compression accelerates logical operations
(AND, OR) over the bitmaps. However, run-length encoding is sensitive to the
order of the facts. Thus, we propose to sort the fact tables. We review
lexicographic, Gray-code, and block-wise sorting. We found that a lexicographic
sort improves compression--sometimes generating indexes twice as small--and
make indexes several times faster. While sorting takes time, this is partially
offset by the fact that it is faster to index a sorted table. Column order is
significant: it is generally preferable to put the columns having more distinct
values at the beginning. A block-wise sort is much less efficient than a full
sort. Moreover, we found that Gray-code sorting is not better than
lexicographic sorting when using word-aligned compression.Comment: to appear at BDA'0
Better bitmap performance with Roaring bitmaps
Bitmap indexes are commonly used in databases and search engines. By
exploiting bit-level parallelism, they can significantly accelerate queries.
However, they can use much memory, and thus we might prefer compressed bitmap
indexes. Following Oracle's lead, bitmaps are often compressed using run-length
encoding (RLE). Building on prior work, we introduce the Roaring compressed
bitmap format: it uses packed arrays for compression instead of RLE. We compare
it to two high-performance RLE-based bitmap encoding techniques: WAH (Word
Aligned Hybrid compression scheme) and Concise (Compressed `n' Composable
Integer Set). On synthetic and real data, we find that Roaring bitmaps (1)
often compress significantly better (e.g., 2 times) and (2) are faster than the
compressed alternatives (up to 900 times faster for intersections). Our results
challenge the view that RLE-based bitmap compression is best
Reordering Rows for Better Compression: Beyond the Lexicographic Order
Sorting database tables before compressing them improves the compression
rate. Can we do better than the lexicographical order? For minimizing the
number of runs in a run-length encoding compression scheme, the best approaches
to row-ordering are derived from traveling salesman heuristics, although there
is a significant trade-off between running time and compression. A new
heuristic, Multiple Lists, which is a variant on Nearest Neighbor that trades
off compression for a major running-time speedup, is a good option for very
large tables. However, for some compression schemes, it is more important to
generate long runs rather than few runs. For this case, another novel
heuristic, Vortex, is promising. We find that we can improve run-length
encoding up to a factor of 3 whereas we can improve prefix coding by up to 80%:
these gains are on top of the gains due to lexicographically sorting the table.
We prove that the new row reordering is optimal (within 10%) at minimizing the
runs of identical values within columns, in a few cases.Comment: to appear in ACM TOD
๊ฐ๊ฒฐํ ์๋ฃ๊ตฌ์กฐ๋ฅผ ํ์ฉํ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์ ํ์๋ค์ ๊ณต๊ฐ ํจ์จ์ ํํ๋ฒ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ, 2021. 2. Srinivasa Rao Satti.Numerous big data are generated from a plethora of sources. Most of the data stored as files contain a non-fixed type of schema, so that the files are suitable to be maintained as semi-structured document formats. A number of those formats, such as XML (eXtensible Markup Language), JSON (JavaScript Object Notation), and YAML (YAML Ain't Markup Language) are suggested to sustain hierarchy in the original corpora of data. Several data models structuring the gathered data - including RDF (Resource Description Framework) - depend on the semi-structured document formats to be serialized and transferred for future processing.
Since the semi-structured document formats focus on readability and verbosity, redundant space is required to organize and maintain the document. Even though general-purpose compression schemes are widely used to compact the documents, applying those algorithms hinder future handling of the corpora, owing to loss of internal structures.
The area of succinct data structures is widely investigated and researched in theory, to provide answers to the queries while the encoded data occupy space close to the information-theoretic lower bound. Bit vectors and trees are the notable succinct data structures. Nevertheless, there were few attempts to apply the idea of succinct data structures to represent the semi-structured documents in space-efficient manner.
In this dissertation we propose a unified, space-efficient representation of various semi-structured document formats. The core functionality of this representation is its compactness and query-ability derived from enriched functions of succinct data structures. Incorporation of (a) bit indexed arrays, (b) succinct ordinal trees, and (c) compression techniques engineers the compact representation. We implement this representation in practice, and show by experiments that construction of this representation decreases the disk usage by up to 60% while occupying 90% less RAM. We also allow processing a document in partial manner, to allow processing of larger corpus of big data even in the constrained environment.
In parallel to establishing the aforementioned compact semi-structured document representation, we provide and reinforce some of the existing compression schemes in this dissertation. We first suggest an idea to encode an array of integers that is not necessarily sorted. This compaction scheme improves upon the existing universal code systems, by assistance of succinct bit vector structure. We show that our suggested algorithm reduces space usage by up to 44% while consuming 15% less time than the original code system, while the algorithm additionally supports random access of elements upon the encoded array.
We also reinforce the SBH bitmap index compression algorithm. The main strength of this scheme is the use of intermediate super-bucket during operations, giving better performance on querying through a combination of compressed bitmap indexes. Inspired from splits done during the intermediate process of the SBH algorithm, we give an improved compression mechanism supporting parallelism that could be utilized in both CPUs and GPUs. We show by experiments that this CPU parallel processing optimization diminishes compression and decompression times by up to 38% in a 4-core machine without modifying the bitmap compressed form. For GPUs, the new algorithm gives 48% faster query processing time in the experiments, compared to the previous existing bitmap index compression schemes.์
์ ์๋ ๋น
๋ฐ์ดํฐ๊ฐ ๋ค์ํ ์๋ณธ๋ก๋ถํฐ ์์ฑ๋๊ณ ์๋ค. ์ด๋ค ๋ฐ์ดํฐ์ ๋๋ถ๋ถ์ ๊ณ ์ ๋์ง ์์ ์ข
๋ฅ์ ์คํค๋ง๋ฅผ ํฌํจํ ํ์ผ ํํ๋ก ์ ์ฅ๋๋๋ฐ, ์ด๋ก ์ธํ์ฌ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์ ํ์์ ์ด์ฉํ์ฌ ํ์ผ์ ์ ์งํ๋ ๊ฒ์ด ์ ํฉํ๋ค. XML, JSON ๋ฐ YAML๊ณผ ๊ฐ์ ์ข
๋ฅ์ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์ ํ์์ด ๋ฐ์ดํฐ์ ๋ด์ฌํ๋ ๊ตฌ์กฐ๋ฅผ ์ ์งํ๊ธฐ ์ํ์ฌ ์ ์๋์๋ค. ์์ง๋ ๋ฐ์ดํฐ๋ฅผ ๊ตฌ์กฐํํ๋ RDF์ ๊ฐ์ ์ฌ๋ฌ ๋ฐ์ดํฐ ๋ชจ๋ธ๋ค์ ์ฌํ ์ฒ๋ฆฌ๋ฅผ ์ํ ์ ์ฅ ๋ฐ ์ ์ก์ ์ํ์ฌ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์ ํ์์ ์์กดํ๋ค.
๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์ ํ์์ ๊ฐ๋
์ฑ๊ณผ ๋ค๋ณ์ฑ์ ์ง์คํ๊ธฐ ๋๋ฌธ์, ๋ฌธ์๋ฅผ ๊ตฌ์กฐํํ๊ณ ์ ์งํ๊ธฐ ์ํ์ฌ ์ถ๊ฐ์ ์ธ ๊ณต๊ฐ์ ํ์๋ก ํ๋ค. ๋ฌธ์๋ฅผ ์์ถ์ํค๊ธฐ ์ํ์ฌ ์ผ๋ฐ์ ์ธ ์์ถ ๊ธฐ๋ฒ๋ค์ด ๋๋ฆฌ ์ฌ์ฉ๋๊ณ ์์ผ๋, ์ด๋ค ๊ธฐ๋ฒ๋ค์ ์ ์ฉํ๊ฒ ๋๋ฉด ๋ฌธ์์ ๋ด๋ถ ๊ตฌ์กฐ์ ์์ค๋ก ์ธํ์ฌ ๋ฐ์ดํฐ์ ์ฌํ ์ฒ๋ฆฌ๊ฐ ์ด๋ ต๊ฒ ๋๋ค.
๋ฐ์ดํฐ๋ฅผ ์ ๋ณด์ด๋ก ์ ํํ์ ๊ฐ๊น์ด ๊ณต๊ฐ๋ง์ ์ฌ์ฉํ์ฌ ์ ์ฅ์ ๊ฐ๋ฅํ๊ฒ ํ๋ฉด์ ์ง์์ ๋ํ ์๋ต์ ์ ๊ณตํ๋ ๊ฐ๊ฒฐํ ์๋ฃ๊ตฌ์กฐ๋ ์ด๋ก ์ ์ผ๋ก ๋๋ฆฌ ์ฐ๊ตฌ๋๊ณ ์๋ ๋ถ์ผ์ด๋ค. ๋นํธ์ด๊ณผ ํธ๋ฆฌ๊ฐ ๋๋ฆฌ ์๋ ค์ง ๊ฐ๊ฒฐํ ์๋ฃ๊ตฌ์กฐ๋ค์ด๋ค. ๊ทธ๋ฌ๋ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์๋ค์ ์ ์ฅํ๋ ๋ฐ ๊ฐ๊ฒฐํ ์๋ฃ๊ตฌ์กฐ์ ์์ด๋์ด๋ฅผ ์ ์ฉํ ์ฐ๊ตฌ๋ ๊ฑฐ์ ์งํ๋์ง ์์๋ค.
๋ณธ ํ์๋
ผ๋ฌธ์ ํตํด ์ฐ๋ฆฌ๋ ๋ค์ํ ์ข
๋ฅ์ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์ ํ์์ ํต์ผ๋๊ฒ ํํํ๋ ๊ณต๊ฐ ํจ์จ์ ํํ๋ฒ์ ์ ์ํ๋ค. ์ด ๊ธฐ๋ฒ์ ์ฃผ์ํ ๊ธฐ๋ฅ์ ๊ฐ๊ฒฐํ ์๋ฃ๊ตฌ์กฐ๊ฐ ๊ฐ์ ์ผ๋ก ๊ฐ์ง๋ ํน์ฑ์ ๊ธฐ๋ฐํ ๊ฐ๊ฒฐ์ฑ๊ณผ ์ง์ ๊ฐ๋ฅ์ฑ์ด๋ค. ๋นํธ์ด๋ก ์ธ๋ฑ์ฑ๋ ๋ฐฐ์ด, ๊ฐ๊ฒฐํ ์์ ์๋ ํธ๋ฆฌ ๋ฐ ๋ค์ํ ์์ถ ๊ธฐ๋ฒ์ ํตํฉํ์ฌ ํด๋น ํํ๋ฒ์ ๊ณ ์ํ์๋ค. ์ด ๊ธฐ๋ฒ์ ์ค์ฌ์ ์ผ๋ก ๊ตฌํ๋์๊ณ , ์คํ์ ํตํ์ฌ ์ด ๊ธฐ๋ฒ์ ์ ์ฉํ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์๋ค์ ์ต๋ 60% ์ ์ ๋์คํฌ ๊ณต๊ฐ๊ณผ 90% ์ ์ ๋ฉ๋ชจ๋ฆฌ ๊ณต๊ฐ์ ํตํด ํํ๋ ์ ์๋ค๋ ๊ฒ์ ๋ณด์ธ๋ค. ๋๋ถ์ด ๋ณธ ํ์๋
ผ๋ฌธ์์ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์๋ค์ ๋ถํ ์ ์ผ๋ก ํํ์ด ๊ฐ๋ฅํจ์ ๋ณด์ด๊ณ , ์ด๋ฅผ ํตํ์ฌ ์ ํ๋ ํ๊ฒฝ์์๋ ๋น
๋ฐ์ดํฐ๋ฅผ ํํํ ๋ฌธ์๋ค์ ์ฒ๋ฆฌํ ์ ์๋ค๋ ๊ฒ์ ๋ณด์ธ๋ค.
์์ ์ธ๊ธํ ๊ณต๊ฐ ํจ์จ์ ๋ฐ๊ตฌ์กฐํ๋ ๋ฌธ์ ํํ๋ฒ์ ๊ตฌ์ถํจ๊ณผ ๋์์, ๋ณธ ํ์๋
ผ๋ฌธ์์ ์ด๋ฏธ ์กด์ฌํ๋ ์์ถ ๊ธฐ๋ฒ ์ค ์ผ๋ถ๋ฅผ ์ถ๊ฐ์ ์ผ๋ก ๊ฐ์ ํ๋ค. ์ฒซ์งธ๋ก, ๋ณธ ํ์๋
ผ๋ฌธ์์๋ ์ ๋ ฌ ์ฌ๋ถ์ ๊ด๊ณ์๋ ์ ์ ๋ฐฐ์ด์ ๋ถํธํํ๋ ์์ด๋์ด๋ฅผ ์ ์ํ๋ค. ์ด ๊ธฐ๋ฒ์ ์ด๋ฏธ ์กด์ฌํ๋ ๋ฒ์ฉ ์ฝ๋ ์์คํ
์ ๊ฐ์ ํ ํํ๋ก, ๊ฐ๊ฒฐํ ๋นํธ์ด ์๋ฃ๊ตฌ์กฐ๋ฅผ ์ด์ฉํ๋ค. ์ ์๋ ์๊ณ ๋ฆฌ์ฆ์ ๊ธฐ์กด ๋ฒ์ฉ ์ฝ๋ ์์คํ
์ ๋นํด ์ต๋ 44\% ์ ์ ๊ณต๊ฐ์ ์ฌ์ฉํ ๋ฟ๋ง ์๋๋ผ 15\% ์ ์ ๋ถํธํ ์๊ฐ์ ํ์๋ก ํ๋ฉฐ, ๊ธฐ์กด ์์คํ
์์ ์ ๊ณตํ์ง ์๋ ๋ถํธํ๋ ๋ฐฐ์ด์์์ ์์ ์ ๊ทผ์ ์ง์ํ๋ค.
๋ํ ๋ณธ ํ์๋
ผ๋ฌธ์์๋ ๋นํธ๋งต ์ธ๋ฑ์ค ์์ถ์ ์ฌ์ฉ๋๋ SBH ์๊ณ ๋ฆฌ์ฆ์ ๊ฐ์ ์ํจ๋ค. ํด๋น ๊ธฐ๋ฒ์ ์ฃผ๋ ๊ฐ์ ์ ๋ถํธํ์ ๋ณตํธํ ์งํ ์ ์ค๊ฐ ๋งค๊ฐ์ธ ์ํผ๋ฒ์ผ์ ์ฌ์ฉํจ์ผ๋ก์จ ์ฌ๋ฌ ์์ถ๋ ๋นํธ๋งต ์ธ๋ฑ์ค์ ๋ํ ์ง์ ์ฑ๋ฅ์ ๊ฐ์ ์ํค๋ ๊ฒ์ด๋ค. ์ ์์ถ ์๊ณ ๋ฆฌ์ฆ์ ์ค๊ฐ ๊ณผ์ ์์ ์งํ๋๋ ๋ถํ ์์ ์๊ฐ์ ์ป์ด, ๋ณธ ํ์๋
ผ๋ฌธ์์ CPU ๋ฐ GPU์ ์ ์ฉ ๊ฐ๋ฅํ ๊ฐ์ ๋ ๋ณ๋ ฌํ ์์ถ ๋งค์ปค๋์ฆ์ ์ ์ํ๋ค. ์คํ์ ํตํด CPU ๋ณ๋ ฌ ์ต์ ํ๊ฐ ์ด๋ฃจ์ด์ง ์๊ณ ๋ฆฌ์ฆ์ ์์ถ๋ ํํ์ ๋ณํ ์์ด 4์ฝ์ด ์ปดํจํฐ์์ ์ต๋ 38\%์ ์์ถ ๋ฐ ํด์ ์๊ฐ์ ๊ฐ์์ํจ๋ค๋ ๊ฒ์ ๋ณด์ธ๋ค. GPU ๋ณ๋ ฌ ์ต์ ํ๋ ๊ธฐ์กด์ ์กด์ฌํ๋ GPU ๋นํธ๋งต ์์ถ ๊ธฐ๋ฒ์ ๋นํด 48\% ๋น ๋ฅธ ์ง์ ์ฒ๋ฆฌ ์๊ฐ์ ํ์๋ก ํจ์ ํ์ธํ๋ค.Chapter 1 Introduction 1
1.1 Contribution 3
1.2 Organization 5
Chapter 2 Background 6
2.1 Model of Computation 6
2.2 Succinct Data Structures 7
Chapter 3 Space-efficient Representation of Integer Arrays 9
3.1 Introduction 9
3.2 Preliminaries 10
3.2.1 Universal Code System 10
3.2.2 Bit Vector 13
3.3 Algorithm Description 13
3.3.1 Main Principle 14
3.3.2 Optimization in the Implementation 16
3.4 Experimental Results 16
Chapter 4 Space-efficient Parallel Compressed Bitmap Index Processing 19
4.1 Introduction 19
4.2 Related Work 23
4.2.1 Byte-aligned Bitmap Code (BBC) 24
4.2.2 Word-Aligned Hybrid (WAH) 27
4.2.3 WAH-derived Algorithms 28
4.2.4 GPU-based WAH Algorithms 31
4.2.5 Super Byte-aligned Hybrid (SBH) 33
4.3 Parallelizing SBH 38
4.3.1 CPU Parallelism 38
4.3.2 GPU Parallelism 39
4.4 Experimental Results 40
4.4.1 Plain Version 41
4.4.2 Parallelized Version 46
4.4.3 Summary 49
Chapter 5 Space-efficient Representation of Semi-structured Document Formats 50
5.1 Preliminaries 50
5.1.1 Semi-structured Document Formats 50
5.1.2 Resource Description Framework 57
5.1.3 Succinct Ordinal Tree Representations 60
5.1.4 String Compression Schemes 64
5.2 Representation 66
5.2.1 Bit String Indexed Array 67
5.2.2 Main Structure 68
5.2.3 Single Document as a Collection of Chunks 72
5.2.4 Supporting Queries 73
5.3 Experimental Results 75
5.3.1 Datasets 76
5.3.2 Construction Time 78
5.3.3 RAM Usage during Construction 80
5.3.4 Disk Usage and Serialization Time 83
5.3.5 Chunk Division 83
5.3.6 String Compression 88
5.3.7 Query Time 89
Chapter 6 Conclusion 94
Bibliography 96
์์ฝ 109
Acknowledgements 111Docto
- โฆ