4,570 research outputs found
A simple online competitive adaptation of Lempel-Ziv compression with efficient random access support
We present a simple adaptation of the Lempel Ziv 78' (LZ78) compression
scheme ({\em IEEE Transactions on Information Theory, 1978}) that supports
efficient random access to the input string. Namely, given query access to the
compressed string, it is possible to efficiently recover any symbol of the
input string. The compression algorithm is given as input a parameter \eps
>0, and with very high probability increases the length of the compressed
string by at most a factor of (1+\eps). The access time is O(\log n +
1/\eps^2) in expectation, and O(\log n/\eps^2) with high probability. The
scheme relies on sparse transitive-closure spanners. Any (consecutive)
substring of the input string can be retrieved at an additional additive cost
in the running time of the length of the substring. We also formally establish
the necessity of modifying LZ78 so as to allow efficient random access.
Specifically, we construct a family of strings for which
queries to the LZ78-compressed string are required in order to recover a single
symbol in the input string. The main benefit of the proposed scheme is that it
preserves the online nature and simplicity of LZ78, and that for {\em every}
input string, the length of the compressed string is only a small factor larger
than that obtained by running LZ78
NEEXP is Contained in MIP*
We study multiprover interactive proof systems. The power of classical multiprover interactive proof systems, in which the provers do not share entanglement, was characterized in a famous work by Babai, Fortnow, and Lund (Computational Complexity 1991), whose main result was the equality MIP = NEXP. The power of quantum multiprover interactive proof systems, in which the provers are allowed to share entanglement, has proven to be much more difficult to characterize. The best known lower-bound on MIP* is NEXP β MIP*, due to Ito and Vidick (FOCS 2012). As for upper bounds, MIP* could be as large as RE, the class of recursively enumerable languages.
The main result of this work is the inclusion of NEEXP = NTIME[2^(2poly(n))] β MIP*. This is an exponential improvement over the prior lower bound and shows that proof systems with entangled provers are at least exponentially more powerful than classical provers. In our protocol the verifier delegates a classical, exponentially large MIP protocol for NEEXP to two entangled provers: the provers obtain their exponentially large questions by measuring their shared state, and use a classical PCP to certify the correctness of their exponentially-long answers. For the soundness of our protocol, it is crucial that each player should not only sample its own question correctly but also avoid performing measurements that would reveal the other player's sampled question. We ensure this by commanding the players to perform a complementary measurement, relying on the Heisenberg uncertainty principle to prevent the forbidden measurements from being performed
κ°κ²°ν μλ£κ΅¬μ‘°λ₯Ό νμ©ν λ°κ΅¬μ‘°νλ λ¬Έμ νμλ€μ κ³΅κ° ν¨μ¨μ ννλ²
νμλ
Όλ¬Έ (λ°μ¬) -- μμΈλνκ΅ λνμ : 곡과λν μ κΈ°Β·μ»΄ν¨ν°κ³΅νλΆ, 2021. 2. Srinivasa Rao Satti.Numerous big data are generated from a plethora of sources. Most of the data stored as files contain a non-fixed type of schema, so that the files are suitable to be maintained as semi-structured document formats. A number of those formats, such as XML (eXtensible Markup Language), JSON (JavaScript Object Notation), and YAML (YAML Ain't Markup Language) are suggested to sustain hierarchy in the original corpora of data. Several data models structuring the gathered data - including RDF (Resource Description Framework) - depend on the semi-structured document formats to be serialized and transferred for future processing.
Since the semi-structured document formats focus on readability and verbosity, redundant space is required to organize and maintain the document. Even though general-purpose compression schemes are widely used to compact the documents, applying those algorithms hinder future handling of the corpora, owing to loss of internal structures.
The area of succinct data structures is widely investigated and researched in theory, to provide answers to the queries while the encoded data occupy space close to the information-theoretic lower bound. Bit vectors and trees are the notable succinct data structures. Nevertheless, there were few attempts to apply the idea of succinct data structures to represent the semi-structured documents in space-efficient manner.
In this dissertation we propose a unified, space-efficient representation of various semi-structured document formats. The core functionality of this representation is its compactness and query-ability derived from enriched functions of succinct data structures. Incorporation of (a) bit indexed arrays, (b) succinct ordinal trees, and (c) compression techniques engineers the compact representation. We implement this representation in practice, and show by experiments that construction of this representation decreases the disk usage by up to 60% while occupying 90% less RAM. We also allow processing a document in partial manner, to allow processing of larger corpus of big data even in the constrained environment.
In parallel to establishing the aforementioned compact semi-structured document representation, we provide and reinforce some of the existing compression schemes in this dissertation. We first suggest an idea to encode an array of integers that is not necessarily sorted. This compaction scheme improves upon the existing universal code systems, by assistance of succinct bit vector structure. We show that our suggested algorithm reduces space usage by up to 44% while consuming 15% less time than the original code system, while the algorithm additionally supports random access of elements upon the encoded array.
We also reinforce the SBH bitmap index compression algorithm. The main strength of this scheme is the use of intermediate super-bucket during operations, giving better performance on querying through a combination of compressed bitmap indexes. Inspired from splits done during the intermediate process of the SBH algorithm, we give an improved compression mechanism supporting parallelism that could be utilized in both CPUs and GPUs. We show by experiments that this CPU parallel processing optimization diminishes compression and decompression times by up to 38% in a 4-core machine without modifying the bitmap compressed form. For GPUs, the new algorithm gives 48% faster query processing time in the experiments, compared to the previous existing bitmap index compression schemes.μ
μ μλ λΉ
λ°μ΄ν°κ° λ€μν μλ³Έλ‘λΆν° μμ±λκ³ μλ€. μ΄λ€ λ°μ΄ν°μ λλΆλΆμ κ³ μ λμ§ μμ μ’
λ₯μ μ€ν€λ§λ₯Ό ν¬ν¨ν νμΌ ννλ‘ μ μ₯λλλ°, μ΄λ‘ μΈνμ¬ λ°κ΅¬μ‘°νλ λ¬Έμ νμμ μ΄μ©νμ¬ νμΌμ μ μ§νλ κ²μ΄ μ ν©νλ€. XML, JSON λ° YAMLκ³Ό κ°μ μ’
λ₯μ λ°κ΅¬μ‘°νλ λ¬Έμ νμμ΄ λ°μ΄ν°μ λ΄μ¬νλ ꡬ쑰λ₯Ό μ μ§νκΈ° μνμ¬ μ μλμλ€. μμ§λ λ°μ΄ν°λ₯Ό ꡬ쑰ννλ RDFμ κ°μ μ¬λ¬ λ°μ΄ν° λͺ¨λΈλ€μ μ¬ν μ²λ¦¬λ₯Ό μν μ μ₯ λ° μ μ‘μ μνμ¬ λ°κ΅¬μ‘°νλ λ¬Έμ νμμ μμ‘΄νλ€.
λ°κ΅¬μ‘°νλ λ¬Έμ νμμ κ°λ
μ±κ³Ό λ€λ³μ±μ μ§μ€νκΈ° λλ¬Έμ, λ¬Έμλ₯Ό ꡬ쑰ννκ³ μ μ§νκΈ° μνμ¬ μΆκ°μ μΈ κ³΅κ°μ νμλ‘ νλ€. λ¬Έμλ₯Ό μμΆμν€κΈ° μνμ¬ μΌλ°μ μΈ μμΆ κΈ°λ²λ€μ΄ λ리 μ¬μ©λκ³ μμΌλ, μ΄λ€ κΈ°λ²λ€μ μ μ©νκ² λλ©΄ λ¬Έμμ λ΄λΆ ꡬ쑰μ μμ€λ‘ μΈνμ¬ λ°μ΄ν°μ μ¬ν μ²λ¦¬κ° μ΄λ ΅κ² λλ€.
λ°μ΄ν°λ₯Ό μ 보μ΄λ‘ μ ννμ κ°κΉμ΄ 곡κ°λ§μ μ¬μ©νμ¬ μ μ₯μ κ°λ₯νκ² νλ©΄μ μ§μμ λν μλ΅μ μ 곡νλ κ°κ²°ν μλ£κ΅¬μ‘°λ μ΄λ‘ μ μΌλ‘ λ리 μ°κ΅¬λκ³ μλ λΆμΌμ΄λ€. λΉνΈμ΄κ³Ό νΈλ¦¬κ° λ리 μλ €μ§ κ°κ²°ν μλ£κ΅¬μ‘°λ€μ΄λ€. κ·Έλ¬λ λ°κ΅¬μ‘°νλ λ¬Έμλ€μ μ μ₯νλ λ° κ°κ²°ν μλ£κ΅¬μ‘°μ μμ΄λμ΄λ₯Ό μ μ©ν μ°κ΅¬λ κ±°μ μ§νλμ§ μμλ€.
λ³Έ νμλ
Όλ¬Έμ ν΅ν΄ μ°λ¦¬λ λ€μν μ’
λ₯μ λ°κ΅¬μ‘°νλ λ¬Έμ νμμ ν΅μΌλκ² νννλ κ³΅κ° ν¨μ¨μ ννλ²μ μ μνλ€. μ΄ κΈ°λ²μ μ£Όμν κΈ°λ₯μ κ°κ²°ν μλ£κ΅¬μ‘°κ° κ°μ μΌλ‘ κ°μ§λ νΉμ±μ κΈ°λ°ν κ°κ²°μ±κ³Ό μ§μ κ°λ₯μ±μ΄λ€. λΉνΈμ΄λ‘ μΈλ±μ±λ λ°°μ΄, κ°κ²°ν μμ μλ νΈλ¦¬ λ° λ€μν μμΆ κΈ°λ²μ ν΅ν©νμ¬ ν΄λΉ ννλ²μ κ³ μνμλ€. μ΄ κΈ°λ²μ μ€μ¬μ μΌλ‘ ꡬνλμκ³ , μ€νμ ν΅νμ¬ μ΄ κΈ°λ²μ μ μ©ν λ°κ΅¬μ‘°νλ λ¬Έμλ€μ μ΅λ 60% μ μ λμ€ν¬ 곡κ°κ³Ό 90% μ μ λ©λͺ¨λ¦¬ 곡κ°μ ν΅ν΄ ννλ μ μλ€λ κ²μ 보μΈλ€. λλΆμ΄ λ³Έ νμλ
Όλ¬Έμμ λ°κ΅¬μ‘°νλ λ¬Έμλ€μ λΆν μ μΌλ‘ ννμ΄ κ°λ₯ν¨μ 보μ΄κ³ , μ΄λ₯Ό ν΅νμ¬ μ νλ νκ²½μμλ λΉ
λ°μ΄ν°λ₯Ό ννν λ¬Έμλ€μ μ²λ¦¬ν μ μλ€λ κ²μ 보μΈλ€.
μμ μΈκΈν κ³΅κ° ν¨μ¨μ λ°κ΅¬μ‘°νλ λ¬Έμ ννλ²μ ꡬμΆν¨κ³Ό λμμ, λ³Έ νμλ
Όλ¬Έμμ μ΄λ―Έ μ‘΄μ¬νλ μμΆ κΈ°λ² μ€ μΌλΆλ₯Ό μΆκ°μ μΌλ‘ κ°μ νλ€. 첫째λ‘, λ³Έ νμλ
Όλ¬Έμμλ μ λ ¬ μ¬λΆμ κ΄κ³μλ μ μ λ°°μ΄μ λΆνΈννλ μμ΄λμ΄λ₯Ό μ μνλ€. μ΄ κΈ°λ²μ μ΄λ―Έ μ‘΄μ¬νλ λ²μ© μ½λ μμ€ν
μ κ°μ ν ννλ‘, κ°κ²°ν λΉνΈμ΄ μλ£κ΅¬μ‘°λ₯Ό μ΄μ©νλ€. μ μλ μκ³ λ¦¬μ¦μ κΈ°μ‘΄ λ²μ© μ½λ μμ€ν
μ λΉν΄ μ΅λ 44\% μ μ 곡κ°μ μ¬μ©ν λΏλ§ μλλΌ 15\% μ μ λΆνΈν μκ°μ νμλ‘ νλ©°, κΈ°μ‘΄ μμ€ν
μμ μ 곡νμ§ μλ λΆνΈνλ λ°°μ΄μμμ μμ μ κ·Όμ μ§μνλ€.
λν λ³Έ νμλ
Όλ¬Έμμλ λΉνΈλ§΅ μΈλ±μ€ μμΆμ μ¬μ©λλ SBH μκ³ λ¦¬μ¦μ κ°μ μν¨λ€. ν΄λΉ κΈ°λ²μ μ£Όλ κ°μ μ λΆνΈνμ 볡νΈν μ§ν μ μ€κ° 맀κ°μΈ μνΌλ²μΌμ μ¬μ©ν¨μΌλ‘μ¨ μ¬λ¬ μμΆλ λΉνΈλ§΅ μΈλ±μ€μ λν μ§μ μ±λ₯μ κ°μ μν€λ κ²μ΄λ€. μ μμΆ μκ³ λ¦¬μ¦μ μ€κ° κ³Όμ μμ μ§νλλ λΆν μμ μκ°μ μ»μ΄, λ³Έ νμλ
Όλ¬Έμμ CPU λ° GPUμ μ μ© κ°λ₯ν κ°μ λ λ³λ ¬ν μμΆ λ§€μ»€λμ¦μ μ μνλ€. μ€νμ ν΅ν΄ CPU λ³λ ¬ μ΅μ νκ° μ΄λ£¨μ΄μ§ μκ³ λ¦¬μ¦μ μμΆλ ννμ λ³ν μμ΄ 4μ½μ΄ μ»΄ν¨ν°μμ μ΅λ 38\%μ μμΆ λ° ν΄μ μκ°μ κ°μμν¨λ€λ κ²μ 보μΈλ€. GPU λ³λ ¬ μ΅μ νλ κΈ°μ‘΄μ μ‘΄μ¬νλ GPU λΉνΈλ§΅ μμΆ κΈ°λ²μ λΉν΄ 48\% λΉ λ₯Έ μ§μ μ²λ¦¬ μκ°μ νμλ‘ ν¨μ νμΈνλ€.Chapter 1 Introduction 1
1.1 Contribution 3
1.2 Organization 5
Chapter 2 Background 6
2.1 Model of Computation 6
2.2 Succinct Data Structures 7
Chapter 3 Space-efficient Representation of Integer Arrays 9
3.1 Introduction 9
3.2 Preliminaries 10
3.2.1 Universal Code System 10
3.2.2 Bit Vector 13
3.3 Algorithm Description 13
3.3.1 Main Principle 14
3.3.2 Optimization in the Implementation 16
3.4 Experimental Results 16
Chapter 4 Space-efficient Parallel Compressed Bitmap Index Processing 19
4.1 Introduction 19
4.2 Related Work 23
4.2.1 Byte-aligned Bitmap Code (BBC) 24
4.2.2 Word-Aligned Hybrid (WAH) 27
4.2.3 WAH-derived Algorithms 28
4.2.4 GPU-based WAH Algorithms 31
4.2.5 Super Byte-aligned Hybrid (SBH) 33
4.3 Parallelizing SBH 38
4.3.1 CPU Parallelism 38
4.3.2 GPU Parallelism 39
4.4 Experimental Results 40
4.4.1 Plain Version 41
4.4.2 Parallelized Version 46
4.4.3 Summary 49
Chapter 5 Space-efficient Representation of Semi-structured Document Formats 50
5.1 Preliminaries 50
5.1.1 Semi-structured Document Formats 50
5.1.2 Resource Description Framework 57
5.1.3 Succinct Ordinal Tree Representations 60
5.1.4 String Compression Schemes 64
5.2 Representation 66
5.2.1 Bit String Indexed Array 67
5.2.2 Main Structure 68
5.2.3 Single Document as a Collection of Chunks 72
5.2.4 Supporting Queries 73
5.3 Experimental Results 75
5.3.1 Datasets 76
5.3.2 Construction Time 78
5.3.3 RAM Usage during Construction 80
5.3.4 Disk Usage and Serialization Time 83
5.3.5 Chunk Division 83
5.3.6 String Compression 88
5.3.7 Query Time 89
Chapter 6 Conclusion 94
Bibliography 96
μμ½ 109
Acknowledgements 111Docto
LeCo: Lightweight Compression via Learning Serial Correlations
Lightweight data compression is a key technique that allows column stores to
exhibit superior performance for analytical queries. Despite a comprehensive
study on dictionary-based encodings to approach Shannon's entropy, few prior
works have systematically exploited the serial correlation in a column for
compression. In this paper, we propose LeCo (i.e., Learned Compression), a
framework that uses machine learning to remove the serial redundancy in a value
sequence automatically to achieve an outstanding compression ratio and
decompression performance simultaneously. LeCo presents a general approach to
this end, making existing (ad-hoc) algorithms such as Frame-of-Reference (FOR),
Delta Encoding, and Run-Length Encoding (RLE) special cases under our
framework. Our microbenchmark with three synthetic and six real-world data sets
shows that a prototype of LeCo achieves a Pareto improvement on both
compression ratio and random access speed over the existing solutions. When
integrating LeCo into widely-used applications, we observe up to 3.9x speed up
in filter-scanning a Parquet file and a 16% increase in Rocksdb's throughput
A Space-Optimal Grammar Compression
A grammar compression is a context-free grammar (CFG) deriving a single string deterministically. For an input string of length N over an alphabet of size sigma, the smallest CFG is O(log N)-approximable in the offline setting and O(log N log^* N)-approximable in the online setting. In addition, an information-theoretic lower bound for representing a CFG in Chomsky normal form of n variables is log (n!/n^sigma) + n + o(n) bits. Although there is an online grammar compression algorithm that directly computes the succinct encoding of its output CFG with O(log N log^* N) approximation guarantee, the problem of optimizing its working space has remained open. We propose a fully-online algorithm that requires the fewest bits of working space asymptotically equal to the lower bound in O(N log log n) compression time. In addition we propose several techniques to boost grammar compression and show their efficiency by computational experiments
GraCT: A Grammar based Compressed representation of Trajectories
We present a compressed data structure to store free trajectories of moving
objects (ships over the sea, for example) allowing spatio-temporal queries. Our
method, GraCT, uses a -tree to store the absolute positions of all objects
at regular time intervals (snapshots), whereas the positions between snapshots
are represented as logs of relative movements compressed with Re-Pair. Our
experimental evaluation shows important savings in space and time with respect
to a fair baseline.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sk{\l}odowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094
- β¦