475 research outputs found
Efficient and Compact Representations of Some Non-canonical Prefix-Free Codes
The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-46049-9_5[Abstract] For many kinds of prefix-free codes there are efficient and compact alternatives to the traditional tree-based representation. Since these put the codes into canonical form, however, they can only be used when we can choose the order in which codewords are assigned to characters. In this paper we first show how, given a probability distribution over an alphabet of ÏÏ characters, we can store a nearly optimal alphabetic prefix-free code in o(Ï)o(Ï) bits such that we can encode and decode any character in constant time. We then consider a kind of code introduced recently to reduce the space usage of wavelet matrices (Claude, Navarro, and Ordóñez, Information Systems, 2015). They showed how to build an optimal prefix-free code such that the codewordsâ lengths are non-decreasing when they are arranged such that their reverses are in lexicographic order. We show how to store such a code in O(ÏlogL+2Ï”L)O(ÏlogâĄL+2Ï”L) bits, where L is the maximum codeword length and ϔϔ is any positive constant, such that we can encode and decode any character in constant time under reasonable assumptions. Otherwise, we can always encode and decode a codeword of ââ bits in time O(â)O(â) using O(ÏlogL)O(ÏlogâĄL) bits of space.Ministerio de EconomĂa, Industria y Competitividad; TIN2013-47090-C3-3-PMinisterio de EconomĂa, Industria y Competitividad; TIN2015-69951-RMinisterio de EconomĂa, Industria y Competitividad; ITC-20151305Ministerio de EconomĂa, Industria y Competitividad; ITC-20151247Xunta de Galicia; GRC2013/053Chile. NĂșcleo Milenio InformaciĂłn y CoordinaciĂłn en Redes; ICM/FIC.P10-024FCOST. IC1302Academy of Finland; 268324Academy of Finland; 25034
New Algorithms and Lower Bounds for Sequential-Access Data Compression
This thesis concerns sequential-access data compression, i.e., by algorithms
that read the input one or more times from beginning to end. In one chapter we
consider adaptive prefix coding, for which we must read the input character by
character, outputting each character's self-delimiting codeword before reading
the next one. We show how to encode and decode each character in constant
worst-case time while producing an encoding whose length is worst-case optimal.
In another chapter we consider one-pass compression with memory bounded in
terms of the alphabet size and context length, and prove a nearly tight
tradeoff between the amount of memory we can use and the quality of the
compression we can achieve. In a third chapter we consider compression in the
read/write streams model, which allows us passes and memory both
polylogarithmic in the size of the input. We first show how to achieve
universal compression using only one pass over one stream. We then show that
one stream is not sufficient for achieving good grammar-based compression.
Finally, we show that two streams are necessary and sufficient for achieving
entropy-only bounds.Comment: draft of PhD thesi
Postings List Compression with Run-length and Zombit Encodings
Inverted indices is a core index structure for different low-level structures, like search engines and databases.
It stores a mapping from terms, numbers etc. to list of location in document, set of documents, database, table etc. and allows efficient full-text searches on indexed structure.
Mapping location in the inverted indicies is usually called a postings list.
In real life applications, scale of the inverted indicies size can grow huge.
Therefore efficient representation of it is needed, but at the same time, efficient queries must be supported.
This thesis explores ways to represent postings lists efficiently, while allowing efficient nextGEQ queries on the set.
Efficient nextGEQ queries is needed to implement inverted indicies.
First we convert postings lists into one bitvector, which concatenates each postings list's characteristic bitvector.
Then representing an integer set efficiently converts to representing this bitvector efficiently, which is expected to have long runs of 0s and 1s.
Run-length encoding of bitvector have recently led to promising results.
Therefore in this thesis we experiment two encoding methods (Top-k Hybrid coder, RLZ) that encode postings lists via run-length encodes of the bitvector.
We also investigate another new bitvector compression method (Zombit-vector), which encodes bitvectors by finding redundancies of runs of 0/1s.
We compare all encoding to current state-of-the-art Partitioned Elisa-Fano (PEF) coding.
Compression results on all encodings were more efficient than the current state-of-the-art PEF encoding.
Zombit-vector nextGEQ query results were slighty more efficient than PEF's, which make it more attractive with bitvectors that have long runs of 0s and 1s.
More work is needed with Top-k Hybrid coder and RLZ, so that those encodings nextGEQ can be compared to Zombit-vector and PEF
The Many Qualities of a New Directly Accessible Compression Scheme
We present a new variable-length computation-friendly encoding scheme, named
SFDC (Succinct Format with Direct aCcesibility), that supports direct and fast
accessibility to any element of the compressed sequence and achieves
compression ratios often higher than those offered by other solutions in the
literature. The SFDC scheme provides a flexible and simple representation
geared towards either practical efficiency or compression ratios, as required.
For a text of length over an alphabet of size and a fixed
parameter , the access time of the proposed encoding is proportional
to the length of the character's code-word, plus an expected
overhead, where
is the -th number of the Fibonacci sequence. In the overall it uses
bits, where is the length of the encoded string.
Experimental results show that the performance of our scheme is, in some
respects, comparable with the performance of DACs and Wavelet Tees, which are
among of the most efficient schemes. In addition our scheme is configured as a
\emph{computation-friendly compression} scheme, as it counts several features
that make it very effective in text processing tasks. In the string matching
problem, that we take as a case study, we experimentally prove that the new
scheme enables results that are up to 29 times faster than standard
string-matching techniques on plain texts.Comment: 33 page
- âŠ