The Many Qualities of a New Directly Accessible Compression Scheme

Abstract

We present a new variable-length computation-friendly encoding scheme, named SFDC (Succinct Format with Direct aCcesibility), that supports direct and fast accessibility to any element of the compressed sequence and achieves compression ratios often higher than those offered by other solutions in the literature. The SFDC scheme provides a flexible and simple representation geared towards either practical efficiency or compression ratios, as required. For a text of length nn over an alphabet of size Οƒ\sigma and a fixed parameter Ξ»\lambda, the access time of the proposed encoding is proportional to the length of the character's code-word, plus an expected O((FΟƒβˆ’Ξ»+3βˆ’3)/FΟƒ+1)\mathcal{O}((F_{\sigma - \lambda + 3} - 3)/F_{\sigma+1}) overhead, where FjF_j is the jj-th number of the Fibonacci sequence. In the overall it uses N+O(n(Ξ»βˆ’(FΟƒ+3βˆ’3)/FΟƒ+1))=N+O(n)N+\mathcal{O}\big(n \left(\lambda - (F_{\sigma+3}-3)/F_{\sigma+1}\big) \right) = N + \mathcal{O}(n) bits, where NN is the length of the encoded string. Experimental results show that the performance of our scheme is, in some respects, comparable with the performance of DACs and Wavelet Tees, which are among of the most efficient schemes. In addition our scheme is configured as a \emph{computation-friendly compression} scheme, as it counts several features that make it very effective in text processing tasks. In the string matching problem, that we take as a case study, we experimentally prove that the new scheme enables results that are up to 29 times faster than standard string-matching techniques on plain texts.Comment: 33 page

    Similar works

    Full text

    thumbnail-image

    Available Versions