1 research outputs found
Domain Specific Hierarchical Huffman Encoding
In this paper, we revisit the classical data compression problem for domain
specific texts. It is well-known that classical Huffman algorithm is optimal
with respect to prefix encoding and the compression is done at character level.
Since many data transfer are domain specific, for example, downloading of
lecture notes, web-blogs, etc., it is natural to think of data compression in
larger dimensions (i.e. word level rather than character level). Our framework
employs a two-level compression scheme in which the first level identifies
frequent patterns in the text using classical frequent pattern algorithms. The
identified patterns are replaced with special strings and to acheive a better
compression ratio the length of a special string is ensured to be shorter than
the length of the corresponding pattern. After this transformation, on the
resultant text, we employ classical Huffman data compression algorithm. In
short, in the first level compression is done at word level and in the second
level it is at character level. Interestingly, this two level compression
technique for domain specific text outperforms classical Huffman technique. To
support our claim, we have presented both theoretical and simulation results
for domain specific texts