An empirical study on Chinese text compression: from character-based to word-based approach.
Authors
Publication date
1 January 1997
Publisher
Abstract
by Kwok-Shing Cheng.Thesis (M.Phil.)--Chinese University of Hong Kong, 1997.Includes bibliographical references (leaves 114-120).Abstract --- p.iAcknowledgement --- p.iiiChapter 1 --- Introduction --- p.1Chapter 1.1 --- Importance of Text Compression --- p.1Chapter 1.2 --- Motivation of this Research --- p.2Chapter 1.3 --- Characteristics of Chinese --- p.2Chapter 1.3.1 --- Huge size of character set --- p.3Chapter 1.3.2 --- Lack of word segmentation --- p.3Chapter 1.3.3 --- Rich semantics --- p.3Chapter 1.4 --- Different Coding Schemes for Chinese --- p.4Chapter 1.4.1 --- Big5 Code --- p.4Chapter 1.4.2 --- GB (Guo Biao) Code --- p.4Chapter 1.4.3 --- HZ (Hanzi) Code --- p.5Chapter 1.4.4 --- Unicode Code --- p.5Chapter 1.5 --- Modeling and Coding for Chinese Text --- p.6Chapter 1.6 --- Static and Adaptive Modeling --- p.6Chapter 1.7 --- One-Pass and Two-Pass Modeling --- p.8Chapter 1.8 --- Ordering of models --- p.9Chapter 1.9 --- Two Sets of Benchmark Files and the Platform --- p.9Chapter 1.10 --- Outline of the Thesis --- p.11Chapter 2 --- A Survey of Chinese Text Compression --- p.13Chapter 2.1 --- Entropy for Chinese Text --- p.14Chapter 2.2 --- Weakness of Traditional Compression Algorithms on Chinese Text --- p.15Chapter 2.3 --- Statistical Class Algorithms for Compressing Chinese --- p.16Chapter 2.3.1 --- Huffman coding scheme --- p.17Chapter 2.3.2 --- Arithmetic Coding Scheme --- p.22Chapter 2.3.3 --- Restricted Variable Length Coding Scheme --- p.26Chapter 2.4 --- Dictionary-based Class Algorithms for Compressing Chinese --- p.27Chapter 2.5 --- Experiments and Results --- p.32Chapter 2.6 --- Chapter Summary --- p.35Chapter 3 --- Indicator Dependent Huffman Coding Scheme --- p.37Chapter 3.1 --- Chinese Character Identification Routine --- p.37Chapter 3.2 --- Reduction of Header Size --- p.39Chapter 3.3 --- Semi-adaptive IDC for Chinese Text --- p.44Chapter 3.3.1 --- Theoretical Analysis of Partition Technique for Com- pression --- p.48Chapter 3.3.2 --- Experiments and Results of the Semi-adaptive IDC --- p.50Chapter 3.4 --- Adaptive IDC for Chinese Text --- p.54Chapter 3.4.1 --- Experiments and Results of the Adaptive IDC --- p.57Chapter 3.5 --- Chapter Summary --- p.58Chapter 4 --- Cascading LZ Algorithms with Huffman Coding Schemes --- p.59Chapter 4.1 --- Variations of Huffman Coding Scheme --- p.60Chapter 4.1.1 --- Analysis of EPDC and PDC --- p.60Chapter 4.1.2 --- "Analysis of PDC, 16Huff and IDC" --- p.65Chapter 4.1.3 --- Time and Memory Consumption --- p.71Chapter 4.2 --- "Cascading LZSS with PDC, 16Huff and IDC" --- p.73Chapter 4.2.1 --- Experimental Results --- p.76Chapter 4.3 --- "Cascading LZW with PDC, 16Huff and IDC" --- p.79Chapter 4.3.1 --- Experimental Results --- p.82Chapter 4.4 --- Chapter Summary --- p.84Chapter 5 --- Applying Compression Algorithms to Word-segmented Chi- nese Text --- p.85Chapter 5.1 --- Background of word-based compression algorithms --- p.86Chapter 5.2 --- Terminology and Benchmark Files for Word Segmentation Model --- p.88Chapter 5.3 --- Word Segmentation Model --- p.88Chapter 5.4 --- Chinese Entropy from Byte to Word --- p.91Chapter 5.5 --- The Generalized Compression and Decompression Model for Word-segmented Chinese text --- p.92Chapter 5.6 --- Applying Huffman Coding Scheme to Word-segmented Chinese text --- p.94Chapter 5.7 --- Applying WLZSSHUF to Word-segmented Chinese text --- p.97Chapter 5.8 --- Applying WLZWHUF to Word-segmented Chinese text --- p.102Chapter 5.9 --- Match Ratio and Compression Ratio --- p.105Chapter 5.10 --- Chapter Summary --- p.108Chapter 6 --- Concluding Remarks --- p.110Chapter 6.1 --- Conclusions --- p.110Chapter 6.2 --- Contributions --- p.111Chapter 6.3 --- Future Directions --- p.112Chapter 6.3.1 --- Integrate Decremental Coding Scheme with IDC --- p.112Chapter 6.3.2 --- Re-order the Character Sequences in the Sliding Window of LZSS --- p.113Chapter 6.3.3 --- Multiple Huffman Trees for Word-based Compression --- p.113Bibliography --- p.11