Data compression, the reduction in size of the physical representation
of data being stored or transmitted, has long been of interest both as a research topic and as a practical technique. Different methods are used
for encoding different classes of data files. The purpose of this research
is to compress a class of highly redundant data files whose contents are
partially described by a context-free grammar (i.e. text files containing
computer programs).
An encoding technique is developed for the removal of structural
dependancy due to the context-free structure of such files. The technique
depends on a type of LR parsing method called LALR(K) (Lookahead LRM).
The encoder also pays particular attention to the encoding of editing
characters, comments, names and constants.
The encoded data maintains the exact information content of the
original data. Hence, a decoding technique (depending on the same
parsing method) is developed to recover the original information from
its compressed representation.
The technique is demonstrated by compressing Pascal programs. An
optimal coding scheme (based on Huffman codes) is used to encode the
parsing alternatives in each parsing state. The decoder uses these codes
during the decoding phase. Also Huffman codes, based on the probability
of the symbols c oncerned, are used when coding editing characterst
comments, names and constants. The sizes of the parsing tables (and
subsequently the encoding tables) were considerably reduced by splitting
them into a number of sub-tables.
The minimum and the average code length of the average program are
derived from two different matrices. These matrices are constructed
from a probabilistic grammar, and the language generated by this grammar.
Finally, various comparisons are made with a related encoding method by
using a simple context-free language