An Approximation to the Greedy Algorithm for Differential Compression of Very Large Files
- Publication date
- Publisher
Abstract
We present a new differential compression algorithm that combines the hash value techniques and suffix array techniques of previous work. Differential compression refers to encoding a file (a version file) as a set of changes with respect to another file (a reference file). Previous differential compression algorithms can be shown empirically to run in linear-time but they have certain drawbacks, namely they do not find the best matches for every offset of the version file. Our algorithm finds the best matches for every offset of the version file, with respect to a certain granularity (or block size) and above a certain length threshold. It has two variations depending on how we choose the block size. If we keep the block size fixed, we show that the compression performance of our algorithm is similar to that of the greedy algorithm, without the expensive space and time requirements. If we vary the block size linearly with the reference file size, we show that our algorithm can run in linear-time and constant-space to compress very large files. Our algorithm combines the techniques of hashing sections of the files to obtain footprints and the use of suffix arrays to find the longest match. We also show empirically that our algorithm performs better than xdelta [7], vcdiff [3], and the work of Ajtai et al. [1] in most cases in terms of compression and performs better than vcdiff and the work of Ajtai et al. in terms of speed.