On Compressing Collections of Substring Samples

Abstract

Publisher Copyright: © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).Given a string X = X[1..n] of length n, and integers m and s, such that n > m ≥ 2s > 0, we consider the problem of compressing the string S formed by concatenating the substrings of X of length m starting at positions i ≡ 1 (mod s). In particular, we provide an upper bound of (2n − m)/s + 2z + (m − s) on the size of the Lempel-Ziv (LZ77) parsing of S, where z is the size of the parsing of X. We also show that a related bound holds regardless of the order in which the substrings are concatenated in the formation of S. If X is viewed as a genome sequence, the above substring sampling process corresponds to an idealized model of short read DNA sequencing.Peer reviewe

    Similar works