Consider two random strings having the same length and generated by an iid
sequence taking its values uniformly in a fixed finite alphabet. Artificially
place a long constant block into one of the strings, where a constant block is
a contiguous substring consisting only of one type of symbol. The long block
replaces a segment of equal size and its length is smaller than the length of
the strings, but larger than its square-root. We show that for sufficiently
long strings the optimal alignment corresponding to a Longest Common
Subsequence (LCS) treats the inserted block very differently depending on the
size of the alphabet. For two-letter alphabets, the long constant block gets
mainly aligned with the same symbol from the other string, while for three or
more letters the opposite is true and the block gets mainly aligned with gaps.
We further provide simulation results on the proportion of gaps in blocks of
various lengths. In our simulations, the blocks are "regular blocks" in an iid
sequence, and are not artificially inserted. Nonetheless, we observe for these
natural blocks a phenomenon similar to the one shown in case of
artificially-inserted blocks: with two letters, the long blocks get aligned
with a smaller proportion of gaps; for three or more letters, the opposite is
true.
It thus appears that the microscopic nature of two-letter optimal alignments
and three-letter optimal alignments are entirely different from each other.Comment: To appear: Journal of Statistical Physic