Minimal absent words (MAW) of a genomic sequence are subsequences that are
absent themselves but the subwords of which are all present in the sequence.
The characteristic distribution of genomic MAWs as a function of their length
has been observed to be qualitatively similar for all living organisms, the
bulk being rather short, and only relatively few being long. It has been an
open issue whether the reason behind this phenomenon is statistical or reflects
a biological mechanism, and what biological information is contained in absent
words. In this work we demonstrate that the bulk can be described by a
probabilistic model of sampling words from random sequences, while the tail of
long MAWs is of biological origin. We introduce the novel concept of a core of
a minimal absent word, which are sequences present in the genome and closest to
a given MAW. We show that in bacteria and yeast the cores of the longest MAWs,
which exist in two or more copies, are located in highly conserved regions the
most prominent example being ribosomal RNAs (rRNAs). We also show that while
the distribution of the cores of long MAWs is roughly uniform over these
genomes on a coarse-grained level, on a more detailed level it is strongly
enhanced in 3' untranslated regions (UTRs) and, to a lesser extent, also in 5'
UTRs. This indicates that MAWs and associated MAW cores correspond to
fine-tuned evolutionary relationships, and suggest that they can be more widely
used as markers for genomic complexity.Comment: Supplemental Information to the paper is available as ancillary fil