6 research outputs found
DNA word analysis based on the distribution of the distances between symmetric words
We address the problem of discovering pairs of symmetric genomic words (i.e., words and the
corresponding reversed complements) occurring at distances that are overrepresented. For this
purpose, we developed new procedures to identify symmetric word pairs with uncommon empirical
distance distribution and with clusters of overrepresented short distances. We speculate that patterns
of overrepresentation of short distances between symmetric word pairs may allow the occurrence of
non-standard DNA conformations, such as hairpin/cruciform structures. We focused on the human
genome, and analysed both the complete genome as well as a version with known repetitive sequences
masked out. We reported several well-defined features in the distributions of distances, which can be
classified into three different profiles, showing enrichment in distinct distance ranges. We analysed in
greater detail certain pairs of symmetric words of length seven, found by our procedure, characterised
by the surprising fact that they occur at single distances more frequently than expecte