Functional data appear in several domains of science, for example, in biomedical,
meteorologic or engineering studies. A functional observation can exhibit an atypical
behaviour during a short or a large part of the domain and this may be due to
magnitude or to shape features. Over the last ten years many outlier detection
methods have been proposed. In this work we use the functional data framework to
investigate the existence of DNA words with outlying distance distribution, which
may be related with biological motifs.
A DNA word is a sequence defined in the genome alphabet {ACGT}. Distances between successive occurrences of the same word allow defining the inter-word distance
distribution, interpretable as a discrete function. Each word length is associated
with a functional dataset formed by 4
distance distributions. As the word length
increases, greater is the diversity of observed patterns in the functional dataset and
larger is the number of distributions displaying strong peaks of frequency. We propose a two-step procedure to detect words with an outlying pattern of distances: first, the functions are clustered according to their global trend; then, an
outlier detection method is applied within each cluster. Each distribution trend is
obtained by data smoothing, which avoids some distributions’ peaks, and similarities
between smoothed data are explored through hierarchical complete linkage clustering. The dissimilarity between functions is evaluated using the Euclidean distance
or the Generalized Minimum distance [1], which considers the dependence between
domain points. The resulting dendograms are then cut leading to a partition of the
distance distributions. For the second step we use the Directional Outlyingness measure which assigns a robust measure of outlyingness to each domain point and is the
building block of a graphical tool for visualization of the centrality of the curves [2].
We focus on the human genome and words of length ≤ 7. Results are compared
with those obtained by applying only the second step of the procedure [3].publishe