3 research outputs found
Statistical analysis of the DNA sequence of human chromosome 22
We study statistical patterns in the DNA sequence of human chromosome 22, the first completely sequenced human chromosome. We find that (i) the 33.4 x 10(6) nucleotide long human chromosome exhibits long-range power-law correlations over more than four orders of magnitude, (ii) the entropies H-n of the frequency distribution of oligonucleotides of length n (n-mers) grow sublinearly with increasing n, indicating the presence of higher-order correlations for all of the studied lengths 1 less than or equal to n less than or equal to 10, and (iii) the generalized entropies H-n(q) of n-mers decrease monotonically with increasing q and the decay of H-n(q) with q becomes steeper with increasing n less than or equal to 10, indicating that the frequency distribution of oligonucleotides becomes increasingly nonuniform as the length n increases. We investigate to what degree known biological features may explain the observed statistical patterns. We find that (iv) the presence of interspersed repeats may cause the sublinear increase of H-n with n, and that (v) the presence of monomeric tandem repeats as well as the suppression of CG dinucleotides may cause the observed decay of H-n(q) with q
Entropy estimates of small data sets
Estimating entropies from limited data series is known to be a non-trivial
task. Naive estimations are plagued with both systematic (bias) and statistical
errors. Here, we present a new 'balanced estimator' for entropy functionals
Shannon, R\'enyi and Tsallis) specially devised to provide a compromise between
low bias and small statistical errors, for short data series. This new
estimator out-performs other currently available ones when the data sets are
small and the probabilities of the possible outputs of the random variable are
not close to zero. Otherwise, other well-known estimators remain a better
choice. The potential range of applicability of this estimator is quite broad
specially for biological and digital data series.Comment: 11 pages, 2 figure
Repeats and correlations in human DNA sequences
We study the nucleotide-nucleotide mutual information function I(k) of the DNA sequences of the three completely sequenced human chromosomes 20, 21, and 22. We find in each human chromosome (i) the absence of the k=3 base pair (bp) sequence periodicity characteristic for protein coding regions, (ii) the absence of the k=10-11 bp sequence periodicity characteristic for both protein secondary structure and DNA bendability, and (iii) the presence of significant statistical dependencies at about k=135 bp and at about k=165 bp. We investigate to which degree the density and composition of interspersed repeats might explain these observed statistical patterns in all three human chromosomes. We use simple stochastic models to substitute known interspersed repeats and find by numerical studies that (iv) the presence of interspersed repeats dominates short-range correlations as measured by I(k) on the scale of several hundred base pairs in human chromosomes 20, 21, and 22. On the other hand, we find that (v) interspersed repeats contribute only weakly to long-range correlations due to the clustering of highly abundant Alu repeats