74 research outputs found

    The spectral analysis of nonstationary categorical time series using local spectral envelope

    Get PDF
    Most classical methods for the spectral analysis are based on the assumption that the time series is stationary. However, many time series in practical problems shows nonstationary behaviors. The data from some fields are huge and have variance and spectrum which changes over time. Sometimes,we are interested in the cyclic behavior of the categorical-valued time series such as EEG sleep state data or DNA sequence, the general method is to scale the data, that is, assign numerical values to the categories and then use the periodogram to find the cyclic behavior. But there exists numerous possible scaling. If we arbitrarily assign the numerical values to the categories and proceed with a spectral analysis, then the results will depend on the particular assignment. We would like to find the all possible scaling that bring out all of the interesting features in the data. To overcome these problems, there have been many approaches in the spectral analysis. Our goal is to develop a statistical methodology for analyzing nonstationary categorical time series in the frequency domain. In this dissertation, the spectral envelope methodology is introduced for spectral analysis of categorical time series. This provides the general framework for the spectral analysis of the categorical time series and summarizes information from the spectrum matrix. To apply this method to nonstationary process, I used the TBAS(Tree-Based Adaptive Segmentation) and local spectral envelope based on the piecewise stationary process. In this dissertation,the TBAS(Tree-Based Adpative Segmentation) using distance function based on the Kullback-Leibler divergence was proposed to find the best segmentation

    In the search for the low-complexity sequences in prokaryotic and eukaryotic genomes: how to derive a coherent picture from global and local entropy measures

    Full text link
    We investigate on a possible way to connect the presence of Low-Complexity Sequences (LCS) in DNA genomes and the nonstationary properties of base correlations. Under the hypothesis that these variations signal a change in the DNA function, we use a new technique, called Non-Stationarity Entropic Index (NSEI) method, and we prove that this technique is an efficient way to detect functional changes with respect to a random baseline. The remarkable aspect is that NSEI does not imply any training data or fitting parameter, the only arbitrarity being the choice of a marker in the sequence. We make this choice on the basis of biological information about LCS distributions in genomes. We show that there exists a correlation between changing the amount in LCS and the ratio of long- to short-range correlation

    Impact of Tandem Repeats on the Scaling of Nucleotide Sequences

    Full text link
    Techniques such as detrended fluctuation analysis (DFA) and its extensions have been widely used to determine the nature of scaling in nucleotide sequences. In this brief communication we show that tandem repeats which are ubiquitous in nucleotide sequences can prevent reliable estimation of possible long-range correlations. Therefore, it is important to investigate the presence of tandem repeats prior to scaling exponent estimation.Comment: 14 Pages, 3 Figure

    Time-dependent ARMA modeling of genomic sequences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Over the past decade, many investigators have used sophisticated time series tools for the analysis of genomic sequences. Specifically, the correlation of the nucleotide chain has been studied by examining the properties of the power spectrum. The main limitation of the power spectrum is that it is restricted to stationary time series. However, it has been observed over the past decade that genomic sequences exhibit non-stationary statistical behavior. Standard statistical tests have been used to verify that the genomic sequences are indeed not stationary. More recent analysis of genomic data has relied on time-varying power spectral methods to capture the statistical characteristics of genomic sequences. Techniques such as the evolutionary spectrum and evolutionary periodogram have been successful in extracting the time-varying correlation structure. The main difficulty in using time-varying spectral methods is that they are extremely unstable. Large deviations in the correlation structure results from very minor perturbations in the genomic data and experimental procedure. A fundamental new approach is needed in order to provide a stable platform for the non-stationary statistical analysis of genomic sequences.</p> <p>Results</p> <p>In this paper, we propose to model non-stationary genomic sequences by a time-dependent autoregressive moving average (TD-ARMA) process. The model is based on a classical ARMA process whose coefficients are allowed to vary with time. A series expansion of the time-varying coefficients is used to form a generalized Yule-Walker-type system of equations. A recursive least-squares algorithm is subsequently used to estimate the time-dependent coefficients of the model. The non-stationary parameters estimated are used as a basis for statistical inference and biophysical interpretation of genomic data. In particular, we rely on the TD-ARMA model of genomic sequences to investigate the statistical properties and differentiate between coding and non-coding regions in the nucleotide chain. Specifically, we define a quantitative measure of randomness to assess how far a process deviates from white noise. Our simulation results on various gene sequences show that both the coding and non-coding regions are non-random. However, coding sequences are "whiter" than non-coding sequences as attested by a higher index of randomness.</p> <p>Conclusion</p> <p>We demonstrate that the proposed TD-ARMA model can be used to provide a stable time series tool for the analysis of non-stationary genomic sequences. The estimated time-varying coefficients are used to define an index of randomness, in order to assess the statistical correlations in coding and non-coding DNA sequences. It turns out that the statistical differences between coding and non-coding sequences are more subtle than previously thought using stationary analysis tools: Both coding and non-coding sequences exhibit statistical correlations, with the coding regions being "whiter" than the non-coding regions. These results corroborate the evolutionary periodogram analysis of genomic sequences and revoke the stationary analysis' conclusion that coding DNA behaves like random sequences.</p

    Statistical properties of DNA sequences revisited: the role of inverse bilateral symmetry in bacterial chromosomes

    Full text link
    Herein it is shown that in order to study the statistical properties of DNA sequences in bacterial chromosomes it suffices to consider only one half of the chromosome because they are similar to its corresponding complementary sequence in the other half. This is due to the inverse bilateral symmetry of bacterial chromosomes. Contrary to the classical result that DNA coding regions of bacterial genomes are purely uncorrelated random sequences, here it is shown, via a renormalization group approach, that DNA random fluctuations of single bases are modulated by log-periodic variations. Distance series of triplets display long-range correlations in each half of the intact chromosome and in intronless protein-coding sequences, or both long-range correlations and log-periodic modulations along the whole chromosome. Hence scaling analyses of distance series of DNA sequences have to consider the functional units of bacterial chromosomes.Comment: 27 pages, 9 figure

    Biased gene conversion and GC-content evolution in the coding sequences of reptiles and vertebrates.

    Get PDF
    Mammalian and avian genomes are characterized by a substantial spatial heterogeneity of GC-content, which is often interpreted as reflecting the effect of local GC-biased gene conversion (gBGC), a meiotic repair bias that favors G and C over A and T alleles in high-recombining genomic regions. Surprisingly, the first fully sequenced nonavian sauropsid (i.e., reptile), the green anole Anolis carolinensis, revealed a highly homogeneous genomic GC-content landscape, suggesting the possibility that gBGC might not be at work in this lineage. Here, we analyze GC-content evolution at third-codon positions (GC3) in 44 vertebrates species, including eight newly sequenced transcriptomes, with a specific focus on nonavian sauropsids. We report that reptiles, including the green anole, have a genome-wide distribution of GC3 similar to that of mammals and birds, and we infer a strong GC3-heterogeneity to be already present in the tetrapod ancestor. We further show that the dynamic of coding sequence GC-content is largely governed by karyotypic features in vertebrates, notably in the green anole, in agreement with the gBGC hypothesis. The discrepancy between third-codon positions and noncoding DNA regarding GC-content dynamics in the green anole could not be explained by the activity of transposable elements or selection on codon usage. This analysis highlights the unique value of third-codon positions as an insertion/deletion-free marker of nucleotide substitution biases that ultimately affect the evolution of proteins

    The complete mitochondrial genome of Flustra foliacea (Ectoprocta, Cheilostomata) - compositional bias affects phylogenetic analyses of lophotrochozoan relationships

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The phylogenetic relationships of the lophophorate lineages, ectoprocts, brachiopods and phoronids, within Lophotrochozoa are still controversial. We sequenced an additional mitochondrial genome of the most species-rich lophophorate lineage, the ectoprocts. Although it is known that there are large differences in the nucleotide composition of mitochondrial sequences of different lineages as well as in the amino acid composition of the encoded proteins, this bias is often not considered in phylogenetic analyses. We applied several approaches for reducing compositional bias and saturation in the phylogenetic analyses of the mitochondrial sequences.</p> <p>Results</p> <p>The complete mitochondrial genome (16,089 bp) of <it>Flustra foliacea </it>(Ectoprocta, Gymnolaemata, Cheilostomata) was sequenced. All protein-encoding, rRNA and tRNA genes are transcribed from the same strand. <it>Flustra </it>shares long intergenic sequences with the cheilostomate ectoproct <it>Bugula</it>, which might be a synapomorphy of these taxa. Further synapomorphies might be the loss of the DHU arm of the tRNA L(UUR), the loss of the DHU arm of the tRNA S(UCN) and the unique anticodon sequence GAG of the tRNA L(CUN). The gene order of the mitochondrial genome of <it>Flustra </it>differs strongly from that of the other known ectoprocts. Phylogenetic analyses of mitochondrial nucleotide and amino acid data sets show that the lophophorate lineages are more closely related to trochozoan phyla than to deuterostomes or ecdysozoans confirming the Lophotrochozoa hypothesis. Furthermore, they support the monophyly of Cheilostomata and Ectoprocta. However, the relationships of the lophophorate lineages within Lophotrochozoa differ strongly depending on the data set and the used method. Different approaches for reducing heterogeneity in nucleotide and amino acid data sets and saturation did not result in a more robust resolution of lophotrochozoan relationships.</p> <p>Conclusion</p> <p>The contradictory and usually weakly supported phylogenetic reconstructions of the relationships among lophotrochozoan phyla based on mitochondrial sequences indicate that these alone do not contain enough information for a robust resolution of the relations of the lophotrochozoan phyla. The mitochondrial gene order is also not useful for inferring their phylogenetic relationships, because it is highly variable in ectoprocts, brachiopods and some other lophotrochozoan phyla. However, our study revealed several rare genomic changes like the evolution of long intergenic sequences and changes in the structure of tRNAs, which may be helpful for reconstructing ectoproct phylogeny.</p

    Biased gene conversion and GC-content evolution in the coding sequences of reptiles and vertebrates.

    Get PDF
    Mammalian and avian genomes are characterized by a substantial spatial heterogeneity of GC-content, which is often interpreted as reflecting the effect of local GC-biased gene conversion (gBGC), a meiotic repair bias that favors G and C over A and T alleles in high-recombining genomic regions. Surprisingly, the first fully sequenced nonavian sauropsid (i.e., reptile), the green anole Anolis carolinensis, revealed a highly homogeneous genomic GC-content landscape, suggesting the possibility that gBGC might not be at work in this lineage. Here, we analyze GC-content evolution at third-codon positions (GC3) in 44 vertebrates species, including eight newly sequenced transcriptomes, with a specific focus on nonavian sauropsids. We report that reptiles, including the green anole, have a genome-wide distribution of GC3 similar to that of mammals and birds, and we infer a strong GC3-heterogeneity to be already present in the tetrapod ancestor. We further show that the dynamic of coding sequence GC-content is largely governed by karyotypic features in vertebrates, notably in the green anole, in agreement with the gBGC hypothesis. The discrepancy between third-codon positions and noncoding DNA regarding GC-content dynamics in the green anole could not be explained by the activity of transposable elements or selection on codon usage. This analysis highlights the unique value of third-codon positions as an insertion/deletion-free marker of nucleotide substitution biases that ultimately affect the evolution of proteins

    The taming of an impossible child: a standardized all-in approach to the phylogeny of Hymenoptera using public database sequences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Enormous molecular sequence data have been accumulated over the past several years and are still exponentially growing with the use of faster and cheaper sequencing techniques. There is high and widespread interest in using these data for phylogenetic analyses. However, the amount of data that one can retrieve from public sequence repositories is virtually impossible to tame without dedicated software that automates processes. Here we present a novel bioinformatics pipeline for downloading, formatting, filtering and analyzing public sequence data deposited in GenBank. It combines some well-established programs with numerous newly developed software tools (available at <url>http://software.zfmk.de/</url>).</p> <p>Results</p> <p>We used the bioinformatics pipeline to investigate the phylogeny of the megadiverse insect order Hymenoptera (sawflies, bees, wasps and ants) by retrieving and processing more than 120,000 sequences and by selecting subsets under the criteria of compositional homogeneity and defined levels of density and overlap. Tree reconstruction was done with a partitioned maximum likelihood analysis from a supermatrix with more than 80,000 sites and more than 1,100 species. In the inferred tree, consistent with previous studies, "Symphyta" is paraphyletic. Within Apocrita, our analysis suggests a topology of Stephanoidea + (Ichneumonoidea + (Proctotrupomorpha + (Evanioidea + Aculeata))). Despite the huge amount of data, we identified several persistent problems in the Hymenoptera tree. Data coverage is still extremely low, and additional data have to be collected to reliably infer the phylogeny of Hymenoptera.</p> <p>Conclusions</p> <p>While we applied our bioinformatics pipeline to Hymenoptera, we designed the approach to be as general as possible. With this pipeline, it is possible to produce phylogenetic trees for any taxonomic group and to monitor new data and tree robustness in a taxon of interest. It therefore has great potential to meet the challenges of the phylogenomic era and to deepen our understanding of the tree of life.</p
    corecore