Search CORE

589 research outputs found

Distinguish Coding And Noncoding Sequences In A Complete Genome Using Fourier Transform

Author: Anh Vo
Yu Zuguo
Zhou Li-Qian
Zhou Yu
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2007
Field of study

A Fourier transform method is proposed to distinguish coding and non-coding sequences in a complete genome based on a number sequence representation of the DNA sequence proposed in our previous paper (Zhou et al., J. Theor. Biol. 2005) and the imperfect periodicity of 3 in protein coding sequences. The three parameters P_x(S) (1), P_x(S) (1/3) and P_x(S) (1/36) in the Fourier transform of the number sequence representation of DNA sequences are selected to form a three-dimensional parameter space. Each DNA sequence is then represented by a point in this space. The points corresponding to coding and non-coding sequences in the complete genome of prokaryotes are seen to be divided into different regions. If the point (P_x(�ar S) (1), Px(�ar S) (1/3), P_x(�ar S) (1/36)) for a DNA sequence is situated in the region corresponding to coding sequences, the sequence is distinguished as a coding sequence; otherwise, the sequence is classified as a noncoding one. Fisher's discriminant algorithm is used to study the discriminant accuracy. The average discriminant accuracies pc, pnc, qc and qnc of all 51 prokaryotes obtained by the present method reach 81.02%, 92.27%, 80.77% and 92.24% respectively

Queensland University of Technology ePrints Archive

Genomics and proteomics: a signal processor's tour

Author: Vaidyanathan P. P.
Publication venue
Publication date: 01/12/2004
Field of study

The theory and methods of signal processing are becoming increasingly important in molecular biology. Digital filtering techniques, transform domain methods, and Markov models have played important roles in gene identification, biological sequence analysis, and alignment. This paper contains a brief review of molecular biology, followed by a review of the applications of signal processing theory. This includes the problem of gene finding using digital filtering, and the use of transform domain methods in the study of protein binding spots. The relatively new topic of noncoding genes, and the associated problem of identifying ncRNA buried in DNA sequences are also described. This includes a discussion of hidden Markov models and context free grammars. Several new directions in genomic signal processing are briefly outlined in the end

CiteSeerX

Caltech Authors

Correlation property of length sequences based on global structure of complete genome

Author: A. Arneodo
A. K. Mohanty
A. L. Goldberger
A. Provata
B. Lewin
B.-L. Hao
Bin Wang
C. A. Chatzidimitriou-Dreismann
C. K. Peng
C. K. Peng
C. M. Fraser
E. Simoen
F. N. H. Robinson
H. E. Stanley
H. Herzel
J. Maddox
L. Luo
M. de Sousa Vieira
N. Iwabe
P. Allegrini
R. D. Remington
R. H. Shumway
R. M. Dunki
R. M. Dunki
R. Voss
R. Voss
S. Karlin
S. Nee
S. V. Buldyrev
V. V. Anh
V. V. Prabhu
W. Li
W. Li
Z.-G. Yu
Z.-G. Yu
Z.-G. Yu
Zu-Guo Yu
Publication venue: 'American Physical Society (APS)'
Publication date: 01/01/2000
Field of study

This paper considers three kinds of length sequences of the complete genome. Detrended fluctuation analysis, spectral analysis, and the mean distance spanned within time

L

are used to discuss the correlation property of these sequences. The values of the exponents from these methods of these three kinds of length sequences of bacteria indicate that the long-range correlations exist in most of these sequences. The correlation have a rich variety of behaviours including the presence of anti-correlations. Further more, using the exponent

\gamma

, it is found that these correlations are all linear (

\gamma=1.0\pm 0.03

). It is also found that these sequences exhibit

1/f

noise in some interval of frequency (

f>1

). The length of this interval of frequency depends on the length of the sequence. The shape of the periodogram in

f>1

exhibits some periodicity. The period seems to depend on the length and the complexity of the length sequence.Comment: RevTex, 9 pages with 5 figures and 3 tables. Phys. Rev. E Jan. 1,2001 (to appear

arXiv.org e-Print Archive

CiteSeerX

Crossref

CERN Document Server

Measure representation and multifractal analysis of complete genomes

Author: A. Arneodo
A. Provata
A.K. Mohanty
B. Lewin
Bai-lin Hao
Bai-Lin Hao
C.A. Chatzidimitriou-Dreismann
C.K. Peng
C.L. Berthelsen
C.L. Berthelsen
C.M. Fraser
D. Katzen
D. Vollhardt
E. Canessa
E. Pennisi
F. N. H. Robinson
H. Herzel
H.E. Stanley
H.J. Jeffrey
J. Lee
J. Maddox
Ka-Sing Lau
Liaofu Luo
Maria de Sousa Vieira
N. Goldman
N. Iwabe
P. Allegrini
P. Grassberger
R. H. Shumway
R. Pastor-Satorras
R. Voss
R. Voss
S. Karlin
S. Nee
S.V. Buldyrev
T. Bohr and
T. Halsey
V.V. Anh
V.V. Prabhu
Vo Anh
W. Li
W. Li
Zu-Guo Yu
Zu-Guo Yu
Zu-Guo Yu
Zu-Guo Yu
Zu-Guo Yu
Zu-Guo Yu
Publication venue: 'American Physical Society (APS)'
Publication date: 01/01/2001
Field of study

This paper introduces the notion of measure representation of DNA sequences. Spectral analysis and multifractal analysis are then performed on the measure representations of a large number of complete genomes. The main aim of this paper is to discuss the multifractal property of the measure representation and the classification of bacteria. From the measure representations and the values of the

D_{q}

spectra and related

C_{q}

curves, it is concluded that these complete genomes are not random sequences. In fact, spectral analyses performed indicate that these measure representations considered as time series, exhibit strong long-range correlation. For substrings with length K=8, the

D_{q}

spectra of all organisms studied are multifractal-like and sufficiently smooth for the

C_{q}

curves to be meaningful. The

C_{q}

curves of all bacteria resemble a classical phase transition at a critical point. But the 'analogous' phase transitions of chromosomes of non-bacteria organisms are different. Apart from Chromosome 1 of {\it C. elegans}, they exhibit the shape of double-peaked specific heat function.Comment: 12 pages with 9 figures and 1 tabl

arXiv.org e-Print Archive

Crossref

Queensland University of Technology ePrints Archive

Human Promoter Prediction Using DNA Numerical Representation

Author: Arniker Swarna Bai
Publication venue: 'University of Windsor Leddy Library'
Publication date: 01/01/2010
Field of study

With the emergence of genomic signal processing, numerical representation techniques for DNA alphabet set {A, G, C, T} play a key role in applying digital signal processing and machine learning techniques for processing and analysis of DNA sequences. The choice of the numerical representation of a DNA sequence affects how well the biological properties can be reflected in the numerical domain for the detection and identification of the characteristics of special regions of interest within the DNA sequence. This dissertation presents a comprehensive study of various DNA numerical and graphical representation methods and their applications in processing and analyzing long DNA sequences. Discussions on the relative merits and demerits of the various methods, experimental results and possible future developments have also been included. Another area of the research focus is on promoter prediction in human (Homo Sapiens) DNA sequences with neural network based multi classifier system using DNA numerical representation methods. In spite of the recent development of several computational methods for human promoter prediction, there is a need for performance improvement. In particular, the high false positive rate of the feature-based approaches decreases the prediction reliability and leads to erroneous results in gene annotation.To improve the prediction accuracy and reliability, DigiPromPred a numerical representation based promoter prediction system is proposed to characterize DNA alphabets in different regions of a DNA sequence.The DigiPromPred system is found to be able to predict promoters with a sensitivity of 90.8% while reducing false prediction rate for non-promoter sequences with a specificity of 90.4%. The comparative study with state-of-the-art promoter prediction systems for human chromosome 22 shows that our proposed system maintains a good balance between prediction accuracy and reliability. To reduce the system architecture and computational complexity compared to the existing system, a simple feed forward neural network classifier known as SDigiPromPred is proposed. The SDigiPromPred system is found to be able to predict promoters with a sensitivity of 87%, 87%, 99% while reducing false prediction rate for non-promoter sequences with a specificity of 92%, 94%, 99% for Human, Drosophila, and Arabidopsis sequences respectively with reconfigurable capability compared to existing system

Scholarship at UWindsor

Ab initio gene identification: prokaryote genome annotation with GeneScan and GLIMMER

Author: Aggarwal Gautam
Ramaswamy Ramakrishna
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2002
Field of study

We compare the annotation of three complete genomes using theab initio methods of gene identification GeneScan and GLIMMER. The annotation given in GenBank, the standard against which these are compared, has been made using GeneMark. We find a number of novel genes which are predicted by both methods used here, as well as a number of genes that are predicted by GeneMark, but are not identified by either of the nonconsensus methods that we have used. The three organisms studied here are all prokaryotic species with fairly compact genomes. The Fourier measure forms the basis for an efficient non-consensus method for gene prediction, and the algorithm GeneScan exploits this measure. We have bench-marked this program as well as GLIMMER using 3 complete prokaryotic genomes. An effort has also been made to study the limitations of these techniques for complete genome analysis. GeneScan and GLIMMER are of comparable accuracy insofar as gene-identification is concerned, with sensitivities and specificities typically greater than 0.9. The number of false predictions (both positive and negative) is higher for GeneScan as compared to GLIMMER, but in a significant number of cases, similar results are provided by the two techniques. This suggests that there could be some as-yet unidentified additional genes in these three genomes, and also that some of the putative identifications made hitherto might require re-evaluation. All these cases are discussed in detail

CiteSeerX

Hierarchical structure of cascade of primary and secondary periodicities in Fourier power spectrum of alphoid higher order repeats

Author: A Arneodo
A Arneodo
A Puente de la
A Som
A Weiss
AK Brodzik
AL Jorgensen
AM Lynn
AR Fuentes
B Borštnik
B Haubold
BD Silverman
BR Kim
C Lee
C Tyler-Smith
C Yin
CA Chatzidimitriou-Dreismann
CA Chatzidimitriou-Dreismann
CC Yin
CK Peng
CK Peng
D Anastassiou
D Holste
D Kotlar
D Larhammar
D Sharma
DC Benson
DD Mauresan
DG Arques
E Coward
E Coward
E Pizzi
EA Cleever
EN Trifonov
EN Trifonov
EPC Rocha
EV Korotkov
EV Korotkov
G Bernardi
G Dodin
GI Kutuzova
H Herzel
H Herzel
H Herzel
HE Stanley
HE Stanley
I Dunham
IA Alexandrov
Ivan Basar
J Felsenstein
J Gao
J Jin
J Widom
JH Jackson
JM Gutierez
JS Waye
JS Waye
JW Fickett
JW Fickett
KHA Cho
L Du
L Manuelidis
LQ Zhou
LY Romanova
M Rosandić
M Rosandić
M Sousa Vieira de
Marija Rosandić
Matko Glunčić
MK Rudd
MQ Zhang
MY Azbel
N Bouayanaya
N Nagai
Nenad Pavin
Nils Paar
P Bernaola-Galvan
P Bernaola-Galvan
PE Warburton
PG Pop
PP Vaidyanathan
PV O'Neil
R Gupta
R Hall
R Ramakrishna
R Wevrick
R Wevrick
R Zhang
RF Voss
S Guharay
S Karlin
S Nee
S Tiwari
SA Aghili
SV Buldyrev
SV Buldyrev
T Haaf
TR Gregory
TT Tran
V Afreixo
V Paar
V Paar
V Paar
VA Emanuele
Vladimir Paar
VP Turutina
VR Chechetkin
VR Chechetkin
VR Chechetkin
VR Chechetkin
VR Chechetkin
VV Lobzin
VV Pradbu
W Lee
W Li
W Li
W Li
YX Tian
Z-G Yu
Z-G Yu
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Identification of approximate tandem repeats is an important task of broad significance and still remains a challenging problem of computational genomics. Often there is no single best approach to periodicity detection and a combination of different methods may improve the prediction accuracy. Discrete Fourier transform (DFT) has been extensively used to study primary periodicities in DNA sequences. Here we investigate the application of DFT method to identify and study alphoid higher order repeats. Results We used method based on DFT with mapping of symbolic into numerical sequence to identify and study alphoid higher order repeats (HOR). For HORs the power spectrum shows equidistant frequency pattern, with characteristic two-level hierarchical organization as signature of HOR. Our case study was the 16 mer HOR tandem in AC017075.8 from human chromosome 7. Very long array of equidistant peaks at multiple frequencies (more than a thousand higher harmonics) is based on fundamental frequency of 16 mer HOR. Pronounced subset of equidistant peaks is based on multiples of the fundamental HOR frequency (multiplication factor <it>n </it>for <it>n</it>mer) and higher harmonics. In general, <it>n</it>mer HOR-pattern contains equidistant secondary periodicity peaks, having a pronounced subset of equidistant primary periodicity peaks. This hierarchical pattern as signature for HOR detection is robust with respect to monomer insertions and deletions, random sequence insertions etc. For a monomeric alphoid sequence only primary periodicity peaks are present. The 1/<it>f</it><it>β </it>– noise and periodicity three pattern are missing from power spectra in alphoid regions, in accordance with expectations. Conclusion DFT provides a robust detection method for higher order periodicity. Easily recognizable HOR power spectrum is characterized by hierarchical two-level equidistant pattern: higher harmonics of the fundamental HOR-frequency (secondary periodicity) and a subset of pronounced peaks corresponding to constituent monomers (primary periodicity). The number of lower frequency peaks (secondary periodicity) below the frequency of the first primary periodicity peak reveals the size of <it>n</it>mer HOR, i.e., the number <it>n </it>of monomers contained in consensus HOR.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

MPG.PuRe

Structural fingerprints of transcription factor binding site regions

Author: Christopher Hunter
Collins
Eleanor J. Gardiner
Fickett
Kampa
Packer
Peter Willett
Tiwari
Waterston
Publication venue: 'MDPI AG'
Publication date: 01/01/2009
Field of study

Fourier transforms are a powerful tool in the prediction of DNA sequence properties, such as the presence/absence of codons. We have previously compiled a database of the structural properties of all 32,896 unique DNA octamers. In this work we apply Fourier techniques to the analysis of the structural properties of human chromosomes 21 and 22 and also to three sets of transcription factor binding sites within these chromosomes. We find that, for a given structural property, the structural property power spectra of chromosomes 21 and 22 are strikingly similar. We find common peaks in their power spectra for both Sp1 and p53 transcription factor binding sites. We use the power spectra as a structural fingerprint and perform similarity searching in order to find transcription factor binding site regions. This approach provides a new strategy for searching the genome data for information. Although it is difficult to understand the relationship between specific functional properties and the set of structural parameters in our database, our structural fingerprints nevertheless provide a useful tool for searching for function information in sequence data. The power spectrum fingerprints provide a simple, fast method for comparing a set of functional sequences, in this case transcription factor binding site regions, with the sequences of whole chromosomes. On its own, the power spectrum fingerprint does not find all transcription factor binding sites in a chromosome, but the results presented here show that in combination with other approaches, this technique will improve the chances of identifying functional sequences hidden in genomic data

CiteSeerX

Crossref

Directory of Open Access Journals

White Rose Research Online

Genetic Algorithms for the Imitation of Genomic Styles in Protein Backtranslation

Author: Moreira Andres
Publication venue
Publication date: 05/04/2003
Field of study

Several technological applications require the translation of a protein into a nucleic acid that codes for it (``backtranslation''). The degeneracy of the genetic code makes this translation ambiguous; moreover, not every translation is equally viable. The common answer to this problem is the imitation of the codon usage of the target species. Here we discuss several other features of coding sequences (``coding statistics'') that are relevant for the ``genomic style'' of different species. A genetic algorithm is then used to obtain backtranslations that mimic these styles, by minimizing the difference in the coding statistics. Possible improvements and applications are discussed.Comment: 17 pages, 13 figures. Submitted to Theor. Comp. Scienc

arXiv.org e-Print Archive

CiteSeerX

Elsevier - Publisher Connector

CERN Document Server