21 research outputs found

    Identification of exonic regions in DNA sequences using cross-correlation and noise suppression by discrete wavelet transform

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The identification of protein coding regions (exons) in DNA sequences using signal processing techniques is an important component of bioinformatics and biological signal processing. In this paper, a new method is presented for the identification of exonic regions in DNA sequences. This method is based on the cross-correlation technique that can identify periodic regions in DNA sequences.</p> <p>Results</p> <p>The method reduces the dependency of window length on identification accuracy. The proposed algorithm is applied to different eukaryotic datasets and the output results are compared with those of other established methods. The proposed method increased the accuracy of exon detection by 4% to 41% relative to the most common digital signal processing methods for exon prediction.</p> <p>Conclusions</p> <p>We demonstrated that periodic signals can be estimated using cross-correlation. In addition, discrete wavelet transform (DWT) can minimise noise while maintaining the signal. The proposed algorithm, which combines cross-correlation and DWT, significantly increases the accuracy of exonic region identification.</p

    Distâncias inter-simbólicas no ADN

    Get PDF
    Mestrado em Engenharia Eletrónica e TelecomunicaçõesO presente trabalho visou estudar o contributo das distâncias-inter simbólicas na segmentação do ADN. Para esse efeito, foi estudada a segmentação das sequências genómicas em código e não código e em ilhas e não ilhas CpG. Desenvolveu-se um estudo das distâncias inter-trinucleótidas no contexto da identificação de regiões codificantes e das distâncias inter-dinucleótidas para a identificação de ilhas CpG. Com base nestas distâncias foi analisado o desempenho de um algoritmo para discriminação de regiões de código e não código, tendo os resultados evidenciado haver ainda margem para aperfeiçoamento e foi desenvolvido um algoritmo para identificação de ilhas CpG tendo as taxas de boa classificação atingido valores elevados.The present work aimed to study the contribution of the inter-symbolic distances in DNA segmentation. To this end, the segmentation of genomic sequences into coding and non coding regions and CpG islands and non CpG islands was studied. A study of the inter-trinculeotide distances in the context of identifying coding regions and of the inter-dinucleotide distances for identifying CpG islands was developed. Based on these distances the performance of an algorithm to discriminate coding and non coding regions was analyzed, with the results showing there is still room for improvement and an algorithm for identification of CpG islands was designed, resulting in high values of good classification rates

    Segmentation of DNA into coding and noncoding regions based on inter-STOP symbols distances

    Get PDF
    In this study we set to explore the potentialities of the inter-genomic symbols distance for finding the coding regions in DNA sequences. We use the distance between STOP symbols in the DNA sequence and a chi-square statistic to evaluate the nonhomogeneity of the three possible reading frames. The results of this exploratory study suggest that inter-STOP symbols distance has strong ability to discriminate coding regions.publishe

    Human Promoter Prediction Using DNA Numerical Representation

    Get PDF
    With the emergence of genomic signal processing, numerical representation techniques for DNA alphabet set {A, G, C, T} play a key role in applying digital signal processing and machine learning techniques for processing and analysis of DNA sequences. The choice of the numerical representation of a DNA sequence affects how well the biological properties can be reflected in the numerical domain for the detection and identification of the characteristics of special regions of interest within the DNA sequence. This dissertation presents a comprehensive study of various DNA numerical and graphical representation methods and their applications in processing and analyzing long DNA sequences. Discussions on the relative merits and demerits of the various methods, experimental results and possible future developments have also been included. Another area of the research focus is on promoter prediction in human (Homo Sapiens) DNA sequences with neural network based multi classifier system using DNA numerical representation methods. In spite of the recent development of several computational methods for human promoter prediction, there is a need for performance improvement. In particular, the high false positive rate of the feature-based approaches decreases the prediction reliability and leads to erroneous results in gene annotation.To improve the prediction accuracy and reliability, DigiPromPred a numerical representation based promoter prediction system is proposed to characterize DNA alphabets in different regions of a DNA sequence.The DigiPromPred system is found to be able to predict promoters with a sensitivity of 90.8% while reducing false prediction rate for non-promoter sequences with a specificity of 90.4%. The comparative study with state-of-the-art promoter prediction systems for human chromosome 22 shows that our proposed system maintains a good balance between prediction accuracy and reliability. To reduce the system architecture and computational complexity compared to the existing system, a simple feed forward neural network classifier known as SDigiPromPred is proposed. The SDigiPromPred system is found to be able to predict promoters with a sensitivity of 87%, 87%, 99% while reducing false prediction rate for non-promoter sequences with a specificity of 92%, 94%, 99% for Human, Drosophila, and Arabidopsis sequences respectively with reconfigurable capability compared to existing system

    Sinais simbólicos e aplicações em genómica

    Get PDF
    Doutoramento em Engenharia ElectrotécnicaEsta dissertação surge no contexto do processamento de sinais simbólicos com o objectivo específico de contribuir para o conhecimento da estrutura das sequências de DNA. A localização automática de genes foi um dos problemas biológicos que motivou o desenvolvimento deste trabalho. A compressão de sequências genéticas, quer para reduzir o espaço de armazenamento quer para obtenção de modelos das mesmas, foi outra das motivações. Com o objectivo de contribuir para melhorar uma das técnicas frequentemente usadas na localização automática de genes são comparadas metodologias de análise espectral para sequências simbólicas. Também se discute a validade de aplicação de metodologias de análise espectral às sequências simbólicas e apresenta-se um novo método baseada na função de autocorrelação simbólica. Uma característica que usualmente é tomada para identificação de genes é o tamanho da risca espectral que reflecte a periodicidade de período três. Apresenta-se um algoritmo rápido baseado em contadores de símbolos para cálculo de várias riscas espectrais, e em particular da risca de período três. São também enunciadas e analisadas propriedades associadas ao tamanho de algumas riscas e à redundância espectral. Por último, desenvolve-se uma técnica para compressão de sequências genéticas baseada num modelo de três estados. Em regiões codificantes do DNA esta técnica leva em geral a melhores resultados do que as actuais técnicas de compressão.This dissertation addresses the problem of processing sequences of symbols, and has the specific aim of contributing to the analysis and modeling of DNA sequences. This work was partly motivated by the problem of automatic gene location. Another motivation was the compression of genetic sequences, both for the purpose of reducing the required storage and for determining good DNA models. The main methodologies of spectral analysis of symbolic sequences are compared. The application of spectral analysis methods to the symbolic sequences is discussed and a new method based on the symbolic autocorrelation function is presented. One feature that is often used in gene identification is the size of the Fourier coefficient that reflects periodicity of period three. A fast algorithm for the calculation of Fourier coefficients, based on symbol counters, was developed. Some properties associated with the size of some spectral coefficients and spectral redundancy are discussed. Finally, a technique based on a model with three states was developed to compress genetic sequences. In protein-coding regions this technique leads in general to better results than the state-of-the-art DNA compression techniques

    Segmentation of DNA into Coding and Noncoding Regions Based on Recursive Entropic Segmentation and Stop-Codon Statistics

    No full text
    Heterogeneous DNA sequences can be partitioned into homogeneous domains that are comprised of the four nucleotides A, C, G, and T and the stop-codons. Recursively, we apply a new entropic segmentation method on DNA sequences using Jensen-Shannon and Jensen-Rényi divergences in order to find the borders between coding and noncoding DNA regions. We have chosen 12- and 18-symbol alphabets that capture (i) the differential nucleotide composition in codons, and (ii) the differential stop-codon composition along all the three phases in both strands of the DNA. The new segmentation method is based on the Jensen-Rényi divergence measure, nucleotide statistics, and stop-codon statistics in both DNA strands. The recursive segmentation process requires no prior training on known datasets. Consequently, for three entire genomes of bacteria, we find that the use of nucleotide composition, stop-codon composition, and Jensen-Rényi divergence improve the accuracy of finding the borders between coding and noncoding regions in DNA sequences

    Bioinformatics analyses of genomic imprinting

    Get PDF
    In the present thesis, bioinformatics analyses of genomic DNA sequences identified a number of features that distinguish imprinted genes from normal, biallelically expressed genes. Despite species-specific differences, which particularly complicate identification of functional CpG islands, imprinted genes of human and mouse are enriched in intronic CpG islands and tandem repeats. Together with conserved LINE-1 repeats they might be involved in the establishment of the allele-specific marks in the germ line. Striking in comparison to non-imprinted genes is also the enrichment of CpG-rich motifs as well as a decreased estimated deamination ratio in conserved sequences, which hints at unanticipated effects of differential methylation. Genome-wide analyses showed that highly conserved elements in exons of imprinted genes are less conserved and shorter than those of normal genes. Maternally expressed genes and the proteins encoded by them are more divergent between rodents and other mammals, whereas paternally expressed genes are conserved above average between mouse and rat. The associated opposite patterns of selection suggest that imprinted genes played a role in the evolution of early rodents. The existence of conserved paralogs with similar functions may have facilitated divergence.In der vorliegenden Arbeit wurde durch bioinformatische Untersuchungen von genomischen DNSSequenzen eine Reihe von Merkmalen bestimmt, die elterlich geprägte Gene gegenüber normalen, biallelisch exprimierten Genen auszeichnen. Trotz artenspezifischer Unterschiede, die insbesondere die Identifizierung von funktionalen CpG-Inseln erschweren, besitzen geprägte Gene in Mensch und Maus vermehrt intronische CpG-Inseln und Tandemrepeats. Zusammen mit konservierten LINE-1-Repeats könnten diese zur Einrichtung der allelspezifischen Markierungen in der Keimbahn beitragen. Auffällig im Vergleich zu nicht geprägten Genen sind auch die Anreicherung von CpG-reichen Motiven und eine erniedrigte geschätzte Desaminierungsrate in konservierten Sequenzabschnitten, was auf unvorhergesehene Effekte differentieller Methylierung schließen lässt. Genomweite Analysen ergaben, dass hochkonservierte Elemente in Exons bei geprägten Genen weniger konserviert und kürzer sind als bei normalen Genen. Maternal exprimierte Gene und von ihnen codierte Proteine zeigen erhöhte Divergenz zwischen Nagetieren und anderen Säugetieren, wohingegen paternal exprimierte Gene zwischen Maus und Ratte einen überdurchschnittlich hohen Konservierungsgrad aufweisen. Die damit verbundenen entgegengesetzten Selektionsmuster lassen darauf schließen, dass geprägte Gene eine Rolle in der Evolution früher Nagetiere spielten. Möglicherweise erleichterte die Existenz von konservierten Paralogen mit ähnlicher Funktion die Divergenz
    corecore