Search CORE

8 research outputs found

Multiple Methods for Genome Filtering

Author: Selman Alma Husagic
Publication venue: International University of Sarajevo
Publication date: 24/11/2013
Field of study

Filters are fast algorithms, which help to preprocess DNA sequences in order to reduce the time and complexity of approximate motif search. Multiple filtering methods exist, and this paper classifies the filtering algorithms based on their approach, numerical analysis or digital signal processing, and it briefly reviews both classes of filters. The paper also reflects on filters currently used in popular software for genomic processing

Inquiry (E-Journal - Faculty of Business and Administration, International University of Sarajevo)

Crossref

Localizing triplet periodicity in DNA and cDNA sequences

Author: AA Tsonis
AWC Liew
D Anastassiou
DL Black
G Gutierrez
I Daubechies
J Epps
J Sanchez
J Tuqan
JK Pickrell
JP Mena-Chalco
K Okamura
Lincoln D Stein
Liya Wang
M Stanke
M Yan
R Lewis
S Tiwari
TP George
WG Fairbrother
WJ Kent
YT Chan
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background The protein-coding regions (coding exons) of a DNA sequence exhibit a triplet periodicity (TP) due to fact that coding exons contain a series of three nucleotide codons that encode specific amino acid residues. Such periodicity is usually not observed in introns and intergenic regions. If a DNA sequence is divided into small segments and a Fourier Transform is applied on each segment, a strong peak at frequency 1/3 is typically observed in the Fourier spectrum of coding segments, but not in non-coding regions. This property has been used in identifying the locations of protein-coding genes in unannotated sequence. The method is fast and requires no training. However, the need to compute the Fourier Transform across a segment (window) of arbitrary size affects the accuracy with which one can localize TP boundaries. Here, we report a technique that provides higher-resolution identification of these boundaries, and use the technique to explore the biological correlates of TP regions in the genome of the model organism <it>C. elegans</it>. Results Using both simulated TP signals and the real <it>C. elegans </it>sequence F56F11 as an example, we demonstrate that, (1) Modified Wavelet Transform (MWT) can better define the boundary of TP region than the conventional Short Time Fourier Transform (STFT); (2) The scale parameter (a) of MWT determines the precision of TP boundary localization: bigger values of a give sharper TP boundaries but result in a lower signal to noise ratio; (3) RNA splicing sites have weaker TP signals than coding region; (4) TP signals in coding region can be destroyed or recovered by frame-shift mutations; (5) 6 bp periodicities in introns and intergenic region can generate false positive signals and it can be removed with 6 bp MWT. Conclusions MWT can provide more precise TP boundaries than STFT and the boundaries can be further refined by bigger scale MWT. Subtraction of 6 bp periodicity signals reduces the number of false positives. Experimentally-introduced frame-shift mutations help recover TP signal that have been lost by possible ancient frame-shifts. More importantly, TP signal has the potential to be used to detect the splice junctions in fully spliced mRNA sequence.</p

Crossref

Cold Spring Harbor Laboratory Institutional Repository

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Visualization of the protein-coding regions with a self adaptive spectral rotation approach

Author: Akhtar
Akhtar
Anastassiou
Anastassiou
Azad
Bennetzen
Berthelsen
Bo Chen
Borodovsky
Burge
Cao
Cebrat
Chang
Claverie
Do
Dodin
Dodin
Fickett
Fickett
Fickett
Frenkel
Frenkel
Gao
Haimovich
Henderson
Jiang
Kotlar
Li
Masoom
Olson
Orlov
Peng
Ping Ji
Ré
Salzberg
Staden
Stanke
Te Boekhorst
Tiwari
Tuqan
Tuqan
Voss
Yan
Yin
Zhang
Zhang
Publication venue: Oxford University Press
Publication date: 01/01/2011
Field of study

Identifying protein-coding regions in DNA sequences is an active issue in computational biology. In this study, we present a self adaptive spectral rotation (SASR) approach, which visualizes coding regions in DNA sequences, based on investigation of the Triplet Periodicity property, without any preceding training process. It is proposed to help with the rough coding regions prediction when there is no extra information for the training required by other outstanding methods. In this approach, at each position in the DNA sequence, a Fourier spectrum is calculated from the posterior subsequence. Following the spectrums, a random walk in complex plane is generated as the SASR's graphic output. Applications of the SASR on real DNA data show that patterns in the graphic output reveal locations of the coding regions and the frame shifts between them: arcs indicate coding regions, stable points indicate non-coding regions and corners’ shapes reveal frame shifts. Tests on genomic data set from Saccharomyces Cerevisiae reveal that the graphic patterns for coding and non-coding regions differ to a great extent, so that the coding regions can be visually distinguished. Meanwhile, a time cost test shows that the SASR can be easily implemented with the computational complexity of O(N)

The Hong Kong Polytechnic University Pao Yue-kong Library

Crossref

PolyU Institutional Repository

PubMed Central

Visualization of the protein-coding regions with a self adaptive spectral rotation approach

Author: Akhtar
Akhtar
Anastassiou
Anastassiou
Azad
Bennetzen
Berthelsen
Bo Chen
Borodovsky
Burge
Cao
Cebrat
Chang
Claverie
Do
Dodin
Dodin
Fickett
Fickett
Fickett
Frenkel
Frenkel
Gao
Haimovich
Henderson
Jiang
Kotlar
Li
Masoom
Olson
Orlov
Peng
Ping Ji
Ré
Salzberg
Staden
Stanke
Te Boekhorst
Tiwari
Tuqan
Tuqan
Voss
Yan
Yin
Zhang
Zhang
Publication venue: Oxford University Press
Publication date: 01/01/2011
Field of study

The Hong Kong Polytechnic University Pao Yue-kong Library

Crossref

PolyU Institutional Repository

PubMed Central

Novel methodologies for spectral classification of exon and intron sequences

Author: Benjamin Y M Kwan
Hon Keung Kwan
Jennifer Y Y Kwan
Publication venue: Springer Nature
Publication date: 01/01/2012
Field of study

Springer - Publisher Connector

Mapping Equivalence for Symbolic Sequences: Theory and Applications

Author: Schonfeld Dan
Wang Liming
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 10/06/2009
Field of study

Processing of symbolic sequences represented by mapping of symbolic data into numerical signals is commonly used in various applications. It is a particularly popular approach in genomic and proteomic sequence analysis. Numerous mappings of symbolic sequences have been proposed for various applications. It is unclear however whether the processing of symbolic data provides an artifact of the numerical mapping or is an inherent property of the symbolic data. This issue has been long ignored in the engineering and scientific literature. It is possible that many of the results obtained in symbolic signal processing could be a byproduct of the mapping and might not shed any light on the underlying properties embedded in the data. Moreover, in many applications, conflicting conclusions may arise due to the choice of the mapping used for numerical representation of symbolic data. In this paper, we present a novel framework for the analysis of the equivalence of the mappings used for numerical representation of symbolic data. We present strong and weak equivalence properties and rely on signal correlation to characterize equivalent mappings. We derive theoretical results which establish conditions for consistency among numerical mappings of symbolic data. Furthermore, we introduce an abstract mapping model for symbolic sequences and extend the notion of equivalence to an algebraic framework. Finally, we illustrate our theoretical results by application to DNA sequence analysis

arXiv.org e-Print Archive

Crossref

Analysis of Genomic and Proteomic Sequences using DSP Techniques

Author: Kakumani Raja Sekhar
Publication venue
Publication date: 12/03/2013
Field of study

Analysis of biological sequences by detecting the hidden periodicities and symbolic patterns has been an active area of research since couple of decades. The hidden periodic components and the patterns help locating the biologically relevant motifs such as protein coding regions (exons), CpG islands (CGI) and hot-spots that characterize various biological functions. The discrete nature of biological sequences has prompted many researchers to use digital signal processing (DSP) techniques for their analysis. After mapping the biological sequences to numerical sequences, various DSP techniques using digital filters, wavelets, neural networks, filter banks etc. have been developed to detect the hidden periodicities and recurring patterns in these sequences. This thesis attempts to develop effective DSP based techniques to solve some of the important problems in biological sequence analysis. Specifically, DSP techniques such as statistically optimal null filters (SONF), matched filters and neural networks based algorithms are developed for the analysis of deoxyribonucleic acid (DNA), ribonucleic acid (RNA) and protein sequences. In the first part of this study, DNA sequences are investigated in order to identify the locations of CGIs and protein coding regions, i.e., exons. SONFs, which are known for their ability to efficiently estimate short-duration signals embedded in noise by combining the maximum signal-to-noise ratio and the least squares optimization criteria, are utilized to solve these problems. Basis sequences characterizing CGIs and exons are formulated to be used in SONF technique for solving the problems. In the second part of this study, RNA sequences are analyzed to predict their secondary structures. For this purpose, matched filters based on 2-dimensional convolution are developed to identify the locations of stem and loop patterns in the RNA secondary structure. The knowledge of the stem and loop patterns thus obtained are then used to predict the presence of pseudoknot, leading to the determination of the entire RNA secondary structure. Finally, in the third part of this thesis, protein sequences are analyzed to solve the problems of predicting protein secondary structure and identifying the locations of hot-spots. For predicting the protein secondary structure a two-stage neural network scheme is developed, whereas for predicting the locations of hot-spots an SONF based approach is proposed. Hot-spots in proteins exhibit a characteristic frequency corresponding to their biological function. A basis function is formulated based on this characteristic frequency to be used in SONFs to detect the locations of hot-spots belonging to the corresponding functional group. Extensive experiments are performed throughout the thesis to demonstrate the effectiveness and validity of the various schemes and techniques developed in this investigation. The performance of the proposed techniques is compared with that of the previously reported techniques for the analysis of biological sequences. For this purpose, the results obtained are validated using databases containing with known annotations. It is shown that the proposed schemes result in performance superior to those of some of the existing techniques

Concordia University Research Repository

Using signal processing, evolutionary computation, and machine learning to identify transposable elements in genomes

Author: Ashlock Wendy Cole
Publication venue
Publication date: 23/06/2016
Field of study

About half of the human genome consists of transposable elements (TE's), sequences that have many copies of themselves distributed throughout the genome. All genomes, from bacterial to human, contain TE's. TE's affect genome function by either creating proteins directly or affecting genome regulation. They serve as molecular fossils, giving clues to the evolutionary history of the organism. TE's are often challenging to identify because they are fragmentary or heavily mutated. In this thesis, novel features for the detection and study of TE's are developed. These features are of two types. The first type are statistical features based on the Fourier transform used to assess reading frame use. These features measure how different the reading frame use is from that of a random sequence, which reading frames the sequence is using, and the proportion of use of the active reading frames. The second type of feature, called side effect machine (SEM) features, are generated by finite state machines augmented with counters that track the number of times the state is visited. These counters then become features of the sequence. The number of possible SEM features is super-exponential in the number of states. New methods for selecting useful feature subsets that incorporate a genetic algorithm and a novel clustering method are introduced. The features produced reveal structural characteristics of the sequences of potential interest to biologists. A detailed analysis of the genetic algorithm, its fitness functions, and its fitness landscapes is performed. The features are used, together with features used in existing exon finding algorithms, to build classifiers that distinguish TE's from other genomic sequences in humans, fruit flies, and ciliates. The classifiers achieve high accuracy (> 85%) on a variety of TE classification problems. The classifiers are used to scan large genomes for TE's. In addition, the features are used to describe the TE's in the newly sequenced ciliate, Tetrahymena thermophile to provide information for biologists useful to them in forming hypotheses to test experimentally concerning the role of these TE's and the mechanisms that govern them

YorkSpace