Search CORE

33 research outputs found

Recommended from our members

TITER: predicting translation initiation sites by deep learning.

Author: Hu Hailin
Jiang Tao
Zeng Jianyang
Zhang Lei
Zhang Sai
Publication venue: eScholarship, University of California
Publication date: 01/07/2017
Field of study

MotivationTranslation initiation is a key step in the regulation of gene expression. In addition to the annotated translation initiation sites (TISs), the translation process may also start at multiple alternative TISs (including both AUG and non-AUG codons), which makes it challenging to predict TISs and study the underlying regulatory mechanisms. Meanwhile, the advent of several high-throughput sequencing techniques for profiling initiating ribosomes at single-nucleotide resolution, e.g. GTI-seq and QTI-seq, provides abundant data for systematically studying the general principles of translation initiation and the development of computational method for TIS identification.MethodsWe have developed a deep learning-based framework, named TITER, for accurately predicting TISs on a genome-wide scale based on QTI-seq data. TITER extracts the sequence features of translation initiation from the surrounding sequence contexts of TISs using a hybrid neural network and further integrates the prior preference of TIS codon composition into a unified prediction framework.ResultsExtensive tests demonstrated that TITER can greatly outperform the state-of-the-art prediction methods in identifying TISs. In addition, TITER was able to identify important sequence signatures for individual types of TIS codons, including a Kozak-sequence-like motif for AUG start codon. Furthermore, the TITER prediction score can be related to the strength of translation initiation in various biological scenarios, including the repressive effect of the upstream open reading frames on gene expression and the mutational effects influencing translation initiation efficiency.Availability and implementationTITER is available as an open-source software and can be downloaded from https://github.com/zhangsaithu/titer [email protected] or [email protected] informationSupplementary data are available at Bioinformatics online

eScholarship - University of California

Translation initiation site prediction on a genomic scale : beauty in simplicity

Author: Borodovsky
Delcher
Fickett
Hatzigeorgiou
Kozak
Kozak
Kozak
Li
Li
Li
Liu
Nishikawa
Pedersen
Salamov
Salzberg
Salzberg
Sven Degroeve
Thomas Abeel
Tiwari
Wang
Yvan Saeys
Yves Van de Peer
Zeng
Zien
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2007
Field of study

Motivation: The correct identification of translation initiation sites (TIS) remains a challenging problem for computational methods that automatically try to solve this problem. Furthermore, the lion's share of these computational techniques focuses on the identification of TIS in transcript data. However, in the gene prediction context the identification of TIS occurs on the genomic level, which makes things even harder because at the genome level many more pseudo-TIS occur, resulting in models that achieve a higher number of false positive predictions. Results: In this article, we evaluate the performance of several 'simple' TIS recognition methods at the genomic level, and compare them to state-of-the-art models for TIS prediction in transcript data. We conclude that the simple methods largely outperform the complex ones at the genomic scale, and we propose a new model for TIS recognition at the genome level that combines the strengths of these simple models. The new model obtains a false positive rate of 0.125 at a sensitivity of 0.80 on a well annotated human chromosome ( chromosome 21). Detailed analyses show that the model is useful, both on its own and in a simple gene prediction setting

Crossref

Ghent University Academic Bibliography

MetWAMer: eukaryotic translation initiation site prediction

Author: A Delcher
A Hatzigeorgiou
A Nadershahi
A Pedersen
A Prats
A Rakotondrafara
A Sachs
A Salamov
A Zien
C Bishop
C Iseli
C Lottaz
C Mathé
D Abramczyk
D Cavener
E Birney
G Crooks
G Gremme
G Li
G Stormo
H Li
H Liu
H Liu
J Allen
J Allen
J Crow
L Balvay
L Xing
M de Hoon
M Hirosawa
M Kozak
M Kozak
M Kozak
M Kozak
M Medveczky
M Sparks
M Sparks
M Stanke
M Stanke
M Tech
M Tech
Michael E Sparks
Q Dong
S Altschul
S Hebsgaard
S Russell
S Salzberg
T Berardini
T Mitchell
T Nishikawa
T Preiss
T Schiex
T Schneider
T Sing
V Brendel
Volker Brendel
Y Saeys
Y Wang
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Translation initiation site (TIS) identification is an important aspect of the gene annotation process, requisite for the accurate delineation of protein sequences from transcript data. We have developed the MetWAMer package for TIS prediction in eukaryotic open reading frames of non-viral origin. MetWAMer can be used as a stand-alone, third-party tool for post-processing gene structure annotations generated by external computational programs and/or pipelines, or directly integrated into gene structure prediction software implementations. Results MetWAMer currently implements five distinct methods for TIS prediction, the most accurate of which is a routine that combines weighted, signal-based translation initiation site scores and the contrast in coding potential of sequences flanking TISs using a perceptron. Also, our program implements clustering capabilities through use of the <it>k</it>-medoids algorithm, thereby enabling cluster-specific TIS parameter utilization. In practice, our static weight array matrix-based indexing method for parameter set lookup can be used with good results in data sets exhibiting moderate levels of 5'-complete coverage. Conclusion We demonstrate that improvements in statistically-based models for TIS prediction can be achieved by taking the class of each potential start-methionine into account pending certain testing conditions, and that our perceptron-based model is suitable for the TIS identification task. MetWAMer represents a well-documented, extensible, and freely available software system that can be readily re-trained for differing target applications and/or extended with existing and novel TIS prediction methods, to support further research efforts in this area.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Improvement in the prediction of the translation initiation site through balancing methods, inclusion of acquired knowledge and addition of features to sequences of mRNA

Author: de Souza Teixeira Felipe Carvalho
Nobre Cristiane Neri
Ortega José Miguel
Silva Lívia Márcia
Zárate Luis Enrique
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background The accurate prediction of the initiation of translation in sequences of mRNA is an important activity for genome annotation. However, obtaining an accurate prediction is not always a simple task and can be modeled as a problem of classification between positive sequences (protein codifiers) and negative sequences (non-codifiers). The problem is highly imbalanced because each molecule of mRNA has a unique translation initiation site and various others that are not initiators. Therefore, this study focuses on the problem from the perspective of balancing classes and we present an undersampling balancing method, M-clus, which is based on clustering. The method also adds features to sequences and improves the performance of the classifier through the inclusion of knowledge obtained by the model, called InAKnow. Results Through this methodology, the measures of performance used (accuracy, sensitivity, specificity and adjusted accuracy) are greater than 93% for the <it>Mus musculus</it> and <it>Rattus norvegicus</it> organisms, and varied between 72.97% and 97.43% for the other organisms evaluated: <it>Arabidopsis thaliana</it>, <it>Caenorhabditis elegans</it>, <it>Drosophila melanogaster</it>, <it>Homo sapiens</it>, <it>Nasonia vitripennis</it>. The precision increases significantly by 39% and 22.9% for <it>Mus musculus</it> and <it>Rattus norvegicus</it>, respectively, when the knowledge obtained by the model is included. For the other organisms, the precision increases by between 37.10% and 59.49%. The inclusion of certain features during training, for example, the presence of ATG in the upstream region of the Translation Initiation Site, improves the rate of sensitivity by approximately 7%. Using the M-Clus balancing method generates a significant increase in the rate of sensitivity from 51.39% to 91.55% (<it>Mus musculus</it>) and from 47.45% to 88.09% (<it>Rattus norvegicus</it>). Conclusions In order to solve the problem of TIS prediction, the results indicate that the methodology proposed in this work is adequate, particularly when using the concept of acquired knowledge which increased the accuracy in all databases evaluated.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Transductive learning as an alternative to translation initiation site identification

Author: A Zien
B Luukkonen
C Cortes
CC Chang
Cristiane Neri Nobre
Cristiano Lacerda Nunes Pinto
GD Stormo
H Li
H Liu
KD Pruitt
LM Silva
Luis Enrique Zárate
M Kozak
M Kozak
M Matsumoto
NV Chawla
PSG Chain
RA Jia Zeng
S Nakagawa
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

A novel kernel based approach to arbitrary length symbolic data with application to type 2 diabetes risk

Author: Nwegbu N.
Nwegbu N.
Tirunagari S.
Tirunagari S.
Windridge D.
Windridge D.
Publication venue: Nature Publishing Group
Publication date: 01/01/2022
Field of study

Predictive modeling of clinical data is fraught with challenges arising from the manner in which events are recorded. Patients typically fall ill at irregular intervals and experience dissimilar intervention trajectories. This results in irregularly sampled and uneven length data which poses a problem for standard multivariate tools. The alternative of feature extraction into equal-length vectors via methods like Bag-of-Words (BoW) potentially discards useful information. We propose an approach based on a kernel framework in which data is maintained in its native form: discrete sequences of symbols. Kernel functions derived from the edit distance between pairs of sequences may then be utilized in conjunction with support vector machines to classify the data. Our method is evaluated in the context of the prediction task of determining patients likely to develop type 2 diabetes following an earlier episode of elevated blood pressure of 130/80 mmHg. Kernels combined via multi kernel learning achieved an F1-score of 0.96, outperforming classification with SVM 0.63, logistic regression 0.63, Long Short Term Memory 0.61 and Multi-Layer Perceptron 0.54 applied to a BoW representation of the data. We achieved an F1-score of 0.97 on MKL on external dataset. The proposed approach is consequently able to overcome limitations associated with feature-based classification in the context of clinical data

Middlesex University Research Repository

Representative transcript sets for evaluating a translational initiation sites predictor

Author: A Hatzigeorgiou
A Kanapin
A Muller
A Nadershahi
A Pedersen
C Burge
D Wackerly
Douglas J Demetrick
G Cagney
G Hu
G Omenn
H Li
J Cai
J Kyte
J Zeng
Jia Zeng
K Rudd
M Kozak
M Pruess
N Mulder
P Nielsen
R Schwartz
R Shi
Reda Alhajj
S Altschul
S Salzberg
S Wu
W Majoros
Y Nozaki
Y Saeys
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Translational initiation site (TIS) prediction is a very important and actively studied topic in bioinformatics. In order to complete a comparative analysis, it is desirable to have several benchmark data sets which can be used to test the effectiveness of different algorithms. An ideal benchmark data set should be reliable, representative and readily available. Preferably, proteins encoded by members of the data set should also be representative of the protein population actually expressed in cellular specimens. Results In this paper, we report a general algorithm for constructing a reliable sequence collection that only includes mRNA sequences whose corresponding protein products present an average profile of the general protein population of a given organism, with respect to three major structural parameters. Four representative transcript collections, each derived from a model organism, have been obtained following the algorithm we propose. Evaluation of these data sets shows that they are reasonable representations of the spectrum of proteins obtained from cellular proteomic studies. Six state-of-the-art predictors have been used to test the usefulness of the construction algorithm that we proposed. Comparative study which reports the predictors' performance on our data set as well as three other existing benchmark collections has demonstrated the actual merits of our data sets as benchmark testing collections. Conclusion The proposed data set construction algorithm has demonstrated its property of being a general and widely applicable scheme. Our comparison with published proteomic studies has shown that the expression of our data set of transcripts generates a polypeptide population that is representative of that obtained from evaluation of biological specimens. Our data set thus represents "real world" transcripts that will allow more accurate evaluation of algorithms dedicated to identification of TISs, as well as other translational regulatory motifs within mRNA sequences. The algorithm proposed by us aims at compiling a redundancy-free data set by removing redundant copies of homologous proteins. The existence of such data sets may be useful for conducting statistical analyses of protein sequence-structure relations. At the current stage, our approach's focus is to obtain an "average" protein data set for any particular organism without posing much selection bias. However, with the three major protein structural parameters deeply integrated into the scheme, it would be a trivial task to extend the current method for obtaining a more selective protein data set, which may facilitate the study of some particular protein structure.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

PRISM: University of Calgary Digital Repository