33 research outputs found

    Translation initiation site prediction on a genomic scale : beauty in simplicity

    Get PDF
    Motivation: The correct identification of translation initiation sites (TIS) remains a challenging problem for computational methods that automatically try to solve this problem. Furthermore, the lion's share of these computational techniques focuses on the identification of TIS in transcript data. However, in the gene prediction context the identification of TIS occurs on the genomic level, which makes things even harder because at the genome level many more pseudo-TIS occur, resulting in models that achieve a higher number of false positive predictions. Results: In this article, we evaluate the performance of several 'simple' TIS recognition methods at the genomic level, and compare them to state-of-the-art models for TIS prediction in transcript data. We conclude that the simple methods largely outperform the complex ones at the genomic scale, and we propose a new model for TIS recognition at the genome level that combines the strengths of these simple models. The new model obtains a false positive rate of 0.125 at a sensitivity of 0.80 on a well annotated human chromosome ( chromosome 21). Detailed analyses show that the model is useful, both on its own and in a simple gene prediction setting

    MetWAMer: eukaryotic translation initiation site prediction

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Translation initiation site (TIS) identification is an important aspect of the gene annotation process, requisite for the accurate delineation of protein sequences from transcript data. We have developed the MetWAMer package for TIS prediction in eukaryotic open reading frames of non-viral origin. MetWAMer can be used as a stand-alone, third-party tool for post-processing gene structure annotations generated by external computational programs and/or pipelines, or directly integrated into gene structure prediction software implementations.</p> <p>Results</p> <p>MetWAMer currently implements five distinct methods for TIS prediction, the most accurate of which is a routine that combines weighted, signal-based translation initiation site scores and the contrast in coding potential of sequences flanking TISs using a perceptron. Also, our program implements clustering capabilities through use of the <it>k</it>-medoids algorithm, thereby enabling cluster-specific TIS parameter utilization. In practice, our static weight array matrix-based indexing method for parameter set lookup can be used with good results in data sets exhibiting moderate levels of 5'-complete coverage.</p> <p>Conclusion</p> <p>We demonstrate that improvements in statistically-based models for TIS prediction can be achieved by taking the class of each potential start-methionine into account pending certain testing conditions, and that our perceptron-based model is suitable for the TIS identification task. MetWAMer represents a well-documented, extensible, and freely available software system that can be readily re-trained for differing target applications and/or extended with existing and novel TIS prediction methods, to support further research efforts in this area.</p

    Improvement in the prediction of the translation initiation site through balancing methods, inclusion of acquired knowledge and addition of features to sequences of mRNA

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The accurate prediction of the initiation of translation in sequences of mRNA is an important activity for genome annotation. However, obtaining an accurate prediction is not always a simple task and can be modeled as a problem of classification between positive sequences (protein codifiers) and negative sequences (non-codifiers). The problem is highly imbalanced because each molecule of mRNA has a unique translation initiation site and various others that are not initiators. Therefore, this study focuses on the problem from the perspective of balancing classes and we present an undersampling balancing method, M-clus, which is based on clustering. The method also adds features to sequences and improves the performance of the classifier through the inclusion of knowledge obtained by the model, called InAKnow.</p> <p>Results</p> <p>Through this methodology, the measures of performance used (accuracy, sensitivity, specificity and adjusted accuracy) are greater than 93% for the <it>Mus musculus</it> and <it>Rattus norvegicus</it> organisms, and varied between 72.97% and 97.43% for the other organisms evaluated: <it>Arabidopsis thaliana</it>, <it>Caenorhabditis elegans</it>, <it>Drosophila melanogaster</it>, <it>Homo sapiens</it>, <it>Nasonia vitripennis</it>. The precision increases significantly by 39% and 22.9% for <it>Mus musculus</it> and <it>Rattus norvegicus</it>, respectively, when the knowledge obtained by the model is included. For the other organisms, the precision increases by between 37.10% and 59.49%. The inclusion of certain features during training, for example, the presence of ATG in the upstream region of the Translation Initiation Site, improves the rate of sensitivity by approximately 7%. Using the M-Clus balancing method generates a significant increase in the rate of sensitivity from 51.39% to 91.55% (<it>Mus musculus</it>) and from 47.45% to 88.09% (<it>Rattus norvegicus</it>).</p> <p>Conclusions</p> <p>In order to solve the problem of TIS prediction, the results indicate that the methodology proposed in this work is adequate, particularly when using the concept of acquired knowledge which increased the accuracy in all databases evaluated.</p

    A novel kernel based approach to arbitrary length symbolic data with application to type 2 diabetes risk

    Get PDF
    Predictive modeling of clinical data is fraught with challenges arising from the manner in which events are recorded. Patients typically fall ill at irregular intervals and experience dissimilar intervention trajectories. This results in irregularly sampled and uneven length data which poses a problem for standard multivariate tools. The alternative of feature extraction into equal-length vectors via methods like Bag-of-Words (BoW) potentially discards useful information. We propose an approach based on a kernel framework in which data is maintained in its native form: discrete sequences of symbols. Kernel functions derived from the edit distance between pairs of sequences may then be utilized in conjunction with support vector machines to classify the data. Our method is evaluated in the context of the prediction task of determining patients likely to develop type 2 diabetes following an earlier episode of elevated blood pressure of 130/80 mmHg. Kernels combined via multi kernel learning achieved an F1-score of 0.96, outperforming classification with SVM 0.63, logistic regression 0.63, Long Short Term Memory 0.61 and Multi-Layer Perceptron 0.54 applied to a BoW representation of the data. We achieved an F1-score of 0.97 on MKL on external dataset. The proposed approach is consequently able to overcome limitations associated with feature-based classification in the context of clinical data

    Representative transcript sets for evaluating a translational initiation sites predictor

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Translational initiation site (TIS) prediction is a very important and actively studied topic in bioinformatics. In order to complete a comparative analysis, it is desirable to have several benchmark data sets which can be used to test the effectiveness of different algorithms. An ideal benchmark data set should be reliable, representative and readily available. Preferably, proteins encoded by members of the data set should also be representative of the protein population actually expressed in cellular specimens.</p> <p>Results</p> <p>In this paper, we report a general algorithm for constructing a reliable sequence collection that only includes mRNA sequences whose corresponding protein products present an average profile of the general protein population of a given organism, with respect to three major structural parameters. Four representative transcript collections, each derived from a model organism, have been obtained following the algorithm we propose. Evaluation of these data sets shows that they are reasonable representations of the spectrum of proteins obtained from cellular proteomic studies. Six state-of-the-art predictors have been used to test the usefulness of the construction algorithm that we proposed. Comparative study which reports the predictors' performance on our data set as well as three other existing benchmark collections has demonstrated the actual merits of our data sets as benchmark testing collections.</p> <p>Conclusion</p> <p>The proposed data set construction algorithm has demonstrated its property of being a general and widely applicable scheme. Our comparison with published proteomic studies has shown that the expression of our data set of transcripts generates a polypeptide population that is representative of that obtained from evaluation of biological specimens. Our data set thus represents "real world" transcripts that will allow more accurate evaluation of algorithms dedicated to identification of TISs, as well as other translational regulatory motifs within mRNA sequences. The algorithm proposed by us aims at compiling a redundancy-free data set by removing redundant copies of homologous proteins. The existence of such data sets may be useful for conducting statistical analyses of protein sequence-structure relations. At the current stage, our approach's focus is to obtain an "average" protein data set for any particular organism without posing much selection bias. However, with the three major protein structural parameters deeply integrated into the scheme, it would be a trivial task to extend the current method for obtaining a more selective protein data set, which may facilitate the study of some particular protein structure.</p
    corecore