6 research outputs found

    An unsupervised classification scheme for improving predictions of prokaryotic TIS

    Get PDF
    BACKGROUND: Although it is not difficult for state-of-the-art gene finders to identify coding regions in prokaryotic genomes, exact prediction of the corresponding translation initiation sites (TIS) is still a challenging problem. Recently a number of post-processing tools have been proposed for improving the annotation of prokaryotic TIS. However, inherent difficulties of these approaches arise from the considerable variation of TIS characteristics across different species. Therefore prior assumptions about the properties of prokaryotic gene starts may cause suboptimal predictions for newly sequenced genomes with TIS signals differing from those of well-investigated genomes. RESULTS: We introduce a clustering algorithm for completely unsupervised scoring of potential TIS, based on positionally smoothed probability matrices. The algorithm requires an initial gene prediction and the genomic sequence of the organism to perform the reannotation. As compared with other methods for improving predictions of gene starts in bacterial genomes, our approach is not based on any specific assumptions about prokaryotic TIS. Despite the generality of the underlying algorithm, the prediction rate of our method is competitive on experimentally verified test data from E. coli and B. subtilis. Regarding genomes with high G+C content, in contrast to some previously proposed methods, our algorithm also provides good performance on P. aeruginosa, B. pseudomallei and R. solanacearum. CONCLUSION: On reliable test data we showed that our method provides good results in post-processing the predictions of the widely-used program GLIMMER. The underlying clustering algorithm is robust with respect to variations in the initial TIS annotation and does not require specific assumptions about prokaryotic gene starts. These features are particularly useful on genomes with high G+C content. The algorithm has been implemented in the tool »TICO«(TIs COrrector) which is publicly available from our web site

    TICO: a tool for postprocessing the predictions of prokaryotic translation initiation sites

    Get PDF
    Exact localization of the translation initiation sites (TIS) in prokaryotic genomes is difficult to achieve using conventional gene finders. We recently introduced the program TICO for postprocessing TIS predictions based on a completely unsupervised learning algorithm. The program can be utilized through our web interface at and it is also freely available as a commandline version for Linux and Windows. The latest version of our program provides a tool for visualization of the resulting TIS model. Although the underlying method is not based on any specific assumptions about characteristic sequence features of prokaryotic TIS the prediction rates of our tool are competitive on experimentally verified test data

    Computational evaluation of TIS annotation for prokaryotic genomes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Accurate annotation of translation initiation sites (TISs) is essential for understanding the translation initiation mechanism. However, the reliability of TIS annotation in widely used databases such as RefSeq is uncertain due to the lack of experimental benchmarks.</p> <p>Results</p> <p>Based on a homogeneity assumption that gene translation-related signals are uniformly distributed across a genome, we have established a computational method for a large-scale quantitative assessment of the reliability of TIS annotations for any prokaryotic genome. The method consists of modeling a positional weight matrix (PWM) of aligned sequences around predicted TISs in terms of a linear combination of three elementary PWMs, one for true TIS and the two others for false TISs. The three elementary PWMs are obtained using a reference set with highly reliable TIS predictions. A generalized least square estimator determines the weighting of the true TIS in the observed PWM, from which the accuracy of the prediction is derived. The validity of the method and the extent of the limitation of the assumptions are explicitly addressed by testing on experimentally verified TISs with variable accuracy of the reference sets. The method is applied to estimate the accuracy of TIS annotations that are provided on public databases such as RefSeq and ProTISA and by programs such as EasyGene, GeneMarkS, Glimmer 3 and TiCo. It is shown that RefSeq's TIS prediction is significantly less accurate than two recent predictors, Tico and ProTISA. With convincing proofs, we show two general preferential biases in the RefSeq annotation, <it>i.e</it>. over-annotating the longest open reading frame (LORF) and under-annotating ATG start codon. Finally, we have established a new TIS database, SupTISA, based on the best prediction of all the predictors; SupTISA has achieved an average accuracy of 92% over all 532 complete genomes.</p> <p>Conclusion</p> <p>Large-scale computational evaluation of TIS annotation has been achieved. A new TIS database much better than RefSeq has been constructed, and it provides a valuable resource for further TIS studies.</p

    Gene prediction in metagenomic fragments: A large scale machine learning approach

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Metagenomics is an approach to the characterization of microbial genomes via the direct isolation of genomic sequences from the environment without prior cultivation. The amount of metagenomic sequence data is growing fast while computational methods for metagenome analysis are still in their infancy. In contrast to genomic sequences of single species, which can usually be assembled and analyzed by many available methods, a large proportion of metagenome data remains as unassembled anonymous sequencing reads. One of the aims of all metagenomic sequencing projects is the identification of novel genes. Short length, for example, Sanger sequencing yields on average 700 bp fragments, and unknown phylogenetic origin of most fragments require approaches to gene prediction that are different from the currently available methods for genomes of single species. In particular, the large size of metagenomic samples requires fast and accurate methods with small numbers of false positive predictions.</p> <p>Results</p> <p>We introduce a novel gene prediction algorithm for metagenomic fragments based on a two-stage machine learning approach. In the first stage, we use linear discriminants for monocodon usage, dicodon usage and translation initiation sites to extract features from DNA sequences. In the second stage, an artificial neural network combines these features with open reading frame length and fragment GC-content to compute the probability that this open reading frame encodes a protein. This probability is used for the classification and scoring of gene candidates. With large scale training, our method provides fast single fragment predictions with good sensitivity and specificity on artificially fragmented genomic DNA. Additionally, this method is able to predict translation initiation sites accurately and distinguishes complete from incomplete genes with high reliability.</p> <p>Conclusion</p> <p>Large scale machine learning methods are well-suited for gene prediction in metagenomic DNA fragments. In particular, the combination of linear discriminants and neural networks is promising and should be considered for integration into metagenomic analysis pipelines. The data sets can be downloaded from the URL provided (see Availability and requirements section).</p

    MetWAMer: eukaryotic translation initiation site prediction

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Translation initiation site (TIS) identification is an important aspect of the gene annotation process, requisite for the accurate delineation of protein sequences from transcript data. We have developed the MetWAMer package for TIS prediction in eukaryotic open reading frames of non-viral origin. MetWAMer can be used as a stand-alone, third-party tool for post-processing gene structure annotations generated by external computational programs and/or pipelines, or directly integrated into gene structure prediction software implementations.</p> <p>Results</p> <p>MetWAMer currently implements five distinct methods for TIS prediction, the most accurate of which is a routine that combines weighted, signal-based translation initiation site scores and the contrast in coding potential of sequences flanking TISs using a perceptron. Also, our program implements clustering capabilities through use of the <it>k</it>-medoids algorithm, thereby enabling cluster-specific TIS parameter utilization. In practice, our static weight array matrix-based indexing method for parameter set lookup can be used with good results in data sets exhibiting moderate levels of 5'-complete coverage.</p> <p>Conclusion</p> <p>We demonstrate that improvements in statistically-based models for TIS prediction can be achieved by taking the class of each potential start-methionine into account pending certain testing conditions, and that our perceptron-based model is suitable for the TIS identification task. MetWAMer represents a well-documented, extensible, and freely available software system that can be readily re-trained for differing target applications and/or extended with existing and novel TIS prediction methods, to support further research efforts in this area.</p

    Computational annotation of eukaryotic gene structures: algorithms development and software systems

    Get PDF
    An important foundation for the advancement of both basic and applied biological science is correct annotation of protein-coding gene repertoires in model organisms. Accurate automated annotation of eukaryotic gene structures remains a challenging, open-ended and critical problem for modern computational biology.;The use of extrinsic (homology) information has been shown as a quite successful strategy for this task, though it is not a perfect solution, for a variety of reasons. More recently, gene prediction methods leveraging information present in syntenic genomic sequences have become favorable, though these too, have limitations.;Identifying genes by inspection of genomic sequence alone thoroughly tests our theoretical understanding of the gene recognition process as it occurs in vivo, and where we encounter failure, excellent opportunities for meaningful research are revealed.;Therefore, the continued development of methods not reliant on homology information---the so-called ab initio gene prediction methods---should help to more rapidly achieve a comprehensive understanding of gene content in our model organisms, at least.;This thesis explores the development of novel algorithms in an attempt to advance the current state-of-the-art in gene prediction, with particular emphasis on ab initio approaches.;The work has been conducted with an eye towards contributing open source, well-documented, and extensible software systems implementing the methods, and to generate novel biological knowledge with respect to plant taxa, in particular
    corecore