9 research outputs found

    Translation initiation site prediction on a genomic scale : beauty in simplicity

    Get PDF
    Motivation: The correct identification of translation initiation sites (TIS) remains a challenging problem for computational methods that automatically try to solve this problem. Furthermore, the lion's share of these computational techniques focuses on the identification of TIS in transcript data. However, in the gene prediction context the identification of TIS occurs on the genomic level, which makes things even harder because at the genome level many more pseudo-TIS occur, resulting in models that achieve a higher number of false positive predictions. Results: In this article, we evaluate the performance of several 'simple' TIS recognition methods at the genomic level, and compare them to state-of-the-art models for TIS prediction in transcript data. We conclude that the simple methods largely outperform the complex ones at the genomic scale, and we propose a new model for TIS recognition at the genome level that combines the strengths of these simple models. The new model obtains a false positive rate of 0.125 at a sensitivity of 0.80 on a well annotated human chromosome ( chromosome 21). Detailed analyses show that the model is useful, both on its own and in a simple gene prediction setting

    MetWAMer: eukaryotic translation initiation site prediction

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Translation initiation site (TIS) identification is an important aspect of the gene annotation process, requisite for the accurate delineation of protein sequences from transcript data. We have developed the MetWAMer package for TIS prediction in eukaryotic open reading frames of non-viral origin. MetWAMer can be used as a stand-alone, third-party tool for post-processing gene structure annotations generated by external computational programs and/or pipelines, or directly integrated into gene structure prediction software implementations.</p> <p>Results</p> <p>MetWAMer currently implements five distinct methods for TIS prediction, the most accurate of which is a routine that combines weighted, signal-based translation initiation site scores and the contrast in coding potential of sequences flanking TISs using a perceptron. Also, our program implements clustering capabilities through use of the <it>k</it>-medoids algorithm, thereby enabling cluster-specific TIS parameter utilization. In practice, our static weight array matrix-based indexing method for parameter set lookup can be used with good results in data sets exhibiting moderate levels of 5'-complete coverage.</p> <p>Conclusion</p> <p>We demonstrate that improvements in statistically-based models for TIS prediction can be achieved by taking the class of each potential start-methionine into account pending certain testing conditions, and that our perceptron-based model is suitable for the TIS identification task. MetWAMer represents a well-documented, extensible, and freely available software system that can be readily re-trained for differing target applications and/or extended with existing and novel TIS prediction methods, to support further research efforts in this area.</p

    Improvement in the prediction of the translation initiation site through balancing methods, inclusion of acquired knowledge and addition of features to sequences of mRNA

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The accurate prediction of the initiation of translation in sequences of mRNA is an important activity for genome annotation. However, obtaining an accurate prediction is not always a simple task and can be modeled as a problem of classification between positive sequences (protein codifiers) and negative sequences (non-codifiers). The problem is highly imbalanced because each molecule of mRNA has a unique translation initiation site and various others that are not initiators. Therefore, this study focuses on the problem from the perspective of balancing classes and we present an undersampling balancing method, M-clus, which is based on clustering. The method also adds features to sequences and improves the performance of the classifier through the inclusion of knowledge obtained by the model, called InAKnow.</p> <p>Results</p> <p>Through this methodology, the measures of performance used (accuracy, sensitivity, specificity and adjusted accuracy) are greater than 93% for the <it>Mus musculus</it> and <it>Rattus norvegicus</it> organisms, and varied between 72.97% and 97.43% for the other organisms evaluated: <it>Arabidopsis thaliana</it>, <it>Caenorhabditis elegans</it>, <it>Drosophila melanogaster</it>, <it>Homo sapiens</it>, <it>Nasonia vitripennis</it>. The precision increases significantly by 39% and 22.9% for <it>Mus musculus</it> and <it>Rattus norvegicus</it>, respectively, when the knowledge obtained by the model is included. For the other organisms, the precision increases by between 37.10% and 59.49%. The inclusion of certain features during training, for example, the presence of ATG in the upstream region of the Translation Initiation Site, improves the rate of sensitivity by approximately 7%. Using the M-Clus balancing method generates a significant increase in the rate of sensitivity from 51.39% to 91.55% (<it>Mus musculus</it>) and from 47.45% to 88.09% (<it>Rattus norvegicus</it>).</p> <p>Conclusions</p> <p>In order to solve the problem of TIS prediction, the results indicate that the methodology proposed in this work is adequate, particularly when using the concept of acquired knowledge which increased the accuracy in all databases evaluated.</p

    Knowledge Discovery with Bayesian Networks

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Computational annotation of eukaryotic gene structures: algorithms development and software systems

    Get PDF
    An important foundation for the advancement of both basic and applied biological science is correct annotation of protein-coding gene repertoires in model organisms. Accurate automated annotation of eukaryotic gene structures remains a challenging, open-ended and critical problem for modern computational biology.;The use of extrinsic (homology) information has been shown as a quite successful strategy for this task, though it is not a perfect solution, for a variety of reasons. More recently, gene prediction methods leveraging information present in syntenic genomic sequences have become favorable, though these too, have limitations.;Identifying genes by inspection of genomic sequence alone thoroughly tests our theoretical understanding of the gene recognition process as it occurs in vivo, and where we encounter failure, excellent opportunities for meaningful research are revealed.;Therefore, the continued development of methods not reliant on homology information---the so-called ab initio gene prediction methods---should help to more rapidly achieve a comprehensive understanding of gene content in our model organisms, at least.;This thesis explores the development of novel algorithms in an attempt to advance the current state-of-the-art in gene prediction, with particular emphasis on ab initio approaches.;The work has been conducted with an eye towards contributing open source, well-documented, and extensible software systems implementing the methods, and to generate novel biological knowledge with respect to plant taxa, in particular

    Genome annotation and evolution of chemosensory receptors in spider mites

    Get PDF
    Understanding the evolution of species and speciation, the mechanism producing the diversity of life on Earth, has always fascinated scientists. In recent years, advances in next generation sequencing techniques, together with the development of data analyzing software tools, allow us to sequence and analyze genomes of many species and reconstruct their evolutionary history. We can detect the evolutionary changes of a group of species or of different populations of a single species. In this thesis, we perform studies on three spider mite genomes, Tetranychus urticae, Tetranychus evansi and Tetranychus lintearius. The spider mites belong to the Chelicerata, the second largest group of arthropods after insects. While many insect genomes were sequenced and analyzed already, Tetranychus urticae represents the first complete chelicerate genome. This thesis has been organized into five chapters. The introductory Chapter 1 provides an overview of the explosion of genome sequences in times of the fast development of next generation sequencing techniques, describes genome annotation information, methods and pipelines to give biological meaning to these genomes, and explains the importance of genome based research for the evolution of arthropod-plant interactions. In addition, a short overview of the chemosensory receptors is provided since in the thesis we have particularly studied the annotation and evolution of this gene family in three different spider mites. Chapter 2 provides the results of annotation and analysis of the Tetranychus urticae genome (London strain). T. urticae represents one of the most polyphagous arthropod herbivores, feeding on more than 1,100 plant species including species known to produce toxic compounds. We have annotated the T. urticae genome with support of RNA-seq data and made it publicly available to the research community. The T. urticae genome sequence reveals herbivorous pest adaptations with strong signatures of polyphagy and detoxification in gene families associated with feeding on different hosts and in new gene families acquired by lateral gene transfer. Moreover, how this pest responds to a changing host environment is shown by deep transcriptome analysis of T. urticae feeding on different plants. Thus, the T. urticae genome sequence opens up new avenues for understanding the evolution of arthropods as well as the fundamentals of plant–herbivore interactions. The next two chapters (Chapter 3 and Chapter 4) present studies on the annotation and evolution of chemosensory receptors (CRs) in three different spider mites. Chemosensory receptors help animals to detect certain chemical components in their environment to find food, to locate shelter, mates and offspring, and to avoid danger. In Chapter 3, starting from Daphnia and insect chemosensory receptors, we describe mining the T. urticae genome for putative chemosensory receptors, including the ones related to insect gustatory receptors (GRs), the ionotropic receptors (IRs) and the epithelial Na+ channels (ENaCs). T. urticae has a huge repertoire of GRs, many more than the total number of GRs and odorant receptors (ORs) found to date in any other arthropod. Similar to Daphnia pulex, we observed the complete lack of ORs in T. urticae. This is consistent with the hypothesis that ORs are an insect-specific class of GR-related chemosensory receptors. Futhermore, we compare chemosensory receptor genes among three strains (London, Montpellier, and EtoxR). We find that GR genes that are intact in some T. urticae populations appeared to be inactived in other populations. Next, in Chapter 4, we describe the annotation of GR genes in T. evansi and T. lintearius, and the evolutionary analysis of this gene family in the three spider mites. We identify many GR gene expansions in the polyphagous T. urticae, a few gene expansions and many gene losses in the oligophagous T. evansi, and no gene expansion but also many gene losses in the monophagous T. lintearius. Finally, general remarks are discussed in the Chapter 5

    Neutrophil count prediction in childhood cancer patients receiving 6-mercaptopurine chemotherapy treatment

    Get PDF
    Acute Lymphoblastic Leukaemia (ALL) is a common form of blood cancer, usually affecting children under 15 years of age. Chemotherapy treatment for ALL is delivered in three phases viz. induction (to achieve initial remission), intensification (to kill the majority of abnormal cells), and finally, maintenance. The maintenance phase involves oral administration of the chemotherapy drug 6-Mercaptopurine (6-MP) in varying doses to destroy any remaining abnormal cells and prevent reoccurrence. A key side effect of the treatment is a reduction in neutrophil counts that can result in a condition known as neutropenia, i.e. reduced immune system. This carries a risk of secondary infection and has been linked to 60% of ALL fatalities. Current practice aims to control neutrophil counts by varying 6-MP dosages on a weekly basis based on blood counts. However, its success is varied. This thesis proposes a number of intelligent prediction methods to more accurately predicting neutrophil counts one week ahead using blood count data and corresponding 6-MP dosing regimens. Firstly, a well-known and robust neural network (Nonlinear Autoregressive Exogenous) is applied to blood count data to provide an initial assessment of the feasibility of such an approach. A comparative analysis of a series of more complex algorithms is then considered for more advanced, in-depth analysis viz. Multi-Layer Perceptron (MLP) and Support Vector Machines (SVM). Both methods are shown to have a prediction accuracy of around 60% on the first sample period, with the MLP also having a prediction accuracy of more than 60% in the second sample period in seven out of ten blood data points (there was 10 timeseries blood data predictions). However, in comparison the accuracy of SVM is relatively low. Finally, an incremental learning-based approach is proposed to increase the accuracy of the system and provide a realistic framework for real-time implementation. The accuracy is shown to improve considerably as more data is added, and the predicted neutrophils data is shown to follow the trend of the actual neutrophil counts

    Genome sequences analysis using neural networks

    Get PDF
    วิทยานิพนธ์ (วท.ม. (วิทยาการคอมพิวเตอร์))--มหาวิทยาลัยสงขลานครินทร์, 255

    Translation initiation sites prediction with mixture Gaussian models in human cDNA sequences

    Get PDF
    10.1109/TKDE.2005.133IEEE Transactions on Knowledge and Data Engineering1781152-1160ITKE
    corecore