16 research outputs found

    Predictive modeling of plant messenger RNA polyadenylation sites

    Get PDF
    BACKGROUND: One of the essential processing events during pre-mRNA maturation is the post-transcriptional addition of a polyadenine [poly(A)] tail. The 3'-end poly(A) track protects mRNA from unregulated degradation, and indicates the integrity of mRNA through recognition by mRNA export and translation machinery. The position of a poly(A) site is predetermined by signals in the pre-mRNA sequence that are recognized by a complex of polyadenylation factors. These signals are generally tri-part sequence patterns around the cleavage site that serves as the future poly(A) site. In plants, there is little sequence conservation among these signal elements, which makes it difficult to develop an accurate algorithm to predict the poly(A) site of a given gene. We attempted to solve this problem. RESULTS: Based on our current working model and the profile of nucleotide sequence distribution of the poly(A) signals and around poly(A) sites in Arabidopsis, we have devised a Generalized Hidden Markov Model based algorithm to predict potential poly(A) sites. The high specificity and sensitivity of the algorithm were demonstrated by testing several datasets, and at the best combinations, both reach 97%. The accuracy of the program, called poly(A) site sleuth or PASS, has been demonstrated by the prediction of many validated poly(A) sites. PASS also predicted the changes of poly(A) site efficiency in poly(A) signal mutants that were constructed and characterized by traditional genetic experiments. The efficacy of PASS was demonstrated by predicting poly(A) sites within long genomic sequences. CONCLUSION: Based on the features of plant poly(A) signals, a computational model was built to effectively predict the poly(A) sites in Arabidopsis genes. The algorithm will be useful in gene annotation because a poly(A) site signifies the end of the transcript. This algorithm can also be used to predict alternative poly(A) sites in known genes, and will be useful in the design of transgenes for crop genetic engineering by predicting and eliminating undesirable poly(A) sites

    Isolation of Alcohol Dehydrogenase

    Get PDF
    Alcohol dehydrogenase (Adh) is a versatile enzyme involved in many biochemical pathways in plants such as in germination and stress tolerance. Sago palm is plant with much importance to the state of Sarawak as one of the most important crops that bring revenue with the advantage of being able to withstand various biotic and abiotic stresses such as heat, pathogens, and water logging. Here we report the isolation of sago palm Adh cDNA and its putative promoter region via the use of rapid amplification of cDNA ends (RACE) and genomic walking. The isolated cDNA was characterized and determined to be 1464 bp long encoding for 380 amino acids. BLAST analysis showed that the Adh is similar to the Adh1 group with 91% and 85% homology with Elaeis guineensis and Washingtonia robusta, respectively. The putative basal msAdh1 regulatory region was further determined to contain promoter signals of TATA and AGGA boxes and predicted amino acids analyses showed several Adh-specific motifs such as the two zinc-binding domains that bind to the adenosine ribose of the coenzyme and binding to alcohol substrate. A phylogenetic tree was also constructed using the predicted amino acid showed clear separation of Adh from bacteria and clustered within the plant Adh group

    Identification of Plant Messenger RNA Polyadenylation Sites Using Length-Variable Second Order Markov Model

    Get PDF
    In this paper we adopted a length-variable second order Markov model to identify plant messenger RNA poly(A) sites, and provided a common method that only relies on the experimental sequences. The efficacy of our model is showed up to 92% sensitivity and 79% specificity. This method is particularly suitable for the prediction of the poly(A) site which is lack of biological priori knowledge and has poor conservative signal characteristic, as well as for the identification of the alternative poly(A) sites in different genetic regions. Compared with other algorithms, generalized hidden Markov model needed the signal distributions and AdaBoost required the construction of signal features around the sites, our model is more versatile

    RECOGNITION OF POLYADENYLATION SITES FROM ARABIDOPSIS GENOMIC SEQUENCES

    Full text link

    Riboswitches as hormone receptors: hypothetical cytokinin-binding riboswitches in Arabidopsis thaliana

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Riboswitches are mRNA elements that change conformation when bound to small molecules. They are known to be key regulators of biosynthetic pathways in both prokaryotes and eukaryotes.</p> <p>Presentation of the Hypothesis</p> <p>The hypothesis presented here is that riboswitches function as receptors in hormone perception. We propose that riboswitches initiate or integrate signaling cascades upon binding to classic signaling molecules. The molecular interactions for ligand binding and gene expression control would be the same as for biosynthetic pathways, but the context and the cadre of ligands to consider is dramatically different. The hypothesis arose from the observation that a compound used to identify adenine binding RNA sequences is chemically similar to the classic plant hormone, or growth regulator, cytokinin. A general tenet of the hypothesis is that riboswitch-binding metabolites can be used to make predictions about chemically related signaling molecules. In fact, all cell permeable signaling compounds can be considered as potential riboswitch ligands. The hypothesis is plausible, as demonstrated by a cursory review of the transcriptome and genome of the model plant <it>Arabidopsis thaliana </it>for transcripts that <it>i) </it>contain an adenine aptamer motif, and <it>ii) </it>are also predicted to be cytokinin-regulated. Here, one gene, <it>CRK10 </it>(for <it>Cysteine-rich Receptor-like Kinase 10</it>, At4g23180), contains an adenine aptamer-related sequence and is down-regulated by cytokinin approximately three-fold in public gene expression data. To illustrate the hypothesis, implications of cytokinin-binding to the <it>CRK10 </it>mRNA are discussed.</p> <p>Testing the hypothesis</p> <p>At the broadest level, screening various cell permeable signaling molecules against random RNA libraries and comparing hits to sequence and gene expression data bases could determine how broadly the hypothesis applies. Specific cases, such as <it>CRK10 </it>presented here, will require experimental validation of direct ligand binding, altered RNA conformation, and effect on gene expression. Each case will be different depending on the signaling pathway and the physiology involved.</p> <p>Implications of the hypothesis</p> <p>This would be a very direct signal perception mechanism for regulating gene expression; rivaling animal steroid hormone receptors, which are frequently ligand dependent transcription initiation factors. Riboswitch-regulated responses could occur by modulating target RNA stability, translatability, and alternative splicing - all known expression platforms used in riboswitches. The specific illustration presented, <it>CRK10</it>, implies a new mechanism for the perception of cytokinin, a classic plant hormone. Experimental support for the hypothesis would add breadth to the growing list of important functions attributed to riboswitches.</p> <p>Reviewers</p> <p>This article was reviewed by Anthony Poole, Rob Knight, Mikhail Gelfand.</p

    Genome level analysis of rice mRNA 3′-end processing signals and alternative polyadenylation

    Get PDF
    The position of a poly(A) site of eukaryotic mRNA is determined by sequence signals in pre-mRNA and a group of polyadenylation factors. To reveal rice poly(A) signals at a genome level, we constructed a dataset of 55 742 authenticated poly(A) sites and characterized the poly(A) signals. This resulted in identifying the typical tripartite cis-elements, including FUE, NUE and CE, as previously observed in Arabidopsis. The average size of the 3′-UTR was 289 nucleotides. When mapped to the genome, however, 15% of these poly(A) sites were found to be located in the currently annotated intergenic regions. Moreover, an extensive alternative polyadenylation profile was evident where 50% of the genes analyzed had more than one unique poly(A) site (excluding microheterogeneity sites), and 13% had four or more poly(A) sites. About 4% of the analyzed genes possessed alternative poly(A) sites at their introns, 5′-UTRs, or protein coding regions. The authenticity of these alternative poly(A) sites was partially confirmed using MPSS data. Analysis of nucleotide profile and signal patterns indicated that there may be a different set of poly(A) signals for those poly(A) sites found in the coding regions. Based on the features of rice poly(A) signals, an updated algorithm termed PASS-Rice was designed to predict poly(A) sites

    Genome-wide characterization of intergenic polyadenylation sites redefines gene spaces in Arabidopsis thaliana

    Get PDF
    Background:Messenger RNA polyadenylation is an essential step for the maturation of most eukaryotic mRNAs.Accurate determination of poly(A) sites helps define the 3’-ends of genes, which is important for genome annotation and gene function research. Genomic studies have revealed the presence of poly(A) sites in intergenic regions, which may be attributed to 3’-UTR extensions and novel transcript units. However, there is no systematically evaluation of intergenic poly(A) sites in plants. Results:Approximately 16,000 intergenic poly(A) site clusters (IPAC) in Arabidopsis thaliana were discovered and evaluated at the whole genome level. Based on the distributions of distance from IPACs to nearby sense and antisense genes, these IPACs were classified into three categories. About 70 % of them were from previously unannotated 3’-UTR extensions to known genes, which would extend 6985 transcripts of TAIR10 genome annotation beyond their 3’-ends, with a mean extension of 134 nucleotides. 1317 IPACs were originated from novel intergenic transcripts, 37 of which were likely to be associated with protein coding transcripts. 2957 IPACs corresponded to antisense transcripts for genes on the reverse strand, which might affect 2265 protein coding genes and 39 non-protein-coding genes, including long non-coding RNA genes. The rest of IPACs could be originated from transcriptional read-through or gene mis-annotations. Conclusions:The identified IPACs corresponding to novel transcripts, 3’-UTR extensions, and antisense transcription should be incorporated into current Arabidopsis genome annotation. Comprehensive characterization of IPACs from this study provides insights of alternative polyadenylation and antisense transcription in plants.Funding supports were in part from US National Science Foundation (No. 1541737 to QQL), the Hundred Talent Plans of Fujian Province and Xiamen City (to QQL). This project was also funded by the National Natural Science Foundation of China (Nos. 61201358 and 61174161), the Natural Science Foundation of Fujian Province of China (No. 2012J01154), and the specialized Research Fund for the Doctoral Program of Higher Education of China (Nos. 20120121120038 and 20130121130004), and the Fundamental Research Funds for the Central Universities in China (Xiamen University: Nos. 2013121025, 201412G009, and 2014X0234)

    Deep learning methods for mining genomic sequence patterns

    Get PDF
    Nowadays, with the growing availability of large-scale genomic datasets and advanced computational techniques, more and more data-driven computational methods have been developed to analyze genomic data and help to solve incompletely understood biological problems. Among them, deep learning methods, have been proposed to automatically learn and recognize the functional activity of DNA sequences from genomics data. Techniques for efficient mining genomic sequence pattern will help to improve our understanding of gene regulation, and thus accelerate our progress toward using personal genomes in medicine. This dissertation focuses on the development of deep learning methods for mining genomic sequences. First, we compare the performance between deep learning models and traditional machine learning methods in recognizing various genomic sequence patterns. Through extensive experiments on both simulated data and real genomic sequence data, we demonstrate that an appropriate deep learning model can be generally made for successfully recognizing various genomic sequence patterns. Next, we develop deep learning methods to help solve two specific biological problems, (1) inference of polyadenylation code and (2) tRNA gene detection and functional prediction. Polyadenylation is a pervasive mechanism that has been used by Eukaryotes for regulating mRNA transcription, localization, and translation efficiency. Polyadenylation signals in the plant are particularly noisy and challenging to decipher. A deep convolutional neural network approach DeepPolyA is proposed to predict poly(A) site from the plant Arabidopsis thaliana genomic sequences. It employs various deep neural network architectures and demonstrates its superiority in comparison with competing methods, including classical machine learning algorithms and several popular deep learning models. Transfer RNAs (tRNAs) represent a highly complex class of genes and play a central role in protein translation. There remains a de facto tool, tRNAscan-SE, for identifying tRNA genes encoded in genomes. Despite its popularity and success, tRNAscan-SE is still not powerful enough to separate tRNAs from pseudo-tRNAs, and a significant number of false positives can be output as a result. To address this issue, tRNA-DL, a hybrid combination of convolutional neural network and recurrent neural network approach is proposed. It is shown that the proposed method can help to reduce the false positive rate of the state-of-art tRNA prediction tool tRNAscan-SE substantially. Coupled with tRNAscan-SE, tRNA-DL can serve as a useful complementary tool for tRNA annotation. Taken together, the experiments and applications demonstrate the superiority of deep learning in automatic feature generation for characterizing genomic sequence patterns

    Identification of Polyadenylation Sites within Arabidopsis Thaliana

    Get PDF
    Machine Learning (ML) is a field of artificial intelligence focused on the design and implementation of algorithms that enable creation of models for clustering, classification, prediction, ranking and similar inference tasks based on information contained in data. Many ML algorithms have been successfully utilized in a variety of applications. The problem addressed in this thesis is from the field of bioinformatics and deals with the recognition of polyadenylation (poly(A)) sites in the genomic sequence of the plant Arabidopsis thaliana. During the RNA processing, a tail consisting of a number of consecutive adenine (A) nucleotides is added to the terminal nucleotide of the 3’- untranslated region (3’UTR) of the primary RNA. The process in which these A nucleotides are added is called polyadenylation. The location in the genomic DNA sequence that corresponds to the start of terminal A nucleotides (i.e. to the end of 3’UTR) is known as a poly(A) site. Recognition of the poly(A) sites in DNA sequence is important for better gene annotation and understanding of gene regulation. In this study, we built an artificial neural network (ANN) for the recognition of poly(A) sites in the Arabidopsis thaliana genome. Our study demonstrates that this model achieves improved accuracy compared to the existing predictive models for this purpose. The key factor contributing to the enhanced predictive performance of our ANN model is a distinguishing set of features used in creation of the model. These features include a number of physico-chemical characteristics of relevance, such as dinucleotide thermodynamic characteristics, electron-ion interaction potential, etc., but also many of the statistical properties of the DNA sequences from the region surrounding poly(A) site, such as nucleotide and polynucleotide properties, common motifs, etc. Our ANN model was compared in performance with several other ML models, as well as with the PAC tool that is specifically developed for poly(A) site recognition in Arabidopsis thaliana and rice. The comparison analysis shows that our model performs better compared to the others available, and achieves on average 93% accuracy
    corecore