37 research outputs found

    Identification of Polyadenylation Sites within Arabidopsis Thaliana

    Get PDF
    Machine Learning (ML) is a field of artificial intelligence focused on the design and implementation of algorithms that enable creation of models for clustering, classification, prediction, ranking and similar inference tasks based on information contained in data. Many ML algorithms have been successfully utilized in a variety of applications. The problem addressed in this thesis is from the field of bioinformatics and deals with the recognition of polyadenylation (poly(A)) sites in the genomic sequence of the plant Arabidopsis thaliana. During the RNA processing, a tail consisting of a number of consecutive adenine (A) nucleotides is added to the terminal nucleotide of the 3’- untranslated region (3’UTR) of the primary RNA. The process in which these A nucleotides are added is called polyadenylation. The location in the genomic DNA sequence that corresponds to the start of terminal A nucleotides (i.e. to the end of 3’UTR) is known as a poly(A) site. Recognition of the poly(A) sites in DNA sequence is important for better gene annotation and understanding of gene regulation. In this study, we built an artificial neural network (ANN) for the recognition of poly(A) sites in the Arabidopsis thaliana genome. Our study demonstrates that this model achieves improved accuracy compared to the existing predictive models for this purpose. The key factor contributing to the enhanced predictive performance of our ANN model is a distinguishing set of features used in creation of the model. These features include a number of physico-chemical characteristics of relevance, such as dinucleotide thermodynamic characteristics, electron-ion interaction potential, etc., but also many of the statistical properties of the DNA sequences from the region surrounding poly(A) site, such as nucleotide and polynucleotide properties, common motifs, etc. Our ANN model was compared in performance with several other ML models, as well as with the PAC tool that is specifically developed for poly(A) site recognition in Arabidopsis thaliana and rice. The comparison analysis shows that our model performs better compared to the others available, and achieves on average 93% accuracy

    INDIGO - INtegrated Data Warehouse of MIcrobial GenOmes with Examples from the Red Sea Extremophiles.

    Get PDF
    Background: The next generation sequencing technologies substantially increased the throughput of microbial genome sequencing. To functionally annotate newly sequenced microbial genomes, a variety of experimental and computational methods are used. Integration of information from different sources is a powerful approach to enhance such annotation. Functional analysis of microbial genomes, necessary for downstream experiments, crucially depends on this annotation but it is hampered by the current lack of suitable information integration and exploration systems for microbial genomes. Results: We developed a data warehouse system (INDIGO) that enables the integration of annotations for exploration and analysis of newly sequenced microbial genomes. INDIGO offers an opportunity to construct complex queries and combine annotations from multiple sources starting from genomic sequence to protein domain, gene ontology and pathway levels. This data warehouse is aimed at being populated with information from genomes of pure cultures and uncultured single cells of Red Sea bacteria and Archaea. Currently, INDIGO contains information from Salinisphaera shabanensis, Haloplasma contractile, and Halorhabdus tiamatea - extremophiles isolated from deep-sea anoxic brine lakes of the Red Sea. We provide examples of utilizing the system to gain new insights into specific aspects on the unique lifestyle and adaptations of these organisms to extreme environments. Conclusions: We developed a data warehouse system, INDIGO, which enables comprehensive integration of information from various resources to be used for annotation, exploration and analysis of microbial genomes. It will be regularly updated and extended with new genomes. It is aimed to serve as a resource dedicated to the Red Sea microbes. In addition, through INDIGO, we provide our Automatic Annotation of Microbial Genomes (AAMG) pipeline. The INDIGO web server is freely available at http://www.cbrc.kaust.edu.sa/indigo.IA and AAK were supported from the KAUST CBRC Base Fund of VBB. WBa and VBB were supported from the KAUST Base Funds of VBB. US was supported by the KAUST Base Fund of US. This study was partly supported by the Saudi Economic and Development Company (SEDCO) Research Excellence award to US and VBB. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript

    Untranslated parts of genes interpreted: making heads or tails of high-throughput transcriptomic data via computational methods

    Get PDF
    The fate of eukaryotic transcripts is closely linked to their untranslated regions, which are determined by where transcription starts and ends on a genomic locus. The extent of alternative transcription start and alternative poly-adenylation has been revealed by sequencing methods focused on the ends of transcripts, but the application of these methods is not yet widely adopted by the community. In this review we highlight the importance of defining the untranslated parts of transcripts and suggest that computational methods applied to standard high-throughput technologies are a useful alternative to the expertise-demanding 5’ and 3’ sequencing. We present a number of computational approaches for the discovery and quantification of alternative transcription start and poly-adenylation events, focusing on technical challenges and arguing for the need to include better normalization of the data and more appropriate statistical models of the expected variation in the signal

    Omni-PolyA: a method and tool for accurate recognition of Poly(A) signals in human genomic DNA

    No full text
    Abstract Background Polyadenylation is a critical stage of RNA processing during the formation of mature mRNA, and is present in most of the known eukaryote protein-coding transcripts and many long non-coding RNAs. The correct identification of poly(A) signals (PAS) not only helps to elucidate the 3′-end genomic boundaries of a transcribed DNA region and gene regulatory mechanisms but also gives insight into the multiple transcript isoforms resulting from alternative PAS. Although progress has been made in the in-silico prediction of genomic signals, the recognition of PAS in DNA genomic sequences remains a challenge. Results In this study, we analyzed human genomic DNA sequences for the 12 most common PAS variants. Our analysis has identified a set of features that helps in the recognition of true PAS, which may be involved in the regulation of the polyadenylation process. The proposed features, in combination with a recognition model, resulted in a novel method and tool, Omni-PolyA. Omni-PolyA combines several machine learning techniques such as different classifiers in a tree-like decision structure and genetic algorithms for deriving a robust classification model. We performed a comparison between results obtained by state-of-the-art methods, deep neural networks, and Omni-PolyA. Results show that Omni-PolyA significantly reduced the average classification error rate by 35.37% in the prediction of the 12 considered PAS variants relative to the state-of-the-art results. Conclusions The results of our study demonstrate that Omni-PolyA is currently the most accurate model for the prediction of PAS in human and can serve as a useful complement to other PAS recognition methods. Omni-PolyA is publicly available as an online tool accessible at www.cbrc.kaust.edu.sa/omnipolya/

    DeepGSR: An optimized deep-learning structure for the recognition of genomic signals and regions

    No full text
    Recognition of different genomic signals and regions (GSRs) in the DNA is helpful in gaining knowledge to understand genome organization and gene regulation as well as gene function. Accurate recognition of GSRs enables better genome and gene annotation. Although many methods have been developed to recognize GSRs, their pure computational identification remains challenging. Moreover, various GSRs usually require a specialized set of features for developing robust recognition models. Recently, deep-learning (DL) methods have been shown to generate more accurate prediction models than the ‘shallow’ methods without the need to develop specialized features for the problems in question. Here, we explore the potential use of DL for the recognition of GSRs. We developed DeepGSR, an optimized DL architecture for the prediction of different types of GSRs. The performance of the DeepGSR structure is evaluated on the recognition of polyadenylation signals (PAS) and translation initiation sites (TIS) of different organisms: human, mouse, bovine and fruit fly. The results show that DeepGSR outperformed the state-of-the-art methods, reducing the classification error rate of the PAS and TIS prediction in the human genome by up to 29% and 86%, respectively. Moreover, the cross-organisms and genome-wide analyses we performed, confirmed the robustness of DeepGSR and provided new insights into the conservation of examined GSRs across species. README: DeepGSR: An optimized deep-learning structure for the recognition of genomic signals and regions. Version 1.1 13/Dec/2017 WHAT IS IT? ----------- DeepGSR is a deep-learning model that can be used for the recognition of genomic signals and regions with Eukaryotic DNA. It has been applied to polyadenylation signals (PAS) and translation initiation site (TIS). It uses fasta format DNA Sequences as input. But you can process the data using the provided code. COMMAND LINE VERSION -------------------- Here we include the source code of DeepGSR written in Python language and using Keras library with Theano backend. INSTALLATION ------------ DeepGSR is able to run on any linux platform. To run DeepGSR: Install scikit-learn (http://scikit-learn.org/), keras (https://keras.io/) and cuda for if you want faster processing using GPUs. The data that were used in the paper found in the (Data) folder. There are two types of DeepGSR usage, either for testing using pre-trained models or for training new models; each of these types found in a separate folder. Open a new terminal, then go to the directory that contains the python code. For example: cd Testing/ or cd Training/DeepGSR-2DCNN Running DeepGSR, command line options: python CNN_Testing.py –h or python 2DCNN.py –h EXAMPLE: -------- Note: all required data is included in this package Train DeepGSR on human genome for PAS recognition: python 2DCNN.py --inputFile ../../Data/Human/PAS_processed/hs_mixAATAAA_polyA.txt --DataName human_AATAAA --FileName human_AATAAA Train DeepGSR on human genome for TIS recognition: python 2DCNN.py --inputFile ../../Data/Human/TIS_processed/hs_mixATG_TIS.txt --DataName human_ATG --FileName human_ATG Test DeepGSR on mouse genome using human trained model for PAS recognition: python CNN_Testing.py --inputFile ../../Data/Mouse/PAS_processed/mm_mixAATAAA_polyA.txt –inputModel ../human_AATAAA_Model.h5 --DataName mouse_AATAAA --FileName mouse_human_AATAAA Test DeepGSR on mouse genome using human trained model for TIS recognition: python CNN_Testing.py --inputFile ../../Data/Mouse/TIS_processed/mm_mixATG_TIS.txt --inputModel ../human_ATG_Model.h5 --DataName mouse_ATG --FileName mouse_human_ATG CONTACTS -------- If you want to report bugs or have general queries email to: [email protected]

    Additional file 2: of BEACON: automated tool for Bacterial GEnome Annotation ComparisON

    No full text
    Contains the source code of BEACON in C++ for command line use along with makefile, a ReadMe file and the license text. (TGZ 42 kb

    Additional file 3: Table S3. of Omni-PolyA: a method and tool for accurate recognition of Poly(A) signals in human genomic DNA

    No full text
    False positive and false negative rates comparison between DPS, DNN, and Omni-polyA derived by using different feature sets. (PDF 105 kb

    Additional file 4: Figure S1. of Omni-PolyA: a method and tool for accurate recognition of Poly(A) signals in human genomic DNA

    No full text
    Nucleotide distribution for PAS variants in the PAS-weak category. These plots show the frequency of nucleotides for true PAS sequences in the 10 variants from the PAS-weak category. (PDF 1696 kb

    Additional file 7: Table S4. of Omni-PolyA: a method and tool for accurate recognition of Poly(A) signals in human genomic DNA

    No full text
    DPS, HMM_SVM and DNN model parameters. Parameters were determined from the validation set. (PDF 87 kb
    corecore