814 research outputs found

    Improvement in the prediction of the translation initiation site through balancing methods, inclusion of acquired knowledge and addition of features to sequences of mRNA

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The accurate prediction of the initiation of translation in sequences of mRNA is an important activity for genome annotation. However, obtaining an accurate prediction is not always a simple task and can be modeled as a problem of classification between positive sequences (protein codifiers) and negative sequences (non-codifiers). The problem is highly imbalanced because each molecule of mRNA has a unique translation initiation site and various others that are not initiators. Therefore, this study focuses on the problem from the perspective of balancing classes and we present an undersampling balancing method, M-clus, which is based on clustering. The method also adds features to sequences and improves the performance of the classifier through the inclusion of knowledge obtained by the model, called InAKnow.</p> <p>Results</p> <p>Through this methodology, the measures of performance used (accuracy, sensitivity, specificity and adjusted accuracy) are greater than 93% for the <it>Mus musculus</it> and <it>Rattus norvegicus</it> organisms, and varied between 72.97% and 97.43% for the other organisms evaluated: <it>Arabidopsis thaliana</it>, <it>Caenorhabditis elegans</it>, <it>Drosophila melanogaster</it>, <it>Homo sapiens</it>, <it>Nasonia vitripennis</it>. The precision increases significantly by 39% and 22.9% for <it>Mus musculus</it> and <it>Rattus norvegicus</it>, respectively, when the knowledge obtained by the model is included. For the other organisms, the precision increases by between 37.10% and 59.49%. The inclusion of certain features during training, for example, the presence of ATG in the upstream region of the Translation Initiation Site, improves the rate of sensitivity by approximately 7%. Using the M-Clus balancing method generates a significant increase in the rate of sensitivity from 51.39% to 91.55% (<it>Mus musculus</it>) and from 47.45% to 88.09% (<it>Rattus norvegicus</it>).</p> <p>Conclusions</p> <p>In order to solve the problem of TIS prediction, the results indicate that the methodology proposed in this work is adequate, particularly when using the concept of acquired knowledge which increased the accuracy in all databases evaluated.</p

    Investigation of the length distributions of coding and noncoding sequences in relation to gene architecture, function, and expression

    Get PDF
    The last 20 years has seen the birth of bioinformatics, and is defined as the combination of mathematics, biology, and computational approaches. This discipline has led to the era of ontology, extensive databases including sequences, structures, expression profiles, and genomes and database cross-referencing, (Ouzounis, 2012). Before this discipline, scientists referenced atlas books, such as Margret Dayhoff’s protein sequence collection (Strasser, 2010) which required long hours of letter counting. Through the development of sequencing technology over the past forty years, a tremendous amount of genomic sequencing data has already been collected. With a surge of such data increasing, so does the challenges of data organisation, accessibility and interpretation, with interpretation being the most challenging (Ouzounis, 2012)

    A Machine Learning Model for Discovery of Protein Isoforms as Biomarkers

    Get PDF
    Prostate cancer is the most common cancer in men. One in eight Canadian men will be diagnosed with prostate cancer in their lifetime. The accurate detection of the disease’s subtypes is critical for providing adequate therapy; hence, it is critical for increasing both survival rates and quality of life. Next generation sequencing can be beneficial when studying cancer. This technology generates a large amount of data that can be used to extract information about biomarkers. This thesis proposes a model that discovers protein isoforms for different stages of prostate cancer progression. A tool has been developed that utilizes RNA-Seq data to infer open reading frames (ORFs) corresponding to transcripts. These ORFs are used as features for classificatio. A quantification measurement, Adaptive Fragments Per Kilobase of transcript per Million mapped reads (AFPKM), is proposed to compute the expression level for ORFs. The new measurement considers the actual length of the ORF and the length of the transcript. Using these ORFs and the new expression measure, several classifiers were built using different machine learning techniques. That enabled the identification of some protein isoforms related to prostate cancer progression. The biomarkers have had a great impact on the discrimination of prostate cancer stages and are worth further investigation

    Path to Facilitate the Prediction of Functional Amino Acid Substitutions in Red Blood Cell Disorders – A Computational Approach

    Get PDF
    A major area of effort in current genomics is to distinguish mutations that are functionally neutral from those that contribute to disease. Single Nucleotide Polymorphisms (SNPs) are amino acid substitutions that currently account for approximately half of the known gene lesions responsible for human inherited diseases. As a result, the prediction of non-synonymous SNPs (nsSNPs) that affect protein functions and relate to disease is an important task.In this study, we performed a comprehensive analysis of deleterious SNPs at both functional and structural level in the respective genes associated with red blood cell metabolism disorders using bioinformatics tools. We analyzed the variants in Glucose-6-phosphate dehydrogenase (G6PD) and isoforms of Pyruvate Kinase (PKLR & PKM2) genes responsible for major red blood cell disorders. Deleterious nsSNPs were categorized based on empirical rule and support vector machine based methods to predict the impact on protein functions. Furthermore, we modeled mutant proteins and compared them with the native protein for evaluation of protein structure stability.We argue here that bioinformatics tools can play an important role in addressing the complexity of the underlying genetic basis of Red Blood Cell disorders. Based on our investigation, we report here the potential candidate SNPs, for future studies in human Red Blood Cell disorders. Current study also demonstrates the presence of other deleterious mutations and also endorses with in vivo experimental studies. Our approach will present the application of computational tools in understanding functional variation from the perspective of structure, expression, evolution and phenotype

    Alternative translation initiation unraveled by N-terminomics and ribosome profiling

    Get PDF

    Identification of regulatory RNA elements based on structural conservation

    Get PDF
    The posttranscriptional regulation of gene expression determines the amount of a protein produced from a specific mRNA. All stages of the mRNA life cycle are posttranscriptionally regulated. The untranslated regions (UTRs) of mRNA play an important role in this process. UTRs encode cis-regulatory elements that interact with trans-acting factors such as RNA binding proteins or non-coding RNAs. These interactions are based on the specific recognition of sequence or structured motifs. Similar to the conservation of linear sequences, the conservation of secondary structures can be an indicative of a functional cis-regulatory element. In the first part of my doctoral thesis, I identified structurally conserved regulatory elements in 3’UTRs of mRNAs. For this, I performed reporter gene assays with bioinformatically predicted structurally conserved RNA elements and discovered a regulatory element in the 3’UTR of the UCP3 (uncoupling protein 3) mRNA. UCP3 is a protein of the inner mitochiondrial membrane and is associated with the development of diabetes melitus type 2 (T2DM). Through sequence and structural analysis, I discovered that the element has an active conformation that consists of two short RNA stem-loops. In further experiments, I was able to confirm that the presence of both RNA stem-loops is necessary for efficient regulation. Furthermore, I showed that the reduction of reporter gene expression is caused by a reduction of the mRNA half-life. The prediction of conserved RNA structures thus provides a powerful tool for the de novo identification of cis-regulatory elements. In the second part of my doctoral thesis, I characterized the regulatory element from the 3’UTR of UCP3 in detail. First, I performed RNA affinity purification to identify proteins specifically associated with the UCP3 element by mass spectrometry. This allowed me to show that the proteins Roquin-1 and Roquin-2 bind to the RNA stem loops of the UCP3 element. Furthermore, I showed that endogenous UCP3 is regulated by Roquin. Roquin proteins bind to constitutive (CDEs) and alternative (ADEs) decay elements and induce the rapid degradation of mRNAs of genes that play an important role in the immune response. Binding studies with the Roquin-1 ROQ domain showed that Roquin binds with significantly higher affinity to the UCP3 element, when both CDEs are present. Both CDEs in the UCP3 element do not correspond to the previously suggested CDE consensus. Performing a detailed mutational analysis, I revised the CDE consensus. With my data >160 new CDE- and 19 new ADE-coding mRNAs could be identified. Furthermore, I confirmed new CDE/ADE-containing mRNAs as targets of Roquin. Interestingly, I was able to show that not only the expression of CDEencoding mRNAs, but also regulation by Roquin is cell type dependent. In conclusion, I have extended Roquin’s role in the posttranscriptional regulation of gene expression and suggest that its role is not limited to the regulation of the immune response

    Investigating prokaryotic transcriptomes and the impact of crosstalk between noncoding RNA and messenger RNA interactions

    Get PDF
    Prokaryotes have a complex noncoding RNA (ncRNA) based regulatory system, resembling that of eukaryotes. Recent transcriptomics studies also point out the abundance of highly expressed uncharacterized RNAs in archaea and bacteria. However, despite the recent advances indicating the prevalence of ncRNAs in prokaryotes, it is still unknown to what extent these uncharacterized transcripts are functional. Therefore, we have proposed a phylogeny informed approach to design new RNA sequencing (RNAseq) experiments, which increases the information harnessed from transcriptome data for ncRNA detection. Many regulatory ncRNAs engage in RNARNA interactions, where RNA molecules bind to form a duplex. Predictions of true targets for an RNA enables a successful functional characterization, these can be estimated by bioinformatics methods. However, the algorithms developed to date are imperfect and it is an open question as to which ones perform well and whether these can be improved upon. Towards this goal we performed a computational benchmark study to find reliable algorithms for RNARNA interaction prediction. We found that energy based methods, which include the accessibility of interaction regions, are currently the most accurate. Many ncRNAs, including housekeeping ncRNA genes, are highly expressed. The abundances of interacting RNA molecules enable RNARNA duplex formation. In chapter IV we explore the impact of high abundance RNAs on protein expression due to crosstalk RNARNA interactions between mRNAs and ncRNAs. With extensive RNARNA interaction predictions we reveal that RNA avoidance is an evolutionarily conserved phenomenon among prokaryotes, which means that core mRNAs have evolved to avoid crosstalk interactions with abundant ncRNAs. Our predictions also reveal that RNA avoidance may influence protein expression. To test this, we investigated the stability of interactions between mRNAs and core ncRNAs. These predictions show that the RNA avoidance influences the final protein abundances. In conclusion, the primary aims of this study are to investigate the prokaryotic transcriptome for novel ncRNA genes and examine the effects of crosstalk RNA interactions. We present a method to increase information gained from transcriptome in prokaryotes for ncRNA identification. We also present the most comprehensive benchmark of RNARNA interaction prediction algorithms to date. Lastly, we introduce and test a ‘RNA avoidance hypothesis’ that shows the influence of crosstalk RNA interactions on protein expression in bacteria
    corecore