114 research outputs found

    Identification of pre-microRNAs by characterizing their sequence order evolution information and secondary structure graphs

    Full text link
    © 2018 The Author(s). Background: Distinction between pre-microRNAs (precursor microRNAs) and length-similar pseudo pre-microRNAs can reveal more about the regulatory mechanism of RNA biological processes. Machine learning techniques have been widely applied to deal with this challenging problem. However, most of them mainly focus on secondary structure information of pre-microRNAs, while ignoring sequence-order information and sequence evolution information. Results: We use new features for the machine learning algorithms to improve the classification performance by characterizing both sequence order evolution information and secondary structure graphs. We developed three steps to extract these features of pre-microRNAs. We first extract features from PSI-BLAST profiles and Hilbert-Huang transforms, which contain rich sequence evolution information and sequence-order information respectively. We then obtain properties of small molecular networks of pre-microRNAs, which contain refined secondary structure information. These structural features are carefully generated so that they can depict both global and local characteristics of pre-microRNAs. In total, our feature space covers 591 features. The maximum relevance and minimum redundancy (mRMR) feature selection method is adopted before support vector machine (SVM) is applied as our classifier. The constructed classification model is named MicroRNA -NHPred. The performance of MicroRNA -NHPred is high and stable, which is better than that of those state-of-the-art methods, achieving an accuracy of up to 94.83% on same benchmark datasets. Conclusions: The high prediction accuracy achieved by our proposed method is attributed to the design of a comprehensive feature set on the sequences and secondary structures, which are capable of characterizing the sequence evolution information and sequence-order information, and global and local information of pre-microRNAs secondary structures. MicroRNA -NHPred is a valuable method for pre-microRNAs identification. The source codes of our method can be downloaded from https://github.com/myl446/MicroRNA-NHPred

    End-to-end learning framework for circular RNA classification from other long non-coding RNAs using multi-modal deep learning.

    Get PDF
    Over the past two decades, a circular form of RNA (circular RNA) produced from splicing mechanism has become the focus of scientific studies due to its major role as a microRNA (miR) ac tivity modulator and its association with various diseases including cancer. Therefore, the detection of circular RNAs is a vital operation for continued comprehension of their biogenesis and purpose. Prediction of circular RNA can be achieved by first distinguishing non-coding RNAs from protein coding gene transcripts, separating short and long non-coding RNAs (lncRNAs), and finally pre dicting circular RNAs from other lncRNAs. However, available tools to distinguish circular RNAs from other lncRNAs have only reached 80% accuracy due to the difficulty of classifying circular RNAs from other lncRNAs. Therefore, the availability of a faster, more accurate machine learning method for the identification of circular RNAs, which will take into account the specific features of circular RNA, is essential in the development of systematic annotation. Here we present an End to-End multimodal deep learning framework, our tool, to classify circular RNA from other lncRNA. It fuses a RCM descriptor, an ACNN-BLSTM sequence descriptor, and a conservation descriptor into high level abstraction descriptors, where the shared representations across different modalities are integrated. The experiments show that our tool is not only faster compared to existing tools but also eclipses other tools by an over 12% increase in accuracy. Another interesting result found from analysis of a ACNN-BLSTM sequence descriptor is that circular RNA sequences share the characteristics of the coding sequence

    The Evolution and Mechanics of Translational Control in Plants

    Get PDF
    The expression of numerous plant mRNAs is attenuated by RNA sequence elements located in the 5\u27 and 3\u27 untranslated regions (UTRs). For example, in plants and many higher eukaryotes, roughly 35% of genes encode mRNAs that contain one or more upstream open reading frames (uORFs) in the 5\u27 UTR. For this dissertation I have analyzed the pattern of conservation of such mRNA sequence elements. In the first set of studies, I have taken a comparative transcriptomics approach to address which RNA sequence elements are conserved between various families of angiosperm plants. Such conservation indicates an element\u27s fundamental importance to plant biology, points to pathways for which it is most vital, and suggests the mechanism by which it acts. Conserved motifs were detected in 3% of genes. These include di-purine repeat motifs, uORF-associated motifs, putative binding sites for PUMILIO-like RNA binding proteins, small RNA targets, and a wide range of other sequence motifs. Due to the scanning process that precedes translation initiation, uORFs are often translated, thereby repressing initiation at the an mRNA\u27s main ORF. As one might predict, I found a clear bias against the AUG start codon within the 5\u27 untranslated region (5\u27 UTR) among all plants examined. Further supporting this finding, comparative analysis indicates that, for ~42% of genes, AUGs and their resultant uORFs reduce carrier fitness. Interestingly, for at least 5% of genes, uORFs are not only tolerated, but enriched. The remaining uORFs appear to be neutral. Because of their tangible impact on plant biology, it is critical to differentiate how uORFs affect translation and how, in many cases, their inhibitory effects are neutralized. In pursuit of this aim, I developed a computational model of the initiation process that uses five parameters to account for uORF presence. In vivo translation efficiency data from uORF-containing reporter constructs were used to estimate the model\u27s parameters in wild type Arabidopsis. In addition, the model was applied to identify salient defects associated with a mutation in the subunit h of eukaryotic initiation factor 3 (eIF3h). The model indicates that eIF3h, by supporting re-initation during uORF elongation, facilitates uORF tolerance

    Application of Software Engineering Principles to Synthetic Biology and Emerging Regulatory Concerns

    Get PDF
    As the science of synthetic biology matures, engineers have begun to deliver real-world applications which are the beginning of what could radically transform our lives. Recent progress indicates synthetic biology will produce transformative breakthroughs. Examples include: 1) synthesizing chemicals for medicines which are expensive and difficult to produce; 2) producing protein alternatives; 3) altering genomes to combat deadly diseases; 4) killing antibiotic-resistant pathogens; and 5) speeding up vaccine production. Although synthetic biology promises great benefits, many stakeholders have expressed concerns over safety and security risks from creating biological behavior never seen before in nature. As with any emerging technology, there is the risk of malicious use known as the dual-use problem. The technology is becoming democratized and de-skilled, and people in do-it-yourself communities can tinker with genetic code, similar to how programming has become prevalent through the ease of using macros in spreadsheets. While easy to program, it may be non-trivial to validate novel biological behavior. Nevertheless, we must be able to certify synthetically engineered organisms behave as expected, and be confident they will not harm natural life or the environment. Synthetic biology is an interdisciplinary engineering domain, and interdisciplinary problems require interdisciplinary solutions. Using an interdisciplinary approach, this dissertation lays foundations for verifying, validating, and certifying safety and security of synthetic biology applications through traditional software engineering concepts about safety, security, and reliability of systems. These techniques can help stakeholders navigate what is currently a confusing regulatory process. The contributions of this dissertation are: 1) creation of domain-specific patterns to help synthetic biologists develop assurance cases using evidence and arguments to validate safety and security of designs; 2) application of software product lines and feature models to the modular DNA parts of synthetic biology commonly known as BioBricks, making it easier to find safety features during design; 3) a technique for analyzing DNA sequence motifs to help characterize proteins as toxins or non-toxins; 4) a legal investigation regarding what makes regulating synthetic biology challenging; and 5) a repeatable workflow for leveraging safety and security artifacts to develop assurance cases for synthetic biology systems. Advisers: Myra B. Cohen and Brittany A. Dunca

    Modeling the combined effect of RNA-binding proteins and microRNAs in post-transcriptional regulation

    Get PDF
    Recent studies show that RNA-binding proteins (RBPs) and microRNAs (miRNAs) function in coordination with each other to control post-transcriptional regulation (PTR). Despite this, the majority of research to date has focused on the regulatory effect of individual RBPs or miRNAs. Here, we mapped both RBP and miRNA binding sites on human 3′UTRs and utilized this collection to better understand PTR. We show that the transcripts that lack competition for HuR binding are destabilized more after HuR depletion. We also confirm this finding for PUM1(2) by measuring genome-wide expression changes following the knockdown of PUM1(2) in HEK293 cells. Next, to find potential cooperative interactions, we identified the pairs of factors whose sites co-localize more often than expected by random chance. Upon examining these results for PUM1(2), we found that transcripts where the sites of PUM1(2) and its interacting miRNA form a stem-loop are more stabilized upon PUM1(2) depletion. Finally, using dinucleotide frequency and counts of regulatory sites as features in a regression model, we achieved an AU-ROC of 0.86 in predicting mRNA half-life in BEAS-2B cells. Altogether, our results suggest that future studies of PTR must consider the combined effects of RBPs and miRNAs, as well as their interactions.No sponso

    Both maintenance and avoidance of RNA-binding protein interactions constrain coding region evolution

    Get PDF

    Alignment, Clustering and Extraction of Structured Motifs in DNA Promoter Sequences

    Get PDF
    A simple motif is a short DNA sequence found in the promoter region and believed to act as a binding site for a transcription factor protein. A structured motif is a sequence of simple motifs (boxes) separated by short sequences (gaps). Biologists theorize that the presence of these motifs play a key role in gene expression regulation. Discovering these patterns is an important step towards understanding protein-gene and gene-gene interaction thus facilitates the building of accurate gene regulatory network models. DNA sequence motif extraction is an important problem in bioinformatics. Many studies have proposed algorithms to solve the problem instance of simple motif extraction. Only in the past decade has the more complex structured motif extraction problem been examined by researchers. The problem is inherently challenging as structured motif patterns are segmented into several boxes separated by variable size gaps for each instance. These boxes may not be exact copies, but may have multiple mismatched positions. The challenge is extenuated by the lack of resources for real datasets covering a wide range of possible cases. Also, incomplete annotation of real data leads to the discovery of unknown motifs that may be regarded as false positives. Furthermore, current algorithms demand unreasonable amount of prior knowledge to successfully extract the target pattern. The contributions of this research are four new algorithms. First, SMGenerate generates simulated datasets of implanted motifs that covers a wide range of biologically possible cases. Second, SMAlign aligns a pair of structured motifs optimally and efficiently given their gap constraints. Third, SMCluster produces multiple alignment of structured motifs through hierarchical clustering using SMAlign\u27s affinity score. Finally, SMExtract extracts structured motifs from a set of sequences by using SMCluster to construct the target pattern from the top reported two-box patterns (fragments), extracted using an existing algorithm (Exmotif) and a two-box template. The main advantage of SMExtract is its efficiency to extract longer degenerate patterns while requiring less prior knowledge, about the pattern to be extracted, than current algorithms

    Improved Pre-miRNAs Identification Through Mutual Information of Pre-miRNA Sequences and Structures

    Get PDF
    Playing critical roles as post-transcriptional regulators, microRNAs (miRNAs) are a family of short non-coding RNAs that are derived from longer transcripts called precursor miRNAs (pre-miRNAs). Experimental methods to identify pre-miRNAs are expensive and time-consuming, which presents the need for computational alternatives. In recent years, the accuracy of computational methods to predict pre-miRNAs has been increasing significantly. However, there are still several drawbacks. First, these methods usually only consider base frequencies or sequence information while ignoring the information between bases. Second, feature extraction methods based on secondary structures usually only consider the global characteristics while ignoring the mutual influence of the local structures. Third, methods integrating high-dimensional feature information is computationally inefficient. In this study, we have proposed a novel mutual information-based feature representation algorithm for pre-miRNA sequences and secondary structures, which is capable of catching the interactions between sequence bases and local features of the RNA secondary structure. In addition, the feature space is smaller than that of most popular methods, which makes our method computationally more efficient than the competitors. Finally, we applied these features to train a support vector machine model to predict pre-miRNAs and compared the results with other popular predictors. As a result, our method outperforms others based on both 5-fold cross-validation and the Jackknife test
    corecore