1,423 research outputs found

    Machine learning tools for mRNA isoform function prediction

    Get PDF
    This dissertation is focused on improving mRNA isoform characterization in terms of functional networks, function prediction and tissue-specificity. There are three major challenges in solving these problems. The first is the unavailability of mRNA isoform level functional data which is required to develop machine learning tools. However, the available data, even at the gene level doesn’t include all genes, further complicating the matter. The second challenge is the lack of information about tissue-specificity in functional databases such as Gene Ontology, Kyoto Encyclopedia of Genes and Genomes and UniProt. The third challenge is the lack of mRNA isoform level “ground truth” functional annotation data. The scope of this dissertation includes using mRNA isoform and protein sequences, high-throughput RNA-sequencing data and functional annotations at the gene level to develop computational methods for predicting functions for alternative spliced mRNA isoforms in mouse. To address these challenges, this dissertation develops and describes two computational tools. The first is a supervised learning-based machine learning framework for predicting tissue-specific mRNA isoform functional networks. Tissue-spEcific mrNa iSoform functIonal Networks (TENSION) makes use of single mRNA producing gene annotations and gene annotations tagged with “NOT” to create a high-quality mRNA isoform level functional data. We use these mRNA isoform level functional data to train random forest algorithms to develop mRNA isoform functional network prediction models. By using a leave-one-tissue-out approach and incorporating tissue-specific mRNA isoform level predictors along with those obtained from mRNA isoform and protein sequences, we have developed mRNA isoform level functional networks for 17 mouse tissues. We identify about 10.6 million tissue-specific functional mRNA isoform interactions and demonstrate the ability of our networks to reveal tissue-specific functional differences of the isoforms of the same genes. We validate our models and predictions by using a series of tests such as 10-fold stratified cross validation, comparison with published method and validating against literature datasets. As a result, we have also generated a high-quality mRNA isoform level functional dataset that can be used for benchmarking future methods. Next, we describe mRNA Function Recommendation System (mFRecSys), a recommendation system for making tissue-specific function recommendations for mRNA isoforms. In mFRecSys, we consider mRNA isoforms as “users” and Gene Ontology biological process terms as “items”. By using explicit contexts for mRNA isoforms, Gene Ontology biological process terms and tissue-specific mRNA isoform expression, mFRecSys is able to make tissue-specific mRNA isoform function recommendations. This work emphasizes the significance of incorporating diverse biological context to develop better machine learning tools for biology. It also highlights the use of simplified supervised learning methods for biological network prediction. The machine learning models and recommendation systems developed as part of this work also draw attention to the power of simple mRNA isoform sequence-based predictors to improve mRNA isoform function prediction. The methods developed have potential practical applications, for instance as predictive models for distinguishing the functions of different mRNA isoforms of the same gene or identifying tissue-specific functions of mRNA isoforms

    Transcriptome-based Gene Networks for Systems-level Analysis of Plant Gene Functions

    Get PDF
    Present day genomic technologies are evolving at an unprecedented rate, allowing interrogation of cellular activities with increasing breadth and depth. However, we know very little about how the genome functions and what the identified genes do. The lack of functional annotations of genes greatly limits the post-analytical interpretation of new high throughput genomic datasets. For plant biologists, the problem is much severe. Less than 50% of all the identified genes in the model plant Arabidopsis thaliana, and only about 20% of all genes in the crop model Oryza sativa have some aspects of their functions assigned. Therefore, there is an urgent need to develop innovative methods to predict and expand on the currently available functional annotations of plant genes. With open-access catching the ‘pulse’ of modern day molecular research, an integration of the copious amount of transcriptome datasets allows rapid prediction of gene functions in specific biological contexts, which provide added evidence over traditional homology-based functional inference. The main goal of this dissertation was to develop data analysis strategies and tools broadly applicable in systems biology research. Two user friendly interactive web applications are presented: The Rice Regulatory Network (RRN) captures an abiotic-stress conditioned gene regulatory network designed to facilitate the identification of transcription factor targets during induction of various environmental stresses. The Arabidopsis Seed Active Network (SANe) is a transcriptional regulatory network that encapsulates various aspects of seed formation, including embryogenesis, endosperm development and seed-coat formation. Further, an edge-set enrichment analysis algorithm is proposed that uses network density as a parameter to estimate the gain or loss in correlation of pathways between two conditionally independent coexpression networks

    GOexpress: an R/Bioconductor package for the identification and visualisation of robust gene ontology signatures through supervised learning of gene expression data

    Get PDF
    Background: Identification of gene expression profiles that differentiate experimental groups is critical for discovery and analysis of key molecular pathways and also for selection of robust diagnostic or prognostic biomarkers. While integration of differential expression statistics has been used to refine gene set enrichment analyses, such approaches are typically limited to single gene lists resulting from simple two-group comparisons or time-series analyses. In contrast, functional class scoring and machine learning approaches provide powerful alternative methods to leverage molecular measurements for pathway analyses, and to compare continuous and multi-level categorical factors. Results: We introduce GOexpress, a software package for scoring and summarising the capacity of gene ontology features to simultaneously classify samples from multiple experimental groups. GOexpress integrates normalised gene expression data (e.g., from microarray and RNA-seq experiments) and phenotypic information of individual samples with gene ontology annotations to derive a ranking of genes and gene ontology terms using a supervised learning approach. The default random forest algorithm allows interactions between all experimental factors, and competitive scoring of expressed genes to evaluate their relative importance in classifying predefined groups of samples. Conclusions: GOexpress enables rapid identification and visualisation of ontology-related gene panels that robustly classify groups of samples and supports both categorical (e.g., infection status, treatment) and continuous (e.g., time-series, drug concentrations) experimental factors. The use of standard Bioconductor extension packages and publicly available gene ontology annotations facilitates straightforward integration of GOexpress within existing computational biology pipelines.Department of Agriculture, Food and the MarineEuropean Commission - Seventh Framework Programme (FP7)Science Foundation IrelandUniversity College Dubli

    Gene Regulatory Network Analysis and Web-based Application Development

    Get PDF
    Microarray data is a valuable source for gene regulatory network analysis. Using earthworm microarray data analysis as an example, this dissertation demonstrates that a bioinformatics-guided reverse engineering approach can be applied to analyze time-series data to uncover the underlying molecular mechanism. My network reconstruction results reinforce previous findings that certain neurotransmitter pathways are the target of two chemicals - carbaryl and RDX. This study also concludes that perturbations to these pathways by sublethal concentrations of these two chemicals were temporary, and earthworms were capable of fully recovering. Moreover, differential networks (DNs) analysis indicates that many pathways other than those related to synaptic and neuronal activities were altered during the exposure phase. A novel differential networks (DNs) approach is developed in this dissertation to connect pathway perturbation with toxicity threshold setting from Live Cell Array (LCA) data. Findings from this proof-of-concept study suggest that this DNs approach has a great potential to provide a novel and sensitive tool for threshold setting in chemical risk assessment. In addition, a web-based tool “Web-BLOM” was developed for the reconstruction of gene regulatory networks from time-series gene expression profiles including microarray and LCA data. This tool consists of several modular components: a database, the gene network reconstruction model and a user interface. The Bayesian Learning and Optimization Model (BLOM), originally implemented in MATLAB, was adopted by Web-BLOM to provide an online reconstruction of large-scale gene regulation networks. Compared to other network reconstruction models, BLOM can infer larger networks with compatible accuracy, identify hub genes and is much more computationally efficient

    Comprehensive compendium of Arabidopsis RNA-seq data, A

    Get PDF
    2020 Spring.Includes bibliographical references.In the last fifteen years, the amount of publicly available genomic sequencing data has doubled every few months. Analyzing large collections of RNA-seq datasets can provide insights that are not available when analyzing data from single experiments. There are barriers towards such analyses: combining processed data is challenging because varying methods for processing data make it difficult to compare data across studies; combining data in raw form is challenging because of the resources needed to process the data. Multiple RNA-seq compendiums, which are curated sets of RNA-seq data that have been pre-processed in a uniform fashion, exist; however, there is no such resource in plants. We created a comprehensive compendium for Arabidopsis thaliana using a pipeline based on Snakemake. We downloaded over 80 Arabidopsis studies from the Sequence Read Archive. Through a strict set of criteria, we chose 35 studies containing a total of 700 biological replicates, with a focus on the response of different Arabidopsis tissues to a variety of stresses. In order to make the studies comparable, we hand-curated the metadata, pre-processed and analyzed each sample using our pipeline. We performed exploratory analysis on the samples in our compendium for quality control, and to identify biologically distinct subgroups, using PCA and t-SNE. We discuss the differences between these two methods and show that the data separates primarily by tissue type, and to a lesser extent, by the type of stress. We identified treatment conditions for each study and generated three lists: differentially expressed genes, differentially expressed introns, and genes that were differentially expressed under multiple conditions. We then visually analyzed these groups, looking for overarching patterns within the data, finding around a thousand genes that participate in stress response across tissues and stresses

    Polymorphism identification and improved genome annotation of Brassica rapa through Deep RNA sequencing.

    Get PDF
    The mapping and functional analysis of quantitative traits in Brassica rapa can be greatly improved with the availability of physically positioned, gene-based genetic markers and accurate genome annotation. In this study, deep transcriptome RNA sequencing (RNA-Seq) of Brassica rapa was undertaken with two objectives: SNP detection and improved transcriptome annotation. We performed SNP detection on two varieties that are parents of a mapping population to aid in development of a marker system for this population and subsequent development of high-resolution genetic map. An improved Brassica rapa transcriptome was constructed to detect novel transcripts and to improve the current genome annotation. This is useful for accurate mRNA abundance and detection of expression QTL (eQTLs) in mapping populations. Deep RNA-Seq of two Brassica rapa genotypes-R500 (var. trilocularis, Yellow Sarson) and IMB211 (a rapid cycling variety)-using eight different tissues (root, internode, leaf, petiole, apical meristem, floral meristem, silique, and seedling) grown across three different environments (growth chamber, greenhouse and field) and under two different treatments (simulated sun and simulated shade) generated 2.3 billion high-quality Illumina reads. A total of 330,995 SNPs were identified in transcribed regions between the two genotypes with an average frequency of one SNP in every 200 bases. The deep RNA-Seq reassembled Brassica rapa transcriptome identified 44,239 protein-coding genes. Compared with current gene models of B. rapa, we detected 3537 novel transcripts, 23,754 gene models had structural modifications, and 3655 annotated proteins changed. Gaps in the current genome assembly of B. rapa are highlighted by our identification of 780 unmapped transcripts. All the SNPs, annotations, and predicted transcripts can be viewed at http://phytonetworks.ucdavis.edu/

    Analysis of G-quadruplexes as environmental sensors: Novel statistical models and computational algorithms enable interpretation of complex gene expression patterns for maize under salt stress conditions

    Get PDF
    The occurrence of G-quadruplex (G4) structures in both genic and non-genic sequences have been well-documented. However, even in genic regions the biological functions of these motifs remains poorly understood, though their potential to act in a regulatory fashion has been hypothesized. With the recent development of next-generation sequencing technology, we have accumulated genomic and transcriptomic sequences from various species and tissues. Coupled with pattern recognition software that can identify putative G4 sequences, the time is right for tackling the question of whether and how G4’s are involved in regulating gene expression. Previous studies suggested that G4 conformation can be dependent on cation type and concentration, along with G4 motif patterns differences (e.g., number of consecutive guanines). It also has been shown that G4 function may be associated with the location relative to a given gene’s structural elements (transcription start site [TSS], exon/intron boundaries, etc.). My project focused on the expression of G4-containing genes from maize tissues under various abiotic stress conditions, including salt stress, which would be likely to change physiological cation concentrations. I quantified, compared, and visualized expression of G4-containing gene groups by developing and applying novel computational algorithms and statistical models. These methods were packaged into a software program I released on a web server called C-REx (http://c-rex.dill-picl.org/). I found that under salt stress conditions, transcription factors (TFs) with a G4 on the anti-sense strand upstream of the TSS are 455% more likely to be up-regulated than non-G4 genes. Likewise, transcription factors with a G4 on the anti-sense strand just downstream of the TSS are 259% more likely to be up-regulated. In addition, among G4 transcription factors that are up-regulated, heat shock factors are significantly enriched. On the other hand, under salt stress conditions non-TF genes with a G4 on anti-sense strand upstream of the TSS are 157% more likely to be down-regulated, and those with the G4 on the anti-sense strand downstream of the TSS are 124% more likely to be down-regulated. Through G4 sequence feature analysis, we found that the length of G-runs was significantly associated with whether genes were switched ‘on’ or ‘off’ in salt stress conditions. The shortest G-runs were associated with G4 motifs in TF genes that were switched ‘on’ and longest G-runs were associated with G4s in non-TF genes that were switched ‘off’. These findings suggest that salt stress resilience could potentially be improved in maize by selecting for natural gene variants with specific G4 constitutions or by introducing specific G4 motifs of varying lengths into TF and non-TF genes involved in response to salt stress

    High throughput approaches reveal splicing of primary microRNA transcripts and tissue specific expression of mature microRNAs in Vitis vinifera

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>MicroRNAs are short (~21 base) single stranded RNAs that, in plants, are generally coded by specific genes and cleaved specifically from hairpin precursors. MicroRNAs are critical for the regulation of multiple developmental, stress related and other physiological processes in plants. The recent annotation of the genome of the grapevine (<it>Vitis vinifera </it>L.) allowed the identification of many putative conserved microRNA precursors, grouped into multiple gene families.</p> <p>Results</p> <p>Here we use oligonucleotide arrays to provide the first indication that many of these microRNAs show differential expression patterns between tissues and during the maturation of fruit in the grapevine. Furthermore we demonstrate that whole transcriptome sequencing and deep-sequencing of small RNA fractions can be used both to identify which microRNA precursors are expressed in different tissues and to estimate genomic coordinates and patterns of splicing and alternative splicing for many primary miRNA transcripts.</p> <p>Conclusion</p> <p>Our results show that many microRNAs are differentially expressed in different tissues and during fruit maturation in the grapevine. Furthermore, the demonstration that whole transcriptome sequencing can be used to identify candidate splicing events and approximate primary microRNA transcript coordinates represents a significant step towards the large-scale elucidation of mechanisms regulating the expression of microRNAs at the transcriptional and post-transcriptional levels.</p

    Predicting splicing regulation with learning methods

    Get PDF
    Alternative splicing is an important post-transcriptional process that serves to increase the diversity of proteins in different tissues and developmental stages, and its dysregulation is often associated with diseases. Large-scale RNA-seq experiments and bioinformatic approaches already found evidence of splice site selections and interaction among cis-regulatory elements and trans-acting factors. However, in most cases, the mechanisms behind are still incompletely understood and remain to be determined. Therefore, there is a great need to accurately map and quantify gene splice variants, identify differences in splicing between conditions and computationally reveal the splicing regulation. In this dissertation, we investigate those challenges and propose novel computational methods to mitigate them. I will highlight my Ph.D. works on alternative splicing and present machine learning and statistical methods to extract gene and alternative splicing features from large collections of RNA-seq data, determining statistically significant differences in expression and splicing measurements between conditions, and predicting the splicing regulations of cis-regulatory sequence elements and trans-acting factors

    Response to Persistent ER Stress in Plants: a Multiphasic Process that Transitions Cells from Prosurvival Activities to Cell Death

    Get PDF
    The unfolded protein response (UPR) is a highly conserved response that protects plants from adverse environmental conditions. The UPR is elicited by endoplasmic reticulum (ER) stress, in which unfolded and misfolded proteins accumulate within the ER. Here, we induced the UPR in maize (Zea mays) seedlings to characterize the molecular events that occur over time during persistent ER stress. We found that a multiphasic program of gene expression was interwoven among other cellular events, including the induction of autophagy. One of the earliest phases involved the degradation by regulated IRE1-dependent RNA degradation (RIDD) of RNA transcripts derived from a family of peroxidase genes. RIDD resulted from the activation of the promiscuous ribonuclease activity of ZmIRE1 that attacks the mRNAs of secreted proteins. This was followed by an upsurge in expression of the canonical UPR genes indirectly driven by ZmIRE1 due to its splicing of Zmbzip60 mRNA to make an active transcription factor that directly upregulates many of the UPR genes. At the peak of UPR gene expression, a global wave of RNA processing led to the production of many aberrant UPR gene transcripts, likely tempering the ER stress response. During later stages of ER stress, ZmIRE1\u27s activity declined as did the expression of survival modulating genes, Bax inhibitor1 and Bcl-2-associated athanogene7, amidst a rising tide of cell death. Thus, in response to persistent ER stress, maize seedlings embark on a course of gene expression and cellular events progressing from adaptive responses to cell death
    • …
    corecore