268 research outputs found

    High-Throughput Polygenic Biomarker Discovery Using Condition-Specific Gene Coexpression Networks

    Get PDF
    Biomarkers can be described as molecular signatures that are associated with a trait or disease. RNA expression data facilitates discovery of biomarkers underlying complex phenotypes because it can capture dynamic biochemical processes that are regulated in tissue-specific and time-specific manners. Gene Coexpression Network (GCN) analysis is a method that utilizes RNA expression data to identify binary gene relationships across experimental conditions. Using a novel GCN construction algorithm, Knowledge Independent Network Construction (KINC), I provide evidence for novel polygenic biomarkers in both plant and animal use cases. Kidney cancer is comprised of several distinct subtypes that demonstrate unique histological and molecular signatures. Using KINC, I have identified gene correlations that are specific to clear cell renal cell carcinoma (ccRCC), the most common form of kidney cancer. ccRCC is associated with two common mutation profiles that respond differently to targeted therapy. By identifying GCN edges that are specific to patients with each of these two mutation profiles, I discovered unique genes with similar biological function, suggesting a role for T cell exhaustion in the development of ccRCC. Medicago truncatula is a legume that is capable of atmospheric nitrogen fixation through a symbiotic relationship between plant and rhizobium that results in root nodulation. This process is governed by complex gene expression patterns that are dynamically regulated across tissues over the course of rhizobial infection. Using de novo RNA sequencing data generated from the root maturation zone at five distinct time points, I identified hundreds of genes that were differentially expressed between control and inoculated plants at specific time points. To discover genes that were co-regulated during this experiment, I constructed a GCN using the KINC software. By combining GCN clustering analysis with differentially expressed genes, I present evidence for novel root nodulation biomarkers. These biomarkers suggest that temporal regulation of pathogen response related genes is an important process in nodulation. Large-scale GCN analysis requires computational resources and stable data-processing pipelines. Supercomputers such as Clemson University’s Palmetto Cluster provide data storage and processing resources that enable terabyte-scale experiments. However, with the wealth of public sequencing data available for mining, petabyte-scale experiments are required to provide novel insights across the tree of life. I discuss computational challenges that I have discovered with large scale RNA expression data mining, and present two workflows, OSG-GEM and OSG-KINC, that enable researchers to access geographically distributed computing resources to handle petabyte-scale experiments

    Sequencing and Analysis of the Diel Transcriptome of Botryococcus braunii

    Get PDF
    Microalgae are widely viewed as a potential source of renewable biofuels. Microalgae are highly productive and can be cultured in recycled water on margial or non-agricultural land. Despite their advantages, the industrial scale deployment of microalgae faces numerous challenges including relatively little knowledge of the algae themselves and the comparatively expensive infrastructures required for culture. The green microalga, Botryococcus braunii is particularly interesting because it synthesizes long-chain (C30- C40) hydrocarbons that can be converted to liquid fuel by hydrogenation and catalytic cracking. Moreover, B. braunii is the major fossil present in the Ordovician oil shales and kerogen deposits. Although studied since the 1970s, very little is known regarding critical aspects of B. braunii, notably its molecular biology. In higher plants molecular clocks have been well defined and transcript profiling has revealed a sophisticated network of circadian scheduling of metabolic processes. Characterization of temporal controls over hydrocarbon synthesis is therefore of importance to optimization of biofuel production from B. braunii. In this project B. braunii (Race B, strain Guadeloupe) were cultured in a 12-hour photoperiod and either maintained in that regime or transferred to constant light. Algae were sampled every 4 hours, during a 28-hour time-course and mRNA extracted. mRNA was reverse-transcribed to cDNA and sequenced using a paired-end protocol on an Illumina HiSeq 2000 platform. Over 2 billion sequence reads of 100 bp were generated and assembled de novo, into a complete transcriptome for B. braunii. The transcriptome was comprehensively annotated using global and targeted protocols and differential expression and co-expression analyses were performed. Metabolic pathway analysis confirmed the presence, and photoperiodic regulation of the MEP/DOXP Terpenoid Backbone synthesis pathway. Targeted annotation and expression analysis revealed two predicted B. braunii circadian clock components, which were incorporated into a B. braunii circadian clock model. In non-hierarchical cluster analysis, contigs of the B. braunii transcriptome clustered under four distinct patterns of diel expression. Networks of co- and anti-expressed contigs were elucidated by hierarchical clustering. These results demonstrate the exquisite control over metabolism in B. braunii. Such knowledge is essential for the industrial applications of B. braunii, either directly or through the engineering of selected B. braunii genes or molecular pathways into alternative chassis.Biotechnology and Biological Sciences Research CouncilPlymouth Marine Laborator

    Quantification of transgene expression in GSH AAVS1 with a novel CRISPR/Cas9-based approach reveals high transcriptional variation

    Full text link
    Genomic safe harbors (GSH) are defined as sites in the host genome that allow stable expression of inserted transgenes while having no adverse effects on the host cell, making them ideal for use in basic research and therapeutic applications. Silencing and fluctuations in transgene expression would be highly undesirable effects. We have previously shown that transgene expression in Jurkat T cells is not silenced for up to 160 days after CRISPR-Cas9-mediated insertion of reporter genes into the adeno-associated virus site 1 (AAVS1), a commonly used GSH. Here, we studied fluctuations in transgene expression upon targeted insertion into the GSH AAVS1. We have developed an efficient method to generate and validate highly complex barcoded plasmid libraries to study transgene expression on the single-cell level. Its applicability is demonstrated by inserting the barcoded transgene Cerulean into the AAVS1 locus in Jurkat T cells via the CRISPR-Cas9 technology followed by next-generation sequencing of the transcribed barcodes. We observed large transcriptional variations over two logs for transgene expression in the GSH AAVS1. This barcoded transgene insertion model is a powerful tool to investigate fluctuations in transgene expression at any GSH site

    Genomewide analysis of gene expression in Vitis vinifera ssp.

    Get PDF

    Interpretation, Stratification and Validation of Sequence Variants Affecting mRNA Splicing in Complete Human Genome Sequences

    Get PDF
    The Shannon Human Splicing Pipeline software has been developed to analyze variants on a genome-scale. Evidence is provided that this software predicts variants affecting mRNA splicing. Variants are examined through information-based analysis and the context of novel mutations as well as common and rare SNPs with splicing effects are displayed. Potential natural and cryptic mRNA splicing variants are identified, and inactivating mutations are distinguished from leaky mutations. Mutations and rare SNPs were predicted in genomes of three cancer cell lines (U2OS, U251 and A431), supported by expression analyses. After filtering, tractable numbers of potentially deleterious variants are predicted by the software, suitable for further laboratory investigation. In these cell lines, novel functional variants comprised 6–17 inactivating mutations, 1–5 leaky mutations and 6–13 cryptic splicing mutations. Predicted effects were validated by RNA-seq data of the three cell lines, and expression microarray analysis of SNPs in HapMap cell lines

    Hypoxic and viral contributions to the etiopathogenesis of schizophrenia: a whole transcriptome analysis

    Get PDF
    Schizophrenia is a mental illness with a complex and as of yet unclear etiology. It is highly heritable and has a strong polygenic character, however, studies examining the genetics of schizophrenia have not sufficiently explained all variability in its prevalence. Environmental causes are theorized to have a non trivial contribution to the pathoetiology of schizophrenia, including interactions with genetic components, but these mechanisms remain unclear. Analyzing schizophrenia dysfunction using transcriptomic approaches is a paradigm still in its infancy, and fewer studies still have examined non neurological contributions to schizophrenia pathology with next generation sequencing technologies. This pilot study uses several tools to probe changes in gene expression and isoform prevalence, and to detect the presence of viral genomes that may contribute to schizophrenia pathoetiology. Findings of interest include a robust genetic response associated with hypoxia and downstream changes in gene expression that may have direct consequences on schizophrenia symptomatology, and the presence of viral transcripts suggesting an active viral infection in a schizophrenic patient. While these findings are not definitive proof that these events are directly correlated with schizophrenia pathoetiology, they suggest intriguing directions to pursue in next generation sequencing research to clarify this complex disorder

    Development of an integrated omics in silico workflow and its application for studying bacteria-phage interactions in a model microbial community

    Get PDF
    Microbial communities are ubiquitous and dynamic systems that inhabit a multitude of environments. They underpin natural as well as biotechnological processes, and are also implicated in human health. The elucidation and understanding of these structurally and functionally complex microbial systems using a broad spectrum of toolkits ranging from in situ sampling, high-throughput data generation ("omics"), bioinformatic analyses, computational modelling and laboratory experiments is the aim of the emerging discipline of Eco-Systems Biology. Integrated workflows which allow the systematic investigation of microbial consortia are being developed. However, in silico methods for analysing multi-omic data sets are so far typically lab-specific, applied ad hoc, limited in terms of their reproducibility by different research groups and suboptimal in the amount of data actually being exploited. To address these limitations, the present work initially focused on the development of the Integrated Meta-omic Pipeline (IMP), a large-scale reference-independent bioinformatic analyses pipeline for the integrated analysis of coupled metagenomic and metatranscriptomic data. IMP is an elaborate pipeline that incorporates robust read preprocessing, iterative co-assembly, analyses of microbial community structure and function, automated binning as well as genomic signature-based visualizations. The IMP-based data integration strategy greatly enhances overall data usage, output volume and quality as demonstrated using relevant use-cases. Finally, IMP is encapsulated within a user-friendly implementation using Python while relying on Docker for reproducibility. The IMP pipeline was then applied to a longitudinal multi-omic dataset derived from a model microbial community from an activated sludge biological wastewater treatment plant with the explicit aim of following bacteria-phage interaction dynamics using information from the CRISPR-Cas system. This work provides a multi-omic perspective of community-level CRISPR dynamics, namely changes in CRISPR repeat and spacer complements over time, demonstrating that these are heterogeneous, dynamic and transcribed genomic regions. Population-level analysis of two lipid accumulating bacterial species associated with 158 putative bacteriophage sequences enabled the observation of phage-host population dynamics. Several putatively identified bacteriophages were found to occur at much higher abundances compared to other phages and these specific peaks usually do not overlap with other putative phages. In addition, there were several RNA-based CRISPR targets that were found to occur in high abundances. In summary, the present work describes the development of a new bioinformatic pipeline for the analysis of coupled metagenomic and metatranscriptomic datasets derived from microbial communities and its application to a study focused on the dynamics of bacteria-virus interactions. Finally, this work demonstrates the power of integrated multi-omic investigation of microbial consortia towards the conversion of high-throughput next-generation sequencing data into new insights
    • …
    corecore