13 research outputs found

    Gene network understanding and analysis

    Get PDF
    Gene regulatory network (GRN) is a collection of regulators that interact with each other in the cell to govern the gene expression levels of mRNA and proteins. These regulators can either be DNA, RNA, protein and their complex. Transcriptional gene regulation is an important mechanisms in which an in-depth study can lead to various practical applications, and a greater understanding of how organisms control their cellular behavior. One of the most widely studied organisms in gene regulatory networks are the Mycobacterium tuberculosis and Corynebacterium glutamicum ATCC 13032. Gene co-expression networks are of biological interests due to co-expressed genes which are controlled by the same transcriptional regulatory programs, as well as, studying the functionality of genes in a system-level. Correlation networks are increasingly being used in research applications, especially in the field of bioinformatics. It facilitates networks based on gene screening methods which can be used to identify biomarkers or therapeutic targets. Computational methods use for the development of network models, as well as, the analysis of their functionality proved to be of valuable resources

    Comparative Evaluation of Statistical Dependence Measures

    Get PDF
    Measuring and testing dependence between random variables is of great importance in many scientific fields. In the case of linearly correlated variables, Pearson’s correlation coefficient is a commonly used measure of the correlation strength. In the case of nonlinear correlation, several innovative measures have been proposed, such as distance-based correlation, rank-based correlations, and information theory-based correlation. This thesis focuses on the statistical comparison of several important correlations, including Spearman’s correlation, mutual information, maximal information coefficient, biweight midcorrelation, distance correlation, and copula correlation, under various simulation settings such as correlative patterns and the level of random noise. Furthermore, we apply those correlations with the overall best performance to a real genomic data set, to study the co-expression between genes in serous ovarian cancer

    Is My Network Module Preserved and Reproducible?

    Get PDF
    In many applications, one is interested in determining which of the properties of a network module change across conditions. For example, to validate the existence of a module, it is desirable to show that it is reproducible (or preserved) in an independent test network. Here we study several types of network preservation statistics that do not require a module assignment in the test network. We distinguish network preservation statistics by the type of the underlying network. Some preservation statistics are defined for a general network (defined by an adjacency matrix) while others are only defined for a correlation network (constructed on the basis of pairwise correlations between numeric variables). Our applications show that the correlation structure facilitates the definition of particularly powerful module preservation statistics. We illustrate that evaluating module preservation is in general different from evaluating cluster preservation. We find that it is advantageous to aggregate multiple preservation statistics into summary preservation statistics. We illustrate the use of these methods in six gene co-expression network applications including 1) preservation of cholesterol biosynthesis pathway in mouse tissues, 2) comparison of human and chimpanzee brain networks, 3) preservation of selected KEGG pathways between human and chimpanzee brain networks, 4) sex differences in human cortical networks, 5) sex differences in mouse liver networks. While we find no evidence for sex specific modules in human cortical networks, we find that several human cortical modules are less preserved in chimpanzees. In particular, apoptosis genes are differentially co-expressed between humans and chimpanzees. Our simulation studies and applications show that module preservation statistics are useful for studying differences between the modular structure of networks. Data, R software and accompanying tutorials can be downloaded from the following webpage: http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/ModulePreservation

    Machine learning and network embedding methods for gene co-expression networks

    Get PDF
    High-throughput technologies such as DNA microarrays and RNA-seq are used to measure the expression levels of large numbers of genes simultaneously. To support the extraction of biological knowledge, individual gene expression levels are transformed into Gene Co-expression Networks (GCNs). GCNs are analyzed to discover gene modules. GCN construction and analysis is a well-studied topic, for nearly two decades. While new types of sequencing and the corresponding data are now available, the software package WGCNA and its most recent variants are still widely used, contributing to biological discovery. The discovery of biologically significant modules of genes from raw expression data is a non-typical unsupervised problem; while there are no training data to drive the computational discovery of modules, the biological significance of the discovered modules can be evaluated with the widely used module enrichment metric, measuring the statistical significance of the occurrence of Gene Ontology terms within the computed modules. WGCNA and other related methods are entirely heuristic and they do not leverage the aforementioned non-typical nature of the underlying unsupervised problem. The main contribution of this thesis is SGCP, a novel Self-Training Gene Clustering Pipeline for discovering modules of genes from raw expression data. SGCP almost entirely replaces the steps followed by existing methods, based on recent progress in mathematically justified unsupervised clustering algorithms. It also introduces a conceptually novel self-training step that leverages Gene Ontology information to modify and improve the set of modules computed by the unsupervised algorithm. SGCP is tested on a rich set of DNA microarrays and RNA-seq benchmarks, coming from various organisms. These tests show that SGCP greatly outperforms all previous methods, resulting in highly enriched modules. Furthermore, these modules are often quite dissimilar from those computed by previous methods, suggesting the possibility that SGCP can indeed become an auxiliary tool for extracting biological knowledge. To this end, SGCP is implemented as an easy-to-use R package that is made available on Bioconductor

    Integrative genetic and network approaches to identify key regulators of cardiac fibrosis

    Get PDF
    Excessive fibrogenic response is a pathological hallmark of chronic complex diseases, including cardiovascular disease. To date, very few gene targets for cardiac fibrosis that led to effective treatments have been identified in humans. In this thesis I study and dissect the genetic component underlying cardiac fibrosis. This study integrates histomorphometric measurements of fibrosis in the rat left ventricle (LV) with gene expression (RNA-Seq from LV) and genetic data in a panel of recombinant inbred (RI) rat strains (n=30). In addition, I integrated RNA-seq LV and genetic data in humans (n=187, healthy and dilated cardiomyopathy (DCM) patients), as well as DCM genome-wide association studies (GWAS) data. I started by carrying out an unbiased co-expression network analysis in the rat heart. The reconstructed cardiac transcriptional modules were associated with quantitative levels of fibrosis. Co-expression networks were also independently built in the heart of DCM patients and by using the rat data, co-expression networks associated with fibrosis, conserved across rats and humans and not present in control human heart were prioritised. In the prioritised networks, I also analysed their cardiac cell type specificity, differential expression after TGFβ induction, potential driving transcription factors and conservation in other fibrotic diseases by analysing human data collected from other organs. Furthermore, I aimed to identify common genetic regulators of the networks (also called master genetic regulators) by using Bayesian multivariate regression approaches. Finally, I integrated GWAS data in DCM (n=2,287) to dissect the genetic basis of DCM. This systems genetics study evidences that there are transcriptional processes involved in the human cardiac fibrogenic response that are conserved across rats and humans, some of them also underlying DCM aetiology. In an attempt to suggest new gene targets for cardiac fibrosis, I also identified the WWP2 gene as a novel trans-acting genetic regulator of cardiac fibrosis.Open Acces

    Assessing and accounting for correlation in RNA-seq data analysis

    Get PDF
    RNA-sequencing (RNA-seq) technology is a high-throughput next-generation sequencing procedure. It allows researchers to measure gene transcript abundance at a lower cost and with a higher resolution. Advances in RNA-seq technology promoted new methodological development in several branches of quantitative analysis for RNA-seq data. In this dissertation, we focus on several topics related to RNA-seq data analysis. This dissertation is comprised of three papers on the analysis of RNA-seq data. We first introduce a method for detecting differentially expressed genes across different experimental conditions with correlated RNA-seq data. We fit a general linear model to the transformed read counts of each gene and assume the error vector has a block-diagonal correlation matrix with unstructured blocks that account for within-gene correlations. In order to stabilize parameter estimation with limited replicates, we shrink the residual maximum likelihood estimator of correlation parameters toward a mean-correlation locally-weighted scatterplot smoothing curve. The shrinkage weights are determined by using a hierarchical model and then estimated via parametric bootstrap. Due to the information sharing across genes in parameter estimation, the null distribution of test statistic is unknown and mathematically intractable. Thus, we approximate the null test distribution through a parametric bootstrap strategy. Next, we focus on correlation estimation between genes. Gene co-expression correlation estimation is a fundamental step in gene co-expression network construction. The correlation estimates could also be used as inputs of topological statistics which help analyze gene functions. We propose a new strategy for co-expression correlation definition and estimation. We introduce a motivating dataset with two factors and a split-plot experimental design. We define two types of co-expression correlations that originate from two different sources. We apply a linear mixed model to each gene pair. The correlations within random effects and random errors are used to represent the two types of correlations. Finally, we consider a basic topic in quantitative RNA-seq analysis, gene filtering. It is essential to remove genes with extremely low read counts before further analysis to avoid numerical problems and to get a more stable estimates. For most differential expression and gene network analyses tools, there are embedded gene filtering functions. In general, these functions rely on a user-defined hard threshold for gene selection and fail to make full use of gene features, such as gene length and GC content level. Several studies have shown that gene features have a significant impact on RNA-sequencing efficiency and thus should be considered in subsequent analysis. We propose to fit a model involving a two-component mixture of Gaussian distribution to the transformed read counts for each sample and assume all parameters are functions of GC content. We adopt a modified semiparametric expectation-maximization algorithm for parameter estimation. We perform a series of simulation studies and show, that in many cases, the proposed methods improve upon existing methods and are more robust

    Investigating the Transcriptome Signature of Depression: Employing Co-expression Network, Candidate Pathways and Machine Learning Approaches

    Get PDF
    Depression is the leading cause of disability worldwide and is one of the major contributors to the overall global burden of disease. Despite significant advances in elucidating the neurobiology of depression in recent years, the molecular factors involved in the pathophysiology of depression remain poorly understood. Chapter 1: An overview of Major Depressive Disorder (MDD) from epidemiological and clinical perspectives with a summary of the current knowledge of the underlying biology is provided. A review of the major pathophysiological hypotheses of MDD highlights a need for a more comprehensive approach that allows studying complex molecular interactions involved in depression. Chapter 2: Transcriptome signature of depression was examined using the measure of replication at individual gene level across different tissues and cell types in both brain and periphery. Fifty-seven replicated genes were reported as differentially expressed in the brain and 21 in peripheral tissues. In-silico functional characterisation of these genes was provided, implicating shared pathways in a comorbid phenotype of depression and cardiovascular disease. Chapter 3: The molecular basis of MDD using co-expression network analysis was investigated. The Weighed Gene Co-expression Network Analysis (WGCNA) allowed for studying complex interactions between individual genes influencing biological pathways in MDD. Utilising the Sydney Memory and Aging Study (sMAS) and the Older Australian Twin Study (OATS) as discovery and replication cohorts respectively, it was found that the eigengenes of four clusters containing over 3,000 highly co-regulated genes are involved in 13 immune- and pathogen-related pathways and associated with recurrent MDD. However, the findings were not replicated on an independent cohort at the network level. Chapter 4: Using a machine learning (ML) approach, a predictive model was built to identify the genome-wide gene expression markers of recurrent MDD. Fuzzy Forests (FF) is a novel ML algorithm, which works in conjunction with WGCNA and was designed to reduce the bias seen in feature selection caused by the presence of correlated transcripts in transcriptome data. FF correctly classified 63% of recurrently depressed individuals in test data using the single top predictive feature (TFRC, encodes for transferrin receptor). This suggests that TFRC can represent a putative marker for recurrent MDD. Chapter 5: Following the findings on immune-related pathways being associated with recurrent MDD in the elderly (Chapter 3), the role of these pathways in recurrent MDD was examined at individual gene levels in an independent cohort (OATS). To target the immune pathways, all known genes (KEGG) involved in these 13 pathways were selected and a differential expression analysis was conducted on 1,302 candidates between individuals with recurrent MDD and those without. We found that CD14 was significantly downregulated in recurrent MDD (FDR < 5%). Considering the key role of CD14 for facilitating the innate immune response, we suggest that CD14 can potentially serve as a peripheral marker of immune dysregulation in recurrent MDD. Chapter 6: A discussion on obtained findings is provided and future directions are outlined with a particular focus on how co-expression network and machine learning approaches that can enhance translation of molecular findings into clinical translation.Thesis (Ph.D.) -- University of Adelaide, Adelaide Medical School, 201

    Non-coding RNAs in ovine immunity: Identification of unannotated genes and functional analyses of high throughput genomic data

    Get PDF
    210 p.Non-coding RNAs (ncRNAs) are involved in several biological processes in mammals, including the immune system response to pathogens and vaccines. The annotation and functional characterization of two of the main classes of ncRNAs, microRNAs (miRNAs) and long non-coding RNAs (lncRNAs), is more advanced in humans than in livestock species, and thus, there is limited knowledge about the function of these transcripts. The main objective of this work was the identification of ovine non-coding genes, concretely miRNA and lncRNA genes, that are involved in the innate and adaptive immune responses induced by vaccines, vaccine components and pathogen infections. For this purpose, high-throughput transcriptome sequencing datasets produced for this purpose and datasets publicly available were analysed with bioinformatic tools and workflows in order to identify unannotated non-coding genes, profile their expression in different tissues and perform evolutionary conservation analyses. More than 12000 unannotated ovine lncRNAs and 1000 ovine miRNAs were identified in the different analyses, with varying levels of sequence conservation. Differential expression analyses between unstimulated samples and samples stimulated with pathogen infection or vaccination resulted in hundreds of lncRNAs and miRNAs with changed expression. Gene co-expression analyses revealed immune gene-enriched clusters associated with immune system activation. These genes make up a prioritized set of potential candidates for deeper experimental analyses. Taken together, these results should help completing the sheep non-coding gene catalogue, and most importantly, they give evidence of immune state-specific ncRNA expression patterns in a livestock species

    SYSTEM GENETIC ANALYSIS OF MECHANISMS UNDERLYING EXCESSIVE ALCOHOL CONSUMPTION

    Get PDF
    Increased alcohol consumption over time is one of the characteristic symptoms of Alcohol Use Disorder (AUD). The molecular mechanisms underlying this escalation in intake is still the subject of study. However, the mesocortical and mesolimbic dopamine pathways, and the extended amygdala, because of their involvement in reward and reinforcement are believed to play key roles in these behavioral changes. Multiple gene expression studies have shown that alcohol affects the expression of thousands of genes in the brain. The studies discussed in this document use the systems biology technique of co-expression network analysis to attempt to find patterns within genome-wide expression data from two animal models of chronic, high-dose ethanol exposure. These analyses have identified time-dependent and brain-regions specific patterns of expression in C57Bl/6J mice after multiple exposures to intoxicating doses of ethanol and withdrawal. Specifically, they have identified the PFC and HPC as showing long-term ethanol regulation, and identified Let-7 family miRNAs as potential gene expression regulators of chronic ethanol response. Network analysis also indicates neurotransmitter release and neuroimmune response are very correlated to ethanol intake in chronically exposed mice. Examining gene expression response to chronic ethanol exposure across a variable genetic background revealed that, although gene expression response may show conserved patterns, underlying differences in gene expression influence by genetic background may be what truly underlies voluntary ethanol consumption. Finally, combined network analysis of gene expression in the prefrontal cortex (PFC) of mice and macaques following prolonged ethanol exposure demonstrated that neurotransmission, myelination, transcription, cellular respiration, and, possibly, neurovasculature are affected by chronic ethanol across species. Taken together, these studies generate several new hypothesis and areas of future research into the continued study of druggable targets for AUD

    Development and exploitation of GeneFriends: An online database for gene and transcript co-expression analysis

    Get PDF
    Although many diseases have been well characterized at the molecular level, the underlying mechanisms often remain unclear. This may be attributed to the large number of genes for which it remains unknown in which biological processes and diseases they play a role. Genes involved in the same biological processes and diseases are often co-expressed, which information can be used to predict the biological process a poorly annotated gene likely plays its primary role in. With this purpose, we constructed a co-expression network from a large number of microarray and RNA-seq samples. We conclude that co-expression analysis can be used to postulate the functions of both coding and non-coding genes. Additionally, it can be used to predict diseases they likely play an important role in. It is also shown that gene-function predictions based on a co-expression network that is constructed on a transcript rather than gene level can differentiate between different functions of transcripts originating from the same gene. We have created an online resource, GeneFriends, the first online resource that utilizes a co-expression network constructed from RNA-seq data, also allowing users to query for co-expression at the transcript rather than gene level. This allows researchers to identify and prioritize novel candidate genes and transcripts involved in biological processes and complex diseases. This is a valuable resource to the research community as supported by usage of GeneFriends in a number of independent publications. GeneFriends is available online at: http://GeneFriends.org/. To validate the ability of our tool to identify genes that are relevant to diseases, we tested GeneFriends by conducting a co-expression analysis with seed lists for aging, cancer, and mitochondrial complex I disease. We identified several candidate genes that have previously been predicted as relevant targets for each of these diseases. Some of the identified genes were already being tested in clinical trials supporting the effectiveness of this approach. Furthermore, two of the novel candidates of unknown function that were identified by GeneFriends as co-expressed with cancer genes were selected for experimental validation. Knock-down of the human homologs (C1ORF112 and C12ORF48) of these two candidate genes in HeLa cells slowed proliferation suggesting that these genes indeed play a role in cancer growth. Co-expression analyses often lead to large lists of gene-disease associations without a clear indication which genes are most relevant for follow up studies. To select such relevant genes, those that are important nodes in a co-expression network are often identified under the notion that these are of higher biological relevance than the others. To validate if this method selects the most relevant genes for aging, we conduct a co-expression analysis on a rat thymus dataset and identified transcription factors that are important network nodes. Whilst literature supports that some of these transcription factors may be important regulators of the aging process, this method can also miss some of the most interesting intervention targets. Lastly, in a rat brain aging RNA-seq dataset, generated in our lab, we tested if we could identify co-expression modules for which the expression correlates with aging and investigate if we can identify dietary interventions that potentially affected this correlation. Although modules were identified that correlated with aging, no significant effect of the dietary interventions for any of these modules was detected. Additionally, this dataset contained detailed information about the expression of microRNAs in addition to the whole transcriptome data. This was utilized to investigate if expression of microRNAs and their targets are negatively correlated, which we did not observe
    corecore