331 research outputs found

    Utilizing gene co-expression networks for comparative transcriptomic analyses

    Get PDF
    The development of high-throughput technologies such as microarray and next-generation RNA sequencing (RNA-seq) has generated numerous transcriptomic data that can be used for comparative transcriptomics studies. Transcriptomes obtained from different species can reveal differentially expressed genes that underlie species-specific traits. It also has the potential to identify genes that have conserved gene expression patterns. However, differential expression alone does not provide information about how the genes relate to each other in terms of gene expression or if groups of genes are correlated in similar ways across species, tissues, etc. This makes gene expression networks, such as co-expression networks, valuable in terms of finding similarities or differences between genes based on their relationships with other genes. The desired outcome of this research was to develop methods for comparative transcriptomics, specifically for comparing gene co-expression networks (GCNs), either within or between any set of organisms. These networks represent genes as nodes in the network, and pairs of genes may be connected by an edge representing the strength of the relationship between the pairs. We begin with a review of currently utilized techniques available that can be used or adapted to compare gene co-expression networks. We also work to systematically determine the appropriate number of samples needed to construct reproducible gene co-expression networks for comparison purposes. In order to systematically compare these replicate networks, software to visualize the relationship between replicate networks was created to determine when the consistency of the networks begins to plateau and if this is affected by factors such as tissue type and sample size. Finally, we developed a tool called Juxtapose that utilizes gene embedding to functionally interpret the commonalities and differences between a given set of co-expression networks constructed using transcriptome datasets from various organisms. A set of transcriptome datasets were utilized from publicly available sources as well as from collaborators. GTEx and Gene Expression Omnibus (GEO) RNA-seq datasets were used for the evaluation of the techniques proposed in this research. Skeletal cell datasets of closely related species and more evolutionarily distant organisms were also analyzed to investigate the evolutionary relationships of several skeletal cell types. We found evidence that data characteristics such as tissue origin, as well as the method used to construct gene co-expression networks, can substantially impact the number of samples required to generate reproducible networks. In particular, if a threshold is used to construct a gene co-expression network for downstream analyses, the number of samples used to construct the networks is an important consideration as many samples may be required to generate networks that have a reproducible edge order when sorted by edge weight. We also demonstrated the capabilities of our proposed method for comparing GCNs, Juxtapose, showing that it is capable of consistently matching up genes in identical networks, and it also reflects the similarity between different networks using cosine distance as a measure of gene similarity. Finally, we applied our proposed method to skeletal cell networks and find evidence of conserved gene relationships within skeletal GCNs from the same species and identify modules of genes with similar embeddings across species that are enriched for biological processes involved in cartilage and osteoblast development. Furthermore, smaller sub-networks of genes reflect the phylogenetic relationships of the species analyzed using our gene embedding strategy to compare the GCNs. This research has produced methodologies and tools that can be used for evolutionary studies and generalizable to scenarios other than cross-species comparisons, including co-expression network comparisons across tissues or conditions within the same species

    Development of Biclustering Techniques for Gene Expression Data Modeling and Mining

    Get PDF
    The next-generation sequencing technologies can generate large-scale biological data with higher resolution, better accuracy, and lower technical variation than the arraybased counterparts. RNA sequencing (RNA-Seq) can generate genome-scale gene expression data in biological samples at a given moment, facilitating a better understanding of cell functions at genetic and cellular levels. The abundance of gene expression datasets provides an opportunity to identify genes with similar expression patterns across multiple conditions, i.e., co-expression gene modules (CEMs). Genomescale identification of CEMs can be modeled and solved by biclustering, a twodimensional data mining technique that allows clustering of rows and columns in a gene expression matrix, simultaneously. Compared with traditional clustering that targets global patterns, biclustering can predict local patterns. This unique feature makes biclustering very useful when applied to big gene expression data since genes that participate in a cellular process are only active in specific conditions, thus are usually coexpressed under a subset of all conditions. The combination of biclustering and large-scale gene expression data holds promising potential for condition-specific functional pathway/network analysis. However, existing biclustering tools do not have satisfied performance on high-resolution RNA-Seq data, majorly due to the lack of (i) a consideration of high sparsity of RNA-Seq data, especially for scRNA-Seq data, and (ii) an understanding of the underlying transcriptional regulation signals of the observed gene expression values. QUBIC2, a novel biclustering algorithm, is designed for large-scale bulk RNA-Seq and single-cell RNA-seq (scRNA-Seq) data analysis. Critical novelties of the algorithm include (i) used a truncated model to handle the unreliable quantification of genes with low or moderate expression; (ii) adopted the Gaussian mixture distribution and an information-divergency objective function to capture shared transcriptional regulation signals among a set of genes; (iii) utilized a Dual strategy to expand the core biclusters, aiming to save dropouts from the background; and (iv) developed a statistical framework to evaluate the significances of all the identified biclusters. Method validation on comprehensive data sets suggests that QUBIC2 had superior performance in functional modules detection and cell type classification. The applications of temporal and spatial data demonstrated that QUBIC2 could derive meaningful biological information from scRNA-Seq data. Also presented in this dissertation is QUBICR. This R package is characterized by an 82% average improved efficiency compared to the source C code of QUBIC. It provides a set of comprehensive functions to facilitate biclustering-based biological studies, including the discretization of expression data, query-based biclustering, bicluster expanding, biclusters comparison, heatmap visualization of any identified biclusters, and co-expression networks elucidation. In the end, a systematical summary is provided regarding the primary applications of biclustering for biological data and more advanced applications for biomedical data. It will assist researchers to effectively analyze their big data and generate valuable biological knowledge and novel insights with higher efficiency

    Inferring interaction networks from transcriptomic data: methods and applications

    Full text link
    Transcriptomic data is a treasure-trove in modern molecular biology, as it offers a comprehensive viewpoint into the intricate nuances of gene expression dynamics underlying biological systems. This genetic information must be utilised to infer biomolecular interaction networks that can provide insights into the complex regulatory mechanisms underpinning the dynamic cellular processes. Gene regulatory networks and protein-protein interaction networks are two major classes of such networks. This chapter thoroughly investigates the wide range of methodologies used for distilling insightful revelations from transcriptomic data that include association based methods (based on correlation among expression vectors), probabilistic models (using Bayesian and Gaussian models), and interologous methods. We reviewed different approaches for evaluating the significance of interactions based on the network topology and biological functions of the interacting molecules, and discuss various strategies for the identification of functional modules. The chapter concludes with highlighting network based techniques of prioritising key genes, outlining the centrality based, diffusion based and subgraph based methods. The chapter provides a meticulous framework for investigating transcriptomic data to uncover assembly of complex molecular networks for their adaptable analyses across a broad spectrum of biological domains.Comment: 48 pages, 3 figure

    Integrating biclustering techniques with de novo gene regulatory network discovery using RNA-seq from skeletal tissues

    Get PDF
    In order to improve upon stem cell therapy for osteoarthritis, it is necessary to understand the molecular and cellular processes behind bone development and the differences from cartilage formation. To further elucidate these processes would provide a means to analyze the relatedness of bone and cartilage tissue by determining genes that are expressed and regulated for stem cells to differentiate into skeletal tissues. It would also contribute to the classification of differences in normal skeletogenesis and degenerative conditions involving these tissues. The three predominant skeletal tissues of interest are bone, immature cartilage and mature cartilage. Analysis of the transcriptome of these skeletal tissues using RNA-seq technology was performed using differential expression, clustering and biclustering algorithms, to detect similarly expressed genes, which provides evidence for genes potentially interacting together to produce a particular phenotype. Identifying key regulators in the gene regulatory networks (GRNs) driving cartilage and bone development and the differences in the GRNs they drive will facilitate a means to make comparisons between the tissues at the transcriptomic level. Due to a small number of available samples for gene expression data in bone, immature and mature cartilage, it is necessary to determine how the number of samples influences the ability to make accurate GRN predictions. Machine learning techniques for GRN prediction that can incorporate multiple data types have not been well evaluated for complex organisms, nor has RNA-seq data been used often for evaluating these methods. Therefore, techniques identified to work well with microarray data were applied to RNA-seq data from mouse embryonic stem cells, where more samples are available for evaluation compared to the skeletal tissue RNA-seq samples. The RNA-seq data was combined with ChIP-seq data to determine if the machine learning methods outperform simple, correlation-based methods that have been evaluated using RNA-seq data alone. Two of the best performing GRN prediction algorithms from previous large-scale evaluations, which are incapable of incorporating data beyond expression data, were used as a baseline to determine if the addition of multiple data types could help reduce the number of gene expression samples. It was also necessary to identify a biclustering algorithm that could identify potentially biologically relevant modules. Publicly available ChIP-seq and RNA-seq samples from embryonic stem cells were used to measure the performance and consistency of each method, as there was a well-established network in mouse embryonic stem cells to compare results. The methods were then compared to cMonkey2, a biclustering method used in conjunction with ChIP-seq for two important transcription factors in the embryonic stem cell network. This was done to determine if any of these GRN prediction methods could potentially use the small number of skeletal tissue samples available to determine transcription factors orchestrating the expression of other genes driving cartilage and bone formation. Using the embryonic stem cell RNA-seq samples, it was found that sample size, if above 10, does not have a significant impact on the number of true positives in the top predicted interactions. Random forest methods outperform correlation-based methods when using RNA-seq, with area under ROC (AUROC) for evaluation, but the number of true positive interactions predicted when compared to a literature network were similar when using a strict cut-off. Using a limited set of ChIP-seq data was found to not improve the confidence in the transcription factor interactions and had no obvious affect on biclustering results. Correlation-based methods are likely the safest option when based on consistency of the results over multiple runs, but there is still the challenge of determining an appropriate cut-off to the predictions. To predict the skeletal tissue GRNs, cMonkey was used as an initial feature selection method to identify important genes in skeletal tissues and compared with other biclustering methods that do not use ChIP-seq. The predicted skeletal tissue GRNs will be utilized in future analyses of skeletal tissues, focussing on the evolutionary relationship between the GRNs driving skeletal tissue development

    GeRNet: A Gene Regulatory Network Tool

    Get PDF
    Gene regulatory networks (GRNs) are crucial in every process of life since they govern the majority of the molecular processes. Therefore, the task of assembling these networks is highly important. In particular, the so called model-free ap-proaches have an advantage modeling the complexities of dynamic molecular networks, since most of the gene networks are hard to be mapped with accuracy by any other mathematical model. A highly abstract model-free approach, called rule-based approach, offers several advantages performing data-driven analysis; such as the requirement of the least amount of data. They also have an important ability to perform inferences: its simplicity allows the inference of large size mod-els with a higher speed of analysis. However, regarding these techniques, the re-construction of the relational structure of the network is partial, hence incomplete, for an effective biological analysis. This situation motivated us to explore the possibility of hybridizing with other approaches, such as biclustering techniques. This led to incorporate a biclustering tool that finds new relations between these nodes of the GRN. In this work we present a new software, called GeRNeT that integrates the algorithms of GRNCOP2 and BiHEA along a set of tools for interactive visualization, statistical analysis and ontological enrichment of the resulting GRNs. In this regard, results associated with Alzheimer disease datasets are pre-sented that show the usefulness of integrating both bioinformatics tools.Fil: Dussaut, Julieta Sol. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; ArgentinaFil: Gallo, Cristian Andrés. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; ArgentinaFil: Cravero, Fiorella. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Planta Piloto de Ingeniería Química. Universidad Nacional del Sur. Planta Piloto de Ingeniería Química; ArgentinaFil: Martínez, María Jimena. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; ArgentinaFil: Carballido, Jessica Andrea. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; ArgentinaFil: Ponzoni, Ignacio. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentin

    DNA Microarray Data Analysis: A New Survey on Biclustering

    Get PDF
    There are subsets of genes that have similar behavior under subsets of conditions, so we say that they coexpress, but behave independently under other subsets of conditions. Discovering such coexpressions can be helpful to uncover genomic knowledge such as gene networks or gene interactions. That is why, it is of utmost importance to make a simultaneous clustering of genes and conditions to identify clusters of genes that are coexpressed under clusters of conditions. This type of clustering is called biclustering.Biclustering is an NP-hard problem. Consequently, heuristic algorithms are typically used to approximate this problem by finding suboptimal solutions. In this paper, we make a new survey on biclustering of gene expression data, also called microarray data
    corecore