2,524 research outputs found

    Development of Biclustering Techniques for Gene Expression Data Modeling and Mining

    Get PDF
    The next-generation sequencing technologies can generate large-scale biological data with higher resolution, better accuracy, and lower technical variation than the arraybased counterparts. RNA sequencing (RNA-Seq) can generate genome-scale gene expression data in biological samples at a given moment, facilitating a better understanding of cell functions at genetic and cellular levels. The abundance of gene expression datasets provides an opportunity to identify genes with similar expression patterns across multiple conditions, i.e., co-expression gene modules (CEMs). Genomescale identification of CEMs can be modeled and solved by biclustering, a twodimensional data mining technique that allows clustering of rows and columns in a gene expression matrix, simultaneously. Compared with traditional clustering that targets global patterns, biclustering can predict local patterns. This unique feature makes biclustering very useful when applied to big gene expression data since genes that participate in a cellular process are only active in specific conditions, thus are usually coexpressed under a subset of all conditions. The combination of biclustering and large-scale gene expression data holds promising potential for condition-specific functional pathway/network analysis. However, existing biclustering tools do not have satisfied performance on high-resolution RNA-Seq data, majorly due to the lack of (i) a consideration of high sparsity of RNA-Seq data, especially for scRNA-Seq data, and (ii) an understanding of the underlying transcriptional regulation signals of the observed gene expression values. QUBIC2, a novel biclustering algorithm, is designed for large-scale bulk RNA-Seq and single-cell RNA-seq (scRNA-Seq) data analysis. Critical novelties of the algorithm include (i) used a truncated model to handle the unreliable quantification of genes with low or moderate expression; (ii) adopted the Gaussian mixture distribution and an information-divergency objective function to capture shared transcriptional regulation signals among a set of genes; (iii) utilized a Dual strategy to expand the core biclusters, aiming to save dropouts from the background; and (iv) developed a statistical framework to evaluate the significances of all the identified biclusters. Method validation on comprehensive data sets suggests that QUBIC2 had superior performance in functional modules detection and cell type classification. The applications of temporal and spatial data demonstrated that QUBIC2 could derive meaningful biological information from scRNA-Seq data. Also presented in this dissertation is QUBICR. This R package is characterized by an 82% average improved efficiency compared to the source C code of QUBIC. It provides a set of comprehensive functions to facilitate biclustering-based biological studies, including the discretization of expression data, query-based biclustering, bicluster expanding, biclusters comparison, heatmap visualization of any identified biclusters, and co-expression networks elucidation. In the end, a systematical summary is provided regarding the primary applications of biclustering for biological data and more advanced applications for biomedical data. It will assist researchers to effectively analyze their big data and generate valuable biological knowledge and novel insights with higher efficiency

    High-Throughput Polygenic Biomarker Discovery Using Condition-Specific Gene Coexpression Networks

    Get PDF
    Biomarkers can be described as molecular signatures that are associated with a trait or disease. RNA expression data facilitates discovery of biomarkers underlying complex phenotypes because it can capture dynamic biochemical processes that are regulated in tissue-specific and time-specific manners. Gene Coexpression Network (GCN) analysis is a method that utilizes RNA expression data to identify binary gene relationships across experimental conditions. Using a novel GCN construction algorithm, Knowledge Independent Network Construction (KINC), I provide evidence for novel polygenic biomarkers in both plant and animal use cases. Kidney cancer is comprised of several distinct subtypes that demonstrate unique histological and molecular signatures. Using KINC, I have identified gene correlations that are specific to clear cell renal cell carcinoma (ccRCC), the most common form of kidney cancer. ccRCC is associated with two common mutation profiles that respond differently to targeted therapy. By identifying GCN edges that are specific to patients with each of these two mutation profiles, I discovered unique genes with similar biological function, suggesting a role for T cell exhaustion in the development of ccRCC. Medicago truncatula is a legume that is capable of atmospheric nitrogen fixation through a symbiotic relationship between plant and rhizobium that results in root nodulation. This process is governed by complex gene expression patterns that are dynamically regulated across tissues over the course of rhizobial infection. Using de novo RNA sequencing data generated from the root maturation zone at five distinct time points, I identified hundreds of genes that were differentially expressed between control and inoculated plants at specific time points. To discover genes that were co-regulated during this experiment, I constructed a GCN using the KINC software. By combining GCN clustering analysis with differentially expressed genes, I present evidence for novel root nodulation biomarkers. These biomarkers suggest that temporal regulation of pathogen response related genes is an important process in nodulation. Large-scale GCN analysis requires computational resources and stable data-processing pipelines. Supercomputers such as Clemson University’s Palmetto Cluster provide data storage and processing resources that enable terabyte-scale experiments. However, with the wealth of public sequencing data available for mining, petabyte-scale experiments are required to provide novel insights across the tree of life. I discuss computational challenges that I have discovered with large scale RNA expression data mining, and present two workflows, OSG-GEM and OSG-KINC, that enable researchers to access geographically distributed computing resources to handle petabyte-scale experiments

    Statistical Algorithms and Bioinformatics Tools Development for Computational Analysis of High-throughput Transcriptomic Data

    Get PDF
    Next-Generation Sequencing technologies allow for a substantial increase in the amount of data available for various biological studies. In order to effectively and efficiently analyze this data, computational approaches combining mathematics, statistics, computer science, and biology are implemented. Even with the substantial efforts devoted to development of these approaches, numerous issues and pitfalls remain. One of these issues is mapping uncertainty, in which read alignment results are biased due to the inherent difficulties associated with accurately aligning RNA-Sequencing reads. GeneQC is an alignment quality control tool that provides insight into the severity of mapping uncertainty in each annotated gene from alignment results. GeneQC used feature extraction to identify three levels of information for each gene and implements elastic net regularization and mixture model fitting to provide insight in the severity of mapping uncertainty and the quality of read alignment. In combination with GeneQC, the Ambiguous Reads Mapping (ARM) algorithm works to re-align ambiguous reads through the integration of motif prediction from metabolic pathways to establish coregulatory gene modules for re-alignment using a negative binomial distribution-based probabilistic approach. These two tools work in tandem to address the issue of mapping uncertainty and provide more accurate read alignments, and thus more accurate expression estimates. Also presented in this dissertation are two approaches to interpreting the expression estimates. The first is IRIS-EDA, an integrated shiny web server that combines numerous analyses to investigate gene expression data generated from RNASequencing data. The second is ViDGER, an R/Bioconductor package that quickly generates high-quality visualizations of differential gene expression results to assist users in comprehensive interpretations of their differential gene expression results, which is a non-trivial task. These four presented tools cover a variety of aspects of modern RNASeq analyses and aim to address bottlenecks related to algorithmic and computational issues, as well as more efficient and effective implementation methods

    Swine blood transcriptomics: Application and advancement

    Get PDF
    Improving swine feed efficiency (FE) by selection for low residual feed intake (RFI) is of practical interest. However, whether selection for low RFI compromises a pig’s immune response is not clear. In addition, current RFI-based selection for improving feed efficiency was expensive and time-consuming. Seeking alternative tools to facilitate selection, such as predictive biomarkers for RFI, is of great interest. The objectives of this thesis are as follows: (1) to investigate whether selection for low RFI compromise a pig’s immune response; (2) to develop candidate biomarkers applicable at early growth stage for predicting RFI at late growth stage; (3) to improve the annotation of the porcine blood transcriptome. In Chapter 2, pigs of two lines divergently selected for RFI were injected with lipopolysaccharide (LPS). Transcriptomes of peripheral blood at baseline and multi-time points post injection were profiled by RNA-seq. LPS injection induced systemic inflammatory response in both RFI lines. However, no significant differences were detected in dynamics of body temperature, blood cell count and cytokine levels during the time course. Only a very small number of differentially expressed genes (DEGs) were detected between the lines over all time points, though ~ 50% of blood genes were differentially expressed post LPS injection compared to baseline for each line. The two lines were largely similar in most biological pathways and processes studied. Minor differences included a slightly lower level of inflammatory response in the low- versus high-RFI animals. Cross-species comparison showed that humans and pigs responded to LPS stimulation similarly at both the gene and pathway levels, though pigs are more tolerant to LPS than humans. In Chapter 3, post-weaning blood transcriptomic differences between the two lines were studied by RNA-seq. DEGs between the lines significantly overlapped gene sets associated with human diseases, such as eating disorders, hyperphagia and mitochondrial disease. Genes functioning in the mitochondrion and proteasome, and signaling had lower and higher expression in the low-RFI group relative to the high-RFI group, respectively. Expression levels of five differentially expressed genes between the two groups were significantly associated with individual animal’s RFI values. These five genes were candidate biomarkers for predicting RFI. Given limitations of current annotation of the porcine reference genome, a high-quality annotated transcriptome of porcine peripheral blood was built in the last study via a hybrid assembly strategy with a large amount of blood RNA-seq data from studies mentioned above and public databases. Taken together, this work provides evidence that selection for low RFI did not significantly compromise pigs’ immune response to systemic inflammation, offers a few candidate biomarkers for predicting RFI to facilitate RFI-based selection, and significantly advances the structural and functional annotation of porcine blood transcriptome

    Genome-wide Transcriptome Analysis of Cotton (Gossypium hirsutum L.) to Identify Genes in Response to Aspergillus flavus Infection, and Development of RNA-Seq Data Analysis Pipeline

    Get PDF
    Aflatoxins are toxic and potent carcinogenic metabolites produced by Aspergillus flavus and A. parasiticus. Aflatoxins can contaminate cottonseed under conducive environmental conditions. Much success has been achieved by the application of atoxigenic strains of A. flavus for controlling aflatoxin contamination in cotton, peanut and maize. Development of aflatoxin-resistant cultivars overexpressing resistance-associated genes and/or knocking down aflatoxin biosynthesis of A. flavus could be an effective strategy for controlling aflatoxin contamination in cotton. In this study, differentially expressed genes (DEGs) were identified in response to infection with both toxigenic and atoxigenic strains of A. flavus pericarp and seed of cotton through genome-wide transcriptome profiling. The genes involved in antifungal response, oxidative burst, transcription factors, defense signaling pathways and stress response were highly differentially expressed in pericarp and seed tissues in response to A. flavus infection. The cell-wall modifying genes and genes involved in the production of antimicrobial substances were more active in pericarp than seed. Genes involved in defense response in cotton were highly induced in pericarp. The DEGs will serve as the source for identifying biomarkers for breeding, potential candidate genes for transgenic manipulation, and will help in understanding complex plant-fungal interaction for future downstream research. The increasing volume of sequence data generated by the rapidly decreasing cost of RNA sequencing (RNA-Seq) necessitates the development of software pipeline(s) that can analyze the massive amounts of RNA-Seq data in an efficient manner. Through the present study, a comprehensive and flexible Standalone RNA-Seq Analysis Pipeline (SRAP) implemented with the parallel programming approach was developed, which can analyze transcriptome for any genome. SRAP consists of high-level modules, including sequence reads filtering, mapping to reference genome (or transcriptome), sequence assembly, gene expression analysis and variant discovery along with low-level modules for other common NGS utilities. The high-level modules, unlike low-level modules, require intense computation in terms of memory and processor. SRAP is developed with in-house developed scripts (Python), parallel computing and open source bioinformatics tools. It can be executed as a batch and/or individual mode for single or multiple sample files. SRAP generates RNA-Seq data analysis output files with statistical summary and graphic visualization

    Utilizing gene co-expression networks for comparative transcriptomic analyses

    Get PDF
    The development of high-throughput technologies such as microarray and next-generation RNA sequencing (RNA-seq) has generated numerous transcriptomic data that can be used for comparative transcriptomics studies. Transcriptomes obtained from different species can reveal differentially expressed genes that underlie species-specific traits. It also has the potential to identify genes that have conserved gene expression patterns. However, differential expression alone does not provide information about how the genes relate to each other in terms of gene expression or if groups of genes are correlated in similar ways across species, tissues, etc. This makes gene expression networks, such as co-expression networks, valuable in terms of finding similarities or differences between genes based on their relationships with other genes. The desired outcome of this research was to develop methods for comparative transcriptomics, specifically for comparing gene co-expression networks (GCNs), either within or between any set of organisms. These networks represent genes as nodes in the network, and pairs of genes may be connected by an edge representing the strength of the relationship between the pairs. We begin with a review of currently utilized techniques available that can be used or adapted to compare gene co-expression networks. We also work to systematically determine the appropriate number of samples needed to construct reproducible gene co-expression networks for comparison purposes. In order to systematically compare these replicate networks, software to visualize the relationship between replicate networks was created to determine when the consistency of the networks begins to plateau and if this is affected by factors such as tissue type and sample size. Finally, we developed a tool called Juxtapose that utilizes gene embedding to functionally interpret the commonalities and differences between a given set of co-expression networks constructed using transcriptome datasets from various organisms. A set of transcriptome datasets were utilized from publicly available sources as well as from collaborators. GTEx and Gene Expression Omnibus (GEO) RNA-seq datasets were used for the evaluation of the techniques proposed in this research. Skeletal cell datasets of closely related species and more evolutionarily distant organisms were also analyzed to investigate the evolutionary relationships of several skeletal cell types. We found evidence that data characteristics such as tissue origin, as well as the method used to construct gene co-expression networks, can substantially impact the number of samples required to generate reproducible networks. In particular, if a threshold is used to construct a gene co-expression network for downstream analyses, the number of samples used to construct the networks is an important consideration as many samples may be required to generate networks that have a reproducible edge order when sorted by edge weight. We also demonstrated the capabilities of our proposed method for comparing GCNs, Juxtapose, showing that it is capable of consistently matching up genes in identical networks, and it also reflects the similarity between different networks using cosine distance as a measure of gene similarity. Finally, we applied our proposed method to skeletal cell networks and find evidence of conserved gene relationships within skeletal GCNs from the same species and identify modules of genes with similar embeddings across species that are enriched for biological processes involved in cartilage and osteoblast development. Furthermore, smaller sub-networks of genes reflect the phylogenetic relationships of the species analyzed using our gene embedding strategy to compare the GCNs. This research has produced methodologies and tools that can be used for evolutionary studies and generalizable to scenarios other than cross-species comparisons, including co-expression network comparisons across tissues or conditions within the same species
    • …
    corecore