3 research outputs found

    Sensitivity And Specificity Of Gene Set Analysis

    Get PDF
    High-throughput technologies are widely used for understanding biological processes. Gene set analysis is a well-established computational approach for providing a concise biological interpretation of high-throughput gene expression data. Gene set analysis utilizes the available knowledge about the groups of genes involved in cellular processes or functions. Large collections of such groups of genes, referred to as gene set databases, are available through online repositories to facilitate gene set analysis. There are a large number of gene set analysis methods available, and current recommendations and guidelines about the method of choice for a given experiment are often inconsistent and contradictory. It has also been reported that some gene set analysis methods suffer from a lack of specificity. Furthermore, the sheer size of gene set databases makes it difficult to study these databases and their effect on gene set analysis. In this thesis, we propose quantitative approaches for the study of reproducibility, sensitivity, and specificity of gene set analysis methods; characterize gene set databases; and offer guidelines for choosing an appropriate gene set database for a given experiment. We review commonly used gene set analysis methods; classify these methods based on their components; describe the underlying requirements and assumptions for each class; suggest the appropriate method to be used for a given experiment; and explain the challenges and pitfalls in interpreting results for each class of methods. We propose a methodology and use it for evaluating the effect of sample size on the results of thirteen gene set analysis methods utilizing real datasets. Further, to investigate the effect of method choice on the results of gene set analysis, we develop a quantitative approach and use it to evaluate ten commonly used gene set analysis methods. We also quantify and visualize gene set overlap and study its effect on the specificity of over-representation analysis. We propose Silver, a quantitative framework for simulating gene expression datasets and evaluating gene set analysis methods without relying on oversimplifying assumptions commonly made when evaluating gene set analysis methods. Finally, we propose a systematic approach to select appropriate gene set databases for conducting gene set analysis for a given experiment. Using this approach, we highlight the drawbacks of meta-databases such as MSigDB, a well-established gene set database made by extracting gene sets from several sources including GO, KEGG, Reactome, and BioCarta. Our findings suggest that the results of most gene set analysis methods are not reproducible for small sample sizes. In addition, the results of gene set analysis significantly vary depending on the method used, with little to no commonality between the 20 most significant results. We show that there is a significant negative correlation between gene set overlap and the specificity of over-representation analysis. This suggests that gene set overlap should be taken into account when developing and evaluating gene set analysis methods. We show that the datasets synthesized using Silver preserve complex gene-gene correlations and the distribution of expression values. Using Silver provides unbiased insight about how gene set analysis methods behave when applied on real datasets and real gene set databases. Our quantitative study of several well-established gene set databases reveals that commonly used gene set databases fall short in representing some phenotypes. The proposed methodologies and achieved results in this research reveal the main challenges facing gene set analysis. We identify key factors that contribute to the lack of specificity and reproducibility of gene set analysis methods, establishing the direction for future research. Also, the quantitative methodologies proposed in this thesis facilitate the design and development of gene set analysis methods as well as gene set databases and benefit a wide range of researchers utilizing high-throughput technologies

    Utilizing gene co-expression networks for comparative transcriptomic analyses

    Get PDF
    The development of high-throughput technologies such as microarray and next-generation RNA sequencing (RNA-seq) has generated numerous transcriptomic data that can be used for comparative transcriptomics studies. Transcriptomes obtained from different species can reveal differentially expressed genes that underlie species-specific traits. It also has the potential to identify genes that have conserved gene expression patterns. However, differential expression alone does not provide information about how the genes relate to each other in terms of gene expression or if groups of genes are correlated in similar ways across species, tissues, etc. This makes gene expression networks, such as co-expression networks, valuable in terms of finding similarities or differences between genes based on their relationships with other genes. The desired outcome of this research was to develop methods for comparative transcriptomics, specifically for comparing gene co-expression networks (GCNs), either within or between any set of organisms. These networks represent genes as nodes in the network, and pairs of genes may be connected by an edge representing the strength of the relationship between the pairs. We begin with a review of currently utilized techniques available that can be used or adapted to compare gene co-expression networks. We also work to systematically determine the appropriate number of samples needed to construct reproducible gene co-expression networks for comparison purposes. In order to systematically compare these replicate networks, software to visualize the relationship between replicate networks was created to determine when the consistency of the networks begins to plateau and if this is affected by factors such as tissue type and sample size. Finally, we developed a tool called Juxtapose that utilizes gene embedding to functionally interpret the commonalities and differences between a given set of co-expression networks constructed using transcriptome datasets from various organisms. A set of transcriptome datasets were utilized from publicly available sources as well as from collaborators. GTEx and Gene Expression Omnibus (GEO) RNA-seq datasets were used for the evaluation of the techniques proposed in this research. Skeletal cell datasets of closely related species and more evolutionarily distant organisms were also analyzed to investigate the evolutionary relationships of several skeletal cell types. We found evidence that data characteristics such as tissue origin, as well as the method used to construct gene co-expression networks, can substantially impact the number of samples required to generate reproducible networks. In particular, if a threshold is used to construct a gene co-expression network for downstream analyses, the number of samples used to construct the networks is an important consideration as many samples may be required to generate networks that have a reproducible edge order when sorted by edge weight. We also demonstrated the capabilities of our proposed method for comparing GCNs, Juxtapose, showing that it is capable of consistently matching up genes in identical networks, and it also reflects the similarity between different networks using cosine distance as a measure of gene similarity. Finally, we applied our proposed method to skeletal cell networks and find evidence of conserved gene relationships within skeletal GCNs from the same species and identify modules of genes with similar embeddings across species that are enriched for biological processes involved in cartilage and osteoblast development. Furthermore, smaller sub-networks of genes reflect the phylogenetic relationships of the species analyzed using our gene embedding strategy to compare the GCNs. This research has produced methodologies and tools that can be used for evolutionary studies and generalizable to scenarios other than cross-species comparisons, including co-expression network comparisons across tissues or conditions within the same species
    corecore