5,526 research outputs found

    Improved Performance of Gene Set Analysis on Genome-Wide Transcriptomics Data When Using Gene Activity State Estimates

    Get PDF
    Gene set analysis methods continue to be a popular and powerful method of evaluating genome-wide transcriptomics data. These approach require a priori grouping of genes into biologically meaningful sets, and then conducting downstream analyses at the set (instead of gene) level of analysis. Gene set analysis methods have been shown to yield more powerful statistical conclusions than single-gene analyses due to both reduced multiple testing penalties and potentially larger observed effects due to the aggregation of effects across multiple genes in the set. Traditionally, gene set analysis methods have been applied directly to normalized, log-transformed, transcriptomics data. Recently, efforts have been made to transform transcriptomics data to scales yielding more biologically interpretable results. For example, recently proposed models transform log-transformed transcriptomics data to a confidence metric (ranging between 0 and 100%) that a gene is active (roughly speaking, that the gene product is part of an active cellular mechanism). In this manuscript, we demonstrate, on both real and simulated transcriptomics data, that tests for differential expression between sets of genes using are typically more powerful when using gene activity state estimates as opposed to log-transformed gene expression data. Our analysis suggests further exploration of techniques to transform transcriptomics data to meaningful quantities for improved downstream inference

    Improvements to Bayesian Gene Activity State Estimation from Genome-Wide Transcriptomics Data

    Get PDF
    An important question in many biological applications, is to estimate or classify gene activity states (active or inactive) based on genome-wide transcriptomics data. Recently, we proposed a Bayesian method, titled MultiMM, which showed superior results compared to existing methods. In short, MultiMM performed better than existing methods on both simulated and real gene expression data, confirming well-known biological results and yielding better agreement with fluxomics data. Despite these promising results, MultiMM has numerous limitations. First, MultiMM leverages co-regulatory models to improve activity state estimates, but information about co-regulation is incorporated in a manner that assumes that networks are known with certainty. Second, MultiMM assumes that genes that change states in the dataset can be distinguished with certainty from those that remain in one state. Third, the model can be sensitive to extreme measures (outliers) of gene expression. In this manuscript, we propose a modified Bayesian approach, which addresses these three limitations by improving outlier handling and by explicitly modeling network and other uncertainty yielding improved gene activity state estimates when compared to MultiMM

    Is a gene-centric human proteome project the best way for proteomics to serve biology?

    Get PDF
    With the recent developments in proteomic technologies, a complete human proteome project (HPP) appears feasible for the first time. However, there is still debate as to how it should be designed and what it should encompass. In "proteomics speak", the debate revolves around the central question as to whether a gene-centric or a protein-centric proteomics approach is the most appropriate way forward. In this paper, we try to shed light on what these definitions mean, how large-scale proteomics such as a HPP can insert into the larger omics chorus, and what we can reasonably expect from a HPP in the way it has been proposed so far

    Doctor of Philosophy

    Get PDF
    dissertationDespite the advancements in therapies, next-generation sequencing, and our knowledge, breast cancer is claiming hundreds of thousands of lives around the world every year. We have therapy options that work for only a fraction of the population due to the heterogeneity of the disease. It is still overwhelmingly challenging to match a patient with the appropriate available therapy for the optimal outcome. This dissertation work focuses on using biomedical informatics approaches to development of pathwaybased biomarkers to predict personalized drug response in breast cancer and assessment of feasibility integrating such biomarkers in current electronic health records to better implement genomics-based personalized medicine. The uncontrolled proliferation in breast cancer is frequently driven by HER2/PI3K/AKT/mTOR pathway. In this pathway, the AKT node plays an important role in controlling the signal transduction. In normal breast cells, the proliferation of cells is tightly maintained at a stable rate via AKT. However, in cancer, the balance is disrupted by amplification of the upstream growth factor receptors (GFR) such as HER2, IGF1R and/or deleterious mutations in PTEN, PI3KCA. Overexpression of AKT leads to increased proliferation and decreased apoptosis and autophagy, leading to cancer. Often these known amplifications and the mutation status associated with the disease progression are used as biomarkers for determining targeting therapies. However, downstream known or unknown mutations and activations in the pathways, crosstalk iv between the pathways, can make the targeted therapies ineffective. For example, one third of HER2 amplified breast cancer patients do not respond to HER2-targeting therapies such as trastuzumab, possibly due to downstream PTEN loss of mutation or PIK3CA mutations. To identify pathway aberration with better sensitivity and specificity, I first developed gene-expression-based pathway biomarkers that can identify the deregulation status of the pathway activation status in the sample of interest. Second, I developed drug response prediction models primarily based on the pathway activity, breast cancer subtype, proteomics and mutation data. Third, I assessed the feasibility of including gene expression data or transcriptomics data in current electronic health record so that we can implement such biomarkers in routine clinical care

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Efficient gene set analysis of high-throughput data : From omics to pathway architecture of health and disease

    Get PDF
    Background: A wide range of diseases, normal variations in physiology and development of different species are caused by alterations in gene regulation. The study of gene expression is thus crucial for understanding both normal physiology and disease mechanisms. High-throughput mea- surement technologies allow the profiling of tens of thousands of genes simultaneously. However, the high volume of data thus generated poses methodological challenges in inferring biological consequences from gene expression changes. Traditional gene wise analysis of high dimensional data is overwhelming, prone to noise and unintuitive. The analysis of sets of genes (gene set analysis, GSA), solves the problem by boosting statistical power and biological interpretability. Despite more than a decade of research on gene set analysis, there are still serious limitations in the existing methods. Aims of the study: The objectives of this study were: (1) development of an efficient p-value estimation method for GSA; (2) development of an advanced permutation method for GSA of multi-group gene expression data with fewer replicates; and (3) implementation of the developed methods for the identification of novel smoking induced epigenetic signatures at biological pathway level. Materials and methods: The first study involved the assessment of four different statistical null models for modeling the distribution of gene set scores calculated with the Gene Set Z-score (GSZ) function from permuted gene expression data. A new GSA method - modified GSZ (mGSZ) - based on GSZ and the most optimal distribution model was developed. mGSZ was evaluated by comparing its results with seven other popular GSA methods using four different publicly available gene expression datasets. The second study involved the evaluation of six different permutation schemes for GSA of multi-group (more than two groups) datasets based on the identification of reference gene sets generated using a novel data splitting approach. A new GSA method based on a modification of mGSZ (mGSZm) was developed by implementing the best permutation method for the analysis of multi-group data with fewer than six replicates per group. mGSZm was evaluated by contrasting its performance with seven other state-of-the-art GSA methods suitable for multi-group data. The evaluation was based on three different publicly available multi-group datasets. The third study involved an implementation of mGSZ for GSA of genome-wide DNA methylation data from the Cardiovascular Risk in Young Finns study (YFS) cohort with gene sets downloaded from the Molecular Signature Database (MSigDB). Methylation measurements were done on a subset of 192 individuals from whole-blood samples from the 2011 follow-up study using Illumina Infinium HumanMethylation450 BeadChips. Results: Overall, efficient and robust GSA methods were developed (studies I-II) and implemented (study III). In study I, the results demonstrated a clear advantage of asymptotic p-value estimation over empirical methods. mGSZ, a GSA method based on asymptotic p-values, requires fewer permutations which speeds up the analysis process. mGSZ outperformed state-of-the-art methods based on three different evaluations with three different datasets. In study II, results from a novel evaluation approach with two different datasets suggested that the proposed advanced permutation method outperformed the naive permutation method in GSA of multi-group data with fewer than six replicates. Evaluation of mGSZm, a GSA method equipped with the advanced permutation method and asymptoticn/

    Statistical Algorithms and Bioinformatics Tools Development for Computational Analysis of High-throughput Transcriptomic Data

    Get PDF
    Next-Generation Sequencing technologies allow for a substantial increase in the amount of data available for various biological studies. In order to effectively and efficiently analyze this data, computational approaches combining mathematics, statistics, computer science, and biology are implemented. Even with the substantial efforts devoted to development of these approaches, numerous issues and pitfalls remain. One of these issues is mapping uncertainty, in which read alignment results are biased due to the inherent difficulties associated with accurately aligning RNA-Sequencing reads. GeneQC is an alignment quality control tool that provides insight into the severity of mapping uncertainty in each annotated gene from alignment results. GeneQC used feature extraction to identify three levels of information for each gene and implements elastic net regularization and mixture model fitting to provide insight in the severity of mapping uncertainty and the quality of read alignment. In combination with GeneQC, the Ambiguous Reads Mapping (ARM) algorithm works to re-align ambiguous reads through the integration of motif prediction from metabolic pathways to establish coregulatory gene modules for re-alignment using a negative binomial distribution-based probabilistic approach. These two tools work in tandem to address the issue of mapping uncertainty and provide more accurate read alignments, and thus more accurate expression estimates. Also presented in this dissertation are two approaches to interpreting the expression estimates. The first is IRIS-EDA, an integrated shiny web server that combines numerous analyses to investigate gene expression data generated from RNASequencing data. The second is ViDGER, an R/Bioconductor package that quickly generates high-quality visualizations of differential gene expression results to assist users in comprehensive interpretations of their differential gene expression results, which is a non-trivial task. These four presented tools cover a variety of aspects of modern RNASeq analyses and aim to address bottlenecks related to algorithmic and computational issues, as well as more efficient and effective implementation methods

    Structured data abstractions and interpretable latent representations for single-cell multimodal genomics

    Get PDF
    Single-cell multimodal genomics involves simultaneous measurement of multiple types of molecular data, such as gene expression, epigenetic marks and protein abundance, in individual cells. This allows for a comprehensive and nuanced understanding of the molecular basis of cellular identity and function. The large volume of data generated by single-cell multimodal genomics experiments requires specialised methods and tools for handling, storing, and analysing it. This work provides contributions on multiple levels. First, it introduces a single-cell multimodal data standard — MuData — designed to facilitate the handling, storage and exchange of multimodal data. MuData provides interfaces that enable transparent access to multimodal annotations as well as data from individual modalities. This data structure has formed the foundation for the multimodal integration framework, which enables complex and composable workflows that can be naturally integrated with existing omics-specific analysis approaches. Joint analysis of multimodal data can be performed using integration methods. In order to enable integration of single-cell data, an improved multi-omics factor analysis model (MOFA+) has been designed and implemented building on the canonical dimensionality reduction approach for multi-omics integration. Inferring later factors that explain variation across multiple modalities of the data, MOFA+ enables the modelling of latent factors with cell group-specific patterns of activity. MOFA+ model has been implemented as part of the respective multi-omics integration framework, and its utility has been extended by software solutions that facilitate interactive model exploration and interpretation. The newly improved model for multi-omics integration of single cells has been applied to the study of gene expression signatures upon targeted gene activation. In a dataset featuring targeted activation of candidate regulators of zygotic genome activation (ZGA) — a crucial transcriptional event in early embryonic development, — modelling expression of both coding and non-coding loci with MOFA+ allowed to rank genes by their potency to activate a ZGA-like transcriptional response. With identification of Patz1, Dppa2 and Smarca5 as potent inducers of ZGA-like transcription in mouse embryonic stem cells, these findings have contributed to the understanding of molecular mechanisms behind ZGA and laid the foundation for future research of ZGA in vivo. In summary, this work’s contributions include the development of data handling and integration methods as well as new biological insights that arose from applying these methods to studying gene expression regulation in early development. This highlights how single-cell multimodal genomics can aid to generate valuable insights into complex biological systems
    • …
    corecore