6 research outputs found

    STATISTICAL LEARNING METHODS FOR UNCOVERING GENE REGULATION MECHANISMS

    Get PDF
    Gene regulation is a complex process controlling gene product levels through factors like transcription factors, epigenetic modifications, RNA, and proteins (Mack and Nachman, 2017). This mechanism is pivotal in biological processes, and disruptions can lead to diseases. Understanding it is crucial for gene therapy. This proposal aims to develop innovative statistical techniques for unraveling gene regulation, focusing on cis-regulatory elements (CRE). Our first project studies allelic expression (AE) to detect genes influenced by local CRE genetic variations. We introduce airpart, a model for allelic imbalance (AI) analysis in single-cell and temporal datasets. airpart features (i) a Generalized Fused Lasso with Binomial likelihood to partition cells by AI signal, ensuring interpretability; (ii) a hierarchical Bayesian model for hypothesis testing of AI presence within each cell state and differential AI (DAI) across cell states. Simulation and real data analyses show airpart’s accuracy in detecting cell type partitions, reducing RMSE in allelic ratio estimates, and outperforming existing methods. Enrichment analysis assesses if gene sets represent biological functions, pathways, or processes. To generate null hypotheses for such tests, we introduce bootRanges, fast functions producing block bootstrapped genomic ranges. We demonstrate that conventional shuffling or permutation methods often yield overly narrow null test statistic distributions, inflating statistical significance. Block bootstrap, however, preserves local genomic correlations and provides reliablenull distributions. Real data analyses show its applicability across various test statistics. In our third project, we aim to link CREs to genes using multi-omics time series data. We predict enhancer-promoter pairs from candidate pairs by analyzing enhancer activity-gene expression correlations over time. We propose GPlag, a Gaussian process-based model known for its flexibility with time-lagged and irregular time series. Predictions are validated usinghigh-throughput chromosome conformation capture (Hi-C) and expression quantitative trait loci (eQTL) datasets. Advancing our understanding of gene regulation mechanisms and developing new statistical tools contribute to gene therapy and genetic control research.Doctor of Philosoph

    On the Identifiability and Interpretability of Gaussian Process Models

    Full text link
    In this paper, we critically examine the prevalent practice of using additive mixtures of Mat\'ern kernels in single-output Gaussian process (GP) models and explore the properties of multiplicative mixtures of Mat\'ern kernels for multi-output GP models. For the single-output case, we derive a series of theoretical results showing that the smoothness of a mixture of Mat\'ern kernels is determined by the least smooth component and that a GP with such a kernel is effectively equivalent to the least smooth kernel component. Furthermore, we demonstrate that none of the mixing weights or parameters within individual kernel components are identifiable. We then turn our attention to multi-output GP models and analyze the identifiability of the covariance matrix AA in the multiplicative kernel K(x,y)=AK0(x,y)K(x,y) = AK_0(x,y), where K0K_0 is a standard single output kernel such as Mat\'ern. We show that AA is identifiable up to a multiplicative constant, suggesting that multiplicative mixtures are well suited for multi-output tasks. Our findings are supported by extensive simulations and real applications for both single- and multi-output settings. This work provides insight into kernel selection and interpretation for GP models, emphasizing the importance of choosing appropriate kernel structures for different tasks.Comment: 37th Conference on Neural Information Processing Systems (NeurIPS 2023

    Gaussian Processes for Time Series with Lead-Lag Effects with applications to biology data

    Full text link
    Investigating the relationship, particularly the lead-lag effect, between time series is a common question across various disciplines, especially when uncovering biological process. However, analyzing time series presents several challenges. Firstly, due to technical reasons, the time points at which observations are made are not at uniform inintervals. Secondly, some lead-lag effects are transient, necessitating time-lag estimation based on a limited number of time points. Thirdly, external factors also impact these time series, requiring a similarity metric to assess the lead-lag relationship. To counter these issues, we introduce a model grounded in the Gaussian process, affording the flexibility to estimate lead-lag effects for irregular time series. In addition, our method outputs dissimilarity scores, thereby broadening its applications to include tasks such as ranking or clustering multiple pair-wise time series when considering their strength of lead-lag effects with external factors. Crucially, we offer a series of theoretical proofs to substantiate the validity of our proposed kernels and the identifiability of kernel parameters. Our model demonstrates advances in various simulations and real-world applications, particularly in the study of dynamic chromatin interactions, compared to other leading methods

    The tidyomics ecosystem: Enhancing omic data analyses

    Get PDF
    The growth of omic data presents evolving challenges in data manipulation, analysis, and integration. Addressing these challenges, Bioconductor1 provides an extensive community-driven biological data analysis platform. Meanwhile, tidy R programming2 offers a revolutionary standard for data organisation and manipulation. Here, we present the tidyomics software ecosystem, bridging Bioconductor to the tidy R paradigm. This ecosystem aims to streamline omic analysis, ease learning, and encourage cross-disciplinary collaborations. We demonstrate the effectiveness of tidyomics by analysing 7.5 million peripheral blood mononuclear cells from the Human Cell Atlas3, spanning six data frameworks and ten analysis tools.Competing Interest StatementR.G. has received consulting income from Takeda and Sanofi, and declares ownership in Ozette Technologies. M.K. is an employee of and declares ownership in Achilles Therapeutics. ​​The remaining authors declare no competing interests

    Airpart: interpretable statistical models for analyzing allelic imbalance in single-cell datasets.

    No full text
    MOTIVATION: Allelic expression analysis aids in detection of cis-regulatory mechanisms of genetic variation, which produce allelic imbalance (AI) in heterozygotes. Measuring AI in bulk data lacking time or spatial resolution has the limitation that cell-type-specific (CTS), spatial- or time-dependent AI signals may be dampened or not detected. RESULTS: We introduce a statistical method airpart for identifying differential CTS AI from single-cell RNA-sequencing data, or dynamics AI from other spatially or time-resolved datasets. airpart outputs discrete partitions of data, pointing to groups of genes and cells under common mechanisms of cis-genetic regulation. In order to account for low counts in single-cell data, our method uses a Generalized Fused Lasso with Binomial likelihood for partitioning groups of cells by AI signal, and a hierarchical Bayesian model for AI statistical inference. In simulation, airpart accurately detected partitions of cell types by their AI and had lower Root Mean Square Error (RMSE) of allelic ratio estimates than existing methods. In real data, airpart identified differential allelic imbalance patterns across cell states and could be used to define trends of AI signal over spatial or time axes. AVAILABILITY AND IMPLEMENTATION: The airpart package is available as an R/Bioconductor package at https://bioconductor.org/packages/airpart. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online
    corecore