6 research outputs found
STATISTICAL LEARNING METHODS FOR UNCOVERING GENE REGULATION MECHANISMS
Gene regulation is a complex process controlling gene product levels through factors like transcription factors, epigenetic modifications, RNA, and proteins (Mack and Nachman, 2017). This mechanism is pivotal in biological processes, and disruptions can lead to diseases. Understanding it is crucial for gene therapy. This proposal aims to develop innovative statistical techniques for unraveling gene regulation, focusing on cis-regulatory elements (CRE). Our first project studies allelic expression (AE) to detect genes influenced by local CRE genetic variations. We introduce airpart, a model for allelic imbalance (AI) analysis in single-cell and temporal datasets. airpart features (i) a Generalized Fused Lasso with Binomial likelihood to partition cells by AI signal, ensuring interpretability; (ii) a hierarchical Bayesian model for hypothesis testing of AI presence within each cell state and differential AI (DAI) across cell states. Simulation and real data analyses show airpart’s accuracy in detecting cell type partitions, reducing RMSE in allelic ratio estimates, and outperforming existing methods. Enrichment analysis assesses if gene sets represent biological functions, pathways, or processes. To generate null hypotheses for such tests, we introduce bootRanges, fast functions producing block bootstrapped genomic ranges. We demonstrate that conventional shuffling or permutation methods often yield overly narrow null test statistic distributions, inflating statistical significance. Block bootstrap, however, preserves local genomic correlations and provides reliablenull distributions. Real data analyses show its applicability across various test statistics. In our third project, we aim to link CREs to genes using multi-omics time series data. We predict enhancer-promoter pairs from candidate pairs by analyzing enhancer activity-gene expression correlations over time. We propose GPlag, a Gaussian process-based model known for its flexibility with time-lagged and irregular time series. Predictions are validated usinghigh-throughput chromosome conformation capture (Hi-C) and expression quantitative trait loci (eQTL) datasets. Advancing our understanding of gene regulation mechanisms and developing new statistical tools contribute to gene therapy and genetic control research.Doctor of Philosoph
On the Identifiability and Interpretability of Gaussian Process Models
In this paper, we critically examine the prevalent practice of using additive
mixtures of Mat\'ern kernels in single-output Gaussian process (GP) models and
explore the properties of multiplicative mixtures of Mat\'ern kernels for
multi-output GP models. For the single-output case, we derive a series of
theoretical results showing that the smoothness of a mixture of Mat\'ern
kernels is determined by the least smooth component and that a GP with such a
kernel is effectively equivalent to the least smooth kernel component.
Furthermore, we demonstrate that none of the mixing weights or parameters
within individual kernel components are identifiable. We then turn our
attention to multi-output GP models and analyze the identifiability of the
covariance matrix in the multiplicative kernel , where
is a standard single output kernel such as Mat\'ern. We show that is
identifiable up to a multiplicative constant, suggesting that multiplicative
mixtures are well suited for multi-output tasks. Our findings are supported by
extensive simulations and real applications for both single- and multi-output
settings. This work provides insight into kernel selection and interpretation
for GP models, emphasizing the importance of choosing appropriate kernel
structures for different tasks.Comment: 37th Conference on Neural Information Processing Systems (NeurIPS
2023
Gaussian Processes for Time Series with Lead-Lag Effects with applications to biology data
Investigating the relationship, particularly the lead-lag effect, between
time series is a common question across various disciplines, especially when
uncovering biological process. However, analyzing time series presents several
challenges. Firstly, due to technical reasons, the time points at which
observations are made are not at uniform inintervals. Secondly, some lead-lag
effects are transient, necessitating time-lag estimation based on a limited
number of time points. Thirdly, external factors also impact these time series,
requiring a similarity metric to assess the lead-lag relationship. To counter
these issues, we introduce a model grounded in the Gaussian process, affording
the flexibility to estimate lead-lag effects for irregular time series. In
addition, our method outputs dissimilarity scores, thereby broadening its
applications to include tasks such as ranking or clustering multiple pair-wise
time series when considering their strength of lead-lag effects with external
factors. Crucially, we offer a series of theoretical proofs to substantiate the
validity of our proposed kernels and the identifiability of kernel parameters.
Our model demonstrates advances in various simulations and real-world
applications, particularly in the study of dynamic chromatin interactions,
compared to other leading methods
The tidyomics ecosystem: Enhancing omic data analyses
The growth of omic data presents evolving challenges in data manipulation, analysis, and integration. Addressing these challenges, Bioconductor1 provides an extensive community-driven biological data analysis platform. Meanwhile, tidy R programming2 offers a revolutionary standard for data organisation and manipulation. Here, we present the tidyomics software ecosystem, bridging Bioconductor to the tidy R paradigm. This ecosystem aims to streamline omic analysis, ease learning, and encourage cross-disciplinary collaborations. We demonstrate the effectiveness of tidyomics by analysing 7.5 million peripheral blood mononuclear cells from the Human Cell Atlas3, spanning six data frameworks and ten analysis tools.Competing Interest StatementR.G. has received consulting income from Takeda and Sanofi, and declares ownership in Ozette Technologies. M.K. is an employee of and declares ownership in Achilles Therapeutics. ​​The remaining authors declare no competing interests
Airpart: interpretable statistical models for analyzing allelic imbalance in single-cell datasets.
MOTIVATION: Allelic expression analysis aids in detection of cis-regulatory mechanisms of genetic variation, which produce allelic imbalance (AI) in heterozygotes. Measuring AI in bulk data lacking time or spatial resolution has the limitation that cell-type-specific (CTS), spatial- or time-dependent AI signals may be dampened or not detected.
RESULTS: We introduce a statistical method airpart for identifying differential CTS AI from single-cell RNA-sequencing data, or dynamics AI from other spatially or time-resolved datasets. airpart outputs discrete partitions of data, pointing to groups of genes and cells under common mechanisms of cis-genetic regulation. In order to account for low counts in single-cell data, our method uses a Generalized Fused Lasso with Binomial likelihood for partitioning groups of cells by AI signal, and a hierarchical Bayesian model for AI statistical inference. In simulation, airpart accurately detected partitions of cell types by their AI and had lower Root Mean Square Error (RMSE) of allelic ratio estimates than existing methods. In real data, airpart identified differential allelic imbalance patterns across cell states and could be used to define trends of AI signal over spatial or time axes.
AVAILABILITY AND IMPLEMENTATION: The airpart package is available as an R/Bioconductor package at https://bioconductor.org/packages/airpart.
SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online
Author Correction: Community-wide hackathons to identify central themes in single-cell multi-omics
Medicine, Faculty ofNon UBCPathology and Laboratory Medicine, Department ofReviewedFacultyResearche