Search CORE

6 research outputs found

STATISTICAL LEARNING METHODS FOR UNCOVERING GENE REGULATION MECHANISMS

Author: Mu Wancen
Publication venue: University of North Carolina at Chapel Hill Graduate School
Publication date: 01/01/2024
Field of study

Gene regulation is a complex process controlling gene product levels through factors like transcription factors, epigenetic modifications, RNA, and proteins (Mack and Nachman, 2017). This mechanism is pivotal in biological processes, and disruptions can lead to diseases. Understanding it is crucial for gene therapy. This proposal aims to develop innovative statistical techniques for unraveling gene regulation, focusing on cis-regulatory elements (CRE). Our first project studies allelic expression (AE) to detect genes influenced by local CRE genetic variations. We introduce airpart, a model for allelic imbalance (AI) analysis in single-cell and temporal datasets. airpart features (i) a Generalized Fused Lasso with Binomial likelihood to partition cells by AI signal, ensuring interpretability; (ii) a hierarchical Bayesian model for hypothesis testing of AI presence within each cell state and differential AI (DAI) across cell states. Simulation and real data analyses show airpart’s accuracy in detecting cell type partitions, reducing RMSE in allelic ratio estimates, and outperforming existing methods. Enrichment analysis assesses if gene sets represent biological functions, pathways, or processes. To generate null hypotheses for such tests, we introduce bootRanges, fast functions producing block bootstrapped genomic ranges. We demonstrate that conventional shuffling or permutation methods often yield overly narrow null test statistic distributions, inflating statistical significance. Block bootstrap, however, preserves local genomic correlations and provides reliablenull distributions. Real data analyses show its applicability across various test statistics. In our third project, we aim to link CREs to genes using multi-omics time series data. We predict enhancer-promoter pairs from candidate pairs by analyzing enhancer activity-gene expression correlations over time. We propose GPlag, a Gaussian process-based model known for its flexibility with time-lagged and irregular time series. Predictions are validated usinghigh-throughput chromosome conformation capture (Hi-C) and expression quantitative trait loci (eQTL) datasets. Advancing our understanding of gene regulation mechanisms and developing new statistical tools contribute to gene therapy and genetic control research.Doctor of Philosoph

Carolina Digital Repository

On the Identifiability and Interpretability of Gaussian Process Models

Author: Chen Jiawen
Li Didong
Li Yun
Mu Wancen
Publication venue
Publication date: 25/10/2023
Field of study

In this paper, we critically examine the prevalent practice of using additive mixtures of Mat\'ern kernels in single-output Gaussian process (GP) models and explore the properties of multiplicative mixtures of Mat\'ern kernels for multi-output GP models. For the single-output case, we derive a series of theoretical results showing that the smoothness of a mixture of Mat\'ern kernels is determined by the least smooth component and that a GP with such a kernel is effectively equivalent to the least smooth kernel component. Furthermore, we demonstrate that none of the mixing weights or parameters within individual kernel components are identifiable. We then turn our attention to multi-output GP models and analyze the identifiability of the covariance matrix

A

in the multiplicative kernel

K(x,y) = AK_0(x,y)

, where

K_0

is a standard single output kernel such as Mat\'ern. We show that

A

is identifiable up to a multiplicative constant, suggesting that multiplicative mixtures are well suited for multi-output tasks. Our findings are supported by extensive simulations and real applications for both single- and multi-output settings. This work provides insight into kernel selection and interpretation for GP models, emphasizing the importance of choosing appropriate kernel structures for different tasks.Comment: 37th Conference on Neural Information Processing Systems (NeurIPS 2023

arXiv.org e-Print Archive

Gaussian Processes for Time Series with Lead-Lag Effects with applications to biology data

Author: Davis Eric S.
Li Didong
Love Michael I.
Mu Wancen
Phanstiel Douglas
Reed Kathleen
Publication venue
Publication date: 14/01/2024
Field of study

Investigating the relationship, particularly the lead-lag effect, between time series is a common question across various disciplines, especially when uncovering biological process. However, analyzing time series presents several challenges. Firstly, due to technical reasons, the time points at which observations are made are not at uniform inintervals. Secondly, some lead-lag effects are transient, necessitating time-lag estimation based on a limited number of time points. Thirdly, external factors also impact these time series, requiring a similarity metric to assess the lead-lag relationship. To counter these issues, we introduce a model grounded in the Gaussian process, affording the flexibility to estimate lead-lag effects for irregular time series. In addition, our method outputs dissimilarity scores, thereby broadening its applications to include tasks such as ranking or clustering multiple pair-wise time series when considering their strength of lead-lag effects with external factors. Crucially, we offer a series of theoretical proofs to substantiate the validity of our proposed kernels and the identifiability of kernel parameters. Our model demonstrates advances in various simulations and real-world applications, particularly in the study of dynamic chromatin interactions, compared to other leading methods

arXiv.org e-Print Archive

The tidyomics ecosystem: Enhancing omic data analyses

The growth of omic data presents evolving challenges in data manipulation, analysis, and integration. Addressing these challenges, Bioconductor1 provides an extensive community-driven biological data analysis platform. Meanwhile, tidy R programming2 offers a revolutionary standard for data organisation and manipulation. Here, we present the tidyomics software ecosystem, bridging Bioconductor to the tidy R paradigm. This ecosystem aims to streamline omic analysis, ease learning, and encourage cross-disciplinary collaborations. We demonstrate the effectiveness of tidyomics by analysing 7.5 million peripheral blood mononuclear cells from the Human Cell Atlas3, spanning six data frameworks and ten analysis tools.Competing Interest StatementR.G. has received consulting income from Takeda and Sanofi, and declares ownership in Ozette Technologies. M.K. is an employee of and declares ownership in Achilles Therapeutics. The remaining authors declare no competing interests

Queensland DAF eResearch Archive

Airpart: interpretable statistical models for analyzing allelic imbalance in single-cell datasets.

Author: Choi Kwangbom
Love Michael I
Mu Wancen
Patro Rob
Sarkar Hirak
Srivastava Avi
Publication venue: 'Oxford University Press (OUP)'
Publication date: 06/04/2022
Field of study

MOTIVATION: Allelic expression analysis aids in detection of cis-regulatory mechanisms of genetic variation, which produce allelic imbalance (AI) in heterozygotes. Measuring AI in bulk data lacking time or spatial resolution has the limitation that cell-type-specific (CTS), spatial- or time-dependent AI signals may be dampened or not detected. RESULTS: We introduce a statistical method airpart for identifying differential CTS AI from single-cell RNA-sequencing data, or dynamics AI from other spatially or time-resolved datasets. airpart outputs discrete partitions of data, pointing to groups of genes and cells under common mechanisms of cis-genetic regulation. In order to account for low counts in single-cell data, our method uses a Generalized Fused Lasso with Binomial likelihood for partitioning groups of cells by AI signal, and a hierarchical Bayesian model for AI statistical inference. In simulation, airpart accurately detected partitions of cell types by their AI and had lower Root Mean Square Error (RMSE) of allelic ratio estimates than existing methods. In real data, airpart identified differential allelic imbalance patterns across cell states and could be used to define trends of AI signal over spatial or time axes. AVAILABILITY AND IMPLEMENTATION: The airpart package is available as an R/Bioconductor package at https://bioconductor.org/packages/airpart. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online

The Jackson Laboratory: The Mouseion at the JAXlibrary

PubMed Central

Author Correction: Community-wide hackathons to identify central themes in single-cell multi-omics

Author: Abadi Al J.
Argelaguet Ricard
Arora Arshi
Cao Kim-Anh L.
Carey Vincent J.
Coullomb Alexis
Culhane Aedin C.
Davis-Marcisak Emily F.
Deshpande Atul
Dries Ruben
Feng Yuzhou
Fertig Elana
Greene Casey S.
Holmes Susan
Hsu Lauren
Jeganathan Pratheepa
Loth Melanie
Love Michael I.
Meng Chen
Mu Wancen
Pancaldi Vera
Righelli Dario
Ritchie Matthew E.
Sankaran Kris
Singh Amrit
Sodicoff Joshua S.
Stein-O’Brien Genevieve L.
Subramanian Ayshwarya
Welch Joshua D.
You Yue
Yuan Guo-Cheng
Publication venue: BioMed Central
Publication date: 01/08/2021
Field of study

Medicine, Faculty ofNon UBCPathology and Laboratory Medicine, Department ofReviewedFacultyResearche

Directory of Open Access Journals

University of British Columbia: cIRcle - UBC's Information Repository

University of Melbourne Institutional Repository

Deep Blue Documents