172 research outputs found
Recommended from our members
Statistical methods for the integrative analysis of single-cell multi-omics data
Single-cell profiling techniques have provided an unprecedented opportunity to study cellular heterogeneity at the molecular level. This represents a remarkable advance over traditional bulk sequencing methods, particularly to study lineage diversification and cell fate commitment events in heterogeneous biological processes. While the large majority of single-cell studies are focused on quantifying RNA expression, transcriptomic readouts provide only a single dimension of cellular heterogeneity. Recently, technological advances have enabled multiple biological layers to be probed in parallel one cell at a time, unveiling a powerful approach for investigating multiple dimensions of cellular heterogeneity. However, the increasing availability of multi-modal data sets needs to be accompanied by the development of suitable integrative strategies to fully exploit the data generated. In this thesis I worked in collaboration with different research groups to introduce innovative experimental and computational strategies for the integrative study of multi-omics at single-cell resolution.
The first contribution is the development of scNMT-seq, a protocol for the simultaneous profiling of RNA expression, DNA methylation and chromatin accessibility in single cells. I demonstrate how this assay provides a powerful approach for investigating regulatory relationships between the epigenome and the transcriptome within individual cells.
The second contribution is Multi-Omics Factor Analysis (MOFA), a statistical framework for the unsupervised integration of multi-omics data sets. MOFA is a Bayesian latent variable model that can be viewed as a statistically rigorous generalization of Principal Component Analysis to multi-omics data. The method provides a principled approach to retrieve, in an unsupervised manner, the underlying sources of sample heterogeneity while at the same time disentangling which axes of heterogeneity are shared across multiple modalities and which are specific to individual data modalities.
The third contribution is the generation of a comprehensive molecular roadmap of mouse gastrulation at single-cell resolution. We employed scNMT-seq to simultaneously profile RNA expression, DNA methylation and chromatin accessibility for hundreds of cells, spanning multiple time points from the exit from pluripotency to primary germ layer specification. Using MOFA, and other tools, I performed an integrative analysis of the multi-modal measurements, revealing novel insights into the role of the epigenome in regulating this key developmental process.
The fourth contribution is an extended formulation of the MOFA model tailored to the analysis of large-scale single-cell data with complex experimental designs. I extended the model to incorporate a flexible regularisation that enables the joint analysis of multiple omics as well as multiple sample groups (batches and/or experimental conditions). In addition, I implemented a GPU-accelerated stochastic variational inference framework, thus enabling the scalable analysis of potentially millions of samples
Enhancing Missing Data Imputation of Non-stationary Signals with Harmonic Decomposition
Dealing with time series with missing values, including those afflicted by
low quality or over-saturation, presents a significant signal processing
challenge. The task of recovering these missing values, known as imputation,
has led to the development of several algorithms. However, we have observed
that the efficacy of these algorithms tends to diminish when the time series
exhibit non-stationary oscillatory behavior. In this paper, we introduce a
novel algorithm, coined Harmonic Level Interpolation (HaLI), which enhances the
performance of existing imputation algorithms for oscillatory time series.
After running any chosen imputation algorithm, HaLI leverages the harmonic
decomposition based on the adaptive nonharmonic model of the initial imputation
to improve the imputation accuracy for oscillatory time series. Experimental
assessments conducted on synthetic and real signals consistently highlight that
HaLI enhances the performance of existing imputation algorithms. The algorithm
is made publicly available as a readily employable Matlab code for other
researchers to use
Novel statistical approaches for missing values in truncated high-dimensional metabolomics data with a detection threshold.
Despite considerable advances in high throughput technology over the last decade, new challenges have emerged related to the analysis, interpretation, and integration of high-dimensional data. The arrival of omics datasets has contributed to the rapid improvement of systems biology, which seeks the understanding of complex biological systems. Metabolomics is an emerging omics field, where mass spectrometry technologies generate high dimensional datasets. As advances in this area are progressing, the need for better analysis methods to provide correct and adequate results are required. While in other omics sectors such as genomics or proteomics there has and continues to be critical understanding and concern in developing appropriate methods to handle missing values, handling of missing values in metabolomics has been an undervalued step. Missing data are a common issue in all types of medical research and handling missing data has always been a challenge. Since many downstream analyses such as classification methods, clustering methods, and dimension reduction methods require complete datasets, imputation of missing data is a critical and crucial step. The standard approach used is to remove features with one or more missing values or to substitute them with a value such as mean or half minimum substitution. One of the major issues from the missing data in metabolomics is due to a limit of detection, and thus sophisticated methods are needed to incorporate different origins of missingness. This dissertation contributes to the knowledge of missing value imputation methods with three separate but related research projects. The first project consists of a novel missing value imputation method based on a modification of the k nearest neighbor method which accounts for truncation at the minimum value/limit of detection. The approach assumes that the data follows a truncated normal distribution with the truncation point at the detection limit. The aim of the second project arises from the limitation in the first project. While the novel approach is useful, estimation of the truncated mean and standard deviation is problematic in small sample sizes (N \u3c 10). In this project, we develop a Bayesian model for imputing missing values with small sample sizes. The Bayesian paradigm has generally been utilized in the omics field as it exploits the data accessible from related components to acquire data to stabilize parameter estimation. The third project is based on the motivation to determine the impact of missing value imputation on down-stream analyses and whether ranking of imputation methods correlates well with the biological implications of the imputation
Recommended from our members
Statistical Methods for the Analysis of Contextual Gene Expression Data
Technological advances have enabled profiling gene expression variability, both at the RNA and the protein level, with ever increasing throughput. In addition, miniaturisation has enabled quantifying gene expression from small volumes of the input material and most recently at the level of single cells. Increasingly these technologies also preserve context information, such as assaying tissues with high spatial resolution. A second example of contextual information is multi-omics protocols, for example to assay gene expression and DNA methylation from the same cells or samples.
Although such contextual gene expression datasets are increasingly available for both popu- lation and single-cell variation studies, methods for their analysis are not established. In this thesis, we propose two modelling approaches for the analysis of gene expression variation in specific biological contexts.
The first contribution of this thesis is a statistical method for analysing single cell expression data in a spatial context. Our method identifies the sources of gene expression variability by decomposing it into different components, each attributable to a different source. These sources include aspects of spatial variation such as cell-cell interactions. In applications to data across different technologies, we show that cell-cell interactions are indeed a major determinant of the expression level of specific genes with a relevant link to their function.
The second contribution is a latent variable model for the unsupervised analysis of gene expression data, while accounting for structured prior knowledge on experimental context. The proposed method enables the joint analysis of gene expression data and other omics data profiled in the same samples, and the model can be used to account for the grouping structure of samples, e.g. samples from individuals with different clinical covariates or from distinct experimental batches. Our model constitutes a principled framework to compare the molecular identities of these distinct groups.Fellowship from the EMBL international PhD progra
Bayesian Model-based Methods for the Analysis of DNA Microarrays with Survival, Genetic, and Sequence Data
DNA microarrays measure the expression of thousands of genes or DNA fragments simultaneously in which probes have specific complementary hybridization. Gene expression or microarray data analysis problems have a prominent role in the biostatistics, biological sciences, and clinical medicine. The first paper proposes a method for finding associations between the survival time of the subjects and the gene expression of tumor microarrays. Measurement error is known to bias the estimates for survival regression coefficients, and this method minimizes bias. The latent variable model is shown to detect associations between potentially important genes and survival in a breast cancer dataset that conventional models did not detect, and the method is demonstrated to have robustness to misspecification with simulated data. The second paper considers the Expression Quantitative Trait Loci (eQTL) detection problem. An eQTL is a genetic locus that influences gene expression, and the major challenges with this type of data are multiple testing and computational issues. The proposed method extends the Mixture Over Marker (MOM) model to include a structured prior probability that accounts for the transcript location. The new technique exploits the fact that genetic markers are more likely to influence transcripts that share the same location on the genome. The third paper improves the analysis of Chromatin (Ch)-Immunoprecipitation (IP) (ChIP) microarray data. ChIP-chip data analysis estimates the motif of specific Transcription Factor Binding Sites (TFBSs) by comparing the IP DNA sample that is enriched for the TFBS and a control sample of general genomic DNA. The probes on the ChIP-chip array are uniformly spaced on the genome, and the probes that have relatively high intensity in the IP sample will have corresponding sequences that are likely to contain the TFBS motif. Present analytical methods use the array data to discover peaks or regions of IP enrichment then analyze the sequences of these peaks in a separate procedure to discover the motif. The proposed model will integrate enrichment peak finding and motif discovery through a Hidden Markov Model (HMM). Performance comparisons are made between the proposed HMM and the previously developed methods
Non-linear dimensionality reduction of signaling networks
<p>Abstract</p> <p>Background</p> <p>Systems wide modeling and analysis of signaling networks is essential for understanding complex cellular behaviors, such as the biphasic responses to different combinations of cytokines and growth factors. For example, tumor necrosis factor (TNF) can act as a proapoptotic or prosurvival factor depending on its concentration, the current state of signaling network and the presence of other cytokines. To understand combinatorial regulation in such systems, new computational approaches are required that can take into account non-linear interactions in signaling networks and provide tools for clustering, visualization and predictive modeling.</p> <p>Results</p> <p>Here we extended and applied an unsupervised non-linear dimensionality reduction approach, Isomap, to find clusters of similar treatment conditions in two cell signaling networks: (I) apoptosis signaling network in human epithelial cancer cells treated with different combinations of TNF, epidermal growth factor (EGF) and insulin and (II) combination of signal transduction pathways stimulated by 21 different ligands based on AfCS double ligand screen data. For the analysis of the apoptosis signaling network we used the Cytokine compendium dataset where activity and concentration of 19 intracellular signaling molecules were measured to characterise apoptotic response to TNF, EGF and insulin. By projecting the original 19-dimensional space of intracellular signals into a low-dimensional space, Isomap was able to reconstruct clusters corresponding to different cytokine treatments that were identified with graph-based clustering. In comparison, Principal Component Analysis (PCA) and Partial Least Squares – Discriminant analysis (PLS-DA) were unable to find biologically meaningful clusters. We also showed that by using Isomap components for supervised classification with k-nearest neighbor (k-NN) and quadratic discriminant analysis (QDA), apoptosis intensity can be predicted for different combinations of TNF, EGF and insulin. Prediction accuracy was highest when early activation time points in the apoptosis signaling network were used to predict apoptosis rates at later time points. Extended Isomap also outperformed PCA on the AfCS double ligand screen data. Isomap identified more functionally coherent clusters than PCA and captured more information in the first two-components. The Isomap projection performs slightly worse when more signaling networks are analyzed; suggesting that the mapping function between cues and responses becomes increasingly non-linear when large signaling pathways are considered.</p> <p>Conclusion</p> <p>We developed and applied extended Isomap approach for the analysis of cell signaling networks. Potential biological applications of this method include characterization, visualization and clustering of different treatment conditions (i.e. low and high doses of TNF) in terms of changes in intracellular signaling they induce.</p
- …