21 research outputs found

    Single-cell Analysis from the perspective of how to Interact, Identify and Integrate cells

    No full text
    Single-cell technologies have emerged as powerful tools to analyze complex tissues at the single-cell resolution, resolving the cellular heterogeneity within a tissue through the discovery of different cell populations. Over the past decade, single-cell technologies have greatly developed allowing the profiling of various molecular features including genomics, transcriptomics and proteomics. These high-throughput technologies produce datasets containing thousands to millions of cells in a single experiment. These large high-dimensional datasets impose several challenges to the data analysis. These challenges can be divided into three categories: interaction, identification and integration. Interaction refers to the visual exploration and interactive analysis of the data, identification refers to the definition of the identity of each single-cell, while integration deals with the combination of different molecular information from different datasets. In this thesis, we introduced several computational methods, addressing these three challenges, to eventually improve the analysis of single-cell data. Regarding the interaction, we focused on developing scalable methods that can analyze datasets having millions of cells and thousands of features within workable time frames. We improved the scalability of both clustering and visualization of single-cell data by summarizing the data using a hierarchical representation. To improve the identification of cells, we make use of the large number of annotated datasets available nowadays, and identify cell populations present in a single-cell dataset using classification methods instead of clustering the data. These classification methods can be trained using the previously annotated datasets. We benchmarked a large number of different classification methods and based on this analysis propose to use simple linear classifiers since they have better performance and scale better to larger datasets. We applied this linear classification on single-cell mass cytometry data to automatically identify cell populations when comparing two cohorts of colorectal cancer patients. To integrate single-cell multi-omics data, we focused on extending the number of measured features to overcome current technological limitations. For single-cell mass cytometry, we integrated different panels measured from the same biological sample, resulting in an extended number of proteins markers per cell. Downstream analysis of this data revealed new cell subpopulations showing a more fine-grained cellular heterogeneity. We also extended spatial single-cell transcriptomic data by integrating it with scRNA-seq data that lacks the spatial localization of the cells. Our proposed integration generates whole transcriptome spatial data, which makes it possible to predict spatial expression patterns of genes (in-silico) that are not originally measured in the spatial data. Taken together, this thesis presents several computational methods that aid and improve single-cell data analysis, increasing our insights in molecular heterogeneity.Pattern Recognition and Bioinformatic

    An in-depth comparison of linear and non-linear joint embedding methods for bulk and single-cell multi-omics

    No full text
    Multi-omic analyses are necessary to understand the complex biological processes taking place at the tissue and cell level, but also to make reliable predictions about, for example, disease outcome. Several linear methods exist that create a joint embedding using paired information per sample, but recently there has been a rise in the popularity of neural architectures that embed paired -omics into the same non-linear manifold. This work describes a head-to-head comparison of linear and non-linear joint embedding methods using both bulk and single-cell multi-modal datasets. We found that non-linear methods have a clear advantage with respect to linear ones for missing modality imputation. Performance comparisons in the downstream tasks of survival analysis for bulk tumor data and cell type classification for single-cell data lead to the following insights: First, concatenating the principal components of each modality is a competitive baseline and hard to beat if all modalities are available at test time. However, if we only have one modality available at test time, training a predictive model on the joint space of that modality can lead to performance improvements with respect to just using the unimodal principal components. Second, -omic profiles imputed by neural joint embedding methods are realistic enough to be used by a classifier trained on real data with limited performance drops. Taken together, our comparisons give hints to which joint embedding to use for which downstream task. Overall, product-of-experts performed well in most tasks and was reasonably fast, while early integration (concatenation) of modalities did quite poorly.Pattern Recognition and Bioinformatic

    scMoC: single-cell multi-omics clustering

    No full text
    MotivationSingle-cell multi-omics assays simultaneously measure different molecular features from the same cell. A key question is how to benefit from the complementary data available and perform cross-modal clustering of cells.ResultsWe propose Single-Cell Multi-omics Clustering (scMoC), an approach to identify cell clusters from data with comeasurements of scRNA-seq and scATAC-seq from the same cell. We overcome the high sparsity of the scATAC-seq data by using an imputation strategy that exploits the less-sparse scRNA-seq data available from the same cell. Subsequently, scMoC identifies clusters of cells by merging clusterings derived from both data domains individually. We tested scMoC on datasets generated using different protocols with variable data sparsity levels. We show that scMoC (i) is able to generate informative scATAC-seq data due to its RNA-guided imputation strategy and (ii) results in integrated clusters based on both RNA and ATAC information that are biologically meaningful either from the RNA or from the ATAC perspective.Availability and implementationThe data used in this manuscript is publicly available, and we refer to the original manuscript for their description and availability. For convience sci-CAR data is available at NCBI GEO under the accession number of GSE117089. SNARE-seq data is available at NCBI GEO under the accession number of GSE126074. The 10X multiome data is available at the following link https://www.10xgenomics.com/resources/datasets/pbmc-from-a-healthy-donor-no-cell-sorting-3-k-1-standard-2-0-0.Pattern Recognition and Bioinformatic

    SpaGE: Spatial Gene Enhancement using scRNA-seq

    Get PDF
    Single-cell technologies are emerging fast due to their ability to unravel the heterogeneity of biological systems. While scRNA-seq is a powerful tool that measures whole-transcriptome expression of single cells, it lacks their spatial localization. Novel spatial transcriptomics methods do retain cells spatial information but some methods can only measure tens to hundreds of transcripts. To resolve this discrepancy, we developed SpaGE, a method that integrates spatial and scRNA-seq datasets to predict whole-transcriptome expressions in their spatial configuration. Using five dataset-pairs, SpaGE outperformed previously published methods and showed scalability to large datasets. Moreover, SpaGE predicted new spatial gene patterns that are confirmed independently using in situ hybridization data from the Allen Mouse Brain Atlas.Pattern Recognition and Bioinformatic

    scTopoGAN: unsupervised manifold alignment of single-cell data

    No full text
    Motivation: Single-cell technologies allow deep characterization of different molecular aspects of cells. Integrating these modalities provides a comprehensive view of cellular identity. Current integration methods rely on overlapping features or cells to link datasets measuring different modalities, limiting their application to experiments where different molecular layers are profiled in different subsets of cells. Results: We present scTopoGAN, a method for unsupervised manifold alignment of single-cell datasets with non-overlapping cells or features. We use topological autoencoders (topoAE) to obtain latent representations of each modality separately. A topology-guided Generative Adversarial Network then aligns these latent representations into a common space. We show that scTopoGAN outperforms state-of-the-art manifold alignment methods in complete unsupervised settings. Interestingly, the topoAE for individual modalities also showed better performance in preserving the original structure of the data in the low-dimensional representations when compared to other manifold projection methods. Taken together, we show that the concept of topology preservation might be a powerful tool to align multiple single modality datasets, unleashing the potential of multi-omic interpretations of cells.Pattern Recognition and Bioinformatic

    Predicting Cell Populations in Single Cell Mass Cytometry Data

    No full text
    Mass cytometry by time-of-flight (CyTOF) is a valuable technology for high-dimensional analysis at the single cell level. Identification of different cell populations is an important task during the data analysis. Many clustering tools can perform this task, which is essential to identify “new” cell populations in explorative experiments. However, relying on clustering is laborious since it often involves manual annotation, which significantly limits the reproducibility of identifying cell-populations across different samples. The latter is particularly important in studies comparing different conditions, for example in cohort studies. Learning cell populations from an annotated set of cells solves these problems. However, currently available methods for automatic cell population identification are either complex, dependent on prior biological knowledge about the populations during the learning process, or can only identify canonical cell populations. We propose to use a linear discriminant analysis (LDA) classifier to automatically identify cell populations in CyTOF data. LDA outperforms two state-of-the-art algorithms on four benchmark datasets. Compared to more complex classifiers, LDA has substantial advantages with respect to the interpretable performance, reproducibility, and scalability to larger datasets with deeper annotations. We apply LDA to a dataset of ~3.5 million cells representing 57 cell populations in the Human Mucosal Immune System. LDA has high performance on abundant cell populations as well as the majority of rare cell populations, and provides accurate estimates of cell population frequencies. Further incorporating a rejection option, based on the estimated posterior probabilities, allows LDA to identify previously unknown (new) cell populations that were not encountered during training. Altogether, reproducible prediction of cell population compositions using LDA opens up possibilities to analyze large cohort studies based on CyTOF data.Pattern Recognition and BioinformaticsComp Graphics & Visualisatio

    Benchmarking variational AutoEncoders on cancer transcriptomics data

    No full text
    Deep generative models, such as variational autoencoders (VAE), have gained increasing attention in computational biology due to their ability to capture complex data manifolds which subsequently can be used to achieve better performance in downstream tasks, such as cancer type prediction or subtyping of cancer. However, these models are difficult to train due to the large number of hyperparameters that need to be tuned. To get a better understanding of the importance of the different hyperparameters, we examined six different VAE models when trained on TCGA transcriptomics data and evaluated on the downstream tasks of cluster agreement with cancer subtypes and survival analysis. We studied the effect of the latent space dimensionality, learning rate, optimizer, initialization and activation function on the quality of subsequent downstream tasks on the TCGA samples. We found β-TCVAE and DIP-VAE to have a good performance, on average, despite being more sensitive to hyperparameters selection. Based on these experiments, we derived recommendations for selecting the different hyperparameters settings. To ensure generalization, we tested all hyperparameter configurations on the GTEx dataset. We found a significant correlation (ρ = 0.7) between the hyperparameter effects on clustering performance in the TCGA and GTEx datasets. This highlights the robustness and generalizability of our recommendations. In addition, we examined whether the learned latent spaces capture biologically relevant information. Hereto, we measured the correlation and mutual information of the different representations with various data characteristics such as gender, age, days to metastasis, immune infiltration, and mutation signatures. We found that for all models the latent factors, in general, do not uniquely correlate with one of the data characteristics nor capture separable information in the latent factors even for models specifically designed for disentanglement.Pattern Recognition and Bioinformatic

    Spondyloarthritis mass cytometry immuno-monitoring: a proof of concept study in the tight-control and treat-to target TiCoSpA trial

    No full text
    Objective: Mass cytometry (MC) immunoprofiling allows high-parameter phenotyping of immune cells. We set to investigate the potential of MC immuno-monitoring of axial spondyloarthritis (axSpA) patients enrolled in the Tight Control SpondyloArthritis (TiCoSpA) trial. Methods: Fresh, longitudinal PBMCs samples (baseline, 24, and 48 weeks) from 9 early, untreated axSpA patients and 7 HLA-B27+ controls were analyzed using a 35-marker panel. Data were subjected to HSNE dimension reduction and Gaussian mean shift clustering (Cytosplore), followed by Cytofast analysis. Linear discriminant analyzer (LDA), based on initial HSNE clustering, was applied onto week 24 and 48 samples. Results: Unsupervised analysis yielded a clear separation of baseline patients and controls including a significant difference in 9 T cell, B cell, and monocyte clusters (cl), indicating disrupted immune homeostasis. Decrease in disease activity (ASDAS score; median 1.7, range 0.6–3.2) from baseline to week 48 matched significant changes over time in five clusters: cl10 CD4 Tnai cells median 4.7 to 0.02%, cl37 CD4 Tem cells median 0.13 to 8.28%, cl8 CD4 Tcm cells median 3.2 to 0.02%, cl39 B cells median 0.12 to 2.56%, and cl5 CD38+ B cells median 2.52 to 0.64% (all p<0.05). Conclusions: Our results showed that a decrease in disease activity in axSpA coincided with normalization of peripheral T- and B-cell frequency abnormalities. This proof of concept study shows the value of MC immuno-monitoring in clinical trials and longitudinal studies in axSpA. MC immunophenotyping on a larger, multi-center scale is likely to provide crucial new insights in the effect of anti-inflammatory treatment and thereby the pathogenesis of inflammatory rheumatic diseases. Key Points • Longitudinal immuno-monitoring of axSpA patients through mass cytometry indicates that normalization of immune cell compartments coincides with decrease in disease activity. • Our proof of concept study confirms the value of immune-monitoring utilizing mass cytometry.Pattern Recognition and Bioinformatic

    SpaceWalker enables interactive gradient exploration for spatial transcriptomics data

    No full text
    In spatial transcriptomics (ST) data, biologically relevant features such as tissue compartments or cell-state transitions are reflected by gene expression gradients. Here, we present SpaceWalker, a visual analytics tool for exploring the local gradient structure of 2D and 3D ST data. The user can be guided by the local intrinsic dimensionality of the high-dimensional data to define seed locations, from which a flood-fill algorithm identifies transcriptomically similar cells on the fly, based on the high-dimensional data topology. In several use cases, we demonstrate that the spatial projection of these flooded cells highlights tissue architectural features and that interactive retrieval of gene expression gradients in the spatial and transcriptomic domains confirms known biology. We also show that SpaceWalker generalizes to several different ST protocols and scales well to large, multi-slice, 3D whole-brain ST data while maintaining real-time interaction performance.Pattern Recognition and BioinformaticsComputer Graphics and Visualisatio

    CyTOFmerge: integrating mass cytometry data across multiple panels

    No full text
    Motivation: High-dimensional mass cytometry (CyTOF) allows the simultaneous measurement of multiple cellular markers at single-cell level, providing a comprehensive view of cell compositions.However, the power of CyTOF to explore the full heterogeneity of a biological sample at the singlecell level is currently limited by the number of markers measured simultaneously on a single panel.Results: To extend the number of markers per cell, we propose an in silico method to integrate CyTOF datasets measured using multiple panels that share a set of markers. Additionally, we present an approach to select the most informative markers from an existing CyTOF dataset to be used as a shared marker set between panels. We demonstrate the feasibility of our methods byevaluating the quality of clustering and neighborhood preservation of the integrated dataset, on two public CyTOF datasets. We illustrate that by computationally extending the number of markerswe can further untangle the heterogeneity of mass cytometry data, including rare cell-population detection.Pattern Recognition and BioinformaticsComp Graphics & Visualisatio