380 research outputs found
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
K-Nearest-Neighbors Induced Topological PCA for scRNA Sequence Data Analysis
Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity
in cells, which has given us insights into cell-cell communication, cell
differentiation, and differential gene expression. However, analyzing scRNA-seq
data is a challenge due to sparsity and the large number of genes involved.
Therefore, dimensionality reduction and feature selection are important for
removing spurious signals and enhancing downstream analysis. Traditional PCA, a
main workhorse in dimensionality reduction, lacks the ability to capture
geometrical structure information embedded in the data, and previous graph
Laplacian regularizations are limited by the analysis of only a single scale.
We propose a topological Principal Components Analysis (tPCA) method by the
combination of persistent Laplacian (PL) technique and L norm
regularization to address multiscale and multiclass heterogeneity issues in
data. We further introduce a k-Nearest-Neighbor (kNN) persistent Laplacian
technique to improve the robustness of our persistent Laplacian method. The
proposed kNN-PL is a new algebraic topology technique which addresses the many
limitations of the traditional persistent homology. Rather than inducing
filtration via the varying of a distance threshold, we introduced kNN-tPCA,
where filtrations are achieved by varying the number of neighbors in a kNN
network at each step, and find that this framework has significant implications
for hyper-parameter tuning. We validate the efficacy of our proposed tPCA and
kNN-tPCA methods on 11 diverse benchmark scRNA-seq datasets, and showcase that
our methods outperform other unsupervised PCA enhancements from the literature,
as well as popular Uniform Manifold Approximation (UMAP), t-Distributed
Stochastic Neighbor Embedding (tSNE), and Projection Non-Negative Matrix
Factorization (NMF) by significant margins.Comment: 28 pages, 11 figure
Deep Learning in Single-Cell Analysis
Single-cell technologies are revolutionizing the entire field of biology. The
large volumes of data generated by single-cell technologies are
high-dimensional, sparse, heterogeneous, and have complicated dependency
structures, making analyses using conventional machine learning approaches
challenging and impractical. In tackling these challenges, deep learning often
demonstrates superior performance compared to traditional machine learning
methods. In this work, we give a comprehensive survey on deep learning in
single-cell analysis. We first introduce background on single-cell technologies
and their development, as well as fundamental concepts of deep learning
including the most popular deep architectures. We present an overview of the
single-cell analytic pipeline pursued in research applications while noting
divergences due to data sources or specific applications. We then review seven
popular tasks spanning through different stages of the single-cell analysis
pipeline, including multimodal integration, imputation, clustering, spatial
domain identification, cell-type deconvolution, cell segmentation, and
cell-type annotation. Under each task, we describe the most recent developments
in classical and deep learning methods and discuss their advantages and
disadvantages. Deep learning tools and benchmark datasets are also summarized
for each task. Finally, we discuss the future directions and the most recent
challenges. This survey will serve as a reference for biologists and computer
scientists, encouraging collaborations.Comment: 77 pages, 11 figures, 15 tables, deep learning, single-cell analysi
STGIC: a graph and image convolution-based method for spatial transcriptomic clustering
Spatial transcriptomic (ST) clustering employs spatial and transcription
information to group spots spatially coherent and transcriptionally similar
together into the same spatial domain. Graph convolution network (GCN) and
graph attention network (GAT), fed with spatial coordinates derived adjacency
and transcription profile derived feature matrix are often used to solve the
problem. Our proposed method STGIC (spatial transcriptomic clustering with
graph and image convolution) utilizes an adaptive graph convolution (AGC) to
get high quality pseudo-labels and then resorts to dilated convolution
framework (DCF) for virtual image converted from gene expression information
and spatial coordinates of spots. The dilation rates and kernel sizes are set
appropriately and updating of weight values in the kernels is made to be
subject to the spatial distance from the position of corresponding elements to
kernel centers so that feature extraction of each spot is better guided by
spatial distance to neighbor spots. Self-supervision realized by KL-divergence,
spatial continuity loss and cross entropy calculated among spots with high
confidence pseudo-labels make up the training objective of DCF. STGIC attains
state-of-the-art (SOTA) clustering performance on the benchmark dataset of
human dorsolateral prefrontal cortex (DLPFC). Besides, it's capable of
depicting fine structures of other tissues from other species as well as
guiding the identification of marker genes. Also, STGIC is expandable to
Stereo-seq data with high spatial resolution.Comment: Major revision has been made to generate the current version as
follows: 1. Writing style has been thoroughly changed. 2. Four more datasets
have been added. 3. Contrastive learning has been removed since it doesn't
make significant difference to the performance. 4. Two more authors are adde
GSAE: an autoencoder with embedded gene-set nodes for genomics functional characterization
Bioinformatics tools have been developed to interpret gene expression data at
the gene set level, and these gene set based analyses improve the biologists'
capability to discover functional relevance of their experiment design. While
elucidating gene set individually, inter gene sets association is rarely taken
into consideration. Deep learning, an emerging machine learning technique in
computational biology, can be used to generate an unbiased combination of gene
set, and to determine the biological relevance and analysis consistency of
these combining gene sets by leveraging large genomic data sets. In this study,
we proposed a gene superset autoencoder (GSAE), a multi-layer autoencoder model
with the incorporation of a priori defined gene sets that retain the crucial
biological features in the latent layer. We introduced the concept of the gene
superset, an unbiased combination of gene sets with weights trained by the
autoencoder, where each node in the latent layer is a superset. Trained with
genomic data from TCGA and evaluated with their accompanying clinical
parameters, we showed gene supersets' ability of discriminating tumor subtypes
and their prognostic capability. We further demonstrated the biological
relevance of the top component gene sets in the significant supersets. Using
autoencoder model and gene superset at its latent layer, we demonstrated that
gene supersets retain sufficient biological information with respect to tumor
subtypes and clinical prognostic significance. Superset also provides high
reproducibility on survival analysis and accurate prediction for cancer
subtypes.Comment: Presented in the International Conference on Intelligent Biology and
Medicine (ICIBM 2018) at Los Angeles, CA, USA and published in BMC Systems
Biology 2018, 12(Suppl 8):14
Recommended from our members
ManiNetCluster: a novel manifold learning approach to reveal the functional links between gene networks.
BACKGROUND:The coordination of genomic functions is a critical and complex process across biological systems such as phenotypes or states (e.g., time, disease, organism, environmental perturbation). Understanding how the complexity of genomic function relates to these states remains a challenge. To address this, we have developed a novel computational method, ManiNetCluster, which simultaneously aligns and clusters gene networks (e.g., co-expression) to systematically reveal the links of genomic function between different conditions. Specifically, ManiNetCluster employs manifold learning to uncover and match local and non-linear structures among networks, and identifies cross-network functional links. RESULTS:We demonstrated that ManiNetCluster better aligns the orthologous genes from their developmental expression profiles across model organisms than state-of-the-art methods (p-value <2.2×10-16). This indicates the potential non-linear interactions of evolutionarily conserved genes across species in development. Furthermore, we applied ManiNetCluster to time series transcriptome data measured in the green alga Chlamydomonas reinhardtii to discover the genomic functions linking various metabolic processes between the light and dark periods of a diurnally cycling culture. We identified a number of genes putatively regulating processes across each lighting regime. CONCLUSIONS:ManiNetCluster provides a novel computational tool to uncover the genes linking various functions from different networks, providing new insight on how gene functions coordinate across different conditions. ManiNetCluster is publicly available as an R package at https://github.com/daifengwanglab/ManiNetCluster
Novel Techniques for Single-cell RNA Sequencing Data Imputation and Clustering
Advances in single-cell technologies have shifted genomics research from the analysis of bulk tissues toward a comprehensive characterization of individual cells. These cutting-edge approaches enable the in-depth analysis of individual cells, unveiling the remarkable heterogeneity and complexity of cellular systems. By unraveling the unique signatures and functions of distinct cell types, single-cell technologies have not only deepened our understanding of fundamental biological processes but also unlocked new avenues for disease diagnostics and therapeutic interventions.The applications of single-cell technologies extend beyond basic research, with significant implications for precision medicine, drug discovery, and regenerative medicine. By capturing the cellular heterogeneity within tumors, these methods have shed light on the mechanisms of tumor evolution, metastasis, and therapy resistance. Additionally, they have facilitated the identification of rare cell populations with specialized functions, such as stem cells and tissue-resident immune cells, which hold great promise for cell-based therapies.However, one of the major challenges in analyzing scRNA-seq data is the prevalence of dropouts, which are instances where gene expression is not detected despite being present in the cell. Dropouts occur due to technical limitations and can introduce excessive noise into the data, obscuring the true biological signals. As a result, imputation methods are used to estimate missing values and reduce the impact of dropouts on downstream analyses. Furthermore, the high-dimensionality of scRNA-seq data presents additional challenges in effectively partitioning cell populations. Thus, robust computational approaches are required to overcome these challenges and extract meaningful biological insights from single-cell data.There have been numerous imputation and clustering methods developed specifically to address the unique challenges associated with scRNA-seq data analysis. These methods aim to reduce the impact of dropouts and high dimensionality, allowing for accurate cell population partitioning and the discovery of meaningful biological insights. While these methods have unquestionably advanced the field of single-cell transcriptomics, they are not without limitations. Some methods may be computationally intensive, resulting in scalability issues with large datasets, whereas others may introduce biases or overfit the data, potentially affecting the accuracy of subsequent analyses. Furthermore, the performance of these methods can vary depending on the dataset's complexity and heterogeneity. As a result, ongoing research is required to improve existing methodologies and create new algorithms that address these limitations while retaining robustness and accuracy in scRNA-seq data analysis.In this work, we propose three imputation approaches which incorporate with statistical and deep learning framework. We robustly reconstruct the gene expression matrix, effectively mitigating dropout effects and reducing noise. This results in the enhanced recovery of true biological signals from scRNA-seq data and leveraging transcriptomic profiles of single cells. In addition, we introduce a clustering method, which exploits the scRNA-seq data to identify cellular subpopulations. Our method employs a combination of dimensionality reduction and network fusion algorithms to generate a cell similarity graph. This approach accounts for both local and global structure within the data, enabling the discovery of rare and previously unidentified cell populations.We plan to assess the imputation and clustering methods through rigorous benchmarking on simulated and more than 30 real scRNA-seq datasets against existing state-of-the-art techniques. We will show that the imputed data generated from our method can enhance the quality of downstream analyses. Also, we demonstrate that our clustering algorithm is efficient in accurately identifying the cells populations and capable of analyzing big datasets.In conclusion, this thesis propose an alternative approaches to advance current state of scRNA-seq data analysis by developing innovative imputation and clustering methods that enable a more comprehensive and accurate characterization of cellular subpopulations. These advancements potentially have broad applicability in diverse research fields, including developmental biology, immunology, and oncology, where understanding cellular heterogeneity is crucial
Unsupervised Machine Learning Algorithms to Characterize Single-Cell Heterogeneity and Perturbation Response
Recent advances in microfluidic technologies facilitate the measurement of gene expression, DNA accessibility, protein content, or genomic mutations at unprecedented scale. The challenges imposed by the scale of these datasets are further exacerbated by non-linearity in molecular effects, complex interdependencies between features, and a lack of understanding of both data generating processes and sources of technical and biological noise. As a result, analysis of modern single-cell data requires the development of specialized computational tools. One solution to these problems is the use of manifold learning, a sub-field of unsupervised machine learning that seeks to model data geometry using a simplifying assumption that the underlying system is continuous and locally Euclidean. In this dissertation, I show how manifold learning is naturally suited for single-cell analysis and introduce three related algorithms for characterization of single-cell heterogeneity and perturbation response. I first describe Vertex Frequency Clustering, an algorithm that identifies groups of cells with similar responses to an experiment perturbation by analyzing the spectral representation of condition labels expressed as signals over a cell similarity graph. Next, I introduce MELD, an algorithm that expands on these ideas to estimate the density of each experimental sample over the graph to quantify the effect of an experimental perturbation at single cell resolution. Finally, I describe a neural network for archetypal analysis that represents the data as continuously distributed between a set of extrema. Each of these algorithms are demonstrated on a combination of real and synthetic datasets and are benchmarked against state-of-the-art algorithms
Recommended from our members
Processing, visualising and reconstructing network models from single-cell data.
New single-cell technologies readily permit gene expression profiling of thousands of cells at single-cell resolution. In this review, we will discuss methods for visualisation and interpretation of single-cell gene expression data, and the computational analysis needed to go from raw data to predictive executable models of gene regulatory network function. We will focus primarily on single-cell real-time quantitative PCR and RNA-sequencing data, but much of what we cover will also be relevant to other platforms, such as the mass cytometry technology for high-dimensional single-cell proteomics.S.W is supported by a Microsoft Research PhD Scholarship.This is the author accepted manuscript. The final version is available from Nature Publishing Group via http://dx.doi.org/10.1038/icb.2015.10
- …