Search CORE

60 research outputs found

A comparative evaluation of dimensionality reduction methods on large-scale gene expression datasets

Author: Sara Carolina Martins Ribeiro
Publication venue
Publication date: 09/12/2020
Field of study

Repositório Aberto da Universidade do Porto

Data Visualization, Dimensionality Reduction, and Data Alignment via Manifold Learning

Author: Duque Correa Andrés Felipe
Publication venue: DigitalCommons@USU
Publication date: 01/12/2022
Field of study

The high dimensionality of modern data introduces significant challenges in descriptive and exploratory data analysis. These challenges gave rise to extensive work on dimensionality reduction and manifold learning aiming to provide low dimensional representations that preserve or uncover intrinsic patterns and structures in the data. In this thesis, we expand the current literature in manifold learning developing two methods called DIG (Dynamical Information Geometry) and GRAE (Geometry Regularized Autoencoders). DIG is a method capable of finding low-dimensional representations of high-frequency multivariate time series data, especially suited for visualization. GRAE is a general framework which splices the well-established machinery from kernel manifold learning methods to recover a sensitive geometry, alongside the parametric structure of autoencoders. Manifold learning can also be useful to study data collected from different measurement instruments, conditions, or protocols of the same underlying system. In such cases the data is acquired in a multi-domain representation. The last two Chapters of this thesis are devoted to two new methods capable of aligning multi-domain data, leveraging their geometric structure alongside limited common information. First, we present DTA (Diffusion Transport Alignment), a semi-supervised manifold alignment method that exploits prior one-to-one correspondence knowledge between distinct data views and finds an aligned common representation. And finally, we introduce MALI (Manifold Alignment with Label Information). Here we drop the one-to-one prior correspondences assumption, since in many scenarios such information can not be provided, either due to the nature of the experimental design, or it becomes extremely costly. Instead, MALI only needs side-information in the form of discrete labels/classes present in both domains

DigitalCommons@USU

Recommended from our members

Functional interpretation of single cell similarity maps.

Author: Ashuach Tal
DeTomaso David
Jones Matthew G
Subramaniam Meena
Ye Chun J
Yosef Nir
Publication venue: eScholarship, University of California
Publication date: 01/09/2019
Field of study

We present Vision, a tool for annotating the sources of variation in single cell RNA-seq data in an automated and scalable manner. Vision operates directly on the manifold of cell-cell similarity and employs a flexible annotation approach that can operate either with or without preconceived stratification of the cells into groups or along a continuum. We demonstrate the utility of Vision in several case studies and show that it can derive important sources of cellular variation and link them to experimental meta-data even with relatively homogeneous sets of cells. Vision produces an interactive, low latency and feature rich web-based report that can be easily shared among researchers, thus facilitating data dissemination and collaboration

eScholarship - University of California

Understanding cellular differentiation by modelling of single-cell gene expression data

Author: Papadopoulos Nikolaos
Publication venue: University Goettingen Repository
Publication date: 08/08/2019
Field of study

Over the course of the last decade single-cell RNA sequencing (scRNA-seq) has revolutionized the study of cellular heterogeneity, as one experiment routinely covers the expression of thousands of genes in tens or hundreds of thousands of cells. By quantifying differences between the single cell transcriptomes it is possible to reconstruct the process that gives rise to different cell fates from a progenitor population and gain access to trajectories of gene expression over developmental time. Tree reconstruction algorithms must deal with the high levels of noise, the high dimensionality of gene expression space, and strong non-linear dependencies between genes. In this thesis we address three aspects of working with scRNA-seq data: (1) lineage tree reconstruction, where we propose MERLoT, a novel trajectory inference method, (2) method comparison, where we propose PROSSTT, a novel algorithm that simulates scRNA-seq count data of complex differentiation trajectories, and (3) noise modelling, where we propose a novel probabilistic description of count data, a statistically motivated local averaging strategy, and an adaptation of the cross validation approach for the evaluation of gene expression imputation strategies. While statistical modelling of the data was our primary motivation, due to time constraints we did not manage to fully realize our plans for it. Increasingly complex processes like whole-organism development are being studied by single-cell transcriptomics, producing large amounts of data. Methods for trajectory inference must therefore efficiently reconstruct \textit{a priori} unknown lineage trees with many cell fates. We propose MERLoT, a method that can reconstruct trees in sub-quadratic time by utilizing a local averaging strategy, scaling very well on large datasets. MERLoT compares favorably to the state of the art, both on real data and a large synthetic benchmark. The absence of data with known complex underlying topologies makes it challenging to quantitatively compare tree reconstruction methods to each other. PROSSTT is a novel algorithm that simulates count data from complex differentiation processes, facilitating comparisons between algorithms. We created the largest synthetic dataset to-date, and the first to contain simulations with up to 12 cell fates. Additionally, PROSSTT can learn simulation parameters from reconstructed lineage trees and produce cells with expression profiles similar to the real data. Quantifying similarity between single-cell transcriptomes is crucial for clustering scRNA-seq profiles to cell types or inferring developmental trajectories, and appropriate statistical modelling of the data should improve such similarity calculations. We propose a Gaussian mixture of negative binomial distributions where gene expression variance depends on the square of the average expression. The model hyperparameters can be learned via the hybrid Monte Carlo algorithm, and a good initialization of average expression and variance parameters can be obtained by trajectory inference. A way to limit noise in the data is to apply local averaging, using the nearest neighbours of each cell to recover expression of non-captured mRNA. Our proposal, nearest neighbour smoothing with optimal bias-variance trade-off, optimizes the k-nearest neighbours approach by reducing the contribution of inappropriate neighbours. We also propose a way to assess the quality of gene expression imputation. After reconstructing a trajectory with imputed data, each cell can be projected to the trajectory using non-overlapping subsets of genes. The robustness of these assignments over multiple partitions of the genes is a novel estimator of imputation performance. Finally, I was involved in the planning and initial stages of a mouse ovary cell atlas as a collaboration

Georg-August-University Göttingen

Graph Priors, Optimal Transport, and Deep Learning in Biomedical Discovery

Author: Tong Alexander
Publication venue: EliScholar – A Digital Platform for Scholarly Publishing at Yale
Publication date: 01/10/2021
Field of study

Recent advances in biomedical data collection allows the collection of massive datasets measuring thousands of features in thousands to millions of individual cells. This data has the potential to advance our understanding of biological mechanisms at a previously impossible resolution. However, there are few methods to understand data of this scale and type. While neural networks have made tremendous progress on supervised learning problems, there is still much work to be done in making them useful for discovery in data with more difficult to represent supervision. The flexibility and expressiveness of neural networks is sometimes a hindrance in these less supervised domains, as is the case when extracting knowledge from biomedical data. One type of prior knowledge that is more common in biological data comes in the form of geometric constraints. In this thesis, we aim to leverage this geometric knowledge to create scalable and interpretable models to understand this data. Encoding geometric priors into neural network and graph models allows us to characterize the models’ solutions as they relate to the fields of graph signal processing and optimal transport. These links allow us to understand and interpret this datatype. We divide this work into three sections. The first borrows concepts from graph signal processing to construct more interpretable and performant neural networks by constraining and structuring the architecture. The second borrows from the theory of optimal transport to perform anomaly detection and trajectory inference efficiently and with theoretical guarantees. The third examines how to compare distributions over an underlying manifold, which can be used to understand how different perturbations or conditions relate. For this we design an efficient approximation of optimal transport based on diffusion over a joint cell graph. Together, these works utilize our prior understanding of the data geometry to create more useful models of the data. We apply these methods to molecular graphs, images, single-cell sequencing, and health record data

Yale University

Recommended from our members

Quantitative Approaches to the Genomics of Clonal Evolution

Author: Zairis Sakellarios
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2018
Field of study

Many problems in the biological sciences reduce to questions of genetic evolution. Entire classes of medical pathology, such as malignant neoplasia or infectious disease, can be viewed in the light of Darwinian competition of genomes. With the benefit of today's maturing sequencing technologies we can observe and quantify genetic evolution with nucleotide resolution. This provides a molecular view of genetic material that has adapted, or is in the process of adapting, to its local selection pressures. A series of problems will be discussed in this thesis, all involving the mathematical modeling of genomic data derived from clonally evolving populations. We use a variety of computational approaches to characterize over-represented features in the data, with the underlying hypothesis that we may be detecting fitness-conferring features of the biology. In Part I we consider the cross-sectional sampling of human tumors via RNA-sequencing, and devise computational pipelines for detecting oncogenic gene fusions and oncovirus infections. Genomic translocation and oncovirus infection can each be a highly penetrant alteration in a tumor's evolutionary history, with famous examples of both populating the cancer biology literature. In order to exert a transforming influence over the host cell, gene fusions and viral genetic programs need to be expressed and thus can be detected via whole transcriptome sequencing of a malignant cell population. We describe our approaches to predicting oncogenic gene fusions (Chapter 2) and quantifying host-viral interactions (Chapter 3) in large panels of human tumor tissue. The alterations that we characterize prompt the larger question of how the genetics of tumors and viruses might vary in time, leading us to the study of serially sampled populations. In Part II we consider longitudinal sampling of a clonally evolving population. Phylogenetic trees are the standard representation of a clonal process, an evolutionary picture as old as Darwin's voyages on the Beagle. Chapter 4 first reviews phylogenetic inference and then introduces a certain phylogenetic tree space that forms the starting point of our work on the topic. Specifically, Chapter 4 describes the construction of our projective tree space along with an explicit implementation for visualizing point clouds of rescaled trees. The Chapter finishes by defining a method for stable dimensionality reduction of large phylogenies, which is useful for analyzing long genomic time series. In Chapter 5 we consider medically relevant instances of clonal evolution and the longitudinal genetic data sets to which they give rise. We analyze data from (i) the sequencing of cancers along their therapeutic course, (ii) the passaging of a xenografted tumor through a mouse model, and (iii) the seasonal surveillance of H3N2 influenza's hemagglutinin segment. A novel approach to predicting influenza vaccine effectiveness is demonstrated using statistics of point clouds in tree spaces. Our investigations into clonal processes may be extended beyond naturally occurring genomes. In Part III we focus on the directed clonal evolution of populations of synthetic RNAs in vitro. Analogous to the selection pressures exerted upon malignant cells or viral particles, these synthetic RNA genomes can be evolved against a desired fitness objective. We investigate fitness objectives related to reprogramming ribosomal translation. Chapter 6 identifies high fitness RNA pseudoknot geometries capable of inducing ribosomal frameshift, while Chapter 7 takes an unbiased approach to evolving sequence and structural elements that promote stop codon readthrough

Columbia University Academic Commons

빠르고 정확한 차원 축소를 위한 UMAP 개선 방법

Author: 고형권
Publication venue: 서울대학교 대학원
Publication date: 01/08/2021
Field of study

학위논문(석사) -- 서울대학교대학원 : 공과대학 컴퓨터공학부, 2021.8. 고형권.One e ective way of understanding the characteristics of high-dimensional data is to embed it onto a low-dimensional space. Among many existing dimensionality reduction algorithms, Uniform Manifold Approximation and Projection (UMAP) has gained the most attention because of its fast and stable projection result. However, still it is too slow to be adopted for an interactive visual analytics system as it takes for a few minutes to embed even for a toy dataset (e.g., MNIST). Moreover, UMAP is vulnerable to di erent configurations of yperparameters, especially to the initialization methods and the number of epochs, which can bring about a serious bias mining insights from the embedding result. To achieve the responsiveness, we propose a progressive algorithm for UMAP, called Progressive UMAP, for the exploration of datasets by updating the embedding with a batch of points through a progressive computation. Next, to guarantee less biases and the robustness in the embedding, we present a novel dimensionality reduction algorithm called Uniform Manifold Approximation with Twophase Optimization (UMATO). We discover that the vulnerability comes from the approximation of cross-entropy loss function. UMATO, instead, takes a two-phase optimization approach: global optimization to obtain the overall skeleton of data, and local optimization to identify regional characteristics of a local area. In our experiment with one synthetic and three real-world datasets, UMATO outperformed widely-used baseline algorithms, such as PCA, t-SNE, UMAP, topological autoencoders and Anchor t-SNE, in terms of global quality metrics and 2D projection results. We further examine a case study of UMATO on real-world biological data and the extension to multi-phase optimization. Our work makes the original contributions to the field of dimensionality reduction, as well as the progressive visual analytics. Lastly, the thesis discusses the future research directions for improving the proposed algorithms.고차원 데이터의 특성을 파악하는 효과적인 방법 중 하나는 저차원 공간에 임베딩을 하는 것이다. 많은 차원 축소 알고리즘이 있지만, 균일 매니폴드 근사 및 투영법 (UMAP)은 빠른 속도와 안정적인 투영 결과로 인해 많은 주목을 받았다. 그러나 현재의 UMAP은 실험용 데이터 셋인 MNIST에도 수 분이 걸리는 등, 인터랙티브 시각적 분석 시스템에 도입되기에는 너무 느리다. 또한 UMAP은 하이퍼파라미터 설정이 (특히, 초기화 방법과 epoch 수) 달라지는 것에 취약한데, 이것은 임베딩 결과로 부터 통찰을 얻는 과정에서 큰 오류를 범할 수 있게 한다. UMAP의 즉각적인 반응성을 얻기 위해서, UMAP의 점진적인 알고리즘인 Progressive UMAP을 제안한다. 이로써 한 배치의 데이터를 추가할 때마다 임베딩 결과를 업데이트 하게되는 점진적인 계산이 가능해진다. 다음으로 적은 편향과 강건한 임베딩을 보장하기 위해 UMATO를 제안한다. 먼저 우리는 이러한 취약함이 최적화를 근사하는 과정에서 일어나는 것을 밝힌다. UMATO는, UMAP과 다르게, 두 단계에 걸친 최적화를 통해서 처음으로 전체적인 구조를 잡고, 그 다음 지역적 특성을 파악한다. 실험을 통해 UMATO가 PCA, t-SNE, UMAP, topological autoencoders 그리고 Anchort-SNE와 같은 기존 알고리즘에 비해 전체 구조 평가 지표와 2차원 임베딩 결과에서 더 나음을 보인다. 추가적으로 여러 단계로 최적화 하는 것과 임베딩의 안정성 역시 실험으로 파악한다. 이 연구는 차원 축소뿐만 아니라 점진적 시각화 분야에도 독창적인 공헌을 한다. 마지막으로 연구의 향후 연구 방향을 도모한다.CHAPTER 1 Introduction 1 1.1 Motivation 1 1.2 Research Questions and Approaches 2 1.2.1 Progressive Algorithm for UMAP 3 1.2.2 Less Biased and Robust Dimensionality Reduction Algorithm 4 1.3 Contributions 4 1.4 Thesis Overview 5 CHAPTER 2 Background: UMAP 6 2.1 Graph Construction 6 2.2 Layout Optimization 7 CHAPTER 3 Progressive UMAP: A Progressive Algorithm for UMAP 10 3.1 Introduction 10 3.2 Related Work 11 3.2.1 Progressive Visual Analytics 11 3.3 Progressive UMAP 13 3.3.1 Computing Ni 14 3.3.2 Computing ρi and σi 14 3.3.3 Layout Initialization 14 3.3.4 Layout Optimization 15 3.4 Evaluation and Discussion 15 3.5 Summary 18 CHAPTER 4 UMATO: A Less Biased and Robust Dimensionality Reduction Algorithm Based on UMAP 19 4.1 Introduction 19 4.2 Related Work 22 4.2.1 Dimensionality Reduction 22 4.2.2 Hubs, landmarks, and anchors 23 4.3 The Meaning of Using Di erent Loss Functions in Dimensionality Reduction 25 4.3.1 t-SNE 25 4.4 UMATO 27 4.4.1 Points Classification 28 4.4.2 Global Optimization 29 4.4.3 Local Optimization 30 4.4.4 Outliers Arrangement 32 4.5 Experiments 33 4.5.1 Quantitative and Qualitative Evaluation of UMATO Compared to Six Baseline Algorithms 33 4.5.2 Case Study: UMATO on Real-world Biological Data 39 4.6 Discussion 41 4.7 Summary 46 CHAPTER 5 Discussion 48 5.1 Lessons Learned 48 5.2 Limitations 49 CHAPTER 6 Conclusion 50 Abstract (Korean) 58석

SNU Open Repository and Archive