60 research outputs found

    Data Visualization, Dimensionality Reduction, and Data Alignment via Manifold Learning

    Get PDF
    The high dimensionality of modern data introduces significant challenges in descriptive and exploratory data analysis. These challenges gave rise to extensive work on dimensionality reduction and manifold learning aiming to provide low dimensional representations that preserve or uncover intrinsic patterns and structures in the data. In this thesis, we expand the current literature in manifold learning developing two methods called DIG (Dynamical Information Geometry) and GRAE (Geometry Regularized Autoencoders). DIG is a method capable of finding low-dimensional representations of high-frequency multivariate time series data, especially suited for visualization. GRAE is a general framework which splices the well-established machinery from kernel manifold learning methods to recover a sensitive geometry, alongside the parametric structure of autoencoders. Manifold learning can also be useful to study data collected from different measurement instruments, conditions, or protocols of the same underlying system. In such cases the data is acquired in a multi-domain representation. The last two Chapters of this thesis are devoted to two new methods capable of aligning multi-domain data, leveraging their geometric structure alongside limited common information. First, we present DTA (Diffusion Transport Alignment), a semi-supervised manifold alignment method that exploits prior one-to-one correspondence knowledge between distinct data views and finds an aligned common representation. And finally, we introduce MALI (Manifold Alignment with Label Information). Here we drop the one-to-one prior correspondences assumption, since in many scenarios such information can not be provided, either due to the nature of the experimental design, or it becomes extremely costly. Instead, MALI only needs side-information in the form of discrete labels/classes present in both domains

    Understanding cellular differentiation by modelling of single-cell gene expression data

    Get PDF
    Over the course of the last decade single-cell RNA sequencing (scRNA-seq) has revolutionized the study of cellular heterogeneity, as one experiment routinely covers the expression of thousands of genes in tens or hundreds of thousands of cells. By quantifying differences between the single cell transcriptomes it is possible to reconstruct the process that gives rise to different cell fates from a progenitor population and gain access to trajectories of gene expression over developmental time. Tree reconstruction algorithms must deal with the high levels of noise, the high dimensionality of gene expression space, and strong non-linear dependencies between genes. In this thesis we address three aspects of working with scRNA-seq data: (1) lineage tree reconstruction, where we propose MERLoT, a novel trajectory inference method, (2) method comparison, where we propose PROSSTT, a novel algorithm that simulates scRNA-seq count data of complex differentiation trajectories, and (3) noise modelling, where we propose a novel probabilistic description of count data, a statistically motivated local averaging strategy, and an adaptation of the cross validation approach for the evaluation of gene expression imputation strategies. While statistical modelling of the data was our primary motivation, due to time constraints we did not manage to fully realize our plans for it. Increasingly complex processes like whole-organism development are being studied by single-cell transcriptomics, producing large amounts of data. Methods for trajectory inference must therefore efficiently reconstruct \textit{a priori} unknown lineage trees with many cell fates. We propose MERLoT, a method that can reconstruct trees in sub-quadratic time by utilizing a local averaging strategy, scaling very well on large datasets. MERLoT compares favorably to the state of the art, both on real data and a large synthetic benchmark. The absence of data with known complex underlying topologies makes it challenging to quantitatively compare tree reconstruction methods to each other. PROSSTT is a novel algorithm that simulates count data from complex differentiation processes, facilitating comparisons between algorithms. We created the largest synthetic dataset to-date, and the first to contain simulations with up to 12 cell fates. Additionally, PROSSTT can learn simulation parameters from reconstructed lineage trees and produce cells with expression profiles similar to the real data. Quantifying similarity between single-cell transcriptomes is crucial for clustering scRNA-seq profiles to cell types or inferring developmental trajectories, and appropriate statistical modelling of the data should improve such similarity calculations. We propose a Gaussian mixture of negative binomial distributions where gene expression variance depends on the square of the average expression. The model hyperparameters can be learned via the hybrid Monte Carlo algorithm, and a good initialization of average expression and variance parameters can be obtained by trajectory inference. A way to limit noise in the data is to apply local averaging, using the nearest neighbours of each cell to recover expression of non-captured mRNA. Our proposal, nearest neighbour smoothing with optimal bias-variance trade-off, optimizes the k-nearest neighbours approach by reducing the contribution of inappropriate neighbours. We also propose a way to assess the quality of gene expression imputation. After reconstructing a trajectory with imputed data, each cell can be projected to the trajectory using non-overlapping subsets of genes. The robustness of these assignments over multiple partitions of the genes is a novel estimator of imputation performance. Finally, I was involved in the planning and initial stages of a mouse ovary cell atlas as a collaboration

    Graph Priors, Optimal Transport, and Deep Learning in Biomedical Discovery

    Get PDF
    Recent advances in biomedical data collection allows the collection of massive datasets measuring thousands of features in thousands to millions of individual cells. This data has the potential to advance our understanding of biological mechanisms at a previously impossible resolution. However, there are few methods to understand data of this scale and type. While neural networks have made tremendous progress on supervised learning problems, there is still much work to be done in making them useful for discovery in data with more difficult to represent supervision. The flexibility and expressiveness of neural networks is sometimes a hindrance in these less supervised domains, as is the case when extracting knowledge from biomedical data. One type of prior knowledge that is more common in biological data comes in the form of geometric constraints. In this thesis, we aim to leverage this geometric knowledge to create scalable and interpretable models to understand this data. Encoding geometric priors into neural network and graph models allows us to characterize the modelsโ€™ solutions as they relate to the fields of graph signal processing and optimal transport. These links allow us to understand and interpret this datatype. We divide this work into three sections. The first borrows concepts from graph signal processing to construct more interpretable and performant neural networks by constraining and structuring the architecture. The second borrows from the theory of optimal transport to perform anomaly detection and trajectory inference efficiently and with theoretical guarantees. The third examines how to compare distributions over an underlying manifold, which can be used to understand how different perturbations or conditions relate. For this we design an efficient approximation of optimal transport based on diffusion over a joint cell graph. Together, these works utilize our prior understanding of the data geometry to create more useful models of the data. We apply these methods to molecular graphs, images, single-cell sequencing, and health record data

    ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•œ ์ฐจ์› ์ถ•์†Œ๋ฅผ ์œ„ํ•œ UMAP ๊ฐœ์„  ๋ฐฉ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021.8. ๊ณ ํ˜•๊ถŒ.One e ective way of understanding the characteristics of high-dimensional data is to embed it onto a low-dimensional space. Among many existing dimensionality reduction algorithms, Uniform Manifold Approximation and Projection (UMAP) has gained the most attention because of its fast and stable projection result. However, still it is too slow to be adopted for an interactive visual analytics system as it takes for a few minutes to embed even for a toy dataset (e.g., MNIST). Moreover, UMAP is vulnerable to di erent configurations of yperparameters, especially to the initialization methods and the number of epochs, which can bring about a serious bias mining insights from the embedding result. To achieve the responsiveness, we propose a progressive algorithm for UMAP, called Progressive UMAP, for the exploration of datasets by updating the embedding with a batch of points through a progressive computation. Next, to guarantee less biases and the robustness in the embedding, we present a novel dimensionality reduction algorithm called Uniform Manifold Approximation with Twophase Optimization (UMATO). We discover that the vulnerability comes from the approximation of cross-entropy loss function. UMATO, instead, takes a two-phase optimization approach: global optimization to obtain the overall skeleton of data, and local optimization to identify regional characteristics of a local area. In our experiment with one synthetic and three real-world datasets, UMATO outperformed widely-used baseline algorithms, such as PCA, t-SNE, UMAP, topological autoencoders and Anchor t-SNE, in terms of global quality metrics and 2D projection results. We further examine a case study of UMATO on real-world biological data and the extension to multi-phase optimization. Our work makes the original contributions to the field of dimensionality reduction, as well as the progressive visual analytics. Lastly, the thesis discusses the future research directions for improving the proposed algorithms.๊ณ ์ฐจ์› ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ์„ ํŒŒ์•…ํ•˜๋Š” ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋Š” ์ €์ฐจ์› ๊ณต๊ฐ„์— ์ž„๋ฒ ๋”ฉ์„ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๋งŽ์€ ์ฐจ์› ์ถ•์†Œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์žˆ์ง€๋งŒ, ๊ท ์ผ ๋งค๋‹ˆํด๋“œ ๊ทผ์‚ฌ ๋ฐ ํˆฌ์˜๋ฒ• (UMAP)์€ ๋น ๋ฅธ ์†๋„์™€ ์•ˆ์ •์ ์ธ ํˆฌ์˜ ๊ฒฐ๊ณผ๋กœ ์ธํ•ด ๋งŽ์€ ์ฃผ๋ชฉ์„ ๋ฐ›์•˜๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํ˜„์žฌ์˜ UMAP์€ ์‹คํ—˜์šฉ ๋ฐ์ดํ„ฐ ์…‹์ธ MNIST์—๋„ ์ˆ˜ ๋ถ„์ด ๊ฑธ๋ฆฌ๋Š” ๋“ฑ, ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ ์‹œ๊ฐ์  ๋ถ„์„ ์‹œ์Šคํ…œ์— ๋„์ž…๋˜๊ธฐ์—๋Š” ๋„ˆ๋ฌด ๋Š๋ฆฌ๋‹ค. ๋˜ํ•œ UMAP์€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •์ด (ํŠนํžˆ, ์ดˆ๊ธฐํ™” ๋ฐฉ๋ฒ•๊ณผ epoch ์ˆ˜) ๋‹ฌ๋ผ์ง€๋Š” ๊ฒƒ์— ์ทจ์•ฝํ•œ๋ฐ, ์ด๊ฒƒ์€ ์ž„๋ฒ ๋”ฉ ๊ฒฐ๊ณผ๋กœ ๋ถ€ํ„ฐ ํ†ต์ฐฐ์„ ์–ป๋Š” ๊ณผ์ •์—์„œ ํฐ ์˜ค๋ฅ˜๋ฅผ ๋ฒ”ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค. UMAP์˜ ์ฆ‰๊ฐ์ ์ธ ๋ฐ˜์‘์„ฑ์„ ์–ป๊ธฐ ์œ„ํ•ด์„œ, UMAP์˜ ์ ์ง„์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ Progressive UMAP์„ ์ œ์•ˆํ•œ๋‹ค. ์ด๋กœ์จ ํ•œ ๋ฐฐ์น˜์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”๊ฐ€ํ•  ๋•Œ๋งˆ๋‹ค ์ž„๋ฒ ๋”ฉ ๊ฒฐ๊ณผ๋ฅผ ์—…๋ฐ์ดํŠธ ํ•˜๊ฒŒ๋˜๋Š” ์ ์ง„์ ์ธ ๊ณ„์‚ฐ์ด ๊ฐ€๋Šฅํ•ด์ง„๋‹ค. ๋‹ค์Œ์œผ๋กœ ์ ์€ ํŽธํ–ฅ๊ณผ ๊ฐ•๊ฑดํ•œ ์ž„๋ฒ ๋”ฉ์„ ๋ณด์žฅํ•˜๊ธฐ ์œ„ํ•ด UMATO๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ๋จผ์ € ์šฐ๋ฆฌ๋Š” ์ด๋Ÿฌํ•œ ์ทจ์•ฝํ•จ์ด ์ตœ์ ํ™”๋ฅผ ๊ทผ์‚ฌํ•˜๋Š” ๊ณผ์ •์—์„œ ์ผ์–ด๋‚˜๋Š” ๊ฒƒ์„ ๋ฐํžŒ๋‹ค. UMATO๋Š”, UMAP๊ณผ ๋‹ค๋ฅด๊ฒŒ, ๋‘ ๋‹จ๊ณ„์— ๊ฑธ์นœ ์ตœ์ ํ™”๋ฅผ ํ†ตํ•ด์„œ ์ฒ˜์Œ์œผ๋กœ ์ „์ฒด์ ์ธ ๊ตฌ์กฐ๋ฅผ ์žก๊ณ , ๊ทธ ๋‹ค์Œ ์ง€์—ญ์  ํŠน์„ฑ์„ ํŒŒ์•…ํ•œ๋‹ค. ์‹คํ—˜์„ ํ†ตํ•ด UMATO๊ฐ€ PCA, t-SNE, UMAP, topological autoencoders ๊ทธ๋ฆฌ๊ณ  Anchort-SNE์™€ ๊ฐ™์€ ๊ธฐ์กด ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋น„ํ•ด ์ „์ฒด ๊ตฌ์กฐ ํ‰๊ฐ€ ์ง€ํ‘œ์™€ 2์ฐจ์› ์ž„๋ฒ ๋”ฉ ๊ฒฐ๊ณผ์—์„œ ๋” ๋‚˜์Œ์„ ๋ณด์ธ๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ ์—ฌ๋Ÿฌ ๋‹จ๊ณ„๋กœ ์ตœ์ ํ™” ํ•˜๋Š” ๊ฒƒ๊ณผ ์ž„๋ฒ ๋”ฉ์˜ ์•ˆ์ •์„ฑ ์—ญ์‹œ ์‹คํ—˜์œผ๋กœ ํŒŒ์•…ํ•œ๋‹ค. ์ด ์—ฐ๊ตฌ๋Š” ์ฐจ์› ์ถ•์†Œ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ ์ง„์  ์‹œ๊ฐํ™” ๋ถ„์•ผ์—๋„ ๋…์ฐฝ์ ์ธ ๊ณตํ—Œ์„ ํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ์—ฐ๊ตฌ์˜ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ์„ ๋„๋ชจํ•œ๋‹ค.CHAPTER 1 Introduction 1 1.1 Motivation 1 1.2 Research Questions and Approaches 2 1.2.1 Progressive Algorithm for UMAP 3 1.2.2 Less Biased and Robust Dimensionality Reduction Algorithm 4 1.3 Contributions 4 1.4 Thesis Overview 5 CHAPTER 2 Background: UMAP 6 2.1 Graph Construction 6 2.2 Layout Optimization 7 CHAPTER 3 Progressive UMAP: A Progressive Algorithm for UMAP 10 3.1 Introduction 10 3.2 Related Work 11 3.2.1 Progressive Visual Analytics 11 3.3 Progressive UMAP 13 3.3.1 Computing Ni 14 3.3.2 Computing ฯi and ฯƒi 14 3.3.3 Layout Initialization 14 3.3.4 Layout Optimization 15 3.4 Evaluation and Discussion 15 3.5 Summary 18 CHAPTER 4 UMATO: A Less Biased and Robust Dimensionality Reduction Algorithm Based on UMAP 19 4.1 Introduction 19 4.2 Related Work 22 4.2.1 Dimensionality Reduction 22 4.2.2 Hubs, landmarks, and anchors 23 4.3 The Meaning of Using Di erent Loss Functions in Dimensionality Reduction 25 4.3.1 t-SNE 25 4.4 UMATO 27 4.4.1 Points Classification 28 4.4.2 Global Optimization 29 4.4.3 Local Optimization 30 4.4.4 Outliers Arrangement 32 4.5 Experiments 33 4.5.1 Quantitative and Qualitative Evaluation of UMATO Compared to Six Baseline Algorithms 33 4.5.2 Case Study: UMATO on Real-world Biological Data 39 4.6 Discussion 41 4.7 Summary 46 CHAPTER 5 Discussion 48 5.1 Lessons Learned 48 5.2 Limitations 49 CHAPTER 6 Conclusion 50 Abstract (Korean) 58์„
    • โ€ฆ
    corecore