9,269 research outputs found

    viSNE fine-tuning enables better resolution of cell populations

    Full text link
    t-Distributed Stochastic Neighbor Embedding (t-SNE or viSNE) is a dimensionality reduction algorithm that allows visualization of complex high-dimensional cytometry data as a two-dimensional distribution or " map ". These maps can be interrogated by human-guided or automated techniques to categorize single cell data into relevant biological populations and otherwise visualize important differences between samples. The method has been extensively adopted and reported in the literature to be superior to traditional biaxial gating. The analyst must carefully choose the parameters of a t-SNE computation, as incorrectly chosen parameters might create artifacts that make the resulting map difficult or impossible to interpret. The correct choice of algorithm parameters is complicated by a lack of agreed-upon quantitative framework for assessing the quality of algorithm results. Gauging result quality currently relies on subjective visual evaluation by an experienced t-SNE user. To overcome these limitations, we used Cytobank viSNE engine for all t-SNE analyses and employed 18-parameter flow cytometry data as well as 32-parameter mass cytometry data of varying numbers of events to optimize t-SNE parameters such as total number of iterations and perplexity. We also investigated the utility of Kullback-Liebler (KL) divergence as a metric for map quality as well as SPADE clustering as an indirect measure of multidimensional data integrity when flattened into t-SNE coordinates. We have established the imperative requirement for the number of t-SNE analysis optimization steps ('iteration number') to be scaled with the total number of data points (events) in the set, suggesting that a number of existing software solutions produce unclear t-SNE maps of flow and mass cytometry data due to built-in user control restrictions. We also evaluated lower-level parameters within the t-SNE code that control the 'early exaggeration' stage initially introduced into t-SNE algorithm for better map optimization. These parameters are not available as part of the standard algorithm interface, but we found that they can be tuned to produce high quality results in shorter periods of time, avoiding unnecessary increases of both analysis duration and computation cost. Therefore, our approach allows to fine-tune the t-SNE analysis to ensure both optimal resolution of t-SNE low-dimensional maps and better faithfulness of their presentation of high-parameter cytometry data

    Theoretical Foundations of t-SNE for Visualizing High-Dimensional Clustered Data

    Full text link
    This paper investigates the theoretical foundations of the t-distributed stochastic neighbor embedding (t-SNE) algorithm, a popular nonlinear dimension reduction and data visualization method. A novel theoretical framework for the analysis of t-SNE based on the gradient descent approach is presented. For the early exaggeration stage of t-SNE, we show its asymptotic equivalence to power iterations based on the underlying graph Laplacian, characterize its limiting behavior, and uncover its deep connection to Laplacian spectral clustering, and fundamental principles including early stopping as implicit regularization. The results explain the intrinsic mechanism and the empirical benefits of such a computational strategy. For the embedding stage of t-SNE, we characterize the kinematics of the low-dimensional map throughout the iterations, and identify an amplification phase, featuring the intercluster repulsion and the expansive behavior of the low-dimensional map, and a stabilization phase. The general theory explains the fast convergence rate and the exceptional empirical performance of t-SNE for visualizing clustered data, brings forth interpretations of the t-SNE visualizations, and provides theoretical guidance for applying t-SNE and selecting its tuning parameters in various applications.Comment: Accepted by Journal of Machine Learning Researc

    A Tool for Interactive Data Visualization: Application to Over 10,000 Brain Imaging and Phantom MRI Data Sets

    Get PDF
    In this paper we propose a web-based approach for quick visualization of big data from brain magnetic resonance imaging (MRI) scans using a combination of an automated image capture and processing system, nonlinear embedding, and interactive data visualization tools. We draw upon thousands of MRI scans captured via the COllaborative Imaging and Neuroinformatics Suite (COINS). We then interface the output of several analysis pipelines based on structural and functional data to a t-distributed stochastic neighbor embedding (t-SNE) algorithm which reduces the number of dimensions for each scan in the input data set to two dimensions while preserving the local structure of data sets. Finally, we interactively display the output of this approach via a web-page, based on data driven documents (D3) JavaScript library. Two distinct approaches were used to visualize the data. In the first approach, we computed multiple quality control (QC) values from pre-processed data, which were used as inputs to the t-SNE algorithm. This approach helps in assessing the quality of each data set relative to others. In the second case, computed variables of interest (e.g., brain volume or voxel values from segmented gray matter images) were used as inputs to the t-SNE algorithm. This approach helps in identifying interesting patterns in the data sets. We demonstrate these approaches using multiple examples from over 10,000 data sets including (1) quality control measures calculated from phantom data over time, (2) quality control data from human functional MRI data across various studies, scanners, sites, (3) volumetric and density measures from human structural MRI data across various studies, scanners and sites. Results from (1) and (2) show the potential of our approach to combine t-SNE data reduction with interactive color coding of variables of interest to quickly identify visually unique clusters of data (i.e., data sets with poor QC, clustering of data by site) quickly. Results from (3) demonstrate interesting patterns of gray matter and volume, and evaluate how they map onto variables including scanners, age, and gender. In sum, the proposed approach allows researchers to rapidly identify and extract meaningful information from big data sets. Such tools are becoming increasingly important as datasets grow larger

    Approximated and User Steerable tSNE for Progressive Visual Analytics

    Full text link
    Progressive Visual Analytics aims at improving the interactivity in existing analytics techniques by means of visualization as well as interaction with intermediate results. One key method for data analysis is dimensionality reduction, for example, to produce 2D embeddings that can be visualized and analyzed efficiently. t-Distributed Stochastic Neighbor Embedding (tSNE) is a well-suited technique for the visualization of several high-dimensional data. tSNE can create meaningful intermediate results but suffers from a slow initialization that constrains its application in Progressive Visual Analytics. We introduce a controllable tSNE approximation (A-tSNE), which trades off speed and accuracy, to enable interactive data exploration. We offer real-time visualization techniques, including a density-based solution and a Magic Lens to inspect the degree of approximation. With this feedback, the user can decide on local refinements and steer the approximation level during the analysis. We demonstrate our technique with several datasets, in a real-world research scenario and for the real-time analysis of high-dimensional streams to illustrate its effectiveness for interactive data analysis

    Predictive Liability Models and Visualizations of High Dimensional Retail Employee Data

    Full text link
    Employee theft and dishonesty is a major contributor to loss in the retail industry. Retailers have reported the need for more automated analytic tools to assess the liability of their employees. In this work, we train and optimize several machine learning models for regression prediction and analysis on this data, which will help retailers identify and manage risky employees. Since the data we use is very high dimensional, we use feature selection techniques to identify the most contributing factors to an employee's assessed risk. We also use dimension reduction and data embedding techniques to present this dataset in a easy to interpret format
    • …
    corecore