102,566 research outputs found
Visualizing Bags of Vectors
The motivation of this work is two-fold - a) to compare between two different
modes of visualizing data that exists in a bag of vectors format b) to propose
a theoretical model that supports a new mode of visualizing data. Visualizing
high dimensional data can be achieved using Minimum Volume Embedding, but the
data has to exist in a format suitable for computing similarities while
preserving local distances. This paper compares the visualization between two
methods of representing data and also proposes a new method providing sample
visualizations for that method
Dynamical projections for the visualization of PDFSense data
A recent paper on visualizing the sensitivity of hadronic experiments to
nucleon structure [1] introduces the tool PDFSense which defines measures to
allow the user to judge the sensitivity of PDF fits to a given experiment. The
sensitivity is characterized by high-dimensional data residuals that are
visualized in a 3-d subspace of the 10 first principal components or using
t-SNE [2]. We show how a tour, a dynamic visualisation of high dimensional
data, can extend this tool beyond 3-d relationships. This approach enables
resolving structure orthogonal to the 2-d viewing plane used so far, and hence
finer tuned assessment of the sensitivity.Comment: Format of the animations changed for easier viewin
Theoretical Foundations of t-SNE for Visualizing High-Dimensional Clustered Data
This paper investigates the theoretical foundations of the t-distributed
stochastic neighbor embedding (t-SNE) algorithm, a popular nonlinear dimension
reduction and data visualization method. A novel theoretical framework for the
analysis of t-SNE based on the gradient descent approach is presented. For the
early exaggeration stage of t-SNE, we show its asymptotic equivalence to power
iterations based on the underlying graph Laplacian, characterize its limiting
behavior, and uncover its deep connection to Laplacian spectral clustering, and
fundamental principles including early stopping as implicit regularization. The
results explain the intrinsic mechanism and the empirical benefits of such a
computational strategy. For the embedding stage of t-SNE, we characterize the
kinematics of the low-dimensional map throughout the iterations, and identify
an amplification phase, featuring the intercluster repulsion and the expansive
behavior of the low-dimensional map, and a stabilization phase. The general
theory explains the fast convergence rate and the exceptional empirical
performance of t-SNE for visualizing clustered data, brings forth
interpretations of the t-SNE visualizations, and provides theoretical guidance
for applying t-SNE and selecting its tuning parameters in various applications.Comment: Accepted by Journal of Machine Learning Researc
Superheat: An R package for creating beautiful and extendable heatmaps for visualizing complex data
The technological advancements of the modern era have enabled the collection
of huge amounts of data in science and beyond. Extracting useful information
from such massive datasets is an ongoing challenge as traditional data
visualization tools typically do not scale well in high-dimensional settings.
An existing visualization technique that is particularly well suited to
visualizing large datasets is the heatmap. Although heatmaps are extremely
popular in fields such as bioinformatics for visualizing large gene expression
datasets, they remain a severely underutilized visualization tool in modern
data analysis. In this paper we introduce superheat, a new R package that
provides an extremely flexible and customizable platform for visualizing large
datasets using extendable heatmaps. Superheat enhances the traditional heatmap
by providing a platform to visualize a wide range of data types simultaneously,
adding to the heatmap a response variable as a scatterplot, model results as
boxplots, correlation information as barplots, text information, and more.
Superheat allows the user to explore their data to greater depths and to take
advantage of the heterogeneity present in the data to inform analysis
decisions. The goal of this paper is two-fold: (1) to demonstrate the potential
of the heatmap as a default visualization method for a wide range of data types
using reproducible examples, and (2) to highlight the customizability and ease
of implementation of the superheat package in R for creating beautiful and
extendable heatmaps. The capabilities and fundamental applicability of the
superheat package will be explored via three case studies, each based on
publicly available data sources and accompanied by a file outlining the
step-by-step analytic pipeline (with code).Comment: 26 pages, 10 figure
- …