7 research outputs found
Heavy-tailed kernels reveal a finer cluster structure in t-SNE visualisations
T-distributed stochastic neighbour embedding (t-SNE) is a widely used data
visualisation technique. It differs from its predecessor SNE by the
low-dimensional similarity kernel: the Gaussian kernel was replaced by the
heavy-tailed Cauchy kernel, solving the "crowding problem" of SNE. Here, we
develop an efficient implementation of t-SNE for a -distribution kernel with
an arbitrary degree of freedom , with corresponding to SNE
and corresponding to the standard t-SNE. Using theoretical analysis and
toy examples, we show that can further reduce the crowding problem and
reveal finer cluster structure that is invisible in standard t-SNE. We further
demonstrate the striking effect of heavier-tailed kernels on large real-life
data sets such as MNIST, single-cell RNA-sequencing data, and the HathiTrust
library. We use domain knowledge to confirm that the revealed clusters are
meaningful. Overall, we argue that modifying the tail heaviness of the t-SNE
kernel can yield additional insight into the cluster structure of the data
Stochastic Neighbor Embedding with Gaussian and Student-t Distributions: Tutorial and Survey
Stochastic Neighbor Embedding (SNE) is a manifold learning and dimensionality
reduction method with a probabilistic approach. In SNE, every point is consider
to be the neighbor of all other points with some probability and this
probability is tried to be preserved in the embedding space. SNE considers
Gaussian distribution for the probability in both the input and embedding
spaces. However, t-SNE uses the Student-t and Gaussian distributions in these
spaces, respectively. In this tutorial and survey paper, we explain SNE,
symmetric SNE, t-SNE (or Cauchy-SNE), and t-SNE with general degrees of
freedom. We also cover the out-of-sample extension and acceleration for these
methods. Some simulations to visualize the embeddings are also provided.Comment: To appear as a part of an upcoming academic book on dimensionality
reduction and manifold learnin
Manifold Learning in Atomistic Simulations: A Conceptual Review
Analyzing large volumes of high-dimensional data requires dimensionality
reduction: finding meaningful low-dimensional structures hidden in their
high-dimensional observations. Such practice is needed in atomistic simulations
of complex systems where even thousands of degrees of freedom are sampled. An
abundance of such data makes gaining insight into a specific physical problem
strenuous. Our primary aim in this review is to focus on unsupervised machine
learning methods that can be used on simulation data to find a low-dimensional
manifold providing a collective and informative characterization of the studied
process. Such manifolds can be used for sampling long-timescale processes and
free-energy estimation. We describe methods that can work on datasets from
standard and enhanced sampling atomistic simulations. Unlike recent reviews on
manifold learning for atomistic simulations, we consider only methods that
construct low-dimensional manifolds based on Markov transition probabilities
between high-dimensional samples. We discuss these techniques from a conceptual
point of view, including their underlying theoretical frameworks and possible
limitations