18,640 research outputs found
Spatio-Temporal Surrogates for Interaction of a Jet with High Explosives: Part II -- Clustering Extremely High-Dimensional Grid-Based Data
Building an accurate surrogate model for the spatio-temporal outputs of a
computer simulation is a challenging task. A simple approach to improve the
accuracy of the surrogate is to cluster the outputs based on similarity and
build a separate surrogate model for each cluster. This clustering is
relatively straightforward when the output at each time step is of moderate
size. However, when the spatial domain is represented by a large number of grid
points, numbering in the millions, the clustering of the data becomes more
challenging. In this report, we consider output data from simulations of a jet
interacting with high explosives. These data are available on spatial domains
of different sizes, at grid points that vary in their spatial coordinates, and
in a format that distributes the output across multiple files at each time step
of the simulation. We first describe how we bring these data into a consistent
format prior to clustering. Borrowing the idea of random projections from data
mining, we reduce the dimension of our data by a factor of thousand, making it
possible to use the iterative k-means method for clustering. We show how we can
use the randomness of both the random projections, and the choice of initial
centroids in k-means clustering, to determine the number of clusters in our
data set. Our approach makes clustering of extremely high dimensional data
tractable, generating meaningful cluster assignments for our problem, despite
the approximation introduced in the random projections
Innovation Pursuit: A New Approach to Subspace Clustering
In subspace clustering, a group of data points belonging to a union of
subspaces are assigned membership to their respective subspaces. This paper
presents a new approach dubbed Innovation Pursuit (iPursuit) to the problem of
subspace clustering using a new geometrical idea whereby subspaces are
identified based on their relative novelties. We present two frameworks in
which the idea of innovation pursuit is used to distinguish the subspaces.
Underlying the first framework is an iterative method that finds the subspaces
consecutively by solving a series of simple linear optimization problems, each
searching for a direction of innovation in the span of the data potentially
orthogonal to all subspaces except for the one to be identified in one step of
the algorithm. A detailed mathematical analysis is provided establishing
sufficient conditions for iPursuit to correctly cluster the data. The proposed
approach can provably yield exact clustering even when the subspaces have
significant intersections. It is shown that the complexity of the iterative
approach scales only linearly in the number of data points and subspaces, and
quadratically in the dimension of the subspaces. The second framework
integrates iPursuit with spectral clustering to yield a new variant of
spectral-clustering-based algorithms. The numerical simulations with both real
and synthetic data demonstrate that iPursuit can often outperform the
state-of-the-art subspace clustering algorithms, more so for subspaces with
significant intersections, and that it significantly improves the
state-of-the-art result for subspace-segmentation-based face clustering
Structural Variability from Noisy Tomographic Projections
In cryo-electron microscopy, the 3D electric potentials of an ensemble of
molecules are projected along arbitrary viewing directions to yield noisy 2D
images. The volume maps representing these potentials typically exhibit a great
deal of structural variability, which is described by their 3D covariance
matrix. Typically, this covariance matrix is approximately low-rank and can be
used to cluster the volumes or estimate the intrinsic geometry of the
conformation space. We formulate the estimation of this covariance matrix as a
linear inverse problem, yielding a consistent least-squares estimator. For
images of size -by- pixels, we propose an algorithm for calculating this
covariance estimator with computational complexity
, where the condition number
is empirically in the range --. Its efficiency relies on the
observation that the normal equations are equivalent to a deconvolution problem
in 6D. This is then solved by the conjugate gradient method with an appropriate
circulant preconditioner. The result is the first computationally efficient
algorithm for consistent estimation of 3D covariance from noisy projections. It
also compares favorably in runtime with respect to previously proposed
non-consistent estimators. Motivated by the recent success of eigenvalue
shrinkage procedures for high-dimensional covariance matrices, we introduce a
shrinkage procedure that improves accuracy at lower signal-to-noise ratios. We
evaluate our methods on simulated datasets and achieve classification results
comparable to state-of-the-art methods in shorter running time. We also present
results on clustering volumes in an experimental dataset, illustrating the
power of the proposed algorithm for practical determination of structural
variability.Comment: 52 pages, 11 figure
Recovering the Optimal Solution by Dual Random Projection
Random projection has been widely used in data classification. It maps
high-dimensional data into a low-dimensional subspace in order to reduce the
computational cost in solving the related optimization problem. While previous
studies are focused on analyzing the classification performance of using random
projection, in this work, we consider the recovery problem, i.e., how to
accurately recover the optimal solution to the original optimization problem in
the high-dimensional space based on the solution learned from the subspace
spanned by random projections. We present a simple algorithm, termed Dual
Random Projection, that uses the dual solution of the low-dimensional
optimization problem to recover the optimal solution to the original problem.
Our theoretical analysis shows that with a high probability, the proposed
algorithm is able to accurately recover the optimal solution to the original
problem, provided that the data matrix is of low rank or can be well
approximated by a low rank matrix.Comment: The 26th Annual Conference on Learning Theory (COLT 2013
- âŠ