4,735 research outputs found
Parallel HOP: A Scalable Halo Finder for Massive Cosmological Data Sets
Modern N-body cosmological simulations contain billions () of dark
matter particles. These simulations require hundreds to thousands of gigabytes
of memory, and employ hundreds to tens of thousands of processing cores on many
compute nodes. In order to study the distribution of dark matter in a
cosmological simulation, the dark matter halos must be identified using a halo
finder, which establishes the halo membership of every particle in the
simulation. The resources required for halo finding are similar to the
requirements for the simulation itself. In particular, simulations have become
too extensive to use commonly-employed halo finders, such that the
computational requirements to identify halos must now be spread across multiple
nodes and cores. Here we present a scalable-parallel halo finding method called
Parallel HOP for large-scale cosmological simulation data. Based on the halo
finder HOP, it utilizes MPI and domain decomposition to distribute the halo
finding workload across multiple compute nodes, enabling analysis of much
larger datasets than is possible with the strictly serial or previous parallel
implementations of HOP. We provide a reference implementation of this method as
a part of the toolkit yt, an analysis toolkit for Adaptive Mesh Refinement
(AMR) data that includes complementary analysis modules. Additionally, we
discuss a suite of benchmarks that demonstrate that this method scales well up
to several hundred tasks and datasets in excess of particles. The
Parallel HOP method and our implementation can be readily applied to any kind
of N-body simulation data and is therefore widely applicable.Comment: 29 pages, 11 figures, 2 table
Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values
This work is motivated by the needs of predictive analytics on healthcare
data as represented by Electronic Medical Records. Such data is invariably
problematic: noisy, with missing entries, with imbalance in classes of
interests, leading to serious bias in predictive modeling. Since standard data
mining methods often produce poor performance measures, we argue for
development of specialized techniques of data-preprocessing and classification.
In this paper, we propose a new method to simultaneously classify large
datasets and reduce the effects of missing values. It is based on a multilevel
framework of the cost-sensitive SVM and the expected maximization imputation
method for missing values, which relies on iterated regression analyses. We
compare classification results of multilevel SVM-based algorithms on public
benchmark datasets with imbalanced classes and missing values as well as real
data in health applications, and show that our multilevel SVM-based method
produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625
Simultaneous Spectral-Spatial Feature Selection and Extraction for Hyperspectral Images
In hyperspectral remote sensing data mining, it is important to take into
account of both spectral and spatial information, such as the spectral
signature, texture feature and morphological property, to improve the
performances, e.g., the image classification accuracy. In a feature
representation point of view, a nature approach to handle this situation is to
concatenate the spectral and spatial features into a single but high
dimensional vector and then apply a certain dimension reduction technique
directly on that concatenated vector before feed it into the subsequent
classifier. However, multiple features from various domains definitely have
different physical meanings and statistical properties, and thus such
concatenation hasn't efficiently explore the complementary properties among
different features, which should benefit for boost the feature
discriminability. Furthermore, it is also difficult to interpret the
transformed results of the concatenated vector. Consequently, finding a
physically meaningful consensus low dimensional feature representation of
original multiple features is still a challenging task. In order to address the
these issues, we propose a novel feature learning framework, i.e., the
simultaneous spectral-spatial feature selection and extraction algorithm, for
hyperspectral images spectral-spatial feature representation and
classification. Specifically, the proposed method learns a latent low
dimensional subspace by projecting the spectral-spatial feature into a common
feature space, where the complementary information has been effectively
exploited, and simultaneously, only the most significant original features have
been transformed. Encouraging experimental results on three public available
hyperspectral remote sensing datasets confirm that our proposed method is
effective and efficient
Efficient Large-scale Approximate Nearest Neighbor Search on the GPU
We present a new approach for efficient approximate nearest neighbor (ANN)
search in high dimensional spaces, extending the idea of Product Quantization.
We propose a two-level product and vector quantization tree that reduces the
number of vector comparisons required during tree traversal. Our approach also
includes a novel highly parallelizable re-ranking method for candidate vectors
by efficiently reusing already computed intermediate values. Due to its small
memory footprint during traversal, the method lends itself to an efficient,
parallel GPU implementation. This Product Quantization Tree (PQT) approach
significantly outperforms recent state of the art methods for high dimensional
nearest neighbor queries on standard reference datasets. Ours is the first work
that demonstrates GPU performance superior to CPU performance on high
dimensional, large scale ANN problems in time-critical real-world applications,
like loop-closing in videos
- …