2,419 research outputs found
A KD-Tree-Based Nearest Neighbor Search for Large Quantities of Data
[[abstract]]The discovery of nearest neighbors, without training in advance, has many applications, such as the formation of mosaic images, image matching, image retrieval and image stitching. When the quantity of data is huge and the number of dimensions is high, the efficient identification of a nearest neighbor (NN) is very important. This study proposes a variation of the KD-tree - the arbitrary KD-tree (KDA) - which is constructed without the need to evaluate variances. Multiple KDAs can be constructed efficiently and possess independent tree structures, when the amount of data is large. Upon testing, using extended synthetic databases and real-world SIFT data, this study concludes that the KDA method increases computational efficiency and produces satisfactory accuracy, when solving NN problems.[[notice]]補正完畢[[journaltype]]國外[[incitationindex]]SCI[[ispeerreviewed]]Y[[booktype]]電子版[[countrycodes]]KO
Tree-Independent Dual-Tree Algorithms
Dual-tree algorithms are a widely used class of branch-and-bound algorithms.
Unfortunately, developing dual-tree algorithms for use with different trees and
problems is often complex and burdensome. We introduce a four-part logical
split: the tree, the traversal, the point-to-point base case, and the pruning
rule. We provide a meta-algorithm which allows development of dual-tree
algorithms in a tree-independent manner and easy extension to entirely new
types of trees. Representations are provided for five common algorithms; for
k-nearest neighbor search, this leads to a novel, tighter pruning bound. The
meta-algorithm also allows straightforward extensions to massively parallel
settings.Comment: accepted in ICML 201
K-nearest Neighbor Search by Random Projection Forests
K-nearest neighbor (kNN) search has wide applications in many areas,
including data mining, machine learning, statistics and many applied domains.
Inspired by the success of ensemble methods and the flexibility of tree-based
methodology, we propose random projection forests (rpForests), for kNN search.
rpForests finds kNNs by aggregating results from an ensemble of random
projection trees with each constructed recursively through a series of
carefully chosen random projections. rpForests achieves a remarkable accuracy
in terms of fast decay in the missing rate of kNNs and that of discrepancy in
the kNN distances. rpForests has a very low computational complexity. The
ensemble nature of rpForests makes it easily run in parallel on multicore or
clustered computers; the running time is expected to be nearly inversely
proportional to the number of cores or machines. We give theoretical insights
by showing the exponential decay of the probability that neighboring points
would be separated by ensemble random projection trees when the ensemble size
increases. Our theory can be used to refine the choice of random projections in
the growth of trees, and experiments show that the effect is remarkable.Comment: 15 pages, 4 figures, 2018 IEEE Big Data Conferenc
Parallel HOP: A Scalable Halo Finder for Massive Cosmological Data Sets
Modern N-body cosmological simulations contain billions () of dark
matter particles. These simulations require hundreds to thousands of gigabytes
of memory, and employ hundreds to tens of thousands of processing cores on many
compute nodes. In order to study the distribution of dark matter in a
cosmological simulation, the dark matter halos must be identified using a halo
finder, which establishes the halo membership of every particle in the
simulation. The resources required for halo finding are similar to the
requirements for the simulation itself. In particular, simulations have become
too extensive to use commonly-employed halo finders, such that the
computational requirements to identify halos must now be spread across multiple
nodes and cores. Here we present a scalable-parallel halo finding method called
Parallel HOP for large-scale cosmological simulation data. Based on the halo
finder HOP, it utilizes MPI and domain decomposition to distribute the halo
finding workload across multiple compute nodes, enabling analysis of much
larger datasets than is possible with the strictly serial or previous parallel
implementations of HOP. We provide a reference implementation of this method as
a part of the toolkit yt, an analysis toolkit for Adaptive Mesh Refinement
(AMR) data that includes complementary analysis modules. Additionally, we
discuss a suite of benchmarks that demonstrate that this method scales well up
to several hundred tasks and datasets in excess of particles. The
Parallel HOP method and our implementation can be readily applied to any kind
of N-body simulation data and is therefore widely applicable.Comment: 29 pages, 11 figures, 2 table
Data Management and Mining in Astrophysical Databases
We analyse the issues involved in the management and mining of astrophysical
data. The traditional approach to data management in the astrophysical field is
not able to keep up with the increasing size of the data gathered by modern
detectors. An essential role in the astrophysical research will be assumed by
automatic tools for information extraction from large datasets, i.e. data
mining techniques, such as clustering and classification algorithms. This asks
for an approach to data management based on data warehousing, emphasizing the
efficiency and simplicity of data access; efficiency is obtained using
multidimensional access methods and simplicity is achieved by properly handling
metadata. Clustering and classification techniques, on large datasets, pose
additional requirements: computational and memory scalability with respect to
the data size, interpretability and objectivity of clustering or classification
results. In this study we address some possible solutions.Comment: 10 pages, Late
Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information
Conditional independence testing is a fundamental problem underlying causal
discovery and a particularly challenging task in the presence of nonlinear and
high-dimensional dependencies. Here a fully non-parametric test for continuous
data based on conditional mutual information combined with a local permutation
scheme is presented. Through a nearest neighbor approach, the test efficiently
adapts also to non-smooth distributions due to strongly nonlinear dependencies.
Numerical experiments demonstrate that the test reliably simulates the null
distribution even for small sample sizes and with high-dimensional conditioning
sets. The test is better calibrated than kernel-based tests utilizing an
analytical approximation of the null distribution, especially for non-smooth
densities, and reaches the same or higher power levels. Combining the local
permutation scheme with the kernel tests leads to better calibration, but
suffers in power. For smaller sample sizes and lower dimensions, the test is
faster than random fourier feature-based kernel tests if the permutation scheme
is (embarrassingly) parallelized, but the runtime increases more sharply with
sample size and dimensionality. Thus, more theoretical research to analytically
approximate the null distribution and speed up the estimation for larger sample
sizes is desirable.Comment: 17 pages, 12 figures, 1 tabl
- …