6,524 research outputs found
A System for Induction of Oblique Decision Trees
This article describes a new system for induction of oblique decision trees.
This system, OC1, combines deterministic hill-climbing with two forms of
randomization to find a good oblique split (in the form of a hyperplane) at
each node of a decision tree. Oblique decision tree methods are tuned
especially for domains in which the attributes are numeric, although they can
be adapted to symbolic or mixed symbolic/numeric attributes. We present
extensive empirical studies, using both real and artificial data, that analyze
OC1's ability to construct oblique trees that are smaller and more accurate
than their axis-parallel counterparts. We also examine the benefits of
randomization for the construction of oblique decision trees.Comment: See http://www.jair.org/ for an online appendix and other files
accompanying this articl
How to Find More Supernovae with Less Work: Object Classification Techniques for Difference Imaging
We present the results of applying new object classification techniques to
difference images in the context of the Nearby Supernova Factory supernova
search. Most current supernova searches subtract reference images from new
images, identify objects in these difference images, and apply simple threshold
cuts on parameters such as statistical significance, shape, and motion to
reject objects such as cosmic rays, asteroids, and subtraction artifacts.
Although most static objects subtract cleanly, even a very low false positive
detection rate can lead to hundreds of non-supernova candidates which must be
vetted by human inspection before triggering additional followup. In comparison
to simple threshold cuts, more sophisticated methods such as Boosted Decision
Trees, Random Forests, and Support Vector Machines provide dramatically better
object discrimination. At the Nearby Supernova Factory, we reduced the number
of non-supernova candidates by a factor of 10 while increasing our supernova
identification efficiency. Methods such as these will be crucial for
maintaining a reasonable false positive rate in the automated transient alert
pipelines of upcoming projects such as PanSTARRS and LSST.Comment: 25 pages; 6 figures; submitted to Ap
Angle Tree: Nearest Neighbor Search in High Dimensions with Low Intrinsic Dimensionality
We propose an extension of tree-based space-partitioning indexing structures
for data with low intrinsic dimensionality embedded in a high dimensional
space. We call this extension an Angle Tree. Our extension can be applied to
both classical kd-trees as well as the more recent rp-trees. The key idea of
our approach is to store the angle (the "dihedral angle") between the data
region (which is a low dimensional manifold) and the random hyperplane that
splits the region (the "splitter"). We show that the dihedral angle can be used
to obtain a tight lower bound on the distance between the query point and any
point on the opposite side of the splitter. This in turn can be used to
efficiently prune the search space. We introduce a novel randomized strategy to
efficiently calculate the dihedral angle with a high degree of accuracy.
Experiments and analysis on real and synthetic data sets shows that the Angle
Tree is the most efficient known indexing structure for nearest neighbor
queries in terms of preprocessing and space usage while achieving high accuracy
and fast search time.Comment: To be submitted to IEEE Transactions on Pattern Analysis and Machine
Intelligenc
Analysis of approximate nearest neighbor searching with clustered point sets
We present an empirical analysis of data structures for approximate nearest
neighbor searching. We compare the well-known optimized kd-tree splitting
method against two alternative splitting methods. The first, called the
sliding-midpoint method, which attempts to balance the goals of producing
subdivision cells of bounded aspect ratio, while not producing any empty cells.
The second, called the minimum-ambiguity method is a query-based approach. In
addition to the data points, it is also given a training set of query points
for preprocessing. It employs a simple greedy algorithm to select the splitting
plane that minimizes the average amount of ambiguity in the choice of the
nearest neighbor for the training points. We provide an empirical analysis
comparing these two methods against the optimized kd-tree construction for a
number of synthetically generated data and query sets. We demonstrate that for
clustered data and query sets, these algorithms can provide significant
improvements over the standard kd-tree construction for approximate nearest
neighbor searching.Comment: 20 pages, 8 figures. Presented at ALENEX '99, Baltimore, MD, Jan
15-16, 199
High-dimensional approximate nearest neighbor: k-d Generalized Randomized Forests
We propose a new data-structure, the generalized randomized kd forest, or
kgeraf, for approximate nearest neighbor searching in high dimensions. In
particular, we introduce new randomization techniques to specify a set of
independently constructed trees where search is performed simultaneously, hence
increasing accuracy. We omit backtracking, and we optimize distance
computations, thus accelerating queries. We release public domain software
geraf and we compare it to existing implementations of state-of-the-art methods
including BBD-trees, Locality Sensitive Hashing, randomized kd forests, and
product quantization. Experimental results indicate that our method would be
the method of choice in dimensions around 1,000, and probably up to 10,000, and
pointsets of cardinality up to a few hundred thousands or even one million;
this range of inputs is encountered in many critical applications today. For
instance, we handle a real dataset of images represented in 960
dimensions with a query time of less than sec on average and 90\% responses
being true nearest neighbors
- …