The vast amounts of data collected in various domains pose great challenges
to modern data exploration and analysis. To find "interesting" objects in large
databases, users typically define a query using positive and negative example
objects and train a classification model to identify the objects of interest in
the entire data catalog. However, this approach requires a scan of all the data
to apply the classification model to each instance in the data catalog, making
this method prohibitively expensive to be employed in large-scale databases
serving many users and queries interactively. In this work, we propose a novel
framework for such search-by-classification scenarios that allows users to
interactively search for target objects by specifying queries through a small
set of positive and negative examples. Unlike previous approaches, our
framework can rapidly answer such queries at low cost without scanning the
entire database. Our framework is based on an index-aware construction scheme
for decision trees and random forests that transforms the inference phase of
these classification models into a set of range queries, which in turn can be
efficiently executed by leveraging multidimensional indexing structures. Our
experiments show that queries over large data catalogs with hundreds of
millions of objects can be processed in a few seconds using a single server,
compared to hours needed by classical scanning-based approaches