87 research outputs found
Link-Prediction Enhanced Consensus Clustering for Complex Networks
Many real networks that are inferred or collected from data are incomplete
due to missing edges. Missing edges can be inherent to the dataset (Facebook
friend links will never be complete) or the result of sampling (one may only
have access to a portion of the data). The consequence is that downstream
analyses that consume the network will often yield less accurate results than
if the edges were complete. Community detection algorithms, in particular,
often suffer when critical intra-community edges are missing. We propose a
novel consensus clustering algorithm to enhance community detection on
incomplete networks. Our framework utilizes existing community detection
algorithms that process networks imputed by our link prediction based
algorithm. The framework then merges their multiple outputs into a final
consensus output. On average our method boosts performance of existing
algorithms by 7% on artificial data and 17% on ego networks collected from
Facebook
Physical Representation-based Predicate Optimization for a Visual Analytics Database
Querying the content of images, video, and other non-textual data sources
requires expensive content extraction methods. Modern extraction techniques are
based on deep convolutional neural networks (CNNs) and can classify objects
within images with astounding accuracy. Unfortunately, these methods are slow:
processing a single image can take about 10 milliseconds on modern GPU-based
hardware. As massive video libraries become ubiquitous, running a content-based
query over millions of video frames is prohibitive.
One promising approach to reduce the runtime cost of queries of visual
content is to use a hierarchical model, such as a cascade, where simple cases
are handled by an inexpensive classifier. Prior work has sought to design
cascades that optimize the computational cost of inference by, for example,
using smaller CNNs. However, we observe that there are critical factors besides
the inference time that dramatically impact the overall query time. Notably, by
treating the physical representation of the input image as part of our query
optimization---that is, by including image transforms, such as resolution
scaling or color-depth reduction, within the cascade---we can optimize data
handling costs and enable drastically more efficient classifier cascades.
In this paper, we propose Tahoma, which generates and evaluates many
potential classifier cascades that jointly optimize the CNN architecture and
input data representation. Our experiments on a subset of ImageNet show that
Tahoma's input transformations speed up cascades by up to 35 times. We also
find up to a 98x speedup over the ResNet50 classifier with no loss in accuracy,
and a 280x speedup if some accuracy is sacrificed.Comment: Camera-ready version of the paper submitted to ICDE 2019, In
Proceedings of the 35th IEEE International Conference on Data Engineering
(ICDE 2019
SeeSaw: Interactive Ad-hoc Search Over Image Databases
As image datasets become ubiquitous, the problem of ad-hoc searches over
image data is increasingly important. Many high-level data tasks in machine
learning, such as constructing datasets for training and testing object
detectors, imply finding ad-hoc objects or scenes within large image datasets
as a key sub-problem. New foundational visual-semantic embeddings trained on
massive web datasets such as Contrastive Language-Image Pre-Training (CLIP) can
help users start searches on their own data, but we find there is a long tail
of queries where these models fall short in practice. SeeSaw is a system for
interactive ad-hoc searches on image datasets that integrates state-of-the-art
embeddings like CLIP with user feedback in the form of box annotations to help
users quickly locate images of interest in their data even in the long tail of
harder queries. One key challenge for SeeSaw is that, in practice, many
sensible approaches to incorporating feedback into future results, including
state-of-the-art active-learning algorithms, can worsen results compared to
introducing no feedback, partly due to CLIP's high-average performance.
Therefore, SeeSaw includes several algorithms that empirically result in larger
and also more consistent improvements. We compare SeeSaw's accuracy to both
using CLIP alone and to a state-of-the-art active-learning baseline and find
SeeSaw consistently helps improve results for users across four datasets and
more than a thousand queries. SeeSaw increases Average Precision (AP) on search
tasks by an average of .08 on a wide benchmark (from a base of .72), and by a
.27 on a subset of more difficult queries where CLIP alone performs poorly.Comment: SIGMOD 2024 camera read
Runtime Support for Human-in-the-Loop Feature Engineering Systems
Abstract A machine learning system is only as good as its feature
- …