2,458 research outputs found
Ambient Sound Provides Supervision for Visual Learning
The sound of crashing waves, the roar of fast-moving cars -- sound conveys
important information about the objects in our surroundings. In this work, we
show that ambient sounds can be used as a supervisory signal for learning
visual models. To demonstrate this, we train a convolutional neural network to
predict a statistical summary of the sound associated with a video frame. We
show that, through this process, the network learns a representation that
conveys information about objects and scenes. We evaluate this representation
on several recognition tasks, finding that its performance is comparable to
that of other state-of-the-art unsupervised learning methods. Finally, we show
through visualizations that the network learns units that are selective to
objects that are often associated with characteristic sounds.Comment: ECCV 201
Image retrieval with hierarchical matching pursuit
A novel representation of images for image retrieval is introduced in this
paper, by using a new type of feature with remarkable discriminative power.
Despite the multi-scale nature of objects, most existing models perform feature
extraction on a fixed scale, which will inevitably degrade the performance of
the whole system. Motivated by this, we introduce a hierarchical sparse coding
architecture for image retrieval to explore multi-scale cues. Sparse codes
extracted on lower layers are transmitted to higher layers recursively. With
this mechanism, cues from different scales are fused. Experiments on the
Holidays dataset show that the proposed method achieves an excellent retrieval
performance with a small code length.Comment: 5 pages, 6 figures, conferenc
Salient object subitizing
We study the problem of salient object subitizing, i.e. predicting the existence and the number of salient objects in an image using holistic cues. This task is inspired by the ability of people to quickly and accurately identify the number of items within the subitizing range (1–4). To this end, we present a salient object subitizing image dataset of about 14 K everyday images which are annotated using an online crowdsourcing marketplace. We show that using an end-to-end trained convolutional neural network (CNN) model, we achieve prediction accuracy comparable to human performance in identifying images with zero or one salient object. For images with multiple salient objects, our model also provides significantly better than chance performance without requiring any localization process. Moreover, we propose a method to improve the training of the CNN subitizing model by leveraging synthetic images. In experiments, we demonstrate the accuracy and generalizability of our CNN subitizing model and its applications in salient object detection and image retrieval.This research was supported in part by US NSF Grants 0910908 and 1029430, and gifts from Adobe and NVIDIA. (0910908 - US NSF; 1029430 - US NSF)https://arxiv.org/abs/1607.07525https://arxiv.org/pdf/1607.07525.pdfAccepted manuscrip
- …