61,245 research outputs found
Visual Search at Pinterest
We demonstrate that, with the availability of distributed computation
platforms such as Amazon Web Services and open-source tools, it is possible for
a small engineering team to build, launch and maintain a cost-effective,
large-scale visual search system with widely available tools. We also
demonstrate, through a comprehensive set of live experiments at Pinterest, that
content recommendation powered by visual search improve user engagement. By
sharing our implementation details and the experiences learned from launching a
commercial visual search engines from scratch, we hope visual search are more
widely incorporated into today's commercial applications.Comment: in Proceedings of the 21th ACM SIGKDD International Conference on
Knowledge and Discovery and Data Mining, 201
Bilinear CNNs for Fine-grained Visual Recognition
We present a simple and effective architecture for fine-grained visual
recognition called Bilinear Convolutional Neural Networks (B-CNNs). These
networks represent an image as a pooled outer product of features derived from
two CNNs and capture localized feature interactions in a translationally
invariant manner. B-CNNs belong to the class of orderless texture
representations but unlike prior work they can be trained in an end-to-end
manner. Our most accurate model obtains 84.1%, 79.4%, 86.9% and 91.3% per-image
accuracy on the Caltech-UCSD birds [67], NABirds [64], FGVC aircraft [42], and
Stanford cars [33] dataset respectively and runs at 30 frames-per-second on a
NVIDIA Titan X GPU. We then present a systematic analysis of these networks and
show that (1) the bilinear features are highly redundant and can be reduced by
an order of magnitude in size without significant loss in accuracy, (2) are
also effective for other image classification tasks such as texture and scene
recognition, and (3) can be trained from scratch on the ImageNet dataset
offering consistent improvements over the baseline architecture. Finally, we
present visualizations of these models on various datasets using top
activations of neural units and gradient-based inversion techniques. The source
code for the complete system is available at http://vis-www.cs.umass.edu/bcnn
A review of EO image information mining
We analyze the state of the art of content-based retrieval in Earth
observation image archives focusing on complete systems showing promise for
operational implementation. The different paradigms at the basis of the main
system families are introduced. The approaches taken are analyzed, focusing in
particular on the phases after primitive feature extraction. The solutions
envisaged for the issues related to feature simplification and synthesis,
indexing, semantic labeling are reviewed. The methodologies for query
specification and execution are analyzed
Automatic Attribute Discovery with Neural Activations
How can a machine learn to recognize visual attributes emerging out of online
community without a definitive supervised dataset? This paper proposes an
automatic approach to discover and analyze visual attributes from a noisy
collection of image-text data on the Web. Our approach is based on the
relationship between attributes and neural activations in the deep network. We
characterize the visual property of the attribute word as a divergence within
weakly-annotated set of images. We show that the neural activations are useful
for discovering and learning a classifier that well agrees with human
perception from the noisy real-world Web data. The empirical study suggests the
layered structure of the deep neural networks also gives us insights into the
perceptual depth of the given word. Finally, we demonstrate that we can utilize
highly-activating neurons for finding semantically relevant regions.Comment: ECCV 201
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
In this paper, we explore neural network models that learn to associate
segments of spoken audio captions with the semantically relevant portions of
natural images that they refer to. We demonstrate that these audio-visual
associative localizations emerge from network-internal representations learned
as a by-product of training to perform an image-audio retrieval task. Our
models operate directly on the image pixels and speech waveform, and do not
rely on any conventional supervision in the form of labels, segmentations, or
alignments between the modalities during training. We perform analysis using
the Places 205 and ADE20k datasets demonstrating that our models implicitly
learn semantically-coupled object and word detectors
Order-Free RNN with Visual Attention for Multi-Label Classification
In this paper, we propose the joint learning attention and recurrent neural
network (RNN) models for multi-label classification. While approaches based on
the use of either model exist (e.g., for the task of image captioning),
training such existing network architectures typically require pre-defined
label sequences. For multi-label classification, it would be desirable to have
a robust inference process, so that the prediction error would not propagate
and thus affect the performance. Our proposed model uniquely integrates
attention and Long Short Term Memory (LSTM) models, which not only addresses
the above problem but also allows one to identify visual objects of interests
with varying sizes without the prior knowledge of particular label ordering.
More importantly, label co-occurrence information can be jointly exploited by
our LSTM model. Finally, by advancing the technique of beam search, prediction
of multiple labels can be efficiently achieved by our proposed network model.Comment: Accepted at 32nd AAAI Conference on Artificial Intelligence (AAAI-18
Image Matters: Scalable Detection of Offensive and Non-Compliant Content / Logo in Product Images
In e-commerce, product content, especially product images have a significant
influence on a customer's journey from product discovery to evaluation and
finally, purchase decision. Since many e-commerce retailers sell items from
other third-party marketplace sellers besides their own, the content published
by both internal and external content creators needs to be monitored and
enriched, wherever possible. Despite guidelines and warnings, product listings
that contain offensive and non-compliant images continue to enter catalogs.
Offensive and non-compliant content can include a wide range of objects, logos,
and banners conveying violent, sexually explicit, racist, or promotional
messages. Such images can severely damage the customer experience, lead to
legal issues, and erode the company brand. In this paper, we present a computer
vision driven offensive and non-compliant image detection system for extremely
large image datasets. This paper delves into the unique challenges of applying
deep learning to real-world product image data from retail world. We
demonstrate how we resolve a number of technical challenges such as lack of
training data, severe class imbalance, fine-grained class definitions etc.
using a number of practical yet unique technical strategies. Our system
combines state-of-the-art image classification and object detection techniques
with budgeted crowdsourcing to develop a solution customized for a massive,
diverse, and constantly evolving product catalog.Comment: 10 page
Hierarchical Spatial Sum-Product Networks for Action Recognition in Still Images
Recognizing actions from still images is popularly studied recently. In this
paper, we model an action class as a flexible number of spatial configurations
of body parts by proposing a new spatial SPN (Sum-Product Networks). First, we
discover a set of parts in image collections via unsupervised learning. Then,
our new spatial SPN is applied to model the spatial relationship and also the
high-order correlations of parts. To learn robust networks, we further develop
a hierarchical spatial SPN method, which models pairwise spatial relationship
between parts inside sub-images and models the correlation of sub-images via
extra layers of SPN. Our method is shown to be effective on two benchmark
datasets
Shared Predictive Cross-Modal Deep Quantization
With explosive growth of data volume and ever-increasing diversity of data
modalities, cross-modal similarity search, which conducts nearest neighbor
search across different modalities, has been attracting increasing interest.
This paper presents a deep compact code learning solution for efficient
cross-modal similarity search. Many recent studies have proven that
quantization-based approaches perform generally better than hashing-based
approaches on single-modal similarity search. In this paper, we propose a deep
quantization approach, which is among the early attempts of leveraging deep
neural networks into quantization-based cross-modal similarity search. Our
approach, dubbed shared predictive deep quantization (SPDQ), explicitly
formulates a shared subspace across different modalities and two private
subspaces for individual modalities, and representations in the shared subspace
and the private subspaces are learned simultaneously by embedding them to a
reproducing kernel Hilbert space, where the mean embedding of different
modality distributions can be explicitly compared. In addition, in the shared
subspace, a quantizer is learned to produce the semantics preserving compact
codes with the help of label alignment. Thanks to this novel network
architecture in cooperation with supervised quantization training, SPDQ can
preserve intramodal and intermodal similarities as much as possible and greatly
reduce quantization error. Experiments on two popular benchmarks corroborate
that our approach outperforms state-of-the-art methods
Answering Visual-Relational Queries in Web-Extracted Knowledge Graphs
A visual-relational knowledge graph (KG) is a multi-relational graph whose
entities are associated with images. We explore novel machine learning
approaches for answering visual-relational queries in web-extracted knowledge
graphs. To this end, we have created ImageGraph, a KG with 1,330 relation
types, 14,870 entities, and 829,931 images crawled from the web. With
visual-relational KGs such as ImageGraph one can introduce novel probabilistic
query types in which images are treated as first-class citizens. Both the
prediction of relations between unseen images as well as multi-relational image
retrieval can be expressed with specific families of visual-relational queries.
We introduce novel combinations of convolutional networks and knowledge graph
embedding methods to answer such queries. We also explore a zero-shot learning
scenario where an image of an entirely new entity is linked with multiple
relations to entities of an existing KG. The resulting multi-relational
grounding of unseen entity images into a knowledge graph serves as a semantic
entity representation. We conduct experiments to demonstrate that the proposed
methods can answer these visual-relational queries efficiently and accurately
- …