5,666 research outputs found
Recurrent Attention Models for Depth-Based Person Identification
We present an attention-based model that reasons on human body shape and
motion dynamics to identify individuals in the absence of RGB information,
hence in the dark. Our approach leverages unique 4D spatio-temporal signatures
to address the identification problem across days. Formulated as a
reinforcement learning task, our model is based on a combination of
convolutional and recurrent neural networks with the goal of identifying small,
discriminative regions indicative of human identity. We demonstrate that our
model produces state-of-the-art results on several published datasets given
only depth images. We further study the robustness of our model towards
viewpoint, appearance, and volumetric changes. Finally, we share insights
gleaned from interpretable 2D, 3D, and 4D visualizations of our model's
spatio-temporal attention.Comment: Computer Vision and Pattern Recognition (CVPR) 201
A visual embedding for the unsupervised extraction of abstract semantics
Vector-space word representations obtained from neural network models have been shown to enable semantic operations based on vector arithmetic. In this paper, we explore the existence of similar information on vector representations of images. For that purpose we define a methodology to obtain large, sparse vector representations of image classes, and generate vectors through the state-of-the-art deep learning architecture GoogLeNet for 20 K images obtained from ImageNet. We first evaluate the resultant vector-space semantics through its correlation with WordNet distances, and find vector distances to be strongly correlated with linguistic semantics. We then explore the location of images within the vector space, finding elements close in WordNet to be clustered together, regardless of significant visual variances (e.g., 118 dog types). More surprisingly, we find that the space unsupervisedly separates complex classes without prior knowledge (e.g., living things). Afterwards, we consider vector arithmetics. Although we are unable to obtain meaningful results on this regard, we discuss the various problem we encountered, and how we consider to solve them. Finally, we discuss the impact of our research for cognitive systems, focusing on the role of the architecture being used.This work is partially supported by the Joint Study Agreement no. W156463 under the IBM/BSC Deep Learning Center agreement, by the Spanish Government through Programa Severo Ochoa (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project and by the Generalitat de Catalunya (contracts 2014-SGR-1051), and by the Core Research for Evolutional Science and Technology (CREST) program of Japan Science and Technology Agency (JST).Peer ReviewedPostprint (published version
Few-shot Learning with Multi-scale Self-supervision
Learning concepts from the limited number of datapoints is a challenging task
usually addressed by the so-called one- or few-shot learning. Recently, an
application of second-order pooling in few-shot learning demonstrated its
superior performance due to the aggregation step handling varying image
resolutions without the need of modifying CNNs to fit to specific image sizes,
yet capturing highly descriptive co-occurrences. However, using a single
resolution per image (even if the resolution varies across a dataset) is
suboptimal as the importance of image contents varies across the coarse-to-fine
levels depending on the object and its class label e. g., generic objects and
scenes rely on their global appearance while fine-grained objects rely more on
their localized texture patterns. Multi-scale representations are popular in
image deblurring, super-resolution and image recognition but they have not been
investigated in few-shot learning due to its relational nature complicating the
use of standard techniques. In this paper, we propose a novel multi-scale
relation network based on the properties of second-order pooling to estimate
image relations in few-shot setting. To optimize the model, we leverage a scale
selector to re-weight scale-wise representations based on their second-order
features. Furthermore, we propose to a apply self-supervised scale prediction.
Specifically, we leverage an extra discriminator to predict the scale labels
and the scale discrepancy between pairs of images. Our model achieves
state-of-the-art results on standard few-shot learning datasets
Detecting Visual Relationships with Deep Relational Networks
Relationships among objects play a crucial role in image understanding.
Despite the great success of deep learning techniques in recognizing individual
objects, reasoning about the relationships among objects remains a challenging
task. Previous methods often treat this as a classification problem,
considering each type of relationship (e.g. "ride") or each distinct visual
phrase (e.g. "person-ride-horse") as a category. Such approaches are faced with
significant difficulties caused by the high diversity of visual appearance for
each kind of relationships or the large number of distinct visual phrases. We
propose an integrated framework to tackle this problem. At the heart of this
framework is the Deep Relational Network, a novel formulation designed
specifically for exploiting the statistical dependencies between objects and
their relationships. On two large datasets, the proposed method achieves
substantial improvement over state-of-the-art.Comment: To be appeared in CVPR 2017 as an oral pape
Looking Beyond Appearances: Synthetic Training Data for Deep CNNs in Re-identification
Re-identification is generally carried out by encoding the appearance of a
subject in terms of outfit, suggesting scenarios where people do not change
their attire. In this paper we overcome this restriction, by proposing a
framework based on a deep convolutional neural network, SOMAnet, that
additionally models other discriminative aspects, namely, structural attributes
of the human figure (e.g. height, obesity, gender). Our method is unique in
many respects. First, SOMAnet is based on the Inception architecture, departing
from the usual siamese framework. This spares expensive data preparation
(pairing images across cameras) and allows the understanding of what the
network learned. Second, and most notably, the training data consists of a
synthetic 100K instance dataset, SOMAset, created by photorealistic human body
generation software. Synthetic data represents a good compromise between
realistic imagery, usually not required in re-identification since surveillance
cameras capture low-resolution silhouettes, and complete control of the
samples, which is useful in order to customize the data w.r.t. the surveillance
scenario at-hand, e.g. ethnicity. SOMAnet, trained on SOMAset and fine-tuned on
recent re-identification benchmarks, outperforms all competitors, matching
subjects even with different apparel. The combination of synthetic data with
Inception architectures opens up new research avenues in re-identification.Comment: 14 page
- …