4,754 research outputs found
Understanding deep features with computer-generated imagery
We introduce an approach for analyzing the variation of features generated by
convolutional neural networks (CNNs) with respect to scene factors that occur
in natural images. Such factors may include object style, 3D viewpoint, color,
and scene lighting configuration. Our approach analyzes CNN feature responses
corresponding to different scene factors by controlling for them via rendering
using a large database of 3D CAD models. The rendered images are presented to a
trained CNN and responses for different layers are studied with respect to the
input scene factors. We perform a decomposition of the responses based on
knowledge of the input scene factors and analyze the resulting components. In
particular, we quantify their relative importance in the CNN responses and
visualize them using principal component analysis. We show qualitative and
quantitative results of our study on three CNNs trained on large image
datasets: AlexNet, Places, and Oxford VGG. We observe important differences
across the networks and CNN layers for different scene factors and object
categories. Finally, we demonstrate that our analysis based on
computer-generated imagery translates to the network representation of natural
images
Unsupervised learning of clutter-resistant visual representations from natural videos
Populations of neurons in inferotemporal cortex (IT) maintain an explicit
code for object identity that also tolerates transformations of object
appearance e.g., position, scale, viewing angle [1, 2, 3]. Though the learning
rules are not known, recent results [4, 5, 6] suggest the operation of an
unsupervised temporal-association-based method e.g., Foldiak's trace rule [7].
Such methods exploit the temporal continuity of the visual world by assuming
that visual experience over short timescales will tend to have invariant
identity content. Thus, by associating representations of frames from nearby
times, a representation that tolerates whatever transformations occurred in the
video may be achieved. Many previous studies verified that such rules can work
in simple situations without background clutter, but the presence of visual
clutter has remained problematic for this approach. Here we show that temporal
association based on large class-specific filters (templates) avoids the
problem of clutter. Our system learns in an unsupervised way from natural
videos gathered from the internet, and is able to perform a difficult
unconstrained face recognition task on natural images: Labeled Faces in the
Wild [8]
- …