31,350 research outputs found
Is Robustness To Transformations Driven by Invariant Neural Representations?
Deep Convolutional Neural Networks (DCNNs) have demonstrated impressive
robustness to recognize objects under transformations (e.g. blur or noise) when
these transformations are included in the training set. A hypothesis to explain
such robustness is that DCNNs develop invariant neural representations that
remain unaltered when the image is transformed. Yet, to what extent this
hypothesis holds true is an outstanding question, as including transformations
in the training set could lead to properties different from invariance, e.g.
parts of the network could be specialized to recognize either transformed or
non-transformed images. In this paper, we analyze the conditions under which
invariance emerges. To do so, we leverage that invariant representations
facilitate robustness to transformations for object categories that are not
seen transformed during training. Our results with state-of-the-art DCNNs
indicate that invariant representations strengthen as the number of transformed
categories in the training set is increased. This is much more prominent with
local transformations such as blurring and high-pass filtering, compared to
geometric transformations such as rotation and thinning, that entail changes in
the spatial arrangement of the object. Our results contribute to a better
understanding of invariant representations in deep learning, and the conditions
under which invariance spontaneously emerges
A Deep Representation for Invariance And Music Classification
Representations in the auditory cortex might be based on mechanisms similar
to the visual ventral stream; modules for building invariance to
transformations and multiple layers for compositionality and selectivity. In
this paper we propose the use of such computational modules for extracting
invariant and discriminative audio representations. Building on a theory of
invariance in hierarchical architectures, we propose a novel, mid-level
representation for acoustical signals, using the empirical distributions of
projections on a set of templates and their transformations. Under the
assumption that, by construction, this dictionary of templates is composed from
similar classes, and samples the orbit of variance-inducing signal
transformations (such as shift and scale), the resulting signature is
theoretically guaranteed to be unique, invariant to transformations and stable
to deformations. Modules of projection and pooling can then constitute layers
of deep networks, for learning composite representations. We present the main
theoretical and computational aspects of a framework for unsupervised learning
of invariant audio representations, empirically evaluated on music genre
classification.Comment: 5 pages, CBMM Memo No. 002, (to appear) IEEE 2014 International
Conference on Acoustics, Speech, and Signal Processing (ICASSP 2014
On Invariance and Selectivity in Representation Learning
We discuss data representation which can be learned automatically from data,
are invariant to transformations, and at the same time selective, in the sense
that two points have the same representation only if they are one the
transformation of the other. The mathematical results here sharpen some of the
key claims of i-theory -- a recent theory of feedforward processing in sensory
cortex
Invariance of visual operations at the level of receptive fields
Receptive field profiles registered by cell recordings have shown that
mammalian vision has developed receptive fields tuned to different sizes and
orientations in the image domain as well as to different image velocities in
space-time. This article presents a theoretical model by which families of
idealized receptive field profiles can be derived mathematically from a small
set of basic assumptions that correspond to structural properties of the
environment. The article also presents a theory for how basic invariance
properties to variations in scale, viewing direction and relative motion can be
obtained from the output of such receptive fields, using complementary
selection mechanisms that operate over the output of families of receptive
fields tuned to different parameters. Thereby, the theory shows how basic
invariance properties of a visual system can be obtained already at the level
of receptive fields, and we can explain the different shapes of receptive field
profiles found in biological vision from a requirement that the visual system
should be invariant to the natural types of image transformations that occur in
its environment.Comment: 40 pages, 17 figure
Learning An Invariant Speech Representation
Recognition of speech, and in particular the ability to generalize and learn
from small sets of labelled examples like humans do, depends on an appropriate
representation of the acoustic input. We formulate the problem of finding
robust speech features for supervised learning with small sample complexity as
a problem of learning representations of the signal that are maximally
invariant to intraclass transformations and deformations. We propose an
extension of a theory for unsupervised learning of invariant visual
representations to the auditory domain and empirically evaluate its validity
for voiced speech sound classification. Our version of the theory requires the
memory-based, unsupervised storage of acoustic templates -- such as specific
phones or words -- together with all the transformations of each that normally
occur. A quasi-invariant representation for a speech segment can be obtained by
projecting it to each template orbit, i.e., the set of transformed signals, and
computing the associated one-dimensional empirical probability distributions.
The computations can be performed by modules of filtering and pooling, and
extended to hierarchical architectures. In this paper, we apply a single-layer,
multicomponent representation for phonemes and demonstrate improved accuracy
and decreased sample complexity for vowel classification compared to standard
spectral, cepstral and perceptual features.Comment: CBMM Memo No. 022, 5 pages, 2 figure
View-tolerant face recognition and Hebbian learning imply mirror-symmetric neural tuning to head orientation
The primate brain contains a hierarchy of visual areas, dubbed the ventral
stream, which rapidly computes object representations that are both specific
for object identity and relatively robust against identity-preserving
transformations like depth-rotations. Current computational models of object
recognition, including recent deep learning networks, generate these properties
through a hierarchy of alternating selectivity-increasing filtering and
tolerance-increasing pooling operations, similar to simple-complex cells
operations. While simulations of these models recapitulate the ventral stream's
progression from early view-specific to late view-tolerant representations,
they fail to generate the most salient property of the intermediate
representation for faces found in the brain: mirror-symmetric tuning of the
neural population to head orientation. Here we prove that a class of
hierarchical architectures and a broad set of biologically plausible learning
rules can provide approximate invariance at the top level of the network. While
most of the learning rules do not yield mirror-symmetry in the mid-level
representations, we characterize a specific biologically-plausible Hebb-type
learning rule that is guaranteed to generate mirror-symmetric tuning to faces
tuning at intermediate levels of the architecture
Visual Representations: Defining Properties and Deep Approximations
Visual representations are defined in terms of minimal sufficient statistics
of visual data, for a class of tasks, that are also invariant to nuisance
variability. Minimal sufficiency guarantees that we can store a representation
in lieu of raw data with smallest complexity and no performance loss on the
task at hand. Invariance guarantees that the statistic is constant with respect
to uninformative transformations of the data. We derive analytical expressions
for such representations and show they are related to feature descriptors
commonly used in computer vision, as well as to convolutional neural networks.
This link highlights the assumptions and approximations tacitly assumed by
these methods and explains empirical practices such as clamping, pooling and
joint normalization.Comment: UCLA CSD TR140023, Nov. 12, 2014, revised April 13, 2015, November
13, 2015, February 28, 201
- …