557 research outputs found
The computational magic of the ventral stream
I argue that the sample complexity of (biological, feedforward) object recognition is mostly due to geometric image transformations and conjecture that a main goal of the ventral stream – V1, V2, V4 and IT – is to learn-and-discount image transformations.

In the first part of the paper I describe a class of simple and biologically plausible memory-based modules that learn transformations from unsupervised visual experience. The main theorems show that these modules provide (for every object) a signature which is invariant to local affine transformations and approximately invariant for other transformations. I also prove that,
in a broad class of hierarchical architectures, signatures remain invariant from layer to layer. The identification of these memory-based modules with complex (and simple) cells in visual areas leads to a theory of invariant recognition for the ventral stream.

In the second part, I outline a theory about hierarchical architectures that can learn invariance to transformations. I show that the memory complexity of learning affine transformations is drastically reduced in a hierarchical architecture that factorizes transformations in terms of the subgroup of translations and the subgroups of rotations and scalings. I then show how translations are automatically selected as the only learnable transformations during development by enforcing small apertures – eg small receptive fields – in the first layer.

In a third part I show that the transformations represented in each area can be optimized in terms of storage and robustness, as a consequence determining the tuning of the neurons in the area, rather independently (under normal conditions) of the statistics of natural images. I describe a model of learning that can be proved to have this property, linking in an elegant way the spectral properties of the signatures with the tuning of receptive fields in different areas. A surprising implication of these theoretical results is that the computational goals and some of the tuning properties of cells in the ventral stream may follow from symmetry properties (in the sense of physics) of the visual world through a process of unsupervised correlational learning, based on Hebbian synapses. In particular, simple and complex cells do not directly care about oriented bars: their tuning is a side effect of their role in translation invariance. Across the whole ventral stream the preferred features reported for neurons in different areas are only a symptom of the invariances computed and represented.

The results of each of the three parts stand on their own independently of each other. Together this theory-in-fieri makes several broad predictions, some of which are:

-invariance to small transformations in early areas (eg translations in V1) may underly stability of visual perception (suggested by Stu Geman);

-each cell’s tuning properties are shaped by visual experience of image transformations during developmental and adult plasticity;

-simple cells are likely to be the same population as complex cells, arising from different convergence of the Hebbian learning rule. The input to complex “complex” cells are dendritic branches with simple cell properties;

-class-specific transformations are learned and represented at the top of the ventral stream hierarchy; thus class-specific modules such as faces, places and possibly body areas should exist in IT;

-the type of transformations that are learned from visual experience depend on the size of the receptive fields and thus on the area (layer in the models) – assuming that the size increases with layers;

-the mix of transformations learned in each area influences the tuning properties of the cells oriented bars in V1+V2, radial and spiral patterns in V4 up to class specific tuning in AIT (eg face tuned cells);

-features must be discriminative and invariant: invariance to transformations is the primary determinant of the tuning of cortical neurons rather than statistics of natural images.

The theory is broadly consistent with the current version of HMAX. It explains it and extend it in terms of unsupervised learning, a broader class of transformation invariance and higher level modules. The goal of this paper is to sketch a comprehensive theory with little regard for mathematical niceties. If the theory turns out to be useful there will be scope for deep mathematics, ranging from group representation tools to wavelet theory to dynamics of learning
The Levels of Understanding framework, revised
I discuss the "levels of understanding" framework described in Marr's Vision and propose a revised and updated version of it to capture the changes in computation and neuroscience over the last 30 years
The Computational Magic of the Ventral Stream: Towards a Theory
I conjecture that the sample complexity of object recognition is mostly due to geometric image transformations and that a main goal of the ventral stream – V1, V2, V4 and IT – is to learn-and-discount image transformations. The most surprising implication of the theory emerging from these assumptions is that the computational goals and detailed properties of cells in the ventral stream follow from symmetry properties of the visual world through a process of unsupervised correlational learning.

From the assumption of a hierarchy of areas with receptive fields of increasing size the theory predicts that the size of the receptive fields determines which transformations are learned during development and then factored out during normal processing; that the transformation represented in each area determines the tuning of the neurons in the aerea, independently of the statistics of natural images; and that class-specific transformations are learned and represented at the top of the ventral stream hierarchy.

Some of the main predictions of this theory-in-fieri are:
1. the type of transformation that are learned from visual experience depend on the size (measured in terms of wavelength) and thus on the area (layer in the models) – assuming that the aperture size increases with layers;
2. the mix of transformations learned determine the properties of the receptive fields – oriented bars in V1+V2, radial and spiral patterns in V4 up to class specific tuning in AIT (eg face tuned cells);
3. invariance to small translations in V1 may underly stability of visual perception
4. class-specific modules – such as faces, places and possibly body areas – should exist in IT to process images of object classes
Integrating vision modules with coupled MRFs
A. I. Laboratory Working Papers are produced for internal circulation and contain proteins, lipids, cholesterol, polysorbate-80, and other compounds unsuitable for external exposure. It is not intended that material in this paper be applied externally; it is intended for internal consumption only. Serving suggestion: add taco sauce (not included).I outline a project for integrating several early visual modalities based on coupled Markov Random Fields models of the physical processes underlying image formation, such as depth, albedo and orientation of surfaces. The key ideas are:
a) to use as input data estimates of the various processes and their discontinuities, computed by several different algorithms.
b) to implement with MRFs the physical and geometrical constraints of local "continuity" of the processes and of their discontinuities. Processes are coupled to each other: the most common form of coupling is a veto — one process vetoing another — as in the case of discontinuities and the associated continuous field.MIT Artificial Intelligence Laborator
Werner Reichardt: the man and his scientific legacy
Excerpts from a talk given by Tomaso Poggio in Tübingen on the opening ofthe Werner Reichardt Centrun für Integrative Neurowissenschaften, December 8, 2008
Computational role of eccentricity dependent cortical magnification
We develop a sampling extension of M-theory focused on invariance to scale
and translation. Quite surprisingly, the theory predicts an architecture of
early vision with increasing receptive field sizes and a high resolution fovea
-- in agreement with data about the cortical magnification factor, V1 and the
retina. From the slope of the inverse of the magnification factor, M-theory
predicts a cortical "fovea" in V1 in the order of by basic units at
each receptive field size -- corresponding to a foveola of size around
minutes of arc at the highest resolution, degrees at the lowest
resolution. It also predicts uniform scale invariance over a fixed range of
scales independently of eccentricity, while translation invariance should
depend linearly on spatial frequency. Bouma's law of crowding follows in the
theory as an effect of cortical area-by-cortical area pooling; the Bouma
constant is the value expected if the signature responsible for recognition in
the crowding experiments originates in V2. From a broader perspective, the
emerging picture suggests that visual recognition under natural conditions
takes place by composing information from a set of fixations, with each
fixation providing recognition from a space-scale image fragment -- that is an
image patch represented at a set of increasing sizes and decreasing
resolutions
On Invariance and Selectivity in Representation Learning
We discuss data representation which can be learned automatically from data,
are invariant to transformations, and at the same time selective, in the sense
that two points have the same representation only if they are one the
transformation of the other. The mathematical results here sharpen some of the
key claims of i-theory -- a recent theory of feedforward processing in sensory
cortex
- …