10 research outputs found
Sparse Coding on Stereo Video for Object Detection
Deep Convolutional Neural Networks (DCNN) require millions of labeled
training examples for image classification and object detection tasks, which
restrict these models to domains where such datasets are available. In this
paper, we explore the use of unsupervised sparse coding applied to stereo-video
data to help alleviate the need for large amounts of labeled data. We show that
replacing a typical supervised convolutional layer with an unsupervised
sparse-coding layer within a DCNN allows for better performance on a car
detection task when only a limited number of labeled training examples is
available. Furthermore, the network that incorporates sparse coding allows for
more consistent performance over varying initializations and ordering of
training examples when compared to a fully supervised DCNN. Finally, we compare
activations between the unsupervised sparse-coding layer and the supervised
convolutional layer, and show that the sparse representation exhibits an
encoding that is depth selective, whereas encodings from the convolutional
layer do not exhibit such selectivity. These result indicates promise for using
unsupervised sparse-coding approaches in real-world computer vision tasks in
domains with limited labeled training data
Integrating Flexible Normalization into Mid-Level Representations of Deep Convolutional Neural Networks
Deep convolutional neural networks (CNNs) are becoming increasingly popular
models to predict neural responses in visual cortex. However, contextual
effects, which are prevalent in neural processing and in perception, are not
explicitly handled by current CNNs, including those used for neural prediction.
In primary visual cortex, neural responses are modulated by stimuli spatially
surrounding the classical receptive field in rich ways. These effects have been
modeled with divisive normalization approaches, including flexible models,
where spatial normalization is recruited only to the degree responses from
center and surround locations are deemed statistically dependent. We propose a
flexible normalization model applied to mid-level representations of deep CNNs
as a tractable way to study contextual normalization mechanisms in mid-level
cortical areas. This approach captures non-trivial spatial dependencies among
mid-level features in CNNs, such as those present in textures and other visual
stimuli, that arise from tiling high order features, geometrically. We expect
that the proposed approach can make predictions about when spatial
normalization might be recruited in mid-level cortical areas. We also expect
this approach to be useful as part of the CNN toolkit, therefore going beyond
more restrictive fixed forms of normalization
Learning Mid-Level Auditory Codes from Natural Sound Statistics
Interaction with the world requires an organism to transform sensory signals into representations in which behaviorally meaningful properties of the environment are made explicit. These representations are derived through cascades of neuronal processing stages in which neurons at each stage recode the output of preceding stages. Explanations of sensory coding may thus involve understanding how low-level patterns are combined into more complex structures. Although models exist in the visual domain to explain how mid-level features such as junctions and curves might be derived from oriented filters in early visual cortex, little is known about analogous grouping principles for mid-level auditory representations. We propose a hierarchical generative model of natural sounds that learns combina- tions of spectrotemporal features from natural stimulus statistics. In the first layer the model forms a sparse convolutional code of spectrograms using a dictionary of learned spectrotemporal kernels. To generalize from specific kernel activation patterns, the second layer encodes patterns of time-varying magnitude of multiple first layer coefficients. Because second-layer features are sensitive to combi- nations of spectrotemporal features, the representation they support encodes more complex acoustic patterns than the first layer. When trained on corpora of speech and environmental sounds, some second-layer units learned to group spectrotemporal features that occur together in natural sounds. Others instantiate opponency between dissimilar sets of spectrotemporal features. Such groupings might be instantiated by neurons in the auditory cortex, providing a hypothesis for mid-level neuronal computation.This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216
Toward a Biologically Plausible Model of LGN-V1 Pathways Based on Efficient Coding
Increasing evidence supports the hypothesis that the visual system employs a sparse code to represent visual stimuli, where information is encoded in an efficient way by a small population of cells that respond to sensory input at a given time. This includes simple cells in primary visual cortex (V1), which are defined by their linear spatial integration of visual stimuli. Various models of sparse coding have been proposed to explain physiological phenomena observed in simple cells. However, these models have usually made the simplifying assumption that inputs to simple cells already incorporate linear spatial summation. This overlooks the fact that these inputs are known to have strong non-linearities such the separation of ON and OFF pathways, or separation of excitatory and inhibitory neurons. Consequently these models ignore a range of important experimental phenomena that are related to the emergence of linear spatial summation from non-linear inputs, such as segregation of ON and OFF sub-regions of simple cell receptive fields, the push-pull effect of excitation and inhibition, and phase-reversed cortico-thalamic feedback. Here, we demonstrate that a two-layer model of the visual pathway from the lateral geniculate nucleus to V1 that incorporates these biological constraints on the neural circuits and is based on sparse coding can account for the emergence of these experimental phenomena, diverse shapes of receptive fields and contrast invariance of orientation tuning of simple cells when the model is trained on natural images. The model suggests that sparse coding can be implemented by the V1 simple cells using neural circuits with a simple biologically plausible architecture
Visual Nonclassical Receptive Field Effects Emerge from Sparse Coding in a Dynamical System
<div><p>Extensive electrophysiology studies have shown that many V1 simple cells have nonlinear response properties to stimuli within their classical receptive field (CRF) and receive contextual influence from stimuli outside the CRF modulating the cell's response. Models seeking to explain these non-classical receptive field (nCRF) effects in terms of circuit mechanisms, input-output descriptions, or individual visual tasks provide limited insight into the functional significance of these response properties, because they do not connect the full range of nCRF effects to optimal sensory coding strategies. The (population) sparse coding hypothesis conjectures an optimal sensory coding approach where a neural population uses as few active units as possible to represent a stimulus. We demonstrate that a wide variety of nCRF effects are emergent properties of a single sparse coding model implemented in a neurally plausible network structure (requiring no parameter tuning to produce different effects). Specifically, we replicate a wide variety of nCRF electrophysiology experiments (e.g., end-stopping, surround suppression, contrast invariance of orientation tuning, cross-orientation suppression, etc.) on a dynamical system implementing sparse coding, showing that this model produces individual units that reproduce the canonical nCRF effects. Furthermore, when the population diversity of an nCRF effect has also been reported in the literature, we show that this model produces many of the same population characteristics. These results show that the sparse coding hypothesis, when coupled with a biophysically plausible implementation, can provide a unified high-level functional interpretation to many response properties that have generally been viewed through distinct mechanistic or phenomenological models.</p></div