Neuroscientists classify neurons into different types that perform similar
computations at different locations in the visual field. Traditional methods
for neural system identification do not capitalize on this separation of 'what'
and 'where'. Learning deep convolutional feature spaces that are shared among
many neurons provides an exciting path forward, but the architectural design
needs to account for data limitations: While new experimental techniques enable
recordings from thousands of neurons, experimental time is limited so that one
can sample only a small fraction of each neuron's response space. Here, we show
that a major bottleneck for fitting convolutional neural networks (CNNs) to
neural data is the estimation of the individual receptive field locations, a
problem that has been scratched only at the surface thus far. We propose a CNN
architecture with a sparse readout layer factorizing the spatial (where) and
feature (what) dimensions. Our network scales well to thousands of neurons and
short recordings and can be trained end-to-end. We evaluate this architecture
on ground-truth data to explore the challenges and limitations of CNN-based
system identification. Moreover, we show that our network model outperforms
current state-of-the art system identification models of mouse primary visual
cortex.Comment: NIPS 201