7 research outputs found
Lifelong Learning of Spatiotemporal Representations with Dual-Memory Recurrent Self-Organization
Artificial autonomous agents and robots interacting in complex environments
are required to continually acquire and fine-tune knowledge over sustained
periods of time. The ability to learn from continuous streams of information is
referred to as lifelong learning and represents a long-standing challenge for
neural network models due to catastrophic forgetting. Computational models of
lifelong learning typically alleviate catastrophic forgetting in experimental
scenarios with given datasets of static images and limited complexity, thereby
differing significantly from the conditions artificial agents are exposed to.
In more natural settings, sequential information may become progressively
available over time and access to previous experience may be restricted. In
this paper, we propose a dual-memory self-organizing architecture for lifelong
learning scenarios. The architecture comprises two growing recurrent networks
with the complementary tasks of learning object instances (episodic memory) and
categories (semantic memory). Both growing networks can expand in response to
novel sensory experience: the episodic memory learns fine-grained
spatiotemporal representations of object instances in an unsupervised fashion
while the semantic memory uses task-relevant signals to regulate structural
plasticity levels and develop more compact representations from episodic
experience. For the consolidation of knowledge in the absence of external
sensory input, the episodic memory periodically replays trajectories of neural
reactivations. We evaluate the proposed model on the CORe50 benchmark dataset
for continuous object recognition, showing that we significantly outperform
current methods of lifelong learning in three different incremental learning
scenario
Multimodal Language Analysis with Recurrent Multistage Fusion
Computational modeling of human multimodal language is an emerging research
area in natural language processing spanning the language, visual and acoustic
modalities. Comprehending multimodal language requires modeling not only the
interactions within each modality (intra-modal interactions) but more
importantly the interactions between modalities (cross-modal interactions). In
this paper, we propose the Recurrent Multistage Fusion Network (RMFN) which
decomposes the fusion problem into multiple stages, each of them focused on a
subset of multimodal signals for specialized, effective fusion. Cross-modal
interactions are modeled using this multistage fusion approach which builds
upon intermediate representations of previous stages. Temporal and intra-modal
interactions are modeled by integrating our proposed fusion approach with a
system of recurrent neural networks. The RMFN displays state-of-the-art
performance in modeling human multimodal language across three public datasets
relating to multimodal sentiment analysis, emotion recognition, and speaker
traits recognition. We provide visualizations to show that each stage of fusion
focuses on a different subset of multimodal signals, learning increasingly
discriminative multimodal representations.Comment: EMNLP 201
Brain-inspired self-organization with cellular neuromorphic computing for multimodal unsupervised learning
Cortical plasticity is one of the main features that enable our ability to
learn and adapt in our environment. Indeed, the cerebral cortex self-organizes
itself through structural and synaptic plasticity mechanisms that are very
likely at the basis of an extremely interesting characteristic of the human
brain development: the multimodal association. In spite of the diversity of the
sensory modalities, like sight, sound and touch, the brain arrives at the same
concepts (convergence). Moreover, biological observations show that one
modality can activate the internal representation of another modality when both
are correlated (divergence). In this work, we propose the Reentrant
Self-Organizing Map (ReSOM), a brain-inspired neural system based on the
reentry theory using Self-Organizing Maps and Hebbian-like learning. We propose
and compare different computational methods for unsupervised learning and
inference, then quantify the gain of the ReSOM in a multimodal classification
task. The divergence mechanism is used to label one modality based on the
other, while the convergence mechanism is used to improve the overall accuracy
of the system. We perform our experiments on a constructed written/spoken
digits database and a DVS/EMG hand gestures database. The proposed model is
implemented on a cellular neuromorphic architecture that enables distributed
computing with local connectivity. We show the gain of the so-called hardware
plasticity induced by the ReSOM, where the system's topology is not fixed by
the user but learned along the system's experience through self-organization.Comment: Preprin
Emergence of multimodal action representations from neural network self-organization
AbstractThe integration of multisensory information plays a crucial role in autonomous robotics to forming robust and meaningful representations of the environment. In this work, we investigate how robust multimodal representations can naturally develop in a self-organizing manner from co-occurring multisensory inputs. We propose a hierarchical architecture with growing self-organizing neural networks for learning human actions from audiovisual inputs. The hierarchical processing of visual inputs allows to obtain progressively specialized neurons encoding latent spatiotemporal dynamics of the input, consistent with neurophysiological evidence for increasingly large temporal receptive windows in the human cortex. Associative links to bind unimodal representations are incrementally learned by a semi-supervised algorithm with bidirectional connectivity. Multimodal representations of actions are obtained using the co-activation of action features from video sequences and labels from automatic speech recognition. Experimental results on a dataset of 10 full-body actions show that our system achieves state-of-the-art classification performance without requiring the manual segmentation of training samples, and that congruent visual representations can be retrieved from recognized speech in the absence of visual stimuli. Together, these results show that our hierarchical neural architecture accounts for the development of robust multimodal representations from dynamic audiovisual inputs