3 research outputs found
Directional Source Separation for Robust Speech Recognition on Smart Glasses
Modern smart glasses leverage advanced audio sensing and machine learning
technologies to offer real-time transcribing and captioning services,
considerably enriching human experiences in daily communications. However, such
systems frequently encounter challenges related to environmental noises,
resulting in degradation to speech recognition and speaker change detection. To
improve voice quality, this work investigates directional source separation
using the multi-microphone array. We first explore multiple beamformers to
assist source separation modeling by strengthening the directional properties
of speech signals. In addition to relying on predetermined beamformers, we
investigate neural beamforming in multi-channel source separation,
demonstrating that automatic learning directional characteristics effectively
improves separation quality. We further compare the ASR performance leveraging
separated outputs to noisy inputs. Our results show that directional source
separation benefits ASR for the wearer but not for the conversation partner.
Lastly, we perform the joint training of the directional source separation and
ASR model, achieving the best overall ASR performance.Comment: Submitted to ICASSP 202
Brain-inspired self-organization with cellular neuromorphic computing for multimodal unsupervised learning
Cortical plasticity is one of the main features that enable our ability to
learn and adapt in our environment. Indeed, the cerebral cortex self-organizes
itself through structural and synaptic plasticity mechanisms that are very
likely at the basis of an extremely interesting characteristic of the human
brain development: the multimodal association. In spite of the diversity of the
sensory modalities, like sight, sound and touch, the brain arrives at the same
concepts (convergence). Moreover, biological observations show that one
modality can activate the internal representation of another modality when both
are correlated (divergence). In this work, we propose the Reentrant
Self-Organizing Map (ReSOM), a brain-inspired neural system based on the
reentry theory using Self-Organizing Maps and Hebbian-like learning. We propose
and compare different computational methods for unsupervised learning and
inference, then quantify the gain of the ReSOM in a multimodal classification
task. The divergence mechanism is used to label one modality based on the
other, while the convergence mechanism is used to improve the overall accuracy
of the system. We perform our experiments on a constructed written/spoken
digits database and a DVS/EMG hand gestures database. The proposed model is
implemented on a cellular neuromorphic architecture that enables distributed
computing with local connectivity. We show the gain of the so-called hardware
plasticity induced by the ReSOM, where the system's topology is not fixed by
the user but learned along the system's experience through self-organization.Comment: Preprin
Attention-driven Multi-sensor Selection
Recent encoder-decoder models for sequence-to-sequence mapping show that integrating both temporal and spatial attention mechanisms into neural networks considerably improve network performance. The use of attention for sensor selection in multi-sensor setups and the benefit of such an attention mechanism is less studied. This work reports on a sensor transformation attention network (STAN) that embeds a sensory attention mechanism to dynamically weigh and combine individual input sensors based on their task-relevant information. We demonstrate the correlation of the attentional signal to changing noise levels of each sensor on the audio-visual GRID dataset and synthetic noise; and on CHiME-4, a multi-microphone real-world noisy dataset. In addition, we demonstrate that the STAN model is able to deal with sensor removal and addition without retraining, and is invariant to channel order. Compared to a two-sensor model that weighs both sensors equally, the equivalent STAN model has a relative parameter increase of only 0.09%, but reduces the relative character error rate (CER) by up to 19.1% on the CHiME-4 dataset. The attentional signal helps to identify a lower SNR sensor with up to 94.2% accuracy