3 research outputs found

    Directional Source Separation for Robust Speech Recognition on Smart Glasses

    Full text link
    Modern smart glasses leverage advanced audio sensing and machine learning technologies to offer real-time transcribing and captioning services, considerably enriching human experiences in daily communications. However, such systems frequently encounter challenges related to environmental noises, resulting in degradation to speech recognition and speaker change detection. To improve voice quality, this work investigates directional source separation using the multi-microphone array. We first explore multiple beamformers to assist source separation modeling by strengthening the directional properties of speech signals. In addition to relying on predetermined beamformers, we investigate neural beamforming in multi-channel source separation, demonstrating that automatic learning directional characteristics effectively improves separation quality. We further compare the ASR performance leveraging separated outputs to noisy inputs. Our results show that directional source separation benefits ASR for the wearer but not for the conversation partner. Lastly, we perform the joint training of the directional source separation and ASR model, achieving the best overall ASR performance.Comment: Submitted to ICASSP 202

    Brain-inspired self-organization with cellular neuromorphic computing for multimodal unsupervised learning

    Full text link
    Cortical plasticity is one of the main features that enable our ability to learn and adapt in our environment. Indeed, the cerebral cortex self-organizes itself through structural and synaptic plasticity mechanisms that are very likely at the basis of an extremely interesting characteristic of the human brain development: the multimodal association. In spite of the diversity of the sensory modalities, like sight, sound and touch, the brain arrives at the same concepts (convergence). Moreover, biological observations show that one modality can activate the internal representation of another modality when both are correlated (divergence). In this work, we propose the Reentrant Self-Organizing Map (ReSOM), a brain-inspired neural system based on the reentry theory using Self-Organizing Maps and Hebbian-like learning. We propose and compare different computational methods for unsupervised learning and inference, then quantify the gain of the ReSOM in a multimodal classification task. The divergence mechanism is used to label one modality based on the other, while the convergence mechanism is used to improve the overall accuracy of the system. We perform our experiments on a constructed written/spoken digits database and a DVS/EMG hand gestures database. The proposed model is implemented on a cellular neuromorphic architecture that enables distributed computing with local connectivity. We show the gain of the so-called hardware plasticity induced by the ReSOM, where the system's topology is not fixed by the user but learned along the system's experience through self-organization.Comment: Preprin

    Attention-driven Multi-sensor Selection

    Full text link
    Recent encoder-decoder models for sequence-to-sequence mapping show that integrating both temporal and spatial attention mechanisms into neural networks considerably improve network performance. The use of attention for sensor selection in multi-sensor setups and the benefit of such an attention mechanism is less studied. This work reports on a sensor transformation attention network (STAN) that embeds a sensory attention mechanism to dynamically weigh and combine individual input sensors based on their task-relevant information. We demonstrate the correlation of the attentional signal to changing noise levels of each sensor on the audio-visual GRID dataset and synthetic noise; and on CHiME-4, a multi-microphone real-world noisy dataset. In addition, we demonstrate that the STAN model is able to deal with sensor removal and addition without retraining, and is invariant to channel order. Compared to a two-sensor model that weighs both sensors equally, the equivalent STAN model has a relative parameter increase of only 0.09%, but reduces the relative character error rate (CER) by up to 19.1% on the CHiME-4 dataset. The attentional signal helps to identify a lower SNR sensor with up to 94.2% accuracy