3,129 research outputs found
A cross-talk robust multichannel VAD model for multiparty agent interactions trained using synthetic re-recordings
In this work, we propose a novel cross-talk rejection framework for a
multi-channel multi-talker setup for a live multiparty interactive show. Our
far-field audio setup is required to be hands-free during live interaction and
comprises four adjacent talkers with directional microphones in the same space.
Such setups often introduce heavy cross-talk between channels, resulting in
reduced automatic speech recognition (ASR) and natural language understanding
(NLU) performance. To address this problem, we propose voice activity detection
(VAD) model for all talkers using multichannel information, which is then used
to filter audio for downstream tasks. We adopt a synthetic training data
generation approach through playback and re-recording for such scenarios,
simulating challenging speech overlap conditions. We train our models on this
synthetic data and demonstrate that our approach outperforms single-channel VAD
models and energy-based multi-channel VAD algorithm in various acoustic
environments. In addition to VAD results, we also present multiparty ASR
evaluation results to highlight the impact of using our VAD model for filtering
audio in downstream tasks by significantly reducing the insertion error.Comment: Accepted for presentation at the Hands-free Speech Communication and
Microphone Arrays (HSCMA 2024
Joint model-based recognition and localization of overlapped acoustic events using a set of distributed small microphone arrays
In the analysis of acoustic scenes, often the occurring sounds have to be
detected in time, recognized, and localized in space. Usually, each of these
tasks is done separately. In this paper, a model-based approach to jointly
carry them out for the case of multiple simultaneous sources is presented and
tested. The recognized event classes and their respective room positions are
obtained with a single system that maximizes the combination of a large set of
scores, each one resulting from a different acoustic event model and a
different beamformer output signal, which comes from one of several
arbitrarily-located small microphone arrays. By using a two-step method, the
experimental work for a specific scenario consisting of meeting-room acoustic
events, either isolated or overlapped with speech, is reported. Tests carried
out with two datasets show the advantage of the proposed approach with respect
to some usual techniques, and that the inclusion of estimated priors brings a
further performance improvement.Comment: Computational acoustic scene analysis, microphone array signal
processing, acoustic event detectio
Acoustic echo and noise canceller for personal hands-free video IP phone
This paper presents implementation and evaluation of a proposed acoustic echo and noise canceller (AENC) for videotelephony-enabled personal hands-free Internet protocol (IP) phones. This canceller has the following features: noise-robust performance, low processing delay, and low computational complexity. The AENC employs an adaptive digital filter (ADF) and noise reduction (NR) methods that can effectively eliminate undesired acoustic echo and background noise included in a microphone signal even in a noisy environment. The ADF method uses the step-size control approach according to the level of disturbance such as background noise; it can minimize the effect of disturbance in a noisy environment. The NR method estimates the noise level under an assumption that the noise amplitude spectrum is constant in a short period, which cannot be applied to the amplitude spectrum of speech. In addition, this paper presents the method for decreasing the computational complexity of the ADF process without increasing the processing delay to make the processing suitable for real-time implementation. The experimental results demonstrate that the proposed AENC suppresses echo and noise sufficiently in a noisy environment; thus, resulting in natural-sounding speech
- …