1,337 research outputs found
Detection of activity and position of speakers by using deep neural networks and acoustic data augmentation
The task of Speaker LOCalization (SLOC) has been the focus of numerous works in the research field, where SLOC is performed on pure speech data, requiring the presence of an Oracle Voice Activity Detection (VAD) algorithm. Nevertheless, this perfect working condition is not satisfied in a real world scenario, where employed VADs do commit errors. This work addresses this issue with an extensive analysis focusing on the relationship between several data-driven VAD and SLOC models, finally proposing a reliable framework for VAD and SLOC. The effectiveness of the approach here discussed is assessed against a multi-room scenario, which is close to a real-world environment. Furthermore, up to the authors’ best knowledge, only one contribution proposes a unique framework for VAD and SLOC acting in this addressed scenario; however, this solution does not rely on data-driven approaches.
This work comes as an extension of the authors’ previous research addressing the VAD and SLOC tasks, by proposing numerous advancements to the original neural network architectures. In details, four different models based on convolutional neural networks (CNNs) are here tested, in order to easily highlight the advantages of the introduced novelties. In addition, two different CNN models go under study for SLOC. Furthermore, training of data-driven models is here improved through a specific data augmentation technique. During this procedure, the room impulse responses (RIRs) of two virtual rooms are generated from the knowledge of the room size, reverberation time and microphones and sources placement. Finally, the only other framework for simultaneous detection and localization in a multi-room scenario is here taken into account to fairly compare the proposed method.
As result, the proposed method is more accurate than the baseline framework, and remarkable improvements are specially observed when the data augmentation techniques are applied for both the VAD and SLOC tasks
Multi-View Networks For Multi-Channel Audio Classification
In this paper we introduce the idea of multi-view networks for sound
classification with multiple sensors. We show how one can build a multi-channel
sound recognition model trained on a fixed number of channels, and deploy it to
scenarios with arbitrary (and potentially dynamically changing) number of input
channels and not observe degradation in performance. We demonstrate that at
inference time you can safely provide this model all available channels as it
can ignore noisy information and leverage new information better than standard
baseline approaches. The model is evaluated in both an anechoic environment and
in rooms generated by a room acoustics simulator. We demonstrate that this
model can generalize to unseen numbers of channels as well as unseen room
geometries.Comment: 5 pages, 7 figures, Accepted to ICASSP 201
Realistic multi-microphone data simulation for distant speech recognition
The availability of realistic simulated corpora is of key importance for the
future progress of distant speech recognition technology. The reliability,
flexibility and low computational cost of a data simulation process may
ultimately allow researchers to train, tune and test different techniques in a
variety of acoustic scenarios, avoiding the laborious effort of directly
recording real data from the targeted environment.
In the last decade, several simulated corpora have been released to the
research community, including the data-sets distributed in the context of
projects and international challenges, such as CHiME and REVERB. These efforts
were extremely useful to derive baselines and common evaluation frameworks for
comparison purposes. At the same time, in many cases they highlighted the need
of a better coherence between real and simulated conditions.
In this paper, we examine this issue and we describe our approach to the
generation of realistic corpora in a domestic context. Experimental validation,
conducted in a multi-microphone scenario, shows that a comparable performance
trend can be observed with both real and simulated data across different
recognition frameworks, acoustic models, as well as multi-microphone processing
techniques.Comment: Proc. of Interspeech 201
Multiclass audio segmentation based on recurrent neural networks for broadcast domain data
This paper presents a new approach based on recurrent neural networks (RNN) to the multiclass audio segmentation task whose goal is to classify an audio signal as speech, music, noise or a combination of these. The proposed system is based on the use of bidirectional long short-term Memory (BLSTM) networks to model temporal dependencies in the signal. The RNN is complemented by a resegmentation module, gaining long term stability by means of the tied state concept in hidden Markov models. We explore different neural architectures introducing temporal pooling layers to reduce the neural network output sampling rate. Our findings show that removing redundant temporal information is beneficial for the segmentation system showing a relative improvement close to 5%. Furthermore, this solution does not increase the number of parameters of the model and reduces the number of operations per second, allowing our system to achieve a real-time factor below 0.04 if running on CPU and below 0.03 if running on GPU. This new architecture combined with a data-agnostic data augmentation technique called mixup allows our system to achieve competitive results in both the AlbayzĂn 2010 and 2012 evaluation datasets, presenting a relative improvement of 19.72% and 5.35% compared to the best results found in the literature for these databases
Whose Emotion Matters? Speaking Activity Localisation without Prior Knowledge
The task of emotion recognition in conversations (ERC) benefits from the
availability of multiple modalities, as provided, for example, in the
video-based Multimodal EmotionLines Dataset (MELD). However, only a few
research approaches use both acoustic and visual information from the MELD
videos. There are two reasons for this: First, label-to-video alignments in
MELD are noisy, making those videos an unreliable source of emotional speech
data. Second, conversations can involve several people in the same scene, which
requires the localisation of the utterance source. In this paper, we introduce
MELD with Fixed Audiovisual Information via Realignment (MELD-FAIR) by using
recent active speaker detection and automatic speech recognition models, we are
able to realign the videos of MELD and capture the facial expressions from
speakers in 96.92% of the utterances provided in MELD. Experiments with a
self-supervised voice recognition model indicate that the realigned MELD-FAIR
videos more closely match the transcribed utterances given in the MELD dataset.
Finally, we devise a model for emotion recognition in conversations trained on
the realigned MELD-FAIR videos, which outperforms state-of-the-art models for
ERC based on vision alone. This indicates that localising the source of
speaking activities is indeed effective for extracting facial expressions from
the uttering speakers and that faces provide more informative visual cues than
the visual features state-of-the-art models have been using so far. The
MELD-FAIR realignment data, and the code of the realignment procedure and of
the emotional recognition, are available at
https://github.com/knowledgetechnologyuhh/MELD-FAIR.Comment: 17 pages, 8 figures, 7 tables, Published in Neurocomputin
- …