2,309 research outputs found
End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models
Speech activity detection (SAD) plays an important role in current speech
processing systems, including automatic speech recognition (ASR). SAD is
particularly difficult in environments with acoustic noise. A practical
solution is to incorporate visual information, increasing the robustness of the
SAD approach. An audiovisual system has the advantage of being robust to
different speech modes (e.g., whisper speech) or background noise. Recent
advances in audiovisual speech processing using deep learning have opened
opportunities to capture in a principled way the temporal relationships between
acoustic and visual features. This study explores this idea proposing a
\emph{bimodal recurrent neural network} (BRNN) framework for SAD. The approach
models the temporal dynamic of the sequential audiovisual data, improving the
accuracy and robustness of the proposed SAD system. Instead of estimating
hand-crafted features, the study investigates an end-to-end training approach,
where acoustic and visual features are directly learned from the raw data
during training. The experimental evaluation considers a large audiovisual
corpus with over 60.8 hours of recordings, collected from 105 speakers. The
results demonstrate that the proposed framework leads to absolute improvements
up to 1.2% under practical scenarios over a VAD baseline using only audio
implemented with deep neural network (DNN). The proposed approach achieves
92.7% F1-score when it is evaluated using the sensors from a portable tablet
under noisy acoustic environment, which is only 1.0% lower than the performance
obtained under ideal conditions (e.g., clean speech obtained with a high
definition camera and a close-talking microphone).Comment: Submitted to Speech Communicatio
Deep Room Recognition Using Inaudible Echos
Recent years have seen the increasing need of location awareness by mobile
applications. This paper presents a room-level indoor localization approach
based on the measured room's echos in response to a two-millisecond single-tone
inaudible chirp emitted by a smartphone's loudspeaker. Different from other
acoustics-based room recognition systems that record full-spectrum audio for up
to ten seconds, our approach records audio in a narrow inaudible band for 0.1
seconds only to preserve the user's privacy. However, the short-time and
narrowband audio signal carries limited information about the room's
characteristics, presenting challenges to accurate room recognition. This paper
applies deep learning to effectively capture the subtle fingerprints in the
rooms' acoustic responses. Our extensive experiments show that a two-layer
convolutional neural network fed with the spectrogram of the inaudible echos
achieve the best performance, compared with alternative designs using other raw
data formats and deep models. Based on this result, we design a RoomRecognize
cloud service and its mobile client library that enable the mobile application
developers to readily implement the room recognition functionality without
resorting to any existing infrastructures and add-on hardware.
Extensive evaluation shows that RoomRecognize achieves 99.7%, 97.7%, 99%, and
89% accuracy in differentiating 22 and 50 residential/office rooms, 19 spots in
a quiet museum, and 15 spots in a crowded museum, respectively. Compared with
the state-of-the-art approaches based on support vector machine, RoomRecognize
significantly improves the Pareto frontier of recognition accuracy versus
robustness against interfering sounds (e.g., ambient music).Comment: 29 page
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
- …