5,760 research outputs found
End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models
Speech activity detection (SAD) plays an important role in current speech
processing systems, including automatic speech recognition (ASR). SAD is
particularly difficult in environments with acoustic noise. A practical
solution is to incorporate visual information, increasing the robustness of the
SAD approach. An audiovisual system has the advantage of being robust to
different speech modes (e.g., whisper speech) or background noise. Recent
advances in audiovisual speech processing using deep learning have opened
opportunities to capture in a principled way the temporal relationships between
acoustic and visual features. This study explores this idea proposing a
\emph{bimodal recurrent neural network} (BRNN) framework for SAD. The approach
models the temporal dynamic of the sequential audiovisual data, improving the
accuracy and robustness of the proposed SAD system. Instead of estimating
hand-crafted features, the study investigates an end-to-end training approach,
where acoustic and visual features are directly learned from the raw data
during training. The experimental evaluation considers a large audiovisual
corpus with over 60.8 hours of recordings, collected from 105 speakers. The
results demonstrate that the proposed framework leads to absolute improvements
up to 1.2% under practical scenarios over a VAD baseline using only audio
implemented with deep neural network (DNN). The proposed approach achieves
92.7% F1-score when it is evaluated using the sensors from a portable tablet
under noisy acoustic environment, which is only 1.0% lower than the performance
obtained under ideal conditions (e.g., clean speech obtained with a high
definition camera and a close-talking microphone).Comment: Submitted to Speech Communicatio
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Probabilistic modelling and inference of human behaviour from mobile phone time series
With an estimated 4.1 billion subscribers around the world, the mobile phone offers a unique
opportunity to sense and understand human behaviour from location, co-presence and communication
data. While the benefit of modelling this unprecedented amount of data is widely
recognised, a number of challenges impede the development of accurate behaviour models. In
this thesis, we identify and address two modelling problems and show that their consideration
improves the accuracy of behaviour inference.
We first examine the modelling of long-range dependencies in human behaviour. Human behaviour
models only take into account short-range dependencies in mobile phone time series.
Using information theory, we quantify long-range dependencies in mobile phone time series for
the first time, demonstrate that they exhibit periodic oscillations and introduce novel tools to
analyse them. We further show that considering what the user did 24 hours earlier improves
accuracy when predicting user behaviour five hours or longer in advance.
The second problem that we address is the modelling of temporal variations in human behaviour.
The time spent by a user on an activity varies from one day to the next. In order to
recognise behaviour patterns despite temporal variations, we establish a methodological connection
between human behaviour modelling and biological sequence alignment. This connection
allows us to compare, cluster and model behaviour sequences and introduce novel features for
behaviour recognition which improve its accuracy.
The experiments presented in this thesis have been conducted on the largest publicly available
mobile phone dataset labelled in an unsupervised fashion and are entirely repeatable. Furthermore,
our techniques only require cellular data which can easily be recorded by today's mobile
phones and could benefit a wide range of applications including life logging, health monitoring,
customer profiling and large-scale surveillance
- …