5,760 research outputs found

    End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models

    Full text link
    Speech activity detection (SAD) plays an important role in current speech processing systems, including automatic speech recognition (ASR). SAD is particularly difficult in environments with acoustic noise. A practical solution is to incorporate visual information, increasing the robustness of the SAD approach. An audiovisual system has the advantage of being robust to different speech modes (e.g., whisper speech) or background noise. Recent advances in audiovisual speech processing using deep learning have opened opportunities to capture in a principled way the temporal relationships between acoustic and visual features. This study explores this idea proposing a \emph{bimodal recurrent neural network} (BRNN) framework for SAD. The approach models the temporal dynamic of the sequential audiovisual data, improving the accuracy and robustness of the proposed SAD system. Instead of estimating hand-crafted features, the study investigates an end-to-end training approach, where acoustic and visual features are directly learned from the raw data during training. The experimental evaluation considers a large audiovisual corpus with over 60.8 hours of recordings, collected from 105 speakers. The results demonstrate that the proposed framework leads to absolute improvements up to 1.2% under practical scenarios over a VAD baseline using only audio implemented with deep neural network (DNN). The proposed approach achieves 92.7% F1-score when it is evaluated using the sensors from a portable tablet under noisy acoustic environment, which is only 1.0% lower than the performance obtained under ideal conditions (e.g., clean speech obtained with a high definition camera and a close-talking microphone).Comment: Submitted to Speech Communicatio

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Efficient speaker recognition for mobile devices

    Get PDF

    Probabilistic modelling and inference of human behaviour from mobile phone time series

    No full text
    With an estimated 4.1 billion subscribers around the world, the mobile phone offers a unique opportunity to sense and understand human behaviour from location, co-presence and communication data. While the benefit of modelling this unprecedented amount of data is widely recognised, a number of challenges impede the development of accurate behaviour models. In this thesis, we identify and address two modelling problems and show that their consideration improves the accuracy of behaviour inference. We first examine the modelling of long-range dependencies in human behaviour. Human behaviour models only take into account short-range dependencies in mobile phone time series. Using information theory, we quantify long-range dependencies in mobile phone time series for the first time, demonstrate that they exhibit periodic oscillations and introduce novel tools to analyse them. We further show that considering what the user did 24 hours earlier improves accuracy when predicting user behaviour five hours or longer in advance. The second problem that we address is the modelling of temporal variations in human behaviour. The time spent by a user on an activity varies from one day to the next. In order to recognise behaviour patterns despite temporal variations, we establish a methodological connection between human behaviour modelling and biological sequence alignment. This connection allows us to compare, cluster and model behaviour sequences and introduce novel features for behaviour recognition which improve its accuracy. The experiments presented in this thesis have been conducted on the largest publicly available mobile phone dataset labelled in an unsupervised fashion and are entirely repeatable. Furthermore, our techniques only require cellular data which can easily be recorded by today's mobile phones and could benefit a wide range of applications including life logging, health monitoring, customer profiling and large-scale surveillance
    corecore