Search CORE

6 research outputs found

Transforming unstructured voice and text data into insight for paramedic emergency service using recurrent and convolutional neural networks

Author: Huyen Alexander
Lu Thomas
Yun Kyongsik
Publication venue
Publication date: 30/05/2020
Field of study

Paramedics often have to make lifesaving decisions within a limited time in an ambulance. They sometimes ask the doctor for additional medical instructions, during which valuable time passes for the patient. This study aims to automatically fuse voice and text data to provide tailored situational awareness information to paramedics. To train and test speech recognition models, we built a bidirectional deep recurrent neural network (long short-term memory (LSTM)). Then we used convolutional neural networks on top of custom-trained word vectors for sentence-level classification tasks. Each sentence is automatically categorized into four classes, including patient status, medical history, treatment plan, and medication reminder. Subsequently, incident reports were automatically generated to extract keywords and assist paramedics and physicians in making decisions. The proposed system found that it could provide timely medication notifications based on unstructured voice and text data, which was not possible in paramedic emergencies at present. In addition, the automatic incident report generation provided by the proposed system improves the routine but error-prone tasks of paramedics and doctors, helping them focus on patient care

arXiv.org e-Print Archive

Crossref

Improving Lip-reading Performance for Robust Audiovisual Speech Recognition using DNNs

Author: Cox Stephen
Harvey Richard
Thangthai Kwanchiva
Theobald Barry-John
Publication venue
Publication date: 01/09/2015
Field of study

This paper presents preliminary experiments using the Kaldi toolkit to investigate audiovisual speech recognition (AVSR) in noisy environments using deep neural networks (DNNs). In particular we use a single-speaker large vocabulary, continuous audiovisual speech corpus to compare the performance of visual-only, audio-only and audiovisual speech recognition. The models trained using the Kaldi toolkit are compared with the performance of models trained using conventional hidden Markov models (HMMs). In addition, we compare the performance of a speech recognizer both with and without visual features over nine different SNR levels of babble noise ranging from 20dB down to -20dB. The results show that the DNN outperforms conventional HMMs in all experimental conditions, especially for the lip-reading only system, which achieves a gain of 37.19% accuracy (84.67% absolute word accuracy). Moreover, the DNN provides an effective improvement of 10 and 12dB SNR respectively for both the single modal and bimodal speech recognition systems. However, integrating the visual features using simple feature fusion is only effective in SNRs at 5dB and above. Below this the degradion in accuracy of an audiovisual system is similar to the audio only recognizer. Index Terms: lip-reading, speech reading, audiovisual speech recognitio

University of East Anglia digital repository

Visual speech recognition:from traditional to deep learning frameworks

Author: Zimmermann Marina
Publication venue: Lausanne, EPFL
Publication date: 29/08/2018
Field of study

Speech is the most natural means of communication for humans. Therefore, since the beginning of computers it has been a goal to interact with machines via speech. While there have been gradual improvements in this field over the decades, and with recent drastic progress more and more commercial software is available that allow voice commands, there are still many ways in which it can be improved. One way to do this is with visual speech information, more specifically, the visible articulations of the mouth. Based on the information contained in these articulations, visual speech recognition (VSR) transcribes an utterance from a video sequence. It thus helps extend speech recognition from audio-only to other scenarios such as silent or whispered speech (e.g.\ in cybersecurity), mouthings in sign language, as an additional modality in noisy audio scenarios for audio-visual automatic speech recognition, to better understand speech production and disorders, or by itself for human machine interaction and as a transcription method. In this thesis, we present and compare different ways to build systems for VSR: We start with the traditional hidden Markov models that have been used in the field for decades, especially in combination with handcrafted features. These are compared to models taking into account recent developments in the fields of computer vision and speech recognition through deep learning. While their superior performance is confirmed, certain limitations with respect to computing power for these systems are also discussed. This thesis also addresses multi-view processing and fusion, which is an important topic for many current applications. This is due to the fact that a single camera view often cannot provide enough flexibility with speakers moving in front of the camera. Technology companies are willing to integrate more cameras into their products, such as cars and mobile devices, due to lower hardware cost for both cameras and processing units, as well as the availability of higher processing power and high performance algorithms. Multi-camera and multi-view solutions are thus becoming more common, which means that algorithms can benefit from taking these into account. In this work we propose several methods of fusing the views of multiple cameras to improve the overall results. We can show that both, relying on deep learning-based approaches for feature extraction and sequence modelling, as well as taking into account the complementary information contained in several views, improves performance considerably. To further improve the results, it would be necessary to move from data recorded in a lab environment, to multi-view data in realistic scenarios. Furthermore, the findings and models could be transferred to other domains such as audio-visual speech recognition or the study of speech production and disorders

Infoscience - École polytechnique fédérale de Lausanne

Reconstruction of intelligible audio speech from visual speech information

Author: Le Cornu Thomas
Publication venue
Publication date: 01/11/2016
Field of study

The aim of the work conducted in this thesis is to reconstruct audio speech signals using information which can be extracted solely from a visual stream of a speaker's face, with application for surveillance scenarios and silent speech interfaces. Visual speech is limited to that which can be seen of the mouth, lips, teeth, and tongue, where the visual articulators convey considerably less information than in the audio domain, leading to the task being difficult. Accordingly, the emphasis is on the reconstruction of intelligible speech, with less regard given to quality. A speech production model is used to reconstruct audio speech, where methods are presented in this work for generating or estimating the necessary parameters for the model. Three approaches are explored for producing spectral-envelope estimates from visual features as this parameter provides the greatest contribution to speech intelligibility. The first approach uses regression to perform the visual-to-audio mapping, and then two further approaches are explored using vector quantisation techniques and classification models, with long-range temporal information incorporated at the feature and model-level. Excitation information, namely fundamental frequency and aperiodicity, is generated using artificial methods and joint-feature clustering approaches. Evaluations are first performed using mean squared error analyses and objective measures of speech intelligibility to refine the various system configurations, and then subjective listening tests are conducted to determine word-level accuracy, giving real intelligibility scores, of reconstructed speech. The best performing visual-to-audio domain mapping approach, using a clustering-and-classification framework with feature-level temporal encoding, is able to achieve audio-only intelligibility scores of 77 %, and audiovisual intelligibility scores of 84 %, on the GRID dataset. Furthermore, the methods are applied to a larger and more continuous dataset, with less favourable results, but with the belief that extensions to the work presented will yield a further increase in intelligibility

University of East Anglia digital repository