33 research outputs found

    Computer lipreading via hybrid deep neural network hidden Markov models

    Get PDF
    Constructing a viable lipreading system is a challenge because it is claimed that only 30% of information of speech production is visible on the lips. Nevertheless, in small vocabulary tasks, there have been several reports of high accuracies. However, investigation of larger vocabulary tasks is rare. This work examines constructing a large vocabulary lipreading system using an approach based-on Deep Neural Network Hidden Markov Models (DNN-HMMs). We present the historical development of computer lipreading technology and the state-ofthe-art results in small and large vocabulary tasks. In preliminary experiments, we evaluate the performance of lipreading and audiovisual speech recognition in small vocabulary data sets. We then concentrate on the improvement of lipreading systems in a more substantial vocabulary size with a multi-speaker data set. We tackle the problem of lipreading an unseen speaker. We investigate the effect of employing several stepstopre-processvisualfeatures. Moreover, weexaminethecontributionoflanguage modelling in a lipreading system where we use longer n-grams to recognise visual speech. Our lipreading system is constructed on the 6000-word vocabulary TCDTIMIT audiovisual speech corpus. The results show that visual-only speech recognition can definitely reach about 60% word accuracy on large vocabularies. We actually achieved a mean of 59.42% measured via three-fold cross-validation on the speaker independent setting of the TCD-TIMIT corpus using Deep autoencoder features and DNN-HMM models. This is the best word accuracy of a lipreading system in a large vocabulary task reported on the TCD-TIMIT corpus. In the final part of the thesis, we examine how the DNN-HMM model improves lipreading performance. We also give an insight into lipreading by providing a feature visualisation. Finally, we present an analysis of lipreading results and suggestions for future development

    Combining Residual Networks with LSTMs for Lipreading

    Full text link
    We propose an end-to-end deep learning architecture for word-level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We train and evaluate it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size target-words consisting of 1.28sec video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0, yielding 6.8 absolute improvement over the current state-of-the-art, without using information about word boundaries during training or testing.Comment: Submitted to Interspeech 201

    Speaker-Independent Classification of Phonetic Segments from Raw Ultrasound in Child Speech

    Get PDF
    Ultrasound tongue imaging (UTI) provides a convenient way to visualize the vocal tract during speech production. UTI is increasingly being used for speech therapy, making it important to develop automatic methods to assist various time-consuming manual tasks currently performed by speech therapists. A key challenge is to generalize the automatic processing of ultrasound tongue images to previously unseen speakers. In this work, we investigate the classification of phonetic segments (tongue shapes) from raw ultrasound recordings under several training scenarios: speaker-dependent, multi-speaker, speaker-independent, and speaker-adapted. We observe that models underperform when applied to data from speakers not seen at training time. However, when provided with minimal additional speaker information, such as the mean ultrasound frame, the models generalize better to unseen speakers.Comment: 5 pages, 4 figures, published in ICASSP2019 (IEEE International Conference on Acoustics, Speech and Signal Processing, 2019

    Viseme-based Lip-Reading using Deep Learning

    Get PDF
    Research in Automated Lip Reading is an incredibly rich discipline with so many facets that have been the subject of investigation including audio-visual data, feature extraction, classification networks and classification schemas. The most advanced and up-to-date lip-reading systems can predict entire sentences with thousands of different words and the majority of them use ASCII characters as the classification schema. The classification performance of such systems however has been insufficient and the need to cover an ever expanding range of vocabulary using as few classes as possible is challenge. The work in this thesis contributes to the area concerning classification schemas by proposing an automated lip reading model that predicts sentences using visemes as a classification schema. This is an alternative schema to using ASCII characters, which is the conventional class system used to predict sentences. This thesis provides a review of the current trends in deep learning- based automated lip reading and analyses a gap in the research endeavours of automated lip-reading by contributing towards work done in the region of classification schema. A whole new line of research is opened up whereby an alternative way to do lip-reading is explored and in doing so, lip-reading performance results for predicting s entences from a benchmark dataset are attained which improve upon the current state-of-the-art. In this thesis, a neural network-based lip reading system is proposed. The system is lexicon-free and uses purely visual cues. With only a limited number of visemes as classes to recognise, the system is designed to lip read sentences covering a wide range of vocabulary and to recognise words that may not be included in system training. The lip-reading system predicts sentences as a two-stage procedure with visemes being recognised as the first stage and words being classified as the second stage. This is such that the second-stage has to both overcome the one-to-many mapping problem posed in lip-reading where one set of visemes can map to several words, and the problem of visemes being confused or misclassified to begin with. To develop the proposed lip-reading system, a number of tasks have been performed in this thesis. These include the classification of continuous sequences of visemes; and the proposal of viseme-to-word conversion models that are both effective in their conversion performance of predicting words, and robust to the possibility of viseme confusion or misclassification. The initial system reported has been testified on the challenging BBC Lip Reading Sentences 2 (LRS2) benchmark dataset attaining a word accuracy rate of 64.6%. Compared with the state-of-the-art works in lip reading sentences reported at the time, the system had achieved a significantly improved performance. The lip reading system is further improved upon by using a language model that has been demonstrated to be effective at discriminating between homopheme words and being robust to incorrectly classified visemes. An improved performance in predicting spoken sentences from the LRS2 dataset is yielded with an attained word accuracy rate of 79.6% which is still better than another lip-reading system trained and evaluated on the the same dataset that attained a word accuracy rate 77.4% and it is to the best of our knowledge the next best observed result attained on LRS2

    Silent versus modal multi-speaker speech recognition from ultrasound and video

    Get PDF
    We investigate multi-speaker speech recognition from ultrasound images of the tongue and video images of the lips. We train our systems on imaging data from modal speech, and evaluate on matched test sets of two speaking modes: silent and modal speech. We observe that silent speech recognition from imaging data underperforms compared to modal speech recognition, likely due to a speaking-mode mismatch between training and testing. We improve silent speech recognition performance using techniques that address the domain mismatch, such as fMLLR and unsupervised model adaptation. We also analyse the properties of silent and modal speech in terms of utterance duration and the size of the articulatory space. To estimate the articulatory space, we compute the convex hull of tongue splines, extracted from ultrasound tongue images. Overall, we observe that the duration of silent speech is longer than that of modal speech, and that silent speech covers a smaller articulatory space than modal speech. Although these two properties are statistically significant across speaking modes, they do not directly correlate with word error rates from speech recognition.Comment: 5 pages, 5 figures, Submitted to Interspeech 202

    Incorporating Ultrasound Tongue Images for Audio-Visual Speech Enhancement

    Full text link
    Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along with extra visual information such as lip videos, and has been shown to be more effective than audio-only speech enhancement. This paper proposes the incorporation of ultrasound tongue images to improve the performance of lip-based AV-SE systems further. To address the challenge of acquiring ultrasound tongue images during inference, we first propose to employ knowledge distillation during training to investigate the feasibility of leveraging tongue-related information without directly inputting ultrasound tongue images. Specifically, we guide an audio-lip speech enhancement student model to learn from a pre-trained audio-lip-tongue speech enhancement teacher model, thus transferring tongue-related knowledge. To better model the alignment between the lip and tongue modalities, we further propose the introduction of a lip-tongue key-value memory network into the AV-SE model. This network enables the retrieval of tongue features based on readily available lip features, thereby assisting the subsequent speech enhancement task. Experimental results demonstrate that both methods significantly improve the quality and intelligibility of the enhanced speech compared to traditional lip-based AV-SE baselines. Moreover, both proposed methods exhibit strong generalization performance on unseen speakers and in the presence of unseen noises. Furthermore, phone error rate (PER) analysis of automatic speech recognition (ASR) reveals that while all phonemes benefit from introducing ultrasound tongue images, palatal and velar consonants benefit most.Comment: Submmited to IEEE/ACM Transactions on Audio, Speech and Language Processing. arXiv admin note: text overlap with arXiv:2305.1493

    Audio-Visual Automatic Speech Recognition Using PZM, MFCC and Statistical Analysis

    Get PDF
    Audio-Visual Automatic Speech Recognition (AV-ASR) has become the most promising research area when the audio signal gets corrupted by noise. The main objective of this paper is to select the important and discriminative audio and visual speech features to recognize audio-visual speech. This paper proposes Pseudo Zernike Moment (PZM) and feature selection method for audio-visual speech recognition. Visual information is captured from the lip contour and computes the moments for lip reading. We have extracted 19th order of Mel Frequency Cepstral Coefficients (MFCC) as speech features from audio. Since all the 19 speech features are not equally important, therefore, feature selection algorithms are used to select the most efficient features. The various statistical algorithm such as Analysis of Variance (ANOVA), Kruskal-wallis, and Friedman test are employed to analyze the significance of features along with Incremental Feature Selection (IFS) technique. Statistical analysis is used to analyze the statistical significance of the speech features and after that IFS is used to select the speech feature subset. Furthermore, multiclass Support Vector Machine (SVM), Artificial Neural Network (ANN) and Naive Bayes (NB) machine learning techniques are used to recognize the speech for both the audio and visual modalities. Based on the recognition rate combined decision is taken from the two individual recognition systems. This paper compares the result achieved by the proposed model and the existing model for both audio and visual speech recognition. Zernike Moment (ZM) is compared with PZM and shows that our proposed model using PZM extracts better discriminative features for visual speech recognition. This study also proves that audio feature selection using statistical analysis outperforms methods without any feature selection technique

    Automatic Speech Recognition for Documenting Endangered First Nations Languages

    Get PDF
    Automatic speech recognition (ASR) for low-resource languages is an active field of research. Over the past years with the advent of deep learning, impressive achievements have been reported using minimal resources. As many of the world’s languages are getting extinct every year, with every dying language we lose intellect, culture, values, and tradition which generally pass down for long generations. Linguists throughout the world have already initiated many projects on language documentation to preserve such endangered languages. Automatic speech recognition is a solution to accelerate the documentation process reducing the annotation time for field linguists as well as the overall cost of the project. A traditional speech recognizer is trained on thousands of hours of acoustic data and a phonetic dictionary that includes all words from the language. End-to-End ASR systems have shown dramatic improvement for major languages. Especially, recent advancement in self-supervised representation learning which takes advantage of large corpora of untranscribed speech data has become the state-of-the-art for speech recognition technology. However, for resource-constrained languages, the technology is not tested in depth. In this thesis, we explore both traditional methods of ASR and state-of-the-art end-to-end systems for modeling a critically endangered Athabascan language known as Upper Tanana. In our first approach, we investigate traditional models with a comparative study on feature selection and a performance comparison with deep hybrid models. With limited resources at our disposal, we build a working ASR system based on a grapheme-to-phoneme (G2P) phonetic dictionary. The acoustic model can also be used as a separate forced alignment tool for the automatic alignment of training data. The results show that the GMM-HMM methods outperform deep hybrid models in low-resource acoustic modeling. In our second approach, we propose using Domain-adapted Cross-lingual Speech Recognition (DA-XLSR) for an ASR system, developed over the wav2vec 2.0 framework that utilizes pretrained transformer models leveraging cross lingual data for building an acoustic representation. The proposed system uses a multistage transfer learning process in order to fine tune the final model. To supplement the limited data, we compile a data augmentation strategy combining six augmentation techniques. The speech model uses Connectionist Temporal Classification (CTC) for an alignment free training and does not require any pronunciation dictionary or language model. Experiments from the second approach demonstrate that it can outperform the best traditional or end-to-end models in terms of word error rate (WER) and produce a powerful utterance level transcription. On top of that, the augmentation strategy is tested on several end-to-end models, and it provides a consistent improvement in performance. While the best proposed model can currently reduce the WER significantly, it may still require further research to completely replace the need for human transcribers
    corecore