368 research outputs found

    Lipreading with Long Short-Term Memory

    Full text link
    Lipreading, i.e. speech recognition from visual-only recordings of a speaker's face, can be achieved with a processing pipeline based solely on neural networks, yielding significantly better accuracy than conventional methods. Feed-forward and recurrent neural network layers (namely Long Short-Term Memory; LSTM) are stacked to form a single structure which is trained by back-propagating error gradients through all the layers. The performance of such a stacked network was experimentally evaluated and compared to a standard Support Vector Machine classifier using conventional computer vision features (Eigenlips and Histograms of Oriented Gradients). The evaluation was performed on data from 19 speakers of the publicly available GRID corpus. With 51 different words to classify, we report a best word accuracy on held-out evaluation speakers of 79.6% using the end-to-end neural network-based solution (11.6% improvement over the best feature-based solution evaluated).Comment: Accepted for publication at ICASSP 201

    Lip2AudSpec: Speech reconstruction from silent lip movements video

    Full text link
    In this study, we propose a deep neural network for reconstructing intelligible speech from silent lip movement videos. We use auditory spectrogram as spectral representation of speech and its corresponding sound generation method resulting in a more natural sounding reconstructed speech. Our proposed network consists of an autoencoder to extract bottleneck features from the auditory spectrogram which is then used as target to our main lip reading network comprising of CNN, LSTM and fully connected layers. Our experiments show that the autoencoder is able to reconstruct the original auditory spectrogram with a 98% correlation and also improves the quality of reconstructed speech from the main lip reading network. Our model, trained jointly on different speakers is able to extract individual speaker characteristics and gives promising results of reconstructing intelligible speech with superior word recognition accuracy

    Vid2speech: Speech Reconstruction from Silent Video

    Full text link
    Speechreading is a notoriously difficult task for humans to perform. In this paper we present an end-to-end model based on a convolutional neural network (CNN) for generating an intelligible acoustic speech signal from silent video frames of a speaking person. The proposed CNN generates sound features for each frame based on its neighboring frames. Waveforms are then synthesized from the learned speech features to produce intelligible speech. We show that by leveraging the automatic feature learning capabilities of a CNN, we can obtain state-of-the-art word intelligibility on the GRID dataset, and show promising results for learning out-of-vocabulary (OOV) words.Comment: Accepted for publication at ICASSP 201

    Visual units and confusion modelling for automatic lip-reading

    Get PDF
    Automatic lip-reading (ALR) is a challenging task because the visual speech signal is known to be missing some important information, such as voicing. We propose an approach to ALR that acknowledges that this information is missing but assumes that it is substituted or deleted in a systematic way that can be modelled. We describe a system that learns such a model and then incorporates it into decoding, which is realised as a cascade of weighted finite-state transducers. Our results show a small but statistically significant improvement in recognition accuracy. We also investigate the issue of suitable visual units for ALR, and show that visemes are sub-optimal, not but because they introduce lexical ambiguity, but because the reduction in modelling units entailed by their use reduces accuracy

    Automatic Visual Speech Recognition

    Get PDF
    Intelligent SystemsElectrical Engineering, Mathematics and Computer Scienc

    Lip-Reading with Visual Form Classification using Residual Networks and Bidirectional Gated Recurrent Units

    Get PDF
    Lip-reading is a method that focuses on the observation and interpretation of lip movements to understand spoken language. Previous studies have exclusively concentrated on a single variation of residual networks (ResNets). This study primarily aimed to conduct a comparative analysis of several types of ResNets. This study additionally calculates metrics for several word structures included in the GRID dataset, encompassing verbs, colors, prepositions, letters, and numerals. This component has not been previously investigated in other studies. The proposed approach encompasses several stages, namely pre-processing, which involves face detection and mouth location, feature extraction, and classification. The architecture for feature extraction comprises a 3-dimensional convolutional neural network (3D-CNN) integrated with ResNets. The management of temporal sequences during the classification phase is accomplished through the utilization of the bidirectional gated recurrent units (Bi-GRU) model. The experimental results demonstrated a character error rate (CER) of 14.09% and a word error rate (WER) of 28.51%. The combination of 3D-CNN ResNet-34 and Bi-GRU yielded superior outcomes in comparison to ResNet-18 and ResNet-50. The correlation between increased network depth and enhanced performance in lip-reading models was not consistently observed. Nevertheless, the incorporation of additional trained parameters offers certain benefits. Moreover, it has demonstrated superior levels of precision in comparison to human professionals in the task of distinguishing diverse word structures. Doi: 10.28991/HIJ-2023-04-02-010 Full Text: PD

    Lip-Listening: Mixing Senses to Understand Lips using Cross Modality Knowledge Distillation for Word-Based Models

    Full text link
    In this work, we propose a technique to transfer speech recognition capabilities from audio speech recognition systems to visual speech recognizers, where our goal is to utilize audio data during lipreading model training. Impressive progress in the domain of speech recognition has been exhibited by audio and audio-visual systems. Nevertheless, there is still much to be explored with regards to visual speech recognition systems due to the visual ambiguity of some phonemes. To this end, the development of visual speech recognition models is crucial given the instability of audio models. The main contributions of this work are i) building on recent state-of-the-art word-based lipreading models by integrating sequence-level and frame-level Knowledge Distillation (KD) to their systems; ii) leveraging audio data during training visual models, a feat which has not been utilized in prior word-based work; iii) proposing the Gaussian-shaped averaging in frame-level KD, as an efficient technique that aids the model in distilling knowledge at the sequence model encoder. This work proposes a novel and competitive architecture for lip-reading, as we demonstrate a noticeable improvement in performance, setting a new benchmark equals to 88.64% on the LRW dataset.Comment: arXiv admin note: text overlap with arXiv:2108.0354
    • …
    corecore