176 research outputs found
Lipreading with Long Short-Term Memory
Lipreading, i.e. speech recognition from visual-only recordings of a
speaker's face, can be achieved with a processing pipeline based solely on
neural networks, yielding significantly better accuracy than conventional
methods. Feed-forward and recurrent neural network layers (namely Long
Short-Term Memory; LSTM) are stacked to form a single structure which is
trained by back-propagating error gradients through all the layers. The
performance of such a stacked network was experimentally evaluated and compared
to a standard Support Vector Machine classifier using conventional computer
vision features (Eigenlips and Histograms of Oriented Gradients). The
evaluation was performed on data from 19 speakers of the publicly available
GRID corpus. With 51 different words to classify, we report a best word
accuracy on held-out evaluation speakers of 79.6% using the end-to-end neural
network-based solution (11.6% improvement over the best feature-based solution
evaluated).Comment: Accepted for publication at ICASSP 201
Improving computer lipreading via DNN sequence discriminative training techniques
Although there have been some promising results in computer lipreading, there has been a paucity of data on which to train automatic systems. However the recent emergence of the TCD-TIMIT corpus, with around 6000 words, 59 speakers and seven hours of recorded audio-visual speech, allows the deployment of more recent techniques in audio-speech such as Deep Neural Networks (DNNs) and sequence discriminative training. In this paper we combine the DNN with a Hidden Markov Model (HMM) to the, so called, hybrid DNN-HMM configuration which we train using a variety of sequence discriminative training methods. This is then followed with a weighted finite state transducer. The conclusion is that the DNN offers very substantial improvement over a conventional classifier which uses a Gaussian Mixture Model (GMM) to model the densities even when optimised with Speaker Adaptive Training. Sequence adaptive training offers further improvements depending on the precise variety employed but those improvements are of the order of ~10\% improvement in word accuracy. Putting these two results together implies that lipreading is moving from something of rather esoteric interest to becoming a practical reality in the foreseeable future
Multimodal Based Audio-Visual Speech Recognition for Hard-of-Hearing: State of the Art Techniques and Challenges
Multimodal Integration (MI) is the study of merging the knowledge acquired by the nervous system using sensory modalities such as speech, vision, touch, and gesture. The applications of MI expand over the areas of Audio-Visual Speech Recognition (AVSR), Sign Language Recognition (SLR), Emotion Recognition (ER), Bio Metrics Applications (BMA), Affect Recognition (AR), Multimedia Retrieval (MR), etc. The fusion of modalities such as hand gestures- facial, lip- hand position, etc., are mainly used sensory modalities for the development of hearing-impaired multimodal systems. This paper encapsulates an overview of multimodal systems available within literature towards hearing impaired studies. This paper also discusses some of the studies related to hearing-impaired acoustic analysis. It is observed that very less algorithms have been developed for hearing impaired AVSR as compared to normal hearing. Thus, the study of audio-visual based speech recognition systems for the hearing impaired is highly demanded for the people who are trying to communicate with natively speaking languages. This paper also highlights the state-of-the-art techniques in AVSR and the challenges faced by the researchers for the development of AVSR systems
Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers
Lip reading has witnessed unparalleled development in recent years thanks to
deep learning and the availability of large-scale datasets. Despite the
encouraging results achieved, the performance of lip reading, unfortunately,
remains inferior to the one of its counterpart speech recognition, due to the
ambiguous nature of its actuations that makes it challenging to extract
discriminant features from the lip movement videos. In this paper, we propose a
new method, termed as Lip by Speech (LIBS), of which the goal is to strengthen
lip reading by learning from speech recognizers. The rationale behind our
approach is that the features extracted from speech recognizers may provide
complementary and discriminant clues, which are formidable to be obtained from
the subtle movements of the lips, and consequently facilitate the training of
lip readers. This is achieved, specifically, by distilling multi-granularity
knowledge from speech recognizers to lip readers. To conduct this cross-modal
knowledge distillation, we utilize an efficacious alignment scheme to handle
the inconsistent lengths of the audios and videos, as well as an innovative
filtering strategy to refine the speech recognizer's prediction. The proposed
method achieves the new state-of-the-art performance on the CMLR and LRS2
datasets, outperforming the baseline by a margin of 7.66% and 2.75% in
character error rate, respectively.Comment: AAAI 202
Audio-visual speech processing system for Polish applicable to human-computer interaction
This paper describes audio-visual speech recognition system for Polish language and a set of performance tests under various acoustic conditions. We first present the overall structure of AVASR systems with three main areas: audio features extraction, visual features extraction and subsequently, audiovisual speech integration. We present MFCC features for audio stream with standard HMM modeling technique, then we describe appearance and shape based visual features. Subsequently we present two feature integration techniques, feature concatenation and model fusion. We also discuss the results of a set of experiments conducted to select best system setup for Polish, under noisy audio conditions. Experiments are simulating human-computer interaction in computer control case with voice commands in difficult audio environments. With Active Appearance Model (AAM) and multistream Hidden Markov Model (HMM) we can improve system accuracy by reducing Word Error Rate for more than 30%, comparing to audio-only speech recognition, when Signal-to-Noise Ratio goes down to 0dB
RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self-Supervised Representations
Significant progress has been made in speaker dependent Lip-to-Speech
synthesis, which aims to generate speech from silent videos of talking faces.
Current state-of-the-art approaches primarily employ non-autoregressive
sequence-to-sequence architectures to directly predict mel-spectrograms or
audio waveforms from lip representations. We hypothesize that the direct
mel-prediction hampers training/model efficiency due to the entanglement of
speech content with ambient information and speaker characteristics. To this
end, we propose RobustL2S, a modularized framework for Lip-to-Speech synthesis.
First, a non-autoregressive sequence-to-sequence model maps self-supervised
visual features to a representation of disentangled speech content. A vocoder
then converts the speech features into raw waveforms. Extensive evaluations
confirm the effectiveness of our setup, achieving state-of-the-art performance
on the unconstrained Lip2Wav dataset and the constrained GRID and TCD-TIMIT
datasets. Speech samples from RobustL2S can be found at
https://neha-sherin.github.io/RobustL2S
- …