Search CORE

85 research outputs found

Lipreading with Long Short-Term Memory

Author: Koutník Jan
Schmidhuber Jürgen
Wand Michael
Publication venue
Publication date: 29/01/2016
Field of study

Lipreading, i.e. speech recognition from visual-only recordings of a speaker's face, can be achieved with a processing pipeline based solely on neural networks, yielding significantly better accuracy than conventional methods. Feed-forward and recurrent neural network layers (namely Long Short-Term Memory; LSTM) are stacked to form a single structure which is trained by back-propagating error gradients through all the layers. The performance of such a stacked network was experimentally evaluated and compared to a standard Support Vector Machine classifier using conventional computer vision features (Eigenlips and Histograms of Oriented Gradients). The evaluation was performed on data from 19 speakers of the publicly available GRID corpus. With 51 different words to classify, we report a best word accuracy on held-out evaluation speakers of 79.6% using the end-to-end neural network-based solution (11.6% improvement over the best feature-based solution evaluated).Comment: Accepted for publication at ICASSP 201

arXiv.org e-Print Archive

Crossref

Audio-Visual Speaker Identification using the CUAVE Database

Author: Dean David
Lucey Patrick
Sridharan Subramanian
Publication venue: AVSP '05
Publication date: 01/01/2005
Field of study

The freely available nature of the CUAVE database allows it to provide a valuable platform to form benchmarks and compare research. This paper shows that the CUAVE database can successfully be used to test speaker identifications systems, with performance comparable to existing systems implemented on other databases. Additionally, this research shows that the optimal configuration for decisionfusion of an audio-visual speaker identification system relies heavily on the video modality in all but clean speech conditions

CiteSeerX

Queensland University of Technology ePrints Archive

Visual Speech Recognition

Author: Ahmad B. A. Hassanat
Publication venue: 'IntechOpen'
Publication date: 21/06/2011
Field of study

Lip reading is used to understand or interpret speech without hearing it, a technique especially mastered by people with hearing difficulties. The ability to lip read enables a person with a hearing impairment to communicate with others and to engage in social activities, which otherwise would be difficult. Recent advances in the fields of computer vision, pattern recognition, and signal processing has led to a growing interest in automating this challenging task of lip reading. Indeed, automating the human ability to lip read, a process referred to as visual speech recognition (VSR) (or sometimes speech reading), could open the door for other novel related applications. VSR has received a great deal of attention in the last decade for its potential use in applications such as human-computer interaction (HCI), audio-visual speech recognition (AVSR), speaker recognition, talking heads, sign language recognition and video surveillance. Its main aim is to recognise spoken word(s) by using only the visual signal that is produced during speech. Hence, VSR deals with the visual domain of speech and involves image processing, artificial intelligence, object detection, pattern recognition, statistical modelling, etc.Comment: Speech and Language Technologies (Book), Prof. Ivo Ipsic (Ed.), ISBN: 978-953-307-322-4, InTech (2011

IntechOpen

arXiv.org e-Print Archive

Crossref

Comparing phonemes and visemes with DNN-based lipreading

Author: Bear Y.
Bear Y.
Harvey Richard
Harvey Richard
Thangthai Kwanchiva
Thangthai Kwanchiva
Publication venue: BMVA Press
Publication date: 01/01/2017
Field of study

There is debate if phoneme or viseme units are the most effective for a lipreading system. Some studies use phoneme units even though phonemes describe unique short sounds; other studies tried to improve lipreading accuracy by focusing on visemes with varying results. We compare the performance of a lipreading system by modeling visual speech using either 13 viseme or 38 phoneme units. We report the accuracy of our system at both word and unit levels. The evaluation task is large vocabulary continuous speech using the TCD-TIMIT corpus. We complete our visual speech modeling via hybrid DNN-HMMs and our visual speech decoder is aWeighted Finite-State Transducer (WFST). We use DCT and Eigenlips as a representation of mouth ROI image. The phoneme lipreading system word accuracy outperforms the viseme based system word accuracy. However, the phoneme system achieved lower accuracy at the unit level which shows the importance of the dictionary for decoding classification outputs into words

UEL Research Repository at University of East London

Computer lipreading via hybrid deep neural network hidden Markov models

Author: Thangthai Kwanchiva
Publication venue
Publication date: 01/11/2018
Field of study

Constructing a viable lipreading system is a challenge because it is claimed that only 30% of information of speech production is visible on the lips. Nevertheless, in small vocabulary tasks, there have been several reports of high accuracies. However, investigation of larger vocabulary tasks is rare. This work examines constructing a large vocabulary lipreading system using an approach based-on Deep Neural Network Hidden Markov Models (DNN-HMMs). We present the historical development of computer lipreading technology and the state-ofthe-art results in small and large vocabulary tasks. In preliminary experiments, we evaluate the performance of lipreading and audiovisual speech recognition in small vocabulary data sets. We then concentrate on the improvement of lipreading systems in a more substantial vocabulary size with a multi-speaker data set. We tackle the problem of lipreading an unseen speaker. We investigate the effect of employing several stepstopre-processvisualfeatures. Moreover, weexaminethecontributionoflanguage modelling in a lipreading system where we use longer n-grams to recognise visual speech. Our lipreading system is constructed on the 6000-word vocabulary TCDTIMIT audiovisual speech corpus. The results show that visual-only speech recognition can definitely reach about 60% word accuracy on large vocabularies. We actually achieved a mean of 59.42% measured via three-fold cross-validation on the speaker independent setting of the TCD-TIMIT corpus using Deep autoencoder features and DNN-HMM models. This is the best word accuracy of a lipreading system in a large vocabulary task reported on the TCD-TIMIT corpus. In the final part of the thesis, we examine how the DNN-HMM model improves lipreading performance. We also give an insight into lipreading by providing a feature visualisation. Finally, we present an analysis of lipreading results and suggestions for future development

University of East Anglia digital repository

Lip region feature extraction analysis by means of stochastic variability modeling

Author: Cárdenas Peña David Augusto
Publication venue
Publication date: 01/01/2011
Field of study

En este trabajo de tesis se analizaron diferentes técnicas de caracterización de la región labial, usadas para modelar la dinámica labial. Para llevar a cabo este análisis, se construyó ´o una base de datos de secuencias de video de la pronunciación del alfabeto español. Esta base de datos se utilizo para entrenar un sistema de reconocimiento visual del habla usando diferentes metodologías de extracción de características. El objetivo del experimento es evaluar la habilidad de cada conjunto de características para modelar adecuadamente el movimiento labial. Se probaron metodologías basadas en apariencia, forma y una representación espacio-temporal. Los resultados reportados permiten seleccionar las características espacio-temporales como los mejores descriptores, dentro de los evaluados, de la dinámica visual del habla / Abstract: On this thesis work, an analysis of lip region characterization techniques used to model lip dynamics was performed. To carry out the analysis a video sequence database of Spanish alphabet was built and used to train a visual speech recognition system with several feature extraction methodologies. The aim of the experiment is to evaluate the ability of each feature set to model accurately lip movement. Appearance based, shape-based and spatiotemporal-based feature extraction methodologies were tested. Reported results let choose the spatiotemporal features as the best descriptors for visual speech dynamicsMaestrí

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Nacional De Colombia - Repositorio Institucional UN