39 research outputs found

    Viseme-based Lip-Reading using Deep Learning

    Get PDF
    Research in Automated Lip Reading is an incredibly rich discipline with so many facets that have been the subject of investigation including audio-visual data, feature extraction, classification networks and classification schemas. The most advanced and up-to-date lip-reading systems can predict entire sentences with thousands of different words and the majority of them use ASCII characters as the classification schema. The classification performance of such systems however has been insufficient and the need to cover an ever expanding range of vocabulary using as few classes as possible is challenge. The work in this thesis contributes to the area concerning classification schemas by proposing an automated lip reading model that predicts sentences using visemes as a classification schema. This is an alternative schema to using ASCII characters, which is the conventional class system used to predict sentences. This thesis provides a review of the current trends in deep learning- based automated lip reading and analyses a gap in the research endeavours of automated lip-reading by contributing towards work done in the region of classification schema. A whole new line of research is opened up whereby an alternative way to do lip-reading is explored and in doing so, lip-reading performance results for predicting s entences from a benchmark dataset are attained which improve upon the current state-of-the-art. In this thesis, a neural network-based lip reading system is proposed. The system is lexicon-free and uses purely visual cues. With only a limited number of visemes as classes to recognise, the system is designed to lip read sentences covering a wide range of vocabulary and to recognise words that may not be included in system training. The lip-reading system predicts sentences as a two-stage procedure with visemes being recognised as the first stage and words being classified as the second stage. This is such that the second-stage has to both overcome the one-to-many mapping problem posed in lip-reading where one set of visemes can map to several words, and the problem of visemes being confused or misclassified to begin with. To develop the proposed lip-reading system, a number of tasks have been performed in this thesis. These include the classification of continuous sequences of visemes; and the proposal of viseme-to-word conversion models that are both effective in their conversion performance of predicting words, and robust to the possibility of viseme confusion or misclassification. The initial system reported has been testified on the challenging BBC Lip Reading Sentences 2 (LRS2) benchmark dataset attaining a word accuracy rate of 64.6%. Compared with the state-of-the-art works in lip reading sentences reported at the time, the system had achieved a significantly improved performance. The lip reading system is further improved upon by using a language model that has been demonstrated to be effective at discriminating between homopheme words and being robust to incorrectly classified visemes. An improved performance in predicting spoken sentences from the LRS2 dataset is yielded with an attained word accuracy rate of 79.6% which is still better than another lip-reading system trained and evaluated on the the same dataset that attained a word accuracy rate 77.4% and it is to the best of our knowledge the next best observed result attained on LRS2

    Recurrent Neural Networks for Decoding Lip Read Speech

    Get PDF
    The success of automated lip reading has been constrained by the inability to distinguish between homopheme words, which are words have different characters and produce the same lip movements (e.g. ”time” and ”some”), despite being intrinsically different. One word can often have different phonemes (units of sound) producing exactly the viseme or visual equivalent of phoneme for a unit of sound. Through the use of a Long-Short Term Memory Network with word embeddings, we can distinguish between homopheme words or words that produce identical lip movements. The neural network architecture achieved a character accuracy rate of 77.1% and a word accuracy rate of 72.2%

    Contour Mapping for Speaker-Independent Lip Reading System

    Get PDF
    In this paper, we demonstrate how an existing deep learning architecture for automatically lip reading individuals can be adapted it so that it can be made speaker independent, and by doing so, improved accuracies can be achieved on a variety of different speakers. The architecture itself is multi-layered consisting of a convolutional neural network, but if we are to apply an initial edge detection-based stage to pre-process the image inputs so that only the contours are required, the architecture can be made to be less speaker favourable. The neural network architecture achieves good accuracy rates when trained and tested on some of the same speakers in the ”overlapped speakers” phase of simulations, where word error rates of just 1.3% and 0.4% are achieved when applied to two individual speakers respectively, as well as character error rates of 0.6% and 0.3%. The ”unseen speakers” phase fails to achieve as good an accuracy, with greater recorded word error rates of 20.6% and 17.0% when tested on the two speakers with character error rates of 11.5% and 8.3%. The variation in size and colour of different people’s lips will result in different outputs at the convolution layer of a convolutional neural network as the output depends on the pixel intensity of the red, green and blue channels of an input image so a convolutional neural network will naturally favour the observations of the individual whom the network was tested on. This paper proposes an initial ”contour mapping stage” which makes all inputs uniform so that the system can be speaker independent. Keywords: Lip Reading, Speech Recognition, Deep Learning, Facial Landmarks, Convolutional Neural Networks, Recurrent Neural Networks, Edge Detection, Contour Mappin

    A numerical study of dynamic capillary pressure effect for supercritical carbon dioxide-water flow in porous domain

    Get PDF
    This is the accepted version of the following article: DAS, D.B. ... et al., 2014. A numerical study of dynamic capillary pressure effect for supercritical carbon dioxide-water flow in porous domain. AIChE Journal, 60 (12), pp. 4266-4278, which has been published in final form at http://dx.doi.org/10.1002/aic.14577Numerical simulations for core-scale capillary pressure (Pc)–saturation (S) relationships have been conducted for a supercritical carbon dioxide-water system at temperatures between 35°C and 65°C at a domain pressure of 15 MPa as typically expected during geological sequestration of CO2. As the Pc-S relationships depend on both S and time derivative of saturation (∂S / ∂t) yielding what is known as the ‘dynamic capillary pressure effect’ or simply ‘dynamic effect’, this work specifically attempts to determine the significance of these effects for supercritical carbon dioxide-water flow in terms of a coefficient, namely dynamic coefficient (τ). The coefficient establishes the speed at which capillary equilibrium for supercritical CO2-water flow is reached. The simulations in this work involved the solution of the extended version of Darcy’s law which represents the momentum balance for individual fluid phases in the system, the continuity equation for fluid mass balance, as well as additional correlations for determining the capillary pressure as a function of saturation, and the physical properties of the fluids as a function of temperature. The simulations were carried for 3D cylindrical porous domains measuring 10 cm in diameter and 12 cm in height. τ was determined by measuring the slope of a best-fit straight line plotted between (i) the differences in dynamic and equilibrium capillary pressures (Pc,dyn – Pc,equ) against (ii) the time derivative of saturation (dS/dt), both at the same saturation value. The results show rising trends for τ as the saturation values reduce, with noticeable impacts of temperature at 50% saturation of aqueous phase. This means that the time to attain capillary equilibrium for the CO2-water system increases as the saturation decreases. From a practical point view, it implies that the time to capillary equilibrium during geological sequestration of CO2 is an important factor and should be accounted for while simulating the flow processes, e.g., to determine the CO2 storage capacity of a geological aquifer. In this task, one would require both the fundamental understanding of the dynamic capillary pressure effects for supercritical CO2-water flow as well as τ values. These issues are addressed in this article

    Deep Learning-based Automated Lip-Reading: A Survey

    Get PDF
    A survey on automated lip-reading approaches is presented in this paper with the main focus being on deep learning related methodologies which have proven to be more fruitful for both feature extraction and classification. This survey also provides comparisons of all the different components that make up automated lip-reading systems including the audio-visual databases, feature extraction, classification networks and classification schemas. The main contributions and unique insights of this survey are: 1) A comparison of Convolutional Neural Networks with other neural network architectures for feature extraction; 2) A critical review on the advantages of Attention-Transformers and Temporal Convolutional Networks to Recurrent Neural Networks for classification; 3) A comparison of different classification schemas used for lip-reading including ASCII characters, phonemes and visemes, and 4) A review of the most up-to-date lip-reading systems up until early 2021

    Decoder-Encoder LSTM for Lip Reading

    Get PDF
    The success of automated lip reading has been constrained by the inability to distinguish between homopheme words, which are words have different characters and produce the same lip movements (e.g. ”time” and ”some”), despite being intrinsically different. One word can often have different phonemes (units of sound) producing exactly the viseme or visual equivalent of phoneme for a unit of sound. Through the use of a Long-Short Term Memory Network with word embeddings, we can distinguish between homopheme words or words that produce identical lip movements. The neural network architecture achieved a character accuracy rate of 77.1% and a word accuracy rate of 72.2%
    corecore