154 research outputs found

    Lip-Reading with Visual Form Classification using Residual Networks and Bidirectional Gated Recurrent Units

    Get PDF
    Lip-reading is a method that focuses on the observation and interpretation of lip movements to understand spoken language. Previous studies have exclusively concentrated on a single variation of residual networks (ResNets). This study primarily aimed to conduct a comparative analysis of several types of ResNets. This study additionally calculates metrics for several word structures included in the GRID dataset, encompassing verbs, colors, prepositions, letters, and numerals. This component has not been previously investigated in other studies. The proposed approach encompasses several stages, namely pre-processing, which involves face detection and mouth location, feature extraction, and classification. The architecture for feature extraction comprises a 3-dimensional convolutional neural network (3D-CNN) integrated with ResNets. The management of temporal sequences during the classification phase is accomplished through the utilization of the bidirectional gated recurrent units (Bi-GRU) model. The experimental results demonstrated a character error rate (CER) of 14.09% and a word error rate (WER) of 28.51%. The combination of 3D-CNN ResNet-34 and Bi-GRU yielded superior outcomes in comparison to ResNet-18 and ResNet-50. The correlation between increased network depth and enhanced performance in lip-reading models was not consistently observed. Nevertheless, the incorporation of additional trained parameters offers certain benefits. Moreover, it has demonstrated superior levels of precision in comparison to human professionals in the task of distinguishing diverse word structures. Doi: 10.28991/HIJ-2023-04-02-010 Full Text: PD

    A new visual speech modelling approach for visual speech recognition

    Get PDF
    In this paper we propose a new learning-based representation that is referred to as Visual Speech Unit (VSU) for visual speech recognition (VSR). The new Visual Speech Unit concept proposes an extension of the standard viseme model that is currently applied for VSR by including in this representation not only the data associated with the visemes, but also the transitory information between consecutive visemes. The developed speech recognition system consists of several computational stages: (a) lips segmentation, (b) construction of the Expectation-Maximization Principal Component Analysis (EM-PCA) manifolds from the input video image, (c) registration between the models of the VSUs and the EM-PCA data constructed from the input image sequence and (d) recognition of the VSUs using a standard Hidden Markov Model (HMM) classification scheme. In this paper we were particularly interested to evaluate the classification accuracy obtained for our new VSU models when compared with that attained for standard (MPEG-4) viseme models. The experimental results indicate that we achieved 90% recognition rate when the system has been applied to the identification of 60 classes of VSUs, while the recognition rate for the standard set of MPEG-4 visemes was only 52%

    Improving Phoneme to Viseme Mapping for Indonesian Language

    Get PDF
    The lip synchronization technology of animation can run automatically through the phoneme-to-viseme map. Since the complexity of facial muscles causes the shape of the mouth to vary greatly, phoneme-to-viseme mapping always has challenging problems. One of them is the allophone vowel problem. The resemblance makes many researchers clustering them into one class. This paper discusses the certainty of allophone vowels as a variable of the phoneme-to-viseme map. Vowel allophones pre-processing as a proposed method is carried out through formant frequency feature extraction methods and then compared by t-test to find out the significance of the difference. The results of pre-processing are then used to reference the initial data when building phoneme-to-viseme maps. This research was conducted on maps and allophones of the Indonesian language. Maps that have been built are then compared with other maps using the HMM method in the value of word correctness and accuracy. The results show that viseme mapping preceded by allophonic pre-processing makes map performance more accurate when compared to other maps

    Audio-Visual Speech Recognition using Red Exclusion an Neural Networks

    Get PDF
    PO BOX Q534,QVB POST OFFICE, SYDNEY, AUSTRALIA, 123

    Lip Synchrony of Rounded and Protruded Vowels and Diphthongs in the Lithuanian-Dubbed Animated Film “Cloudy with a Chance of Meatballs 2”

    Get PDF
    In this article, the problems of dubbing, especially related to lip synchrony as one of the most challenging aspects of audiovisual translation, are scrutinised. Contrarily to the traditional focus on bilabials and open vowels, the object of this research is lip synchrony of both rounded and protruded vowels and diphthongs since lip rounding is a visibly marked feature, which cannot be neglected especially in close-ups. The study aims at determining the inaccuracies in lip synchrony of the mentioned phonemic group in the dubbed animated feature film Cloudy with a Chance of Meatballs 2 from English to Lithuanian. Qualitative and quantitative analysis is carried out by employing a comparative method. The research methodology is based on the theoretical insights and assumptions provided by Frederic Chaume (2004, 2006, 2012), Richard Barsam & Dave Monahan (2010), and Indrė Koverienė (2015). The research findings demonstrate the main issues of lip synchrony a translator might face while adapting a piece of audiovisual material for the target language audience. Also, it provides insights into the quality of the overall translation of the chosen film
    corecore