92 research outputs found

    Visual Speech Recognition with Lightweight Psychologically Motivated Gabor Features

    Get PDF
    Extraction of relevant lip features is of continuing interest in the visual speech domain. 1 Using end-to-end feature extraction can produce good results, but at the cost of the results being 2 difficult for humans to comprehend and relate to. We present a new, lightweight feature extraction 3 approach, motivated by human-centric glimpse based psychological research into facial barcodes, 4 and demonstrate that these simple, easy to extract 3D geometric features (produced using Gabor 5 based image patches), can successfully be used for speech recognition with LSTM based machine 6 learning. This approach can successfully extract low dimensionality lip parameters with a minimum 7 of processing. One key difference between using these Gabor-based features and using other features 8 such as traditional DCT, or the current fashion for CNN features is that these are human-centric 9 features that can be visualised and analysed by humans. This means that it is easier to explain and 10 visualise the results. They can also be used for reliable speech recognition, as demonstrated using the 11 Grid corpus. Results for overlapping speakers using our lightweight system gave a recognition rate 12 of over 82%, which compares well to less explainable features in the literature. 1

    Gabor-based audiovisual fusion for Mandarin Chinese speech recognition

    Get PDF
    Audiovisual Speech Recognition (AVSR) is a popular research topic, and incorporating visual features into speech recognition systems has been found to deliver good results. In recent years, end-to-end Convolutional Neural Network (CNN) based deep learning has been widely utilized. However, these often require big data and can be time consuming to train. A lot of speech research also tends to focus on English language datasets. In this paper, we propose a lightweight optimized and automated speech recognition system using Gabor based feature extraction, combined with our Audiovisual Mandarin Chinese (AVMC) corpus. This combines Mel-frequency Cepstral Coefficients (MFCCs) + CNN_Bidirectional Long Short-term Memory (BiLSTM)_Attention (CLA) model for Audio Speech Recognition, and low dimension Gabor visual features + CLA model for Visual Speech Recognition. As we are focusing on Chinese language recognition, we individually analyse initials, finals, and tones, as part of pinyin speech production. The proposed low dimensionality system achieves 88.6%, 87.5% and 93.6% accuracy for tones, initials and finals respectively at char-level, 84.8% for pinyin at word-level

    Lip2Speech : lightweight multi-speaker speech reconstruction with Gabor features

    Get PDF
    In environments characterised by noise or the absence of audio signals, visual cues, notably facial and lip movements, serve as valuable substitutes for missing or corrupted speech signals. In these scenarios, speech reconstruction can potentially generate speech from visual data. Recent advancements in this domain have predominantly relied on end-to-end deep learning models, like Convolutional Neural Networks (CNN) or Generative Adversarial Networks (GAN). However, these models are encumbered by their intricate and opaque architectures, coupled with their lack of speaker independence. Consequently, achieving multi-speaker speech reconstruction without supplementary information is challenging. This research introduces an innovative Gabor-based speech reconstruction system tailored for lightweight and efficient multi-speaker speech restoration. Using our Gabor feature extraction technique, we propose two novel models: GaborCNN2Speech and GaborFea2Speech. These models employ a rapid Gabor feature extraction method to derive lowdimensional mouth region features, encompassing filtered Gabor mouth images and low-dimensional Gabor features as visual inputs. An encoded spectrogram serves as the audio target, and a Long Short-Term Memory (LSTM)-based model is harnessed to generate coherent speech output. Through comprehensive experiments conducted on the GRID corpus, our proposed Gabor-based models have showcased superior performance in sentence and vocabulary reconstruction when compared to traditional end-to-end CNN models. These models stand out for their lightweight design and rapid processing capabilities. Notably, the GaborFea2Speech model presented in this study achieves robust multi-speaker speech reconstruction without necessitating supplementary information, thereby marking a significant milestone in the field of speech reconstruction

    Face pose estimation in monocular images

    Get PDF
    People use orientation of their faces to convey rich, inter-personal information. For example, a person will direct his face to indicate who the intended target of the conversation is. Similarly in a conversation, face orientation is a non-verbal cue to listener when to switch role and start speaking, and a nod indicates that a person has understands, or agrees with, what is being said. Further more, face pose estimation plays an important role in human-computer interaction, virtual reality applications, human behaviour analysis, pose-independent face recognition, driver s vigilance assessment, gaze estimation, etc. Robust face recognition has been a focus of research in computer vision community for more than two decades. Although substantial research has been done and numerous methods have been proposed for face recognition, there remain challenges in this field. One of these is face recognition under varying poses and that is why face pose estimation is still an important research area. In computer vision, face pose estimation is the process of inferring the face orientation from digital imagery. It requires a serious of image processing steps to transform a pixel-based representation of a human face into a high-level concept of direction. An ideal face pose estimator should be invariant to a variety of image-changing factors such as camera distortion, lighting condition, skin colour, projective geometry, facial hairs, facial expressions, presence of accessories like glasses and hats, etc. Face pose estimation has been a focus of research for about two decades and numerous research contributions have been presented in this field. Face pose estimation techniques in literature have still some shortcomings and limitations in terms of accuracy, applicability to monocular images, being autonomous, identity and lighting variations, image resolution variations, range of face motion, computational expense, presence of facial hairs, presence of accessories like glasses and hats, etc. These shortcomings of existing face pose estimation techniques motivated the research work presented in this thesis. The main focus of this research is to design and develop novel face pose estimation algorithms that improve automatic face pose estimation in terms of processing time, computational expense, and invariance to different conditions

    Audio-Visual Biometrics and Forgery

    Get PDF

    Face pose estimation in monocular images

    Get PDF
    People use orientation of their faces to convey rich, inter-personal information. For example, a person will direct his face to indicate who the intended target of the conversation is. Similarly in a conversation, face orientation is a non-verbal cue to listener when to switch role and start speaking, and a nod indicates that a person has understands, or agrees with, what is being said. Further more, face pose estimation plays an important role in human-computer interaction, virtual reality applications, human behaviour analysis, pose-independent face recognition, driver s vigilance assessment, gaze estimation, etc. Robust face recognition has been a focus of research in computer vision community for more than two decades. Although substantial research has been done and numerous methods have been proposed for face recognition, there remain challenges in this field. One of these is face recognition under varying poses and that is why face pose estimation is still an important research area. In computer vision, face pose estimation is the process of inferring the face orientation from digital imagery. It requires a serious of image processing steps to transform a pixel-based representation of a human face into a high-level concept of direction. An ideal face pose estimator should be invariant to a variety of image-changing factors such as camera distortion, lighting condition, skin colour, projective geometry, facial hairs, facial expressions, presence of accessories like glasses and hats, etc. Face pose estimation has been a focus of research for about two decades and numerous research contributions have been presented in this field. Face pose estimation techniques in literature have still some shortcomings and limitations in terms of accuracy, applicability to monocular images, being autonomous, identity and lighting variations, image resolution variations, range of face motion, computational expense, presence of facial hairs, presence of accessories like glasses and hats, etc. These shortcomings of existing face pose estimation techniques motivated the research work presented in this thesis. The main focus of this research is to design and develop novel face pose estimation algorithms that improve automatic face pose estimation in terms of processing time, computational expense, and invariance to different conditions.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Face pose estimation in monocular images

    Get PDF
    People use orientation of their faces to convey rich, inter-personal information. For example, a person will direct his face to indicate who the intended target of the conversation is. Similarly in a conversation, face orientation is a non-verbal cue to listener when to switch role and start speaking, and a nod indicates that a person has understands, or agrees with, what is being said. Further more, face pose estimation plays an important role in human-computer interaction, virtual reality applications, human behaviour analysis, pose-independent face recognition, driver s vigilance assessment, gaze estimation, etc. Robust face recognition has been a focus of research in computer vision community for more than two decades. Although substantial research has been done and numerous methods have been proposed for face recognition, there remain challenges in this field. One of these is face recognition under varying poses and that is why face pose estimation is still an important research area. In computer vision, face pose estimation is the process of inferring the face orientation from digital imagery. It requires a serious of image processing steps to transform a pixel-based representation of a human face into a high-level concept of direction. An ideal face pose estimator should be invariant to a variety of image-changing factors such as camera distortion, lighting condition, skin colour, projective geometry, facial hairs, facial expressions, presence of accessories like glasses and hats, etc. Face pose estimation has been a focus of research for about two decades and numerous research contributions have been presented in this field. Face pose estimation techniques in literature have still some shortcomings and limitations in terms of accuracy, applicability to monocular images, being autonomous, identity and lighting variations, image resolution variations, range of face motion, computational expense, presence of facial hairs, presence of accessories like glasses and hats, etc. These shortcomings of existing face pose estimation techniques motivated the research work presented in this thesis. The main focus of this research is to design and develop novel face pose estimation algorithms that improve automatic face pose estimation in terms of processing time, computational expense, and invariance to different conditions.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    3D face recognition using photometric stereo

    Get PDF
    Automatic face recognition has been an active research area for the last four decades. This thesis explores innovative bio-inspired concepts aimed at improved face recognition using surface normals. New directions in salient data representation are explored using data captured via a photometric stereo method from the University of the West of England’s “Photoface” device. Accuracy assessments demonstrate the advantage of the capture format and the synergy offered by near infrared light sources in achieving more accurate results than under conventional visible light. Two 3D face databases have been created as part of the thesis – the publicly available Photoface database which contains 3187 images of 453 subjects and the 3DE-VISIR dataset which contains 363 images of 115 people with different expressions captured simultaneously under near infrared and visible light. The Photoface database is believed to be the ?rst to capture naturalistic 3D face models. Subsets of these databases are then used to show the results of experiments inspired by the human visual system. Experimental results show that optimal recognition rates are achieved using surprisingly low resolution of only 10x10 pixels on surface normal data, which corresponds to the spatial frequency range of optimal human performance. Motivated by the observed increase in recognition speed and accuracy that occurs in humans when faces are caricatured, novel interpretations of caricaturing using outlying data and pixel locations with high variance show that performance remains disproportionately high when up to 90% of the data has been discarded. These direct methods of dimensionality reduction have useful implications for the storage and processing requirements for commercial face recognition systems. The novel variance approach is extended to recognise positive expressions with 90% accuracy which has useful implications for human-computer interaction as well as ensuring that a subject has the correct expression prior to recognition. Furthermore, the subject recognition rate is improved by removing those pixels which encode expression. Finally, preliminary work into feature detection on surface normals by extending Haar-like features is presented which is also shown to be useful for correcting the pose of the head as part of a fully operational device. The system operates with an accuracy of 98.65% at a false acceptance rate of only 0.01 on front facing heads with neutral expressions. The work has shown how new avenues of enquiry inspired by our observation of the human visual system can offer useful advantages towards achieving more robust autonomous computer-based facial recognition
    • …
    corecore