805 research outputs found

    Visual Passwords Using Automatic Lip Reading

    Get PDF
    This paper presents a visual passwords system to increase security. The system depends mainly on recognizing the speaker using the visual speech signal alone. The proposed scheme works in two stages: setting the visual password stage and the verification stage. At the setting stage the visual passwords system request the user to utter a selected password, a video recording of the user face is captured, and processed by a special words-based VSR system which extracts a sequence of feature vectors. In the verification stage, the same procedure is executed, the features will be sent to be compared with the stored visual password. The proposed scheme has been evaluated using a video database of 20 different speakers (10 females and 10 males), and 15 more males in another video database with different experiment sets. The evaluation has proved the system feasibility, with average error rate in the range of 7.63% to 20.51% at the worst tested scenario, and therefore, has potential to be a practical approach with the support of other conventional authentication methods such as the use of usernames and passwords

    Combining Multiple Views for Visual Speech Recognition

    Get PDF
    Visual speech recognition is a challenging research problem with a particular practical application of aiding audio speech recognition in noisy scenarios. Multiple camera setups can be beneficial for the visual speech recognition systems in terms of improved performance and robustness. In this paper, we explore this aspect and provide a comprehensive study on combining multiple views for visual speech recognition. The thorough analysis covers fusion of all possible view angle combinations both at feature level and decision level. The employed visual speech recognition system in this study extracts features through a PCA-based convolutional neural network, followed by an LSTM network. Finally, these features are processed in a tandem system, being fed into a GMM-HMM scheme. The decision fusion acts after this point by combining the Viterbi path log-likelihoods. The results show that the complementary information contained in recordings from different view angles improves the results significantly. For example, the sentence correctness on the test set is increased from 76% for the highest performing single view (30∘30^\circ) to up to 83% when combining this view with the frontal and 60∘60^\circ view angles

    Emotion Detector

    Get PDF
    Face plays significant role in social communication. This is a 'window' to human personality, emotions and thoughts. Verbal part contributes about 7% of the message, vocal – 34% and facial expression about 55%. Due to that, face is a subject of study in many areas of science such as psychology, behavioral science, medicine and finally computer science. In the field of computer science much effort is put to explore the ways of automation the process of face detection and segmentation. Several approaches addressing the problem of facial feature extraction have been proposed. The main issue is to provide appropriate face representation, which remains robust with respect to diversity of facial appearances. The objective of this report is to outline the problem of facial expression recognition, which is a great challenge in the area of computer vision. Advantages of creating a fully automatic system for facial action analysis are constant motivation for exploring this field of science and will be mentioned in this thesis

    Visual speech recognition and utterance segmentation based on mouth movement

    Get PDF
    This paper presents a vision-based approach to recognize speech without evaluating the acoustic signals. The proposed technique combines motion features and support vector machines (SVMs) to classify utterances. Segmentation of utterances is important in a visual speech recognition system. This research proposes a video segmentation method to detect the start and end frames of isolated utterances from an image sequence. Frames that correspond to `speaking' and `silence' phases are identified based on mouth movement information. The experimental results demonstrate that the proposed visual speech recognition technique yields high accuracy in a phoneme classification task. Potential applications of such a system are, e.g., human computer interface (HCI) for mobility-impaired users, lip-reading mobile phones, in-vehicle systems, and improvement of speech-based computer control in noisy environments

    Human-computer interaction in ubiquitous computing environments

    Full text link
    Purpose &ndash; The purpose of this paper is to explore characteristics of human-computer interaction when the human body and its movements become input for interaction and interface control in pervasive computing settings. Design/methodology/approach &ndash; The paper quantifies the performance of human movement based on Fitt\u27s Law and discusses some of the human factors and technical considerations that arise in trying to use human body movements as an input medium. Findings &ndash; The paper finds that new interaction technologies utilising human movements may provide more flexible, naturalistic interfaces and support the ubiquitous or pervasive computing paradigm. Practical implications &ndash; In pervasive computing environments the challenge is to create intuitive and user-friendly interfaces. Application domains that may utilize human body movements as input are surveyed here and the paper addresses issues such as culture, privacy, security and ethics raised by movement of a user\u27s body-based interaction styles. Originality/value &ndash; The paper describes the utilization of human body movements as input for interaction and interface control in pervasive computing settings. <br /

    Comparison of Conventional Hybrid and CTC/Attention Decoders for Continuous Visual Speech Recognition

    Full text link
    Thanks to the rise of deep learning and the availability of large-scale audio-visual databases, recent advances have been achieved in Visual Speech Recognition (VSR). Similar to other speech processing tasks, these end-to-end VSR systems are usually based on encoder-decoder architectures. While encoders are somewhat general, multiple decoding approaches have been explored, such as the conventional hybrid model based on Deep Neural Networks combined with Hidden Markov Models (DNN-HMM) or the Connectionist Temporal Classification (CTC) paradigm. However, there are languages and tasks in which data is scarce, and in this situation, there is not a clear comparison between different types of decoders. Therefore, we focused our study on how the conventional DNN-HMM decoder and its state-of-the-art CTC/Attention counterpart behave depending on the amount of data used for their estimation. We also analyzed to what extent our visual speech features were able to adapt to scenarios for which they were not explicitly trained, either considering a similar dataset or another collected for a different language. Results showed that the conventional paradigm reached recognition rates that improve the CTC/Attention model in data-scarcity scenarios along with a reduced training time and fewer parameters.Comment: Accepted at the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING

    Visual Speech Recognition using Histogram of Oriented Displacements

    Get PDF
    Lip reading is the recognition of spoken words from the visual information of lips. It has been of considerable interest in the Computer Vision and Speech Recognition communities to automate this process using computer algorithms. In this thesis, we have developed a novel method involving describing visual features using fixed length descriptors called Histogram of Oriented Displacements to which we apply Support Vector Machines for recognition of spoken words. Using this method on the CUAVE database we have achieved a recognition rate of 81%

    Learning Multi-Boosted HMMs for Lip-Password Based Speaker Verification

    Full text link
    • …
    corecore