7 research outputs found

    Visual Passwords Using Automatic Lip Reading

    Get PDF
    This paper presents a visual passwords system to increase security. The system depends mainly on recognizing the speaker using the visual speech signal alone. The proposed scheme works in two stages: setting the visual password stage and the verification stage. At the setting stage the visual passwords system request the user to utter a selected password, a video recording of the user face is captured, and processed by a special words-based VSR system which extracts a sequence of feature vectors. In the verification stage, the same procedure is executed, the features will be sent to be compared with the stored visual password. The proposed scheme has been evaluated using a video database of 20 different speakers (10 females and 10 males), and 15 more males in another video database with different experiment sets. The evaluation has proved the system feasibility, with average error rate in the range of 7.63% to 20.51% at the worst tested scenario, and therefore, has potential to be a practical approach with the support of other conventional authentication methods such as the use of usernames and passwords

    Combining Multiple Views for Visual Speech Recognition

    Get PDF
    Visual speech recognition is a challenging research problem with a particular practical application of aiding audio speech recognition in noisy scenarios. Multiple camera setups can be beneficial for the visual speech recognition systems in terms of improved performance and robustness. In this paper, we explore this aspect and provide a comprehensive study on combining multiple views for visual speech recognition. The thorough analysis covers fusion of all possible view angle combinations both at feature level and decision level. The employed visual speech recognition system in this study extracts features through a PCA-based convolutional neural network, followed by an LSTM network. Finally, these features are processed in a tandem system, being fed into a GMM-HMM scheme. The decision fusion acts after this point by combining the Viterbi path log-likelihoods. The results show that the complementary information contained in recordings from different view angles improves the results significantly. For example, the sentence correctness on the test set is increased from 76% for the highest performing single view (3030^\circ) to up to 83% when combining this view with the frontal and 6060^\circ view angles

    Pengenalan Viseme Dinamis Bahasa Indonesia Menggunakan Convolutional Neural Network

    Get PDF
    There has been very little researches on automatic lip reading in Indonesian language, especially the ones based on dynamic visemes. To improve the accuracy of a recognition process, for certain problems, choosing suitable classifiers or combining of some methods may be required. This study aims to classify five dynamic visemes of Indonesian language using a CNN (Convolutional Neural Network) and to compare the results with an MLP (Multi Layer Perceptron). Varying some parameters theoretically improving the recognition accuracy was attempted to obtain the best result. The data includes videos on pronunciation of daily words in Indonesian language by 28 subjects recorded in frontal view. The best recognition result gives 96.44% of validation accuracy using the CNN classifier with three convolution layers

    Система розпізнавання команд за допомогою читання з губ

    Get PDF
    Дипломна робота: 110 с., 28 рис., 7 табл., 2 додатки, 25 джерел. Метою роботи є розробка додатку, що зчитує команди, сказані людиною, з камери в реальному часі, та вивід цих команд на екран. У дослідженні проаналізовано методи глибокого навчання, та застосовано деякі математичні підходи. Результати роботи: - розроблено додаток, що працює в режимі реального часу - розроблено алгоритм детекції пауз між різними фразами - реалізовано інтерфейс користувача для даного додатку Додаток використовується у сферах, де немає можливості застосувати аудіо-обробку.Theme of work: "System of commands recognition using lipreading". Thesis contains 110 p., 28 fig., 7 tabl., 2 appendices and 25 sources. The purpose of the work is to develop an application that reads the commands spoken by a person from the camera in real time, and displays these commands on the screen, or perform actions that correspond to them. The study analyzed the methods of deep learning, and applied some mathematical approaches. Results of work: - A real-time application is developed - an algorithm for detecting pauses between different phrases has been developed - The user interface for this application is implemented The application is used in areas where it is not possible to apply audio processing

    Visual speech recognition:from traditional to deep learning frameworks

    Get PDF
    Speech is the most natural means of communication for humans. Therefore, since the beginning of computers it has been a goal to interact with machines via speech. While there have been gradual improvements in this field over the decades, and with recent drastic progress more and more commercial software is available that allow voice commands, there are still many ways in which it can be improved. One way to do this is with visual speech information, more specifically, the visible articulations of the mouth. Based on the information contained in these articulations, visual speech recognition (VSR) transcribes an utterance from a video sequence. It thus helps extend speech recognition from audio-only to other scenarios such as silent or whispered speech (e.g.\ in cybersecurity), mouthings in sign language, as an additional modality in noisy audio scenarios for audio-visual automatic speech recognition, to better understand speech production and disorders, or by itself for human machine interaction and as a transcription method. In this thesis, we present and compare different ways to build systems for VSR: We start with the traditional hidden Markov models that have been used in the field for decades, especially in combination with handcrafted features. These are compared to models taking into account recent developments in the fields of computer vision and speech recognition through deep learning. While their superior performance is confirmed, certain limitations with respect to computing power for these systems are also discussed. This thesis also addresses multi-view processing and fusion, which is an important topic for many current applications. This is due to the fact that a single camera view often cannot provide enough flexibility with speakers moving in front of the camera. Technology companies are willing to integrate more cameras into their products, such as cars and mobile devices, due to lower hardware cost for both cameras and processing units, as well as the availability of higher processing power and high performance algorithms. Multi-camera and multi-view solutions are thus becoming more common, which means that algorithms can benefit from taking these into account. In this work we propose several methods of fusing the views of multiple cameras to improve the overall results. We can show that both, relying on deep learning-based approaches for feature extraction and sequence modelling, as well as taking into account the complementary information contained in several views, improves performance considerably. To further improve the results, it would be necessary to move from data recorded in a lab environment, to multi-view data in realistic scenarios. Furthermore, the findings and models could be transferred to other domains such as audio-visual speech recognition or the study of speech production and disorders
    corecore