7 research outputs found

    Lip Reading with Hahn Convolutional Neural Networks moments

    Get PDF
    International audienceLipreading or Visual speech recognition is the process of decoding speech from speakers mouth movements. It is used for people with hearing impairment , to understand patients attained with laryngeal cancer, people with vocal cord paralysis and in noisy environment. In this paper we aim to develop a visual-only speech recognition system based only on video. Our main targeted application is in the medical field for the assistance to la-ryngectomized persons. To that end, we propose Hahn Convolutional Neu-ral Network (HCNN), a novel architecture based on Hahn moments as first layer in the Convolutional neural network (CNN) architecture. We show that HCNN helps in reducing the dimensionality of video images, in gaining training time. HCNN model is trained to classify letters, digits or words given as video images. We evaluated the proposed method on three datasets, AVLetters, OuluVS2 and BBC LRW, and we show that it achieves significant results in comparison with other works in the literature

    ViLiDEx- A Lip Extraction Algorithm for Lip Reading

    Get PDF
    Technology is evolving at an immense speed every day. In the lap of technology, computer vision and machine learning are also growing fast. Many real time applications are running without human interaction just because of Computer vision and machine learning. In this paper, we are using computer vision and machine learning for lip feature extraction for Gujarati language. For this task we have created dataset GVLetters for Gujarati alphabets. We have taken videos of 24 speakers for 33 alphabets of Guajarati language. Face landmark algorithm from dlib is used for deriving ViLiDEx (Vibhavari’s algorithm for Lip Detection and Extraction). ViLiDEx is applied for 24 speakers and 5 alphabets from each class (Guttural, Palatal, Retroflex, Dental and Labial). This algorithm calculates total number of frames for each speaker, keep 20/25 frames as a dataset and removes extra frames. Depending on number of frames, frame numbers divisible by prime numbers are chosen for removal

    A Systematic Study and Empirical Analysis of Lip Reading Models using Traditional and Deep Learning Algorithms

    Get PDF
    Despite the fact that there are many applications for analyzing and recreating the audio through existinglip movement recognition, the researchers have shown the interest in developing the automatic lip-readingsystems to achieve the increased performance. Modelling of the framework has been playing a major role inadvance yield of sequential framework. In recent years there have been lot of interest in Deep Neural Networks(DNN) and break through results in various domains including Image Classification, Speech Recognition andNatural Language Processing. To represents complex functions DNNs are used and also they play a vital rolein Automatic Lip Reading (ALR) systems. This paper mainly focuses on the traditional pixel, shape and mixedfeature extractions and their improved technologies for lip reading recognitions. It highlights the mostimportant techniques and progression from end-to-end deep learning architectures that were evolved duringthe past decade. The investigation points out the voice-visual databases that are used for analyzing and trainthe system with the most common words and the count of speakers and the size, length of the language andtime duration. On the flip side, ALR systems developed were compared with their old-style systems. Thestatistical analysis is performed to recognize the characters or numerals and words or sentences in English andcompared their performances

    Neural networks using for handwriting numbers recognition

    Get PDF
    V prezentované práci, Hopfieldova neuronová síť byla postavena pro rozpoznávání ručně psaného číslice vzory obsažené v MNIST databáze. Pro každou číslici bylo vybudováno deset neuronových sítí Hopfieldu. Středy shluků, které byly postaveny s využitím neuronové sítě Kohonen byly brány jako objekty pro "zapamatování". Byly navrženy dvě metody, které jsou podporovaným krokem v hopfieldské neurální síti; byla provedena analýza těchto metod. Také, chyba byla vypočtena pro každé metody, výhody a nevýhody jejich použití byly identifikovány. Seskupení ručně psaných číslic z tréninkového vzorku MNIST databáze se provádí. Clustering is performed using a Kohonen neural network. Pro každou číslici je zvolen optimální počet seskupení (nepřesahující 50). As a metric for Kohonen network, the Euclidean norm is used. Síť je vycvičena sériovým algoritmem na procesoru a paralelním algoritmem na GPU pomocí technologie CUDA. Grafy času stráveného tréninkem neurální sítě pro každou číslici jsou uvedeny. Je prezentováno srovnání času stráveného sériovým a paralelním tréninkem. Bylo zjištěno, že průměrná hodnota zrychlení výcviku neurální sítě pomocí technologie CUDA je téměř 17krát vyšší. Číslice ze zkušebního vzorku databáze MNIST se používají k vyhodnocení přesnosti stavby seskupení. Bylo zjištěno, že procento vektorů ze zkušebního vzorku ve správném seskupení pro každou číslici je více než 90%. Vypočítá se F-míra pro každou číslici. Nejlepší hodnoty F-measure jsou získány pro 0 a 1 (F-measure je 0.974), vzhledem k tomu, že nejhorší hodnoty jsou získány pro číslici 9 (F-measure je 0.903). Úvod stručně popisuje obsah práce, jaký výzkum je v současné době k dispozici, a význam této práce. Po tom následuje prohlášení o problému, stejně jako o tom, jaké technologie byly použity k psaní této práce. První kapitola popisuje teoretické aspekty, stejně jako popisuje, jak řešit každou fázi této práce. Druhá kapitola obsahuje popis programu práce a získané výsledky. Ve druhé kapitole mluvíme o paralelizaci výukového algoritmu Kohonenovy neurální sítě. Ve třetí kapitole je software testován. Výsledky jsou uznání reakci každé neuronové sítě - obraz je nejvíce podobný obraz předložené pro vstup, a také celkové procento uznání za každé neuronové sítě.In the presented work, a Hopfield neural network was constructed for recognizing handwritten digit patterns contained in the MNIST database. Ten Hopfield neural networks were built for each digit separately. The centers of clusters that were built using the Kohonen neural network were taken as objects for “memorization”. Two methods were proposed, which are a supported step in a Hopfield neural network; an analysis of these methods was carried out. Also, an error was calculated for each method, the pros and cons of their use were identified. Clustering of handwritten digits from the training sample of the MNIST database is conducted. Clustering is performed using a Kohonen neural network. The optimal number of clusters (not exceeding 50) for each digit is selected. As a metric for Kohonen network, the Euclidean norm is used. The network is trained by a serial algorithm on the CPU and by a parallel algorithm on the GPU using CUDA technology. The graphs of the time spent on training the neural network for each digit are given. A comparison of the time spent for serial and parallel training is presented. It is found that the average value of accelerating the training of a neural network using CUDA technology is almost 17-fold. The digits from the test sample of the MNIST database are used to evaluate the accuracy of building the cluster. It is found that the percentage of vectors from the test sample in the correct cluster for each digit is more than 90%. The F-measure for each digit is calculated. The best values of the F-measure are obtained for 0 and 1 (F-measure is 0.974), whereas the worst values are obtained for the digit 9 (F-measure is 0.903). The introduction briefly describes the content of the work, what research is currently available, and the relevance of this work. This is followed by a statement of the problem, as well as what technologies were used to write this work. The first chapter describes the theoretical aspects, as well as describes how to solve each stage of this work. The second chapter contains a program description of the work and the results obtained. In the second chapter, we talk about parallelizing the learning algorithm of the Kohonen neural network. In the third chapter, the software is tested. The results are the recognition response of each neural network - the image is the most similar to the image submitted for input, also, the total percentage of recognition for each neural network

    Deep Learning applied to Visual Speech Recognition

    Get PDF
    Visual Speech Recognition (VSR) or Automatic Lip-Reading (ALR), the artificial process used to infer visemes, words, or sentences from video inputs, is an efficient yet far from being a day-to-day tool. With the evolution of deep learning models and the proliferation of databases (DB), vocabularies increase in quality and quantity. Large DB feed end-to-end deep learning (DL) models that extract speech, solely on the visual recognition of the speaker’s lips movements. However, large DB production requires large resources, unavailable to the majority of ALR researchers, impairing a larger scale evolution. This dissertation contributes to the development of ALR by diversifying training data, on which the DL depends upon. This includes producing a new DB, in Portuguese language, capable of state-of-the-art (SOTA) performance. As DL only shows a SOTA performance if trained on a large DB, whose resources are not on the scope of this dissertation, a knowledge leveraging method emerges, as a necessary subsequent objective. A large DB and a SOTA model are selected and used as templates, from which a smaller DB (LusaPt) is created, comprising 100 phrases by 10 speakers, uttering 50 typical Portuguese digits and words, recorded and processed by day-to-day equipment. After having pre-trained on the SOTA DB, the new model is then fine-tuned on the new DB. For LusaPt’s validation, the performance of new and the SOTA’s are compared. Results reveal that, if the same video is recurrently subject to the same model, the same prediction is obtained. Tests also show a clear increase on the word recognition rate (WRR), from the 0% when inferring with the SOTA model with no further training on the new DB, to an over 95% when inferring with the new model. Besides showing a “powerful belief” of the SOTA model in its predictions, this work also validates the new DB and its creation methodology. It reenforces that the transfer learning process is efficient in learning a new language, therefore new words. Another contribution is to demonstrate that, with a day-to-day equipment and limited human resources, it is possible to enrich the DB corpora and, ultimately, to positively impact the performance and future of Automatic Lip-Reading

    Viseme-based Lip-Reading using Deep Learning

    Get PDF
    Research in Automated Lip Reading is an incredibly rich discipline with so many facets that have been the subject of investigation including audio-visual data, feature extraction, classification networks and classification schemas. The most advanced and up-to-date lip-reading systems can predict entire sentences with thousands of different words and the majority of them use ASCII characters as the classification schema. The classification performance of such systems however has been insufficient and the need to cover an ever expanding range of vocabulary using as few classes as possible is challenge. The work in this thesis contributes to the area concerning classification schemas by proposing an automated lip reading model that predicts sentences using visemes as a classification schema. This is an alternative schema to using ASCII characters, which is the conventional class system used to predict sentences. This thesis provides a review of the current trends in deep learning- based automated lip reading and analyses a gap in the research endeavours of automated lip-reading by contributing towards work done in the region of classification schema. A whole new line of research is opened up whereby an alternative way to do lip-reading is explored and in doing so, lip-reading performance results for predicting s entences from a benchmark dataset are attained which improve upon the current state-of-the-art. In this thesis, a neural network-based lip reading system is proposed. The system is lexicon-free and uses purely visual cues. With only a limited number of visemes as classes to recognise, the system is designed to lip read sentences covering a wide range of vocabulary and to recognise words that may not be included in system training. The lip-reading system predicts sentences as a two-stage procedure with visemes being recognised as the first stage and words being classified as the second stage. This is such that the second-stage has to both overcome the one-to-many mapping problem posed in lip-reading where one set of visemes can map to several words, and the problem of visemes being confused or misclassified to begin with. To develop the proposed lip-reading system, a number of tasks have been performed in this thesis. These include the classification of continuous sequences of visemes; and the proposal of viseme-to-word conversion models that are both effective in their conversion performance of predicting words, and robust to the possibility of viseme confusion or misclassification. The initial system reported has been testified on the challenging BBC Lip Reading Sentences 2 (LRS2) benchmark dataset attaining a word accuracy rate of 64.6%. Compared with the state-of-the-art works in lip reading sentences reported at the time, the system had achieved a significantly improved performance. The lip reading system is further improved upon by using a language model that has been demonstrated to be effective at discriminating between homopheme words and being robust to incorrectly classified visemes. An improved performance in predicting spoken sentences from the LRS2 dataset is yielded with an attained word accuracy rate of 79.6% which is still better than another lip-reading system trained and evaluated on the the same dataset that attained a word accuracy rate 77.4% and it is to the best of our knowledge the next best observed result attained on LRS2

    Lip Reading with Hahn Convolutional Neural Networks moments

    No full text
    International audienceLipreading or Visual speech recognition is the process of decoding speech from speakers mouth movements. It is used for people with hearing impairment , to understand patients attained with laryngeal cancer, people with vocal cord paralysis and in noisy environment. In this paper we aim to develop a visual-only speech recognition system based only on video. Our main targeted application is in the medical field for the assistance to la-ryngectomized persons. To that end, we propose Hahn Convolutional Neu-ral Network (HCNN), a novel architecture based on Hahn moments as first layer in the Convolutional neural network (CNN) architecture. We show that HCNN helps in reducing the dimensionality of video images, in gaining training time. HCNN model is trained to classify letters, digits or words given as video images. We evaluated the proposed method on three datasets, AVLetters, OuluVS2 and BBC LRW, and we show that it achieves significant results in comparison with other works in the literature
    corecore