7 research outputs found
Lip Reading with Hahn Convolutional Neural Networks moments
International audienceLipreading or Visual speech recognition is the process of decoding speech from speakers mouth movements. It is used for people with hearing impairment , to understand patients attained with laryngeal cancer, people with vocal cord paralysis and in noisy environment. In this paper we aim to develop a visual-only speech recognition system based only on video. Our main targeted application is in the medical field for the assistance to la-ryngectomized persons. To that end, we propose Hahn Convolutional Neu-ral Network (HCNN), a novel architecture based on Hahn moments as first layer in the Convolutional neural network (CNN) architecture. We show that HCNN helps in reducing the dimensionality of video images, in gaining training time. HCNN model is trained to classify letters, digits or words given as video images. We evaluated the proposed method on three datasets, AVLetters, OuluVS2 and BBC LRW, and we show that it achieves significant results in comparison with other works in the literature
ViLiDEx- A Lip Extraction Algorithm for Lip Reading
Technology is evolving at an immense speed every day. In the lap of technology, computer vision and machine learning are also growing fast. Many real time applications are running without human interaction just because of Computer vision and machine learning. In this paper, we are using computer vision and machine learning for lip feature extraction for Gujarati language. For this task we have created dataset GVLetters for Gujarati alphabets. We have taken videos of 24 speakers for 33 alphabets of Guajarati language. Face landmark algorithm from dlib is used for deriving ViLiDEx (Vibhavari’s algorithm for Lip Detection and Extraction). ViLiDEx is applied for 24 speakers and 5 alphabets from each class (Guttural, Palatal, Retroflex, Dental and Labial). This algorithm calculates total number of frames for each speaker, keep 20/25 frames as a dataset and removes extra frames. Depending on number of frames, frame numbers divisible by prime numbers are chosen for removal
A Systematic Study and Empirical Analysis of Lip Reading Models using Traditional and Deep Learning Algorithms
Despite the fact that there are many applications for analyzing and recreating the audio through existinglip movement recognition, the researchers have shown the interest in developing the automatic lip-readingsystems to achieve the increased performance. Modelling of the framework has been playing a major role inadvance yield of sequential framework. In recent years there have been lot of interest in Deep Neural Networks(DNN) and break through results in various domains including Image Classification, Speech Recognition andNatural Language Processing. To represents complex functions DNNs are used and also they play a vital rolein Automatic Lip Reading (ALR) systems. This paper mainly focuses on the traditional pixel, shape and mixedfeature extractions and their improved technologies for lip reading recognitions. It highlights the mostimportant techniques and progression from end-to-end deep learning architectures that were evolved duringthe past decade. The investigation points out the voice-visual databases that are used for analyzing and trainthe system with the most common words and the count of speakers and the size, length of the language andtime duration. On the flip side, ALR systems developed were compared with their old-style systems. Thestatistical analysis is performed to recognize the characters or numerals and words or sentences in English andcompared their performances
Neural networks using for handwriting numbers recognition
V prezentovanĂ© práci, Hopfieldova neuronová sĂĹĄ byla postavena pro rozpoznávánĂ ruÄŤnÄ› psanĂ©ho ÄŤĂslice vzory obsaĹľenĂ© v MNIST databáze. Pro kaĹľdou ÄŤĂslici bylo vybudováno deset neuronovĂ˝ch sĂtĂ Hopfieldu. StĹ™edy shlukĹŻ, kterĂ© byly postaveny s vyuĹľitĂm neuronovĂ© sĂtÄ› Kohonen byly brány jako objekty pro "zapamatovánĂ". Byly navrĹľeny dvÄ› metody, kterĂ© jsou podporovanĂ˝m krokem v hopfieldskĂ© neurálnĂ sĂti; byla provedena analĂ˝za tÄ›chto metod. TakĂ©, chyba byla vypoÄŤtena pro kaĹľdĂ© metody, vĂ˝hody a nevĂ˝hody jejich pouĹľitĂ byly identifikovány. SeskupenĂ ruÄŤnÄ› psanĂ˝ch ÄŤĂslic z trĂ©ninkovĂ©ho vzorku MNIST databáze se provádĂ. Clustering is performed using a Kohonen neural network. Pro kaĹľdou ÄŤĂslici je zvolen optimálnĂ poÄŤet seskupenĂ (nepĹ™esahujĂcĂ 50). As a metric for Kohonen network, the Euclidean norm is used. SĂĹĄ je vycviÄŤena sĂ©riovĂ˝m algoritmem na procesoru a paralelnĂm algoritmem na GPU pomocĂ technologie CUDA. Grafy ÄŤasu strávenĂ©ho trĂ©ninkem neurálnĂ sĂtÄ› pro kaĹľdou ÄŤĂslici jsou uvedeny. Je prezentováno srovnánĂ ÄŤasu strávenĂ©ho sĂ©riovĂ˝m a paralelnĂm trĂ©ninkem. Bylo zjištÄ›no, Ĺľe prĹŻmÄ›rná hodnota zrychlenĂ vĂ˝cviku neurálnĂ sĂtÄ› pomocĂ technologie CUDA je tĂ©měř 17krát vyššĂ. ÄŚĂslice ze zkušebnĂho vzorku databáze MNIST se pouĹľĂvajĂ k vyhodnocenĂ pĹ™esnosti stavby seskupenĂ. Bylo zjištÄ›no, Ĺľe procento vektorĹŻ ze zkušebnĂho vzorku ve správnĂ©m seskupenĂ pro kaĹľdou ÄŤĂslici je vĂce neĹľ 90%. VypoÄŤĂtá se F-mĂra pro kaĹľdou ÄŤĂslici. Nejlepšà hodnoty F-measure jsou zĂskány pro 0 a 1 (F-measure je 0.974), vzhledem k tomu, Ĺľe nejhoršà hodnoty jsou zĂskány pro ÄŤĂslici 9 (F-measure je 0.903). Ăšvod struÄŤnÄ› popisuje obsah práce, jakĂ˝ vĂ˝zkum je v souÄŤasnĂ© dobÄ› k dispozici, a vĂ˝znam tĂ©to práce. Po tom následuje prohlášenĂ o problĂ©mu, stejnÄ› jako o tom, jakĂ© technologie byly pouĹľity k psanĂ tĂ©to práce. PrvnĂ kapitola popisuje teoretickĂ© aspekty, stejnÄ› jako popisuje, jak Ĺ™ešit kaĹľdou fázi tĂ©to práce. Druhá kapitola obsahuje popis programu práce a zĂskanĂ© vĂ˝sledky. Ve druhĂ© kapitole mluvĂme o paralelizaci vĂ˝ukovĂ©ho algoritmu Kohonenovy neurálnĂ sĂtÄ›. Ve tĹ™etĂ kapitole je software testován. VĂ˝sledky jsou uznánĂ reakci kaĹľdĂ© neuronovĂ© sĂtÄ› - obraz je nejvĂce podobnĂ˝ obraz pĹ™edloĹľenĂ© pro vstup, a takĂ© celkovĂ© procento uznánĂ za kaĹľdĂ© neuronovĂ© sĂtÄ›.In the presented work, a Hopfield neural network was constructed for recognizing handwritten digit patterns contained in the MNIST database. Ten Hopfield neural networks were built for each digit separately. The centers of clusters that were built using the Kohonen neural network were taken as objects for “memorization”. Two methods were proposed, which are a supported step in a Hopfield neural network; an analysis of these methods was carried out. Also, an error was calculated for each method, the pros and cons of their use were identified. Clustering of handwritten digits from the training sample of the MNIST database is conducted. Clustering is performed using a Kohonen neural network. The optimal number of clusters (not exceeding 50) for each digit is selected. As a metric for Kohonen network, the Euclidean norm is used. The network is trained by a serial algorithm on the CPU and by a parallel algorithm on the GPU using CUDA technology. The graphs of the time spent on training the neural network for each digit are given. A comparison of the time spent for serial and parallel training is presented. It is found that the average value of accelerating the training of a neural network using CUDA technology is almost 17-fold. The digits from the test sample of the MNIST database are used to evaluate the accuracy of building the cluster. It is found that the percentage of vectors from the test sample in the correct cluster for each digit is more than 90%. The F-measure for each digit is calculated. The best values of the F-measure are obtained for 0 and 1 (F-measure is 0.974), whereas the worst values are obtained for the digit 9 (F-measure is 0.903). The introduction briefly describes the content of the work, what research is currently available, and the relevance of this work. This is followed by a statement of the problem, as well as what technologies were used to write this work. The first chapter describes the theoretical aspects, as well as describes how to solve each stage of this work. The second chapter contains a program description of the work and the results obtained. In the second chapter, we talk about parallelizing the learning algorithm of the Kohonen neural network. In the third chapter, the software is tested. The results are the recognition response of each neural network - the image is the most similar to the image submitted for input, also, the total percentage of recognition for each neural network
Deep Learning applied to Visual Speech Recognition
Visual Speech Recognition (VSR) or Automatic Lip-Reading (ALR), the artificial process used to infer visemes, words, or sentences from video inputs, is an efficient yet far from being a day-to-day tool. With the evolution of deep learning models and the proliferation of databases (DB), vocabularies increase in quality and quantity. Large DB feed end-to-end deep learning (DL) models that extract speech, solely on the visual recognition of the speaker’s lips movements. However, large DB production requires large resources, unavailable to the majority of ALR researchers, impairing a larger scale evolution.
This dissertation contributes to the development of ALR by diversifying training data, on which the DL depends upon. This includes producing a new DB, in Portuguese language, capable of state-of-the-art (SOTA) performance. As DL only shows a SOTA performance if trained on a large DB, whose resources are not on the scope of this dissertation, a knowledge leveraging method emerges, as a necessary subsequent objective.
A large DB and a SOTA model are selected and used as templates, from which a smaller DB (LusaPt) is created, comprising 100 phrases by 10 speakers, uttering 50 typical Portuguese digits and words, recorded and processed by day-to-day equipment. After having pre-trained on the SOTA DB, the new model is then fine-tuned on the new DB. For LusaPt’s validation, the performance of new and the SOTA’s are compared.
Results reveal that, if the same video is recurrently subject to the same model, the same prediction is obtained. Tests also show a clear increase on the word recognition rate (WRR), from the 0% when inferring with the SOTA model with no further training on the new DB, to an over 95% when inferring with the new model.
Besides showing a “powerful belief” of the SOTA model in its predictions, this work also validates the new DB and its creation methodology. It reenforces that the transfer learning process is efficient in learning a new language, therefore new words. Another contribution is to demonstrate that, with a day-to-day equipment and limited human resources, it is possible to enrich the DB corpora and, ultimately, to positively impact the performance and future of Automatic Lip-Reading
Viseme-based Lip-Reading using Deep Learning
Research in Automated Lip Reading is an incredibly rich discipline with so many facets that have been the subject of investigation including audio-visual data, feature extraction, classification networks and classification schemas. The most advanced and up-to-date lip-reading systems can predict entire sentences with thousands of different words and the majority of them use ASCII characters as the classification schema. The classification performance of such systems however has been insufficient and the need to cover an ever expanding range of vocabulary using as few classes as possible is challenge.
The work in this thesis contributes to the area concerning classification schemas by proposing an automated lip reading model that predicts sentences using visemes as a classification schema.
This is an alternative schema to using ASCII characters, which is the conventional class system used to predict sentences. This thesis provides a review of the current trends in deep learning-
based automated lip reading and analyses a gap in the research endeavours of automated lip-reading by contributing towards work done in the region of classification schema. A whole new line of research is opened up whereby an alternative way to do lip-reading is explored and in doing so, lip-reading performance results for predicting s entences from a benchmark dataset
are attained which improve upon the current state-of-the-art.
In this thesis, a neural network-based lip reading system is proposed. The system is lexicon-free and uses purely visual cues. With only a limited number of visemes as classes to recognise, the system is designed to lip read sentences covering a wide range of vocabulary and to recognise words that may not be included in system training. The lip-reading system predicts sentences as a two-stage procedure with visemes being recognised as the first stage and words being classified as the second stage. This is such that the second-stage has to both overcome the one-to-many mapping problem posed in lip-reading where one set of visemes can map to several words, and the problem of visemes being confused or misclassified to begin with.
To develop the proposed lip-reading system, a number of tasks have been performed in this thesis. These include the classification of continuous sequences of visemes; and the proposal of viseme-to-word conversion models that are both effective in their conversion performance of predicting words, and robust to the possibility of viseme confusion or misclassification. The initial system reported has been testified on the challenging BBC Lip Reading Sentences 2
(LRS2) benchmark dataset attaining a word accuracy rate of 64.6%. Compared with the state-of-the-art works in lip reading sentences reported at the time, the system had achieved a significantly improved performance.
The lip reading system is further improved upon by using a language model that has been demonstrated to be effective at discriminating between homopheme words and being robust to incorrectly classified visemes. An improved performance in predicting spoken sentences from the LRS2 dataset is yielded with an attained word accuracy rate of 79.6% which is still better than another lip-reading system trained and evaluated on the the same dataset that attained a word accuracy rate 77.4% and it is to the best of our knowledge the next best observed result attained on LRS2
Lip Reading with Hahn Convolutional Neural Networks moments
International audienceLipreading or Visual speech recognition is the process of decoding speech from speakers mouth movements. It is used for people with hearing impairment , to understand patients attained with laryngeal cancer, people with vocal cord paralysis and in noisy environment. In this paper we aim to develop a visual-only speech recognition system based only on video. Our main targeted application is in the medical field for the assistance to la-ryngectomized persons. To that end, we propose Hahn Convolutional Neu-ral Network (HCNN), a novel architecture based on Hahn moments as first layer in the Convolutional neural network (CNN) architecture. We show that HCNN helps in reducing the dimensionality of video images, in gaining training time. HCNN model is trained to classify letters, digits or words given as video images. We evaluated the proposed method on three datasets, AVLetters, OuluVS2 and BBC LRW, and we show that it achieves significant results in comparison with other works in the literature