27 research outputs found

    Lipreading with Long Short-Term Memory

    Full text link
    Lipreading, i.e. speech recognition from visual-only recordings of a speaker's face, can be achieved with a processing pipeline based solely on neural networks, yielding significantly better accuracy than conventional methods. Feed-forward and recurrent neural network layers (namely Long Short-Term Memory; LSTM) are stacked to form a single structure which is trained by back-propagating error gradients through all the layers. The performance of such a stacked network was experimentally evaluated and compared to a standard Support Vector Machine classifier using conventional computer vision features (Eigenlips and Histograms of Oriented Gradients). The evaluation was performed on data from 19 speakers of the publicly available GRID corpus. With 51 different words to classify, we report a best word accuracy on held-out evaluation speakers of 79.6% using the end-to-end neural network-based solution (11.6% improvement over the best feature-based solution evaluated).Comment: Accepted for publication at ICASSP 201

    Developing digital signal clustering method using local binary pattern histogram

    Get PDF
    In this paper we presented a new approach to manipulate a digital signal in order to create a features array, which can be used as a signature to retrieve the signal. Each digital signal is associated with the local binary pattern (LBP) histogram; this histogram will be calculated based on LBP operator, then k-means clustering was used to generate the required features for each digital signal. The proposed method was implemented, tested and the obtained experimental results were analyzed. The results showed the flexibility and accuracy of the proposed method. Althoug different parameters of the digital signal were changed during implementation, the results obtained showed the robustness of the proposed method

    Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed

    Full text link
    Speechreading or lipreading is the technique of understanding and getting phonetic features from a speaker's visual features such as movement of lips, face, teeth and tongue. It has a wide range of multimedia applications such as in surveillance, Internet telephony, and as an aid to a person with hearing impairments. However, most of the work in speechreading has been limited to text generation from silent videos. Recently, research has started venturing into generating (audio) speech from silent video sequences but there have been no developments thus far in dealing with divergent views and poses of a speaker. Thus although, we have multiple camera feeds for the speech of a user, but we have failed in using these multiple video feeds for dealing with the different poses. To this end, this paper presents the world's first ever multi-view speech reading and reconstruction system. This work encompasses the boundaries of multimedia research by putting forth a model which leverages silent video feeds from multiple cameras recording the same subject to generate intelligent speech for a speaker. Initial results confirm the usefulness of exploiting multiple camera views in building an efficient speech reading and reconstruction system. It further shows the optimal placement of cameras which would lead to the maximum intelligibility of speech. Next, it lays out various innovative applications for the proposed system focusing on its potential prodigious impact in not just security arena but in many other multimedia analytics problems.Comment: 2018 ACM Multimedia Conference (MM '18), October 22--26, 2018, Seoul, Republic of Kore

    Lip Reading with Hahn Convolutional Neural Networks moments

    Get PDF
    International audienceLipreading or Visual speech recognition is the process of decoding speech from speakers mouth movements. It is used for people with hearing impairment , to understand patients attained with laryngeal cancer, people with vocal cord paralysis and in noisy environment. In this paper we aim to develop a visual-only speech recognition system based only on video. Our main targeted application is in the medical field for the assistance to la-ryngectomized persons. To that end, we propose Hahn Convolutional Neu-ral Network (HCNN), a novel architecture based on Hahn moments as first layer in the Convolutional neural network (CNN) architecture. We show that HCNN helps in reducing the dimensionality of video images, in gaining training time. HCNN model is trained to classify letters, digits or words given as video images. We evaluated the proposed method on three datasets, AVLetters, OuluVS2 and BBC LRW, and we show that it achieves significant results in comparison with other works in the literature

    Applications of Face Analysis and Modeling in Media Production

    Get PDF
    Facial expressions play an important role in day-by-day communication as well as media production. This article surveys automatic facial analysis and modeling methods using computer vision techniques and their applications for media production. The authors give a brief overview of the psychology of face perception and then describe some of the applications of computer vision and pattern recognition applied to face recognition in media production. This article also covers the automatic generation of face models, which are used in movie and TV productions for special effects in order to manipulate people's faces or combine real actors with computer graphics

    Touchless Typing using Head Movement-based Gestures

    Full text link
    Physical contact-based typing interfaces are not suitable for people with upper limb disabilities such as Quadriplegia. This paper, thus, proposes a touch-less typing interface that makes use of an on-screen QWERTY keyboard and a front-facing smartphone camera mounted on a stand. The keys of the keyboard are grouped into nine color-coded clusters. Users pointed to the letters that they wanted to type just by moving their head. The head movements of the users are recorded by the camera. The recorded gestures are then translated into a cluster sequence. The translation module is implemented using CNN-RNN, Conv3D, and a modified GRU based model that uses pre-trained embedding rich in head pose features. The performances of these models were evaluated under four different scenarios on a dataset of 2234 video sequences collected from 22 users. The modified GRU-based model outperforms the standard CNN-RNN and Conv3D models for three of the four scenarios. The results are encouraging and suggest promising directions for future research.Comment: *The two lead authors contributed equally. The dataset and code are available upon request. Please contact the last autho

    Multimodal Based Audio-Visual Speech Recognition for Hard-of-Hearing: State of the Art Techniques and Challenges

    Get PDF
    Multimodal Integration (MI) is the study of merging the knowledge acquired by the nervous system using sensory modalities such as speech, vision, touch, and gesture. The applications of MI expand over the areas of Audio-Visual Speech Recognition (AVSR), Sign Language Recognition (SLR), Emotion Recognition (ER), Bio Metrics Applications (BMA), Affect Recognition (AR), Multimedia Retrieval (MR), etc. The fusion of modalities such as hand gestures- facial, lip- hand position, etc., are mainly used sensory modalities for the development of hearing-impaired multimodal systems. This paper encapsulates an overview of multimodal systems available within literature towards hearing impaired studies. This paper also discusses some of the studies related to hearing-impaired acoustic analysis. It is observed that very less algorithms have been developed for hearing impaired AVSR as compared to normal hearing. Thus, the study of audio-visual based speech recognition systems for the hearing impaired is highly demanded for the people who are trying to communicate with natively speaking languages.  This paper also highlights the state-of-the-art techniques in AVSR and the challenges faced by the researchers for the development of AVSR systems
    corecore