9 research outputs found

    Lips tracking identification of a correct pronunciation of Quranic alphabets for tajweed teaching and learning

    Get PDF
    Mastering the recitation of the Holy Quran is an obligation among Muslims. It is an important task to fulfill other Ibadat like prayer, pilgrimage, and zikr. However, the traditional way of teaching Quran recitation is a hard task due to the extensive training time and effort required from both teacher and learner. In fact, learning the correct pronunciation of the Quranic letters or alphabets is the first step in mastering Tajweed (Rules and Guidance) in Quranic recitation. The pronunciation of Arabic alphabets is based on its points of articulation and the characteristics of a particular alphabet. In this paper, we implement a lip identification technique from video signal acquired from experts to extract the movement data of the lips while pronouncing the correct Quranic alphabets. The extracted lip movement data from experts helps in categorizing the alphabets into 5 groups and in deciding the final shape of the lips. Later, the technique was tested among a public reciter and then compared for similarity verification between the novice and the professional reciter. The system is able to extract the lip movement of the random user and draw the displacement graph and compare with the pronunciation of the expert. The error will be shown if the user has mistakenly pronounced the alphabet and suggests ways for improvement. More subjects with different backgrounds will be tested in the very near future with feedback instructions. Machine learning techniques will be implemented at a later stage for the real time learning application. Menguasai bacaan Al-Quran adalah satu kewajipan di kalangan umat Islam. Ia adalah satu tugas yang penting untuk memenuhi Ibadat lain seperti solat, haji, dan zikir. Walau bagaimanapun, cara tradisional pengajaran bacaan Al-Quran adalah satu tugas yang sukar kerana memerlukan masa latihan dan usaha yang banyak daripada guru dan pelajar. Malah, mempelajari sebutan yang betul bagi huruf Al-Quran adalah langkah pertama dalam menguasai Tajweed (Peraturan dan Panduan) pada bacaan Al-Quran. Sebutan huruf Arab adalah berdasarkan cara penyebutan tiap-tiap huruf dan ciri-ciri huruf tertentu. Dalam kertas ini, kami membina teknik pengenalan bibir dari isyarat video yang diperoleh daripada bacaan Al Quran oleh pakar-pakar untuk mengekstrak data pergerakan bibir ketika menyebut huruf Al-Quran yang betul. Data pergerakan bibir yang diekstrak daripada pembacaan oleh pakar membantu dalam mengkategorikan huruf kepada 5 kumpulan dan dalam menentukan bentuk akhir bibir. Kemudian, teknik ini diuji dengan pembaca awam dan kemudian bacaan mereka dibandingkan untuk pengesahan persamaan bacaan antara pembaca awam dan pembaca Al-Quran profesional. Sistem ini berjaya mengambil pergerakan bibir pengguna rawak dan melukis graf perbezaan sebutan mereka apabila dibandingkan dengan sebutan pakar. Jika pengguna telah tersilap menyebut sesuatu huruf, kesilapan akan ditunjukkan dan cara untuk penambahbaikan dicadangkan. Lebih ramai pengguna yang mempunyai latar belakang yang berbeza akan diuji dalam masa terdekat dan arahan maklum balas akan diberi. Teknik pembelajaran mesin akan dilaksanakan di peringkat seterusnya bagi penggunaan pembelajaran masa nyata

    Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed

    Full text link
    Speechreading or lipreading is the technique of understanding and getting phonetic features from a speaker's visual features such as movement of lips, face, teeth and tongue. It has a wide range of multimedia applications such as in surveillance, Internet telephony, and as an aid to a person with hearing impairments. However, most of the work in speechreading has been limited to text generation from silent videos. Recently, research has started venturing into generating (audio) speech from silent video sequences but there have been no developments thus far in dealing with divergent views and poses of a speaker. Thus although, we have multiple camera feeds for the speech of a user, but we have failed in using these multiple video feeds for dealing with the different poses. To this end, this paper presents the world's first ever multi-view speech reading and reconstruction system. This work encompasses the boundaries of multimedia research by putting forth a model which leverages silent video feeds from multiple cameras recording the same subject to generate intelligent speech for a speaker. Initial results confirm the usefulness of exploiting multiple camera views in building an efficient speech reading and reconstruction system. It further shows the optimal placement of cameras which would lead to the maximum intelligibility of speech. Next, it lays out various innovative applications for the proposed system focusing on its potential prodigious impact in not just security arena but in many other multimedia analytics problems.Comment: 2018 ACM Multimedia Conference (MM '18), October 22--26, 2018, Seoul, Republic of Kore

    Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

    Get PDF
    This paper proposes an audio-visual speech recognition method using lip information extracted from side-face images as an attempt to increase noise robustness in mobile environments. Our proposed method assumes that lip images can be captured using a small camera installed in a handset. Two different kinds of lip features, lip-contour geometric features and lip-motion velocity features, are used individually or jointly, in combination with audio features. Phoneme HMMs modeling the audio and visual features are built based on the multistream HMM technique. Experiments conducted using Japanese connected digit speech contaminated with white noise in various SNR conditions show effectiveness of the proposed method. Recognition accuracy is improved by using the visual information in all SNR conditions. These visual features were confirmed to be effective even when the audio HMM was adapted to noise by the MLLR method

    Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

    No full text
    This paper proposes an audio-visual speech recognition method using lip information extracted from side-face images as an attempt to increase noise robustness in mobile environments. Our proposed method assumes that lip images can be captured using a small camera installed in a handset. Two different kinds of lip features, lip-contour geometric features and lip-motion velocity features, are used individually or jointly, in combination with audio features. Phoneme HMMs modeling the audio and visual features are built based on the multistream HMM technique. Experiments conducted using Japanese connected digit speech contaminated with white noise in various SNR conditions show effectiveness of the proposed method. Recognition accuracy is improved by using the visual information in all SNR conditions. These visual features were confirmed to be effective even when the audio HMM was adapted to noise by the MLLR method.</p

    Sistema de clasificación y exposición de características faciales SICECAF

    Get PDF
    En la actualidad las tecnologías relacionadas con el reconocimiento automático del habla se han desarrollado de manera exponencial. Gracias a la investigación en este campo se ha mejorado la interacción persona-máquina, obteniendo nuevos tipos de aplicaciones relacionadas con la comunicación. Aunque las capacidades de los reconocedores del habla han aumentado en los últimos años siguen teniendo carencias importantes. Entre las más habituales destacan el ruido en el canal de transmisión y las ambigüedades del lenguaje, lo que provoca una falta de acierto considerable. Para solucionar estos problemas se necesita aumentar las prestaciones de los sistemas anteriormente descritos, tanto las capacidades de los dispositivos de sonido, como los algoritmos de reconocimiento, teniendo en cuenta las señales visuales presentes en el habla. En esta memoria se expone un sistema de reconocimiento facial que aumente las prestaciones de los reconocedores actuales. Se crea un sistema que combina diferentes métodos de visualización y discriminación de zonas faciales.Ingeniería Técnica en Informática de Gestió

    Robust visual speech recognition using optical flow analysis and rotation invariant features

    Get PDF
    The focus of this thesis is to develop computer vision algorithms for visual speech recognition system to identify the visemes. The majority of existing speech recognition systems is based on audio-visual signals and has been developed for speech enhancement and is prone to acoustic noise. Considering this problem, aim of this research is to investigate and develop a visual only speech recognition system which should be suitable for noisy environments. Potential applications of such a system include the lip-reading mobile phones, human computer interface (HCI) for mobility-impaired users, robotics, surveillance, improvement of speech based computer control in a noisy environment and for the rehabilitation of the persons who have undergone a laryngectomy surgery. In the literature, there are several models and algorithms available for visual feature extraction. These features are extracted from static mouth images and characterized as appearance and shape based features. However, these methods rarely incorporate the time dependent information of mouth dynamics. This dissertation presents two optical flow based approaches of visual feature extraction, which capture the mouth motions in an image sequence. The motivation for using motion features is, because the human perception of lip-reading is concerned with the temporal dynamics of mouth motion. The first approach is based on extraction of features from the optical flow vertical component. The optical flow vertical component is decomposed into multiple non-overlapping fixed scale blocks and statistical features of each block are computed for successive video frames of an utterance. To overcome the issue of large variation in speed of speech, each utterance is normalized using simple linear interpolation method. In the second approach, four directional motion templates based on optical flow are developed, each representing the consolidated motion information in an utterance in four directions (i.e.,up, down, left and right). This approach is an evolution of a view based approach known as motion history image (MHI). One of the main issues with the MHI method is its motion overwriting problem because of self-occlusion. DMHIs seem to solve this issue of overwriting. Two types of image descriptors, Zernike moments and Hu moments are used to represent each image of DMHIs. A support vector machine (SVM) classifier was used to classify the features obtained from the optical flow vertical component, Zernike and Hu moments separately. For identification of visemes, a multiclass SVM approach was employed. A video speech corpus of seven subjects was used for evaluating the efficiency of the proposed methods for lip-reading. The experimental results demonstrate the promising performance of the optical flow based mouth movement representations. Performance comparison between DMHI and MHI based on Zernike moments, shows that the DMHI technique outperforms the MHI technique. A video based adhoc temporal segmentation method is proposed in the thesis for isolated utterances. It has been used to detect the start and the end frame of an utterance from an image sequence. The technique is based on a pair-wise pixel comparison method. The efficiency of the proposed technique was tested on the available data set with short pauses between each utterance

    Audio-Visual Speech Recognition Using New Lip Features Extracted from Side-Face Images

    No full text
    This paper proposes new visual features for audio-visual speech recognition using lip information extracted from side-face images. In order to increase the noise-robustness of speech recognition, we have proposed an audio-visual speech recognition method using speaker lip information extracted from side-face images taken by a small camera installed in a mobile device. Our previous method used only movement information of lips, measured by optical-flow analysis, as a visual feature. However, since shape information of lips is also obviously important, this paper attempts to combine lip-shape information with lip-movement information to improve the audio-visual speech recognition performance. A combination of an angle value between upper and lower lips (lip-angle) and its derivative is extracted as lip-shape features. Effectiveness of the lip-angle features has been evaluated under various SNR conditions. The proposed features improved recognition accuracies in all SNR conditions in comparison with audio-only recognition results. The best improvement of 8.0 % in absolute value was obtained at 5dB SNR condition. Combining the lip-angle features with our previous features extracted by the optical-flow analysis yielded further improvement. These visual features were confirmed to be effective even when the audio HMM used in our method was adapted to noise by the MLLR method. 1
    corecore