32 research outputs found

    Auditory-Visual Integration during the Perception of Spoken Arabic

    Get PDF
    This thesis aimed to investigate the effect of visual speech cues on auditory-visual integration during speech perception in Arabic. Four experiments were conducted two of which were cross linguistic studies using Arabic and English listeners. To compare the influence of visual speech in Arabic and English listeners chapter 3 investigated the use of visual components of auditory-visual stimuli in native versus non-native speech using the McGurk effect. The experiment suggested that Arabic listeners’ speech perception was influenced by visual components of speech to a lesser degree compared to English listeners. Furthermore, auditory and visual assimilation was observed for non-native speech cues. Additionally when the visual cue was an emphatic phoneme the Arabic listeners incorporated the emphatic visual cue in their McGurk response. Chapter 4, investigated whether the lower McGurk effect response in Arabic listeners found in chapter 3 was due to a bottom-up mechanism of visual processing speed. Chapter 4, using auditory-visual temporal asynchronous conditions, concluded that the differences in McGurk response percentage was not due to bottom-up mechanism of visual processing speed. This led to the question of whether the difference in auditory-visual integration of speech could be due to more ambiguous visual cues in Arabic compared to English. To explore this question it was first necessary to identify visemes in Arabic. Chapter 5 identified 13 viseme categories in Arabic, some emphatic visemes were visually distinct from their non-emphatic counterparts and a greater number of phonemes within the guttural viseme category were found compared to English. Chapter 6 evaluated the visual speech influence across the 13 viseme categories in Arabic measured by the McGurk effect. It was concluded that the predictive power of visual cues and the contrast between visual and auditory speech components will lead to an increase in the McGurk response percentage in Arabic

    Visual Speech Recognition

    Get PDF
    In recent years, Visual speech recognition has a more concentration, by researchers, than the past. Because of the leakage of the visual processing of the Arabic vocabularies recognition, we start to search in this field. Audio speech recognition concerned with the acoustic characteristic of the signal, but there are many situations that the audio signal is weak of not exist, and this will be a point in Chapter 2. The visual recognition process focuses on the features extracted from video of the speaker. These features are to be classified using several techniques. The most important feature to be extracted is motion. By segmenting motion of the lips of the speaker, an algorithm has manipulate it in such away to recognize the word which is said. But motion segmentation is not the only problem facing the speech recognition process, segmenting the lips itself is an early step in the speech recognition process, so, to segment lips motion we have to segment lips first, a new approach for lip segmentation is proposed in this thesis. Sometimes, motion feature needs another feature to support in recognition the spoken word. So in our thesis another new algorithm is proposed to use motion segmentation by using the Abstract Difference Image from an image series, supported by correlation for registering images in the image series, to recognize ten words in the Arabic language, the words are from “one” to “ten” in Arabic language. The algorithm also uses the HU-Invariant set of features to describe the Abstract Difference Image, and uses a three different recognition methods to recognize the words. The CLAHE method as a filtering technique is used by our algorithm to manipulate lighting problems. Our algorithm based on extracting the differences details from a series of images to recognize the word, achieved an overall results 55.8%, it is an adequate result for our algorithm when integrated in an audio-visual system

    Decoding visemes: improving machine lip-reading

    Get PDF
    Abstract This thesis is about improving machine lip-reading, that is, the classi�cation of speech from only visual cues of a speaker. Machine lip-reading is a niche research problem in both areas of speech processing and computer vision. Current challenges for machine lip-reading fall into two groups: the content of the video, such as the rate at which a person is speaking or; the parameters of the video recording for example, the video resolution. We begin our work with a literature review to understand the restrictions current technology limits machine lip-reading recognition and conduct an experiment into resolution a�ects. We show that high de�nition video is not needed to successfully lip-read with a computer. The term \viseme" is used in machine lip-reading to represent a visual cue or gesture which corresponds to a subgroup of phonemes where the phonemes are indistinguishable in the visual speech signal. Whilst a viseme is yet to be formally de�ned, we use the common working de�nition `a viseme is a group of phonemes with identical appearance on the lips'. A phoneme is the smallest acoustic unit a human can utter. Because there are more phonemes per viseme, mapping between the units creates a many-to-one relationship. Many mappings have been presented, and we conduct an experiment to determine which mapping produces the most accurate classi�cation. Our results show Lee's [82] is best. Lee's classi�cation also outperforms machine lip-reading systems which use the popular Fisher [48] phonemeto- viseme map. Further to this, we propose three methods of deriving speaker-dependent phonemeto- viseme maps and compare our new approaches to Lee's. Our results show the ii iii sensitivity of phoneme clustering and we use our new knowledge for our �rst suggested augmentation to the conventional lip-reading system. Speaker independence in machine lip-reading classi�cation is another unsolved obstacle. It has been observed, in the visual domain, that classi�ers need training on the test subject to achieve the best classi�cation. Thus machine lip-reading is highly dependent upon the speaker. Speaker independence is the opposite of this, or in other words, is the classi�cation of a speaker not present in the classi�er's training data. We investigate the dependence of phoneme-to-viseme maps between speakers. Our results show there is not a high variability of visual cues, but there is high variability in trajectory between visual cues of an individual speaker with the same ground truth. This implies a dependency upon the number of visemes within each set for each individual. Finally, we investigate how many visemes is the optimum number within a set. We show the phoneme-to-viseme maps in literature rarely have enough visemes and the optimal number, which varies by speaker, ranges from 11 to 35. The last di�culty we address is decoding from visemes back to phonemes and into words. Traditionally this is completed using a language model. The language model unit is either: the same as the classi�er, e.g. visemes or phonemes; or the language model unit is words. In a novel approach we use these optimum range viseme sets within hierarchical training of phoneme labelled classi�ers. This new method of classi�er training demonstrates signi�cant increase in classi�cation with a word language network

    Modelling talking human faces

    Get PDF
    This thesis investigates a number of new approaches for visual speech synthesis using data-driven methods to implement a talking face. The main contributions in this thesis are the following. The accuracy of shared Gaussian process latent variable model (SGPLVM) built using the active appearance model (AAM) and relative spectral transform-perceptual linear prediction (RASTAPLP) features is improved by employing a more accurate AAM. This is the first study to report that using a more accurate AAM improves the accuracy of SGPLVM. Objective evaluation via reconstruction error is performed to compare the proposed approach against previously existing methods. In addition, it is shown experimentally that the accuracy of AAM can be improved by using a larger number of landmarks and/or larger number of samples in the training data. The second research contribution is a new method for visual speech synthesis utilising a fully Bayesian method namely the manifold relevance determination (MRD) for modelling dynamical systems through probabilistic non-linear dimensionality reduction. This is the first time MRD was used in the context of generating talking faces from the input speech signal. The expressive power of this model is in the ability to consider non-linear mappings between audio and visual features within a Bayesian approach. An efficient latent space has been learnt iii Abstract iv using a fully Bayesian latent representation relying on conditional nonlinear independence framework. In the SGPLVM the structure of the latent space cannot be automatically estimated because of using a maximum likelihood formulation. In contrast to SGPLVM the Bayesian approaches allow the automatic determination of the dimensionality of the latent spaces. The proposed method compares favourably against several other state-of-the-art methods for visual speech generation, which is shown in quantitative and qualitative evaluation on two different datasets. Finally, the possibility of incremental learning of AAM for inclusion in the proposed MRD approach for visual speech generation is investigated. The quantitative results demonstrate that using MRD in conjunction with incremental AAMs produces only slightly less accurate results than using batch methods. These results support a way of training this kind of models on computers with limited resources, for example in mobile computing. Overall, this thesis proposes several improvements to the current state-of-the-art in generating talking faces from speech signal leading to perceptually more convincing results

    Facial Modelling and animation trends in the new millennium : a survey

    Get PDF
    M.Sc (Computer Science)Facial modelling and animation is considered one of the most challenging areas in the animation world. Since Parke and Waters’s (1996) comprehensive book, no major work encompassing the entire field of facial animation has been published. This thesis covers Parke and Waters’s work, while also providing a survey of the developments in the field since 1996. The thesis describes, analyses, and compares (where applicable) the existing techniques and practices used to produce the facial animation. Where applicable, the related techniques are grouped in the same chapter and described in a chronological fashion, outlining their differences, as well as their advantages and disadvantages. The thesis is concluded by exploratory work towards a talking head for Northern Sotho. Facial animation and lip synchronisation of a fragment of Northern Sotho is done by using software tools primarily designed for English.Computin

    Visual Speech Enhancement and its Application in Speech Perception Training

    Get PDF
    This thesis investigates methods for visual speech enhancement to support auditory and audiovisual speech perception. Normal-hearing non-native listeners receiving cochlear implant (CI) simulated speech are used as ‘proxy’ listeners for CI users, a proposed user group who could benefit from such enhancement methods in speech perception training. Both CI users and non-native listeners share similarities with regards to audiovisual speech perception, including increased sensitivity to visual speech cues. Two enhancement methods are proposed: (i) an appearance based method, which modifies the appearance of a talker’s lips using colour and luminance blending to apply a ‘lipstick effect’ to increase the saliency of mouth shapes; and (ii) a kinematics based method, which amplifies the kinematics of the talker’s mouth to create the effect of more pronounced speech (an ‘exaggeration effect’). The application that is used to test the enhancements is speech perception training, or audiovisual training, which can be used to improve listening skills. An audiovisual training framework is presented which structures the evaluation of the effectiveness of these methods. It is used in two studies. The first study, which evaluates the effectiveness of the lipstick effect, found a significant improvement in audiovisual and auditory perception. The second study, which evaluates the effectiveness of the exaggeration effect, found improvement in the audiovisual perception of a number of phoneme classes; no evidence was found of improvements in the subsequent auditory perception, as audiovisual recalibration to visually exaggerated speech may have impeded learning when used in the audiovisual training. The thesis also investigates an example of kinematics based enhancement which is observed in Lombard speech, by studying the behaviour of visual Lombard phonemes in different contexts. Due to the lack of suitable datasets for this analysis, the thesis presents a novel audiovisual Lombard speech dataset recorded under high SNR, which offers two, fixed head-pose, synchronised views of each talker in the dataset

    Visual Speech Synthesis using Dynamic Visemes and Deep Learning Architectures

    Get PDF
    The aim of this work is to improve the naturalness of visual speech synthesis produced automatically from a linguistic input over existing methods. Firstly, the most important contribution is on the investigation of the most suitable speech units for the visual speech synthesis. We propose the use of dynamic visemes instead of phonemes or static visemes and found that dynamic visemes can generate better visual speech than either phone or static viseme units. Moreover, best performance is obtained by a combined phoneme-dynamic viseme system. Secondly, we examine the most appropriate model between hidden Markov model (HMM) and different deep learning models that include feedforward and recurrent structures consisting of one-to-one, many-to-one and many-to-many architectures. Results suggested that that frame-by-frame synthesis from deep learning approach outperforms state-based synthesis from HMM approaches and an encoder-decoder many-to-many architecture is better than the one-to-one and many-to-one architectures. Thirdly, we explore the importance of contextual features that include information at varying linguistic levels, from frame level up to the utterance level. Our findings found that frame level information is the most valuable feature, as it is able to avoid discontinuities in the visual feature sequence and produces a smooth and realistic animation output. Fourthly, we found that the two most common objective measures of correlation and root mean square error are not able to indicate realism and naturalness of human perceived quality. We introduce an alternative objective measure and show that the global variance is a better indicator of human perception of quality. Finally, we propose a novel method to convert a given text input and phoneme transcription into a dynamic viseme transcription in the case when a reference dynamic viseme sequence is not available. Subjective preference tests confirmed that our proposed method is able to produce animation, that are statistically indistinguishable from animation produced using reference data

    Audio-visual training effect on L2 perception and production of English /0/-/s/ and /d/-/z/ by Mandarin speakers

    Get PDF
    PhD ThesisResearch on L2 speech perception and production indicate that adult language learners are able to acquire L2 speech sounds that they initially have difficulty with (Best, 1994). Moreover, use of the audiovisual modality, which provides language learners with articulatory information for speech sounds, has been illustrated to be effective in L2 speech perception training (Hazan et al., 2005). Since auditory and visual skills are integrated with each other in speech perception, audiovisual perception training may enhance language learners’ auditory perception of L2 speech sounds (Bernstein, Auer Jr, Ebehardt, and Jiang, 2013). However, little research has been conducted on L1 Mandarin learners of English. Based on these hypotheses, this study investigated whether audiovisual perception training can improve learners’ auditory perception and production of L2 speech sounds. A pilot study was performed on 42 L1-Mandarin learners of English (L1-dialect: Chongqing Mandarin (CQd)) in which their perception and production of English consonants was tested. According to the results, 29 of the subjects had difficulty in the perception and production of /θ/-/s/ and /ð/-/z/. Therefore, these 29 subjects were selected as the experimental group to attend a 9-session audiovisual perception training programme, in which identification tasks for the minimal pairs /θ/-/s/ and /ð/-/z/ were conducted. The subjects’ perception and production performance was tested before, during and at the end of the training with an AXB task and “read aloud” task. In view of the threat to interval validity arising from a repeated testing effect, a control group was tested with the same AXB task and intervals as that of the experimental group. The results show that the experimental group’s perception and production accuracy improved substantially during and by the end of the training programme. Indeed, whilst the control group also showed perception improvement across the pre-test and post-test, their degree of improvement was significantly lower than that of the experimental group. These results therefore confirm the value of the audiovisual modality in L2 speech perception training

    Deaf students and spoken languages

    Get PDF
    Esta monografía apunta a permitir al profesor de lenguas habladas comprender el mundo del alumno sordo y las técnicas de base para entrar en su universo y así cumplir su misión. Para lograr esto se mira la anatomía y fisiología del oído siguiendo el libro de Peter Alberti (1995) y de esta forma comprender lo que es la sordera. Igualmente se entrará en la participación del oído en la formación del pensamiento y por lo tanto como influye la sordera en la constitución de una cultura paralela en nuestra sociedad. Las bases para este estudio son las primeras definiciones del pensamiento, que no se encontraron más profundas en autores posteriores a John Locke (1690). Igualmente se analiza los medios de comunicación de los sordos, con su propia lengua, la Lengua de Señas y sus especificaciones. Este análisis utiliza tanto las bases de la lingüística según Ferdinand de Saussure (1916), los primeros estudios de esta lengua que pueden ir desde el Abad de l'Épée (1776) hasta William Stokoe (1960) con su estudio lingüístico del método usual de comunicación de los no oyentes. Para entrar en las aplicaciones pedagógicas, qué implican qué enseñar y con qué medios, se mirarán tanto la sicología del sordo, su simpatía o rechazo del mundo oral y las posibles soluciones que se encuentra actualmente para la enseñanza de los sordos
    corecore