140 research outputs found

    Auditory-Visual Integration during the Perception of Spoken Arabic

    Get PDF
    This thesis aimed to investigate the effect of visual speech cues on auditory-visual integration during speech perception in Arabic. Four experiments were conducted two of which were cross linguistic studies using Arabic and English listeners. To compare the influence of visual speech in Arabic and English listeners chapter 3 investigated the use of visual components of auditory-visual stimuli in native versus non-native speech using the McGurk effect. The experiment suggested that Arabic listeners’ speech perception was influenced by visual components of speech to a lesser degree compared to English listeners. Furthermore, auditory and visual assimilation was observed for non-native speech cues. Additionally when the visual cue was an emphatic phoneme the Arabic listeners incorporated the emphatic visual cue in their McGurk response. Chapter 4, investigated whether the lower McGurk effect response in Arabic listeners found in chapter 3 was due to a bottom-up mechanism of visual processing speed. Chapter 4, using auditory-visual temporal asynchronous conditions, concluded that the differences in McGurk response percentage was not due to bottom-up mechanism of visual processing speed. This led to the question of whether the difference in auditory-visual integration of speech could be due to more ambiguous visual cues in Arabic compared to English. To explore this question it was first necessary to identify visemes in Arabic. Chapter 5 identified 13 viseme categories in Arabic, some emphatic visemes were visually distinct from their non-emphatic counterparts and a greater number of phonemes within the guttural viseme category were found compared to English. Chapter 6 evaluated the visual speech influence across the 13 viseme categories in Arabic measured by the McGurk effect. It was concluded that the predictive power of visual cues and the contrast between visual and auditory speech components will lead to an increase in the McGurk response percentage in Arabic

    Augmented Reality Talking Heads as a Support for Speech Perception and Production

    Get PDF

    Cross-format integration between spoken number words and Arabic digits

    Get PDF
    Spoken number words and Arabic digits are the most commonly used numerical symbols. We often transcode numerals from one to another, thus, the correspondence between them should become over-learned and automatic after years of usage. It has been shown that an integration usually exists when pairing of stimulus is over-learned, and is often reflected in the mismatch negativity (MMN). The current thesis conducted two behavioural experiments (Chapter 2) and three EEG experiments (Chapter 3 - 5) to systematically investigate the cross-modal correspondence, i.e., the integration, between spoken number words and Arabic digits in adult participants. In the behavioural experiments, a clear distance effect is shown in an audiovisual matching task. This suggests that an amodal, shared magnitude representation is activated for cross-modal numerals during a matching judgment. Moreover, the distance effect is modulated by stimulus onset asynchrony (SOA). That is, the distance effect becomes smaller with the increase of SOA. This is similar to the data pattern of a common integration effect because an integration usually shows when cross-modal stimuli are temporally close. However, a disadvantage of a behavioural task is that the RTs could be influenced by response-selection or response-execution. Hence, I then used an oddball paradigm in which no responses are required for the cross-modal numerals in my EEG experiments. The results of three EEG experiments showed that an early integration effect exists between spoken number words and Arabic digits in the mismatch negativity (MMN). This result is first to show the presence of a cross-format integration between spoken number words and Arabic digits. However, the integration effect is also modulated by distance as well as stimulus onset asynchrony (SOA), which may suggest that the cross-modal correspondence between audiovisual numerals is more complicated than other kinds of audiovisual stimuli, such as letters and speech sounds

    Chinese Tones: Can You Listen With Your Eyes?:The Influence of Visual Information on Auditory Perception of Chinese Tones

    Get PDF
    CHINESE TONES: CAN YOU LISTEN WITH YOUR EYES? The Influence of Visual Information on Auditory Perception of Chinese Tones YUEQIAO HAN Summary Considering the fact that more than half of the languages spoken in the world (60%-70%) are so-called tone languages (Yip, 2002), and tone is notoriously difficult to learn for westerners, this dissertation focused on tone perception in Mandarin Chinese by tone-naïve speakers. Moreover, it has been shown that speech perception is more than just an auditory phenomenon, especially in situations when the speaker’s face is visible. Therefore, the aim of this dissertation is to also study the value of visual information (over and above that of acoustic information) in Mandarin tone perception for tone-naïve perceivers, in combination with other contextual (such as speaking style) and individual factors (such as musical background). Consequently, this dissertation assesses the relative strength of acoustic and visual information in tone perception and tone classification. In the first two empirical and exploratory studies in Chapter 2 and 3 , we set out to investigate to what extent tone-naïve perceivers are able to identify Mandarin Chinese tones in isolated words, and whether or not they can benefit from (seeing) the speakers’ face, and what the contribution is of a hyperarticulated speaking style, and/or their own musical experience. Respectively, in Chapter 2 we investigated the effect of visual cues (comparing audio-only with audio-visual presentations) and speaking style (comparing a natural speaking style with a teaching speaking style) on the perception of Mandarin tones by tone-naïve listeners, looking both at the relative strength of these two factors and their possible interactions; Chapter 3 was concerned with the effects of musicality of the participants (combined with modality) on Mandarin tone perception. In both of these studies, a Mandarin Chinese tone identification experiment was conducted: native speakers of a non-tonal language were asked to distinguish Mandarin Chinese tones based on audio (-only) or video (audio-visual) materials. In order to include variations, the experimental stimuli were recorded using four different speakers in imagined natural and teaching speaking scenarios. The proportion of correct responses (and average reaction times) of the participants were reported. The tone identification experiment presented in Chapter 2 showed that the video conditions (audio-visual natural and audio-visual teaching) resulted in an overall higher accuracy in tone perception than the auditory-only conditions (audio-only natural and audio-only teaching), but no better performance was observed in the audio-visual conditions in terms of reaction time, compared to the auditory-only conditions. Teaching style turned out to make no difference on the speed or accuracy of Mandarin tone perception (as compared to a natural speaking style). Further on, we presented the same experimental materials and procedure in Chapter 3 , but now with musicians and non-musicians as participants. The Goldsmith Musical Sophistication Index (Gold-MSI) was used to assess the musical aptitude of the participants. The data showed that overall, musicians outperformed non-musicians in the tone identification task in both auditory-visual and auditory-only conditions. Both groups identified tones more accurately in the auditory-visual conditions than in the auditory-only conditions. These results provided further evidence for the view that the availability of visual cues along with auditory information is useful for people who have no knowledge of Mandarin Chinese tones when they need to learn to identify these tones. Out of all the musical skills measured by Gold-MSI, the amount of musical training was the only predictor that had an impact on the accuracy of Mandarin tone perception. These findings suggest that learning to perceive Mandarin tones benefits from musical expertise, and visual information can facilitate Mandarin tone identification, but mainly for tone-naïve non-musicians. In addition, performance differed by tone: musicality improves accuracy for every tone; some tones are easier to identify than others: in particular, the identification of tone 3 (a low-falling-rising) proved to be the easiest, while tone 4 (a high-falling tone) was the most difficult to identify for all participants. The results of the first two experiments presented in chapters 2 and 3 showed that adding visual cues to clear auditory information facilitated the tone identification for tone-naïve perceivers (there is a significantly higher accuracy in audio-visual condition(s) than in auditory-only condition(s)). This visual facilitation was unaffected by the presence of (hyperarticulated) speaking style or the musical skill of the participants. Moreover, variations in speakers and tones had effects on the accurate identification of Mandarin tones by tone-naïve perceivers. In Chapter 4 , we compared the relative contribution of auditory and visual information during Mandarin Chinese tone perception. More specifically, we aimed to answer two questions: firstly, whether or not there is audio-visual integration at the tone level (i.e., we explored perceptual fusion between auditory and visual information). Secondly, we studied how visual information affects tone perception for native speakers and non-native (tone-naïve) speakers. To do this, we constructed various tone combinations of congruent (e.g., an auditory tone 1 paired with a visual tone 1, written as AxVx) and incongruent (e.g., an auditory tone 1 paired with a visual tone 2, written as AxVy) auditory-visual materials and presented them to native speakers of Mandarin Chinese and speakers of tone-naïve languages. Accuracy, defined as the percentage correct identification of a tone based on its auditory realization, was reported. When comparing the relative contribution of auditory and visual information during Mandarin Chinese tone perception with congruent and incongruent auditory and visual Chinese material for native speakers of Chinese and non-tonal languages, we found that visual information did not significantly contribute to the tone identification for native speakers of Mandarin Chinese. When there is a discrepancy between visual cues and acoustic information, (native and tone-naïve) participants tend to rely more on the auditory input than on the visual cues. Unlike the native speakers of Mandarin Chinese, tone-naïve participants were significantly influenced by the visual information during their auditory-visual integration, and they identified tones more accurately in congruent stimuli than in incongruent stimuli. In line with our previous work, the tone confusion matrix showed that tone identification varies with individual tones, with tone 3 (the low-dipping tone) being the easiest one to identify, whereas tone 4 (the high-falling tone) was the most difficult one. The results did not show evidence for auditory-visual integration among native participants, while visual information was helpful for tone-naïve participants. However, even for this group, visual information only marginally increased the accuracy in the tone identification task, and this increase depended on the tone in question. Chapter 5 is another chapter that zooms in on the relative strength of auditory and visual information for tone-naïve perceivers, but from the aspect of tone classification. In this chapter, we studied the acoustic and visual features of the tones produced by native speakers of Mandarin Chinese. Computational models based on acoustic features, visual features and acoustic-visual features were constructed to automatically classify Mandarin tones. Moreover, this study examined what perceivers pick up (perception) from what a speaker does (production, facial expression) by studying both production and perception. To be more specific, this chapter set out to answer: (1) which acoustic and visual features of tones produced by native speakers could be used to automatically classify Mandarin tones. Furthermore, (2) whether or not the features used in tone production are similar to or different from the ones that have cue value for tone-naïve perceivers when they categorize tones; and (3) whether and how visual information (i.e., facial expression and facial pose) contributes to the classification of Mandarin tones over and above the information provided by the acoustic signal. To address these questions, the stimuli that had been recorded (and described in chapter 2) and the response data that had been collected (and reported on in chapter 3) were used. Basic acoustic and visual features were extracted. Based on them, we used Random Forest classification to identify the most important acoustic and visual features for classifying the tones. The classifiers were trained on produced tone classification (given a set of auditory and visual features, predict the produced tone) and on perceived/responded tone classification (given a set of features, predict the corresponding tone as identified by the participant). The results showed that acoustic features outperformed visual features for tone classification, both for the classification of the produced and the perceived tone. However, tone-naïve perceivers did revert to the use of visual information in certain cases (when they gave wrong responses). So, visual information does not seem to play a significant role in native speakers’ tone production, but tone-naïve perceivers do sometimes consider visual information in their tone identification. These findings provided additional evidence that auditory information is more important than visual information in Mandarin tone perception and tone classification. Notably, visual features contributed to the participants’ erroneous performance. This suggests that visual information actually misled tone-naïve perceivers in their task of tone identification. To some extent, this is consistent with our claim that visual cues do influence tone perception. In addition, the ranking of the auditory features and visual features in tone perception showed that the factor perceiver (i.e., the participant) was responsible for the largest amount of variance explained in the responses by our tone-naïve participants, indicating the importance of individual differences in tone perception. To sum up, perceivers who do not have tone in their language background tend to make use of visual cues from the speakers’ faces for their perception of unknown tones (Mandarin Chinese in this dissertation), in addition to the auditory information they clearly also use. However, auditory cues are still the primary source they rely on. There is a consistent finding across the studies that the variations between tones, speakers and participants have an effect on the accuracy of tone identification for tone-naïve speaker

    Training Children to Perceive Non-native Lexical Tones: Tone Language Background, Bilingualism, and Auditory-Visual Information

    Get PDF
    This study investigates the role of language background and bilingual status in the perception of foreign lexical tones. Eight groups of participants, consisting of children of 6 and 8 years from one of four language background (tone or non-tone) × bilingual status (monolingual or bilingual)—Thai monolingual, English monolingual, English-Thai bilingual, and English-Arabic bilingual were trained to perceive the four Mandarin lexical tones. Half the children in each of these eight groups were given auditory-only (AO) training and half auditory-visual (AV) training. In each group Mandarin tone identification was tested before and after (pre- and post-) training with both auditory-only test (ao-test) and auditory-visual test (av test). The effect of training on Mandarin tone identification was minimal for 6-year-olds. On the other hand, 8-year-olds, particularly those with tone language experience showed greater pre- to post-training improvement, and this was best indexed by ao-test trials. Bilingual vs. monolingual background did not facilitate overall improvement due to training, but it did modulate the efficacy of the Training mode: for bilinguals both AO and AV training, and especially AO, resulted in performance gain; but for monolinguals training was most effective with AV stimuli. Again this effect was best indexed by ao-test trials. These results suggest that tone language experience, be it monolingual or bilingual, is a strong predictor of learning unfamiliar tones; that monolinguals learn best from AV training trials and bilinguals from AO training trials; and that there is no metalinguistic advantage due to bilingualism in learning to perceive lexical tones

    Lexical and audiovisual bases of perceptual adaptation in speech

    Get PDF

    Cultural determinants of perception

    Get PDF

    Acoustic Approaches to Gender and Accent Identification

    Get PDF
    There has been considerable research on the problems of speaker and language recognition from samples of speech. A less researched problem is that of accent recognition. Although this is a similar problem to language identification, di�erent accents of a language exhibit more fine-grained di�erences between classes than languages. This presents a tougher problem for traditional classification techniques. In this thesis, we propose and evaluate a number of techniques for gender and accent classification. These techniques are novel modifications and extensions to state of the art algorithms, and they result in enhanced performance on gender and accent recognition. The first part of the thesis focuses on the problem of gender identification, and presents a technique that gives improved performance in situations where training and test conditions are mismatched. The bulk of this thesis is concerned with the application of the i-Vector technique to accent identification, which is the most successful approach to acoustic classification to have emerged in recent years. We show that it is possible to achieve high accuracy accent identification without reliance on transcriptions and without utilising phoneme recognition algorithms. The thesis describes various stages in the development of i-Vector based accent classification that improve the standard approaches usually applied for speaker or language identification, which are insu�cient. We demonstrate that very good accent identification performance is possible with acoustic methods by considering di�erent i-Vector projections, frontend parameters, i-Vector configuration parameters, and an optimised fusion of the resulting i-Vector classifiers we can obtain from the same data. We claim to have achieved the best accent identification performance on the test corpus for acoustic methods, with up to 90% identification rate. This performance is even better than previously reported acoustic-phonotactic based systems on the same corpus, and is very close to performance obtained via transcription based accent identification. Finally, we demonstrate that the utilization of our techniques for speech recognition purposes leads to considerably lower word error rates. Keywords: Accent Identification, Gender Identification, Speaker Identification, Gaussian Mixture Model, Support Vector Machine, i-Vector, Factor Analysis, Feature Extraction, British English, Prosody, Speech Recognition

    Language Factors Modulate Audiovisual Speech Perception. A Developmental Perspective

    Get PDF
    [eng] In most natural situations, adults look at the eyes of faces in seek of social information (Yarbus, 1967). However, when the auditory information becomes unclear (e.g. speech-in- noise) they switch their attention towards the mouth of a talking face and rely on the audiovisual redundant cues to help them process the speech signal (Barenholtz, Mavica, & Lewkowicz, 2016; Buchan, Paré, & Munhall, 2007; Lansing & McConkie, 2003; Vatikiotis- Bateson, Eigsti, Yano, & Munhall, 1998). Likewise, young infants are sensitive to the correspondence between acoustic and visual speech (Bahrick & Lickliter, 2012), and they also rely on the talker’s mouth during the second half of the first year of life, putatively to help them acquire language by the time they start babbling (Lewkowicz & Hansen-Tift, 2012), and also to aid language differentiation in the case of bilingual infants (Pons, Bosch & Lewkowicz, 2015). The current set of studies provides a detailed examination of the audiovisual (AV) speech cues contribution to speech processing at different language development stages, through the analysis of selective attention patterns when processing speech from talking faces. To do so, I compared different linguistic experience factors (i.e. types of bilingualism – distance between bilinguals’ two languages –, language familiarity and language proficiency) that modulate audiovisual speech perception in first language acquisition during infancy (Studies 1 and 2), early childhood (Studies 3 and 4), and in second language (L2) learning during adulthood (Studies 5, 6 and 7). The findings of the present work demonstrate that (1) perceiving speech audiovisually hampers close bilingual infants’ ability to discriminate their languages, that (2) 15-month-old and 5 year-old close language bilinguals rely more on the mouth cues of a talking face than do their distant bilingual peers, that (3) children’s attention to the mouth follows a clear temporal pattern: it is maximal in the beginning of the presentation and it diminishes gradually as speech continues, and that (4) adults also rely more on the mouth speech cues when they perceive fluent non-native vs. native speech, regardless of their L2 expertise. All in all, these studies shed new light into the field of audiovisual speech perception and language processing by showing that selective attention to a talker’s eyes and mouth is a dynamic, information-seeking process, which is largely modulated by perceivers’ early linguistic experience and the tasks’ demands. These results suggest that selectively attending the redundant speech cues of a talker’s mouth at the adequate moment enhances speech perception and is crucial for normal language development and speech processing, not only in infancy – during first language acquisition – but also in more advanced language stages in childhood, as well as in L2 learning during adulthood. Ultimately, they confirm that mouth reliance is greater in close bilingual environments, where the presence of two related languages increases the necessity for disambiguation and keeping separate language systems.[cat] Atendre selectivament a la boca d’un parlant ens ajuda a beneficiar-nos de la informació audiovisual i processar millor el senyal de la parla, quan el senyal auditiu es torna confús. Paral·lelament, els infants també atenen a la boca durant la segona meitat del primer any de vida, la qual cosa els ajuda en l'adquisició del llenguatge/s. Aquesta tesi examina la contribució del senyal audiovisual al processament de la parla, a través de les anàlisis d'atenció selectiva a una cara parlant. Es comparen diferents factors lingüístics (tipologies de bilingüisme, la familiaritat i la competència amb l'idioma) que modulen la percepció audiovisual de la parla en l'adquisició del llenguatge durant la primera infància (Estudis 1 i 2), en nens d’edat escolar (Estudis 3 i 4) i l’aprenentatge d'una segona llengua durant l'edat adulta (Estudis 5, 6 i 7). Els resultats demostren que (1) la percepció audiovisual de la parla dificulta la capacitat dels infants bilingües de discriminar les seves llengües properes, que (2) els bilingües de llengües properes de 15 mesos i de 5 anys d’edat posen més atenció a les pistes audiovisuals de la boca que els bilingües de llengües distants, que (3) l’atenció dels nens a la boca del parlant és màxima al començament i disminueix gradualment a mesura que continua la parla, i que (4) els adults també es recolzen més en els senyals audiovisuals de la boca quan perceben una llengua no nativa (L2), independentment de la seva competència en aquesta. Aquests estudis demostren que l'atenció selectiva a la cara d'un parlant és un procés dinàmic i de cerca d'informació, i que aquest és modulat per l'experiència lingüística primerenca i les exigències que comporten les situacions comunicatives. Aquests resultats suggereixen que atendre a les pistes audiovisuals de la boca en els moments adequats és crucial per al desenvolupament normal del llenguatge, tan durant la primera infància com en les etapes més avançades del llenguatge, així com en l’aprenentatge de segones llengües. Per últim, aquests resultats confirmen que l’estratègia de recolzar-se en les pistes audiovisuals s’utilitza en major mesura en entorns bilingües propers, on la presència de dues llengües relacionades augmenta la necessitat de desambiguació

    Visual Speech Recognition

    Get PDF
    In recent years, Visual speech recognition has a more concentration, by researchers, than the past. Because of the leakage of the visual processing of the Arabic vocabularies recognition, we start to search in this field. Audio speech recognition concerned with the acoustic characteristic of the signal, but there are many situations that the audio signal is weak of not exist, and this will be a point in Chapter 2. The visual recognition process focuses on the features extracted from video of the speaker. These features are to be classified using several techniques. The most important feature to be extracted is motion. By segmenting motion of the lips of the speaker, an algorithm has manipulate it in such away to recognize the word which is said. But motion segmentation is not the only problem facing the speech recognition process, segmenting the lips itself is an early step in the speech recognition process, so, to segment lips motion we have to segment lips first, a new approach for lip segmentation is proposed in this thesis. Sometimes, motion feature needs another feature to support in recognition the spoken word. So in our thesis another new algorithm is proposed to use motion segmentation by using the Abstract Difference Image from an image series, supported by correlation for registering images in the image series, to recognize ten words in the Arabic language, the words are from “one” to “ten” in Arabic language. The algorithm also uses the HU-Invariant set of features to describe the Abstract Difference Image, and uses a three different recognition methods to recognize the words. The CLAHE method as a filtering technique is used by our algorithm to manipulate lighting problems. Our algorithm based on extracting the differences details from a series of images to recognize the word, achieved an overall results 55.8%, it is an adequate result for our algorithm when integrated in an audio-visual system
    corecore