1,041 research outputs found

    Effect of being seen on the production of visible speech cues. A pilot study on Lombard speech

    No full text
    International audienceSpeech produced in noise (or Lombard speech) is characterized by increased vocal effort, but also by amplified lip gestures. The current study examines whether this enhancement of visible speech cues may be sought by the speaker, even unconsciously, in order to improve his visual intelligibility. One subject played an interactive game in a quiet situation and then in 85dB of cocktail-party noise, for three conditions of interaction: without interaction, in face-to-face interaction, and in a situation of audio interaction only. The audio signal was recorded simultaneously with articulatory movements, using 3D electromagnetic articulography. The results showed that acoustic modifications of speech in noise were greater when the interlocutor could not see the speaker. Furthermore, tongue movements that are hardly visible were not particularly amplified in noise. Lip movements that are very visible were not more enhanced in noise when the interlocutors could see each other. Actually, they were more enhanced in the situation of audio interaction only. These results support the idea that this speaker did not make use of the visual channel to improve his intelligibility, and that his hyper articulation was just an indirect correlate of increased vocal effort

    The listening talker: A review of human and algorithmic context-induced modifications of speech

    Get PDF
    International audienceSpeech output technology is finding widespread application, including in scenarios where intelligibility might be compromised - at least for some listeners - by adverse conditions. Unlike most current algorithms, talkers continually adapt their speech patterns as a response to the immediate context of spoken communication, where the type of interlocutor and the environment are the dominant situational factors influencing speech production. Observations of talker behaviour can motivate the design of more robust speech output algorithms. Starting with a listener-oriented categorisation of possible goals for speech modification, this review article summarises the extensive set of behavioural findings related to human speech modification, identifies which factors appear to be beneficial, and goes on to examine previous computational attempts to improve intelligibility in noise. The review concludes by tabulating 46 speech modifications, many of which have yet to be perceptually or algorithmically evaluated. Consequently, the review provides a roadmap for future work in improving the robustness of speech output

    Audio-Visual Speech Enhancement Based on Deep Learning

    Get PDF

    The Lombard intelligibility benefit of native and non-native speech for native and non-native listeners

    Get PDF
    Speech produced in noise (Lombard speech) is more intelligible than speech produced in quiet (plain speech). Previous research on the Lombard intelligibility benefit focused almost entirely on how native speakers produce and perceive Lombard speech. In this study, we investigate the size of the Lombard intelligibility benefit of both native (American-English) and non-native (native Dutch) English for native and non-native listeners (Dutch and Spanish). We used a glimpsing metric to measure the energetic masking potential of speech, which predicted that both native and non-native Lombard speech could withstand greater amounts of masking to a similar extent, compared to plain speech. In an intelligibility experiment, native English, Spanish, and Dutch listeners listened to the same words, mixed with noise. While the non-native listeners appeared to benefit more from Lombard speech than the native listeners did, each listener group experienced a similar benefit for native and non-native Lombard speech. Energetic masking, as captured by the glimpsing metric, only accounted for part of the Lombard benefit, indicating that the Lombard intelligibility benefit does not only result from a shift in spectral distribution. Despite subtle native language influences on non-native Lombard speech, both native and non-native speech provides a Lombard benefit

    Acoustic and visual adaptations in speech produced to counter adverse listening conditions

    Get PDF
    This study investigated whether communication modality affects talkers’ speech adaptation to an interlocutor exposed to background noise. It was predicted that adaptations to lip gestures would be greater and acoustic ones reduced when communicating face-to-face. We video recorded 14 Australian-English talkers (Talker A) speaking in a face-to-face or auditory only setting with their interlocutors who were either in quiet or noise. Focusing on keyword productions, acoustic-phonetic adaptations were examined via measures of vowel intensity, pitch, keyword duration, vowel F1/F2 space and VOT, and visual adaptations via measures of vowel interlip area. The interlocutor adverse listening conditions lead Talker A to reduce speech rate, increase pitch and expand vowel space. These adaptations were not significantly reduced in the face-to-face setting although there was a trend for a smaller degree of vowel space expansion than in the auditory only setting. Visible lip gestures were more enhanced overall in the face-to-face setting, but also increased in the auditory only setting when countering the effects of noise. This study therefore showed only small effects of communication modality on speech adaptations

    Visual Speech Enhancement and its Application in Speech Perception Training

    Get PDF
    This thesis investigates methods for visual speech enhancement to support auditory and audiovisual speech perception. Normal-hearing non-native listeners receiving cochlear implant (CI) simulated speech are used as ‘proxy’ listeners for CI users, a proposed user group who could benefit from such enhancement methods in speech perception training. Both CI users and non-native listeners share similarities with regards to audiovisual speech perception, including increased sensitivity to visual speech cues. Two enhancement methods are proposed: (i) an appearance based method, which modifies the appearance of a talker’s lips using colour and luminance blending to apply a ‘lipstick effect’ to increase the saliency of mouth shapes; and (ii) a kinematics based method, which amplifies the kinematics of the talker’s mouth to create the effect of more pronounced speech (an ‘exaggeration effect’). The application that is used to test the enhancements is speech perception training, or audiovisual training, which can be used to improve listening skills. An audiovisual training framework is presented which structures the evaluation of the effectiveness of these methods. It is used in two studies. The first study, which evaluates the effectiveness of the lipstick effect, found a significant improvement in audiovisual and auditory perception. The second study, which evaluates the effectiveness of the exaggeration effect, found improvement in the audiovisual perception of a number of phoneme classes; no evidence was found of improvements in the subsequent auditory perception, as audiovisual recalibration to visually exaggerated speech may have impeded learning when used in the audiovisual training. The thesis also investigates an example of kinematics based enhancement which is observed in Lombard speech, by studying the behaviour of visual Lombard phonemes in different contexts. Due to the lack of suitable datasets for this analysis, the thesis presents a novel audiovisual Lombard speech dataset recorded under high SNR, which offers two, fixed head-pose, synchronised views of each talker in the dataset

    The influence of channel and source degradations on intelligibility and physiological measurements of effort

    Get PDF
    Despite the fact that everyday listening is compromised by acoustic degradations, individuals show a remarkable ability to understand degraded speech. However, recent trends in speech perception research emphasise the cognitive load imposed by degraded speech on both normal-hearing and hearing-impaired listeners. The perception of degraded speech is often studied through channel degradations such as background noise. However, source degradations determined by talkers’ acoustic-phonetic characteristics have been studied to a lesser extent, especially in the context of listening effort models. Similarly, little attention has been given to speaking effort, i.e., effort experienced by talkers when producing speech under channel degradations. This thesis aims to provide a holistic understanding of communication effort, i.e., taking into account both listener and talker factors. Three pupillometry studies are presented. In the first study, speech was recorded for 16 Southern British English speakers and presented to normal-hearing listeners in quiet and in combination with three degradations: noise-vocoding, masking and time-compression. Results showed that acoustic-phonetic talker characteristics predicted intelligibility of degraded speech, but not listening effort, as likely indexed by pupil dilation. In the second study, older hearing-impaired listeners were presented fast time-compressed speech under simulated room acoustics. Intelligibility was kept at high levels. Results showed that both fast speech and reverberant speech were associated with higher listening effort, as suggested by pupillometry. Discrepancies between pupillometry and perceived effort ratings suggest that both methods should be employed in speech perception research to pinpoint processing effort. While findings from the first two studies support models of degraded speech perception, emphasising the relevance of source degradations, they also have methodological implications for pupillometry paradigms. In the third study, pupillometry was combined with a speech production task, aiming to establish an equivalent to listening effort for talkers: speaking effort. Normal-hearing participants were asked to read and produce speech in quiet or in the presence of different types of masking: stationary and modulated speech-shaped noise, and competing-talker masking. Results indicated that while talkers acoustically enhance their speech more under stationary masking, larger pupil dilation associated with competing-speaker masking reflected higher speaking effort. Results from all three studies are discussed in conjunction with models of degraded speech perception and production. Listening effort models are revisited to incorporate pupillometry results from speech production paradigms. Given the new approach of investigating source factors using pupillometry, methodological issues are discussed as well. The main insight provided by this thesis, i.e., the feasibility of applying pupillometry to situations involving listener and talker factors, is suggested to guide future research employing naturalistic conversations

    The tongue and lips in Lombard speech: A pilot study of vowel-space expansion

    Get PDF
    We investigate some ways in which speech production alters to make speech sounds more intelligible to a listener. This single speaker pilot study uses ultrasound tongue imaging and videos of lips to investigate the underlying articulatory processes used to distinguish six different monophthongal vowels in Scottish English in a consistent b__p frame. Lombard speech was elicited in an interactive feedback task with a neutral condition and a condition where the listener's hearing was masked by speech babble. As a baseline, the acoustic formant space was measured, which showed Lombard effects of F1 lowering for all vowels except /i/ and an increase in intensity. In articulation, we found that in the low and back vowel targets, the hyper-articulated version has extra lowering. However, for high front vowels /i/ and /e/, the hyper-articulated version has slight blade lowering and dorsal retraction in association with raising into the palate. The vowel // has very little change, but seems to fit in the high front set. Lip protrusion and spreading are enhanced, appropriately. Despite the frame being identical in each word, qualitatively the speaker enhanced the /b/ but not the /p/, supporting models in which a CV unit is planned holistically in speech production.caslpub3601pu

    The contribution of the visual modality to speech intelligibility in native and non-native speakers of English

    Get PDF
    Since the pandemic, masks have hindered speech intelligibility by obscuring visual cues, especially in foreign-accented speech. This study compares the effectiveness of visual cues provided by French L1 non-native English speakers and a native English speaker. It investigates vowels, using audio and audio-visual stimuli and native English perceivers (n = 24). Results show significant audio-visual benefits for both groups. However, the visual benefit varies across vowel feature and speaker group, with a negative impact observed for the French speaker group for central vowels /ʌ/ and /ə/. These findings highlight the influence of language-specific gestures on L2 production, including French lip-rounding
    corecore