3,816 research outputs found

    The influence of channel and source degradations on intelligibility and physiological measurements of effort

    Get PDF
    Despite the fact that everyday listening is compromised by acoustic degradations, individuals show a remarkable ability to understand degraded speech. However, recent trends in speech perception research emphasise the cognitive load imposed by degraded speech on both normal-hearing and hearing-impaired listeners. The perception of degraded speech is often studied through channel degradations such as background noise. However, source degradations determined by talkersā€™ acoustic-phonetic characteristics have been studied to a lesser extent, especially in the context of listening effort models. Similarly, little attention has been given to speaking effort, i.e., effort experienced by talkers when producing speech under channel degradations. This thesis aims to provide a holistic understanding of communication effort, i.e., taking into account both listener and talker factors. Three pupillometry studies are presented. In the first study, speech was recorded for 16 Southern British English speakers and presented to normal-hearing listeners in quiet and in combination with three degradations: noise-vocoding, masking and time-compression. Results showed that acoustic-phonetic talker characteristics predicted intelligibility of degraded speech, but not listening effort, as likely indexed by pupil dilation. In the second study, older hearing-impaired listeners were presented fast time-compressed speech under simulated room acoustics. Intelligibility was kept at high levels. Results showed that both fast speech and reverberant speech were associated with higher listening effort, as suggested by pupillometry. Discrepancies between pupillometry and perceived effort ratings suggest that both methods should be employed in speech perception research to pinpoint processing effort. While findings from the first two studies support models of degraded speech perception, emphasising the relevance of source degradations, they also have methodological implications for pupillometry paradigms. In the third study, pupillometry was combined with a speech production task, aiming to establish an equivalent to listening effort for talkers: speaking effort. Normal-hearing participants were asked to read and produce speech in quiet or in the presence of different types of masking: stationary and modulated speech-shaped noise, and competing-talker masking. Results indicated that while talkers acoustically enhance their speech more under stationary masking, larger pupil dilation associated with competing-speaker masking reflected higher speaking effort. Results from all three studies are discussed in conjunction with models of degraded speech perception and production. Listening effort models are revisited to incorporate pupillometry results from speech production paradigms. Given the new approach of investigating source factors using pupillometry, methodological issues are discussed as well. The main insight provided by this thesis, i.e., the feasibility of applying pupillometry to situations involving listener and talker factors, is suggested to guide future research employing naturalistic conversations

    The listening talker: A review of human and algorithmic context-induced modifications of speech

    Get PDF
    International audienceSpeech output technology is finding widespread application, including in scenarios where intelligibility might be compromised - at least for some listeners - by adverse conditions. Unlike most current algorithms, talkers continually adapt their speech patterns as a response to the immediate context of spoken communication, where the type of interlocutor and the environment are the dominant situational factors influencing speech production. Observations of talker behaviour can motivate the design of more robust speech output algorithms. Starting with a listener-oriented categorisation of possible goals for speech modification, this review article summarises the extensive set of behavioural findings related to human speech modification, identifies which factors appear to be beneficial, and goes on to examine previous computational attempts to improve intelligibility in noise. The review concludes by tabulating 46 speech modifications, many of which have yet to be perceptually or algorithmically evaluated. Consequently, the review provides a roadmap for future work in improving the robustness of speech output

    Talker identification is not improved by lexical access in the absence of familiar phonology

    Full text link
    Listeners identify talkers more accurately when they are familiar with both the sounds and words of the language being spoken. It is unknown whether lexical information alone can facilitate talker identification in the absence of familiar phonology. To dissociate the roles of familiar words and phonology, we developed English-Mandarin ā€œhybridā€ sentences, spoken in Mandarin, which can be convincingly coerced to sound like English when presented with corresponding subtitles (e.g., ā€œwei4 gou3 chi1 kao3 li2 zhi1ā€ becomes ā€œwe go to collegeā€). Across two experiments, listeners learned to identify talkers in three conditions: listeners' native language (English), an unfamiliar, foreign language (Mandarin), and a foreign language paired with subtitles that primed native language lexical access (subtitled Mandarin). In Experiment 1 listeners underwent a single session of talker identity training; in Experiment 2 listeners completed three days of training. Talkers in a foreign language were identified no better when native language lexical representations were primed (subtitled Mandarin) than from foreign-language speech alone, regardless of whether they had received one or three days of talker identity training. These results suggest that the facilitatory effect of lexical access on talker identification depends on the availability of familiar phonological forms

    Blind estimation of reverberation time in classrooms and hospital wards

    Get PDF
    This paper investigates blind Reverberation Time (RT) estimation in occupied classrooms and hospital wards. Measurements are usually made while these spaces are unoccupied for logistical reasons. However, occupancy can have a significant impact on the rate of reverberant decay. Recent work has developed a Maximum Likelihood Estimation (MLE) method which utilises only passively recorded speech and music signals, this enables measurements to be made while the room is in use. In this paper the MLE method is applied to recordings made in classrooms during lessons. Classroom occupancy levels differ for each lesson, therefore a model is developed using blind estimates to predict the RT for any occupancy level to within Ā±0.07s for the mid-frequency octave bands. The model is also able to predict the effective room and per person absorption area. Ambient sound recordings were also carried out in a number of rooms in two hospitals for a week. Hospital measurements are more challenging as the occurrence of free reverberant decay is rarer than in schools and the acoustic conditions may be non-stationary. However, by gaining recordings over a period of a week, estimates can be gained within Ā±0.07 s. These estimates are representative of the times when the room contains the highest acoustic absorption. In other words when curtains are drawn, there are many visitors or perhaps a window may be open

    Exploiting correlogram structure for robust speech recognition with multiple speech sources

    Get PDF
    This paper addresses the problem of separating and recognising speech in a monaural acoustic mixture with the presence of competing speech sources. The proposed system treats sound source separation and speech recognition as tightly coupled processes. In the first stage sound source separation is performed in the correlogram domain. For periodic sounds, the correlogram exhibits symmetric tree-like structures whose stems are located on the delay that corresponds to multiple pitch periods. These pitch-related structures are exploited in the study to group spectral components at each time frame. Local pitch estimates are then computed for each spectral group and are used to form simultaneous pitch tracks for temporal integration. These processes segregate a spectral representation of the acoustic mixture into several time-frequency regions such that the energy in each region is likely to have originated from a single periodic sound source. The identified time-frequency regions, together with the spectral representation, are employed by a `speech fragment decoder' which employs `missing data' techniques with clean speech models to simultaneously search for the acoustic evidence that best matches model sequences. The paper presents evaluations based on artificially mixed simultaneous speech utterances. A coherence-measuring experiment is first reported which quantifies the consistency of the identified fragments with a single source. The system is then evaluated in a speech recognition task and compared to a conventional fragment generation approach. Results show that the proposed system produces more coherent fragments over different conditions, which results in significantly better recognition accuracy

    Masking release due to linguistic and phonetic dissimilarity between the target and masker speech

    No full text
    Purpose: To investigate masking release for speech maskers for linguistically and phonetically close (English and Dutch) and distant (English and Mandarin) language pairs. Method: Thirty-two monolingual speakers of English with normal audiometric thresholds participated in the study. Data are reported for an English sentence recognition task in English and for Dutch and Mandarin competing speech maskers (Experiment 1) and noise maskers (Experiment 2) that were matched either to the long-term average speech spectra or to the temporal modulations of the speech maskers from Experiment 1. Results: Listener performance increased as the target-tomasker linguistic distance increased (English-in-English < English-in-Dutch < English-in-Mandarin). Conclusion: Spectral differences between maskers can account for some, but not all, of the variation in performance between maskers; however, temporal differences did not seem to play a significant role

    Temporal contrast effects in human speech perception are immune to selective attention

    Get PDF
    Two fundamental properties of perception are selective attention and perceptual contrast, but how these two processes interact remains unknown. Does an attended stimulus history exert a larger contrastive influence on the perception of a following target than unattended stimuli? Dutch listeners categorized target sounds with a reduced prefix "ge-" marking tense (e.g., ambiguous between gegaan-gaan "gone-go"). In 'single talker' Experiments 1-2, participants perceived the reduced syllable (reporting gegaan) when the target was heard after a fast sentence, but not after a slow sentence (reporting gaan). In 'selective attention' Experiments 3-5, participants listened to two simultaneous sentences from two different talkers, followed by the same target sounds, with instructions to attend only one of the two talkers. Critically, the speech rates of attended and unattended talkers were found to equally influence target perception - even when participants could watch the attended talker speak. In fact, participants' target perception in 'selective attention' Experiments 3-5 did not differ from participants who were explicitly instructed to divide their attention equally across the two talkers (Experiment 6). This suggests that contrast effects of speech rate are immune to selective attention, largely operating prior to attentional stream segregation in the auditory processing hierarchy

    Spatial Release From Masking in Children: Effects of Simulated Unilateral Hearing Loss

    Get PDF
    The purpose of this study was twofold: 1) to determine the effect of an acute simulated unilateral hearing loss on childrenā€™s spatial release from masking in two-talker speech and speech-shaped noise, and 2) to develop a procedure to be used in future studies that will assess spatial release from masking in children who have permanent unilateral hearing loss. There were three main predictions. First, spatial release from masking was expected to be larger in two-talker speech than speech-shaped noise. Second, simulated unilateral hearing loss was expected to worsen performance in all listening conditions, but particularly in the spatially separated two-talker speech masker. Third, spatial release from masking was expected to be smaller for children than for adults in the two-talker masker
    • ā€¦
    corecore