347 research outputs found
The listening talker: A review of human and algorithmic context-induced modifications of speech
International audienceSpeech output technology is finding widespread application, including in scenarios where intelligibility might be compromised - at least for some listeners - by adverse conditions. Unlike most current algorithms, talkers continually adapt their speech patterns as a response to the immediate context of spoken communication, where the type of interlocutor and the environment are the dominant situational factors influencing speech production. Observations of talker behaviour can motivate the design of more robust speech output algorithms. Starting with a listener-oriented categorisation of possible goals for speech modification, this review article summarises the extensive set of behavioural findings related to human speech modification, identifies which factors appear to be beneficial, and goes on to examine previous computational attempts to improve intelligibility in noise. The review concludes by tabulating 46 speech modifications, many of which have yet to be perceptually or algorithmically evaluated. Consequently, the review provides a roadmap for future work in improving the robustness of speech output
Recommended from our members
Developmental and cultural factors of audiovisual speech perception in noise
textThe aim of this project is two-fold: 1) to investigate developmental differences in intelligibility gains from visual cues in speech perception-in-noise, and 2) to examine how different types of maskers modulate visual enhancement across age groups. A secondary aim of this project is to investigate whether or not bilingualism differentially
modulates audiovisual integration during speech in noise tasks. To that end, both child and adult, monolingual and bilingual participants completed speech perception in noise tasks through three within-subject variables: (1) masker type: pink noise or two-talker babble, (2) modality: audio-only (AO) and audiovisual (AV), and (3) Signal-to-noise ratio (SNR): 0 dB, -4 dB, -8 dB, -12 dB, and -16 dB. The findings revealed that, although
both children and adults benefited from visual cues in speech-in-noise tasks, adults showed greater benefit at lower SNRs. Moreover, although child monolingual and bilingual participants performed comparably across all conditions, monolingual adults
outperformed simultaneous bilingual adult participants. These results may indicate that the divergent use of visual cues in speech perception between bilingual and monolingual speakers occurs later in development.Communication Sciences and Disorder
Navigating the bilingual cocktail party: Interference from background speakers in listeners with varying L1/L2 proficiency
Cocktail party environments require listeners to tune in to a target voice while ignoring surrounding speakers (maskers), which could present unique challenges for bilingual listeners. Our study recruited English-French bilinguals to listen to a male target speaking French or English, masked by two female voices speaking French, English, or Tamil, or by speech-shaped noise. Listeners performed better with first language (L1) than second language (L2) targets, and relative L1/L2 proficiency acted like a categorical rather than a continuous variable with respect to speech reception threshold (SRT) averaged over maskers. Further, listeners struggled the most with L1 maskers and struggled the least with Tamil maskers. The results suggest that the balanced bilinguals have a slight disadvantage with L1 targets but compensate with a larger advantage with L2 targets, compared to unbalanced bilinguals. This positive net result supports the idea that being a balanced bilingual is helpful in speech-on-speech perception tasks in environments that offer substantial exposure to L2
The effects of adverse conditions on speech recognition by non-native listeners: Electrophysiological and behavioural evidence
This thesis investigated speech recognition by native (L1) and non-native (L2) listeners (i.e., native English and Korean speakers) in diverse adverse conditions using electroencephalography (EEG) and behavioural measures. Study 1 investigated speech recognition in noise for read and casually produced, spontaneous speech using behavioural measures. The results showed that the detrimental effect of casual speech was greater for L2 than L1 listeners, demonstrating real-life L2 speech recognition problems caused by casual speech. Intelligibility was also shown to decrease when the accents of the talker and listener did not match when listening to casual speech as well as read speech. Study 2 set out to develop EEG methods to measure L2 speech processing difficulties for natural, continuous speech. This study thus examined neural entrainment to the amplitude envelope of speech (i.e., slow amplitude fluctuations in speech) while subjects listened to their L1, L2 and a language that they did not understand. The results demonstrate that neural entrainment to the speech envelope is not modulated by whether or not listeners understand the language, opposite to previously reported positive relationships between speech entrainment and intelligibility. Study 3 investigated speech processing in a two-talker situation using measures of neural entrainment and N400, combined with a behavioural speech recognition task. L2 listeners had greater entrainment for target talkers than did L1 listeners, likely because their difficulty with L2 speech comprehension caused them to focus greater attention on the speech signal. L2 listeners also had a greater degree of lexical processing (i.e., larger N400) for highly predictable words than did native listeners, while native listeners had greater lexical processing when listening to foreign-accented speech. The results suggest that the increased listening effort experienced by L2 listeners during speech recognition modulates their auditory and lexical processing
Exploiting correlogram structure for robust speech recognition with multiple speech sources
This paper addresses the problem of separating and recognising speech in a monaural acoustic mixture with the presence of competing speech sources. The proposed system treats sound source separation and speech recognition as
tightly coupled processes. In the first stage sound source separation is performed in the correlogram domain. For periodic sounds, the correlogram exhibits symmetric tree-like structures whose stems are located on the delay
that corresponds to multiple pitch periods. These pitch-related structures are exploited in the study to group spectral components at each time frame. Local
pitch estimates are then computed for each spectral group and are used to form simultaneous pitch tracks for temporal integration. These processes segregate a spectral representation of the acoustic mixture into several time-frequency regions such that the energy in each region is likely to have originated from a single periodic sound source. The identified time-frequency regions, together
with the spectral representation, are employed by a `speech fragment decoder' which employs `missing data' techniques with clean speech models to simultaneously search for the acoustic evidence that best matches model sequences. The paper presents evaluations based on artificially mixed simultaneous speech utterances. A coherence-measuring experiment is first reported which quantifies the consistency of the identified fragments with a single source. The system is then evaluated in a speech recognition task and compared to a conventional fragment generation approach. Results show that the proposed system produces more coherent fragments over different conditions,
which results in significantly better recognition accuracy
Cognitive performance in open-plan office acoustic simulations: Effects of room acoustics and semantics but not spatial separation of sound sources
The irrelevant sound effect (ISE) characterizes short-term memory performance
impairment during irrelevant sounds relative to quiet. Irrelevant sound
presentation in most laboratory-based ISE studies has been rather limited to
represent complex scenarios including open-plan offices (OPOs) and not many
studies have considered serial recall of heard information. This paper
investigates ISE using an auditory-verbal serial recall task, wherein
performance was evaluated for relevant factors in simulating OPO acoustics: the
irrelevant sounds including the semanticity of speech, reproduction methods
over headphones, and room acoustics. Results (Experiments 1 and 2) show that
ISE was exhibited in most conditions with anechoic (irrelevant) nonspeech
sounds with/without speech, but the effect was substantially higher with
meaningful speech compared to foreign speech, suggesting a semantic effect.
Performance differences in conditions with diotic and binaural reproductions
were not statistically robust, suggesting limited role of spatial separation of
sources. In Experiment 3, statistically robust ISE were exhibited for binaural
room acoustic conditions with mid-frequency reverberation times, T30 (s) = 0.4,
0.8, 1.1, suggesting cognitive impairment regardless of sound absorption
representative of OPOs. Performance differences in T30 = 0.4 s relative to T30
= 0.8 and 1.1 s conditions were statistically robust. This emphasizes the
benefits for cognitive performance with increased sound absorption, reinforcing
extant room acoustic design recommendations. Performance differences in T30 =
0.8 s vs. 1.1 s were not statistically robust. Collectively, these results
suggest that certain findings from ISE studies with idiosyncratic acoustics may
not translate well to complex OPO acoustic environments
Intelligibility model optimisation approaches for speech pre-enhancement
The goal of improving the intelligibility of broadcast speech is being met by a recent new direction in speech enhancement: near-end intelligibility enhancement. In contrast to the conventional speech enhancement approach that processes the corrupted speech at the receiver-side of the communication chain, the near-end intelligibility enhancement approach pre-processes the clean speech at the transmitter-side, i.e. before it is played into the environmental noise. In this work, we describe an optimisation-based approach to near-end intelligibility enhancement using models of speech intelligibility to improve the intelligibility of speech in noise.
This thesis first presents a survey of speech intelligibility models and how the adverse acoustic conditions affect the intelligibility of speech. The purpose of this survey is to identify models that we can adopt in the design of the pre-enhancement system. Then, we investigate the strategies humans use to increase speech intelligibility in noise. We then relate human strategies to existing algorithms for near-end intelligibility enhancement. A closed-loop feedback approach to near-end intelligibility enhancement is then introduced. In this framework, speech modifications are guided by a model of intelligibility. For the closed-loop system to work, we develop a simple spectral modification strategy that modifies the first few coefficients of an auditory cepstral representation such as to maximise an intelligibility measure. We experiment with two contrasting measures of objective intelligibility. The first, as a baseline, is an audibility measure named 'glimpse proportion' that is computed as the proportion of the spectro-temporal representation of the speech signal that is free from masking.
We then propose a discriminative intelligibility model, building on the principles of missing data speech recognition, to model the likelihood of specific phonetic confusions that may occur when speech is presented in noise. The discriminative intelligibility measure is computed using a statistical model of speech from the speaker that is to be enhanced.
Interim results showed that, unlike the glimpse proportion based system, the discriminative based system did not improve intelligibility.
We investigated the reason behind that and we found that the discriminative based system was not able to target the phonetic confusion with the fixed spectral shaping. To address that, we introduce a time-varying spectral modification. We also propose to perform the optimisation on a segment-by-segment basis which enables a robust solution against the fluctuating noise. We further combine our system with a noise-independent enhancement technique, i.e. dynamic range compression.
We found significant improvement in non-stationary noise condition, but no significant differences to the state-of-the art system (spectral shaping and dynamic range compression) where found in stationary noise condition
- …