2,703 research outputs found

    The listening talker: A review of human and algorithmic context-induced modifications of speech

    Get PDF
    International audienceSpeech output technology is finding widespread application, including in scenarios where intelligibility might be compromised - at least for some listeners - by adverse conditions. Unlike most current algorithms, talkers continually adapt their speech patterns as a response to the immediate context of spoken communication, where the type of interlocutor and the environment are the dominant situational factors influencing speech production. Observations of talker behaviour can motivate the design of more robust speech output algorithms. Starting with a listener-oriented categorisation of possible goals for speech modification, this review article summarises the extensive set of behavioural findings related to human speech modification, identifies which factors appear to be beneficial, and goes on to examine previous computational attempts to improve intelligibility in noise. The review concludes by tabulating 46 speech modifications, many of which have yet to be perceptually or algorithmically evaluated. Consequently, the review provides a roadmap for future work in improving the robustness of speech output

    The impact of speech type on listening effort and intelligibility for native and non-native listeners

    Get PDF
    Listeners are routinely exposed to many different types of speech, including artificially-enhanced and synthetic speech, styles which deviate to a greater or lesser extent from naturally-spoken exemplars. While the impact of differing speech types on intelligibility is well-studied, it is less clear how such types affect cognitive processing demands, and in particular whether those speech forms with the greatest intelligibility in noise have a commensurately lower listening effort. The current study measured intelligibility, self-reported listening effort, and a pupillometry-based measure of cognitive load for four distinct types of speech: (i) plain i.e. natural unmodified speech; (ii) Lombard speech, a naturally-enhanced form which occurs when speaking in the presence of noise; (iii) artificially-enhanced speech which involves spectral shaping and dynamic range compression; and (iv) speech synthesized from text. In the first experiment a cohort of 26 native listeners responded to the four speech types in three levels of speech-shaped noise. In a second experiment, 31 non-native listeners underwent the same procedure at more favorable signal-to-noise ratios, chosen since second language listening in noise has a more detrimental effect on intelligibility than listening in a first language. For both native and non-native listeners, artificially-enhanced speech was the most intelligible and led to the lowest subjective effort ratings, while the reverse was true for synthetic speech. However, pupil data suggested that Lombard speech elicited the lowest processing demands overall. These outcomes indicate that the relationship between intelligibility and cognitive processing demands is not a simple inverse, but is mediated by speech type. The findings of the current study motivate the search for speech modification algorithms that are optimized for both intelligibility and listening effort.</p

    Master of Arts

    Get PDF
    thesisOne way talkers can increase intelligibility is by producing clear speech. Though clear speech, as opposed to conversational speech (ConvS), generally increases intelligibility (known as the clear speech intelligibility benefit), not all talkers exhibit the same degree of benefit. Ferguson showed that while intelligibility increased across talkers for clear speech, when looking at individual talkers, the benefit ranged from -12.1 -33.3%. While most talkers were more intelligible during clear speech, some talkers actually became less intelligible. To explain individual differences like these, most researchers have explored acoustic, temporal, and syntactic factors. The current study probes three additional factors, ones relating to talker background: talker experience communicating with nonnative (L2) speakers, talkers' attitudes toward nonnatives, and talker experience as an L2 speaker. Twenty L2 English listeners transcribed sentences from 20 L1 English speakers as they were produced in ConvS and nonnative directed speech (NNDS; a type of clear speech). Intelligibility scores for ConvS and NNDS were compared to measure individual differences in intelligibility and to calculate the clear speech benefit for each talker. Scores were compared with the talkers' answers on a questionnaire to determine whether the variables affected the talkers' intelligibility. Results of the transcription task showed greater overall intelligibility for NNDS than ConvS; however, this was not the case for all talkers. Additionally, talkers varied widely in the benefit they provided the L2 listeners. When comparing results to the questionnaire, only talker experience as an L2 speaker was shown to affect intelligibility for L2 listeners

    Investigating supra-intelligibility aspects of speech

    Get PDF
    158 p.Synthetic and recorded speech form a great part of oureveryday listening experience, and much of our exposure tothese forms of speech occurs in potentially noisy settings such as on public transport, in the classroom or workplace, while driving, and in our homes. Optimising speech output to ensure that salient information is both correctly and effortlessly received is a main concern for the designers of applications that make use of the speech modality. Most of the focus in adapting speech output to challenging listening conditions has been on intelligibility, and specifically on enhancing intelligibility by modifying speech prior to presentation. However, the quality of the generated speech is not always satisfying for the recipient, which might lead to fatigue, or reluctance in using this communication modality. Consequently, a sole focus on intelligibility enhancement provides an incomplete picture of a listener¿s experience since the effect of modified or synthetic speech on other characteristics risks being ignored. These concerns motivate the study of 'supra-intelligibility' factors such as the additional cognitive demand that modified speech may well impose upon listeners, as well as quality, naturalness, distortion and pleasantness. This thesis reports on an investigation into two supra-intelligibility factors: listening effort and listener preferences. Differences in listening effort across four speech types (plain natural, Lombard, algorithmically-enhanced, and synthetic speech) were measured using existing methods, including pupillometry, subjective judgements, and intelligibility scores. To explore the effects of speech features on listener preferences, a new tool, SpeechAdjuster, was developed. SpeechAdjuster allows the manipulation of virtually any aspect of speech and supports the joint elicitation of listener preferences and intelligibility measures. The tool reverses the roles of listener and experimenter by allowing listeners direct control of speech characteristics in real-time. Several experiments to explore the effects of speech properties on listening preferences and intelligibility using SpeechAdjuster were conducted. Participants were permitted to change a speech feature during an open-ended adjustment phase, followed by a test phase in which they identified speech presented with the feature value selected at the end of the adjustment phase. Experiments with native normal-hearing listeners measured the consequences of allowing listeners to change speech rate, fundamental frequency, and other features which led to spectral energy redistribution. Speech stimuli were presented in both quiet and masked conditions. Results revealed that listeners prefer feature modifications similar to those observed in naturally modified speech in noise (Lombard speech). Further, Lombard speech required the least listening effort compared to either plain natural, algorithmically-enhanced, or synthetic speech. For stationary noise, as noise level increased listeners chose slower speech rates and flatter tilts compared to the original speech. Only the choice of fundamental frequency was not consistent with that observed in Lombard speech. It is possible that features such as fundamental frequency that talkers naturally modify are by-products of the speech type (e.g. hyperarticulated speech) and might not be advantageous for the listener.Findings suggest that listener preferences provide information about the processing of speech over and above that measured by intelligibility. One of the listeners¿ concerns was to maximise intelligibility. In noise, listeners preferred the feature values for which more information survived masking, choosing speech rates that led to a contrast with the modulation rate of the masker, or modifications that led to a shift of spectral energy concentration to higher frequencies compared to those of the masker. For all features being modified by listeners, preferences were evident even when intelligibility was at or close to ceiling levels. Such preferences might result from a desire to reduce the cognitive effort of understanding speech, or from a desire to reproduce the sound of typical speech features experienced in real-world noisy conditions, or to optimise the quality of the modified signal. Investigation of supra-intelligibility aspects of speech promises to improve the quality of speech enhancement algorithms, bringing with it the potential of reducing the effort of understanding artificially-modified or generated forms of speech

    Plain-to-clear speech video conversion for enhanced intelligibility

    Get PDF
    Clearly articulated speech, relative to plain-style speech, has been shown to improve intelligibility. We examine if visible speech cues in video only can be systematically modified to enhance clear-speech visual features and improve intelligibility. We extract clear-speech visual features of English words varying in vowels produced by multiple male and female talkers. Via a frame-by-frame image-warping based video generation method with a controllable parameter (displacement factor), we apply the extracted clear-speech visual features to videos of plain speech to synthesize clear speech videos. We evaluate the generated videos using a robust, state of the art AI Lip Reader as well as human intelligibility testing. The contributions of this study are: (1) we successfully extract relevant visual cues for video modifications across speech styles, and have achieved enhanced intelligibility for AI; (2) this work suggests that universal talker-independent clear-speech features may be utilized to modify any talker’s visual speech style; (3) we introduce “displacement factor” as a way of systematically scaling the magnitude of displacement modifications between speech styles; and (4) the high definition generated videos make them ideal candidates for human-centric intelligibility and perceptual training studies

    Acoustics and Perception of Clear Fricatives

    Get PDF
    Everyday observation indicates that speakers can naturally and spontaneously adopt a speaking style that allows them to be understood more easily when confronted with difficult communicative situations. Previous studies have demonstrated that the resulting speaking style, known as clear speech, is more intelligible than casual, conversational speech for a variety of listener populations. However, few studies have examined the acoustic properties of clearly produced fricatives in detail. In addition, it is unknown whether clear speech improves the intelligibility of fricative consonants, or how its effects on fricative perception might differ depending on listener population. Since fricatives are the cause of a large number of recognition errors both for normal-hearing listeners in adverse conditions and for hearing-impaired listeners, it is of interest to explore these issues in detail focusing on fricatives. The current study attempts to characterize the type and magnitude of adaptations in the clear production of English fricatives and determine whether clear speech enhances fricative intelligibility for normal-hearing listeners and listeners with simulated impairment. In an acoustic experiment (Experiment I), ten female and ten male talkers produced nonsense syllables containing the fricatives /f, &thetas;, s, [special characters omitted], v, δ, z, and [y]/ in VCV contexts, in both a conversational style and a clear style that was elicited by means of simulated recognition errors in feedback received from an interactive computer program. Acoustic measurements were taken for spectral, amplitudinal, and temporal properties known to influence fricative recognition. Results illustrate that (1) there were consistent overall clear speech effects, several of which (consonant duration, spectral peak location, spectral moments) were consistent with previous findings and a few (notably consonant-to-vowel intensity ratio) which were not, (2) 'contrastive' differences related to acoustic inventory and eliciting prompts were observed in key comparisons, and (3) talkers differed widely in the types and magnitude of acoustic modifications. Two perception experiments using these same productions as stimuli (Experiments II and III) were conducted to address three major questions: (1) whether clearly produced fricatives are more intelligible than conversational fricatives, (2) what specific acoustic modifications are related to clear speech intelligibility advantages, and (3) how sloping, recruiting hearing impairment interacts with clear speech strategies. Both perception experiments used an adaptive procedure to estimate the signal to (multi-talker babble) noise ratio (SNR) threshold at which minimal pair fricative categorizations could be made with 75% accuracy. Data from fourteen normal-hearing listeners (Experiment II) and fourteen listeners with simulated sloping elevated thresholds and loudness recruitment (Experiment III) indicate that clear fricatives were more intelligible overall for both listener groups. However, for listeners with simulated hearing impairment, a reliable clear speech intelligibility advantage was not found for non-sibilant pairs. Correlation analyses comparing acoustic and perceptual style-related differences across the 20 speakers encountered in the experiments indicated that a shift of energy concentration toward higher frequency regions and greater source strength was a primary contributor to the "clear fricative effect" for normal-hearing listeners but not for listeners with simulated loss, for whom information in higher frequency regions was less audible
    corecore