4,721 research outputs found

    Effect of temporal fine structure on speech intelligibility modeling

    Get PDF
    Temporal fine structure (TFS) carries important information for the speech perception of hearing-impaired listeners and for the design of novel prosthetic hearing devices. This study assessed the performance of present intelligibility indices for predicting the intelligibility of speech containing different amount of TFS information. Speech intelligibility data was collected from vocoded and wideband Mandarin sentences containing little/partial and intact TFS information, respectively, and was then subjected to the correlation analysis with existing intelligibility indices. It was found that, though performing well in predicting the intelligibility of vocoded or wideband speech separately, present intelligibility indices were not highly correlated with the intelligibility scores when a general function was used to map all intelligibility measures to intelligibility scores. Analysis further showed that the intelligibility prediction power could be significantly improved when multiple condition-dependent functions were used for mapping intelligibility measures to intelligibility scores. © 2013 IEEE.published_or_final_versio

    Effects of noise suppression and envelope dynamic range compression on the intelligibility of vocoded sentences for a tonal language

    Get PDF
    Vocoder simulation studies have suggested that the carrier signal type employed affects the intelligibility of vocoded speech. The present work further assessed how carrier signal type interacts with additional signal processing, namely, single-channel noise suppression and envelope dynamic range compression, in determining the intelligibility of vocoder simulations. In Experiment 1, Mandarin sentences that had been corrupted by speech spectrum-shaped noise (SSN) or two-talker babble (2TB) were processed by one of four single-channel noise-suppression algorithms before undergoing tone-vocoded (TV) or noise-vocoded (NV) processing. In Experiment 2, dynamic ranges of multiband envelope waveforms were compressed by scaling of the mean-removed envelope waveforms with a compression factor before undergoing TV or NV processing. TV Mandarin sentences yielded higher intelligibility scores with normal-hearing (NH) listeners than did noise-vocoded sentences. The intelligibility advantage of noise-suppressed vocoded speech depended on the masker type (SSN vs 2TB). NV speech was more negatively influenced by envelope dynamic range compression than was TV speech. These findings suggest that an interactional effect exists between the carrier signal type employed in the vocoding process and envelope distortion caused by signal processing

    The listening talker: A review of human and algorithmic context-induced modifications of speech

    Get PDF
    International audienceSpeech output technology is finding widespread application, including in scenarios where intelligibility might be compromised - at least for some listeners - by adverse conditions. Unlike most current algorithms, talkers continually adapt their speech patterns as a response to the immediate context of spoken communication, where the type of interlocutor and the environment are the dominant situational factors influencing speech production. Observations of talker behaviour can motivate the design of more robust speech output algorithms. Starting with a listener-oriented categorisation of possible goals for speech modification, this review article summarises the extensive set of behavioural findings related to human speech modification, identifies which factors appear to be beneficial, and goes on to examine previous computational attempts to improve intelligibility in noise. The review concludes by tabulating 46 speech modifications, many of which have yet to be perceptually or algorithmically evaluated. Consequently, the review provides a roadmap for future work in improving the robustness of speech output

    Enhanced amplitude modulations contribute to the Lombard intelligibility benefit: Evidence from the Nijmegen Corpus of Lombard Speech

    No full text
    Speakers adjust their voice when talking in noise, which is known as Lombard speech. These acoustic adjustments facilitate speech comprehension in noise relative to plain speech (i.e., speech produced in quiet). However, exactly which characteristics of Lombard speech drive this intelligibility benefit in noise remains unclear. This study assessed the contribution of enhanced amplitude modulations to the Lombard speech intelligibility benefit by demonstrating that (1) native speakers of Dutch in the Nijmegen Corpus of Lombard Speech (NiCLS) produce more pronounced amplitude modulations in noise vs. in quiet; (2) more enhanced amplitude modulations correlate positively with intelligibility in a speech-in-noise perception experiment; (3) transplanting the amplitude modulations from Lombard speech onto plain speech leads to an intelligibility improvement, suggesting that enhanced amplitude modulations in Lombard speech contribute towards intelligibility in noise. Results are discussed in light of recent neurobiological models of speech perception with reference to neural oscillators phase-locking to the amplitude modulations in speech, guiding the processing of speech

    A computer model of auditory efferent suppression: Implications for the recognition of speech in noise

    Get PDF
    The neural mechanisms underlying the ability of human listeners to recognize speech in the presence of background noise are still imperfectly understood. However, there is mounting evidence that the medial olivocochlear system plays an important role, via efferents that exert a suppressive effect on the response of the basilar membrane. The current paper presents a computer modeling study that investigates the possible role of this activity on speech intelligibility in noise. A model of auditory efferent processing [ Ferry, R. T., and Meddis, R. (2007). J. Acoust. Soc. Am. 122, 3519?3526 ] is used to provide acoustic features for a statistical automatic speech recognition system, thus allowing the effects of efferent activity on speech intelligibility to be quantified. Performance of the ?basic? model (without efferent activity) on a connected digit recognition task is good when the speech is uncorrupted by noise but falls when noise is present. However, recognition performance is much improved when efferent activity is applied. Furthermore, optimal performance is obtained when the amount of efferent activity is proportional to the noise level. The results obtained are consistent with the suggestion that efferent suppression causes a ?release from adaptation? in the auditory-nerve response to noisy speech, which enhances its intelligibility

    Consonant identification using temporal fine structure and recovered envelope cues

    Get PDF
    The contribution of recovered envelopes (RENVs) to the utilization of temporal-fine structure (TFS) speech cues was examined in normal-hearing listeners. Consonant identification experiments used speech stimuli processed to present TFS or RENV cues. Experiment 1 examined the effects of exposure and presentation order using 16-band TFS speech and 40-band RENV speech recovered from 16-band TFS speech. Prior exposure to TFS speech aided in the reception of RENV speech. Performance on the two conditions was similar (∼50%-correct) for experienced listeners as was the pattern of consonant confusions. Experiment 2 examined the effect of varying the number of RENV bands recovered from 16-band TFS speech. Mean identification scores decreased as the number of RENV bands decreased from 40 to 8 and were only slightly above chance levels for 16 and 8 bands. Experiment 3 examined the effect of varying the number of bands in the TFS speech from which 40-band RENV speech was constructed. Performance fell from 85%- to 31%-correct as the number of TFS bands increased from 1 to 32. Overall, these results suggest that the interpretation of previous studies that have used TFS speech may have been confounded with the presence of RENVs.National Institutes of Health (U.S.) (Grant R01 DC00117)National Institutes of Health (U.S.) (Grant R43 DC013006

    Collaborative Deep Learning for Speech Enhancement: A Run-Time Model Selection Method Using Autoencoders

    Full text link
    We show that a Modular Neural Network (MNN) can combine various speech enhancement modules, each of which is a Deep Neural Network (DNN) specialized on a particular enhancement job. Differently from an ordinary ensemble technique that averages variations in models, the propose MNN selects the best module for the unseen test signal to produce a greedy ensemble. We see this as Collaborative Deep Learning (CDL), because it can reuse various already-trained DNN models without any further refining. In the proposed MNN selecting the best module during run time is challenging. To this end, we employ a speech AutoEncoder (AE) as an arbitrator, whose input and output are trained to be as similar as possible if its input is clean speech. Therefore, the AE can gauge the quality of the module-specific denoised result by seeing its AE reconstruction error, e.g. low error means that the module output is similar to clean speech. We propose an MNN structure with various modules that are specialized on dealing with a specific noise type, gender, and input Signal-to-Noise Ratio (SNR) value, and empirically prove that it almost always works better than an arbitrarily chosen DNN module and sometimes as good as an oracle result

    Spectrogram inversion and potential applications for hearing research

    Get PDF
    corecore