52 research outputs found

    Identifying, Evaluating and Applying Importance Maps for Speech

    Full text link
    Like many machine learning systems, speech models often perform well when employed on data in the same domain as their training data. However, when the inference is on out-of-domain data, performance suffers. With a fast-growing number of applications of speech models in healthcare, education, automotive, automation, etc., it is essential to ensure that speech models can generalize to out-of-domain data, especially to noisy environments in real-world scenarios. In contrast, human listeners are quite robust to noisy environments. Thus, a thorough understanding of the differences between human listeners and speech models is urgently required to enhance speech model performance in noise. These differences exist presumably because the speech model does not use the same information as humans for recognizing the speech. A possible solution is encouraging the speech model to attend to the same time-frequency regions as human listeners. In this way, speech model generalization in noise may be improved. We define those time-frequency regions that humans or machines focus on to recognize the speech as importance maps (IMs). In this research, first, we investigate how to identify speech importance maps. Second, we compare human and machine importance maps to understand how they differ and how the speech model can learn from humans to improve its performance in noise. Third, we develop a structured saliency benchmark (SSBM), a metric for evaluating IMs. Finally, we propose a new application of IMs as data augmentation for speech models, enhancing their performance and enabling them to better generalize to out-of-domain noise. Overall, our work demonstrates that we can improve speech models and achieve out-of-domain generalization to different noise environments with importance maps. In the future, we will expand our work with large-scale speech models and deploy different methods to identify IMs and use them to augment the speech data, such as those based on human responses. We can also extend the technique to computer vision tasks, such as image recognition by predicting importance maps for images and use IMs to enhance model performance to out-of-domain data

    The use of acoustic cues in phonetic perception: Effects of spectral degradation, limited bandwidth and background noise

    Get PDF
    Hearing impairment, cochlear implantation, background noise and other auditory degradations result in the loss or distortion of sound information thought to be critical to speech perception. In many cases, listeners can still identify speech sounds despite degradations, but understanding of how this is accomplished is incomplete. Experiments presented here tested the hypothesis that listeners would utilize acoustic-phonetic cues differently if one or more cues were degraded by hearing impairment or simulated hearing impairment. Results supported this hypothesis for various listening conditions that are directly relevant for clinical populations. Analysis included mixed-effects logistic modeling of contributions of individual acoustic cues for various contrasts. Listeners with cochlear implants (CIs) or normal-hearing (NH) listeners in CI simulations showed increased use of acoustic cues in the temporal domain and decreased use of cues in the spectral domain for the tense/lax vowel contrast and the word-final fricative voicing contrast. For the word-initial stop voicing contrast, NH listeners made less use of voice-onset time and greater use of voice pitch in conditions that simulated high-frequency hearing impairment and/or masking noise; influence of these cues was further modulated by consonant place of articulation. A pair of experiments measured phonetic context effects for the "s/sh" contrast, replicating previously observed effects for NH listeners and generalizing them to CI listeners as well, despite known deficiencies in spectral resolution for CI listeners. For NH listeners in CI simulations, these context effects were absent or negligible. Audio-visual delivery of this experiment revealed enhanced influence of visual lip-rounding cues for CI listeners and NH listeners in CI simulations. Additionally, CI listeners demonstrated that visual cues to gender influence phonetic perception in a manner consistent with gender-related voice acoustics. All of these results suggest that listeners are able to accommodate challenging listening situations by capitalizing on the natural (multimodal) covariance in speech signals. Additionally, these results imply that there are potential differences in speech perception by NH listeners and listeners with hearing impairment that would be overlooked by traditional word recognition or consonant confusion matrix analysis

    The use of the domestic dog (Canis familiaris) as a comparative model for speech perception

    Get PDF
    Animals have long been used as comparative models for adult human speech perception. However, few animal models have been used to explore developmental speech perception questions. This dissertation encourages the use of domestic dogs as a behavioral model for speech perception processes. Specifically, dog models are suggested for questions about 1) the role and function of underlying processes responsible for different aspects of speech perception, and 2) the effect of language experience on speech perception processes. Chapters 2, 3, and 4 examined the contributions of auditory, attention, and linguistic processing skills to infants’ difficulties understanding speech in noise. It is not known why infants have more difficulties perceiving speech in noise, especially single-talker noise, than adults. Understanding speech in noise relies on infants’ auditory, attention, and linguistic processes. It is methodologically difficult to isolate these systems’ contributions when testing infants. To tease apart these systems, I compared dogs’ name recognition in nine- and single-talker background noise to that of infants. These studies suggest that attentional processes play a large role in infants’ difficulties in understanding speech in noise. Chapter 5 explored the reasons behind infants’ shift from a preference for vowel information (vowel bias) to consonant information (consonant bias) in word identification. This shift may occur due to language exposure, or possessing a particular lexicon size and structure. To better understand the linguistic exposure necessary for consonant bias development, I tested dogs, who have long-term linguistic exposure and a minimal vocabulary. Dogs demonstrated a vowel bias rather than a consonant bias; this suggests that a small lexicon and regular linguistic exposure, plus mature auditory processing, do not lead to consonant bias emergence. Overall, these chapters suggest that dog models can be useful for broad questions about systems underlying speech perception and about the role of language exposure in the development of certain speech perception processes. However, the studies faced limitations due to a lack of knowledge about dogs’ underlying cognitive systems and linguistic exposure. More fundamental research is necessary to characterize dogs’ linguistic exposure and to understand their auditory, attentional, and linguistic processes to ask more specific comparative research questions

    The impact of spectrally asynchronous delay on the intelligibility of conversational speech

    Get PDF
    Conversationally spoken speech is rampant with rapidly changing and complex acoustic cues that individuals are able to hear, process, and encode to meaning. For many hearing-impaired listeners, a hearing aid is necessary to hear these spectral and temporal acoustic cues of speech. For listeners with mild-moderate high frequency sensorineural hearing loss, open-fit digital signal processing (DSP) hearing aids are the most common amplification option. Open-fit DSP hearing aids introduce a spectrally asynchronous delay to the acoustic signal by allowing audible low frequency information to pass to the eardrum unimpeded while the aid delivers amplified high frequency sounds to the eardrum that has a delayed onset relative to the natural pathway of sound. These spectrally asynchronous delays may disrupt the natural acoustic pattern of speech. The primary goal of this study is to measure the effect of spectrally asynchronous delay on the intelligibility of conversational speech by normal-hearing and hearing-impaired listeners. A group of normal-hearing listeners (n = 25) and listeners with mild-moderate high frequency sensorineural hearing loss (n = 25) participated in this study. The acoustic stimuli included 200 conversationally-spoken recordings of the low predictability sentences from the revised speech perception in noise test (r-SPIN). These 200 sentences were modified to control for audibility for the hearing-impaired group and so that the acoustic energy above 2 kHz was delayed by either 0 ms (control), 4ms, 8ms, or 32 ms relative to the low frequency energy. The data were analyzed in order to find the effect of each of the four delay conditions on the intelligibility of the final key word of each sentence. Normal-hearing listeners were minimally affected by the asynchronous delay. However, the hearing-impaired listeners were deleteriously affected by increasing amounts of spectrally asynchronous delay. Although the hearing-impaired listeners performed well overall in their perception of conversationally spoken speech in quiet, the intelligibility of conversationally spoken sentences significantly decreased when the delay values were equal to or greater than 4 ms. Therefore, hearing aid manufacturers need to restrict the amount of delay introduced by DSP so that it does not distort the acoustic patterns of conversational speech

    Spoken English discrimination (SED) training with multilingual Malaysians: effect of adaptive staircase procedure and background babble in high variability phonetic training.

    Get PDF
    High variability phonetic training (HVPT) has been shown to improve non-native speakers’ perceptual performance in discriminating difficult second language phonemic contrasts (Bradlow, Akahane-Yamada, Pisoni, & Tohkura, 1999; Bradlow, Pisoni, Akahane-Yamada, & Tohkura, 1997; Lively, Logan, & Pisoni, 1993; Lively, Pisoni, Yamada, Tohkura, & Yamada, 1994; Logan, Lively, & Pisoni, 1991). The perceptual learning can be generalized to novel words (Wang & Munro, 2004), novel speakers (Nishi & Kewley-Port, 2007; Richie & Kewley-Port, 2008) and even to speech production (Bradlow et al., 1997). However, the rigidity of the laboratory training settings has limited applications to real life situations. The current thesis examined the effectiveness of a new phonetic training program - the Spoken English Discrimination (SED) training. SED training is a computerized individual training program designed to improve non-native speakers’ bottom-up perceptual sensitivity to discriminate difficult second language (L2) phonemic contrasts. It combines a number of key training features including 1) natural spoken stimuli, 2) highly variable stimuli spoken by multiple speakers, 3) multi-talker babble as background noise and 4) an adaptive staircase procedure that individualizes the level of background babble. The first experiment investigated the potential benefits of different versions of the SED training program. The effect of stimulus variability (single speaker vs. multiple speakers) and design of background babble (constant vs. adaptive staircase) were examined using English voiceless-voiced plosives /t/-/d/ phonemic contrast as the training materials. No improvements were found in the identification accuracy on the /t/-/d/ contrast in post-test, but identification improvements were found on the untrained English /ε/-/æ/ phonemic contrast. The effectiveness of SED training was re-examined in Chapter 3 using the English /ε/-/æ/ phonemic contrast as the training material. Three experiments were conducted to compare the SED training paradigms that had the background babble implemented either at a constant level (Constant SED) or using the adaptive staircase procedure (Adaptive Staircase SED), and the longevity of the training effects. Results revealed that the Adaptive Staircase SED was the more effective paradigm as it generated greater training benefits and its effect generalized better to the untrained /t/-/d/ phonemic contrast. Training effects from both SED paradigms retained six months after the last training section. Before examining whether SED training leads to improvements in speech production, Chapter 4 investigated the phonetics perception pattern of L1 Mandarin Malaysian speakers, L1 Malaysian English speakers and native British English speakers. The production intelligibility of the L1 Mandarin speakers was also evaluated by the L1 Malaysian English speakers and native British English speakers. Single category assimilation was observed in both L1 Mandarin and L1 Malaysian English speakers whereby the /ε/ and /æ/ phonetic sounds were assimilated to a single/æ/ category (Best, McRoberts, & Goodell, 2001). While the British English speakers showed ceiling performance for all phonetic categories involved, the L1 Malaysian English speakers had difficulty identifying the British English /ε/ phoneme and the L1 Mandarin speakers had difficulty identifying the /d/ final, /ε/ and /æ/ phonemes. As seen by their perceptual performance, the L1 Mandarin speakers also had difficulty producing distinct /d/ final, /ε/ and /æ/ phonemes. Two experiments in Chapter 5 examined whether the effects of SED training generalizes to speech production. The results showed that L1 Malaysian English speakers and native British English speakers found different SED paradigms to be more effective in inducing the production improvement. Only the production intelligibility of the /æ/ phoneme improved as a result of SED training. Collectively, the seven experiments in this thesis showed that SED training was effective in improving Malaysian speakers’ perception and production performance of difficult English phonemic contrasts. Further research should be conducted to examine the efficacy of SED training in improving speech perception and production across different training materials and in speakers who come from different language backgrounds

    Spoken English discrimination (SED) training with multilingual Malaysians: effect of adaptive staircase procedure and background babble in high variability phonetic training.

    Get PDF
    High variability phonetic training (HVPT) has been shown to improve non-native speakers’ perceptual performance in discriminating difficult second language phonemic contrasts (Bradlow, Akahane-Yamada, Pisoni, & Tohkura, 1999; Bradlow, Pisoni, Akahane-Yamada, & Tohkura, 1997; Lively, Logan, & Pisoni, 1993; Lively, Pisoni, Yamada, Tohkura, & Yamada, 1994; Logan, Lively, & Pisoni, 1991). The perceptual learning can be generalized to novel words (Wang & Munro, 2004), novel speakers (Nishi & Kewley-Port, 2007; Richie & Kewley-Port, 2008) and even to speech production (Bradlow et al., 1997). However, the rigidity of the laboratory training settings has limited applications to real life situations. The current thesis examined the effectiveness of a new phonetic training program - the Spoken English Discrimination (SED) training. SED training is a computerized individual training program designed to improve non-native speakers’ bottom-up perceptual sensitivity to discriminate difficult second language (L2) phonemic contrasts. It combines a number of key training features including 1) natural spoken stimuli, 2) highly variable stimuli spoken by multiple speakers, 3) multi-talker babble as background noise and 4) an adaptive staircase procedure that individualizes the level of background babble. The first experiment investigated the potential benefits of different versions of the SED training program. The effect of stimulus variability (single speaker vs. multiple speakers) and design of background babble (constant vs. adaptive staircase) were examined using English voiceless-voiced plosives /t/-/d/ phonemic contrast as the training materials. No improvements were found in the identification accuracy on the /t/-/d/ contrast in post-test, but identification improvements were found on the untrained English /ε/-/æ/ phonemic contrast. The effectiveness of SED training was re-examined in Chapter 3 using the English /ε/-/æ/ phonemic contrast as the training material. Three experiments were conducted to compare the SED training paradigms that had the background babble implemented either at a constant level (Constant SED) or using the adaptive staircase procedure (Adaptive Staircase SED), and the longevity of the training effects. Results revealed that the Adaptive Staircase SED was the more effective paradigm as it generated greater training benefits and its effect generalized better to the untrained /t/-/d/ phonemic contrast. Training effects from both SED paradigms retained six months after the last training section. Before examining whether SED training leads to improvements in speech production, Chapter 4 investigated the phonetics perception pattern of L1 Mandarin Malaysian speakers, L1 Malaysian English speakers and native British English speakers. The production intelligibility of the L1 Mandarin speakers was also evaluated by the L1 Malaysian English speakers and native British English speakers. Single category assimilation was observed in both L1 Mandarin and L1 Malaysian English speakers whereby the /ε/ and /æ/ phonetic sounds were assimilated to a single/æ/ category (Best, McRoberts, & Goodell, 2001). While the British English speakers showed ceiling performance for all phonetic categories involved, the L1 Malaysian English speakers had difficulty identifying the British English /ε/ phoneme and the L1 Mandarin speakers had difficulty identifying the /d/ final, /ε/ and /æ/ phonemes. As seen by their perceptual performance, the L1 Mandarin speakers also had difficulty producing distinct /d/ final, /ε/ and /æ/ phonemes. Two experiments in Chapter 5 examined whether the effects of SED training generalizes to speech production. The results showed that L1 Malaysian English speakers and native British English speakers found different SED paradigms to be more effective in inducing the production improvement. Only the production intelligibility of the /æ/ phoneme improved as a result of SED training. Collectively, the seven experiments in this thesis showed that SED training was effective in improving Malaysian speakers’ perception and production performance of difficult English phonemic contrasts. Further research should be conducted to examine the efficacy of SED training in improving speech perception and production across different training materials and in speakers who come from different language backgrounds

    Language and Communication

    Get PDF
    • …
    corecore