353 research outputs found

    Likelihood-Maximizing-Based Multiband Spectral Subtraction for Robust Speech Recognition

    Get PDF
    Automatic speech recognition performance degrades significantly when speech is affected by environmental noise. Nowadays, the major challenge is to achieve good robustness in adverse noisy conditions so that automatic speech recognizers can be used in real situations. Spectral subtraction (SS) is a well-known and effective approach; it was originally designed for improving the quality of speech signal judged by human listeners. SS techniques usually improve the quality and intelligibility of speech signal while speech recognition systems need compensation techniques to reduce mismatch between noisy speech features and clean trained acoustic model. Nevertheless, correlation can be expected between speech quality improvement and the increase in recognition accuracy. This paper proposes a novel approach for solving this problem by considering SS and the speech recognizer not as two independent entities cascaded together, but rather as two interconnected components of a single system, sharing the common goal of improved speech recognition accuracy. This will incorporate important information of the statistical models of the recognition engine as a feedback for tuning SS parameters. By using this architecture, we overcome the drawbacks of previously proposed methods and achieve better recognition accuracy. Experimental evaluations show that the proposed method can achieve significant improvement of recognition rates across a wide range of signal to noise ratios

    Evaluation of the sparse coding shrinkage noise reduction algorithm for the hearing impaired

    No full text
    Although there are numerous single-channel noise reduction strategies to improve speech perception in a noisy environment, most of them can only improve speech quality but not improve speech intelligibility for normal hearing (NH) or hearing impaired (HI) listeners. Exceptions that can improve speech intelligibility currently are only those that require a priori statistics of speech or noise. Most of the noise reduction algorithms in hearing aids are adopted directly from the algorithms for NH listeners without taking into account of the hearing loss factors within HI listeners. HI listeners suffer more in speech intelligibility than NH listeners in the same noisy environment. Further study of monaural noise reduction algorithms for HI listeners is required.The motivation is to adapt a model-based approach in contrast to the conventional Wiener filtering approach. The model-based algorithm called sparse coding shrinkage (SCS) was proposed to extract key speech information from noisy speech. The SCS algorithm was evaluated by comparison with another state-of-the-art Wiener filtering approach through speech intelligibility and quality tests using 9 NH and 9 HI listeners. The SCS algorithm matched the performance of the Wiener filtering algorithm in speech intelligibility and speech quality. Both algorithms showed some intelligibility improvements for HI listeners but not at all for NH listeners. The algorithms improved speech quality for both HI and NH listeners.Additionally, a physiologically-inspired hearing loss simulation (HLS) model was developed to characterize hearing loss factors and simulate hearing loss consequences. A methodology was proposed to evaluate signal processing strategies for HI listeners with the proposed HLS model and NH subjects. The corresponding experiment was performed by asking NH subjects to listen to unprocessed/enhanced speech with the HLS model. Some of the effects of the algorithms seen in HI listeners are reproduced, at least qualitatively, by using the HLS model with NH listeners.Conclusions: The model-based algorithm SCS is promising for improving performance in stationary noise although no clear difference was seen in the performance of SCS and a competitive Wiener filtering algorithm. Fluctuating noise is more difficult to reduce compared to stationary noise. Noise reduction algorithms may perform better at higher input signal-to-noise ratios (SNRs) where HI listeners can get benefit but where NH listeners already reach ceiling performance. The proposed HLS model can save time and cost when evaluating noise reduction algorithms for HI listeners

    Perceptual techniques in audio quality assessment

    Get PDF

    Perceptual compensation for reverberation in human listeners and machines

    Get PDF
    This thesis explores compensation for reverberation in human listeners and machines. Late reverberation is typically understood as a distortion which degrades intelligibility. Recent research, however, shows that late reverberation is not always detrimental to human speech perception. At times, prolonged exposure to reverberation can provide a helpful acoustic context which improves identification of reverberant speech sounds. The physiology underpinning our robustness to reverberation has not yet been elucidated, but is speculated in this thesis to include efferent processes which have previously been shown to improve discrimination of noisy speech. These efferent pathways descend from higher auditory centres, effectively recalibrating the encoding of sound in the cochlea. Moreover, this thesis proposes that efferent-inspired computational models based on psychoacoustic principles may also improve performance for machine listening systems in reverberant environments. A candidate model for perceptual compensation for reverberation is proposed in which efferent suppression derives from the level of reverberation detected in the simulated auditory nerve response. The model simulates human performance in a phoneme-continuum identification task under a range of reverberant conditions, where a synthetically controlled test-word and its surrounding context phrase are independently reverberated. Addressing questions which arose from the model, a series of perceptual experiments used naturally spoken speech materials to investigate aspects of the psychoacoustic mechanism underpinning compensation. These experiments demonstrate a monaural compensation mechanism that is influenced by both the preceding context (which need not be intelligible speech) and by the test-word itself, and which depends on the time-direction of reverberation. Compensation was shown to act rapidly (within a second or so), indicating a monaural mechanism that is likely to be effective in everyday listening. Finally, the implications of these findings for the future development of computational models of auditory perception are considered

    Studies on noise robust automatic speech recognition

    Get PDF
    Noise in everyday acoustic environments such as cars, traffic environments, and cafeterias remains one of the main challenges in automatic speech recognition (ASR). As a research theme, it has received wide attention in conferences and scientific journals focused on speech technology. This article collection reviews both the classic and novel approaches suggested for noise robust ASR. The articles are literature reviews written for the spring 2009 seminar course on noise robust automatic speech recognition (course code T-61.6060) held at TKK

    When the Differences in Frequency Domain are Compensated: Understanding and Defeating Modulated Replay Attacks on Automatic Speech Recognition

    Full text link
    Automatic speech recognition (ASR) systems have been widely deployed in modern smart devices to provide convenient and diverse voice-controlled services. Since ASR systems are vulnerable to audio replay attacks that can spoof and mislead ASR systems, a number of defense systems have been proposed to identify replayed audio signals based on the speakers' unique acoustic features in the frequency domain. In this paper, we uncover a new type of replay attack called modulated replay attack, which can bypass the existing frequency domain based defense systems. The basic idea is to compensate for the frequency distortion of a given electronic speaker using an inverse filter that is customized to the speaker's transform characteristics. Our experiments on real smart devices confirm the modulated replay attacks can successfully escape the existing detection mechanisms that rely on identifying suspicious features in the frequency domain. To defeat modulated replay attacks, we design and implement a countermeasure named DualGuard. We discover and formally prove that no matter how the replay audio signals could be modulated, the replay attacks will either leave ringing artifacts in the time domain or cause spectrum distortion in the frequency domain. Therefore, by jointly checking suspicious features in both frequency and time domains, DualGuard can successfully detect various replay attacks including the modulated replay attacks. We implement a prototype of DualGuard on a popular voice interactive platform, ReSpeaker Core v2. The experimental results show DualGuard can achieve 98% accuracy on detecting modulated replay attacks.Comment: 17 pages, 24 figures, In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security (CCS' 20

    A perceptual sound space for auditory displays based on sung-vowel synthesis

    Get PDF
    When designing displays for the human senses, perceptual spaces are of great importance to give intuitive access to physical attributes. Similar to how perceptual spaces based on hue, saturation, and lightness were constructed for visual color, research has explored perceptual spaces for sounds of a given timbral family based on timbre, brightness, and pitch. To promote an embodied approach to the design of auditory displays, we introduce the Vowel-Type-Pitch (VTP) space, a cylindrical sound space based on human sung vowels, whose timbres can be synthesized by the composition of acoustic formants and can be categorically labeled. Vowels are arranged along the circular dimension, while voice type and pitch of the vowel correspond to the remaining two axes of the cylindrical VTP space. The decoupling and perceptual effectiveness of the three dimensions of the VTP space are tested through a vowel labeling experiment, whose results are visualized as maps on circular slices of the VTP cylinder. We discuss implications for the design of auditory and multi-sensory displays that account for human perceptual capabilities
    corecore