444 research outputs found

    Studies on noise robust automatic speech recognition

    Get PDF
    Noise in everyday acoustic environments such as cars, traffic environments, and cafeterias remains one of the main challenges in automatic speech recognition (ASR). As a research theme, it has received wide attention in conferences and scientific journals focused on speech technology. This article collection reviews both the classic and novel approaches suggested for noise robust ASR. The articles are literature reviews written for the spring 2009 seminar course on noise robust automatic speech recognition (course code T-61.6060) held at TKK

    Concatenative Synthesis for Novel Timbral Creation

    Get PDF
    Modern day musicians rely on a variety of instruments for musical expression. Tones produced from electronic instruments have become almost as commonplace as those produced by traditional ones as evidenced by the plethora of artists who can be found composing and performing with nothing more than a personal computer. This desire to embrace technical innovation as a means to augment performance art has created a budding field in computer science that explores the creation and manipulation of sound for artistic purposes. One facet of this new frontier concerns timbral creation, or the development of new sounds with unique characteristics that can be wielded by the musician as a virtual instrument. This thesis presents Timcat, a software system that can be used to create novel timbres from prerecorded audio. Various techniques for timbral feature extraction from short audio clips, or grains, are evaluated for use in timbral feature spaces. Clustering is performed on feature vectors in these spaces and groupings are recombined using concatenative synthesis techniques in order to form new instrument patches. The results reveal that interesting timbres can be created using features extracted by both newly developed and existing signal analysis techniques, many common in other fields though not often applied to music audio signals. Several of the features employed also show high accuracy for instrument separation in randomly mixed tracks. Survey results demonstrate positive feedback concerning the timbres created by Timcat from electronic music composers, musicians, and music lovers alike

    Likelihood-Maximizing-Based Multiband Spectral Subtraction for Robust Speech Recognition

    Get PDF
    Automatic speech recognition performance degrades significantly when speech is affected by environmental noise. Nowadays, the major challenge is to achieve good robustness in adverse noisy conditions so that automatic speech recognizers can be used in real situations. Spectral subtraction (SS) is a well-known and effective approach; it was originally designed for improving the quality of speech signal judged by human listeners. SS techniques usually improve the quality and intelligibility of speech signal while speech recognition systems need compensation techniques to reduce mismatch between noisy speech features and clean trained acoustic model. Nevertheless, correlation can be expected between speech quality improvement and the increase in recognition accuracy. This paper proposes a novel approach for solving this problem by considering SS and the speech recognizer not as two independent entities cascaded together, but rather as two interconnected components of a single system, sharing the common goal of improved speech recognition accuracy. This will incorporate important information of the statistical models of the recognition engine as a feedback for tuning SS parameters. By using this architecture, we overcome the drawbacks of previously proposed methods and achieve better recognition accuracy. Experimental evaluations show that the proposed method can achieve significant improvement of recognition rates across a wide range of signal to noise ratios

    Speech Detection Using Gammatone Features And One-class Support Vector Machine

    Get PDF
    A network gateway is a mechanism which provides protocol translation and/or validation of network traffic using the metadata contained in network packets. For media applications such as Voice-over-IP, the portion of the packets containing speech data cannot be verified and can provide a means of maliciously transporting code or sensitive data undetected. One solution to this problem is through Voice Activity Detection (VAD). Many VAD’s rely on time-domain features and simple thresholds for efficient speech detection however this doesn’t say much about the signal being passed. More sophisticated methods employ machine learning algorithms, but train on specific noises intended for a target environment. Validating speech under a variety of unknown conditions must be possible; as well as differentiating between speech and nonspeech data embedded within the packets. A real-time speech detection method is proposed that relies only on a clean speech model for detection. Through the use of Gammatone filter bank processing, the Cepstrum and several frequency domain features are used to train a One-Class Support Vector Machine which provides a clean-speech model irrespective of environmental noise. A Wiener filter is used to provide improved operation for harsh noise environments. Greater than 90% detection accuracy is achieved for clean speech with approximately 70% accuracy for SNR as low as 5d

    Robust speech recognition under noisy environments.

    Get PDF
    Lee Siu Wa.Thesis (M.Phil.)--Chinese University of Hong Kong, 2004.Includes bibliographical references (leaves 116-121).Abstracts in English and Chinese.Abstract --- p.vChapter 1 --- Introduction --- p.1Chapter 1.1 --- An Overview on Automatic Speech Recognition --- p.2Chapter 1.2 --- Thesis Outline --- p.6Chapter 2 --- Baseline Speech Recognition System --- p.8Chapter 2.1 --- Baseline Speech Recognition Framework --- p.8Chapter 2.2 --- Acoustic Feature Extraction --- p.11Chapter 2.2.1 --- Speech Production and Source-Filter Model --- p.12Chapter 2.2.2 --- Review of Feature Representations --- p.14Chapter 2.2.3 --- Mel-frequency Cepstral Coefficients --- p.20Chapter 2.2.4 --- Energy and Dynamic Features --- p.24Chapter 2.3 --- Back-end Decoder --- p.26Chapter 2.4 --- English Digit String Corpus ´ؤ AURORA2 --- p.28Chapter 2.5 --- Baseline Recognition Experiment --- p.31Chapter 3 --- A Simple Recognition Framework with Model Selection --- p.34Chapter 3.1 --- Mismatch between Training and Testing Conditions --- p.34Chapter 3.2 --- Matched Training and Testing Conditions --- p.38Chapter 3.2.1 --- Noise type-Matching --- p.38Chapter 3.2.2 --- SNR-Matching --- p.43Chapter 3.2.3 --- Noise Type and SNR-Matching --- p.44Chapter 3.3 --- Recognition Framework with Model Selection --- p.48Chapter 4 --- Noise Spectral Estimation --- p.53Chapter 4.1 --- Introduction to Statistical Estimation Methods --- p.53Chapter 4.1.1 --- Conventional Estimation Methods --- p.54Chapter 4.1.2 --- Histogram Technique --- p.55Chapter 4.2 --- Quantile-based Noise Estimation (QBNE) --- p.57Chapter 4.2.1 --- Overview of Quantile-based Noise Estimation (QBNE) --- p.58Chapter 4.2.2 --- Time-Frequency Quantile-based Noise Estimation (T-F QBNE) --- p.62Chapter 4.2.3 --- Mainlobe-Resilient Time-Frequency Quantile-based Noise Estimation (M-R T-F QBNE) --- p.65Chapter 4.3 --- Estimation Performance Analysis --- p.72Chapter 4.4 --- Recognition Experiment with Model Selection --- p.74Chapter 5 --- Feature Compensation: Algorithm and Experiment --- p.81Chapter 5.1 --- Feature Deviation from Clean Speech --- p.81Chapter 5.1.1 --- Deviation in MFCC Features --- p.82Chapter 5.1.2 --- Implications for Feature Compensation --- p.84Chapter 5.2 --- Overview of Conventional Compensation Methods --- p.86Chapter 5.3 --- Feature Compensation by In-phase Feature Induction --- p.94Chapter 5.3.1 --- Motivation --- p.94Chapter 5.3.2 --- Methodology --- p.97Chapter 5.4 --- Compensation Framework for Magnitude Spectrum and Segmen- tal Energy --- p.102Chapter 5.5 --- Recognition -Experiments --- p.103Chapter 6 --- Conclusions --- p.112Chapter 6.1 --- Summary and Discussions --- p.112Chapter 6.2 --- Future Directions --- p.114Bibliography --- p.11

    Improving the Speech Intelligibility By Cochlear Implant Users

    Get PDF
    In this thesis, we focus on improving the intelligibility of speech for cochlear implants (CI) users. As an auditory prosthetic device, CI can restore hearing sensations for most patients with profound hearing loss in both ears in a quiet background. However, CI users still have serious problems in understanding speech in noisy and reverberant environments. Also, bandwidth limitation, missing temporal fine structures, and reduced spectral resolution due to a limited number of electrodes are other factors that raise the difficulty of hearing in noisy conditions for CI users, regardless of the type of noise. To mitigate these difficulties for CI listener, we investigate several contributing factors such as the effects of low harmonics on tone identification in natural and vocoded speech, the contribution of matched envelope dynamic range to the binaural benefits and contribution of low-frequency harmonics to tone identification in quiet and six-talker babble background. These results revealed several promising methods for improving speech intelligibility for CI patients. In addition, we investigate the benefits of voice conversion in improving speech intelligibility for CI users, which was motivated by an earlier study showing that familiarity with a talker’s voice can improve understanding of the conversation. Research has shown that when adults are familiar with someone’s voice, they can more accurately – and even more quickly – process and understand what the person is saying. This theory identified as the “familiar talker advantage” was our motivation to examine its effect on CI patients using voice conversion technique. In the present research, we propose a new method based on multi-channel voice conversion to improve the intelligibility of transformed speeches for CI patients
    • …
    corecore