1,244 research outputs found
Recommended from our members
Using Broad Phonetic Group Experts for Improved Speech Recognition
In phoneme recognition experiments, it was found that approximately 75% of misclassified frames were assigned labels within the same broad phonetic group (BPG). While the phoneme can be described as the smallest distinguishable unit of speech, phonemes within BPGs contain very similar characteristics and can be easily confused. However, different BPGs, such as vowels and stops, possess very different spectral and temporal characteristics. In order to accommodate the full range of phonemes, acoustic models of speech recognition systems calculate input features from all frequencies over a large temporal context window. A new phoneme classifier is proposed consisting of a modular arrangement of experts, with one expert assigned to each BPG and focused on discriminating between phonemes within that BPG. Due to the different temporal and spectral structure of each BPG, novel feature sets are extracted using mutual information, to select a relevant time-frequency (TF) feature set for each expert. To construct a phone recognition system, the output of each expert is combined with a baseline classifier under the guidance of a separate BPG detector. Considering phoneme recognition experiments using the TIMIT continuous speech corpus, the proposed architecture afforded significant error rate reductions up to 5% relative
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
Analysis of very low quality speech for mask-based enhancement
The complexity of the speech enhancement problem has motivated many different solutions. However, most techniques address situations in which the target speech is fully intelligible and the background noise energy is low in comparison with that of the speech. Thus while current enhancement algorithms can improve the perceived quality, the intelligibility of the speech is not increased significantly and may even be reduced.
Recent research shows that intelligibility of very noisy speech can be improved by the use of a binary mask, in which a binary weight is applied to each time-frequency bin of the input spectrogram. There are several alternative goals for the binary mask estimator, based either on the Signal-to-Noise Ratio (SNR) of each time-frequency bin or on the speech signal characteristics alone. Our approach to the binary mask estimation problem aims to preserve the important speech cues independently of the noise present by identifying time-frequency regions that contain significant speech energy.
The speech power spectrum varies greatly for different types of speech sound. The energy of voiced speech sounds is concentrated in the harmonics of the fundamental frequency while that of unvoiced sounds is, in contrast, distributed across a broad range of frequencies. To identify the presence of speech energy in a noisy speech signal we have therefore developed two detection algorithms. The first is a robust algorithm that identifies voiced speech segments and estimates their fundamental frequency. The second detects the presence of sibilants and estimates their energy distribution. In addition, we have developed a robust algorithm to estimate the active level of the speech. The outputs of these algorithms are combined with other features estimated from the noisy speech to form the input to a classifier which estimates a mask that accurately reflects the time-frequency distribution of speech energy even at low SNR levels. We evaluate a mask-based speech enhancer on a range of speech and noise signals and demonstrate a consistent increase in an objective intelligibility measure with respect to noisy speech.Open Acces
Mask-based enhancement of very noisy speech
When speech is contaminated by high levels of additive noise, both its perceptual quality and its intelligibility are reduced. Studies show that conventional approaches to speech enhancement are able to improve quality but not intelligibility. However, in recent years, algorithms that estimate a time-frequency mask from noisy speech using a supervised machine learning approach and then apply this mask to the noisy speech have been shown to be capable of improving intelligibility.
The most direct way of measuring intelligibility is to carry out listening tests with human test subjects. However, in situations where listening tests are impractical and where some additional uncertainty in the results is permissible, for example during the development phase of a speech enhancer, intrusive intelligibility metrics can provide an alternative to listening tests. This thesis begins by outlining a new intrusive intelligibility metric, WSTOI, that is a development of the existing STOI metric. WSTOI improves STOI by weighting the intelligibility contributions of different time-frequency regions with an estimate of their intelligibility content. The prediction accuracies of WSTOI and STOI are compared for a range of noises and noise suppression algorithms and it is found that WSTOI outperforms STOI in all tested conditions.
The thesis then investigates the best choice of mask-estimation algorithm, target mask, and method of applying the estimated mask. A new target mask, the HSWOBM, is proposed that optimises a modified version of WSTOI with a higher frequency resolution. The HSWOBM is optimised for a stochastic noise signal to encourage a mask estimator trained on the HSWOBM to generalise better to unseen noise conditions. A high frequency resolution version of WSTOI is optimised as this gives improvements in predicted quality compared with optimising WSTOI. Of the tested approaches to target mask estimation, the best-performing approach uses a feed-forward neural network with a loss function based on WSTOI. The best-performing feature set is based on the gains produced by a classical speech enhancer and an estimate of the local voiced-speech-plus-noise to noise ratio in different time-frequency regions, which is obtained with the aid of a pitch estimator.
When the estimated target mask is applied in the conventional way, by multiplying the speech by the mask in the time-frequency domain, it can result in speech with very poor perceptual quality. The final chapter of this thesis therefore investigates alternative approaches to applying the estimated mask to the noisy speech, in order to improve both intelligibility and quality. An approach is developed that uses the mask to supply prior information about the speech presence probability to a classical speech enhancer that minimises the expected squared error in the log spectral amplitudes. The proposed end-to-end enhancer outperforms existing algorithms in terms of predicted quality and intelligibility for most noise types.Open Acces
- …