1,178 research outputs found
Reconstructing intelligible audio speech from visual speech features
This work describes an investigation into the feasibility of producing intelligible audio speech from only visual speech fea- tures. The proposed method aims to estimate a spectral enve- lope from visual features which is then combined with an arti- ficial excitation signal and used within a model of speech pro- duction to reconstruct an audio signal. Different combinations of audio and visual features are considered, along with both a statistical method of estimation and a deep neural network. The intelligibility of the reconstructed audio speech is measured by human listeners, and then compared to the intelligibility of the video signal only and when combined with the reconstructed audio
Estimating acoustic speech features in low signal-to-noise ratios using a statistical framework
Accurate estimation of acoustic speech features from noisy speech and from different speakers is an ongoing problem in speech processing. Many methods have been proposed to estimate acoustic features but errors increase as signal-to-noise ratios fall. This work proposes a robust statistical framework to estimate an acoustic speech vector (comprising voicing, fundamental frequency and spectral envelope) from an intermediate feature that is extracted from a noisy time-domain speech signal. The initial approach is accurate in clean conditions but deteriorates in noise and with changing speaker. Adaptation methods are then developed to adjust the acoustic models to the noise conditions and speaker. Evaluations are carried out in stationary and nonstationary noises and at SNRs from -5dB to clean conditions. Comparison with conventional methods of estimating fundamental frequency, voicing and spectral envelope reveals the proposed framework to have lowest errors in all conditions tested
Model-Based Speech Enhancement
Abstract
A method of speech enhancement is developed that reconstructs clean speech from
a set of acoustic features using a harmonic plus noise model of speech. This is a significant
departure from traditional filtering-based methods of speech enhancement.
A major challenge with this approach is to estimate accurately the acoustic features
(voicing, fundamental frequency, spectral envelope and phase) from noisy speech.
This is achieved using maximum a-posteriori (MAP) estimation methods that operate
on the noisy speech. In each case a prior model of the relationship between the
noisy speech features and the estimated acoustic feature is required. These models
are approximated using speaker-independent GMMs of the clean speech features
that are adapted to speaker-dependent models using MAP adaptation and for noise
using the Unscented Transform.
Objective results are presented to optimise the proposed system and a set of subjective
tests compare the approach with traditional enhancement methods. Threeway
listening tests examining signal quality, background noise intrusiveness and
overall quality show the proposed system to be highly robust to noise, performing
significantly better than conventional methods of enhancement in terms of background
noise intrusiveness. However, the proposed method is shown to reduce signal
quality, with overall quality measured to be roughly equivalent to that of the Wiener
filter
- …