11,812 research outputs found
Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision
We tackle the problem of audiovisual scene analysis for weakly-labeled data.
To this end, we build upon our previous audiovisual representation learning
framework to perform object classification in noisy acoustic environments and
integrate audio source enhancement capability. This is made possible by a novel
use of non-negative matrix factorization for the audio modality. Our approach
is founded on the multiple instance learning paradigm. Its effectiveness is
established through experiments over a challenging dataset of music instrument
performance videos. We also show encouraging visual object localization
results
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Robust Estimation of Non-Stationary Noise Power Spectrum for Speech Enhancement
International audienceWe propose a novel method for noise power spectrum estimation in speech enhancement. This method called extended-DATE (E-DATE) extends the d-dimensional amplitude trimmed estimator (DATE), originally introduced for additive white gaussian noise power spectrum estimation, to the more challenging scenario of non-stationary noise. The key idea is that, in each frequency bin and within a sufficiently short time period, the noise instantaneous power spectrum can be considered as approximately constant and estimated as the variance of a complex gaussian noise process possibly observed in the presence of the signal of interest. The proposed method relies on the fact that the Short-Time Fourier Transform (STFT) of noisy speech signals is sparse in the sense that transformed speech signals can be represented by a relatively small number of coefficients with large amplitudes in the time-frequency domain. The E-DATE estimator is robust in that it does not require prior information about the signal probability distribution except for the weak-sparseness property. In comparison to other state-of-the-art methods, the E-DATE is found to require the smallest number of parameters (only two). The performance of the proposed estimator has been evaluated in combination with noise reduction and compared to alternative methods. This evaluation involves objective as well as pseudo-subjective criteria
End-to-End Multi-Task Denoising for joint SDR and PESQ Optimization
Supervised learning based on a deep neural network recently has achieved
substantial improvement on speech enhancement. Denoising networks learn mapping
from noisy speech to clean one directly, or to a spectrum mask which is the
ratio between clean and noisy spectra. In either case, the network is optimized
by minimizing mean square error (MSE) between ground-truth labels and
time-domain or spectrum output. However, existing schemes have either of two
critical issues: spectrum and metric mismatches. The spectrum mismatch is a
well known issue that any spectrum modification after short-time Fourier
transform (STFT), in general, cannot be fully recovered after inverse
short-time Fourier transform (ISTFT). The metric mismatch is that a
conventional MSE metric is sub-optimal to maximize our target metrics,
signal-to-distortion ratio (SDR) and perceptual evaluation of speech quality
(PESQ). This paper presents a new end-to-end denoising framework with the goal
of joint SDR and PESQ optimization. First, the network optimization is
performed on the time-domain signals after ISTFT to avoid spectrum mismatch.
Second, two loss functions which have improved correlations with SDR and PESQ
metrics are proposed to minimize metric mismatch. The experimental result
showed that the proposed denoising scheme significantly improved both SDR and
PESQ performance over the existing methods
- …