Search CORE

28,598 research outputs found

Deep speech inpainting of time-frequency masks

Author: Beckmann Pierre
Cernak Milos
Kegler Mikolaj
Publication venue: 'International Speech Communication Association'
Publication date: 29/08/2020
Field of study

Transient loud intrusions, often occurring in noisy environments, can completely overpower speech signal and lead to an inevitable loss of information. While existing algorithms for noise suppression can yield impressive results, their efficacy remains limited for very low signal-to-noise ratios or when parts of the signal are missing. To address these limitations, here we propose an end-to-end framework for speech inpainting, the context-based retrieval of missing or severely distorted parts of time-frequency representation of speech. The framework is based on a convolutional U-Net trained via deep feature losses, obtained using speechVGG, a deep speech feature extractor pre-trained on an auxiliary word classification task. Our evaluation results demonstrate that the proposed framework can recover large portions of missing or distorted time-frequency representation of speech, up to 400 ms and 3.2 kHz in bandwidth. In particular, our approach provided a substantial increase in STOI & PESQ objective metrics of the initially corrupted speech samples. Notably, using deep feature losses to train the framework led to the best results, as compared to conventional approaches.Comment: Accepted to InterSpeech202

arXiv.org e-Print Archive

Crossref

Perceptually Motivated Wavelet Packet Transform for Bioacoustic Signal Enhancement

Author: Cohen I.
Deller J. R.
Fu Q.
Jidong Tao
Michael T. Johnson
Osiejuk T. S.
Seyfarth R. M.
Shao Y.
Yao Ren
Publication venue: e-Publications@Marquette
Publication date: 01/07/2008
Field of study

A significant and often unavoidable problem in bioacoustic signal processing is the presence of background noise due to an adverse recording environment. This paper proposes a new bioacoustic signal enhancement technique which can be used on a wide range of species. The technique is based on a perceptually scaled wavelet packet decomposition using a species-specific Greenwood scale function. Spectral estimation techniques, similar to those used for human speech enhancement, are used for estimation of clean signal wavelet coefficients under an additive noise model. The new approach is compared to several other techniques, including basic bandpass filtering as well as classical speech enhancement methods such as spectral subtraction, Wiener filtering, and Ephraim–Malah filtering. Vocalizations recorded from several species are used for evaluation, including the ortolan bunting (Emberiza hortulana), rhesus monkey (Macaca mulatta), and humpback whale (Megaptera novaeanglia), with both additive white Gaussian noise and environment recording noise added across a range of signal-to-noise ratios (SNRs). Results, measured by both SNR and segmental SNR of the enhanced wave forms, indicate that the proposed method outperforms other approaches for a wide range of noise conditions

epublications@Marquette

Crossref

Kalman tracking of linear predictor and harmonic noise models for noisy speech enhancement

Author: Ben Milner
Boll
Chen
Deller
Ephraim
Ephraim
Ephraim
Ephraim
Esfandiar Zavarehei
Friedman
Griffin
Hansen
Ioannis Andrianakis
Jonathan Darch
Kalman
Lim
Lim
Paul White
Qin Yan
Rentzos
Saeed Vaseghi
Sameti
Secrest
Seltzer
Stylianou
Stylianou
Tucker
Turunen
Vaseghi
Weber
Yan
Publication venue: 'Elsevier BV'
Publication date: 01/01/2008
Field of study

This paper presents a speech enhancement method based on the tracking and denoising of the formants of a linear prediction (LP) model of the spectral envelope of speech and the parameters of a harmonic noise model (HNM) of its excitation. The main advantages of tracking and denoising the prominent energy contours of speech are the efficient use of the spectral and temporal structures of successive speech frames and a mitigation of processing artefact known as the ‘musical noise’ or ‘musical tones’.The formant-tracking linear prediction (FTLP) model estimation consists of three stages: (a) speech pre-cleaning based on a spectral amplitude estimation, (b) formant-tracking across successive speech frames using the Viterbi method, and (c) Kalman filtering of the formant trajectories across successive speech frames.The HNM parameters for the excitation signal comprise; voiced/unvoiced decision, the fundamental frequency, the harmonics’ amplitudes and the variance of the noise component of excitation. A frequency-domain pitch extraction method is proposed that searches for the peak signal to noise ratios (SNRs) at the harmonics. For each speech frame several pitch candidates are calculated. An estimate of the pitch trajectory across successive frames is obtained using a Viterbi decoder. The trajectories of the noisy excitation harmonics across successive speech frames are modeled and denoised using Kalman filters.The proposed method is used to deconstruct noisy speech, de-noise its model parameters and then reconstitute speech from its cleaned parts. Experimental evaluations show the performance gains of the formant tracking, pitch extraction and noise reduction stages

Crossref

Southampton (e-Prints Soton)

University of East Anglia digital repository

Deep Denoising for Hearing Aid Applications

Author: Aubreville Marc
Ehrensperger Kai
Graf Benjamin
Maier Andreas
Puder Henning
Rosenkranz Tobias
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 03/05/2018
Field of study

Reduction of unwanted environmental noises is an important feature of today's hearing aids (HA), which is why noise reduction is nowadays included in almost every commercially available device. The majority of these algorithms, however, is restricted to the reduction of stationary noises. In this work, we propose a denoising approach based on a three hidden layer fully connected deep learning network that aims to predict a Wiener filtering gain with an asymmetric input context, enabling real-time applications with high constraints on signal delay. The approach is employing a hearing instrument-grade filter bank and complies with typical hearing aid demands, such as low latency and on-line processing. It can further be well integrated with other algorithms in an existing HA signal processing chain. We can show on a database of real world noise signals that our algorithm is able to outperform a state of the art baseline approach, both using objective metrics and subject tests.Comment: submitted to IWAENC 201

arXiv.org e-Print Archive

Crossref

Acoustic echo and noise canceller for personal hands-free video IP phone

Author: Akira Nakagawa
Masahiro Fukui
Suehiro Shimauchi
Yoichi Haneda
Yusuke Hioka
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/11/2016
Field of study

This paper presents implementation and evaluation of a proposed acoustic echo and noise canceller (AENC) for videotelephony-enabled personal hands-free Internet protocol (IP) phones. This canceller has the following features: noise-robust performance, low processing delay, and low computational complexity. The AENC employs an adaptive digital filter (ADF) and noise reduction (NR) methods that can effectively eliminate undesired acoustic echo and background noise included in a microphone signal even in a noisy environment. The ADF method uses the step-size control approach according to the level of disturbance such as background noise; it can minimize the effect of disturbance in a noisy environment. The NR method estimates the noise level under an assumption that the noise amplitude spectrum is constant in a short period, which cannot be applied to the amplitude spectrum of speech. In addition, this paper presents the method for decreasing the computational complexity of the ADF process without increasing the processing delay to make the processing suitable for real-time implementation. The experimental results demonstrate that the proposed AENC suppresses echo and noise sufficiently in a noisy environment; thus, resulting in natural-sounding speech

Creative Repository of Electro-Communications

DNN-Based Multi-Frame MVDR Filtering for Single-Microphone Speech Enhancement

Author: Doclo Simon
Fischer Dörte
Tammen Marvin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 21/05/2019
Field of study

Multi-frame approaches for single-microphone speech enhancement, e.g., the multi-frame minimum-variance-distortionless-response (MVDR) filter, are able to exploit speech correlations across neighboring time frames. In contrast to single-frame approaches such as the Wiener gain, it has been shown that multi-frame approaches achieve a substantial noise reduction with hardly any speech distortion, provided that an accurate estimate of the correlation matrices and especially the speech interframe correlation vector is available. Typical estimation procedures of the correlation matrices and the speech interframe correlation (IFC) vector require an estimate of the speech presence probability (SPP) in each time-frequency bin. In this paper, we propose to use a bi-directional long short-term memory deep neural network (DNN) to estimate a speech mask and a noise mask for each time-frequency bin, using which two different SPP estimates are derived. Aiming at achieving a robust performance, the DNN is trained for various noise types and signal-to-noise ratios. Experimental results show that the multi-frame MVDR in combination with the proposed data-driven SPP estimator yields an increased speech quality compared to a state-of-the-art model-based estimator

arXiv.org e-Print Archive

Turbo-detected unequal protection audio and speech transceivers using serially concatenated convolutional codes, trellis coded modulation and space-time trellis coding

Author: Hanzo L.
Ng S.X.
Othman N.S.
Publication venue
Publication date: 01/01/2005
Field of study

The MPEG-4 TwinVQ audio codec and the AMR-WB speech codec are investigated in the context of a jointly optimised turbo transceiver capable of providing unequal error protection. The transceiver advocated consists of serially concatenated Space-Time Trellis Coding (STTC), Trellis Coded Modulation (TCM) and two different-rate Non-Systematic Convolutional codes (NSCs) used for unequal error protection. A benchmarker scheme combining STTC and a single-class protection NSC is used for comparison with the proposed scheme. The audio and speech performance of both schemes is evaluated, when communicating over uncorrelated Rayleigh fading channels. An

E_b/N_0

value of about 2.5 (3.5)~dB is required for near-unimpaired audio (speech) transmission, which is about 3.07 (4.2)~dB from the capacity of the system

Southampton (e-Prints Soton)

A Generative Product-of-Filters Model of Audio

Author: Hoffman Matthew D.
Liang Dawen
Mysore Gautham J.
Publication venue
Publication date: 25/11/2014
Field of study

We propose the product-of-filters (PoF) model, a generative model that decomposes audio spectra as sparse linear combinations of "filters" in the log-spectral domain. PoF makes similar assumptions to those used in the classic homomorphic filtering approach to signal processing, but replaces hand-designed decompositions built of basic signal processing operations with a learned decomposition based on statistical inference. This paper formulates the PoF model and derives a mean-field method for posterior inference and a variational EM algorithm to estimate the model's free parameters. We demonstrate PoF's potential for audio processing on a bandwidth expansion task, and show that PoF can serve as an effective unsupervised feature extractor for a speaker identification task.Comment: ICLR 2014 conference-track submission. Added link to the source cod

arXiv.org e-Print Archive

CiteSeerX