Search CORE

546 research outputs found

End-to-end Source Separation with Adaptive Front-Ends

Author: Casebeer Jonah
Smaragdis Paris
Venkataramani Shrikant
Publication venue
Publication date: 30/10/2017
Field of study

Source separation and other audio applications have traditionally relied on the use of short-time Fourier transforms as a front-end frequency domain representation step. The unavailability of a neural network equivalent to forward and inverse transforms hinders the implementation of end-to-end learning systems for these applications. We present an auto-encoder neural network that can act as an equivalent to short-time front-end transforms. We demonstrate the ability of the network to learn optimal, real-valued basis functions directly from the raw waveform of a signal and further show how it can be used as an adaptive front-end for supervised source separation. In terms of separation performance, these transforms significantly outperform their Fourier counterparts. Finally, we also propose a novel source to distortion ratio based cost function for end-to-end source separation.Comment: 4 figures, 4 page

arXiv.org e-Print Archive

Crossref

Visual Speech Enhancement

Author: Gabbay Aviv
Peleg Shmuel
Shamir Asaph
Publication venue
Publication date: 13/06/2018
Field of study

When video is shot in noisy environment, the voice of a speaker seen in the video can be enhanced using the visible mouth movements, reducing background noise. While most existing methods use audio-only inputs, improved performance is obtained with our visual speech enhancement, based on an audio-visual neural network. We include in the training data videos to which we added the voice of the target speaker as background noise. Since the audio input is not sufficient to separate the voice of a speaker from his own voice, the trained model better exploits the visual input and generalizes well to different noise types. The proposed model outperforms prior audio visual methods on two public lipreading datasets. It is also the first to be demonstrated on a dataset not designed for lipreading, such as the weekly addresses of Barack Obama.Comment: Accepted to Interspeech 2018. Supplementary video: https://www.youtube.com/watch?v=nyYarDGpcY

arXiv.org e-Print Archive

Crossref

Deep Learning for Audio Signal Processing

Author: Chang Shuo-yiin
Li Bo
Purwins Hendrik
Sainath Tara
Schlüter Jan
Virtanen Tuomas
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2019
Field of study

Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

arXiv.org e-Print Archive

VBN

Signal reconstruction by means of Embedding, Clustering and AutoEncoder Ensembles

Author: C. Mio
G. Gianini
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

We study the denoising and reconstruction of corrupted signals by means of AutoEncoder ensembles. In order to guarantee experts' diversity in the ensemble, we apply, prior to learning, a dimensional reduction pass (to map the examples into a suitable Euclidean space) and a partitional clustering pass: each cluster is then used to train a distinct AutoEncoder. We study the approach with an audio file benchmark: the original signals are artificially corrupted by Doppler effect and reverb. The results support the comparative effectiveness of the approach, w.r.t. the approach based on a single AutoEncoder. The processing pipeline using Local Linear Embedding, k means, then k Convolutional Denoising AutoEncoders reduces the reconstruction error by 35% w.r.t. the baseline approach

Crossref

AIR Universita degli studi di Milano

Raw Multi-Channel Audio Source Separation using Multi-Resolution Convolutional Auto-Encoders

Author: Grais Emad M.
Plumbley Mark D.
Ward Dominic
Publication venue
Publication date: 01/03/2018
Field of study

Supervised multi-channel audio source separation requires extracting useful spectral, temporal, and spatial features from the mixed signals. The success of many existing systems is therefore largely dependent on the choice of features used for training. In this work, we introduce a novel multi-channel, multi-resolution convolutional auto-encoder neural network that works on raw time-domain signals to determine appropriate multi-resolution features for separating the singing-voice from stereo music. Our experimental results show that the proposed method can achieve multi-channel audio source separation without the need for hand-crafted features or any pre- or post-processing

arXiv.org e-Print Archive

University of Surrey

Surrey Research Insight