Search CORE

328 research outputs found

Informed Source Separation using Iterative Reconstruction

Author: Daudet Laurent
Sturmel Nicolas
Publication venue
Publication date: 09/02/2012
Field of study

This paper presents a technique for Informed Source Separation (ISS) of a single channel mixture, based on the Multiple Input Spectrogram Inversion method. The reconstruction of the source signals is iterative, alternating between a time- frequency consistency enforcement and a re-mixing constraint. A dual resolution technique is also proposed, for sharper transients reconstruction. The two algorithms are compared to a state-of-the-art Wiener-based ISS technique, on a database of fourteen monophonic mixtures, with standard source separation objective measures. Experimental results show that the proposed algorithms outperform both this reference technique and the oracle Wiener filter by up to 3dB in distortion, at the cost of a significantly heavier computation.Comment: submitted to the IEEE transactions on Audio, Speech and Language Processin

arXiv.org e-Print Archive

On Single-Channel Speech Enhancement and On Non-Linear Modulation-Domain Kalman Filtering

Author: Dionelis Nikolaos
Publication venue
Publication date: 31/10/2018
Field of study

This report focuses on algorithms that perform single-channel speech enhancement. The author of this report uses modulation-domain Kalman filtering algorithms for speech enhancement, i.e. noise suppression and dereverberation, in [1], [2], [3], [4] and [5]. Modulation-domain Kalman filtering can be applied for both noise and late reverberation suppression and in [2], [1], [3] and [4], various model-based speech enhancement algorithms that perform modulation-domain Kalman filtering are designed, implemented and tested. The model-based enhancement algorithm in [2] estimates and tracks the speech phase. The short-time-Fourier-transform-based enhancement algorithm in [5] uses the active speech level estimator presented in [6]. This report describes how different algorithms perform speech enhancement and the algorithms discussed in this report are addressed to researchers interested in monaural speech enhancement. The algorithms are composed of different processing blocks and techniques [7]; understanding the implementation choices made during the system design is important because this provides insights that can assist the development of new algorithms. Index Terms - Speech enhancement, dereverberation, denoising, Kalman filter, minimum mean squared error estimation.Comment: 13 page

arXiv.org e-Print Archive

Fourier Phase Retrieval: Uniqueness and Algorithms

Author: Beinert Robert
Bendory Tamir
Eldar Yonina C.
Publication venue
Publication date: 06/11/2017
Field of study

The problem of recovering a signal from its phaseless Fourier transform measurements, called Fourier phase retrieval, arises in many applications in engineering and science. Fourier phase retrieval poses fundamental theoretical and algorithmic challenges. In general, there is no unique mapping between a one-dimensional signal and its Fourier magnitude and therefore the problem is ill-posed. Additionally, while almost all multidimensional signals are uniquely mapped to their Fourier magnitude, the performance of existing algorithms is generally not well-understood. In this chapter we survey methods to guarantee uniqueness in Fourier phase retrieval. We then present different algorithmic approaches to retrieve the signal in practice. We conclude by outlining some of the main open questions in this field

arXiv.org e-Print Archive

Audio Spectrogram Representations for Processing with Convolutional Neural Networks

Author: Wyse L.
Publication venue
Publication date: 28/06/2017
Field of study

One of the decisions that arise when designing a neural network for any application is how the data should be represented in order to be presented to, and possibly generated by, a neural network. For audio, the choice is less obvious than it seems to be for visual images, and a variety of representations have been used for different applications including the raw digitized sample stream, hand-crafted features, machine discovered features, MFCCs and variants that include deltas, and a variety of spectral representations. This paper reviews some of these representations and issues that arise, focusing particularly on spectrograms for generating audio using neural networks for style transfer.Comment: Proceedings of the First International Conference on Deep Learning and Music, Anchorage, US, May, 2017 (arXiv:1706.08675v1 [cs.NE]

arXiv.org e-Print Archive

Enhancement and Recognition of Reverberant and Noisy Speech by Extending Its Coherence

Author: Atlas Les
Pitton James
Powers Thomas
Wisdom Scott
Publication venue
Publication date: 01/09/2015
Field of study

Most speech enhancement algorithms make use of the short-time Fourier transform (STFT), which is a simple and flexible time-frequency decomposition that estimates the short-time spectrum of a signal. However, the duration of short STFT frames are inherently limited by the nonstationarity of speech signals. The main contribution of this paper is a demonstration of speech enhancement and automatic speech recognition in the presence of reverberation and noise by extending the length of analysis windows. We accomplish this extension by performing enhancement in the short-time fan-chirp transform (STFChT) domain, an overcomplete time-frequency representation that is coherent with speech signals over longer analysis window durations than the STFT. This extended coherence is gained by using a linear model of fundamental frequency variation of voiced speech signals. Our approach centers around using a single-channel minimum mean-square error log-spectral amplitude (MMSE-LSA) estimator proposed by Habets, which scales coefficients in a time-frequency domain to suppress noise and reverberation. In the case of multiple microphones, we preprocess the data with either a minimum variance distortionless response (MVDR) beamformer, or a delay-and-sum beamformer (DSB). We evaluate our algorithm on both speech enhancement and recognition tasks for the REVERB challenge dataset. Compared to the same processing done in the STFT domain, our approach achieves significant improvement in terms of objective enhancement metrics (including PESQ---the ITU-T standard measurement for speech quality). In terms of automatic speech recognition (ASR) performance as measured by word error rate (WER), our experiments indicate that the STFT with a long window is more effective for ASR.Comment: 22 page

arXiv.org e-Print Archive

Digital Signal Processing Group

Author: Baggeroer Arthur B.
Bordley Thomas E.
Chan Philip
Curtis Susan R.
Davis Randall
Dove Webster P.
Dowla Farid U.
Duckworth Gregory L.
Frisk George V.
Griffin Daniel W.
Harrison William A.
Izraelevitz David
Lim Jae S.
Martinez Dennis M.
Milios Evangelos E.
Mook Douglas R.
Musicus Bruce R.
Myers Cory
Oppenheim Alan V.
Pappas Thrasyvoulos N.
Richard Michael D.
Sekiguchi Hiroshi
Sundaram Ramakrishnan
Wengrovitz Michael S.
Publication venue: Research Laboratory of Electronics (RLE) at the Massachusetts Institute of Technology (MIT)
Publication date: 01/01/1983
Field of study

Contains an introduction and reports on nineteen research projects.U.S. Navy - Office of Naval Research (Contract N00014-77-C-0266)U.S. Navy - Office of Naval Research (Contract N00014-81-K-0742)National Science Foundation (Grant ECS80-07102)Bell Laboratories FellowshipAmoco Foundation FellowshipU.S. Navy - Office of Naval Research (Contract N00014-77-C-0196)Schlumberger-Doll Research Center FellowshipToshiba Company FellowshipVinton Hayes FellowshipHertz Foundation Fellowshi

Deep Griffin-Lim Iteration

Author: Harada Noboru
Koizumi Yuma
Masuyama Yoshiki
Oikawa Yasuhiro
Yatabe Kohei
Publication venue
Publication date: 10/03/2019
Field of study

This paper presents a novel phase reconstruction method (only from a given amplitude spectrogram) by combining a signal-processing-based approach and a deep neural network (DNN). To retrieve a time-domain signal from its amplitude spectrogram, the corresponding phase is required. One of the popular phase reconstruction methods is the Griffin-Lim algorithm (GLA), which is based on the redundancy of the short-time Fourier transform. However, GLA often involves many iterations and produces low-quality signals owing to the lack of prior knowledge of the target signal. In order to address these issues, in this study, we propose an architecture which stacks a sub-block including two GLA-inspired fixed layers and a DNN. The number of stacked sub-blocks is adjustable, and we can trade the performance and computational load based on requirements of applications. The effectiveness of the proposed method is investigated by reconstructing phases from amplitude spectrograms of speeches.Comment: 5 pages, to appear in IEEE ICASSP 2019 (Paper Code: AASP-L3.1, Session: Source Separation and Speech Enhancement I

arXiv.org e-Print Archive

Time Domain Neural Audio Style Transfer

Author: Mital Parag K.
Publication venue
Publication date: 29/11/2017
Field of study

A recently published method for audio style transfer has shown how to extend the process of image style transfer to audio. This method synthesizes audio "content" and "style" independently using the magnitudes of a short time Fourier transform, shallow convolutional networks with randomly initialized filters, and iterative phase reconstruction with Griffin-Lim. In this work, we explore whether it is possible to directly optimize a time domain audio signal, removing the process of phase reconstruction and opening up possibilities for real-time applications and higher quality syntheses. We explore a variety of style transfer processes on neural networks that operate directly on time domain audio signals and demonstrate one such network capable of audio stylization

arXiv.org e-Print Archive

Examining the Mapping Functions of Denoising Autoencoders in Singing Voice Separation

Author: Cano Estefanía
Drossos Konstantinos
Mimilakis Stylianos Ioannis
Schuller Gerald
Publication venue
Publication date: 20/10/2019
Field of study

The goal of this work is to investigate what singing voice separation approaches based on neural networks learn from the data. We examine the mapping functions of neural networks based on the denoising autoencoder (DAE) model that are conditioned on the mixture magnitude spectra. To approximate the mapping functions, we propose an algorithm inspired by the knowledge distillation, denoted the neural couplings algorithm (NCA). The NCA yields a matrix that expresses the mapping of the mixture to the target source magnitude information. Using the NCA, we examine the mapping functions of three fundamental DAE-based models in music source separation; one with single-layer encoder and decoder, one with multi-layer encoder and single-layer decoder, and one using skip-filtering connections (SF) with a single-layer encoding and decoding. We first train these models with realistic data to estimate the singing voice magnitude spectra from the corresponding mixture. We then use the optimized models and test spectral data as input to the NCA. Our experimental findings show that approaches based on the DAE model learn scalar filtering operators, exhibiting a predominant diagonal structure in their corresponding mapping functions, limiting the exploitation of inter-frequency structure of music data. In contrast, skip-filtering connections are shown to assist the DAE model in learning filtering operators that exploit richer inter-frequency structures

arXiv.org e-Print Archive

Sparsity Based Formulations For Dereverberation

Author: Koçanaoğulları Aziz
Publication venue: 'Nara Institute of Science and Technology'
Publication date
Field of study

Tez (Yüksek Lisans) -- İstanbul Teknik Üniversitesi, Fen Bilimleri Enstitüsü, 2016Thesis (M.Sc.) -- İstanbul Technical University, Institute of Science and Technology, 2016Konser, konferans, toplantı gibi ortamlarda kaydedilen akustik işaretler, kaydın alındığı ortam nedeni ile yankıya ve gürültüye maruz kalır. Kaynak işaretinin elde edilen gözlemlerden kestirimi yankı giderme problemi olarak isimlendirilir. Bu kayıtlarda göze çarpan yankı etkileri bir süzgeç olarak zaman tanım bölgesinde modellenebilir. Yankı etkilerini modelleyen bu süzgeç oda darbe cevabı olarak isimlendirilir. Oda darbe cevabının bilindiği durumda problem gözü kapalı olmayan yankı giderme problemine dönüşür. Tez boyunca oda darbe cevabının bilindiği durumlar dikkate alınmıştır. Gözlemlenebilir ki, oda darbe cevabı kaynak ve gözlem noktalarına çok bağımlıdır. Bu nedenle oda darbe cevabının bütün uzaydaki noktalar için kestirimi çok zordur. Bu durumda oda darbe cevapları tezdeki deneylerde sentetik olarak uygulanmış veya gözlem ortamında kayıt alındığı sırada gözlemden elde edilmişlerdir. Bölüm 5, bu duruma farklı bir açıdan bakılmasının örneğidir. Bu bölümde oda darbe cevabının kısmen bilindiği ve gözlem ortamı için tek bir süzgeç tanımlanabileceği durumları göz önüne alınmıştır.Acoustic signals recorded in concerts, meetings or conferences are effected by the room impulse response and noise. Estimating the clean source signals from the observations is referred as the dereverberation problem. If the room impulse responses are known, the problem is non-blind dereverberation problem. In this thesis non-blind dereverberation problem is posed using convex penalty functions, with a convex minimization procedure. The convex minimization problems are solved using iterative methods. Through the thesis sparse nature of the time frequency spectrum is referred. In order to transform the time domain signal to a time frequency spectrum Short Time Fourier Transform is used.Yüksek LisansM.Sc

Ulusal Üniversitelerarası Açık Erişim Sistemi - İstanbul Teknik Üniversitesi