328 research outputs found
Informed Source Separation using Iterative Reconstruction
This paper presents a technique for Informed Source Separation (ISS) of a
single channel mixture, based on the Multiple Input Spectrogram Inversion
method. The reconstruction of the source signals is iterative, alternating
between a time- frequency consistency enforcement and a re-mixing constraint. A
dual resolution technique is also proposed, for sharper transients
reconstruction. The two algorithms are compared to a state-of-the-art
Wiener-based ISS technique, on a database of fourteen monophonic mixtures, with
standard source separation objective measures. Experimental results show that
the proposed algorithms outperform both this reference technique and the oracle
Wiener filter by up to 3dB in distortion, at the cost of a significantly
heavier computation.Comment: submitted to the IEEE transactions on Audio, Speech and Language
Processin
On Single-Channel Speech Enhancement and On Non-Linear Modulation-Domain Kalman Filtering
This report focuses on algorithms that perform single-channel speech
enhancement. The author of this report uses modulation-domain Kalman filtering
algorithms for speech enhancement, i.e. noise suppression and dereverberation,
in [1], [2], [3], [4] and [5]. Modulation-domain Kalman filtering can be
applied for both noise and late reverberation suppression and in [2], [1], [3]
and [4], various model-based speech enhancement algorithms that perform
modulation-domain Kalman filtering are designed, implemented and tested. The
model-based enhancement algorithm in [2] estimates and tracks the speech phase.
The short-time-Fourier-transform-based enhancement algorithm in [5] uses the
active speech level estimator presented in [6]. This report describes how
different algorithms perform speech enhancement and the algorithms discussed in
this report are addressed to researchers interested in monaural speech
enhancement. The algorithms are composed of different processing blocks and
techniques [7]; understanding the implementation choices made during the system
design is important because this provides insights that can assist the
development of new algorithms. Index Terms - Speech enhancement,
dereverberation, denoising, Kalman filter, minimum mean squared error
estimation.Comment: 13 page
Fourier Phase Retrieval: Uniqueness and Algorithms
The problem of recovering a signal from its phaseless Fourier transform
measurements, called Fourier phase retrieval, arises in many applications in
engineering and science. Fourier phase retrieval poses fundamental theoretical
and algorithmic challenges. In general, there is no unique mapping between a
one-dimensional signal and its Fourier magnitude and therefore the problem is
ill-posed. Additionally, while almost all multidimensional signals are uniquely
mapped to their Fourier magnitude, the performance of existing algorithms is
generally not well-understood. In this chapter we survey methods to guarantee
uniqueness in Fourier phase retrieval. We then present different algorithmic
approaches to retrieve the signal in practice. We conclude by outlining some of
the main open questions in this field
Audio Spectrogram Representations for Processing with Convolutional Neural Networks
One of the decisions that arise when designing a neural network for any
application is how the data should be represented in order to be presented to,
and possibly generated by, a neural network. For audio, the choice is less
obvious than it seems to be for visual images, and a variety of representations
have been used for different applications including the raw digitized sample
stream, hand-crafted features, machine discovered features, MFCCs and variants
that include deltas, and a variety of spectral representations. This paper
reviews some of these representations and issues that arise, focusing
particularly on spectrograms for generating audio using neural networks for
style transfer.Comment: Proceedings of the First International Conference on Deep Learning
and Music, Anchorage, US, May, 2017 (arXiv:1706.08675v1 [cs.NE]
Enhancement and Recognition of Reverberant and Noisy Speech by Extending Its Coherence
Most speech enhancement algorithms make use of the short-time Fourier
transform (STFT), which is a simple and flexible time-frequency decomposition
that estimates the short-time spectrum of a signal. However, the duration of
short STFT frames are inherently limited by the nonstationarity of speech
signals. The main contribution of this paper is a demonstration of speech
enhancement and automatic speech recognition in the presence of reverberation
and noise by extending the length of analysis windows. We accomplish this
extension by performing enhancement in the short-time fan-chirp transform
(STFChT) domain, an overcomplete time-frequency representation that is coherent
with speech signals over longer analysis window durations than the STFT. This
extended coherence is gained by using a linear model of fundamental frequency
variation of voiced speech signals. Our approach centers around using a
single-channel minimum mean-square error log-spectral amplitude (MMSE-LSA)
estimator proposed by Habets, which scales coefficients in a time-frequency
domain to suppress noise and reverberation. In the case of multiple
microphones, we preprocess the data with either a minimum variance
distortionless response (MVDR) beamformer, or a delay-and-sum beamformer (DSB).
We evaluate our algorithm on both speech enhancement and recognition tasks for
the REVERB challenge dataset. Compared to the same processing done in the STFT
domain, our approach achieves significant improvement in terms of objective
enhancement metrics (including PESQ---the ITU-T standard measurement for speech
quality). In terms of automatic speech recognition (ASR) performance as
measured by word error rate (WER), our experiments indicate that the STFT with
a long window is more effective for ASR.Comment: 22 page
Digital Signal Processing Group
Contains an introduction and reports on nineteen research projects.U.S. Navy - Office of Naval Research (Contract N00014-77-C-0266)U.S. Navy - Office of Naval Research (Contract N00014-81-K-0742)National Science Foundation (Grant ECS80-07102)Bell Laboratories FellowshipAmoco Foundation FellowshipU.S. Navy - Office of Naval Research (Contract N00014-77-C-0196)Schlumberger-Doll Research Center FellowshipToshiba Company FellowshipVinton Hayes FellowshipHertz Foundation Fellowshi
Deep Griffin-Lim Iteration
This paper presents a novel phase reconstruction method (only from a given
amplitude spectrogram) by combining a signal-processing-based approach and a
deep neural network (DNN). To retrieve a time-domain signal from its amplitude
spectrogram, the corresponding phase is required. One of the popular phase
reconstruction methods is the Griffin-Lim algorithm (GLA), which is based on
the redundancy of the short-time Fourier transform. However, GLA often involves
many iterations and produces low-quality signals owing to the lack of prior
knowledge of the target signal. In order to address these issues, in this
study, we propose an architecture which stacks a sub-block including two
GLA-inspired fixed layers and a DNN. The number of stacked sub-blocks is
adjustable, and we can trade the performance and computational load based on
requirements of applications. The effectiveness of the proposed method is
investigated by reconstructing phases from amplitude spectrograms of speeches.Comment: 5 pages, to appear in IEEE ICASSP 2019 (Paper Code: AASP-L3.1,
Session: Source Separation and Speech Enhancement I
Time Domain Neural Audio Style Transfer
A recently published method for audio style transfer has shown how to extend
the process of image style transfer to audio. This method synthesizes audio
"content" and "style" independently using the magnitudes of a short time
Fourier transform, shallow convolutional networks with randomly initialized
filters, and iterative phase reconstruction with Griffin-Lim. In this work, we
explore whether it is possible to directly optimize a time domain audio signal,
removing the process of phase reconstruction and opening up possibilities for
real-time applications and higher quality syntheses. We explore a variety of
style transfer processes on neural networks that operate directly on time
domain audio signals and demonstrate one such network capable of audio
stylization
Examining the Mapping Functions of Denoising Autoencoders in Singing Voice Separation
The goal of this work is to investigate what singing voice separation
approaches based on neural networks learn from the data. We examine the mapping
functions of neural networks based on the denoising autoencoder (DAE) model
that are conditioned on the mixture magnitude spectra. To approximate the
mapping functions, we propose an algorithm inspired by the knowledge
distillation, denoted the neural couplings algorithm (NCA). The NCA yields a
matrix that expresses the mapping of the mixture to the target source magnitude
information. Using the NCA, we examine the mapping functions of three
fundamental DAE-based models in music source separation; one with single-layer
encoder and decoder, one with multi-layer encoder and single-layer decoder, and
one using skip-filtering connections (SF) with a single-layer encoding and
decoding. We first train these models with realistic data to estimate the
singing voice magnitude spectra from the corresponding mixture. We then use the
optimized models and test spectral data as input to the NCA. Our experimental
findings show that approaches based on the DAE model learn scalar filtering
operators, exhibiting a predominant diagonal structure in their corresponding
mapping functions, limiting the exploitation of inter-frequency structure of
music data. In contrast, skip-filtering connections are shown to assist the DAE
model in learning filtering operators that exploit richer inter-frequency
structures
Sparsity Based Formulations For Dereverberation
Tez (Yüksek Lisans) -- İstanbul Teknik Üniversitesi, Fen Bilimleri Enstitüsü, 2016Thesis (M.Sc.) -- İstanbul Technical University, Institute of Science and Technology, 2016Konser, konferans, toplantı gibi ortamlarda kaydedilen akustik işaretler, kaydın alındığı ortam nedeni ile yankıya ve gürültüye maruz kalır. Kaynak işaretinin elde edilen gözlemlerden kestirimi yankı giderme problemi olarak isimlendirilir. Bu kayıtlarda göze çarpan yankı etkileri bir süzgeç olarak zaman tanım bölgesinde modellenebilir. Yankı etkilerini modelleyen bu süzgeç oda darbe cevabı olarak isimlendirilir. Oda darbe cevabının bilindiği durumda problem gözü kapalı olmayan yankı giderme problemine dönüşür. Tez boyunca oda darbe cevabının bilindiği durumlar dikkate alınmıştır. Gözlemlenebilir ki, oda darbe cevabı kaynak ve gözlem noktalarına çok bağımlıdır. Bu nedenle oda darbe cevabının bütün uzaydaki noktalar için kestirimi çok zordur. Bu durumda oda darbe cevapları tezdeki deneylerde sentetik olarak uygulanmış veya gözlem ortamında kayıt alındığı sırada gözlemden elde edilmişlerdir. Bölüm 5, bu duruma farklı bir açıdan bakılmasının örneğidir. Bu bölümde oda darbe cevabının kısmen bilindiği ve gözlem ortamı için tek bir süzgeç tanımlanabileceği durumları göz önüne alınmıştır.Acoustic signals recorded in concerts, meetings or conferences are effected by the room impulse response and noise. Estimating the clean source signals from the observations is referred as the dereverberation problem. If the room impulse responses are known, the problem is non-blind dereverberation problem. In this thesis non-blind dereverberation problem is posed using convex penalty functions, with a convex minimization procedure. The convex minimization problems are solved using iterative methods. Through the thesis sparse nature of the time frequency spectrum is referred. In order to transform the time domain signal to a time frequency spectrum Short Time Fourier Transform is used.Yüksek LisansM.Sc
- …