Search CORE

3 research outputs found

Trainable Adaptive Window Switching for Speech Enhancement

Author: Haneda Yoichi
Harada Noboru
Koizumi Yuma
Publication venue
Publication date: 19/02/2019
Field of study

This study proposes a trainable adaptive window switching (AWS) method and apply it to a deep-neural-network (DNN) for speech enhancement in the modified discrete cosine transform domain. Time-frequency (T-F) mask processing in the short-time Fourier transform (STFT)-domain is a typical speech enhancement method. To recover the target signal precisely, DNN-based short-time frequency transforms have recently been investigated and used instead of the STFT. However, since such a fixed-resolution short-time frequency transform method has a T-F resolution problem based on the uncertainty principle, not only the short-time frequency transform but also the length of the windowing function should be optimized. To overcome this problem, we incorporate AWS into the speech enhancement procedure, and the windowing function of each time-frame is manipulated using a DNN depending on the input signal. We confirmed that the proposed method achieved a higher signal-to-distortion ratio than conventional speech enhancement methods in fixed-resolution frequency domains.Comment: accepted to the 44th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2019

arXiv.org e-Print Archive

Invertible DNN-based nonlinear time-frequency transform for speech enhancement

Author: Harada Noboru
Koizumi Yuma
Oikawa Yasuhiro
Takeuchi Daiki
Yatabe Kohei
Publication venue
Publication date: 13/02/2020
Field of study

We propose an end-to-end speech enhancement method with trainable time-frequency~(T-F) transform based on invertible deep neural network~(DNN). The resent development of speech enhancement is brought by using DNN. The ordinary DNN-based speech enhancement employs T-F transform, typically the short-time Fourier transform~(STFT), and estimates a T-F mask using DNN. On the other hand, some methods have considered end-to-end networks which directly estimate the enhanced signals without T-F transform. While end-to-end methods have shown promising results, they are black boxes and hard to understand. Therefore, some end-to-end methods used a DNN to learn the linear T-F transform which is much easier to understand. However, the learned transform may not have a property important for ordinary signal processing. In this paper, as the important property of the T-F transform, perfect reconstruction is considered. An invertible nonlinear T-F transform is constructed by DNNs and learned from data so that the obtained transform is perfectly reconstructing filterbank.Comment: To appear in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020

arXiv.org e-Print Archive

Real-time speech enhancement using equilibriated RNN

Author: Harada Noboru
Koizumi Yuma
Oikawa Yasuhiro
Takeuchi Daiki
Yatabe Kohei
Publication venue
Publication date: 13/02/2020
Field of study

We propose a speech enhancement method using a causal deep neural network~(DNN) for real-time applications. DNN has been widely used for estimating a time-frequency~(T-F) mask which enhances a speech signal. One popular DNN structure for that is a recurrent neural network~(RNN) owing to its capability of effectively modelling time-sequential data like speech. In particular, the long short-term memory (LSTM) is often used to alleviate the vanishing/exploding gradient problem which makes the training of an RNN difficult. However, the number of parameters of LSTM is increased as the price of mitigating the difficulty of training, which requires more computational resources. For real-time speech enhancement, it is preferable to use a smaller network without losing the performance. In this paper, we propose to use the equilibriated recurrent neural network~(ERNN) for avoiding the vanishing/exploding gradient problem without increasing the number of parameters. The proposed structure is causal, which requires only the information from the past, in order to apply it in real-time. Compared to the uni- and bi-directional LSTM networks, the proposed method achieved the similar performance with much fewer parameters.Comment: To appear in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020

arXiv.org e-Print Archive