3 research outputs found
Trainable Adaptive Window Switching for Speech Enhancement
This study proposes a trainable adaptive window switching (AWS) method and
apply it to a deep-neural-network (DNN) for speech enhancement in the modified
discrete cosine transform domain. Time-frequency (T-F) mask processing in the
short-time Fourier transform (STFT)-domain is a typical speech enhancement
method. To recover the target signal precisely, DNN-based short-time frequency
transforms have recently been investigated and used instead of the STFT.
However, since such a fixed-resolution short-time frequency transform method
has a T-F resolution problem based on the uncertainty principle, not only the
short-time frequency transform but also the length of the windowing function
should be optimized. To overcome this problem, we incorporate AWS into the
speech enhancement procedure, and the windowing function of each time-frame is
manipulated using a DNN depending on the input signal. We confirmed that the
proposed method achieved a higher signal-to-distortion ratio than conventional
speech enhancement methods in fixed-resolution frequency domains.Comment: accepted to the 44th International Conference on Acoustics, Speech,
and Signal Processing (ICASSP 2019
Invertible DNN-based nonlinear time-frequency transform for speech enhancement
We propose an end-to-end speech enhancement method with trainable
time-frequency~(T-F) transform based on invertible deep neural network~(DNN).
The resent development of speech enhancement is brought by using DNN. The
ordinary DNN-based speech enhancement employs T-F transform, typically the
short-time Fourier transform~(STFT), and estimates a T-F mask using DNN. On the
other hand, some methods have considered end-to-end networks which directly
estimate the enhanced signals without T-F transform. While end-to-end methods
have shown promising results, they are black boxes and hard to understand.
Therefore, some end-to-end methods used a DNN to learn the linear T-F transform
which is much easier to understand. However, the learned transform may not have
a property important for ordinary signal processing. In this paper, as the
important property of the T-F transform, perfect reconstruction is considered.
An invertible nonlinear T-F transform is constructed by DNNs and learned from
data so that the obtained transform is perfectly reconstructing filterbank.Comment: To appear in Proc. IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP 2020
Real-time speech enhancement using equilibriated RNN
We propose a speech enhancement method using a causal deep neural
network~(DNN) for real-time applications. DNN has been widely used for
estimating a time-frequency~(T-F) mask which enhances a speech signal. One
popular DNN structure for that is a recurrent neural network~(RNN) owing to its
capability of effectively modelling time-sequential data like speech. In
particular, the long short-term memory (LSTM) is often used to alleviate the
vanishing/exploding gradient problem which makes the training of an RNN
difficult. However, the number of parameters of LSTM is increased as the price
of mitigating the difficulty of training, which requires more computational
resources. For real-time speech enhancement, it is preferable to use a smaller
network without losing the performance. In this paper, we propose to use the
equilibriated recurrent neural network~(ERNN) for avoiding the
vanishing/exploding gradient problem without increasing the number of
parameters. The proposed structure is causal, which requires only the
information from the past, in order to apply it in real-time. Compared to the
uni- and bi-directional LSTM networks, the proposed method achieved the similar
performance with much fewer parameters.Comment: To appear in Proc. IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP 2020