1,035 research outputs found
TasNet: time-domain audio separation network for real-time, single-channel speech separation
Robust speech processing in multi-talker environments requires effective
speech separation. Recent deep learning systems have made significant progress
toward solving this problem, yet it remains challenging particularly in
real-time, short latency applications. Most methods attempt to construct a mask
for each source in time-frequency representation of the mixture signal which is
not necessarily an optimal representation for speech separation. In addition,
time-frequency decomposition results in inherent problems such as
phase/magnitude decoupling and long time window which is required to achieve
sufficient frequency resolution. We propose Time-domain Audio Separation
Network (TasNet) to overcome these limitations. We directly model the signal in
the time-domain using an encoder-decoder framework and perform the source
separation on nonnegative encoder outputs. This method removes the frequency
decomposition step and reduces the separation problem to estimation of source
masks on encoder outputs which is then synthesized by the decoder. Our system
outperforms the current state-of-the-art causal and noncausal speech separation
algorithms, reduces the computational cost of speech separation, and
significantly reduces the minimum required latency of the output. This makes
TasNet suitable for applications where low-power, real-time implementation is
desirable such as in hearable and telecommunication devices.Comment: Camera ready version for ICASSP 2018, Calgary, Canad
Online Speaker Separation Using Deep Clustering
In this thesis, a low-latency variant of speaker-independent deep clustering method is
proposed for speaker separation. Compared to the offline deep clustering separation
system, bidirectional long-short term memory networks (BLSTMs) are replaced with
long-short term memory networks (LSTMs). The reason is that the data has to be
fed to the BLSTM networks both forward and backward directions. Additionally, the
final outputs depend on both directions, which make online processing not possible.
Also, 32 ms synthesis window is replaced with 8 ms in order to cooperate with low-
latency applications like hearing aids since the algorithmic latency depends upon the
length of synthesis window. Furthermore, the beginning of the audio mixture, here,
referred as buffer, is used to get the cluster centers for the constituent speakers in the
mixture serving as the initialization purpose. Later, those centers are used to assign
clusters for the rest of the mixture to achieve speaker separation with the latency
of 8 ms. The algorithm is evaluated on the Wall Street Journal corpus (WSJ0).
Changing the networks from BLSTM to LSTM while keeping the same window
length degrades the separation performance measured by signal-to-distortion ratio
(SDR) by 1.0 dB, which implies that the future information is important for the
separation. For investigating the effect of window length, keeping the same network
structure (LSTM), by changing window length from 32 ms to 8 ms, another 1.1 dB
drop in SDR is found. For the low-latency deep clustering speaker separation system,
different duration of buffer is studied. It is observed that initially, the separation
performance increases as the buffer increases. However, with buffer length of 0.3 s,
the separation performance keeps steady even by increasing the buffer. Compared to
offline deep clustering separation system, degradation of 2.8 dB in SDR is observed
for online system
Online Monaural Speech Enhancement Using Delayed Subband LSTM
This paper proposes a delayed subband LSTM network for online monaural
(single-channel) speech enhancement. The proposed method is developed in the
short time Fourier transform (STFT) domain. Online processing requires
frame-by-frame signal reception and processing. A paramount feature of the
proposed method is that the same LSTM is used across frequencies, which
drastically reduces the number of network parameters, the amount of training
data and the computational burden. Training is performed in a subband manner:
the input consists of one frequency, together with a few context frequencies.
The network learns a speech-to-noise discriminative function relying on the
signal stationarity and on the local spectral pattern, based on which it
predicts a clean-speech mask at each frequency. To exploit future information,
i.e. look-ahead, we propose an output-delayed subband architecture, which
allows the unidirectional forward network to process a few future frames in
addition to the current frame. We leverage the proposed method to participate
to the DNS real-time speech enhancement challenge. Experiments with the DNS
dataset show that the proposed method achieves better performance-measuring
scores than the DNS baseline method, which learns the full-band spectra using a
gated recurrent unit network.Comment: Paper submitted to Interspeech 202
- …