4,024 research outputs found

    Deep neural network Based Low-latency Speech Separation with Asymmetric analysis-Synthesis Window Pair

    Get PDF
    Time-frequency masking or spectrum prediction computed via short symmetric windows are commonly used in low-latency deep neural network (DNN) based source separation. In this paper, we propose the usage of an asymmetric analysis-synthesis window pair which allows for training with targets with better frequency resolution, while retaining the low-latency during inference suitable for real-time speech enhancement or assisted hearing applications. In order to assess our approach across various model types and datasets, we evaluate it with both speaker-independent deep clustering (DC) model and a speaker-dependent mask inference (MI) model. We report an improvement in separation performance of up to 1.5 dB in terms of source-to-distortion ratio (SDR) while maintaining an algorithmic latency of 8 ms.Comment: Accepted to EUSIPCO-202

    TasNet: time-domain audio separation network for real-time, single-channel speech separation

    Full text link
    Robust speech processing in multi-talker environments requires effective speech separation. Recent deep learning systems have made significant progress toward solving this problem, yet it remains challenging particularly in real-time, short latency applications. Most methods attempt to construct a mask for each source in time-frequency representation of the mixture signal which is not necessarily an optimal representation for speech separation. In addition, time-frequency decomposition results in inherent problems such as phase/magnitude decoupling and long time window which is required to achieve sufficient frequency resolution. We propose Time-domain Audio Separation Network (TasNet) to overcome these limitations. We directly model the signal in the time-domain using an encoder-decoder framework and perform the source separation on nonnegative encoder outputs. This method removes the frequency decomposition step and reduces the separation problem to estimation of source masks on encoder outputs which is then synthesized by the decoder. Our system outperforms the current state-of-the-art causal and noncausal speech separation algorithms, reduces the computational cost of speech separation, and significantly reduces the minimum required latency of the output. This makes TasNet suitable for applications where low-power, real-time implementation is desirable such as in hearable and telecommunication devices.Comment: Camera ready version for ICASSP 2018, Calgary, Canad

    Online Speaker Separation Using Deep Clustering

    Get PDF
    In this thesis, a low-latency variant of speaker-independent deep clustering method is proposed for speaker separation. Compared to the offline deep clustering separation system, bidirectional long-short term memory networks (BLSTMs) are replaced with long-short term memory networks (LSTMs). The reason is that the data has to be fed to the BLSTM networks both forward and backward directions. Additionally, the final outputs depend on both directions, which make online processing not possible. Also, 32 ms synthesis window is replaced with 8 ms in order to cooperate with low- latency applications like hearing aids since the algorithmic latency depends upon the length of synthesis window. Furthermore, the beginning of the audio mixture, here, referred as buffer, is used to get the cluster centers for the constituent speakers in the mixture serving as the initialization purpose. Later, those centers are used to assign clusters for the rest of the mixture to achieve speaker separation with the latency of 8 ms. The algorithm is evaluated on the Wall Street Journal corpus (WSJ0). Changing the networks from BLSTM to LSTM while keeping the same window length degrades the separation performance measured by signal-to-distortion ratio (SDR) by 1.0 dB, which implies that the future information is important for the separation. For investigating the effect of window length, keeping the same network structure (LSTM), by changing window length from 32 ms to 8 ms, another 1.1 dB drop in SDR is found. For the low-latency deep clustering speaker separation system, different duration of buffer is studied. It is observed that initially, the separation performance increases as the buffer increases. However, with buffer length of 0.3 s, the separation performance keeps steady even by increasing the buffer. Compared to offline deep clustering separation system, degradation of 2.8 dB in SDR is observed for online system
    • …
    corecore