6 research outputs found

    On End-to-end Multi-channel Time Domain Speech Separation in Reverberant Environments

    Full text link
    This paper introduces a new method for multi-channel time domain speech separation in reverberant environments. A fully-convolutional neural network structure has been used to directly separate speech from multiple microphone recordings, with no need of conventional spatial feature extraction. To reduce the influence of reverberation on spatial feature extraction, a dereverberation pre-processing method has been applied to further improve the separation performance. A spatialized version of wsj0-2mix dataset has been simulated to evaluate the proposed system. Both source separation and speech recognition performance of the separated signals have been evaluated objectively. Experiments show that the proposed fully-convolutional network improves the source separation metric and the word error rate (WER) by more than 13% and 50% relative, respectively, over a reference system with conventional features. Applying dereverberation as pre-processing to the proposed system can further reduce the WER by 29% relative using an acoustic model trained on clean and reverberated data.Comment: Presented at IEEE ICASSP 202

    IRAWNET: A Method for Transcribing Indonesian Classical Music Notes Directly from Multichannel Raw Audio

    Get PDF
    A challenging task when developing real-time Automatic Music Transcription (AMT) methods is directly leveraging inputs from multichannel raw audio without any handcrafted signal transformation and feature extraction steps. The crucial problems are that raw audio only contains an amplitude in each timestamp, and the signals of the left and right channels have different amplitude intensities and onset times. Thus, this study addressed these issues by proposing the IRawNet method with fused feature layers to merge different amplitude from multichannel raw audio. IRawNet aims to transcribe Indonesian classical music notes. It was validated with the Gamelan music dataset. The Synthetic Minority Oversampling Technique (SMOTE) overcame the class imbalance of the Gamelan music dataset. Under various experimental scenarios, the performance effects of oversampled data, hyperparameters tuning, and fused feature layers are analyzed. Furthermore, the performance of the proposed method was compared with Temporal Convolutional Network (TCN), Deep WaveNet, and the monochannel IRawNet. The results proved that proposed method almost achieves superior results in entire metric performances with 0.871 of accuracy, 0.988 of AUC, 0.927 of precision, 0.896 of recall, and 0.896 of F1 score
    corecore