165 research outputs found

    Speech Dereverberation Based on Integrated Deep and Ensemble Learning Algorithm

    Full text link
    Reverberation, which is generally caused by sound reflections from walls, ceilings, and floors, can result in severe performance degradation of acoustic applications. Due to a complicated combination of attenuation and time-delay effects, the reverberation property is difficult to characterize, and it remains a challenging task to effectively retrieve the anechoic speech signals from reverberation ones. In the present study, we proposed a novel integrated deep and ensemble learning algorithm (IDEA) for speech dereverberation. The IDEA consists of offline and online phases. In the offline phase, we train multiple dereverberation models, each aiming to precisely dereverb speech signals in a particular acoustic environment; then a unified fusion function is estimated that aims to integrate the information of multiple dereverberation models. In the online phase, an input utterance is first processed by each of the dereverberation models. The outputs of all models are integrated accordingly to generate the final anechoic signal. We evaluated the IDEA on designed acoustic environments, including both matched and mismatched conditions of the training and testing data. Experimental results confirm that the proposed IDEA outperforms single deep-neural-network-based dereverberation model with the same model architecture and training data

    Noise Types Adaptation for Speech Enhancement with Recurrent Neural Network

    Get PDF
    Speech enhancement is a critical part in automatic speech recognition systems. Recently with the development of deep learning based techniques, those speech enhancement systems trained with neural networks can significantly improve performance. While many of the latest speech enhancement systems show advantages in maximizing the perceptual quality of the noisy signals, they expose drawbacks when the test noisy signals have noise types that never exist during the system training process. The systems have relatively poor performance when handling noisy signals with unseen noise in contrast to noisy signals with seen noise. The dissimilarity between the training and testing circumstances can cause a serious performance decline in a deep learning task.In this work, a new method is proposed to solve the noise types problem. The framework has three parts: the autoencoder, the gradient reverse layers and the recurrent neural networks. The proposed framework can weaken the noise types influences when handling random noisy signals. This work shows that the new method outperforms the baseline models in unseen noise situations

    Speech Separation based on Contrastive Learning and Deep Modularization

    Full text link
    The current monaural state of the art tools for speech separation relies on supervised learning. This means that they must deal with permutation problem, they are impacted by the mismatch on the number of speakers used in training and inference. Moreover, their performance heavily relies on the presence of high-quality labelled data. These problems can be effectively addressed by employing a fully unsupervised technique for speech separation. In this paper, we use contrastive learning to establish the representations of frames then use the learned representations in the downstream deep modularization task. Concretely, we demonstrate experimentally that in speech separation, different frames of a speaker can be viewed as augmentations of a given hidden standard frame of that speaker. The frames of a speaker contain enough prosodic information overlap which is key in speech separation. Based on this, we implement a self-supervised learning to learn to minimize the distance between frames belonging to a given speaker. The learned representations are used in a downstream deep modularization task to cluster frames based on speaker identity. Evaluation of the developed technique on WSJ0-2mix and WSJ0-3mix shows that the technique attains SI-SNRi and SDRi of 20.8 and 21.0 respectively in WSJ0-2mix. In WSJ0-3mix, it attains SI-SNRi and SDRi of 20.7 and 20.7 respectively in WSJ0-2mix. Its greatest strength being that as the number of speakers increase, its performance does not degrade significantly.Comment: arXiv admin note: substantial text overlap with arXiv:2212.0036
    • …
    corecore