6 research outputs found

    CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition

    Full text link
    Recent end-to-end Automatic Speech Recognition (ASR) systems demonstrated the ability to outperform conventional hybrid DNN/ HMM ASR. Aside from architectural improvements in those systems, those models grew in terms of depth, parameters and model capacity. However, these models also require more training data to achieve comparable performance. In this work, we combine freely available corpora for German speech recognition, including yet unlabeled speech data, to a big dataset of over 17001700h of speech data. For data preparation, we propose a two-stage approach that uses an ASR model pre-trained with Connectionist Temporal Classification (CTC) to boot-strap more training data from unsegmented or unlabeled training data. Utterances are then extracted from label probabilities obtained from the network trained with CTC to determine segment alignments. With this training data, we trained a hybrid CTC/attention Transformer model that achieves 12.8%12.8\% WER on the Tuda-DE test set, surpassing the previous baseline of 14.4%14.4\% of conventional hybrid DNN/HMM ASR.Comment: Published at SPECOM 202

    Lightweight End-to-End Speech Enhancement Generative Adversarial Network Using Sinc Convolutions

    No full text
    Generative adversarial networks (GANs) have recently garnered significant attention for their use in speech enhancement tasks, in which they generally process and reconstruct speech waveforms directly. Existing GANs for speech enhancement rely solely on the convolution operation, which may not accurately characterize the local information of speech signals—particularly high-frequency components. Sinc convolution has been proposed in order to allow the GAN to learn more meaningful filters in the input layer, and has achieved remarkable success in several speech signal processing tasks. Nevertheless, Sinc convolution for speech enhancement is still an under-explored research direction. This paper proposes Sinc–SEGAN, a novel generative adversarial architecture for speech enhancement, which usefully merges two powerful paradigms: Sinc convolution and the speech enhancement GAN (SEGAN). There are two highlights of the proposed system. First, it works in an end-to-end manner, overcoming the distortion caused by imperfect phase estimation. Second, the system derives a customized filter bank, tuned for the desired application compactly and efficiently. We empirically study the influence of different configurations of Sinc convolution, including the placement of the Sinc convolution layer, length of input signals, number of Sinc filters, and kernel size of Sinc convolution. Moreover, we employ a set of data augmentation techniques in the time domain, which further improve the system performance and its generalization abilities. Compared to competitive baseline systems, Sinc–SEGAN overtakes all of them with drastically reduced system parameters, demonstrating its effectiveness for practical usage, e.g., hearing aid design and cochlear implants. Additionally, data augmentation methods further boost Sinc–SEGAN performance across classic objective evaluation criteria for speech enhancement

    Light-Weight Self-Attention Augmented Generative Adversarial Networks for Speech Enhancement

    No full text
    Generative adversarial networks (GANs) have shown their superiority for speech enhancement. Nevertheless, most previous attempts had convolutional layers as the backbone, which may obscure long-range dependencies across an input sequence due to the convolution operator’s local receptive field. One popular solution is substituting recurrent neural networks (RNNs) for convolutional neural networks, but RNNs are computationally inefficient, caused by the unparallelization of their temporal iterations. To circumvent this limitation, we propose an end-to-end system for speech enhancement by applying the self-attention mechanism to GANs. We aim to achieve a system that is flexible in modeling both long-range and local interactions and can be computationally efficient at the same time. Our work is implemented in three phases: firstly, we apply the stand-alone self-attention layer in speech enhancement GANs. Secondly, we employ locality modeling on the stand-alone self-attention layer. Lastly, we investigate the functionality of the self-attention augmented convolutional speech enhancement GANs. Systematic experiment results indicate that equipped with the stand-alone self-attention layer, the system outperforms baseline systems across classic evaluation criteria with up to 95% fewer parameters. Moreover, locality modeling can be a parameter-free approach for further performance improvement, and self-attention augmentation also overtakes all baseline systems with acceptably increased parameters
    corecore