Search CORE

63 research outputs found

AN EFFICIENT AND ROBUST MULTI-STREAM FRAMEWORK FOR END-TO-END SPEECH RECOGNITION

Author: Li Ruizhi
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 16/02/2021
Field of study

In voice-enabled domestic or meeting environments, distributed microphone arrays aim to process distant-speech interaction into text with high accuracy. However, with dynamic corruption of noises and reverberations or human movement present, there is no guarantee that any microphone array (stream) is constantly informative. In these cases, an appropriate strategy to dynamically fuse streams is necessary. The multi-stream paradigm in Automatic Speech Recognition (ASR) considers scenarios where parallel streams carry diverse or complementary task-related knowledge. Such streams could be defined as microphone arrays, frequency bands, various modalities or etc. Hence, a robust stream fusion is crucial to emphasize on more informative streams than corrupted ones, especially under unseen conditions. This thesis focuses on improving the performance and robustness of speech recognition in multi-stream scenarios. With increasing use of Deep Neural Networks (DNNs) in ASR, End-to-End (E2E) approaches, which directly transcribe human speech into text, have received greater attention. In this thesis, a multi-stream framework is presented based on the joint Connectionist Temporal Classification/ATTention (CTC/ATT) E2E model, where parallel streams are represented by separate encoders. On top of regular attention networks, a secondary stream-fusion network is to steer the decoder toward the most informative streams. The MEM-Array model aims at improving the far-field ASR robustness using microphone arrays which are activated by separate encoders. With an increasing number of streams (encoders) requiring substantial memory and massive amounts of parallel data, a practical two-stage training strategy is designated to address these issues. Furthermore, a two-stage augmentation scheme is present to improve robustness of the multi-stream model. In MEM-Res, two heterogeneous encoders with different architectures, temporal resolutions and separate CTC networks work in parallel to extract complementary information from the same acoustics. Compared with the best single-stream performance, both models have achieved substantial improvement, outperforming alternative fusion strategies. While the proposed framework optimizes information in multi-stream scenarios, this thesis also studies the Performance Monitoring (PM) measures to predict if recognition results of an E2E model are reliable without growth-truth knowledge. Four PM techniques are investigated, suggesting that PM measures on attention distributions and decoder posteriors are well-correlated with true performances

JScholarship

Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems

Author: Cui Mingyu
Deng Jiajun
Geng Mengzhe
Hu Shoukang
Hu Shujie
Liu Xunying
Meng Helen
Wang Tianzi
Xie Xurong
Xue Boyang
Publication venue: 'International Speech Communication Association'
Publication date: 23/06/2022
Field of study

Fundamental modelling differences between hybrid and end-to-end (E2E) automatic speech recognition (ASR) systems create large diversity and complementarity among them. This paper investigates multi-pass rescoring and cross adaptation based system combination approaches for hybrid TDNN and Conformer E2E ASR systems. In multi-pass rescoring, state-of-the-art hybrid LF-MMI trained CNN-TDNN system featuring speed perturbation, SpecAugment and Bayesian learning hidden unit contributions (LHUC) speaker adaptation was used to produce initial N-best outputs before being rescored by the speaker adapted Conformer system using a 2-way cross system score interpolation. In cross adaptation, the hybrid CNN-TDNN system was adapted to the 1-best output of the Conformer system or vice versa. Experiments on the 300-hour Switchboard corpus suggest that the combined systems derived using either of the two system combination approaches outperformed the individual systems. The best combined system obtained using multi-pass rescoring produced statistically significant word error rate (WER) reductions of 2.5% to 3.9% absolute (22.5% to 28.9% relative) over the stand alone Conformer system on the NIST Hub5'00, Rt03 and Rt02 evaluation data.Comment: It' s accepted to ISCA 202

arXiv.org e-Print Archive

Effective attention-based sequence-to-sequence modelling for automatic speech recognition

Author: Zhang Shucong
Publication venue: The University of Edinburgh
Publication date: 14/03/2022
Field of study

With sufficient training data, attentional encoder-decoder models have given outstanding ASR results. In such models, the encoder encodes the input sequence into a sequence of hidden representations. The attention mechanism generates a soft alignment between the encoder hidden states and the decoder hidden states. The decoder produces the current output by considering the alignment and the previous outputs. However, attentional encoder-decoder models are originally designed for machine translation tasks, where the input and output sequences are relatively short and the alignments between them are flexible. For ASR tasks, the input sequences are notably long. Further, acoustic frames (or their hidden representations) typically can be aligned with output units in a left-to-right order, and compared to the length of the entire utterance, the duration of each output unit is usually small. Conventional encoder-decoder models have difficulties in modelling long sequences, and the attention mechanism does not guarantee the monotonic left-to-right alignments. In this thesis, we study attention-based sequence-to-sequence ASR models and address the aforementioned issues. We investigate recurrent neural network (RNN) encoder-decoder models and self-attention encoder-decoder models. For RNN encoder-decoder models, we develop a dynamic subsampling RNN (dsRNN) encoder to shorten the lengths of the input sequences. The dsRNN learns to skip redundant frames. Furthermore, the skip ratio may vary at different stages of training, thus allowing the encoder to learn the most relevant information for each epoch. Thus, the dsRNN alleviates the difficulties of encoding long sequences. We also propose a fully trainable windowed attention mechanism, in which both the window shift and window length are learned by the model. Our windowed method forces the attention mechanism to attend inputs within small sliding windows in a strict left-to-right order. The proposed dsRNN and windowed attention give significant performance gains over traditional encoder-decoder ASR models. We next study self-attention encoder-decoder models. For RNN encoder-decoder models, we have shown that restricting the attention within small windows is beneficial. However, self-attention encodes input sequences by comparing each element of the sequence with all other elements of the sequence. Therefore, we investigate if the global view of self-attention is necessary for ASR. We note that the range of the learned context increases from the lower to the upper self-attention layers, and suggest that the upper encoder layers may have seen sufficient contextual information without the need for self-attention. This would imply that the upper self-attention layers can be replaced with feed-forward layers (we can view the feed-forward layers as strict local left-to-right self-attention). In practice, we observe replacing upper encoder self-attention layers with feed forward layers does not impact the performance. We also observe that there are individual attention heads that only attend local information, and thus the self-attention mechanism is redundant for these attention heads. Based on these observations, we propose randomly removing attention heads during training but keep all heads at testing. The proposed method achieves state-of-the-art ASR results on benchmark datasets of different ASR scenarios. Finally, we investigate top-down level-wise training of sequence-to-sequence ASR models. We find that when training sequence-to-sequence ASR models on noisy data, the use of upper layers trained on clean data forces the lower layers to learn noise-invariant features, since the features which fit the clean-trained upper layers are more general. We further show that within the same dataset, conventional joint training makes the upper layers quickly overfit. Therefore, we propose to freeze the upper layers and retrain the lower layers. The proposed method is a general training strategy; we use it not only to train ASR models but also to train other neural networks in other domains. The proposed training method yields consistent performance gains across different tasks (e.g., language modelling, image classification). In summary, we propose methods which enable attention-based sequence-to-sequence ASR systems to better model sequential data, and demonstrate the benefits of training neural networks in a top-down cascade manner

Edinburgh Research Archive

Recommended from our members

Single Channel auditory source separation with neural network

Author: Chen Zhuo
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2017
Field of study

Although distinguishing diﬀerent sounds in noisy environment is a relative easy task for human, source separation has long been extremely diﬃcult in audio signal processing. The problem is challenging for three reasons: the large variety of sound type, the abundant mixing conditions and the unclear mechanism to distinguish sources, especially for similar sounds. In recent years, the neural network based methods achieved impressive successes in various problems, including the speech enhancement, where the task is to separate the clean speech out of the noise mixture. However, the current deep learning based source separator does not perform well on real recorded noisy speech, and more importantly, is not applicable in a more general source separation scenario such as overlapped speech. In this thesis, we ﬁrstly propose extensions for the current mask learning network, for the problem of speech enhancement, to ﬁx the scale mismatch problem which is usually occurred in real recording audio. We solve this problem by combining two additional restoration layers in the existing mask learning network. We also proposed a residual learning architecture for the speech enhancement, further improving the network generalization under diﬀerent recording conditions. We evaluate the proposed speech enhancement models on CHiME 3 data. Without retraining the acoustic model, the best bi-direction LSTM with residue connections yields 25.13% relative WER reduction on real data and 34.03% WER on simulated data. Then we propose a novel neural network based model called “deep clustering” for more general source separation tasks. We train a deep network to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures. This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank pairwise aﬃnity matrix that approximates the ideal aﬃnity matrix, while enabling much faster performance. At test time, the clustering step “decodes” the segmentation implicit in the embeddings by optimizing K-means with respect to the unknown assignments. Experiments on single channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker and three speakers mixtures can improve signal quality for mixtures of held-out speakers by an average over 10dB. We then propose an extension for deep clustering named “deep attractor” network that allows the system to perform eﬃcient end-to-end training. In the proposed model, attractor points for each source are ﬁrstly created the acoustic signals which pull together the time-frequency bins corresponding to each source by ﬁnding the centroids of the sources in the embedding space, which are subsequently used to determine the similarity of each bin in the mixture to each source. The network is then trained to minimize the reconstruction error of each source by optimizing the embeddings. We showed that this frame work can achieve even better results. Lastly, we introduce two applications of the proposed models, in singing voice separation and the smart hearing aid device. For the former, a multi-task architecture is proposed, which combines the deep clustering and the classiﬁcation based network. And a new state of the art separation result was achieved, where the signal to noise ratio was improved by 11.1dB on music and 7.9dB on singing voice. In the application of smart hearing aid device, we combine the neural decoding with the separation network. The system ﬁrstly decodes the user’s attention, which is further used to guide the separator for the targeting source. Both objective study and subjective study show the proposed system can accurately decode the attention and significantly improve the user experience

Columbia University Academic Commons