267 research outputs found
Sequentially Sampled Chunk Conformer for Streaming End-to-End ASR
This paper presents an in-depth study on a Sequentially Sampled Chunk
Conformer, SSC-Conformer, for streaming End-to-End (E2E) ASR. The SSC-Conformer
first demonstrates the significant performance gains from using the
sequentially sampled chunk-wise multi-head self-attention (SSC-MHSA) in the
Conformer encoder by allowing efficient cross-chunk interactions while keeping
linear complexities. Furthermore, it explores taking advantage of chunked
convolution to make use of the chunk-wise future context and integrates with
casual convolution in the convolution layers to further reduce CER. We verify
the proposed SSC-Conformer on the AISHELL-1 benchmark and experimental results
show that a state-of-the-art performance for streaming E2E ASR is achieved with
CER 5.33% without LM rescoring. And, owing to its linear complexity, the
SSC-Conformer can train with large batch sizes and infer more efficiently.Comment: This paper has been submitted to ICASSP 202
Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments
In real-world applications, users often require both translations and
transcriptions of speech to enhance their comprehension, particularly in
streaming scenarios where incremental generation is necessary. This paper
introduces a streaming Transformer-Transducer that jointly generates automatic
speech recognition (ASR) and speech translation (ST) outputs using a single
decoder. To produce ASR and ST content effectively with minimal latency, we
propose a joint token-level serialized output training method that interleaves
source and target words by leveraging an off-the-shelf textual aligner.
Experiments in monolingual (it-en) and multilingual (\{de,es,it\}-en) settings
demonstrate that our approach achieves the best quality-latency balance. With
an average ASR latency of 1s and ST latency of 1.3s, our model shows no
degradation or even improves output quality compared to separate ASR and ST
models, yielding an average improvement of 1.1 WER and 0.4 BLEU in the
multilingual case
EM-Network: Oracle Guided Self-distillation for Sequence Learning
We introduce EM-Network, a novel self-distillation approach that effectively
leverages target information for supervised sequence-to-sequence (seq2seq)
learning. In contrast to conventional methods, it is trained with oracle
guidance, which is derived from the target sequence. Since the oracle guidance
compactly represents the target-side context that can assist the sequence model
in solving the task, the EM-Network achieves a better prediction compared to
using only the source input. To allow the sequence model to inherit the
promising capability of the EM-Network, we propose a new self-distillation
strategy, where the original sequence model can benefit from the knowledge of
the EM-Network in a one-stage manner. We conduct comprehensive experiments on
two types of seq2seq models: connectionist temporal classification (CTC) for
speech recognition and attention-based encoder-decoder (AED) for machine
translation. Experimental results demonstrate that the EM-Network significantly
advances the current state-of-the-art approaches, improving over the best prior
work on speech recognition and establishing state-of-the-art performance on
WMT'14 and IWSLT'14.Comment: ICML 202
Automatic Speech Recognition for Documenting Endangered First Nations Languages
Automatic speech recognition (ASR) for low-resource languages is an active field of research. Over the past years with the advent of deep learning, impressive achievements have been reported using minimal resources. As many of the world’s languages are getting extinct every year, with every dying language we lose intellect, culture, values, and tradition which generally pass down for long generations. Linguists throughout the world have already initiated many projects on language documentation to preserve such endangered languages. Automatic speech recognition is a solution to accelerate the documentation process reducing the annotation time for field linguists as well as the overall cost of the project. A traditional speech recognizer is trained on thousands of hours of acoustic data and a phonetic dictionary that includes all words from the language. End-to-End ASR systems have shown dramatic improvement for major languages. Especially, recent advancement in self-supervised representation learning which takes advantage of large corpora of untranscribed speech data has become the state-of-the-art for speech recognition technology. However, for resource-constrained languages, the technology is not tested in depth. In this thesis, we explore both traditional methods of ASR and state-of-the-art end-to-end systems for modeling a critically endangered Athabascan language known as Upper Tanana. In our first approach, we investigate traditional models with a comparative study on feature selection and a performance comparison with deep hybrid models. With limited resources at our disposal, we build a working ASR system based on a grapheme-to-phoneme (G2P) phonetic dictionary. The acoustic model can also be used as a separate forced alignment tool for the automatic alignment of training data. The results show that the GMM-HMM methods outperform deep hybrid models in low-resource acoustic modeling. In our second approach, we propose using Domain-adapted Cross-lingual Speech Recognition (DA-XLSR) for an ASR system, developed over the wav2vec 2.0 framework that utilizes pretrained transformer models leveraging cross lingual data for building an acoustic representation. The proposed system uses a multistage transfer learning process in order to fine tune the final model. To supplement the limited data, we compile a data augmentation strategy combining six augmentation techniques. The speech model uses Connectionist Temporal Classification (CTC) for an alignment free training and does not require any pronunciation dictionary or language model. Experiments from the second approach demonstrate that it can outperform the best traditional or end-to-end models in terms of word error rate (WER) and produce a powerful utterance level transcription. On top of that, the augmentation strategy is tested on several end-to-end models, and it provides a consistent improvement in performance. While the best proposed model can currently reduce the WER significantly, it may still require further research to completely replace the need for human transcribers
- …