21 research outputs found
Hierarchical recurrent neural network for story segmentation using fusion of lexical and acoustic features
Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation
Collecting audio-text pairs is expensive; however, it is much easier to
access text-only data. Unless using shallow fusion, end-to-end automatic speech
recognition (ASR) models require architecture modifications or additional
training schemes to use text-only data. Inspired by recent advances in
decoder-only language models (LMs), such as GPT-3 and PaLM adopted for
speech-processing tasks, we propose using a decoder-only architecture for ASR
with simple text augmentation. To provide audio information, encoder features
compressed by CTC prediction are used as prompts for the decoder, which can be
regarded as refining CTC prediction using the decoder-only model. Because the
decoder architecture is the same as an autoregressive LM, it is simple to
enhance the model by leveraging external text data with LM training. An
experimental comparison using LibriSpeech and Switchboard shows that our
proposed models with text augmentation training reduced word error rates from
ordinary CTC by 0.3% and 1.4% on LibriSpeech test-clean and testother set,
respectively, and 2.9% and 5.0% on Switchboard and CallHome. The proposed model
had advantage on computational efficiency compared with conventional
encoder-decoder ASR models with a similar parameter setup, and outperformed
them on the LibriSpeech 100h and Switchboard training scenarios.Comment: Submitted to ICASSP202
Integrating Pretrained ASR and LM to Perform Sequence Generation for Spoken Language Understanding
There has been an increased interest in the integration of pretrained speech
recognition (ASR) and language models (LM) into the SLU framework. However,
prior methods often struggle with a vocabulary mismatch between pretrained
models, and LM cannot be directly utilized as they diverge from its NLU
formulation. In this study, we propose a three-pass end-to-end (E2E) SLU system
that effectively integrates ASR and LM subnetworks into the SLU formulation for
sequence generation tasks. In the first pass, our architecture predicts ASR
transcripts using the ASR subnetwork. This is followed by the LM subnetwork,
which makes an initial SLU prediction. Finally, in the third pass, the
deliberation subnetwork conditions on representations from the ASR and LM
subnetworks to make the final prediction. Our proposed three-pass SLU system
shows improved performance over cascaded and E2E SLU models on two benchmark
SLU datasets, SLURP and SLUE, especially on acoustically challenging
utterances.Comment: Accepted at INTERSPEECH 202
The Pipeline System of ASR and NLU with MLM-based Data Augmentation toward STOP Low-resource Challenge
This paper describes our system for the low-resource domain adaptation track
(Track 3) in Spoken Language Understanding Grand Challenge, which is a part of
ICASSP Signal Processing Grand Challenge 2023. In the track, we adopt a
pipeline approach of ASR and NLU. For ASR, we fine-tune Whisper for each domain
with upsampling. For NLU, we fine-tune BART on all the Track3 data and then on
low-resource domain data. We apply masked LM (MLM) -based data augmentation,
where some of input tokens and corresponding target labels are replaced using
MLM. We also apply a retrieval-based approach, where model input is augmented
with similar training samples. As a result, we achieved exact match (EM)
accuracy 63.3/75.0 (average: 69.15) for reminder/weather domain, and won the
1st place at the challenge.Comment: To be appeared at ICASSP202
Tensor decomposition for minimization of E2E SLU model toward on-device processing
Spoken Language Understanding (SLU) is a critical speech recognition
application and is often deployed on edge devices. Consequently, on-device
processing plays a significant role in the practical implementation of SLU.
This paper focuses on the end-to-end (E2E) SLU model due to its small latency
property, unlike a cascade system, and aims to minimize the computational cost.
We reduce the model size by applying tensor decomposition to the Conformer and
E-Branchformer architectures used in our E2E SLU models. We propose to apply
singular value decomposition to linear layers and the Tucker decomposition to
convolution layers, respectively. We also compare COMP/PARFAC decomposition and
Tensor-Train decomposition to the Tucker decomposition. Since the E2E model is
represented by a single neural network, our tensor decomposition can flexibly
control the number of parameters without changing feature dimensions. On the
STOP dataset, we achieved 70.9% exact match accuracy under the tight constraint
of only 15 million parameters.Comment: Accepted by INTERSPEECH 202