6 research outputs found
SpeechBERT: An Audio-and-text Jointly Learned Language Model for End-to-end Spoken Question Answering
While various end-to-end models for spoken language understanding tasks have
been explored recently, this paper is probably the first known attempt to
challenge the very difficult task of end-to-end spoken question answering
(SQA). Learning from the very successful BERT model for various text processing
tasks, here we proposed an audio-and-text jointly learned SpeechBERT model.
This model outperformed the conventional approach of cascading ASR with the
following text question answering (TQA) model on datasets including ASR errors
in answer spans, because the end-to-end model was shown to be able to extract
information out of audio data before ASR produced errors. When ensembling the
proposed end-to-end model with the cascade architecture, even better
performance was achieved. In addition to the potential of end-to-end SQA, the
SpeechBERT can also be considered for many other spoken language understanding
tasks just as BERT for many text processing tasks.Comment: Interspeech 202
Adding Connectionist Temporal Summarization into Conformer to Improve Its Decoder Efficiency For Speech Recognition
The Conformer model is an excellent architecture for speech recognition
modeling that effectively utilizes the hybrid losses of connectionist temporal
classification (CTC) and attention to train model parameters. To improve the
decoding efficiency of Conformer, we propose a novel connectionist temporal
summarization (CTS) method that reduces the number of frames required for the
attention decoder fed from the acoustic sequences generated by the encoder,
thus reducing operations. However, to achieve such decoding improvements, we
must fine-tune model parameters, as cross-attention observations are changed
and thus require corresponding refinements. Our final experiments show that,
with a beamwidth of 4, the LibriSpeech's decoding budget can be reduced by up
to 20% and for FluentSpeech data it can be reduced by 11%, without losing ASR
accuracy. An improvement in accuracy is even found for the LibriSpeech
"test-other" set. The word error rate (WER) is reduced by 6\% relative at the
beam width of 1 and by 3% relative at the beam width of 4.Comment: Submitted to INTERSPEECH 2022 (5 pages, 2 figures
Leveraging Acoustic and Linguistic Embeddings from Pretrained speech and language Models for Intent Classification
Intent classification is a task in spoken language understanding. An intent
classification system is usually implemented as a pipeline process, with a
speech recognition module followed by text processing that classifies the
intents. There are also studies of end-to-end system that takes acoustic
features as input and classifies the intents directly. Such systems don't take
advantage of relevant linguistic information, and suffer from limited training
data. In this work, we propose a novel intent classification framework that
employs acoustic features extracted from a pretrained speech recognition system
and linguistic features learned from a pretrained language model. We use
knowledge distillation technique to map the acoustic embeddings towards
linguistic embeddings. We perform fusion of both acoustic and linguistic
embeddings through cross-attention approach to classify intents. With the
proposed method, we achieve 90.86% and 99.07% accuracy on ATIS and Fluent
speech corpus, respectively
Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs
A major focus of recent research in spoken language understanding (SLU) has
been on the end-to-end approach where a single model can predict intents
directly from speech inputs without intermediate transcripts. However, this
approach presents some challenges. First, since speech can be considered as
personally identifiable information, in some cases only automatic speech
recognition (ASR) transcripts are accessible. Second, intent-labeled speech
data is scarce. To address the first challenge, we propose a novel system that
can predict intents from flexible types of inputs: speech, ASR transcripts, or
both. We demonstrate strong performance for either modality separately, and
when both speech and ASR transcripts are available, through system combination,
we achieve better results than using a single input modality. To address the
second challenge, we leverage a semantically robust pre-trained BERT model and
adopt a cross-modal system that co-trains text embeddings and acoustic
embeddings in a shared latent space. We further enhance this system by
utilizing an acoustic module pre-trained on LibriSpeech and domain-adapting the
text module on our target datasets. Our experiments show significant advantages
for these pre-training and fine-tuning strategies, resulting in a system that
achieves competitive intent-classification performance on Snips SLU and Fluent
Speech Commands datasets.Comment: Accepted to Interspeech 202
Large-scale Transfer Learning for Low-resource Spoken Language Understanding
End-to-end Spoken Language Understanding (SLU) models are made increasingly
large and complex to achieve the state-ofthe-art accuracy. However, the
increased complexity of a model can also introduce high risk of over-fitting,
which is a major challenge in SLU tasks due to the limitation of available
data. In this paper, we propose an attention-based SLU model together with
three encoder enhancement strategies to overcome data sparsity challenge. The
first strategy focuses on the transferlearning approach to improve feature
extraction capability of the encoder. It is implemented by pre-training the
encoder component with a quantity of Automatic Speech Recognition annotated
data relying on the standard Transformer architecture and then fine-tuning the
SLU model with a small amount of target labelled data. The second strategy
adopts multitask learning strategy, the SLU model integrates the speech
recognition model by sharing the same underlying encoder, such that improving
robustness and generalization ability. The third strategy, learning from
Component Fusion (CF) idea, involves a Bidirectional Encoder Representation
from Transformer (BERT) model and aims to boost the capability of the decoder
with an auxiliary network. It hence reduces the risk of over-fitting and
augments the ability of the underlying encoder, indirectly. Experiments on the
FluentAI dataset show that cross-language transfer learning and multi-task
strategies have been improved by up to 4:52% and 3:89% respectively, compared
to the baseline.Comment: will be presented in INTERSPEECH 202
End-to-End Spoken Language Understanding Without Full Transcripts
An essential component of spoken language understanding (SLU) is slot
filling: representing the meaning of a spoken utterance using semantic entity
labels. In this paper, we develop end-to-end (E2E) spoken language
understanding systems that directly convert speech input to semantic entities
and investigate if these E2E SLU models can be trained solely on semantic
entity annotations without word-for-word transcripts. Training such models is
very useful as they can drastically reduce the cost of data collection. We
created two types of such speech-to-entities models, a CTC model and an
attention-based encoder-decoder model, by adapting models trained originally
for speech recognition. Given that our experiments involve speech input, these
systems need to recognize both the entity label and words representing the
entity value correctly. For our speech-to-entities experiments on the ATIS
corpus, both the CTC and attention models showed impressive ability to skip
non-entity words: there was little degradation when trained on just entities
versus full transcripts. We also explored the scenario where the entities are
in an order not necessarily related to spoken order in the utterance. With its
ability to do re-ordering, the attention model did remarkably well, achieving
only about 2% degradation in speech-to-bag-of-entities F1 score.Comment: 5 pages, to be published in Interspeech 202