123 research outputs found
Joint Learning of Correlated Sequence Labelling Tasks Using Bidirectional Recurrent Neural Networks
The stream of words produced by Automatic Speech Recognition (ASR) systems is
typically devoid of punctuations and formatting. Most natural language
processing applications expect segmented and well-formatted texts as input,
which is not available in ASR output. This paper proposes a novel technique of
jointly modeling multiple correlated tasks such as punctuation and
capitalization using bidirectional recurrent neural networks, which leads to
improved performance for each of these tasks. This method could be extended for
joint modeling of any other correlated sequence labeling tasks.Comment: Accepted in Interspeech 201
Alzheimer's Dementia Recognition Using Acoustic, Lexical, Disfluency and Speech Pause Features Robust to Noisy Inputs
We present two multimodal fusion-based deep learning models that consume ASR
transcribed speech and acoustic data simultaneously to classify whether a
speaker in a structured diagnostic task has Alzheimer's Disease and to what
degree, evaluating the ADReSSo challenge 2021 data. Our best model, a BiLSTM
with highway layers using words, word probabilities, disfluency features, pause
information, and a variety of acoustic features, achieves an accuracy of 84%
and RSME error prediction of 4.26 on MMSE cognitive scores. While predicting
cognitive decline is more challenging, our models show improvement using the
multimodal approach and word probabilities, disfluency and pause information
over word-only models. We show considerable gains for AD classification using
multimodal fusion and gating, which can effectively deal with noisy inputs from
acoustic features and ASR hypotheses.Comment: INTERSPEECH 2021. arXiv admin note: substantial text overlap with
arXiv:2106.0966
Best of Both Worlds: Making High Accuracy Non-incremental Transformer-based Disfluency Detection Incremental
While Transformer-based text classifiers pre-trained on large volumes of text have yielded significant improvements on a wide range of computational linguistics tasks, their implementations have been unsuitable for live incremental processing thus far, operating only on the level of complete sentence inputs. We address the challenge of introducing methods for word-by-word left-to-right incremental processing to Transformers such as BERT, models without an intrinsic sense of linear order. We modify the training method and live decoding of non-incremental models to detect speech disfluencies with minimum latency and without pre-segmentation of dialogue acts. We experiment with several decoding methods to predict the rightward context of the word currently being processed using a GPT-2 language model and apply a BERT-based disfluency detector to sequences, including predicted words. We show our method of incrementalising Transformers maintains most of their high non-incremental performance while operating strictly incrementally. We also evaluate our models’ incremental performance to establish the trade-off between incremental performance and final performance, using different prediction strategies. We apply our system to incremental speech recognition results as they arrive into a live system and achieve state-of-the-art results in this setting
- …