59 research outputs found
A novel multimodal dynamic fusion network for disfluency detection in spoken utterances
Disfluency, though originating from human spoken utterances, is primarily
studied as a uni-modal text-based Natural Language Processing (NLP) task. Based
on early-fusion and self-attention-based multimodal interaction between text
and acoustic modalities, in this paper, we propose a novel multimodal
architecture for disfluency detection from individual utterances. Our
architecture leverages a multimodal dynamic fusion network that adds minimal
parameters over an existing text encoder commonly used in prior art to leverage
the prosodic and acoustic cues hidden in speech. Through experiments, we show
that our proposed model achieves state-of-the-art results on the widely used
English Switchboard for disfluency detection and outperforms prior unimodal and
multimodal systems in literature by a significant margin. In addition, we make
a thorough qualitative analysis and show that, unlike text-only systems, which
suffer from spurious correlations in the data, our system overcomes this
problem through additional cues from speech signals. We make all our codes
publicly available on GitHub.Comment: Submitted to ICASSP 2023. arXiv admin note: text overlap with
arXiv:2203.1679
Exploration of A Self-Supervised Speech Model: A Study on Emotional Corpora
Self-supervised speech models have grown fast during the past few years and
have proven feasible for use in various downstream tasks. Some recent work has
started to look at the characteristics of these models, yet many concerns have
not been fully addressed. In this work, we conduct a study on emotional corpora
to explore a popular self-supervised model -- wav2vec 2.0. Via a set of
quantitative analysis, we mainly demonstrate that: 1) wav2vec 2.0 appears to
discard paralinguistic information that is less useful for word recognition
purposes; 2) for emotion recognition, representations from the middle layer
alone perform as well as those derived from layer averaging, while the final
layer results in the worst performance in some cases; 3) current
self-supervised models may not be the optimal solution for downstream tasks
that make use of non-lexical features. Our work provides novel findings that
will aid future research in this area and theoretical basis for the use of
existing models.Comment: Accepted to SLT 202
- …