3 research outputs found
What Do Self-Supervised Speech Models Know About Words?
Many self-supervised speech models (S3Ms) have been introduced over the last
few years, improving performance and data efficiency on various speech tasks.
However, these empirical successes alone do not give a complete picture of what
is learned during pre-training. Recent work has begun analyzing how S3Ms encode
certain properties, such as phonetic and speaker information, but we still lack
a proper understanding of knowledge encoded at the word level and beyond. In
this work, we use lightweight analysis methods to study segment-level
linguistic properties -- word identity, boundaries, pronunciation, syntactic
features, and semantic features -- encoded in S3Ms. We present a comparative
study of layer-wise representations from ten S3Ms and find that (i) the
frame-level representations within each word segment are not all equally
informative, and (ii) the pre-training objective and model size heavily
influence the accessibility and distribution of linguistic information across
layers. We also find that on several tasks -- word discrimination, word
segmentation, and semantic sentence similarity -- S3Ms trained with visual
grounding outperform their speech-only counterparts. Finally, our task-based
analyses demonstrate improved performance on word segmentation and acoustic
word discrimination while using simpler methods than prior work.Comment: Pre-MIT Press publication versio
SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation on Natural Speech
Progress in speech processing has been facilitated by shared datasets and
benchmarks. Historically these have focused on automatic speech recognition
(ASR), speaker identification, or other lower-level tasks. Interest has been
growing in higher-level spoken language understanding tasks, including using
end-to-end models, but there are fewer annotated datasets for such tasks. At
the same time, recent work shows the possibility of pre-training generic
representations and then fine-tuning for several tasks using relatively little
labeled data. We propose to create a suite of benchmark tasks for Spoken
Language Understanding Evaluation (SLUE) consisting of limited-size labeled
training sets and corresponding evaluation sets. This resource would allow the
research community to track progress, evaluate pre-trained representations for
higher-level tasks, and study open questions such as the utility of pipeline
versus end-to-end approaches. We present the first phase of the SLUE benchmark
suite, consisting of named entity recognition, sentiment analysis, and ASR on
the corresponding datasets. We focus on naturally produced (not read or
synthesized) speech, and freely available datasets. We provide new
transcriptions and annotations on subsets of the VoxCeleb and VoxPopuli
datasets, evaluation metrics and results for baseline models, and an
open-source toolkit to reproduce the baselines and evaluate new models.Comment: Updated preprint (Sentiment annotation on test set was updated).
Toolkit link https://github.com/asappresearch/slue-toolki