2,079 research outputs found
NoRefER: a Referenceless Quality Metric for Automatic Speech Recognition via Semi-Supervised Language Model Fine-Tuning with Contrastive Learning
This paper introduces NoRefER, a novel referenceless quality metric for
automatic speech recognition (ASR) systems. Traditional reference-based metrics
for evaluating ASR systems require costly ground-truth transcripts. NoRefER
overcomes this limitation by fine-tuning a multilingual language model for
pair-wise ranking ASR hypotheses using contrastive learning with Siamese
network architecture. The self-supervised NoRefER exploits the known quality
relationships between hypotheses from multiple compression levels of an ASR for
learning to rank intra-sample hypotheses by quality, which is essential for
model comparisons. The semi-supervised version also uses a referenced dataset
to improve its inter-sample quality ranking, which is crucial for selecting
potentially erroneous samples. The results indicate that NoRefER correlates
highly with reference-based metrics and their intra-sample ranks, indicating a
high potential for referenceless ASR evaluation or a/b testing
A Reference-less Quality Metric for Automatic Speech Recognition via Contrastive-Learning of a Multi-Language Model with Self-Supervision
The common standard for quality evaluation of automatic speech recognition
(ASR) systems is reference-based metrics such as the Word Error Rate (WER),
computed using manual ground-truth transcriptions that are time-consuming and
expensive to obtain. This work proposes a multi-language referenceless quality
metric, which allows comparing the performance of different ASR models on a
speech dataset without ground truth transcriptions. To estimate the quality of
ASR hypotheses, a pre-trained language model (LM) is fine-tuned with
contrastive learning in a self-supervised learning manner. In experiments
conducted on several unseen test datasets consisting of outputs from top
commercial ASR engines in various languages, the proposed referenceless metric
obtains a much higher correlation with WER scores and their ranks than the
perplexity metric from the state-of-art multi-lingual LM in all experiments,
and also reduces WER by more than when used for ensembling hypotheses.
The fine-tuned model and experiments are made available for the
reproducibility: https://github.com/aixplain/NoRefERComment: arXiv admin note: substantial text overlap with arXiv:2306.1257
Linguistically Motivated Sign Language Segmentation
Sign language segmentation is a crucial task in sign language processing
systems. It enables downstream tasks such as sign recognition, transcription,
and machine translation. In this work, we consider two kinds of segmentation:
segmentation into individual signs and segmentation into phrases, larger units
comprising several signs. We propose a novel approach to jointly model these
two tasks.
Our method is motivated by linguistic cues observed in sign language corpora.
We replace the predominant IO tagging scheme with BIO tagging to account for
continuous signing. Given that prosody plays a significant role in phrase
boundaries, we explore the use of optical flow features. We also provide an
extensive analysis of hand shapes and 3D hand normalization.
We find that introducing BIO tagging is necessary to model sign boundaries.
Explicitly encoding prosody by optical flow improves segmentation in shallow
models, but its contribution is negligible in deeper models. Careful tuning of
the decoding algorithm atop the models further improves the segmentation
quality.
We demonstrate that our final models generalize to out-of-domain video
content in a different signed language, even under a zero-shot setting. We
observe that including optical flow and 3D hand normalization enhances the
robustness of the model in this context.Comment: Accepted at EMNLP 2023 (Findings
Deep learning assessment of syllable affiliation of intervocalic consonants
In English, a sentence like “He made out our intentions.” could be misperceived as “He may doubt our intentions.” because the coda /d/ sounds like it has become the onset of the next syllable. The nature and occurrence condition of this resyllabification phenomenon are unclear, however. Previous empirical studies mainly relied on listener judgment, limited acoustic evidence, such as voice onset time, or average formant values to determine the occurrence of resyllabification. This study tested the hypothesis that resyllabification is a coarticulatory reorganisation that realigns the coda consonant with the vowel of the next syllable. Deep learning in conjunction with dynamic time warping (DTW) was used to assess syllable affiliation of intervocalic consonants. The results suggest that convolutional neural network- and recurrent neural network-based models can detect cases of resyllabification using Mel-frequency spectrograms. DTW analysis shows that neural network inferred resyllabified sequences are acoustically more similar to their onset counterparts than their canonical productions. A binary classifier further suggests that, similar to the genuine onsets, the inferred resyllabified coda consonants are coarticulated with the following vowel. These results are interpreted with an account of resyllabification as a speech-rate-dependent coarticulatory reorganisation mechanism in speech
Sign language translation from instructional videos
The advances in automatic sign language translation (SLT) to spoken languages have been mostly benchmarked with datasets of limited size and restricted domains. Our work advances the state of the art by providing the first baseline results on How2Sign, a large and broad dataset. We train a Transformer over I3D video features, using the reduced BLEU as a reference metric for validation, instead of the widely used BLEU score. We report a result of 8.03 on the BLEU score, and publish the first open-source implementation of its kind to promote further advances.This research was partially supported by research grant Adavoice PID2019-107579RB-I00 / AEI / 10.13039/501100011033, research grants PRE2020-094223, PID2021-126248OB-I00 and PID2019-107255GB-C21 and by Generalitat de Catalunya (AGAUR) under grant agreement 2021-SGR-00478.Peer ReviewedPostprint (published version
- …