62 research outputs found
Best of Both Worlds: Making High Accuracy Non-incremental Transformer-based Disfluency Detection Incremental
While Transformer-based text classifiers pre-trained on large volumes of text have yielded significant improvements on a wide range of computational linguistics tasks, their implementations have been unsuitable for live incremental processing thus far, operating only on the level of complete sentence inputs. We address the challenge of introducing methods for word-by-word left-to-right incremental processing to Transformers such as BERT, models without an intrinsic sense of linear order. We modify the training method and live decoding of non-incremental models to detect speech disfluencies with minimum latency and without pre-segmentation of dialogue acts. We experiment with several decoding methods to predict the rightward context of the word currently being processed using a GPT-2 language model and apply a BERT-based disfluency detector to sequences, including predicted words. We show our method of incrementalising Transformers maintains most of their high non-incremental performance while operating strictly incrementally. We also evaluate our models’ incremental performance to establish the trade-off between incremental performance and final performance, using different prediction strategies. We apply our system to incremental speech recognition results as they arrive into a live system and achieve state-of-the-art results in this setting
Automatically Neutralizing Subjective Bias in Text
Texts like news, encyclopedias, and some social media strive for objectivity.
Yet bias in the form of inappropriate subjectivity - introducing attitudes via
framing, presupposing truth, and casting doubt - remains ubiquitous. This kind
of bias erodes our collective trust and fuels social conflict. To address this
issue, we introduce a novel testbed for natural language generation:
automatically bringing inappropriately subjective text into a neutral point of
view ("neutralizing" biased text). We also offer the first parallel corpus of
biased language. The corpus contains 180,000 sentence pairs and originates from
Wikipedia edits that removed various framings, presuppositions, and attitudes
from biased sentences. Last, we propose two strong encoder-decoder baselines
for the task. A straightforward yet opaque CONCURRENT system uses a BERT
encoder to identify subjective words as part of the generation process. An
interpretable and controllable MODULAR algorithm separates these steps, using
(1) a BERT-based classifier to identify problematic words and (2) a novel join
embedding through which the classifier can edit the hidden states of the
encoder. Large-scale human evaluation across four domains (encyclopedias, news
headlines, books, and political speeches) suggests that these algorithms are a
first step towards the automatic identification and reduction of bias.Comment: To appear at AAAI 202
Weakly-supervised forced alignment of disfluent speech using phoneme-level modeling
The study of speech disorders can benefit greatly from time-aligned data.
However, audio-text mismatches in disfluent speech cause rapid performance
degradation for modern speech aligners, hindering the use of automatic
approaches. In this work, we propose a simple and effective modification of
alignment graph construction of CTC-based models using Weighted Finite State
Transducers. The proposed weakly-supervised approach alleviates the need for
verbatim transcription of speech disfluencies for forced alignment. During the
graph construction, we allow the modeling of common speech disfluencies, i.e.
repetitions and omissions. Further, we show that by assessing the degree of
audio-text mismatch through the use of Oracle Error Rate, our method can be
effectively used in the wild. Our evaluation on a corrupted version of the
TIMIT test set and the UCLASS dataset shows significant improvements,
particularly for recall, achieving a 23-25% relative improvement over our
baselines.Comment: Interspeech 202
Mispronunciation Detection in Children's Reading of Sentences
This work proposes an approach to automatically parse children’s reading of sentences by detecting word pronunciations and extra content, and to classify words as correctly or incorrectly pronounced. This approach can be directly helpful for automatic assessment of reading level or for automatic reading tutors, where a correct reading must be identified. We propose a first segmentation stage to locate candidate word pronunciations based on allowing repetitions and false starts of a word’s syllables. A decoding grammar based solely on syllables allows silence to appear during a word pronunciation. At a second stage, word candidates are classified as mispronounced or not. The feature that best classifies mispronunciations is found to be the log-likelihood ratio between a free phone loop and a word spotting model in the very close vicinity of the candidate segmentation. Additional features are combined in multi-feature models to further improve classification, including: normalizations of the log-likelihood ratio, derivations from phone likelihoods, and Levenshtein distances between the correct pronunciation and recognized phonemes through two phoneme recognition approaches. Results show that most extra events were detected (close to 2% word error rate achieved) and that using automatic segmentation for mispronunciation classification approaches the performance of manual segmentation. Although the log-likelihood ratio from a spotting approach is already a good metric to classify word pronunciations, the combination of additional features provides a relative reduction of the miss rate of 18% (from 34.03% to 27.79% using manual segmentation and from 35.58% to 29.35% using automatic segmentation, at constant 5% false alarm rate).info:eu-repo/semantics/publishedVersio
Long-Form Speech Translation through Segmentation with Finite-State Decoding Constraints on Large Language Models
One challenge in speech translation is that plenty of spoken content is
long-form, but short units are necessary for obtaining high-quality
translations. To address this mismatch, we adapt large language models (LLMs)
to split long ASR transcripts into segments that can be independently
translated so as to maximize the overall translation quality. We overcome the
tendency of hallucination in LLMs by incorporating finite-state constraints
during decoding; these eliminate invalid outputs without requiring additional
training. We discover that LLMs are adaptable to transcripts containing ASR
errors through prompt-tuning or fine-tuning. Relative to a state-of-the-art
automatic punctuation baseline, our best LLM improves the average BLEU by 2.9
points for English-German, English-Spanish, and English-Arabic TED talk
translation in 9 test sets, just by improving segmentation.Comment: accepted to the Findings of EMNLP 2023. arXiv admin note: text
overlap with arXiv:2212.0989
- …