94 research outputs found
Disfluency Detection using a Noisy Channel Model and a Deep Neural Language Model
This paper presents a model for disfluency detection in spontaneous speech
transcripts called LSTM Noisy Channel Model. The model uses a Noisy Channel
Model (NCM) to generate n-best candidate disfluency analyses and a Long
Short-Term Memory (LSTM) language model to score the underlying fluent
sentences of each analysis. The LSTM language model scores, along with other
features, are used in a MaxEnt reranker to identify the most plausible
analysis. We show that using an LSTM language model in the reranking process of
noisy channel disfluency model improves the state-of-the-art in disfluency
detection
Disfluency Detection using Auto-Correlational Neural Networks
In recent years, the natural language processing community has moved away
from task-specific feature engineering, i.e., researchers discovering ad-hoc
feature representations for various tasks, in favor of general-purpose methods
that learn the input representation by themselves. However, state-of-the-art
approaches to disfluency detection in spontaneous speech transcripts currently
still depend on an array of hand-crafted features, and other representations
derived from the output of pre-existing systems such as language models or
dependency parsers. As an alternative, this paper proposes a simple yet
effective model for automatic disfluency detection, called an
auto-correlational neural network (ACNN). The model uses a convolutional neural
network (CNN) and augments it with a new auto-correlation operator at the
lowest layer that can capture the kinds of "rough copy" dependencies that are
characteristic of repair disfluencies in speech. In experiments, the ACNN model
outperforms the baseline CNN on a disfluency detection task with a 5% increase
in f-score, which is close to the previous best result on this task
Increase Apparent Public Speaking Fluency By Speech Augmentation
Fluent and confident speech is desirable to every speaker. But professional
speech delivering requires a great deal of experience and practice. In this
paper, we propose a speech stream manipulation system which can help
non-professional speakers to produce fluent, professional-like speech content,
in turn contributing towards better listener engagement and comprehension. We
propose to achieve this task by manipulating the disfluencies in human speech,
like the sounds 'uh' and 'um', the filler words and awkward long silences.
Given any unrehearsed speech we segment and silence the filled pauses and
doctor the duration of imposed silence as well as other long pauses
('disfluent') by a predictive model learned using professional speech dataset.
Finally, we output a audio stream in which speaker sounds more fluent,
confident and practiced compared to the original speech he/she recorded.
According to our quantitative evaluation, we significantly increase the fluency
of speech by reducing rate of pauses and fillers
Robust cross-domain disfluency detection with pattern match networks
In this paper we introduce a novel pattern match neural network architecture
that uses neighbor similarity scores as features, eliminating the need for
feature engineering in a disfluency detection task. We evaluate the approach in
disfluency detection for four different speech genres, showing that the
approach is as effective as hand-engineered pattern match features when used on
in-domain data and achieves superior performance in cross-domain scenarios.Comment: This paper was submitted to EMNLP 2018 and was rejected. Our EMNLP
submission is posted here to establish concurrency with "Disfluency Detection
using Auto-Correlational Neural Networks" by P. Lou, P. Anderson, M. Johnson
which was submitted to EMNLP at the same tim
Improving Disfluency Detection by Self-Training a Self-Attentive Model
Self-attentive neural syntactic parsers using contextualized word embeddings
(e.g. ELMo or BERT) currently produce state-of-the-art results in joint parsing
and disfluency detection in speech transcripts. Since the contextualized word
embeddings are pre-trained on a large amount of unlabeled data, using
additional unlabeled data to train a neural model might seem redundant.
However, we show that self-training - a semi-supervised technique for
incorporating unlabeled data - sets a new state-of-the-art for the
self-attentive parser on disfluency detection, demonstrating that self-training
provides benefits orthogonal to the pre-trained contextualized word
representations. We also show that ensembling self-trained parsers provides
further gains for disfluency detection
Neural Constituency Parsing of Speech Transcripts
This paper studies the performance of a neural self-attentive parser on
transcribed speech. Speech presents parsing challenges that do not appear in
written text, such as the lack of punctuation and the presence of speech
disfluencies (including filled pauses, repetitions, corrections, etc.).
Disfluencies are especially problematic for conventional syntactic parsers,
which typically fail to find any EDITED disfluency nodes at all. This motivated
the development of special disfluency detection systems, and special mechanisms
added to parsers specifically to handle disfluencies. However, we show here
that neural parsers can find EDITED disfluency nodes, and the best neural
parsers find them with an accuracy surpassing that of specialized disfluency
detection systems, thus making these specialized mechanisms unnecessary. This
paper also investigates a modified loss function that puts more weight on
EDITED nodes. It also describes tree-transformations that simplify the
disfluency detection task by providing alternative encodings of disfluencies
and syntactic information
Dolphin: A Spoken Language Proficiency Assessment System for Elementary Education
Spoken language proficiency is critically important for children's growth and
personal development. Due to the limited and imbalanced educational resources
in China, elementary students barely have chances to improve their oral
language skills in classes. Verbal fluency tasks (VFTs) were invented to let
the students practice their spoken language proficiency after school. VFTs are
simple but concrete math related questions that ask students to not only report
answers but speak out the entire thinking process. In spite of the great
success of VFTs, they bring a heavy grading burden to elementary teachers. To
alleviate this problem, we develop Dolphin, a spoken language proficiency
assessment system for Chinese elementary education. Dolphin is able to
automatically evaluate both phonological fluency and semantic relevance of
students' VFT answers. We conduct a wide range of offline and online
experiments to demonstrate the effectiveness of Dolphin. In our offline
experiments, we show that Dolphin improves both phonological fluency and
semantic relevance evaluation performance when compared to state-of-the-art
baselines on real-world educational data sets. In our online A/B experiments,
we test Dolphin with 183 teachers from 2 major cities (Hangzhou and Xi'an) in
China for 10 weeks and the results show that VFT assignments grading coverage
is improved by 22\%.Comment: Proceedings of The Web Conference 2020 (WWW '20
Controllable Time-Delay Transformer for Real-Time Punctuation Prediction and Disfluency Detection
With the increased applications of automatic speech recognition (ASR) in
recent years, it is essential to automatically insert punctuation marks and
remove disfluencies in transcripts, to improve the readability of the
transcripts as well as the performance of subsequent applications, such as
machine translation, dialogue systems, and so forth. In this paper, we propose
a Controllable Time-delay Transformer (CT-Transformer) model that jointly
completes the punctuation prediction and disfluency detection tasks in real
time. The CT-Transformer model facilitates freezing partial outputs with
controllable time delay to fulfill the real-time constraints in partial
decoding required by subsequent applications. We further propose a fast
decoding strategy to minimize latency while maintaining competitive
performance. Experimental results on the IWSLT2011 benchmark dataset and an
in-house Chinese annotated dataset demonstrate that the proposed approach
outperforms the previous state-of-the-art models on F-scores and achieves a
competitive inference speed.Comment: 4 pages, 2 figures, accepted by ICASSP 202
Parsing Speech: A Neural Approach to Integrating Lexical and Acoustic-Prosodic Information
In conversational speech, the acoustic signal provides cues that help
listeners disambiguate difficult parses. For automatically parsing spoken
utterances, we introduce a model that integrates transcribed text and
acoustic-prosodic features using a convolutional neural network over energy and
pitch trajectories coupled with an attention-based recurrent neural network
that accepts text and prosodic features. We find that different types of
acoustic-prosodic features are individually helpful, and together give
statistically significant improvements in parse and disfluency detection F1
scores over a strong text-only baseline. For this study with known sentence
boundaries, error analyses show that the main benefit of acoustic-prosodic
features is in sentences with disfluencies, attachment decisions are most
improved, and transcription errors obscure gains from prosody.Comment: Accepted in NAACL HLT 201
Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering
Disfluencies is an under-studied topic in NLP, even though it is ubiquitous
in human conversation. This is largely due to the lack of datasets containing
disfluencies. In this paper, we present a new challenge question answering
dataset, Disfl-QA, a derivative of SQuAD, where humans introduce contextual
disfluencies in previously fluent questions. Disfl-QA contains a variety of
challenging disfluencies that require a more comprehensive understanding of the
text than what was necessary in prior datasets. Experiments show that the
performance of existing state-of-the-art question answering models degrades
significantly when tested on Disfl-QA in a zero-shot setting.We show data
augmentation methods partially recover the loss in performance and also
demonstrate the efficacy of using gold data for fine-tuning. We argue that we
need large-scale disfluency datasets in order for NLP models to be robust to
them. The dataset is publicly available at:
https://github.com/google-research-datasets/disfl-qa.Comment: Findings of ACL 202
- …