7 research outputs found
Multi-Task Self-Supervised Learning for Disfluency Detection
Most existing approaches to disfluency detection heavily rely on
human-annotated data, which is expensive to obtain in practice. To tackle the
training data bottleneck, we investigate methods for combining multiple
self-supervised tasks-i.e., supervised tasks where data can be collected
without manual labeling. First, we construct large-scale pseudo training data
by randomly adding or deleting words from unlabeled news data, and propose two
self-supervised pre-training tasks: (i) tagging task to detect the added noisy
words. (ii) sentence classification to distinguish original sentences from
grammatically-incorrect sentences. We then combine these two tasks to jointly
train a network. The pre-trained network is then fine-tuned using
human-annotated disfluency detection training data. Experimental results on the
commonly used English Switchboard test set show that our approach can achieve
competitive performance compared to the previous systems (trained using the
full dataset) by using less than 1% (1000 sentences) of the training data. Our
method trained on the full dataset significantly outperforms previous methods,
reducing the error by 21% on English Switchboard
Culture Clubs: Processing Speech by Deriving and Exploiting Linguistic Subcultures
Spoken language understanding systems are error-prone for several reasons, including individual speech variability. This is manifested in many ways, among which are differences in pronunciation, lexical inventory, grammar and disfluencies. There is, however, a lot of evidence pointing to stable language usage within subgroups of a language population. We call these subgroups linguistic subcultures.
The two broad problems are defined and a survey of the work in this space is performed. The two broad problems are: linguistic subculture detection, commonly performed via Language Identification, Accent Identification or Dialect Identification approaches; and speech and language processing tasks taken which may see increases in performance by modeling for each linguistic subculture.
The data used in the experiments are drawn from four corpora: Accents of the British Isles (ABI), Intonational Variation in English (IViE), the NIST Language Recognition Evaluation Plan (LRE15) and Switchboard. The speakers in the corpora come from different parts of the United Kingdom and the United States and were provided different stimuli. From the speech samples, two features sets are used in the experiments.
A number of experiments to determine linguistic subcultures are conducted. The set of experiments cover a number of approaches including the use traditional machine learning approaches shown to be effective for similar tasks in the past, each with multiple feature sets. State-of-the-art deep learning approaches are also applied to this problem.
Two large automatic speech recognition (ASR) experiments are performed against all three corpora: one, monolithic experiment for all the speakers in each corpus and another for the speakers in groups according to their identified linguistic subcultures.
For the discourse markers labeled in the Switchboard corpus, there are some interesting trends when examined through the lens of the speakers in their linguistic subcultures.
Two large dialogue acts experiments are performed against the labeled portion of the Switchboard corpus: one, monocultural (or monolithic ) experiment for all the speakers in each corpus and another for the speakers in groups according to their identified linguistic subcultures.
We conclude by discussing applications of this work, the changing landscape of natural language processing and suggestions for future research