7 research outputs found
An evaluation of automatic speech recognition in the Spanish version of windows 7: effects of language variety, speaking style and gender
This study consists in an evaluation of the Spanish version of the automatic
speech recognizer embedded in what is currently one of the most widespread operating
systems: Microsoft’s Windows 7. Emphasis is placed upon the effects of gender,
language variety and speaking style on system performance. Two groups of subjects
were included in the tests: one of them was composed of 20 speakers of a Peninsular
variety (Spanish as spoken in Catalonia) and the second one, of 20 speakers of a Latin
American variety (Spanish as spoken in Buenos Aires), 10 female and 10 male
speakers within each group. The test set consisted of three different tasks aimed at
evaluating command recognition as well as automatic dictation. These tasks were
carried out in one-to-one meetings with each of the selected subjects.
Results revealed higher error rates for the group of Latin American speakers in
comparison to Peninsular speakers. Word error rate (WER) in the dictation tasks was
28.2% for the former group and 23.1% for the latter. Regarding the task on commands,
88% of these were correctly recognized for the Peninsular group, whereas the group
from Buenos Aires obtained a recognition percentage of 82.5%. With respect to
speaking style, the system performed worse for speech exhibiting a higher degree of
spontaneity and informality (WER = 30.7%) than for semi-scripted speech on
relatively formal topics (WER = 22.8%). In contrast, results corresponding to the
speech of men and women only showed slight differences which in general did not
prove significant. For male speakers, 86.5% of the commands were correctly
recognized, compared to 84% for female speakers, and WER for the automatic
dictation tasks was 24.9% for the former group and 26.6% for the latter
Crowd-supervised training of spoken language systems
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 155-166).Spoken language systems are often deployed with static speech recognizers. Only rarely are parameters in the underlying language, lexical, or acoustic models updated on-the-fly. In the few instances where parameters are learned in an online fashion, developers traditionally resort to unsupervised training techniques, which are known to be inferior to their supervised counterparts. These realities make the development of spoken language interfaces a difficult and somewhat ad-hoc engineering task, since models for each new domain must be built from scratch or adapted from a previous domain. This thesis explores an alternative approach that makes use of human computation to provide crowd-supervised training for spoken language systems. We explore human-in-the-loop algorithms that leverage the collective intelligence of crowds of non-expert individuals to provide valuable training data at a very low cost for actively deployed spoken language systems. We also show that in some domains the crowd can be incentivized to provide training data for free, as a byproduct of interacting with the system itself. Through the automation of crowdsourcing tasks, we construct and demonstrate organic spoken language systems that grow and improve without the aid of an expert. Techniques that rely on collecting data remotely from non-expert users, however, are subject to the problem of noise. This noise can sometimes be heard in audio collected from poor microphones or muddled acoustic environments. Alternatively, noise can take the form of corrupt data from a worker trying to game the system - for example, a paid worker tasked with transcribing audio may leave transcripts blank in hopes of receiving a speedy payment. We develop strategies to mitigate the effects of noise in crowd-collected data and analyze their efficacy. This research spans a number of different application domains of widely-deployed spoken language interfaces, but maintains the common thread of improving the speech recognizer's underlying models with crowd-supervised training algorithms. We experiment with three central components of a speech recognizer: the language model, the lexicon, and the acoustic model. For each component, we demonstrate the utility of a crowd-supervised training framework. For the language model and lexicon, we explicitly show that this framework can be used hands-free, in two organic spoken language systems.by Ian C. McGraw.Ph.D
Culture Clubs: Processing Speech by Deriving and Exploiting Linguistic Subcultures
Spoken language understanding systems are error-prone for several reasons, including individual speech variability. This is manifested in many ways, among which are differences in pronunciation, lexical inventory, grammar and disfluencies. There is, however, a lot of evidence pointing to stable language usage within subgroups of a language population. We call these subgroups linguistic subcultures.
The two broad problems are defined and a survey of the work in this space is performed. The two broad problems are: linguistic subculture detection, commonly performed via Language Identification, Accent Identification or Dialect Identification approaches; and speech and language processing tasks taken which may see increases in performance by modeling for each linguistic subculture.
The data used in the experiments are drawn from four corpora: Accents of the British Isles (ABI), Intonational Variation in English (IViE), the NIST Language Recognition Evaluation Plan (LRE15) and Switchboard. The speakers in the corpora come from different parts of the United Kingdom and the United States and were provided different stimuli. From the speech samples, two features sets are used in the experiments.
A number of experiments to determine linguistic subcultures are conducted. The set of experiments cover a number of approaches including the use traditional machine learning approaches shown to be effective for similar tasks in the past, each with multiple feature sets. State-of-the-art deep learning approaches are also applied to this problem.
Two large automatic speech recognition (ASR) experiments are performed against all three corpora: one, monolithic experiment for all the speakers in each corpus and another for the speakers in groups according to their identified linguistic subcultures.
For the discourse markers labeled in the Switchboard corpus, there are some interesting trends when examined through the lens of the speakers in their linguistic subcultures.
Two large dialogue acts experiments are performed against the labeled portion of the Switchboard corpus: one, monocultural (or monolithic ) experiment for all the speakers in each corpus and another for the speakers in groups according to their identified linguistic subcultures.
We conclude by discussing applications of this work, the changing landscape of natural language processing and suggestions for future research
Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme
Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie