176 research outputs found
Crowdsourcing correction of speech recognition captioning errors
In this paper, we describe a tool that facilitates crowdsourcing correction of speech recognition captioning errors to provide a sustainable method of making videos accessible to people who find it difficult to understand speech through hearing alone
Concurrent collaborative captioning
Captioned text transcriptions of the spoken word can benefit hearing impaired people, non native speakers, anyone if no audio is available (e.g. watching TV at an airport) and also anyone who needs to review recordings of what has been said (e.g. at lectures, presentations, meetings etc.) In this paper, a tool is described that facilitates concurrent collaborative captioning by correction of speech recognition errors to provide a sustainable method of making videos accessible to people who find it difficult to understand speech through hearing alone. The tool stores all the edits of all the users and uses a matching algorithm to compare users’ edits to check if they are in agreement
Evaluating the Usability of Automatically Generated Captions for People who are Deaf or Hard of Hearing
The accuracy of Automated Speech Recognition (ASR) technology has improved,
but it is still imperfect in many settings. Researchers who evaluate ASR
performance often focus on improving the Word Error Rate (WER) metric, but WER
has been found to have little correlation with human-subject performance on
many applications. We propose a new captioning-focused evaluation metric that
better predicts the impact of ASR recognition errors on the usability of
automatically generated captions for people who are Deaf or Hard of Hearing
(DHH). Through a user study with 30 DHH users, we compared our new metric with
the traditional WER metric on a caption usability evaluation task. In a
side-by-side comparison of pairs of ASR text output (with identical WER), the
texts preferred by our new metric were preferred by DHH participants. Further,
our metric had significantly higher correlation with DHH participants'
subjective scores on the usability of a caption, as compared to the correlation
between WER metric and participant subjective scores. This new metric could be
used to select ASR systems for captioning applications, and it may be a better
metric for ASR researchers to consider when optimizing ASR systems.Comment: 10 pages, 8 figures, published in ACM SIGACCESS Conference on
Computers and Accessibility (ASSETS '17
An introduction to crowdsourcing for language and multimedia technology research
Language and multimedia technology research often relies on
large manually constructed datasets for training or evaluation of algorithms and systems. Constructing these datasets is often expensive with significant challenges in terms of recruitment of personnel to carry out the work. Crowdsourcing methods using scalable pools of workers available on-demand offers a flexible means of rapid low-cost construction of many of these datasets to support existing research requirements and potentially promote new research initiatives that would otherwise not be possible
Crowdsourcing Accessibility: Human-Powered Access Technologies
People with disabilities have always engaged the people around them in order to circumvent inaccessible situations, allowing them to live more independently and get things done in their everyday lives. Increasing connectivity is allowing this approach to be extended to wherever and whenever it is needed. Technology can leverage this human work force to accomplish tasks beyond the capabilities of computers, increasing how accessible the world is for people with disabilities. This article outlines the growth of online human support, outlines a number of projects in this space, and presents a set of challenges and opportunities for this work going forward
Effect of Speech Recognition Errors on Text Understandability for People who are Deaf or Hard of Hearing
Recent advancements in the accuracy of Automated Speech Recognition (ASR) technologies have made them a potential candidate for the task of captioning. However, the presence of errors in the output may present challenges in their use in a fully automatic system. In this research, we are looking more closely into the impact of different inaccurate transcriptions from the ASR system on the understandability of captions for Deaf or Hard-of-Hearing (DHH) individuals. Through a user study with 30 DHH users, we studied the effect of the presence of an error in a text on its understandability for DHH users. We also investigated different prediction models to capture this relation accurately. Among other models, our random forest based model provided the best mean accuracy of 62.04% on the task. Further, we plan to improve this model with more data and use it to advance our investigation on ASR technologies to improve ASR based captioning for DHH users
Linguistic Variation and Anomalies in Comparisons of Human and Machine-Generated Image Captions
Describing the content of a visual image is a fundamental ability of human vision and language systems. Over the past several years, researchers have published on major improvements on image captioning, largely due to the development of deep learning systems trained on large data sets of images and human-written captions. However, these systems have major limitations, and their development has been narrowly focused on improving scores on relatively simple “bag-of-words” metrics. Very little work has examined the overall complex patterns of the language produced by image-captioning systems and how it compares to captions written by humans. In this paper, we closely examine patterns in machine-generated captions and characterize how conventional metrics are inconsistent at penalizing them for nonhuman-like erroneous output. We also hypothesize that the complexity of a visual scene should be reflected in the linguistic variety of the captions and, in testing this hypothesis, we find that human-generated captions have a dramatically greater degree of lexical, syntactic, and semantic variation. These results have important implications for the design of performance metrics, gauging what deep learning captioning systems really understand in images, and the importance of the task of image captioning for cognitive systems researc
Referenceless Quality Estimation for Natural Language Generation
Traditional automatic evaluation measures for natural language generation
(NLG) use costly human-authored references to estimate the quality of a system
output. In this paper, we propose a referenceless quality estimation (QE)
approach based on recurrent neural networks, which predicts a quality score for
a NLG system output by comparing it to the source meaning representation only.
Our method outperforms traditional metrics and a constant baseline in most
respects; we also show that synthetic data helps to increase correlation
results by 21% compared to the base system. Our results are comparable to
results obtained in similar QE tasks despite the more challenging setting.Comment: Accepted as a regular paper to 1st Workshop on Learning to Generate
Natural Language (LGNL), Sydney, 10 August 201
- …