3,199 research outputs found
Concurrent collaborative captioning
Captioned text transcriptions of the spoken word can benefit hearing impaired people, non native speakers, anyone if no audio is available (e.g. watching TV at an airport) and also anyone who needs to review recordings of what has been said (e.g. at lectures, presentations, meetings etc.) In this paper, a tool is described that facilitates concurrent collaborative captioning by correction of speech recognition errors to provide a sustainable method of making videos accessible to people who find it difficult to understand speech through hearing alone. The tool stores all the edits of all the users and uses a matching algorithm to compare users’ edits to check if they are in agreement
The relationship of word error rate to document ranking
This paper describes two experiments that examine the relationship of Word Error Rate (WER) of retrieved
spoken documents returned by a spoken document retrieval system. Previous work has demonstrated that
recognition errors do not significantly affect retrieval effectiveness but whether they will adversely affect
relevance judgement remains unclear. A user-based experiment measuring ability to judge relevance from
the recognised text presented in a retrieved result list was conducted. The results indicated that users were
capable of judging relevance accurately despite transcription errors. This lead an examination of the
relationship of WER in retrieved audio documents to their rank position when retrieved for a particular
query. Here it was shown that WER was somewhat lower for top ranked documents than it was for
documents retrieved further down the ranking, thereby indicating a possible explanation for the success of
the user experiment
Automatic measurement of propositional idea density from part-of-speech tagging
The original publication is available at www.springerlink.comThe Computerized Propositional Idea Density Rater (CPIDR, pronounced “spider”) is a computer program that determines the propositional idea density (P-density) of an English text automatically on the basis of partof-speech tags. The key idea is that propositions correspond roughly to verbs, adjectives, adverbs, prepositions, and conjunctions. After tagging the parts of speech using MontyLingua (Liu, 2004), CPIDR applies numerous rules to adjust the count, such as combining auxiliary verbs with the main verb. A “speech mode” is provided in which CPIDR rejects repetitions and a wider range of fillers. CPIDR is a user-friendly Windows .NET application distributed as open-source freeware under GPL. Tested against human raters, it agrees with the consensus of two human raters better than the team of five raters agree with each other [r(80) = .97 vs. r(10) = .82, respectively]
Automated Speech Recognition for Captioned Telephone Conversations
Internet Protocol Captioned Telephone Service is a service for people with hearing loss, allowing them to communicate effectively by having a human Communications Assistant transcribe the call and equipment that displays the transcription in near real time. The current state of the art for ASR is considered with regard to automating such service. Recent results on standard tests are examined and appropriate metrics for ASR performance in captioning are discussed. Possible paths for developing fully-automated telephone captioning services are examined and the effort involved is evaluated
Evaluating the Usability of Automatically Generated Captions for People who are Deaf or Hard of Hearing
The accuracy of Automated Speech Recognition (ASR) technology has improved,
but it is still imperfect in many settings. Researchers who evaluate ASR
performance often focus on improving the Word Error Rate (WER) metric, but WER
has been found to have little correlation with human-subject performance on
many applications. We propose a new captioning-focused evaluation metric that
better predicts the impact of ASR recognition errors on the usability of
automatically generated captions for people who are Deaf or Hard of Hearing
(DHH). Through a user study with 30 DHH users, we compared our new metric with
the traditional WER metric on a caption usability evaluation task. In a
side-by-side comparison of pairs of ASR text output (with identical WER), the
texts preferred by our new metric were preferred by DHH participants. Further,
our metric had significantly higher correlation with DHH participants'
subjective scores on the usability of a caption, as compared to the correlation
between WER metric and participant subjective scores. This new metric could be
used to select ASR systems for captioning applications, and it may be a better
metric for ASR researchers to consider when optimizing ASR systems.Comment: 10 pages, 8 figures, published in ACM SIGACCESS Conference on
Computers and Accessibility (ASSETS '17
TRScore: A Novel GPT-based Readability Scorer for ASR Segmentation and Punctuation model evaluation and selection
Punctuation and Segmentation are key to readability in Automatic Speech
Recognition (ASR), often evaluated using F1 scores that require high-quality
human transcripts and do not reflect readability well. Human evaluation is
expensive, time-consuming, and suffers from large inter-observer variability,
especially in conversational speech devoid of strict grammatical structures.
Large pre-trained models capture a notion of grammatical structure. We present
TRScore, a novel readability measure using the GPT model to evaluate different
segmentation and punctuation systems. We validate our approach with human
experts. Additionally, our approach enables quantitative assessment of text
post-processing techniques such as capitalization, inverse text normalization
(ITN), and disfluency on overall readability, which traditional word error rate
(WER) and slot error rate (SER) metrics fail to capture. TRScore is strongly
correlated to traditional F1 and human readability scores, with Pearson's
correlation coefficients of 0.67 and 0.98, respectively. It also eliminates the
need for human transcriptions for model selection
- …