1,631 research outputs found
Spoken content retrieval: A survey of techniques and technologies
Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR
Phoneme-based Video Indexing Using Phonetic Disparity Search
This dissertation presents and evaluates a method to the video indexing problem by investigating a categorization method that transcribes audio content through Automatic Speech Recognition (ASR) combined with Dynamic Contextualization (DC), Phonetic Disparity Search (PDS) and Metaphone indexation. The suggested approach applies genome pattern matching algorithms with computational summarization to build a database infrastructure that provides an indexed summary of the original audio content. PDS complements the contextual phoneme indexing approach by optimizing topic seek performance and accuracy in large video content structures. A prototype was established to translate news broadcast video into text and phonemes automatically by using ASR utterance conversions. Each phonetic utterance extraction was then categorized, converted to Metaphones, and stored in a repository with contextual topical information attached and indexed for posterior search analysis. Following the original design strategy, a custom parallel interface was built to measure the capabilities of dissimilar phonetic queries and provide an interface for result analysis. The postulated solution provides evidence of a superior topic matching when compared to traditional word and phoneme search methods. Experimental results demonstrate that PDS can be 3.7% better than the same phoneme query, Metaphone search proved to be 154.6% better than the same phoneme seek and 68.1 % better than the equivalent word search
Improving Searchability of Automatically Transcribed Lectures Through Dynamic Language Modelling
Recording university lectures through lecture capture systems is increasingly common.
However, a single continuous audio recording is often unhelpful for users, who may wish
to navigate quickly to a particular part of a lecture, or locate a specific lecture within a set
of recordings.
A transcript of the recording can enable faster navigation and searching. Automatic speech
recognition (ASR) technologies may be used to create automated transcripts, to avoid the
significant time and cost involved in manual transcription.
Low accuracy of ASR-generated transcripts may however limit their usefulness. In
particular, ASR systems optimized for general speech recognition may not recognize the
many technical or discipline-specific words occurring in university lectures. To improve
the usefulness of ASR transcripts for the purposes of information retrieval (search) and
navigating within recordings, the lexicon and language model used by the ASR engine may
be dynamically adapted for the topic of each lecture.
A prototype is presented which uses the English Wikipedia as a semantically dense, large
language corpus to generate a custom lexicon and language model for each lecture from a
small set of keywords. Two strategies for extracting a topic-specific subset of Wikipedia
articles are investigated: a naïve crawler which follows all article links from a set of seed
articles produced by a Wikipedia search from the initial keywords, and a refinement which
follows only links to articles sufficiently similar to the parent article. Pair-wise article
similarity is computed from a pre-computed vector space model of Wikipedia article term
scores generated using latent semantic indexing.
The CMU Sphinx4 ASR engine is used to generate transcripts from thirteen recorded
lectures from Open Yale Courses, using the English HUB4 language model as a reference
and the two topic-specific language models generated for each lecture from Wikipedia
Automatic transcription and phonetic labelling of dyslexic children's reading in Bahasa Melayu
Automatic speech recognition (ASR) is potentially helpful for children who suffer
from dyslexia. Highly phonetically similar errors of dyslexic children‟s reading affect the accuracy of ASR. Thus, this study aims to evaluate acceptable accuracy of ASR using automatic transcription and phonetic labelling of dyslexic children‟s reading in BM. For that, three objectives have been set: first to produce manual transcription and phonetic labelling; second to construct automatic transcription and phonetic labelling using forced alignment; and third to compare between accuracy using automatic transcription and phonetic labelling and manual transcription and
phonetic labelling. Therefore, to accomplish these goals methods have been used including manual speech labelling and segmentation, forced alignment, Hidden Markov Model (HMM) and Artificial Neural Network (ANN) for training, and for measure accuracy of ASR, Word Error Rate (WER) and False Alarm Rate (FAR) were used. A number of 585 speech files are used for manual transcription, forced alignment and training experiment. The recognition ASR engine using automatic transcription and phonetic labelling obtained optimum results is 76.04% with WER as low as 23.96% and FAR is 17.9%. These results are almost similar with ASR
engine using manual transcription namely 76.26%, WER as low as 23.97% and FAR a 17.9%. As conclusion, the accuracy of automatic transcription and phonetic labelling is acceptable to use it for help dyslexic children learning using ASR in Bahasa Melayu (BM
- …