Search CORE

8 research outputs found

Spoken language processing: piecing together the puzzle

Author: Abler
Aboitiz
Alexandrov
Altmann
Anderson
Anderson
Arnold
Baddeley
Bailly
Bara
Baron-Cohen
Baron-Cohen
Barto
Becchio
Becker
Bourlard
Brainard
Bregman
Bridle
Chartrand
Chella
Cherry
Clarke
Cooke
Cooke
Cowley
Cox
Darwin
Dawkins
de Graaf-Peters
de Zubicaray
Denes
Deutsch
Dijksterhuis
Donald
Dutoit
Ekman
Engelhardt
Erlhagen
Fadiga
Fairbanks
Feldman
Fenn
Figueredo
Fitch
Fitch
Fodor
Fowler
Frith
Fry
Geers
Gerdes
Gerken
Goldinger
Goldinger
Grush
Grush
Hartsuiker
Hauser
Hawkins
Hawkins
Hawkins
Hawkins
Hermansky
Hintzman
Hoare
Holden
Holmes
Howell
Howell
Huang
Hunter
Hunter
Ikuta
Jarvis
Jelinek
Jelinek
John
Junqua
Junqua
Jusczyk
Keller
Kuhl
Kurzweil
Kurzweil
Lane
Lengagne
Levelt
Levelt
Levelt
Levelt
Levelt
Lewicki
Lewis
Liberman
Liberman
Lindblom
Lippmann
Lombard
Makhoul
Marr
Martin-Loeches
Maslow
Meguerditchiana
Meltzoff
Moore
Morgan
Mountcastle
Nicolelis
Norris
Pacherie
Paul
Perkell
Perlis
Philipson
Pinker
Pinker
Powers
Pulvermüller
Rabiner
Rakoczy
Rizzolatti
Rizzolatti
Rizzolatti
Roger K. Moore
Roy
Scharenborg
Scharenborg
Scherer
Schweizer
Searle
Shannon
Sinha
Slaney
Slevc
Sokhi
Sternberg
Stevens
Studdart-Kennedy
Taylor
Taylor
Taylor
Taylor
Taylor
Tirassa
Tirassa
Toates
Tremblay
Tulving
Tummolini
Walker
Wang
Wang
Warren
Wilson
Wilson
Wundt
Wörgötter
Yarbus
Yu
Zipf
Publication venue: 'Elsevier BV'
Publication date: 01/05/2007
Field of study

Attempting to understand the fundamental mechanisms underlying spoken language processing, whether it is viewed as behaviour exhibited by human beings or as a faculty simulated by machines, is one of the greatest scientific challenges of our age. Despite tremendous achievements over the past 50 or so years, there is still a long way to go before we reach a comprehensive explanation of human spoken language behaviour and can create a technology with performance approaching or exceeding that of a human being. It is argued that progress is hampered by the fragmentation of the field across many different disciplines, coupled with a failure to create an integrated view of the fundamental mechanisms that underpin one organism's ability to communicate with another. This paper weaves together accounts from a wide variety of different disciplines concerned with the behaviour of living systems - many of them outside the normal realms of spoken language - and compiles them into a new model: PRESENCE (PREdictive SENsorimotor Control and Emulation). It is hoped that the results of this research will provide a sufficient glimpse into the future to give breath to a new generation of research into spoken language processing by mind or machine. (c) 2007 Elsevier B.V. All rights reserved

Crossref

White Rose Research Online

Evaluation of preprocessors for neural network speaker verification

Author: Salleh Sheikh-Hussain
Publication venue: The University of Edinburgh
Publication date: 01/01/1997
Field of study

Edinburgh Research Archive

Spoken content retrieval: A survey of techniques and technologies

Author: Ani Nenkova
C A. Nenkova
K. Mckeown
Kathleen Mckeown
Publication venue: 'Now Publishers'
Publication date: 01/01/2012
Field of study

Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

CiteSeerX

Crossref

Irish Universities

DCU Online Research Access Service

Improving Searchability of Automatically Transcribed Lectures Through Dynamic Language Modelling

Author: Marquard Stephen
Publication venue: 'Division of Chemical Information and Computer Sciences'
Publication date: 13/08/2016
Field of study

Cape Town University OpenUCT

Multi-tape finite-state transducer for asynchronous multi-stream pattern recognition with application to speech

Author: Shu Han, Ph. D. Massachusetts Institute of Technology
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2006
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Includes bibliographical references (p. 119-127).In this thesis, we have focused on improving the acoustic modeling of speech recognition systems to increase the overall recognition performance. We formulate a novel multi-stream speech recognition framework using multi-tape finite-state transducers (FSTs). The multi-dimensional input labels of the multi-tape FST transitions specify the acoustic models to be used for the individual feature streams. An additional auxiliary field is used to model the degree of asynchrony among the feature streams. The individual feature streams can be linear sequences such as fixed-frame-rate features in traditional hidden Markov model (HMM) systems, and the feature streams can also be directed acyclic graphs such as segment features in segment-based systems. In a single-tape mode, this multi-stream framework also unifies the frame-based HMM and the segment-based approach. Systems using the multi-stream speech recognition framework were evaluated on an audio-only and an audio-visual speech recognition task. On the Wall Street Journal speech recognition task, the multi-stream framework combined a traditional frame-based HMM with segment-based landmark features.(cont.) The system achieved word error rate (WER) of 8.0%, improved from both the WER of 8.8% of the baseline HMM-only system and the WER of 10.4% of the landmark-only system. On the AV-TIMIT audio-visual speech recognition task, the multi-stream framework combined a landmark model, a segment model, and a visual HMM. The system achieved a WER of 0.9%, which also improved from the baseline systems. These results demonstrate the feasibility and versatility of the multi-stream speech recognition framework.by Han Shu.Ph.D

DSpace@MIT

Improving searchability of automatically transcribed lectures through dynamic language modelling

Author: Marquard Mr Stephen
Publication venue
Publication date: 01/01/2012
Field of study

Recording university lectures through lecture capture systems is increasingly common. However, a single continuous audio recording is often unhelpful for users, who may wish to navigate quickly to a particular part of a lecture, or locate a specific lecture within a set of recordings. A transcript of the recording can enable faster navigation and searching. Automatic speech recognition (ASR) technologies may be used to create automated transcripts, to avoid the significant time and cost involved in manual transcription. Low accuracy of ASR-generated transcripts may however limit their usefulness. In particular, ASR systems optimized for general speech recognition may not recognize the many technical or discipline-specific words occurring in university lectures. To improve the usefulness of ASR transcripts for the purposes of information retrieval (search) and navigating within recordings, the lexicon and language model used by the ASR engine may be dynamically adapted for the topic of each lecture. A prototype is presented which uses the English Wikipedia as a semantically dense, large language corpus to generate a custom lexicon and language model for each lecture from a small set of keywords. Two strategies for extracting a topic-specific subset of Wikipedia articles are investigated: a naïve crawler which follows all article links from a set of seed articles produced by a Wikipedia search from the initial keywords, and a refinement which follows only links to articles sufficiently similar to the parent article. Pair-wise article similarity is computed from a pre-computed vector space model of Wikipedia article term scores generated using latent semantic indexing. The CMU Sphinx4 ASR engine is used to generate transcripts from thirteen recorded lectures from Open Yale Courses, using the English HUB4 language model as a reference and the two topic-specific language models generated for each lecture from Wikipedia. Three standard metrics – Perplexity, Word Error Rate and Word Correct Rate – are used to evaluate the extent to which the adapted language models improve the searchability of the resulting transcripts, and in particular improve the recognition of specialist words. Ranked Word Correct Rate is proposed as a new metric better aligned with the goals of improving transcript searchability and specialist word recognition. Analysis of recognition performance shows that the language models derived using the similarity-based Wikipedia crawler outperform models created using the naïve crawler, and that transcripts using similarity-based language models have better perplexity and Ranked Word Correct Rate scores than those created using the HUB4 language model, but worse Word Error Rates. It is concluded that English Wikipedia may successfully be used as a language resource for unsupervised topic adaptation of language models to improve recognition performance for better searchability of lecture recording transcripts, although possibly at the expense of other attributes such as readability

Cape Town University OpenUCT

UCT Computer Science Research Document Archive

Speech Recognition

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

Directory of Open Access Books (DOAB)

Predicting the performance of a speech recognition task.

Author
Publication venue
Publication date: 01/01/2002
Field of study

Yau Pui Yuk.Thesis (M.Phil.)--Chinese University of Hong Kong, 2002.Includes bibliographical references (leaves 147-152).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Overview --- p.1Chapter 1.2 --- Speech Recognition --- p.2Chapter 1.2.1 --- How Speech Recognition Works --- p.3Chapter 1.2.2 --- Types of Speech Recognition Tasks --- p.4Chapter 1.2.3 --- Variabilities in Speech 一 a Challenge for Speech Recog- nition --- p.6Chapter 1.3 --- Performance Prediction of Speech Recognition Task --- p.7Chapter 1.4 --- Thesis Goals --- p.9Chapter 1.5 --- Thesis Organization --- p.10Chapter 2 --- Background --- p.11Chapter 2.1 --- The Acoustic-phonetic Approach --- p.12Chapter 2.1.1 --- Prediction based on the Degree of Mismatch --- p.12Chapter 2.1.2 --- Prediction based on Acoustic Similarity --- p.13Chapter 2.1.3 --- Prediction based on Between-Word Distance --- p.14Chapter 2.2 --- The Lexical Approach --- p.16Chapter 2.2.1 --- Perplexity --- p.16Chapter 2.2.2 --- SMR-perplexity --- p.17Chapter 2.3 --- The Combined Acoustic-phonetic and Lexical Approach --- p.18Chapter 2.3.1 --- Speech Decoder Entropy (SDE) --- p.19Chapter 2.3.2 --- Ideal Speech Decoding Difficulty (ISDD) --- p.20Chapter 2.4 --- Chapter Summary --- p.23Chapter 3 --- Components for Predicting the Performance of Speech Recog- nition Task --- p.24Chapter 3.1 --- Components of Speech Recognizer --- p.25Chapter 3.2 --- Word Similarity Measure --- p.27Chapter 3.2.1 --- Universal Phoneme Symbol (UPS) --- p.30Chapter 3.2.2 --- Definition of Phonetic Distance --- p.31Chapter 3.2.3 --- Definition of Word Pair Phonetic Distance --- p.45Chapter 3.2.4 --- Definition of Word Similarity Measure --- p.47Chapter 3.3 --- Word Occurrence Measure --- p.62Chapter 3.4 --- Chapter Summary --- p.64Chapter 4 --- Formulation of Recognition Error Predictive Index (REPI) --- p.65Chapter 4.1 --- Formulation of Recognition Error Predictive Index (REPI) --- p.66Chapter 4.2 --- Characteristics of Recognition Error Predictive Index (REPI) --- p.74Chapter 4.2.1 --- Weakness of Ideal Speech Decoding Difficulty (ISDD) --- p.75Chapter 4.2.2 --- Advantages of Recognition Error Predictive Index (REPI) --- p.79Chapter 4.3 --- Chapter Summary --- p.82Chapter 5 --- Experimental Design and Setup --- p.83Chapter 5.1 --- Objectives --- p.83Chapter 5.2 --- Experiments Preparation --- p.84Chapter 5.2.1 --- Speech Corpus and Speech Recognizers --- p.85Chapter 5.2.2 --- Speech Recognition Tasks --- p.93Chapter 5.2.3 --- Evaluation Criterion --- p.98Chapter 5.3 --- Experiment Categories and their Setup --- p.99Chapter 5.3.1 --- Experiment Category 1 一 Investigating and comparing the overall prediction performance of the two predictive indices --- p.102Chapter 5.3.2 --- Experiment Category 2 一 Comparing the applicability of the word similarity measures of the two predictive indices on predicting the recognition performance --- p.104Chapter 5.3.3 --- Experiment Category 3 - Comparing the applicability of the formulation method of the two predictive indices on predicting the recognition performance --- p.107Chapter 5.3.4 --- Experiment Category 4 一 Comparing the performance of different phonetic distance definitions --- p.109Chapter 5.4 --- Chapter Summary --- p.111Chapter 6 --- Experimental Results and Analysis --- p.112Chapter 6.1 --- Experimental Results and Analysis --- p.113Chapter 6.1.1 --- Experiment Category 1 - Investigating and comparing the overall prediction performance of the two predictive indices --- p.113Chapter 6.1.2 --- Experiment Category 2- Comparing the applicability of the word similarity measures of the two predictive indices on predicting the recognition performance --- p.117Chapter 6.1.3 --- Experiment Category 3 一 Comparing the applicability of the formulation method of the two predictive indices on predicting the recognition performance --- p.124Chapter 6.1.4 --- Experiment Category 4 - Comparing the performance of different phonetic distance definitions --- p.131Chapter 6.2 --- Experimental Summary --- p.137Chapter 6.3 --- Chapter Summary --- p.141Chapter 7 --- Conclusions --- p.142Chapter 7.1 --- Contributions --- p.144Chapter 7.2 --- Future Directions --- p.145Bibliography --- p.147Chapter A --- Table of Universal Phoneme Symbol --- p.153Chapter B --- Vocabulary Lists --- p.157Chapter C --- Experimental Results of Two-words Speech Recognition Tasks --- p.171Chapter D --- Experimental Results of Three-words Speech Recognition Tasks --- p.180Chapter E --- Significance Testing --- p.190Chapter E.1 --- Procedures of Significance Testing --- p.190Chapter E.2 --- Results of the Significance Testing --- p.191Chapter E.2.1 --- Experiment Category 1 --- p.191Chapter E.2.2 --- Experiment Category 2 --- p.192Chapter E.2.3 --- Experiment Category 3 --- p.194Chapter E.2.4 --- Experiment Category 4 --- p.196Chapter F --- Linear Regression Models --- p.19

CUHK Digital Repository