Search CORE

17 research outputs found

Parsing Speech: A Neural Approach to Integrating Lexical and Acoustic-Prosodic Information

Author: Bansal Mohit
Gimpel Kevin
Livescu Karen
Ostendorf Mari
Toshniwal Shubham
Tran Trang
Publication venue
Publication date: 01/01/2018
Field of study

In conversational speech, the acoustic signal provides cues that help listeners disambiguate difficult parses. For automatically parsing spoken utterances, we introduce a model that integrates transcribed text and acoustic-prosodic features using a convolutional neural network over energy and pitch trajectories coupled with an attention-based recurrent neural network that accepts text and prosodic features. We find that different types of acoustic-prosodic features are individually helpful, and together give statistically significant improvements in parse and disfluency detection F1 scores over a strong text-only baseline. For this study with known sentence boundaries, error analyses show that the main benefit of acoustic-prosodic features is in sentences with disfluencies, attachment decisions are most improved, and transcription errors obscure gains from prosody.Comment: Accepted in NAACL HLT 201

arXiv.org e-Print Archive

Crossref

Tied Probabilistic Linear Discriminant Analysis for Speech Recognition

Author: Lu Liang
Renals Steve
Publication venue
Publication date: 30/11/2014
Field of study

Edinburgh Research Explorer

Svitchboard 1: Small vocabulary tasks from switchboard 1

Author: Bartels Chris
Bilmes Jeff
King Simon
Publication venue
Publication date: 01/01/2005
Field of study

We present a conversational telephone speech data set designed to support research on novel acoustic models. Small vocabulary tasks from 10 words up to 500 words are defined using subsets of the Switchboard-1 corpus; each task has a completely closed vocabulary (an OOV rate of 0%). We justify the need for these tasks, describe the algorithm for selecting them from a large corpus, give a statistical analysis of the data and present baseline whole-word hidden Markov model recognition results. The goal of the paper is to define a common data set and to encourage other researchers to use it

Edinburgh Research Archive

Edinburgh Research Explorer

Recommended from our members

Effects of Duration, Locality, and Surprisal in Speech Disfluency Prediction in English Spontaneous Speech

Author: Agarwal Sumeet
Dammalapati Samvit
Rajkumar Rajakrishnan
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/2021
Field of study

This study examines the role of two influential theories of language processing, Surprisal Theory and Dependency Locality Theory (DLT), in predicting disfluencies (fillers and reparandums) in the Switchboard corpus of English conversational speech. Using Generalized Linear Mixed Models for this task, we incorporate syntactic factors (DLT-inspired costs and syntactic surprisal) in addition to lexical surprisal and duration, thus going beyond the local lexical frequency and predictability used in previous work on modelling word durations in Switchboard speech. Our results indicate that compared to fluent words, words preceding disfluencies tend to have lower lexical surprisal (hence higher activation levels) and lower syntactic complexity (low DLT costs and low syntactic surprisal except for reparandums). Disfluencies tend to occur before upcoming difficulties, i.e., high lexical surprisal words (low activation levels) with high syntactic complexity (high DLT costs and high syntactic surprisal). Further, we see that reparandums behave almost similarly to disfluent fillers with differences possibly arising due to effects being present in the word choice of the reparandum, i.e., in the disfluency itself rather than surrounding it. Moreover, words preceding disfluencies tend to be function words and have longer durations compared to their fluent counterparts, and word duration is a very effective predictor of disfluencies. Overall, speakers may be leveraging the differences in access between content and function words during planning as part of a mechanism to adapt for disfluencies while coordinating between planning and articulation

ScholarWorks@UMass Amherst

Manual transcription of conversational speech at the articulatory feature level

Author: Bezman Ari
Borges Nash
Chi Xuemin
Frankel Joe
King Simon
Lavoie Lisa
Livescu Karen
Magimai-Doss Mathew
Yung Lisa
Çetin Ozgur
Publication venue
Publication date: 01/01/2007
Field of study

Although much is known about how speech is produced, and research into speech production has resulted in measured articulatory data, feature systems of different kinds and numerous models, speech production knowledge is almost totally ignored in current mainstream approaches to automatic speech recognition. Representations of speech production allow simple explanations for many phenomena observed in speech which cannot be easily analyzed from either acoustic signal or phonetic transcription alone. In this article, we provide a survey of a growing body of work in which such representations are used to improve automatic speech recognition

CiteSeerX

Crossref

Edinburgh Research Archive

Edinburgh Research Explorer

Articulatory feature-based methods for acoustic and audio-visual speech recognition: Summary from the 2006 JHU Summer Workshop.

Author: Bartels Chris
Bezman Ari
Borges Nash
Dawson-Haggerty Stephen
Frankel Joe
Hasegawa-Johnson Mark
Kantor Arthur
King Simon
Lal Partha
Livescu Karen
Magimai-Doss Mathew
Saenko Kate
Woods Bronwyn
Yung Lisa
Çetin Ozgur
Publication venue
Publication date: 01/01/2007
Field of study

We report on investigations, conducted at the 2006 Johns HopkinsWorkshop, into the use of articulatory features (AFs) for observation and pronunciation models in speech recognition. In the area of observation modeling, we use the outputs of AF classiers both directly, in an extension of hybrid HMM/neural network models, and as part of the observation vector, an extension of the tandem approach. In the area of pronunciation modeling, we investigate a model having multiple streams of AF states with soft synchrony constraints, for both audio-only and audio-visual recognition. The models are implemented as dynamic Bayesian networks, and tested on tasks from the Small-Vocabulary Switchboard (SVitchboard) corpus and the CUAVE audio-visual digits corpus. Finally, we analyze AF classication and forced alignment using a newly collected set of feature-level manual transcriptions

CiteSeerX

Crossref

Edinburgh Research Archive

Edinburgh Research Explorer