Search CORE

1,706 research outputs found

ARTICULATORY INFORMATION FOR ROBUST SPEECH RECOGNITION

Author: Mitra Vikramjit
Publication venue
Publication date: 01/01/2010
Field of study

Current Automatic Speech Recognition (ASR) systems fail to perform nearly as good as human speech recognition performance due to their lack of robustness against speech variability and noise contamination. The goal of this dissertation is to investigate these critical robustness issues, put forth different ways to address them and finally present an ASR architecture based upon these robustness criteria. Acoustic variations adversely affect the performance of current phone-based ASR systems, in which speech is modeled as `beads-on-a-string', where the beads are the individual phone units. While phone units are distinctive in cognitive domain, they are varying in the physical domain and their variation occurs due to a combination of factors including speech style, speaking rate etc.; a phenomenon commonly known as `coarticulation'. Traditional ASR systems address such coarticulatory variations by using contextualized phone-units such as triphones. Articulatory phonology accounts for coarticulatory variations by modeling speech as a constellation of constricting actions known as articulatory gestures. In such a framework, speech variations such as coarticulation and lenition are accounted for by gestural overlap in time and gestural reduction in space. To realize a gesture-based ASR system, articulatory gestures have to be inferred from the acoustic signal. At the initial stage of this research an initial study was performed using synthetically generated speech to obtain a proof-of-concept that articulatory gestures can indeed be recognized from the speech signal. It was observed that having vocal tract constriction trajectories (TVs) as intermediate representation facilitated the gesture recognition task from the speech signal. Presently no natural speech database contains articulatory gesture annotation; hence an automated iterative time-warping architecture is proposed that can annotate any natural speech database with articulatory gestures and TVs. Two natural speech databases: X-ray microbeam and Aurora-2 were annotated, where the former was used to train a TV-estimator and the latter was used to train a Dynamic Bayesian Network (DBN) based ASR architecture. The DBN architecture used two sets of observation: (a) acoustic features in the form of mel-frequency cepstral coefficients (MFCCs) and (b) TVs (estimated from the acoustic speech signal). In this setup the articulatory gestures were modeled as hidden random variables, hence eliminating the necessity for explicit gesture recognition. Word recognition results using the DBN architecture indicate that articulatory representations not only can help to account for coarticulatory variations but can also significantly improve the noise robustness of ASR system

CiteSeerX

Digital Repository at the University of Maryland

Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab

Author: Gao Yingming
Publication venue
Publication date: 04/08/2022
Field of study

Articulatory copy synthesis (ACS), a subarea of speech inversion, refers to the reproduction of natural utterances and involves both the physiological articulatory processes and their corresponding acoustic results. This thesis proposes two novel methods for the ACS of human speech using the articulatory speech synthesizer VocalTractLab (VTL) to address or mitigate the existing problems of speech inversion, such as non-unique mapping, acoustic variation among different speakers, and the time-consuming nature of the process. The first method involved finding appropriate VTL gestural scores for given natural utterances using a genetic algorithm. It consisted of two steps: gestural score initialization and optimization. In the first step, gestural scores were initialized using the given acoustic signals with speech recognition, grapheme-to-phoneme (G2P), and a VTL rule-based method for converting phoneme sequences to gestural scores. In the second step, the initial gestural scores were optimized by a genetic algorithm via an analysis-by-synthesis (ABS) procedure that sought to minimize the cosine distance between the acoustic features of the synthetic and natural utterances. The articulatory parameters were also regularized during the optimization process to restrict them to reasonable values. The second method was based on long short-term memory (LSTM) and convolutional neural networks, which were responsible for capturing the temporal dependence and the spatial structure of the acoustic features, respectively. The neural network regression models were trained, which used acoustic features as inputs and produced articulatory trajectories as outputs. In addition, to cover as much of the articulatory and acoustic space as possible, the training samples were augmented by manipulating the phonation type, speaking effort, and the vocal tract length of the synthetic utterances. Furthermore, two regularization methods were proposed: one based on the smoothness loss of articulatory trajectories and another based on the acoustic loss between original and predicted acoustic features. The best-performing genetic algorithms and convolutional LSTM systems (evaluated in terms of the difference between the estimated and reference VTL articulatory parameters) obtained average correlation coefficients of 0.985 and 0.983 for speaker-dependent utterances, respectively, and their reproduced speech achieved recognition accuracies of 86.25% and 64.69% for speaker-independent utterances of German words, respectively. When applied to German sentence utterances, as well as English and Mandarin Chinese word utterances, the neural network based ACS systems achieved recognition accuracies of 73.88%, 52.92%, and 52.41%, respectively. The results showed that both of these methods not only reproduced the articulatory processes but also reproduced the acoustic signals of reference utterances. Moreover, the regularization methods led to more physiologically plausible articulatory processes and made the estimated articulatory trajectories be more articulatorily preferred by VTL, thus reproducing more natural and intelligible speech. This study also found that the convolutional layers, when used in conjunction with batch normalization layers, automatically learned more distinctive features from log power spectrograms. Furthermore, the neural network based ACS systems trained using German data could be generalized to the utterances of other languages

Technische Universität Dresden: Qucosa

Generating gestural timing from EMA data using articulatory resynthesis

Author: Richmond Korin
Steiner I.
Publication venue
Publication date: 01/01/2008
Field of study

As part of ongoing work to integrate an articulatory synthesizer into a modular TTS platform, a method is presented which allows gestural timings to be generated automatically from EMA data. Further work is outlined which will adapt the vocal tract model and phoneset to English using new articulatory data, and use statistical trajectory models

Edinburgh Research Archive

Edinburgh Research Explorer

The effect of common ground on how speakers use gesture and speech to represent size information

Author: Holler J.
Stevens R.
Publication venue
Publication date: 01/01/2007
Field of study

MPG.PuRe

Dance-the-music : an educational platform for the modeling, recognition and audiovisual monitoring of dance steps using spatiotemporal motion templates

Author: Amelynck Denis
Leman Marc
Maes Pieter-Jan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

In this article, a computational platform is presented, entitled “Dance-the-Music”, that can be used in a dance educational context to explore and learn the basics of dance steps. By introducing a method based on spatiotemporal motion templates, the platform facilitates to train basic step models from sequentially repeated dance figures performed by a dance teacher. Movements are captured with an optical motion capture system. The teachers’ models can be visualized from a first-person perspective to instruct students how to perform the specific dance steps in the correct manner. Moreover, recognition algorithms-based on a template matching method can determine the quality of a student’s performance in real time by means of multimodal monitoring techniques. The results of an evaluation study suggest that the Dance-the-Music is effective in helping dance students to master the basics of dance figures

Springer - Publisher Connector

Ghent University Academic Bibliography

Speech and language therapy for aphasia following stroke

Author: Albert
Albert
Altman
Altmann
Ambrosi
Avent
Avent
Bakheit
Basso
Bastiaanse
Baumgaertner
Benjamin
Benson
Berman
Best
Beukelman
Bhogal
Blomert
Bloom
Borkowski
Bowen
Boyle
Bradburn
Brady
Breitenfeld
Brooks
Brott
Caplan
Carlo
Castro-Caldas
Caute
Cherney
Cherney
Cherney
Cherney
Cherney
Cherney
Ciccone
Code
Cohen
Cohen
Conklyn
Crerar
Crosson
Crosson
Cupit
David
David
David
David
David
Denes
DeRenzi
Ding
Doesborgh
Doesborgh
Druks
Drummond
Dubner
Duffy
Dunn
Ebrahim
Elman
Elman
Elsner
Enderby
Enderby
Engelter
Ferro
Gans
Godecke
Godecke
Godecke
Godecke
Godecke
Godecke
Godecke
Godecke
Goldberg
Gonzalez
Goodenough-Tregapnier
Goodenough-Trepagnier
Goodenough-Trepagnier
Goodglass
Goodglass
Greener
Greener
Gu
Gu
Hagelstein
Hagen
Harnish
Hartman
Hilari
Hinckley
Hinckley
Hoffmann
Holland
Holland
Howard
Howard
Huber
Ji
Jong-Hagelstein
Jong-Hagelstein
Jungblut
Kagan
Kalra
Kaplan
Katz
Katz
Katz
Katz
Katz
Katz
Kay
Kelly
Kendall
Kertesz
Kinsey
Kurt
Kurtzke
Kurtzke
Kurtzke
Lancker
Lara
Lara
Laska
Laska
Laska
Latimer
Lauterbach
Leal
Lee
Lendrem
Li
Lincoln
Lincoln
Lincoln
Lincoln
Lincoln
Lincoln
Lincoln
Lincoln
Lincoln
Lincoln
Lincoln
Lincoln
Liu
Liu
Loeher
Lomas
Long
Long
Luo
Lyon
Mackay
Maher
Marcotte
Marshall
Marshall
Marshall
Marshall
Martins
Mattioli
Mattioli
McCall
Meikle
Meinzer
Meinzer
Meinzer
Meulen
Meulen
Meulen
Meulen
Moher
Moss
Nettleton
Nicholas
Nicholas
Nobis-Bosch
Nobis-Bosch
Nobis-Bosch
Nobis-Bosch
Oldfield
Osborne
Palmer
Palmer
Palmer
Palmer
Palmer
Palmer
Parr
Patchick
Pistarini
Popovici
Porch
Porch
Porch
Prins
Prins
Pulvermuller
Qiu
Quinteros
Rasmussen
Raven
Raymer
Raymer
Reinvang
Reinvang
Robey
Robey
Rochon
Rodrigues
Rodriguez
Rose
Rose
Rudd
Salonen
Sandt-Koenderman
Sarno
Schegloff
Schegloff
Schlaug
Schlaug
Schmah
Schuell
Schwartz
Shewan
Shewan
Shewan
Shewan
Shewan
Shewan
Shewan
Shewan
Sickert
Smania
Smania
Smith
Smith
Smith
Snodgrass
Soares
Sparks
Spreen
Stachowiak
Stachowiak
Stachowiak
Steenbrugge
Stoicheff
Stutcliffe
Swinburn
Szaflarski
Taylor
Thompson
Thompson
Thorndike
Thorsén
Toro
Tseng
Varley
Varley
Vauth
Vermeulen
Vines
Visch-Brink
Visch-Brink
Visch-Brink
Wang
Weiduschat
Wenke
Wenke
Wertz
Wertz
Wertz
Wertz
Wertz
West
West
Whiteside
Whiteside
Whitworth
Widén Holmqvist
Williams
Wilssens
Wolfe
Wood-Dauphinee
Woolf
Woolf
Woolf
Woolf
Wu
Wu
Xie
Xu
Yao
Yao
Yao
Yao
Young
Zhang
Zhang
Zhang
Zhang
Zhao
Zuckerman
Publication venue: 'Wiley'
Publication date: 01/06/2016
Field of study

Background Aphasia is an acquired language impairment following brain damage that affects some or all language modalities: expression and understanding of speech, reading, and writing. Approximately one third of people who have a stroke experience aphasia. Objectives To assess the effects of speech and language therapy (SLT) for aphasia following stroke. Search methods We searched the Cochrane Stroke Group Trials Register (last searched 9 September 2015), CENTRAL (2015, Issue 5) and other Cochrane Library Databases (CDSR, DARE, HTA, to 22 September 2015), MEDLINE (1946 to September 2015), EMBASE (1980 to September 2015), CINAHL (1982 to September 2015), AMED (1985 to September 2015), LLBA (1973 to September 2015), and SpeechBITE (2008 to September 2015). We also searched major trials registers for ongoing trials including ClinicalTrials.gov (to 21 September 2015), the Stroke Trials Registry (to 21 September 2015), Current Controlled Trials (to 22 September 2015), and WHO ICTRP (to 22 September 2015). In an effort to identify further published, unpublished, and ongoing trials we also handsearched theInternational Journal of Language and Communication Disorders(1969 to 2005) and reference lists of relevant articles, and we contacted academic institutions and other researchers. There were no language restrictions. Selection criteria Randomised controlled trials (RCTs) comparing SLT (a formal intervention that aims to improve language and communication abilities, activity and participation) versus no SLT; social support or stimulation (an intervention that provides social support and communication stimulation but does not include targeted therapeutic interventions); or another SLT intervention (differing in duration, intensity, frequency, intervention methodology or theoretical approach). Data collection and analysis We independently extracted the data and assessed the quality of included trials. We sought missing data from investigators. Main results We included 57 RCTs (74 randomised comparisons) involving 3002 participants in this review (some appearing in more than one comparison). Twenty-seven randomised comparisons (1620 participants) assessed SLT versus no SLT; SLT resulted in clinically and statistically significant benefits to patients' functional communication (standardised mean difference (SMD) 0.28, 95% confidence interval (CI) 0.06 to 0.49, P = 0.01), reading, writing, and expressive language, but (based on smaller numbers) benefits were not evident at follow-up. Nine randomised comparisons (447 participants) assessed SLT with social support and stimulation; meta-analyses found no evidence of a difference in functional communication, but more participants withdrew from social support interventions than SLT. Thirty-eight randomised comparisons (1242 participants) assessed two approaches to SLT. Functional communication was significantly better in people with aphasia that received therapy at a high intensity, high dose, or over a long duration compared to those that received therapy at a lower intensity, lower dose, or over a shorter period of time. The benefits of a high intensity or a high dose of SLT were confounded by a significantly higher dropout rate in these intervention groups. Generally, trials randomised small numbers of participants across a range of characteristics (age, time since stroke, and severity profiles), interventions, and outcomes. Authors' conclusions Our review provides evidence of the effectiveness of SLT for people with aphasia following stroke in terms of improved functional communication, reading, writing, and expressive language compared with no therapy. There is some indication that therapy at high intensity, high dose or over a longer period may be beneficial. HIgh-intensity and high dose interventions may not be acceptable to all

Crossref

Stirling Online Research Repository (RIOXX)

Irish Universities

Cork Open Research Archive

Stirling Online Research Repository

ResearchOnline@GCU

Synchronization of Speech and Gesture : Evidence for Interaction in Action

Author: Chu Mingyuan
Hagoort Peter
Publication venue: 'American Psychological Association (APA)'
Publication date: 01/01/2014
Field of study

Peer reviewedPostprin

Aberdeen University Research

A systematic investigation of gesture kinematics in evolving manual languages in the lab

Author: Dingemanse Mark
Motamedi Yasamin
Pouw Wim
Özyürek Aslı
Publication venue: 'Wiley'
Publication date: 01/01/2021
Field of study

Item does not contain fulltextSilent gestures consist of complex multi-articulatory movements but are now primarily studied through categorical coding of the referential gesture content. The relation of categorical linguistic content with continuous kinematics is therefore poorly understood. Here, we reanalyzed the video data from a gestural evolution experiment (Motamedi, Schouwstra, Smith, Culbertson, & Kirby, 2019), which showed increases in the systematicity of gesture content over time. We applied computer vision techniques to quantify the kinematics of the original data. Our kinematic analyses demonstrated that gestures become more efficient and less complex in their kinematics over generations of learners. We further detect the systematicity of gesture form on the level of thegesture kinematic interrelations, which directly scales with the systematicity obtained on semantic coding of the gestures. Thus, from continuous kinematics alone, we can tap into linguistic aspects that were previously only approachable through categorical coding of meaning. Finally, going beyond issues of systematicity, we show how unique gesture kinematic dialects emerged over generations as isolated chains of participants gradually diverged over iterations from other chains. We, thereby, conclude that gestures can come to embody the linguistic system at the level of interrelationships between communicative tokens, which should calibrate our theories about form and linguistic content.29 p

Edinburgh Research Explorer

Radboud Repository

MPG.PuRe