2,302 research outputs found
Synthetic Speaking Children -- Why We Need Them and How to Make Them
Contemporary Human Computer Interaction (HCI) research relies primarily on
neural network models for machine vision and speech understanding of a system
user. Such models require extensively annotated training datasets for optimal
performance and when building interfaces for users from a vulnerable population
such as young children, GDPR introduces significant complexities in data
collection, management, and processing. Motivated by the training needs of an
Edge AI smart toy platform this research explores the latest advances in
generative neural technologies and provides a working proof of concept of a
controllable data generation pipeline for speech driven facial training data at
scale. In this context, we demonstrate how StyleGAN2 can be finetuned to create
a gender balanced dataset of children's faces. This dataset includes a variety
of controllable factors such as facial expressions, age variations, facial
poses, and even speech-driven animations with realistic lip synchronization. By
combining generative text to speech models for child voice synthesis and a 3D
landmark based talking heads pipeline, we can generate highly realistic,
entirely synthetic, talking child video clips. These video clips can provide
valuable, and controllable, synthetic training data for neural network models,
bridging the gap when real data is scarce or restricted due to privacy
regulations.Comment: Presented at SpeD 2
Accurate synthesis of Dysarthric Speech for ASR data augmentation
Dysarthria is a motor speech disorder often characterized by reduced speech
intelligibility through slow, uncoordinated control of speech production
muscles. Automatic Speech recognition (ASR) systems can help dysarthric talkers
communicate more effectively. However, robust dysarthria-specific ASR requires
a significant amount of training speech, which is not readily available for
dysarthric talkers. This paper presents a new dysarthric speech synthesis
method for the purpose of ASR training data augmentation. Differences in
prosodic and acoustic characteristics of dysarthric spontaneous speech at
varying severity levels are important components for dysarthric speech
modeling, synthesis, and augmentation. For dysarthric speech synthesis, a
modified neural multi-talker TTS is implemented by adding a dysarthria severity
level coefficient and a pause insertion model to synthesize dysarthric speech
for varying severity levels. To evaluate the effectiveness for synthesis of
training data for ASR, dysarthria-specific speech recognition was used. Results
show that a DNN-HMM model trained on additional synthetic dysarthric speech
achieves WER improvement of 12.2% compared to the baseline, and that the
addition of the severity level and pause insertion controls decrease WER by
6.5%, showing the effectiveness of adding these parameters. Overall results on
the TORGO database demonstrate that using dysarthric synthetic speech to
increase the amount of dysarthric-patterned speech for training has significant
impact on the dysarthric ASR systems. In addition, we have conducted a
subjective evaluation to evaluate the dysarthric-ness and similarity of
synthesized speech. Our subjective evaluation shows that the perceived
dysartrhic-ness of synthesized speech is similar to that of true dysarthric
speech, especially for higher levels of dysarthriaComment: arXiv admin note: text overlap with arXiv:2201.1157
SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DSYARTHRIC SPEECH RECOGNITION
Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech is required, which is not readily available for dysarthric talkers.
In this dissertation, we investigate dysarthric speech augmentation and synthesis methods. To better understand differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels, a comparative study between typical and dysarthric speech was conducted. These characteristics are important components for dysarthric speech modeling, synthesis, and augmentation. For augmentation, prosodic transformation and time-feature masking have been proposed. For dysarthric speech synthesis, this dissertation has introduced a modified neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. In addition, we have extended this work by using a label propagation technique to create more meaningful control variables such as a continuous Respiration, Laryngeal and Tongue (RLT) parameter, even for datasets that only provide discrete dysarthria severity level information. This approach increases the controllability of the system, so we are able to generate more dysarthric speech with a broader range.
To evaluate their effectiveness for synthesis of training data, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has a significant impact on the dysarthric ASR systems
Neurotrophin-3 regulates ribbon synapse density in the cochlea and induces synapse regeneration after acoustic trauma
Neurotrophin-3 (Ntf3) and brain derived neurotrophic factor (Bdnf) are critical for sensory neuron survival and establishment of neuronal projections to sensory epithelia in the embryonic inner ear, but their postnatal functions remain poorly understood. Using cell-specific inducible gene recombination in mice we found that, in the postnatal inner ear, Bbnf and Ntf3 are required for the formation and maintenance of hair cell ribbon synapses in the vestibular and cochlear epithelia, respectively. We also show that supporting cells in these epithelia are the key endogenous source of the neurotrophins. Using a new hair cell CreERT line with mosaic expression, we also found that Ntf3's effect on cochlear synaptogenesis is highly localized. Moreover, supporting cell-derived Ntf3, but not Bbnf, promoted recovery of cochlear function and ribbon synapse regeneration after acoustic trauma. These results indicate that glial-derived neurotrophins play critical roles in inner ear synapse density and synaptic regeneration after injury. DOI: http://dx.doi.org/10.7554/eLife.03564.00
Concordance of MEG and fMRI patterns in adolescents during verb generation
In this study we focused on direct comparison between the spatial distributions of activation detected by functional magnetic resonance imaging (fMRI) and localization of sources detected by magnetoencephalography (MEG) during identical language tasks. We examined the spatial concordance between MEG and fMRI results in 16 adolescents performing a three-phase verb generation task that involves repeating the auditorily presented concrete noun and generating verbs either overtly or covertly in response to the auditorily presented noun. MEG analysis was completed using a synthetic aperture magnetometry (SAM) technique, while the fMRI data were analyzed using the general linear model approach with random-effects. To quantify the agreement between the two modalities, we implemented voxel-wise concordance correlation coefficient (CCC) and identified the left inferior frontal gyrus and the bilateral motor cortex with high CCC values. At the group level, MEG and fMRI data showed spatial convergence in the left inferior frontal gyrus for covert or overt generation versus overt repetition, and the bilateral motor cortex when overt generation versus covert generation. These findings demonstrate the utility of the CCC as a quantitative measure of spatial convergence between two neuroimaging techniques
Auditory and speech processing in specific language impairment (SLI) and dyslexia
This thesis investigates auditory and speech processing in Specific Language
Impairment (SLI) and dyslexia. One influential theory of SLI and dyslexia postulates
that both SLI and dyslexia stem from similar underlying sensory deficit that impacts
speech perception and phonological development leading to oral language and literacy
deficits. Previous studies, however, have shown that these underlying sensory deficits
exist in only a subgroup of language impaired individuals, and the exact nature of these
deficits is still largely unknown.
The present thesis investigates three aspects of auditory-phonetic interface: 1) The
weighting of acoustic cues to phonetic voicing contrast 2) the preattentive and attentive
discrimination of speech and non-linguistic stimuli and 3) the formation of auditory
memory traces for speech and non-linguistic stimuli in young adults with SLI and
dyslexia. This thesis focuses on looking at both individial and group-level data of
auditory and speech processing and their relationship with higher-level language
measures. The groups of people with SLI and dyslexia who participated were aged
between 14 and 25 and their performance was compared to a group of controls matched
on chronological age, IQ, gender and handedness.
Investigations revealed a complex pattern of behaviour. The results showed that
individuals with SLI or dyslexia are not poor at discriminating sounds (whether speech
or non-speech). However, in all experiments, there was more variation and more outliers
in the SLI group indicating that auditory deficits may occur in a small subgroup of the
SLI population. Moreover, investigations of the exact nature of the input-processing
deficit revealed that some individuals with SLI have less categorical representations for
speech sounds and that they weight the acoustic cues to phonemic identity differently
from controls and dyslexics
Sequencing in SLA: Phonological Memory, Chunking and Points of Order.
Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/139863/1/Ellis1996Chunking.pd
Towards a vygotskyan cognitive robotics: the role of language as a cognitive tool
Cognitive Robotics can be defined as the study of cognitive phenomena by their modeling in physical artifacts such as robots. This is a very lively and fascinating field which has already given fundamental contributions to our understanding of natural cognition. Nonetheless, robotics has to date addressed mainly very basic, low-level cognitive phenomena like sensory-motor coordination, perception, and navigation, and it is not clear how the current approach might scale up to explain high-level human cognition. In this paper we argue that a promising way to do that is to merge current ideas and methods of \u27embodied cognition\u27 with the Russian tradition of theoretical psychology which views language not only as a communication system but also as a cognitive tool, that is by developing a Vygotskyan Cognitive Robotics. We substantiate this idea by discussing several domains in which language can improve basic cognitive abilities and permit the development of high-level cognition: learning, categorization, abstraction, memory, voluntary control, and mental life
- …