2,302 research outputs found

    Synthetic Speaking Children -- Why We Need Them and How to Make Them

    Full text link
    Contemporary Human Computer Interaction (HCI) research relies primarily on neural network models for machine vision and speech understanding of a system user. Such models require extensively annotated training datasets for optimal performance and when building interfaces for users from a vulnerable population such as young children, GDPR introduces significant complexities in data collection, management, and processing. Motivated by the training needs of an Edge AI smart toy platform this research explores the latest advances in generative neural technologies and provides a working proof of concept of a controllable data generation pipeline for speech driven facial training data at scale. In this context, we demonstrate how StyleGAN2 can be finetuned to create a gender balanced dataset of children's faces. This dataset includes a variety of controllable factors such as facial expressions, age variations, facial poses, and even speech-driven animations with realistic lip synchronization. By combining generative text to speech models for child voice synthesis and a 3D landmark based talking heads pipeline, we can generate highly realistic, entirely synthetic, talking child video clips. These video clips can provide valuable, and controllable, synthetic training data for neural network models, bridging the gap when real data is scarce or restricted due to privacy regulations.Comment: Presented at SpeD 2

    Accurate synthesis of Dysarthric Speech for ASR data augmentation

    Full text link
    Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems can help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech, which is not readily available for dysarthric talkers. This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation. Differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels are important components for dysarthric speech modeling, synthesis, and augmentation. For dysarthric speech synthesis, a modified neural multi-talker TTS is implemented by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. To evaluate the effectiveness for synthesis of training data for ASR, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has significant impact on the dysarthric ASR systems. In addition, we have conducted a subjective evaluation to evaluate the dysarthric-ness and similarity of synthesized speech. Our subjective evaluation shows that the perceived dysartrhic-ness of synthesized speech is similar to that of true dysarthric speech, especially for higher levels of dysarthriaComment: arXiv admin note: text overlap with arXiv:2201.1157

    SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DSYARTHRIC SPEECH RECOGNITION

    Get PDF
    Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech is required, which is not readily available for dysarthric talkers. In this dissertation, we investigate dysarthric speech augmentation and synthesis methods. To better understand differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels, a comparative study between typical and dysarthric speech was conducted. These characteristics are important components for dysarthric speech modeling, synthesis, and augmentation. For augmentation, prosodic transformation and time-feature masking have been proposed. For dysarthric speech synthesis, this dissertation has introduced a modified neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. In addition, we have extended this work by using a label propagation technique to create more meaningful control variables such as a continuous Respiration, Laryngeal and Tongue (RLT) parameter, even for datasets that only provide discrete dysarthria severity level information. This approach increases the controllability of the system, so we are able to generate more dysarthric speech with a broader range. To evaluate their effectiveness for synthesis of training data, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has a significant impact on the dysarthric ASR systems

    Neurotrophin-3 regulates ribbon synapse density in the cochlea and induces synapse regeneration after acoustic trauma

    Get PDF
    Neurotrophin-3 (Ntf3) and brain derived neurotrophic factor (Bdnf) are critical for sensory neuron survival and establishment of neuronal projections to sensory epithelia in the embryonic inner ear, but their postnatal functions remain poorly understood. Using cell-specific inducible gene recombination in mice we found that, in the postnatal inner ear, Bbnf and Ntf3 are required for the formation and maintenance of hair cell ribbon synapses in the vestibular and cochlear epithelia, respectively. We also show that supporting cells in these epithelia are the key endogenous source of the neurotrophins. Using a new hair cell CreERT line with mosaic expression, we also found that Ntf3's effect on cochlear synaptogenesis is highly localized. Moreover, supporting cell-derived Ntf3, but not Bbnf, promoted recovery of cochlear function and ribbon synapse regeneration after acoustic trauma. These results indicate that glial-derived neurotrophins play critical roles in inner ear synapse density and synaptic regeneration after injury. DOI: http://dx.doi.org/10.7554/eLife.03564.00

    Concordance of MEG and fMRI patterns in adolescents during verb generation

    Get PDF
    In this study we focused on direct comparison between the spatial distributions of activation detected by functional magnetic resonance imaging (fMRI) and localization of sources detected by magnetoencephalography (MEG) during identical language tasks. We examined the spatial concordance between MEG and fMRI results in 16 adolescents performing a three-phase verb generation task that involves repeating the auditorily presented concrete noun and generating verbs either overtly or covertly in response to the auditorily presented noun. MEG analysis was completed using a synthetic aperture magnetometry (SAM) technique, while the fMRI data were analyzed using the general linear model approach with random-effects. To quantify the agreement between the two modalities, we implemented voxel-wise concordance correlation coefficient (CCC) and identified the left inferior frontal gyrus and the bilateral motor cortex with high CCC values. At the group level, MEG and fMRI data showed spatial convergence in the left inferior frontal gyrus for covert or overt generation versus overt repetition, and the bilateral motor cortex when overt generation versus covert generation. These findings demonstrate the utility of the CCC as a quantitative measure of spatial convergence between two neuroimaging techniques

    Auditory and speech processing in specific language impairment (SLI) and dyslexia

    Get PDF
    This thesis investigates auditory and speech processing in Specific Language Impairment (SLI) and dyslexia. One influential theory of SLI and dyslexia postulates that both SLI and dyslexia stem from similar underlying sensory deficit that impacts speech perception and phonological development leading to oral language and literacy deficits. Previous studies, however, have shown that these underlying sensory deficits exist in only a subgroup of language impaired individuals, and the exact nature of these deficits is still largely unknown. The present thesis investigates three aspects of auditory-phonetic interface: 1) The weighting of acoustic cues to phonetic voicing contrast 2) the preattentive and attentive discrimination of speech and non-linguistic stimuli and 3) the formation of auditory memory traces for speech and non-linguistic stimuli in young adults with SLI and dyslexia. This thesis focuses on looking at both individial and group-level data of auditory and speech processing and their relationship with higher-level language measures. The groups of people with SLI and dyslexia who participated were aged between 14 and 25 and their performance was compared to a group of controls matched on chronological age, IQ, gender and handedness. Investigations revealed a complex pattern of behaviour. The results showed that individuals with SLI or dyslexia are not poor at discriminating sounds (whether speech or non-speech). However, in all experiments, there was more variation and more outliers in the SLI group indicating that auditory deficits may occur in a small subgroup of the SLI population. Moreover, investigations of the exact nature of the input-processing deficit revealed that some individuals with SLI have less categorical representations for speech sounds and that they weight the acoustic cues to phonemic identity differently from controls and dyslexics

    Sequencing in SLA: Phonological Memory, Chunking and Points of Order.

    Full text link
    Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/139863/1/Ellis1996Chunking.pd

    Towards a vygotskyan cognitive robotics: the role of language as a cognitive tool

    Get PDF
    Cognitive Robotics can be defined as the study of cognitive phenomena by their modeling in physical artifacts such as robots. This is a very lively and fascinating field which has already given fundamental contributions to our understanding of natural cognition. Nonetheless, robotics has to date addressed mainly very basic, low-level cognitive phenomena like sensory-motor coordination, perception, and navigation, and it is not clear how the current approach might scale up to explain high-level human cognition. In this paper we argue that a promising way to do that is to merge current ideas and methods of \u27embodied cognition\u27 with the Russian tradition of theoretical psychology which views language not only as a communication system but also as a cognitive tool, that is by developing a Vygotskyan Cognitive Robotics. We substantiate this idea by discussing several domains in which language can improve basic cognitive abilities and permit the development of high-level cognition: learning, categorization, abstraction, memory, voluntary control, and mental life
    • …
    corecore