705 research outputs found

    Improving fairness for spoken language understanding in atypical speech with Text-to-Speech

    Full text link
    Spoken language understanding (SLU) systems often exhibit suboptimal performance in processing atypical speech, typically caused by neurological conditions and motor impairments. Recent advancements in Text-to-Speech (TTS) synthesis-based augmentation for more fair SLU have struggled to accurately capture the unique vocal characteristics of atypical speakers, largely due to insufficient data. To address this issue, we present a novel data augmentation method for atypical speakers by finetuning a TTS model, called Aty-TTS. Aty-TTS models speaker and atypical characteristics via knowledge transferring from a voice conversion model. Then, we use the augmented data to train SLU models adapted to atypical speech. To train these data augmentation models and evaluate the resulting SLU systems, we have collected a new atypical speech dataset containing intent annotation. Both objective and subjective assessments validate that Aty-TTS is capable of generating high-quality atypical speech. Furthermore, it serves as an effective data augmentation strategy, contributing to more fair SLU systems that can better accommodate individuals with atypical speech patterns.Comment: Accepted at SyntheticData4ML 2023 Ora

    Atypical Speech Development and Working Memory Recall Ability

    Get PDF
    Background: Alan Baddeley and Graham Hitch introduced working memory processes as a three component model including the visuo-spatial sketch pad, the central executive, and the phonological loop. The phonological loop has been theorized to play a role in speech and language acquisition and production. Aims: This study aimed to explore the potential relationship between school-aged children with atypically developing speech and their working memory ability as compared to their typically developing age matched peers. Methods: Participants age 5;0-5;11 and 8;0-8;11 were separated into two groups based on articulation test scores as well as any documented developmental challenges. The participants completed standardized testing as well as two experimental cognitive working memory tasks. Results: There was no significant difference in the 5;0-5;11 age group in their cognitive working memory task score average. However there was a significant difference in the 8;0-8;11 age group’s cognitive working memory task score average. Discussion: There is potentially a connection between phonological ability and the phonological sub vocal rehearsal system that is reflected by the scores in the control group and the atypically developing group. The phonological sub vocal rehearsal system houses the sub system for orthographic information processing. This model would align appropriately with the notion that the participants who do not have speech within functional limits would also perform more poorly on a working memory tasks that do not allow them to rely on orthographic information

    The Application of Echo State Networks to Atypical Speech Recognition

    Get PDF
    Automatic speech recognition (ASR) techniques have improved extensively over the past few years with the rise of new deep learning architectures. Recent sequence-to-sequence models have been shown to have high accuracy by utilizing the attention mechanism, which evaluates and learns the magnitude of element relationships in sequences. Despite being highly accurate, commercial ASR models have a weakness when it comes to accessibility. Current commercial deep learning ASR models find difficulty evaluating and transcribing speech for individuals with unique vocal features, such as those with dysarthria, heavy accents, as well as deaf and hard-of-hearing individuals. Current methodologies for processing vocal data revolve around convolutional feature extraction layers, dulling the sequential nature of the data. Alternatively, reservoir computing has gained popularity for the ability to translate input data to changing network states, which preserves the overall feature complexity of the input. Echo state networks (ESN), a type of reservoir computing mechanism employing a random recurrent neural network, have shown promise in a number of time series classification tasks. This work explores the integration of ESNs into deep learning ASR models. The Listen, Attend and Spell, and Transformer models were utilized as a baseline. A novel approach that used the echo state network as a feature extractor was explored and evaluated using the two models as baseline architectures. The models were trained on 960 hours of LibriSpeech audio data and tuned on various atypical speech data, including the Torgo dysarthric speech dataset and University of Memphis SPAL dataset. The ESN-based Echo, Listen, Attend, and Spell model produced more accurate transcriptions when evaluating on the LibriSpeech test set compared to the ESN-based Transformer. The baseline transformer model achieved a 43.4% word error rate on the Torgo test set after full network tuning. A prototype ASR system was developed to utilize both the developed model as well as commercial smart assistant language models. The system operates on a Raspberry Pi 4 using the Assistant Relay framework

    Hemispheric speech lateralisation in the developing brain is related to motor praxis ability

    Get PDF
    Commonly displayed functional asymmetries such as hand dominance and hemispheric speech lateralisation are well researched in adults. However there is debate about when such functions become lateralised in the typically developing brain. This study examined whether patterns of speech laterality and hand dominance were related and whether they varied with age in typically developing children. 148 children aged 3-10 years performed an electronic pegboard task to determine hand dominance; a subset of 38 of these children also underwent functional Transcranial Doppler (fTCD) imaging to derive a lateralisation index (LI) for hemispheric activation during speech production using an animation description paradigm. There was no main effect of age in the speech laterality scores, however, younger children showed a greater difference in performance between their hands on the motor task. Furthermore, this between-hand performance difference significantly interacted with direction of speech laterality, with a smaller between-hand difference relating to increased left hemisphere activation. This data shows that both handedness and speech lateralisation appear relatively determined by age 3, but that atypical cerebral lateralisation is linked to greater performance differences in hand skill, irrespective of age. Results are discussed in terms of the common neural systems underpinning handedness and speech lateralisation

    Latent Phrase Matching for Dysarthric Speech

    Full text link
    Many consumer speech recognition systems are not tuned for people with speech disabilities, resulting in poor recognition and user experience, especially for severe speech differences. Recent studies have emphasized interest in personalized speech models from people with atypical speech patterns. We propose a query-by-example-based personalized phrase recognition system that is trained using small amounts of speech, is language agnostic, does not assume a traditional pronunciation lexicon, and generalizes well across speech difference severities. On an internal dataset collected from 32 people with dysarthria, this approach works regardless of severity and shows a 60% improvement in recall relative to a commercial speech recognition system. On the public EasyCall dataset of dysarthric speech, our approach improves accuracy by 30.5%. Performance degrades as the number of phrases increases, but consistently outperforms ASR systems when trained with 50 unique phrases

    Nonword repetition and phonological awareness skills in preschoolers with and without speech sound disorders

    Get PDF
    The aim of the current study was to investigate the relationships between phonological awareness (PA) skills, types of speech sound errors, and nonword repetition skills. Ten preschoolers with typically developing speech (TD) and ten preschoolers with speech sound disorder (SSD), aged 4;0 (years; months) to 6;6 participated in the study. Eligible participants did not present with neurological, cognitive, or developmental disabilities such as cleft palate or autism spectrum disorder. We calculated the correlation between PA skills and nonword repetition performance of the children. In addition, a regression model was used to evaluate the degree to which phonological awareness skills could be predicted by the types of speech errors produced by the participants (typical speech errors, atypical speech errors, and distortions). Nonword repetition was significantly correlated with performance on the PA test, such that in general, participants who obtained poorer nonword repetition scores were found to have poorer PA skills. With regards to error types and PA skills, atypical errors predicted 12.5% of the variance in PA skills among TD participants. However, in children with SSD atypical errors did not contribute significant and unique variance to PA skills after controlling for age and nonverbal IQ. This data suggests that PA skills cannot be only inferred through the use of other measurements such as the SRT or speech sound errors produced

    Electrolaryngeal Speech Intelligibility Enhancement Through Robust Linguistic Encoders

    Full text link
    We propose a novel framework for electrolaryngeal speech intelligibility enhancement through the use of robust linguistic encoders. Pretraining and fine-tuning approaches have proven to work well in this task, but in most cases, various mismatches, such as the speech type mismatch (electrolaryngeal vs. typical) or a speaker mismatch between the datasets used in each stage, can deteriorate the conversion performance of this framework. To resolve this issue, we propose a linguistic encoder robust enough to project both EL and typical speech in the same latent space, while still being able to extract accurate linguistic information, creating a unified representation to reduce the speech type mismatch. Furthermore, we introduce HuBERT output features to the proposed framework for reducing the speaker mismatch, making it possible to effectively use a large-scale parallel dataset during pretraining. We show that compared to the conventional framework using mel-spectrogram input and output features, using the proposed framework enables the model to synthesize more intelligible and naturally sounding speech, as shown by a significant 16% improvement in character error rate and 0.83 improvement in naturalness score.Comment: Accepted to ICASSP 2024. Demo page: lesterphillip.github.io/icassp2024_el_si

    Adaptation and validation of intelligibility in context scale as a screening tool for Hong Kong preschoolers

    Get PDF
    Intelligibility in Context Scale (ICS) is a parent report questionnaire developed based on the Environmental and Personal domain of the International Classification of Functioning, Disability and Health — Children and Youth Version (ICF-CY) (World Health Organization, 2007) for assessing children's speech intelligibility (McLeod, Harrison, & McCormack, 2012). This study aimed to adapt ICS into Chinese, namely, ICS-C, and examine the psychometric properties of the adapted version when applying to Cantonese-speaking children. A secondary objective was to identify speech measures ICS-C was sensitive to. A total of 72 Cantonese-speaking preschoolers with (N = 39) and without speech sound disorders (SSD) (N = 33) were recruited. Native Cantonese-speaking parents completed ICS-C independently. Results demonstrated good internal consistency and test-retest reliability of ICS-C. Correlations with speech performance, and significant difference in ICS-C mean scores between the two groups supported validity of ICS-C. The optimal cutoff was estimated using Receiver Operative Characteristic (ROC) curve analysis, giving a sensitivity of .70 and specificity of .59. ICS-C mean scores showed positive correlation with PICC and negative correlation with frequency of atypical errors, both were moderate in strength. Given the satisfactory psychometric properties of ICS-C, it can be a valuable clinical tool for screening of SSD in preschoolers.published_or_final_versionSpeech and Hearing SciencesBachelorBachelor of Science in Speech and Hearing Science
    corecore