421 research outputs found

    Automatic Screening of Childhood Speech Sound Disorders and Detection of Associated Pronunciation Errors

    Full text link
    Speech disorders in children can affect their fluency and intelligibility. Delay in their diagnosis and treatment increases the risk of social impairment and learning disabilities. With the significant shortage of Speech and Language Pathologists (SLPs), there is an increasing interest in Computer-Aided Speech Therapy tools with automatic detection and diagnosis capability. However, the scarcity and unreliable annotation of disordered child speech corpora along with the high acoustic variations in the child speech data has impeded the development of reliable automatic detection and diagnosis of childhood speech sound disorders. Therefore, this thesis investigates two types of detection systems that can be achieved with minimum dependency on annotated mispronounced speech data. First, a novel approach that adopts paralinguistic features which represent the prosodic, spectral, and voice quality characteristics of the speech was proposed to perform segment- and subject-level classification of Typically Developing (TD) and Speech Sound Disordered (SSD) child speech using a binary Support Vector Machine (SVM) classifier. As paralinguistic features are both language- and content-independent, they can be extracted from an unannotated speech signal. Second, a novel Mispronunciation Detection and Diagnosis (MDD) approach was introduced to detect the pronunciation errors made due to SSDs and provide low-level diagnostic information that can be used in constructing formative feedback and a detailed diagnostic report. Unlike existing MDD methods where detection and diagnosis are performed at the phoneme level, the proposed method achieved MDD at the speech attribute level, namely the manners and places of articulations. The speech attribute features describe the involved articulators and their interactions when making a speech sound allowing a low-level description of the pronunciation error to be provided. Two novel methods to model speech attributes are further proposed in this thesis, a frame-based (phoneme-alignment) method leveraging the Multi-Task Learning (MTL) criterion and training a separate model for each attribute, and an alignment-free jointly-learnt method based on the Connectionist Temporal Classification (CTC) sequence to sequence criterion. The proposed techniques have been evaluated using standard and publicly accessible adult and child speech corpora, while the MDD method has been validated using L2 speech corpora

    Tonal Alignment and Segmental Timing in English-speaking Children

    Get PDF
    Tonal alignment has been shown to be sensitive to segmental timing. This suggests that development of the former may be influenced by the latter. The developmental literature reports that English-speaking children do not attain adult-like competence in segmental timing until after age 6. While this suggests that the ability for alignment may be mastered after this age, this possibility is speculative due to paucity of data. Accordingly, the present study sought to determine whether 7- and 8-year old English-speaking children exhibit adult-like alignment and segmental timing in their speech. Seven children (ages 7 and 8) and 10 adults (ages 19 to 24) repeated pre-recorded sentences. Their productions were analyzed acoustically. The children showed adult-like performance on three out of four measures of alignment. They performed comparably with adults on all measures of segmental timing. These results suggest that the English-speaking children’s ability for alignment may reach adult levels after mastery of segmental timing

    ULTRAX2020 : Ultrasound Technology for Optimising the Treatment of Speech Disorders : Clinicians' Resource Manual

    Get PDF
    Ultrasound Visual Biofeedback (U-VBF) uses medical ultrasound to image the tongue in real-time during speech. Clinicians can use this information to both assess speech disorders and as a biofeedback tool to guide children in producing correct speech. Ultrasound images of the tongue are thought to be relatively intuitive to interpret, however, there is no easy way of using the ultrasound to diagnose speech disorders, despite it having the potential to identify imperceptible errors which are diagnostically important. This manual describes how to use ultrasound for the assessment and treatment of speech sound disorders in children. It is designed to be used in combination with Articulate Instruments Ltd. Sonospeech software by clinical partners of the Ultrax2020 project. However, the basic principles and resources contained within this document will be of use to anyone interested in using ultrasound in the speech therapy clinic

    The Status of Coronals in Standard American English . An Optimality-Theoretic Account

    Get PDF
    Coronals are very special sound segments. There is abundant evidence from various fields of phonetics which clearly establishes coronals as a class of consonants appropriate for phonological analysis. The set of coronals is stable across varieties of English unlike other consonant types, e.g. labials and dorsals, which are subject to a greater or lesser degree of variation. Coronals exhibit stability in inventories crosslinguistically, but they simultaneously display flexibility in alternations, i.e. assimilation, deletion, epenthesis, and dissimilation, when it is required by the contradictory forces of perception and production. The two main, opposing types of alternation that coronals in SAE participate in are examined. These are weakening phenomena, i.e. assimilation and deletion, and strengthening phenomena, i.e. epenthesis and dissimilation. Coronals are notorious for their contradictory behavior, especially in alternations. This type of behavior can be accounted for within a phonetically grounded OT framework that unites both phonetic and phonological aspects of alternations. Various sets of inherently conflicting FAITHFULNESS and MARKEDNESS constraints that are needed for an OT analysis of SAE alternations are intoduced

    Exploring Speech Technologies for Language Learning

    Get PDF
    The teaching of the pronunciation of any foreign language must encompass both segmental and suprasegmental aspects of speech. In computational terms, the two levels of language learning activities can be decomposed at least into phonemic aspects, which include the correct pronunciation of single phonemes and the co-articulation of phonemes into higher phonological units; as well as prosodic aspects which include  the correct position of stress at word level;  the alternation of stress and unstressed syllables in terms of compensation and vowel reduction;  the correct position of sentence accent;  the generation of the adequate rhymth from the interleaving of stress, accent, and phonological rules;  the generation of adequate intonational pattern for each utterance related to communicative functions; As appears from above, for a student to communicate intelligibly and as close as possible to native-speaker's pronunciation, prosody is very important [3]. We also assume that an incorrect prosody may hamper communication from taking place and this may be regarded a strong motivation for having the teaching of Prosody as an integral part of any language course. From our point of view it is much more important to stress the achievement of successful communication as the main objective of a second language learner rather than the overcoming of what has been termed “foreign accent”, which can be deemed as a secondary goal. In any case, the two goals are certainly not coincident even though they may be overlapping in some cases. We will discuss about these matter in the following sections. All prosodic questions related to “rhythm” will be discussed in the first section of this chapter. In [4] the author argues in favour of prosodic aids, in particular because a strong placement of word stress may impair understanding from the listener’s point of view of the word being pronounced. He also argues in favour of acquiring correct timing of phonological units to overcome the impression of “foreign accent” which may ensue from an incorrect distribution of stressed vs. unstressed stretches of linguistic units such as syllables or metric feet. Timing is not to be confused with speaking rate which need not be increased forcefully to give the impression of a good fluency: trying to increase speaking rate may result in lower intelligibility. The question of “foreign accent” is also discussed at length in (Jilka M., 1999). This work is particularly relevant as far as intonational features of a learner of a second language which we will address in the second section of this chapter. Correcting the Intonational Foreign Accent (hence IFA) is an important component of a Prosodic Module for self-learning activities, as categorical aspects of the intonation of the two languages in contact, L1 and L2 are far apart and thus neatly distinguishable. Choice of the two languages in contact is determined mainly by the fact that the distance in prosodic terms between English and Italian is maximal, according to (Ramus, F. and J. Mehler, 1999; Ramus F., et al., 1999)

    Visualising articulation: real-time ultrasound visual biofeedback and visual articulatory models and their use in treating speech sound disorders associated with submucous cleft palate

    Get PDF
    Background: Ultrasound Tongue Imaging (UTI) is growing increasingly popular for assessing and treating Speech Sound Disorders (SSDs) and has more recently been used to qualitatively investigate compensatory articulations in speakers with cleft palate (CP). However, its therapeutic application for speakers with CP remains to be tested. A different set of developments, Visual Articulatory Models (VAMs), provide an offline dynamic model with context for lingual patterns. However, unlike UTI, they do not provide real-time biofeedback. Commercially available VAMs, such as Speech Trainer 3D, are available on iDevices, yet their clinical application remains to be tested. Aims: This thesis aims to test the diagnostic use of ultrasound, and investigate the effectiveness of both UTI and VAMs for the treatment of SSDs associated with submucous cleft palate (SMCP). Method: Using a single-subject multiple baseline design, two males with repaired SMCP, Andrew (aged 9;2) and Craig (aged 6;2), received six assessment sessions and two blocks of therapy, following a motor-based therapy approach, using VAMs and UTI. Three methods were used to measure therapy outcomes. Firstly, percent target consonant correct scores, derived from phonetic transcriptions provide outcomes comparable to those used in typical practice. Secondly, a multiplephonetically trained listener perceptual evaluation, using a two-alternative multiple forced choice design, to measure listener agreement provides a more objective measure. Thirdly, articulatory analysis, using qualitative and quantitative measures provides an additional perspective able to reveal covert errors. Results and Conclusions: There was overall improvement in the speech for both speakers, with a greater rate of change in therapy block one (VAMs) and listener agreement in the perceptual evaluation. Articulatory analysis supplemented phonetic transcriptions and detected covert articulations and covert contrast as well as supporting the improvements in auditory outcome scores. Both VAMs and UTI show promise as a clinical tool for the treatment of SSDs associated with CP

    Vowel acquisition in a multidialectal environment: A five-year longitudinal case study

    Get PDF
    What happens when a child is exposed to multiple phonological systems while they are acquiring language? How do they resolve contradictory patterns in the accents around them in their own developing speech production? Do they acquire the accent of the local community, their parents’ accent, or something in between? This thesis examines the acquisition of a subset of vowels in a child growing up in a multidialectal environment. The child’s realisations of vowels in the lexical sets STRUT, FOOT, START, PALM and BATH are analysed between the ages of 2;01 and 6;11. Previous research has shown that while a child’s accent is usually heavily influenced by their peers, having parents from outside the local area can prevent complete acquisition of an accent. Local cultural values, whether or not a parent’s accent has more prestigious elements than the local one, a child’s personality, and the complexity of the relationship between the home and local phonological systems have all been implicated in whether or not a child fully acquires a local accent. In the child studied here, a shift from the vowels used at home to local variants always happened at the level of articulatory feature, rather than at phonemic level, in the first instance, and vowels belonging to different lexical sets were acquired at different rates. This thesis demonstrates that acquisition of these vowels takes many years, as combinations of articulatory features stabilise. Moreover, even once a local variant has apparently been acquired, the variety of language spoken at home can leave a phonetic legacy in a child’s accent. Naturalistic data collection combined with impressionistic and acoustic analysis in conjunction with a long and sustained data collection period reveals patterns in this child’s phonological acquisition not seen in any previous research in this detail

    Apraxia World: Deploying a Mobile Game and Automatic Speech Recognition for Independent Child Speech Therapy

    Get PDF
    Children with speech sound disorders typically improve pronunciation quality by undergoing speech therapy, which must be delivered frequently and with high intensity to be effective. As such, clinic sessions are supplemented with home practice, often under caregiver supervision. However, traditional home practice can grow boring for children due to monotony. Furthermore, practice frequency is limited by caregiver availability, making it difficult for some children to reach therapy dosage. To address these issues, this dissertation presents a novel speech therapy game to increase engagement, and explores automatic pronunciation evaluation techniques to afford children independent practice. Children with speech sound disorders typically improve pronunciation quality by undergoing speech therapy, which must be delivered frequently and with high intensity to be effective. As such, clinic sessions are supplemented with home practice, often under caregiver supervision. However, traditional home practice can grow boring for children due to monotony. Furthermore, practice frequency is limited by caregiver availability, making it difficult for some children to reach therapy dosage. To address these issues, this dissertation presents a novel speech therapy game to increase engagement, and explores automatic pronunciation evaluation techniques to afford children independent practice. The therapy game, called Apraxia World, delivers customizable, repetition-based speech therapy while children play through platformer-style levels using typical on-screen tablet controls; children complete in-game speech exercises to collect assets required to progress through the levels. Additionally, Apraxia World provides pronunciation feedback according to an automated pronunciation evaluation system running locally on the tablet. Apraxia World offers two advantages over current commercial and research speech therapy games; first, the game provides extended gameplay to support long therapy treatments; second, it affords some therapy practice independence via automatic pronunciation evaluation, allowing caregivers to lightly supervise instead of directly administer the practice. Pilot testing indicated that children enjoyed the game-based therapy much more than traditional practice and that the exercises did not interfere with gameplay. During a longitudinal study, children made clinically-significant pronunciation improvements while playing Apraxia World at home. Furthermore, children remained engaged in the game-based therapy over the two-month testing period and some even wanted to continue playing post-study. The second part of the dissertation explores word- and phoneme-level pronunciation verification for child speech therapy applications. Word-level pronunciation verification is accomplished using a child-specific template-matching framework, where an utterance is compared against correctly and incorrectly pronounced examples of the word. This framework identified mispronounced words better than both a standard automated baseline and co-located caregivers. Phoneme-level mispronunciation detection is investigated using a technique from the second-language learning literature: training phoneme-specific classifiers with phonetic posterior features. This method also outperformed the standard baseline, but more significantly, identified mispronunciations better than student clinicians

    Personalising synthetic voices for individuals with severe speech impairment.

    Get PDF
    Speech technology can help individuals with speech disorders to interact more easily. Many individuals with severe speech impairment, due to conditions such as Parkinson's disease or motor neurone disease, use voice output communication aids (VOCAs), which have synthesised or pre-recorded voice output. This voice output effectively becomes the voice of the individual and should therefore represent the user accurately. Currently available personalisation of speech synthesis techniques require a large amount of data input, which is difficult to produce for individuals with severe speech impairment. These techniques also do not provide a solution for those individuals whose voices have begun to show the effects of dysarthria. The thesis shows that Hidden Markov Model (HMM)-based speech synthesis is a promising approach for 'voice banking' for individuals before their condition causes deterioration of the speech and once deterioration has begun. Data input requirements for building personalised voices with this technique using human listener judgement evaluation is investigated. It shows that 100 sentences is the minimum required to build a significantly different voice from an average voice model and show some resemblance to the target speaker. This amount depends on the speaker and the average model used. A neural network analysis trained on extracted acoustic features revealed that spectral features had the most influence for predicting human listener judgements of similarity of synthesised speech to a target speaker. Accuracy of prediction significantly improves if other acoustic features are introduced and combined non-linearly. These results were used to inform the reconstruction of personalised synthetic voices for speakers whose voices had begun to show the effects of their conditions. Using HMM-based synthesis, personalised synthetic voices were built using dysarthric speech showing similarity to target speakers without recreating the impairment in the synthesised speech output
    • 

    corecore