421 research outputs found
Automatic Screening of Childhood Speech Sound Disorders and Detection of Associated Pronunciation Errors
Speech disorders in children can affect their fluency and intelligibility. Delay in their diagnosis and treatment increases the risk of social impairment and learning disabilities. With the significant shortage of Speech and Language Pathologists (SLPs), there is an increasing interest in Computer-Aided Speech Therapy tools with automatic detection and diagnosis capability.
However, the scarcity and unreliable annotation of disordered child speech corpora along with the high acoustic variations in the child speech data has impeded the development of reliable automatic detection and diagnosis of childhood speech sound disorders. Therefore, this thesis investigates two types of detection systems that can be achieved with minimum dependency on annotated mispronounced speech data.
First, a novel approach that adopts paralinguistic features which represent the prosodic, spectral, and voice quality characteristics of the speech was proposed to perform segment- and subject-level classification of Typically Developing (TD) and Speech Sound Disordered (SSD) child speech using a binary Support Vector Machine (SVM) classifier. As paralinguistic features are both language- and content-independent, they can be extracted from an unannotated speech signal.
Second, a novel Mispronunciation Detection and Diagnosis (MDD) approach was introduced to detect the pronunciation errors made due to SSDs and provide low-level diagnostic information that can be used in constructing formative feedback and a detailed diagnostic report. Unlike existing MDD methods where detection and diagnosis are performed at the phoneme level, the proposed method achieved MDD at the speech attribute level, namely the manners and places of articulations. The speech attribute features describe the involved articulators and their interactions when making a speech sound allowing a low-level description of the pronunciation error to be provided. Two novel methods to model speech attributes are further proposed in this thesis, a frame-based (phoneme-alignment) method leveraging the Multi-Task Learning (MTL) criterion and training a separate model for each attribute, and an alignment-free jointly-learnt method based on the Connectionist Temporal Classification (CTC) sequence to sequence criterion.
The proposed techniques have been evaluated using standard and publicly accessible adult and child speech corpora, while the MDD method has been validated using L2 speech corpora
Tonal Alignment and Segmental Timing in English-speaking Children
Tonal alignment has been shown to be sensitive to segmental timing. This suggests that development of the former may be influenced by the latter. The developmental literature reports that English-speaking children do not attain adult-like competence in segmental timing until after age 6. While this suggests that the ability for alignment may be mastered after this age, this possibility is speculative due to paucity of data. Accordingly, the present study sought to determine whether 7- and 8-year old English-speaking children exhibit adult-like alignment and segmental timing in their speech. Seven children (ages 7 and 8) and 10 adults (ages 19 to 24) repeated pre-recorded sentences. Their productions were analyzed acoustically. The children showed adult-like performance on three out of four measures of alignment. They performed comparably with adults on all measures of segmental timing. These results suggest that the English-speaking childrenâs ability for alignment may reach adult levels after mastery of segmental timing
ULTRAX2020 : Ultrasound Technology for Optimising the Treatment of Speech Disorders : Clinicians' Resource Manual
Ultrasound Visual Biofeedback (U-VBF) uses medical ultrasound to image the tongue in real-time during speech. Clinicians can use this information to both assess speech disorders and as a biofeedback tool to guide children in producing correct speech. Ultrasound images of the tongue are thought to be relatively intuitive to interpret, however, there is no easy way of using the ultrasound to diagnose speech disorders, despite it having the potential to identify imperceptible errors which are diagnostically important. This manual describes how to use ultrasound for the assessment and treatment of speech sound disorders in children. It is designed to be used in combination with Articulate Instruments Ltd. Sonospeech software by clinical partners of the Ultrax2020 project. However, the basic principles and resources contained within this document will be of use to anyone interested in using ultrasound in the speech therapy clinic
The Status of Coronals in Standard American English . An Optimality-Theoretic Account
Coronals are very special sound segments. There is abundant evidence from various fields of phonetics which clearly establishes coronals as a class of consonants appropriate for phonological analysis. The set of coronals is stable across varieties of English unlike other consonant types, e.g. labials and dorsals, which are subject to a greater or lesser degree of variation. Coronals exhibit stability in inventories crosslinguistically, but they simultaneously display flexibility in alternations, i.e. assimilation, deletion, epenthesis, and dissimilation, when it is required by the contradictory forces of perception and production. The two main, opposing types of alternation that coronals in SAE participate in are examined. These are weakening phenomena, i.e. assimilation and deletion, and strengthening phenomena, i.e. epenthesis and dissimilation. Coronals are notorious for their contradictory behavior, especially in alternations. This type of behavior can be accounted for within a phonetically grounded OT framework that unites both phonetic and phonological aspects of alternations. Various sets of inherently conflicting FAITHFULNESS and MARKEDNESS constraints that are needed for an OT analysis of SAE alternations are intoduced
Exploring Speech Technologies for Language Learning
The teaching of the pronunciation of any foreign language must encompass both segmental and suprasegmental aspects
of speech. In computational terms, the two levels of language learning activities can be decomposed at least into
phonemic aspects, which include the correct pronunciation of single phonemes and the co-articulation of phonemes into
higher phonological units; as well as prosodic aspects which include
ï± the correct position of stress at word level;
ï± the alternation of stress and unstressed syllables in terms of compensation and vowel reduction;
ï± the correct position of sentence accent;
ï± the generation of the adequate rhymth from the interleaving of stress, accent, and phonological rules;
ï± the generation of adequate intonational pattern for each utterance related to communicative functions;
As appears from above, for a student to communicate intelligibly and as close as possible to native-speaker's
pronunciation, prosody is very important [3]. We also assume that an incorrect prosody may hamper communication
from taking place and this may be regarded a strong motivation for having the teaching of Prosody as an integral part of
any language course. From our point of view it is much more important to stress the achievement of successful
communication as the main objective of a second language learner rather than the overcoming of what has been termed
âforeign accentâ, which can be deemed as a secondary goal. In any case, the two goals are certainly not coincident even
though they may be overlapping in some cases. We will discuss about these matter in the following sections.
All prosodic questions related to ârhythmâ will be discussed in the first section of this chapter. In [4] the author argues
in favour of prosodic aids, in particular because a strong placement of word stress may impair understanding from the
listenerâs point of view of the word being pronounced. He also argues in favour of acquiring correct timing of
phonological units to overcome the impression of âforeign accentâ which may ensue from an incorrect distribution of
stressed vs. unstressed stretches of linguistic units such as syllables or metric feet. Timing is not to be confused with
speaking rate which need not be increased forcefully to give the impression of a good fluency: trying to increase
speaking rate may result in lower intelligibility. The question of âforeign accentâ is also discussed at length in (Jilka M.,
1999). This work is particularly relevant as far as intonational features of a learner of a second language which we will
address in the second section of this chapter. Correcting the Intonational Foreign Accent (hence IFA) is an important
component of a Prosodic Module for self-learning activities, as categorical aspects of the intonation of the two languages
in contact, L1 and L2 are far apart and thus neatly distinguishable. Choice of the two languages in contact is determined
mainly by the fact that the distance in prosodic terms between English and Italian is maximal, according to (Ramus, F.
and J. Mehler, 1999; Ramus F., et al., 1999)
Visualising articulation: real-time ultrasound visual biofeedback and visual articulatory models and their use in treating speech sound disorders associated with submucous cleft palate
Background: Ultrasound Tongue Imaging (UTI) is growing increasingly popular for
assessing and treating Speech Sound Disorders (SSDs) and has more recently been
used to qualitatively investigate compensatory articulations in speakers with cleft
palate (CP). However, its therapeutic application for speakers with CP remains to be
tested. A different set of developments, Visual Articulatory Models (VAMs), provide
an offline dynamic model with context for lingual patterns. However, unlike UTI,
they do not provide real-time biofeedback. Commercially available VAMs, such as
Speech Trainer 3D, are available on iDevices, yet their clinical application remains
to be tested.
Aims: This thesis aims to test the diagnostic use of ultrasound, and investigate the
effectiveness of both UTI and VAMs for the treatment of SSDs associated with
submucous cleft palate (SMCP).
Method: Using a single-subject multiple baseline design, two males with repaired
SMCP, Andrew (aged 9;2) and Craig (aged 6;2), received six assessment sessions
and two blocks of therapy, following a motor-based therapy approach, using VAMs
and UTI. Three methods were used to measure therapy outcomes. Firstly, percent
target consonant correct scores, derived from phonetic transcriptions provide
outcomes comparable to those used in typical practice. Secondly, a multiplephonetically
trained listener perceptual evaluation, using a two-alternative multiple
forced choice design, to measure listener agreement provides a more objective
measure. Thirdly, articulatory analysis, using qualitative and quantitative measures
provides an additional perspective able to reveal covert errors.
Results and Conclusions: There was overall improvement in the speech for both
speakers, with a greater rate of change in therapy block one (VAMs) and listener
agreement in the perceptual evaluation. Articulatory analysis supplemented phonetic
transcriptions and detected covert articulations and covert contrast as well as
supporting the improvements in auditory outcome scores. Both VAMs and UTI show
promise as a clinical tool for the treatment of SSDs associated with CP
Vowel acquisition in a multidialectal environment: A five-year longitudinal case study
What happens when a child is exposed to multiple phonological systems while they are acquiring language? How do they resolve contradictory patterns in the accents around them in their own developing speech production? Do they acquire the accent of the local community, their parentsâ accent, or something in between? This thesis examines the acquisition of a subset of vowels in a child growing up in a multidialectal environment. The childâs realisations of vowels in the lexical sets STRUT, FOOT, START, PALM and BATH are analysed between the ages of 2;01 and 6;11. Previous research has shown that while a childâs accent is usually heavily influenced by their peers, having parents from outside the local area can prevent complete acquisition of an accent. Local cultural values, whether or not a parentâs accent has more prestigious elements than the local one, a childâs personality, and the complexity of the relationship between the home and local phonological systems have all been implicated in whether or not a child fully acquires a local accent. In the child studied here, a shift from the vowels used at home to local variants always happened at the level of articulatory feature, rather than at phonemic level, in the first instance, and vowels belonging to different lexical sets were acquired at different rates. This thesis demonstrates that acquisition of these vowels takes many years, as combinations of articulatory features stabilise. Moreover, even once a local variant has apparently been acquired, the variety of language spoken at home can leave a phonetic legacy in a childâs accent. Naturalistic data collection combined with impressionistic and acoustic analysis in conjunction with a long and sustained data collection period reveals patterns in this childâs phonological acquisition not seen in any previous research in this detail
Apraxia World: Deploying a Mobile Game and Automatic Speech Recognition for Independent Child Speech Therapy
Children with speech sound disorders typically improve pronunciation quality by undergoing speech therapy, which must be delivered frequently and with high intensity to be effective. As such, clinic sessions are supplemented with home practice, often under caregiver supervision. However, traditional home practice can grow boring for children due to monotony. Furthermore, practice frequency is limited by caregiver availability, making it difficult for some children to reach therapy dosage. To address these issues, this dissertation presents a novel speech therapy game to increase engagement, and explores automatic pronunciation evaluation techniques to afford children independent practice.
Children with speech sound disorders typically improve pronunciation quality by undergoing speech therapy, which must be delivered frequently and with high intensity to be effective. As such, clinic sessions are supplemented with home practice, often under caregiver supervision. However, traditional home practice can grow boring for children due to monotony. Furthermore, practice frequency is limited by caregiver availability, making it difficult for some children to reach therapy dosage. To address these issues, this dissertation presents a novel speech therapy game to increase engagement, and explores automatic pronunciation evaluation techniques to afford children independent practice.
The therapy game, called Apraxia World, delivers customizable, repetition-based speech therapy while children play through platformer-style levels using typical on-screen tablet controls; children complete in-game speech exercises to collect assets required to progress through the levels. Additionally, Apraxia World provides pronunciation feedback according to an automated pronunciation evaluation system running locally on the tablet. Apraxia World offers two advantages over current commercial and research speech therapy games; first, the game provides extended gameplay to support long therapy treatments; second, it affords some therapy practice independence via automatic pronunciation evaluation, allowing caregivers to lightly supervise instead of directly administer the practice. Pilot testing indicated that children enjoyed the game-based therapy much more than traditional practice and that the exercises did not interfere with gameplay. During a longitudinal study, children made clinically-significant pronunciation improvements while playing Apraxia World at home. Furthermore, children remained engaged in the game-based therapy over the two-month testing period and some even wanted to continue playing post-study.
The second part of the dissertation explores word- and phoneme-level pronunciation verification for child speech therapy applications. Word-level pronunciation verification is accomplished using a child-specific template-matching framework, where an utterance is compared against correctly and incorrectly pronounced examples of the word. This framework identified mispronounced words better than both a standard automated baseline and co-located caregivers. Phoneme-level mispronunciation detection is investigated using a technique from the second-language learning literature: training phoneme-specific classifiers with phonetic posterior features. This method also outperformed the standard baseline, but more significantly, identified mispronunciations better than student clinicians
Personalising synthetic voices for individuals with severe speech impairment.
Speech technology can help individuals with speech disorders to interact more easily. Many individuals with severe speech impairment, due to conditions such as Parkinson's disease or motor neurone disease, use voice output communication aids (VOCAs), which have synthesised or pre-recorded voice output. This voice output effectively becomes the voice of the individual and should therefore represent the user accurately.
Currently available personalisation of speech synthesis techniques require a large amount of data input, which is difficult to produce for individuals with severe speech impairment. These techniques also do not provide a solution for those individuals whose voices have begun to show the effects of dysarthria.
The thesis shows that Hidden Markov Model (HMM)-based speech synthesis is a promising approach for 'voice banking' for individuals before their condition causes deterioration of the speech and once deterioration has begun. Data input requirements for building personalised voices with this technique using human listener judgement evaluation is investigated. It shows that 100 sentences is the minimum required to build a significantly different voice from an average voice model and show some resemblance to the target speaker. This amount depends on the speaker and the average model used.
A neural network analysis trained on extracted acoustic features revealed that spectral features had the most influence for predicting human listener judgements of similarity of synthesised speech to a target speaker. Accuracy of prediction significantly improves if other acoustic features are introduced and combined non-linearly.
These results were used to inform the reconstruction of personalised synthetic voices for speakers whose voices had begun to show the effects of their conditions. Using HMM-based synthesis, personalised synthetic voices were built using dysarthric speech showing similarity to target speakers without recreating the impairment in the synthesised speech output
- âŠ