553 research outputs found
Automatic detection of accent and lexical pronunciation errors in spontaneous non-native English speech
Detecting individual pronunciation errors and diagnosing pronunciation error tendencies in a language learner based on their speech are important components of computer-aided language learning (CALL). The tasks of error detection and error tendency diagnosis become particularly challenging when the speech in question is spontaneous and particularly given the challenges posed by the inconsistency of human annotation of pronunciation errors. This paper presents an approach to these tasks by distinguishing between lexical errors, wherein the speaker does not know how a particular word is pronounced, and accent errors, wherein the candidate's speech exhibits consistent patterns of phone substitution, deletion and insertion. Three annotated corpora of non-native English speech by speakers of multiple L1s are analysed, the consistency of human annotation investigated and a method presented for detecting individual accent and lexical errors and diagnosing accent error tendencies at the speaker level
Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab
Articulatory copy synthesis (ACS), a subarea of speech inversion, refers to the reproduction of natural utterances and involves both the physiological articulatory processes and their corresponding acoustic results. This thesis proposes two novel methods for the ACS of human speech using the articulatory speech synthesizer VocalTractLab (VTL) to address or mitigate the existing problems of speech inversion, such as non-unique mapping, acoustic variation among different speakers, and the time-consuming nature of the process.
The first method involved finding appropriate VTL gestural scores for given natural utterances using a genetic algorithm. It consisted of two steps: gestural score initialization and optimization. In the first step, gestural scores were initialized using the given acoustic signals with speech recognition, grapheme-to-phoneme (G2P), and a VTL rule-based method for converting phoneme sequences to gestural scores. In the second step, the initial gestural scores were optimized by a genetic algorithm via an analysis-by-synthesis (ABS) procedure that sought to minimize the cosine distance between the acoustic features of the synthetic and natural utterances. The articulatory parameters were also regularized during the optimization process to restrict them to reasonable values.
The second method was based on long short-term memory (LSTM) and convolutional neural networks, which were responsible for capturing the temporal dependence and the spatial structure of the acoustic features, respectively. The neural network regression models were trained, which used acoustic features as inputs and produced articulatory trajectories as outputs. In addition, to cover as much of the articulatory and acoustic space as possible, the training samples were augmented by manipulating the phonation type, speaking effort, and the vocal tract length of the synthetic utterances. Furthermore, two regularization methods were proposed: one based on the smoothness loss of articulatory trajectories and another based on the acoustic loss between original and predicted acoustic features.
The best-performing genetic algorithms and convolutional LSTM systems (evaluated in terms of the difference between the estimated and reference VTL articulatory parameters) obtained average correlation coefficients of 0.985 and 0.983 for speaker-dependent utterances, respectively, and their reproduced speech achieved recognition accuracies of 86.25% and 64.69% for speaker-independent utterances of German words, respectively. When applied to German sentence utterances, as well as English and Mandarin Chinese word utterances, the neural network based ACS systems achieved recognition accuracies of 73.88%, 52.92%, and 52.41%, respectively. The results showed that both of these methods not only reproduced the articulatory processes but also reproduced the acoustic signals of reference utterances. Moreover, the regularization methods led to more physiologically plausible articulatory processes and made the estimated articulatory trajectories be more articulatorily preferred by VTL, thus reproducing more natural and intelligible speech. This study also found that the convolutional layers, when used in conjunction with batch normalization layers, automatically learned more distinctive features from log power spectrograms. Furthermore, the neural network based ACS systems trained using German data could be generalized to the utterances of other languages
Recommended from our members
Deep Learning for Automatic Assessment and Feedback of Spoken English
Growing global demand for learning a second language (L2), particularly English, has led to
considerable interest in automatic spoken language assessment, whether for use in computerassisted language learning (CALL) tools or for grading candidates for formal qualifications.
This thesis presents research conducted into the automatic assessment of spontaneous nonnative English speech, with a view to be able to provide meaningful feedback to learners. One
of the challenges in automatic spoken language assessment is giving candidates feedback on
particular aspects, or views, of their spoken language proficiency, in addition to the overall
holistic score normally provided. Another is detecting pronunciation and other types of errors
at the word or utterance level and feeding them back to the learner in a useful way.
It is usually difficult to obtain accurate training data with separate scores for different
views and, as examiners are often trained to give holistic grades, single-view scores can
suffer issues of consistency. Conversely, holistic scores are available for various standard
assessment tasks such as Linguaskill. An investigation is thus conducted into whether
assessment scores linked to particular views of the speaker’s ability can be obtained from
systems trained using only holistic scores.
End-to-end neural systems are designed with structures and forms of input tuned to single
views, specifically each of pronunciation, rhythm, intonation and text. By training each
system on large quantities of candidate data, individual-view information should be possible
to extract. The relationships between the predictions of each system are evaluated to examine
whether they are, in fact, extracting different information about the speaker. Three methods
of combining the systems to predict holistic score are investigated, namely averaging their
predictions and concatenating and attending over their intermediate representations. The
combined graders are compared to each other and to baseline approaches.
The tasks of error detection and error tendency diagnosis become particularly challenging
when the speech in question is spontaneous and particularly given the challenges posed by
the inconsistency of human annotation of pronunciation errors. An approach to these tasks is
presented by distinguishing between lexical errors, wherein the speaker does not know how a
particular word is pronounced, and accent errors, wherein the candidate’s speech exhibits
consistent patterns of phone substitution, deletion and insertion. Three annotated corpora
x
of non-native English speech by speakers of multiple L1s are analysed, the consistency of
human annotation investigated and a method presented for detecting individual accent and
lexical errors and diagnosing accent error tendencies at the speaker level
Exploring the use of Technology for Assessment and Intensive Treatment of Childhood Apraxia of Speech
Given the rapid advances in technology over the past decade, this thesis examines the potential for automatic speech recognition (ASR) technology to expedite the process of objective analysis of speech, particularly for lexical stress patterns in childhood apraxia of speech. This dissertation also investigates the potential for mobile technology to bridge the gap between current service delivery models in Australia and best practice treatment intensity for CAS. To address these two broad aims, this thesis describes three main projects. The first is a systematic literature review summarising the development, implementation and accuracy of automatic speech analysis tools when applied to evaluation and modification of children’s speech production skills. Guided by the results of the systematic review, the second project presents data on the accuracy and clinical utility of a custom-designed lexical stress classification tool, designed as part of a multi-component speech analysis system for a mobile therapy application, Tabby Talks, for use with children with CAS. The third project is a randomised control trial exploring the effect of different types of feedback on response to intervention for children with CAS. The intervention was designed to specifically explore the feasibility and effectiveness of using an app equipped with ASR technology to provide feedback on speech production accuracy during home practice sessions, simulating the common service delivery model in Australia. The thesis concludes with a discussion of future directions for technology-based speech assessment and intensive speech production practice, guidelines for future development of therapy tools that include more game-based practice activities and the contexts in which children can be transferred from predominantly clinician-delivered augmented feedback to ASR-delivered right/wrong feedback and continue to make optimal gains in acquisition and retention of speech production targets
Automatic Screening of Childhood Speech Sound Disorders and Detection of Associated Pronunciation Errors
Speech disorders in children can affect their fluency and intelligibility. Delay in their diagnosis and treatment increases the risk of social impairment and learning disabilities. With the significant shortage of Speech and Language Pathologists (SLPs), there is an increasing interest in Computer-Aided Speech Therapy tools with automatic detection and diagnosis capability.
However, the scarcity and unreliable annotation of disordered child speech corpora along with the high acoustic variations in the child speech data has impeded the development of reliable automatic detection and diagnosis of childhood speech sound disorders. Therefore, this thesis investigates two types of detection systems that can be achieved with minimum dependency on annotated mispronounced speech data.
First, a novel approach that adopts paralinguistic features which represent the prosodic, spectral, and voice quality characteristics of the speech was proposed to perform segment- and subject-level classification of Typically Developing (TD) and Speech Sound Disordered (SSD) child speech using a binary Support Vector Machine (SVM) classifier. As paralinguistic features are both language- and content-independent, they can be extracted from an unannotated speech signal.
Second, a novel Mispronunciation Detection and Diagnosis (MDD) approach was introduced to detect the pronunciation errors made due to SSDs and provide low-level diagnostic information that can be used in constructing formative feedback and a detailed diagnostic report. Unlike existing MDD methods where detection and diagnosis are performed at the phoneme level, the proposed method achieved MDD at the speech attribute level, namely the manners and places of articulations. The speech attribute features describe the involved articulators and their interactions when making a speech sound allowing a low-level description of the pronunciation error to be provided. Two novel methods to model speech attributes are further proposed in this thesis, a frame-based (phoneme-alignment) method leveraging the Multi-Task Learning (MTL) criterion and training a separate model for each attribute, and an alignment-free jointly-learnt method based on the Connectionist Temporal Classification (CTC) sequence to sequence criterion.
The proposed techniques have been evaluated using standard and publicly accessible adult and child speech corpora, while the MDD method has been validated using L2 speech corpora
Language, perception and production in profoundly deaf children
Prelingually profoundly deaf children usually experience problems
with language learning (Webster, 1986; Campbell, Burden & Wright,
1992). The acquisition of written language would be no problem for
them if normal development of reading and writing was not
dependent on spoken language (Pattison, 1986). However, such
children cannot be viewed as a homogeneous group since some, the
minority, do develop good linguistic skills.
Group studies have identified several factors relating to language skills:
hearing loss and level of loss, I.Q., intelligibility, lip-reading, use of
phonology and memory capacity (Furth, 1966; Conrad, 1979; Trybus &
Karchmer, 1977; Jensema, 1975; Baddeley, Papagno & Vallar, 1988;
Baddeley & Wilson, 1988; Hanson, 1989; Lake, 1980; Daneman &
Carpenter,1980). These various factors appear to be interrelated, with
phonological awareness being implicated in most. So to understand
behaviour, measures of all these factors must be obtained. The present
study aimed to achieve this whilst investigating the prediction that
performance success may be due to better use of phonological
information.
Because linguistic success for the deaf child is exceptional, a case study
approach was taken to avoid obscuring subtle differences in
performance. Subjects were screened to meet 6 research criteria:
profound prelingual deafness, no other known handicap, English the
first language in the home, at least average non-verbal IQ , reading age
7-9 years and inter-subject dissimilarities between chronological reading
age discrepancies. Case histories were obtained from school
records and home interviews. Six subjects with diverse linguistic skills
were selected, four of which undertook all tests.
Phonological awareness and development was assessed across several
variables: immediate memory span, intelligibility, spelling, rhyme
judgement, speech discrimination and production. There was
considerable inter-subject performance difference. One boy's speech
production was singled out for a more detailed analysis. Useful aided hearing and consistent contrastive speech appear to be implicated in
other English language skills.
It was concluded that for phonological awareness to develop, the deaf
child must receive useful inputs from as many media as possible (e.g.,
vision, audition, articulation, sign and orthography). When input is
biassed toward the more reliable modalities of audition and
articulation, there is a greater possibility of a robust and useful
phonology being derived and thus better access to the English language
Artificial Intelligence for Multimedia Signal Processing
Artificial intelligence technologies are also actively applied to broadcasting and multimedia processing technologies. A lot of research has been conducted in a wide variety of fields, such as content creation, transmission, and security, and these attempts have been made in the past two to three years to improve image, video, speech, and other data compression efficiency in areas related to MPEG media processing technology. Additionally, technologies such as media creation, processing, editing, and creating scenarios are very important areas of research in multimedia processing and engineering. This book contains a collection of some topics broadly across advanced computational intelligence algorithms and technologies for emerging multimedia signal processing as: Computer vision field, speech/sound/text processing, and content analysis/information mining
Lexical segmentation and word recognition in fluent aphasia
The current thesis reports a psycholinguistic study of lexical segmentation and word recognition in fluent aphasia.When listening to normal running speech we must identify individual words from a continuous stream before we can extract a linguistic message from it. Normal listeners are able to resolve the segmentation problem without any noticeable difficulty. In this thesis I consider how fluent aphasic listeners perform the process of lexical segmentation and whether any of their impaired comprehension of spoken language has its provenance in the failure to segment speech normally.The investigation was composed of a series of 5 experiments which examined the processing of both explicit acoustic and prosodic cues to word juncture and features which affect listeners' segmentation of the speech stream implicitly, through inter-lexical competition of potential word matchesThe data collected show that lexical segmentation of continuous speech is compromised in fluent aphasia. Word hypotheses do not always accrue appropriate activational information from all of the available sources within the time frame in which segmentation problem is normally resolved. The fluent aphasic performance, although quantitatively impaired compared to normal, reflects an underlying normal competence; their processing seldom displays a totally qualitatively different processing profile to normal. They are able to engage frequency, morphological structure, and imageability as modulators of activation. Word class, a feature found to be influential in the normal resolution of segmentation is not used by the fluent aphasic studied. In those cases of occasional failure to adequately resolve segmentation by automatic frequency mediated activation, fluent aphasics invoke the metalinguistic influence of real world plausibility of alternative parses
Re-examining Phonological and Lexical Correlates of Second Language Comprehensibility:The Role of Rater Experience
Few researchers and teachers would disagree that some linguistic aspects
of second language (L2) speech are more crucial than others for successful
communication. Underlying this idea is the assumption that communicative
success can be broadly defined in terms of speakers’ ability to convey the
intended meaning to the interlocutor, which is frequently captured through
a listener-based rating of comprehensibility or ease of understanding (e.g.
Derwing & Munro, 2009; Levis, 2005). Previous research has shown that
communicative success – for example, as defined through comprehensible L2
speech – depends on several linguistic dimensions of L2 output, including its
segmental and suprasegmental pronunciation, fluency-based characteristics,
lexical and grammatical content, as well as discourse structure (e.g. Field,
2005; Hahn, 2004; Kang et al., 2010; Trofimovich & Isaacs, 2012). Our chief
objective in the current study was to explore the L2 comprehensibility construct from a language assessment perspective (e.g. Isaacs & Thomson, 2013),
by targeting rater experience as a possible source of variance influencing the
degree to which raters use various characteristics of speech in judging L2
comprehensibility. In keeping with this objective, we asked the following
question: What is the extent to which linguistic aspects of L2 speech contributing to comprehensibility ratings depend on raters’ experience
- …