1,271 research outputs found
A Hierarchical Context-aware Modeling Approach for Multi-aspect and Multi-granular Pronunciation Assessment
Automatic Pronunciation Assessment (APA) plays a vital role in
Computer-assisted Pronunciation Training (CAPT) when evaluating a second
language (L2) learner's speaking proficiency. However, an apparent downside of
most de facto methods is that they parallelize the modeling process throughout
different speech granularities without accounting for the hierarchical and
local contextual relationships among them. In light of this, a novel
hierarchical approach is proposed in this paper for multi-aspect and
multi-granular APA. Specifically, we first introduce the notion of sup-phonemes
to explore more subtle semantic traits of L2 speakers. Second, a depth-wise
separable convolution layer is exploited to better encapsulate the local
context cues at the sub-word level. Finally, we use a score-restraint attention
pooling mechanism to predict the sentence-level scores and optimize the
component models with a multitask learning (MTL) framework. Extensive
experiments carried out on a publicly-available benchmark dataset, viz.
speechocean762, demonstrate the efficacy of our approach in relation to some
cutting-edge baselines.Comment: Accepted to Interspeech 202
Phonological Level wav2vec2-based Mispronunciation Detection and Diagnosis Method
The automatic identification and analysis of pronunciation errors, known as
Mispronunciation Detection and Diagnosis (MDD) plays a crucial role in Computer
Aided Pronunciation Learning (CAPL) tools such as Second-Language (L2) learning
or speech therapy applications. Existing MDD methods relying on analysing
phonemes can only detect categorical errors of phonemes that have an adequate
amount of training data to be modelled. With the unpredictable nature of the
pronunciation errors of non-native or disordered speakers and the scarcity of
training datasets, it is unfeasible to model all types of mispronunciations.
Moreover, phoneme-level MDD approaches have a limited ability to provide
detailed diagnostic information about the error made. In this paper, we propose
a low-level MDD approach based on the detection of speech attribute features.
Speech attribute features break down phoneme production into elementary
components that are directly related to the articulatory system leading to more
formative feedback to the learner. We further propose a multi-label variant of
the Connectionist Temporal Classification (CTC) approach to jointly model the
non-mutually exclusive speech attributes using a single model. The pre-trained
wav2vec2 model was employed as a core model for the speech attribute detector.
The proposed method was applied to L2 speech corpora collected from English
learners from different native languages. The proposed speech attribute MDD
method was further compared to the traditional phoneme-level MDD and achieved a
significantly lower False Acceptance Rate (FAR), False Rejection Rate (FRR),
and Diagnostic Error Rate (DER) over all speech attributes compared to the
phoneme-level equivalent
- …