5 research outputs found
Speaker- and Age-Invariant Training for Child Acoustic Modeling Using Adversarial Multi-Task Learning
One of the major challenges in acoustic modelling of child speech is the
rapid changes that occur in the children's articulators as they grow up, their
differing growth rates and the subsequent high variability in the same age
group. These high acoustic variations along with the scarcity of child speech
corpora have impeded the development of a reliable speech recognition system
for children. In this paper, a speaker- and age-invariant training approach
based on adversarial multi-task learning is proposed. The system consists of
one generator shared network that learns to generate speaker- and age-invariant
features connected to three discrimination networks, for phoneme, age, and
speaker. The generator network is trained to minimize the
phoneme-discrimination loss and maximize the speaker- and age-discrimination
losses in an adversarial multi-task learning fashion. The generator network is
a Time Delay Neural Network (TDNN) architecture while the three discriminators
are feed-forward networks. The system was applied to the OGI speech corpora and
achieved a 13% reduction in the WER of the ASR.Comment: Submitted to ICASSP202
CUCHILD: A Large-Scale Cantonese Corpus of Child Speech for Phonology and Articulation Assessment
This paper describes the design and development of CUCHILD, a large-scale
Cantonese corpus of child speech. The corpus contains spoken words collected
from 1,986 child speakers aged from 3 to 6 years old. The speech materials
include 130 words of 1 to 4 syllables in length. The speakers cover both
typically developing (TD) children and children with speech disorder. The
intended use of the corpus is to support scientific and clinical research, as
well as technology development related to child speech assessment. The design
of the corpus, including selection of words, participants recruitment, data
acquisition process, and data pre-processing are described in detail. The
results of acoustical analysis are presented to illustrate the properties of
child speech. Potential applications of the corpus in automatic speech
recognition, phonological error detection and speaker diarization are also
discussed.Comment: Accepted to INTERSPEECH 2020, Shanghai, Chin
Improvement of automatic speech recognition skills of linguistics students through using ukrainian-english and ukrainian-german subtitles in publicistic movies
The increased world's attention to foreign language studies facilitates the development and improvement of its study system in higher education institutions. Such a system takes into account and promptly responds to the demands of today's multicultural society. All should start with the transformation and modernization of the higher education system. This system includes the introduction of innovative technologies in the study of English and German, which should be focused on the modern demands of the world labor market. All this has determined the relevance of the research. This article aims to establish ways for students to gain automatic recognition skills through subtitling Ukrainian-English and Ukrainian-German publicistic movies and series. The first assessment of new language audio and video corpus was developed at the Admiral Makarov National University of Shipbuilding, using an automatic subtitling mechanism to improve linguistics students' recognition and understanding of oral speech. The skills and abilities that improved during the work with the educational movie corpus have been identified
Automatic Screening of Childhood Speech Sound Disorders and Detection of Associated Pronunciation Errors
Speech disorders in children can affect their fluency and intelligibility. Delay in their diagnosis and treatment increases the risk of social impairment and learning disabilities. With the significant shortage of Speech and Language Pathologists (SLPs), there is an increasing interest in Computer-Aided Speech Therapy tools with automatic detection and diagnosis capability.
However, the scarcity and unreliable annotation of disordered child speech corpora along with the high acoustic variations in the child speech data has impeded the development of reliable automatic detection and diagnosis of childhood speech sound disorders. Therefore, this thesis investigates two types of detection systems that can be achieved with minimum dependency on annotated mispronounced speech data.
First, a novel approach that adopts paralinguistic features which represent the prosodic, spectral, and voice quality characteristics of the speech was proposed to perform segment- and subject-level classification of Typically Developing (TD) and Speech Sound Disordered (SSD) child speech using a binary Support Vector Machine (SVM) classifier. As paralinguistic features are both language- and content-independent, they can be extracted from an unannotated speech signal.
Second, a novel Mispronunciation Detection and Diagnosis (MDD) approach was introduced to detect the pronunciation errors made due to SSDs and provide low-level diagnostic information that can be used in constructing formative feedback and a detailed diagnostic report. Unlike existing MDD methods where detection and diagnosis are performed at the phoneme level, the proposed method achieved MDD at the speech attribute level, namely the manners and places of articulations. The speech attribute features describe the involved articulators and their interactions when making a speech sound allowing a low-level description of the pronunciation error to be provided. Two novel methods to model speech attributes are further proposed in this thesis, a frame-based (phoneme-alignment) method leveraging the Multi-Task Learning (MTL) criterion and training a separate model for each attribute, and an alignment-free jointly-learnt method based on the Connectionist Temporal Classification (CTC) sequence to sequence criterion.
The proposed techniques have been evaluated using standard and publicly accessible adult and child speech corpora, while the MDD method has been validated using L2 speech corpora