177 research outputs found
Towards Zero-shot Learning for Automatic Phonemic Transcription
Automatic phonemic transcription tools are useful for low-resource language
documentation. However, due to the lack of training sets, only a tiny fraction
of languages have phonemic transcription tools. Fortunately, multilingual
acoustic modeling provides a solution given limited audio training data. A more
challenging problem is to build phonemic transcribers for languages with zero
training data. The difficulty of this task is that phoneme inventories often
differ between the training languages and the target language, making it
infeasible to recognize unseen phonemes. In this work, we address this problem
by adopting the idea of zero-shot learning. Our model is able to recognize
unseen phonemes in the target language without any training data. In our model,
we decompose phonemes into corresponding articulatory attributes such as vowel
and consonant. Instead of predicting phonemes directly, we first predict
distributions over articulatory attributes, and then compute phoneme
distributions with a customized acoustic model. We evaluate our model by
training it using 13 languages and testing it using 7 unseen languages. We find
that it achieves 7.7% better phoneme error rate on average over a standard
multilingual model.Comment: AAAI 202
The phonetics and phonology of retroflexes : Fonetiek en fonologie van retroflexen (met een samenvatting in het Nederlands)
At the outset of this dissertation one might pose the question why retroflex consonants should still be of interest for phonetics and for phonological theory since ample work on this segmental class already exists. Bhat (1973) conducted a quite extensive study on retroflexion that treated the geographical spread of this class, some phonological processes its members can undergo, and the phonetic motivation for these processes. Furthermore, several phonological representations of retroflexes have been proposed in the framework of Feature Geometry, as in work by Sagey (1986), Pulleyblank (1989), Gnanadesikan (1993), and Clements (2001). Most recently, Steriade (1995, 2001) has discussed the perceptual cues of retroflexes and has argued that the distribution of these cues can account for the phonotactic restrictions on retroflexes and their assimilatory behaviour. Purely phonetically oriented studies such as Dixit (1990) and Simonsen, Moen & Cowen (2000) have shown the large articulatory variation that can be found for retroflexes and hint at the insufficiency of existing definitions
Phonological Level wav2vec2-based Mispronunciation Detection and Diagnosis Method
The automatic identification and analysis of pronunciation errors, known as
Mispronunciation Detection and Diagnosis (MDD) plays a crucial role in Computer
Aided Pronunciation Learning (CAPL) tools such as Second-Language (L2) learning
or speech therapy applications. Existing MDD methods relying on analysing
phonemes can only detect categorical errors of phonemes that have an adequate
amount of training data to be modelled. With the unpredictable nature of the
pronunciation errors of non-native or disordered speakers and the scarcity of
training datasets, it is unfeasible to model all types of mispronunciations.
Moreover, phoneme-level MDD approaches have a limited ability to provide
detailed diagnostic information about the error made. In this paper, we propose
a low-level MDD approach based on the detection of speech attribute features.
Speech attribute features break down phoneme production into elementary
components that are directly related to the articulatory system leading to more
formative feedback to the learner. We further propose a multi-label variant of
the Connectionist Temporal Classification (CTC) approach to jointly model the
non-mutually exclusive speech attributes using a single model. The pre-trained
wav2vec2 model was employed as a core model for the speech attribute detector.
The proposed method was applied to L2 speech corpora collected from English
learners from different native languages. The proposed speech attribute MDD
method was further compared to the traditional phoneme-level MDD and achieved a
significantly lower False Acceptance Rate (FAR), False Rejection Rate (FRR),
and Diagnostic Error Rate (DER) over all speech attributes compared to the
phoneme-level equivalent
Automatic Screening of Childhood Speech Sound Disorders and Detection of Associated Pronunciation Errors
Speech disorders in children can affect their fluency and intelligibility. Delay in their diagnosis and treatment increases the risk of social impairment and learning disabilities. With the significant shortage of Speech and Language Pathologists (SLPs), there is an increasing interest in Computer-Aided Speech Therapy tools with automatic detection and diagnosis capability.
However, the scarcity and unreliable annotation of disordered child speech corpora along with the high acoustic variations in the child speech data has impeded the development of reliable automatic detection and diagnosis of childhood speech sound disorders. Therefore, this thesis investigates two types of detection systems that can be achieved with minimum dependency on annotated mispronounced speech data.
First, a novel approach that adopts paralinguistic features which represent the prosodic, spectral, and voice quality characteristics of the speech was proposed to perform segment- and subject-level classification of Typically Developing (TD) and Speech Sound Disordered (SSD) child speech using a binary Support Vector Machine (SVM) classifier. As paralinguistic features are both language- and content-independent, they can be extracted from an unannotated speech signal.
Second, a novel Mispronunciation Detection and Diagnosis (MDD) approach was introduced to detect the pronunciation errors made due to SSDs and provide low-level diagnostic information that can be used in constructing formative feedback and a detailed diagnostic report. Unlike existing MDD methods where detection and diagnosis are performed at the phoneme level, the proposed method achieved MDD at the speech attribute level, namely the manners and places of articulations. The speech attribute features describe the involved articulators and their interactions when making a speech sound allowing a low-level description of the pronunciation error to be provided. Two novel methods to model speech attributes are further proposed in this thesis, a frame-based (phoneme-alignment) method leveraging the Multi-Task Learning (MTL) criterion and training a separate model for each attribute, and an alignment-free jointly-learnt method based on the Connectionist Temporal Classification (CTC) sequence to sequence criterion.
The proposed techniques have been evaluated using standard and publicly accessible adult and child speech corpora, while the MDD method has been validated using L2 speech corpora
The phonetics and phonology of retroflexes
This dissertation investigates the phonetic realization and phonological behaviour of the class of retroflexes, i.e. sounds that are articulated with the tongue tip or the underside of the tongue tip against the postalveolar or palatal region. On the basis of four articulatory properties, a new definition of retroflexes is proposed. These properties are apicality, posteriority, sublingual cavity, and retraction; the latter is shown to imply that retroflexes are incompatible with secondary palatalization.
The phonetic section gives an overview of the factors responsible for the large articulatory variation of retroflexes and discusses putative counterexamples of palatalized retroflexes. In addition, it describes the acoustic realization of retroflexes and proposes the common characteristic of a low third fomant.
The phonological section discusses processes involving retroflexes from a large number of typologically diverse languages. These processes are shown to be grounded in the similar articulatory and acoustic properties of the retroflex class. Furthermore, this section gives a phonological analysis of the processes involving retroflexes in an Optimally Theoretic framework with underlying perceptual representations, based on Boersma s Functional Phonology. Evidence is presented for the non-universality of the retroflex class, and for the non-necessity of innate phonological features.
This study is of interest to phonologists and phoneticians, especially to those working on the phonetics-phonology interface
Dealing with linguistic mismatches for automatic speech recognition
Recent breakthroughs in automatic speech recognition (ASR) have resulted in a word error rate (WER) on par with human transcribers on the English Switchboard benchmark. However, dealing with linguistic mismatches between the training and testing data is still a significant challenge that remains unsolved. Under the monolingual environment, it is well-known that the performance of ASR systems degrades significantly when presented with the speech from speakers with different accents, dialects, and speaking styles than those encountered during system training. Under the multi-lingual environment, ASR systems trained on a source language achieve even worse performance when tested on another target language because of mismatches in terms of the number of phonemes, lexical ambiguity, and power of phonotactic constraints provided by phone-level n-grams.
In order to address the issues of linguistic mismatches for current ASR systems, my dissertation investigates both knowledge-gnostic and knowledge-agnostic solutions. In the first part, classic theories relevant to acoustics and articulatory phonetics that present capability of being transferred across a dialect continuum from local dialects to another standardized language are re-visited. Experiments demonstrate the potentials that acoustic correlates in the vicinity of landmarks could help to build a bridge for dealing with mismatches across difference local or global varieties in a dialect continuum. In the second part, we design an end-to-end acoustic modeling approach based on connectionist temporal classification loss and propose to link the training of acoustics and accent altogether in a manner similar to the learning process in human speech perception. This joint model not only performed well on ASR with multiple accents but also boosted accuracies of accent identification task in comparison to separately-trained models
Loan Phonology
For many different reasons, speakers borrow words from other languages to fill gaps in their own lexical inventory. The past ten years have been characterized by a great interest among phonologists in the issue of how the nativization of loanwords occurs. The general feeling is that loanword nativization provides a direct window for observing how acoustic cues are categorized in terms of the distinctive features relevant to the L1 phonological system as well as for studying L1 phonological processes in action and thus to the true synchronic phonology of L1. The collection of essays presented in this volume provides an overview of the complex issues phonologists face when investigating this phenomenon and, more generally, the ways in which unfamiliar sounds and sound sequences are adapted to converge with the native language’s sound pattern. This book is of interest to theoretical phonologists as well as to linguists interested in language contact phenomena
- …