320 research outputs found
Analyzing analytical methods: The case of phonology in neural models of spoken language
Given the fast development of analysis techniques for NLP and speech
processing systems, few systematic studies have been conducted to compare the
strengths and weaknesses of each method. As a step in this direction we study
the case of representations of phonology in neural network models of spoken
language. We use two commonly applied analytical techniques, diagnostic
classifiers and representational similarity analysis, to quantify to what
extent neural activation patterns encode phonemes and phoneme sequences. We
manipulate two factors that can affect the outcome of analysis. First, we
investigate the role of learning by comparing neural activations extracted from
trained versus randomly-initialized models. Second, we examine the temporal
scope of the activations by probing both local activations corresponding to a
few milliseconds of the speech signal, and global activations pooled over the
whole utterance. We conclude that reporting analysis results with randomly
initialized models is crucial, and that global-scope methods tend to yield more
consistent results and we recommend their use as a complement to local-scope
diagnostic methods.Comment: ACL 202
Acoustic Modelling for Under-Resourced Languages
Automatic speech recognition systems have so far been developed only for very few languages out of the 4,000-7,000 existing ones.
In this thesis we examine methods to rapidly create acoustic models in new, possibly under-resourced languages, in a time and cost effective manner. For this we examine the use of multilingual models, the application of articulatory features across languages, and the automatic discovery of word-like units in unwritten languages
Syntax, morphology, and phonology in text-to-speech systems
The paper is concerned with the integration of linguistic information in text-to-speech systems. Research in synthesis proper is at a stage where the need for systematic integration of comprehensive linguistic information in such systems is making itself felt more than ever. A surface structure parsing system is presented whose main virtue is that it permits linguists to express syntactic as well as lexical and morphological regularities and irregularities of a language in a simple and easy-to-learn formalism. Most aspects of the system are seen in the light of Danish and - sporadically - English and Finnish surface structure
VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System
Arabic is a complex language with many varieties and dialects spoken by over
450 millions all around the world. Due to the linguistic diversity and
variations, it is challenging to build a robust and generalized ASR system for
Arabic. In this work, we address this gap by developing and demoing a system,
dubbed VoxArabica, for dialect identification (DID) as well as automatic speech
recognition (ASR) of Arabic. We train a wide range of models such as HuBERT
(DID), Whisper, and XLS-R (ASR) in a supervised setting for Arabic DID and ASR
tasks. Our DID models are trained to identify 17 different dialects in addition
to MSA. We finetune our ASR models on MSA, Egyptian, Moroccan, and mixed data.
Additionally, for the remaining dialects in ASR, we provide the option to
choose various models such as Whisper and MMS in a zero-shot setting. We
integrate these models into a single web interface with diverse features such
as audio recording, file upload, model selection, and the option to raise flags
for incorrect outputs. Overall, we believe VoxArabica will be useful for a wide
range of audiences concerned with Arabic research. Our system is currently
running at https://cdce-206-12-100-168.ngrok.io/.Comment: Accepted at ArabicNLP conference co-located with EMNLP'23. First
three authors contributed equall
MISPRONUNCIATION DETECTION AND DIAGNOSIS IN MANDARIN ACCENTED ENGLISH SPEECH
This work presents the development, implementation, and evaluation of a Mispronunciation Detection and Diagnosis (MDD) system, with application to pronunciation evaluation of Mandarin-accented English speech. A comprehensive detection and diagnosis of errors in the Electromagnetic Articulography corpus of Mandarin-Accented English (EMA-MAE) was performed by using the expert phonetic transcripts and an Automatic Speech Recognition (ASR) system. Articulatory features derived from the parallel kinematic data available in the EMA-MAE corpus were used to identify the most significant articulatory error patterns seen in L2 speakers during common mispronunciations. Using both acoustic and articulatory information, an ASR based Mispronunciation Detection and Diagnosis (MDD) system was built and evaluated across different feature combinations and Deep Neural Network (DNN) architectures. The MDD system captured mispronunciation errors with a detection accuracy of 82.4%, a diagnostic accuracy of 75.8% and a false rejection rate of 17.2%. The results demonstrate the advantage of using articulatory features in revealing the significant contributors of mispronunciation as well as improving the performance of MDD systems
Examining skilled reading processes
Skilled reading often occurs with little effort. However, when basic reading processes are analyzed in detail, the illusion of simplicity is removed. The present research focuses on the proficiency with which a skilled reader can successfully access lexical (i.e., whole-word) and sublexical (i.e., sub-word) levels of orthographic and phonological knowledge. In particular, I will address questions pertaining to: (1) the nature of the connections between sub-processes of basic visual word recognition, (2) the degree to which context affects whole-word versus sub-word processing, and (3) whether there are neuroanatomical correlates that correspond to the sub-processes of basic visual word recognition. The findings presented in this set of experiments support:(1) facilitation-dominant connections from orthography to phonology, (2) context related whole-word and sub-word processing, and (3) lexical and sublexical neuroanatomical correlates of basic reading processes. The findings are discussed with respect to the issue of whether there is a single processing route from orthography to phonology or if there are two processing routes from orthography to phonology
Wave to Syntax: Probing spoken language models for syntax
Understanding which information is encoded in deep models of spoken and
written language has been the focus of much research in recent years, as it is
crucial for debugging and improving these architectures. Most previous work has
focused on probing for speaker characteristics, acoustic and phonological
information in models of spoken language, and for syntactic information in
models of written language. Here we focus on the encoding of syntax in several
self-supervised and visually grounded models of spoken language. We employ two
complementary probing methods, combined with baselines and reference
representations to quantify the degree to which syntactic structure is encoded
in the activations of the target models. We show that syntax is captured most
prominently in the middle layers of the networks, and more explicitly within
models with more parameters.Comment: Accepted to Interspeech 202
- …