179 research outputs found
A model of sonority based on pitch intelligibility
Synopsis:
Sonority is a central notion in phonetics and phonology and it is essential for generalizations related to syllabic organization. However, to date there is no clear consensus on the phonetic basis of sonority, neither in perception nor in production. The widely used Sonority Sequencing Principle (SSP) represents the speech signal as a sequence of discrete units, where phonological processes are modeled as symbol manipulating rules that lack a temporal dimension and are devoid of inherent links to perceptual, motoric or cognitive processes. The current work aims to change this by outlining a novel approach for the extraction of continuous entities from acoustic space in order to model dynamic aspects of phonological perception. It is used here to advance a functional understanding of sonority as a universal aspect of prosody that requires pitch-bearing syllables as the building blocks of speech.
This book argues that sonority is best understood as a measurement of pitch intelligibility in perception, which is closely linked to periodic energy in acoustics. It presents a novel principle for sonority-based determinations of well-formedness – the Nucleus Attraction Principle (NAP). Two complementary NAP models independently account for symbolic and continuous representations and they mostly outperform SSP-based models, demonstrated here with experimental perception studies and with a corpus study of Modern Hebrew nouns.
This work also includes a description of ProPer (Prosodic Analysis with Periodic Energy). The ProPer toolbox further exploits the proposal that periodic energy reflects sonority in order to cover major topics in prosodic research, such as prominence, intonation and speech rate. The book is finally concluded with brief discussions on selected topics: (i) the phonotactic division of labor with respect to /s/-stop clusters; (ii) the debate about the universality of sonority; and (iii) the fate of the classic phonetics–phonology dichotomy as it relates to continuity and dynamics in phonology
Recommended from our members
Deep Learning for Automatic Assessment and Feedback of Spoken English
Growing global demand for learning a second language (L2), particularly English, has led to
considerable interest in automatic spoken language assessment, whether for use in computerassisted language learning (CALL) tools or for grading candidates for formal qualifications.
This thesis presents research conducted into the automatic assessment of spontaneous nonnative English speech, with a view to be able to provide meaningful feedback to learners. One
of the challenges in automatic spoken language assessment is giving candidates feedback on
particular aspects, or views, of their spoken language proficiency, in addition to the overall
holistic score normally provided. Another is detecting pronunciation and other types of errors
at the word or utterance level and feeding them back to the learner in a useful way.
It is usually difficult to obtain accurate training data with separate scores for different
views and, as examiners are often trained to give holistic grades, single-view scores can
suffer issues of consistency. Conversely, holistic scores are available for various standard
assessment tasks such as Linguaskill. An investigation is thus conducted into whether
assessment scores linked to particular views of the speaker’s ability can be obtained from
systems trained using only holistic scores.
End-to-end neural systems are designed with structures and forms of input tuned to single
views, specifically each of pronunciation, rhythm, intonation and text. By training each
system on large quantities of candidate data, individual-view information should be possible
to extract. The relationships between the predictions of each system are evaluated to examine
whether they are, in fact, extracting different information about the speaker. Three methods
of combining the systems to predict holistic score are investigated, namely averaging their
predictions and concatenating and attending over their intermediate representations. The
combined graders are compared to each other and to baseline approaches.
The tasks of error detection and error tendency diagnosis become particularly challenging
when the speech in question is spontaneous and particularly given the challenges posed by
the inconsistency of human annotation of pronunciation errors. An approach to these tasks is
presented by distinguishing between lexical errors, wherein the speaker does not know how a
particular word is pronounced, and accent errors, wherein the candidate’s speech exhibits
consistent patterns of phone substitution, deletion and insertion. Three annotated corpora
x
of non-native English speech by speakers of multiple L1s are analysed, the consistency of
human annotation investigated and a method presented for detecting individual accent and
lexical errors and diagnosing accent error tendencies at the speaker level
Deep audio-visual speech recognition
Decades of research in acoustic speech recognition have led to systems that we use in our everyday life. However, even the most advanced speech recognition systems fail in the presence of noise. The degraded performance can be compensated by introducing visual speech information. However, Visual Speech Recognition (VSR) in naturalistic conditions is very challenging, in part due to the lack of architectures and annotations.
This thesis contributes towards the problem of Audio-Visual Speech Recognition (AVSR) from different aspects. Firstly, we develop AVSR models for isolated words. In contrast to previous state-of-the-art methods that consists of a two-step approach, feature extraction and recognition, we present an End-to-End (E2E) approach inside a deep neural network, and this has led to a significant improvement in audio-only, visual-only and audio-visual experiments. We further replace Bi-directional Gated Recurrent Unit (BGRU) with Temporal Convolutional Networks (TCN) to greatly simplify the training procedure.
Secondly, we extend our AVSR model for continuous speech by presenting a hybrid Connectionist Temporal Classification (CTC)/Attention model, that can be trained in an end-to-end manner. We then propose the addition of prediction-based auxiliary tasks to a VSR model and highlight the importance of hyper-parameter optimisation and appropriate data augmentations.
Next, we present a self-supervised framework, Learning visual speech Representations from Audio via self-supervision (LiRA). Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech, and find that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading.
We also investigate the Lombard effect influence in an end-to-end AVSR system, which is the first work using end-to-end deep architectures and presents results on unseen speakers. We show that even if a relatively small amount of Lombard speech is added to the training set then the performance in a real scenario, where noisy Lombard speech is present, can be significantly improved.
Lastly, we propose a detection method against adversarial examples in an AVSR system, where the strong correlation between audio and visual streams is leveraged. The synchronisation confidence score is leveraged as a proxy for audio-visual correlation and based on it, we can detect adversarial attacks. We apply recent adversarial attacks on two AVSR models and the experimental results demonstrate that the proposed approach is an effective way for detecting such attacks.Open Acces
Models and Analysis of Vocal Emissions for Biomedical Applications
The International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications (MAVEBA) came into being in 1999 from the particularly felt need of sharing know-how, objectives and results between areas that until then seemed quite distinct such as bioengineering, medicine and singing. MAVEBA deals with all aspects concerning the study of the human voice with applications ranging from the neonate to the adult and elderly. Over the years the initial issues have grown and spread also in other aspects of research such as occupational voice disorders, neurology, rehabilitation, image and video analysis. MAVEBA takes place every two years always in Firenze, Italy. This edition celebrates twenty years of uninterrupted and succesfully research in the field of voice analysis
Exploraciones fonolingĂĽĂsticas: V Jornadas Internacionales de FonĂ©tica y FonologĂa y I Jornadas Nacionales de FonĂ©tica y Discurso
PublishedLos estudios relativos a la fonĂ©tica y fonologĂa de las lenguas se hallan en una significativa etapa de crecimiento. En todo el mundo y desde diferentes rutas lingĂĽĂsticas, se llevan a cabo investigaciones que dan cuenta de la manera en que las melodĂas y sonidos de la lengua forman parte integral del discurso y de la significaciĂłn. Nuestro paĂs cuenta con numerosos investigadores abocados a esta área y relacionados con investigadores de prestigiosas universidades extranjeras. En esta arena, en el año 2017, tuvimos el honor de ser anfitriones de decenas de estudiosos sobre el tema y de cientos de jĂłvenes ávidos de informarse sobre un campo de estudio tan sutil como apasionante. Este libro es fruto de una ardua tarea que muestra de quĂ© manera la fonĂ©tica y la fonologĂa son aspectos ineludibles en la investigaciĂłn y la enseñanza de las ciencias del lenguaje
Developing Sparse Representations for Anchor-Based Voice Conversion
Voice conversion is the task of transforming speech from one speaker to sound as if it was produced by another speaker, changing the identity while retaining the linguistic content. There are many methods for performing voice conversion, but oftentimes these methods have onerous training requirements or fail in instances where one speaker has a nonnative accent. To address these issues, this dissertation presents and evaluates a novel “anchor-based” representation of speech that separates speaker content from speaker identity by modeling how speakers form English phonemes.
We call the proposed method Sparse, Anchor-Based Representation of Speech (SABR), and explore methods for optimizing the parameters of this model in native-to-native and native-to-nonnative voice conversion contexts. We begin the dissertation by demonstrating how sparse coding in combination with a compact, phoneme-based dictionary can be used to separate speaker identity from content in objective and subjective tests. The formulation of the representation then presents several research questions. First, we propose a method for improving the synthesis quality by using the sparse coding residual in combination with a frequency warping algorithm to convert the residual from the source to target speaker’s space, and add it to the target speaker’s estimated spectrum. Experimentally, we find that synthesis quality is significantly improved via this transform. Second, we propose and evaluate two methods for selecting and optimizing SABR anchors in native-to-native and native-to-nonnative voice conversion. We find that synthesis quality is significantly improved by the proposed methods, especially in native-to- nonnative voice conversion over baseline algorithms. In a detailed analysis of the algorithms, we find they focus on phonemes that are difficult for nonnative speakers of English or naturally have multiple acoustic states. Following this, we examine methods for adding in temporal constraints to SABR via the Fused Lasso. The proposed method significantly reduces the inter-frame variance in the sparse codes over other methods that incorporate temporal features into sparse coding representations.
Finally, in a case study, we examine the use of the SABR methods and optimizations in the context of a computer aided pronunciation training system for building “Golden Speakers”, or ideal models for nonnative speakers of a second language to learn correct pronunciation. Under the hypothesis that the optimal “Golden Speaker” was the learner’s voice, synthesized with a native accent, we used SABR to build voice models for nonnative speakers and evaluated the resulting synthesis in terms of quality, identity, and accentedness. We found that even when deployed in the field, the SABR method generated synthesis with low accentedness and similar acoustic identity to the target speaker, validating the use of the method for building “golden speakers”
Recognition and cortical haemodynamics of vocal emotions-an fNIRS perspective
Normal-hearing listeners rely heavily on variations in the fundamental frequency (F0) of speech to identify vocal emotions. Without reliable F0 cues, as is the case for cochlear implant users, listeners’ ability to extract emotional meaning from speech is reduced. This thesis describes the development of an objective measure of vocal emotion recognition. The program of three experiments investigates: 1) NH listeners’ abilities to use F0, intensity, and speech-rate cues to recognise emotions; 2) cortical activity associated with individual vocal emotions assessed using functional near-infrared spectroscopy (fNIRS); 3) cortical activity evoked by vocal emotions in natural speech and in speech with uninformative F0 using fNIRS
Research Developments in World Englishes
This book is available as open access through the Bloomsbury Open Access programme and is available on www.bloomsburycollections.com. It is funded by the University of Klagenfurt, Austria. Discussing key issues of current relevance and setting the tone for future research in world Englishes, this book provides new perspectives on the diverse realities of Englishes around the world. Written by an international team of established and renowned scholars, it is the inaugural volume in the new series Bloomsbury Advances in World Englishes, dedicated to advancing research in the field. Chapters discuss important topics in contemporary world Englishes research, including de-colonial approaches, emerging varieties in post-protectorates and international uses as communicative events to highlight the globalizing aspect of English as a semiotic code. The book also expands on cultural conceptualizations to investigate the connections between Englishes and localized cultural knowledge and ongoing changes and attitudes towards local forms in multilingual settings. Closing with an examination of how world Englishes and the use of English as a lingua franca could influence the future teaching of Englishes, Research Developments in World Englishes presents a detailed picture of contemporary research approaches and points the way towards exciting future directions
- …