316 research outputs found
Dyslexia Impairs Speech Recognition but Can Spare Phonological Competence
Dyslexia is associated with numerous deficits to speech processing. Accordingly, a large literature asserts that dyslexics manifest a phonological deficit. Few studies, however, have assessed the phonological grammar of dyslexics, and none has distinguished a phonological deficit from a phonetic impairment. Here, we show that these two sources can be dissociated. Three experiments demonstrate that a group of adult dyslexics studied here is impaired in phonetic discrimination (e.g., ba vs. pa), and their deficit compromises even the basic ability to identify acoustic stimuli as human speech. Remarkably, the ability of these individuals to generalize grammatical phonological rules is intact. Like typical readers, these Hebrew-speaking dyslexics identified ill-formed AAB stems (e.g., titug) as less wordlike than well-formed ABB controls (e.g., gitut), and both groups automatically extended this rule to nonspeech stimuli, irrespective of reading ability. The contrast between the phonetic and phonological capacities of these individuals demonstrates that the algebraic engine that generates phonological patterns is distinct from the phonetic interface that implements them. While dyslexia compromises the phonetic system, certain core aspects of the phonological grammar can be spared
Exploiting Phoneme Similarities in Hybrid HMM-ANN Keyword Spotting
We propose a technique for generating alternative models for keywords in a hybrid hidden Markov model - artificial neural network (HMM-ANN) keyword spotting paradigm. Given a base pronunciation for a keyword from the lookup dictionary, our algorithm generates a new model for a keyword which takes into account the systematic errors made by the neural network and avoiding those models that can be confused with other words in the language. The new keyword model improves the keyword detection rate while minimally increasing the number of false alarms
Acoustic-phonetic decoding of speech : problems and solutions
Acoustic phonetic decoding of speech recognition constitutes a major step
in the process of continuous speech recognition . This paper reminds the
difficulties of the problem together with the main methods proposed so far
in order to solve it . We then concentrate on the différent complementary
approaches Chat have been investigated by our group : expert system based
on spectrogram reading, recognition by phonetic triphones, connectionist model based on the cortical column unit and stochastic recognition without
segmentation .Le décodage acoustico-phonétique constitue une étape importante en
reconnaissance de la parole continue . Cet article rappelle d'abord les
difficultés du problème et les principales méthodes qui ont été proposées
pour le résoudre . Il présente ensuite les diverses approches complémentaires
adoptées par notre équipe : système expert fondé sur l'activité de
lecture de spectrogrammes, reconnaissance par triplets phonétiques,
modèle connexionniste de colonne corticale et reconnaissance par
méthode stochastique sans segmentation
Subphonemic Sensitivity in Low Literacy Adults
The link between phonological abilities and reading skills has been well-established in both typical and atypical language development. However, the nature of the phonological deficits in poor readers remains a debated topic. While poor readers have been mostly assumed to have underspecified or “fuzzy” phonological representations (Tallal et al., 1998), the opposite alternative, over-specified phonological representations, has also been hypothesized (Serniclaes, 2006). To examine the two phonological hypotheses, the current study used the eye-tracking paradigm in the study of Dahan et al. (2001) to investigate individuals’ sensitivity to subphonemic information in young adults with a wide range of reading abilities. Our findings suggested a trend of higher sensitivity to subphonemic information in lower-ability readers, consistent with the over-specification hypothesis. In addition, our sample with a lower range of socio-economic status highlighted the need to take environmental factors into consideration for theoretical and practical purposes in reading acquisition
Phonetic aware techniques for Speaker Verification
The goal of this thesis is to improve current state-of-the-art techniques in speaker verification
(SV), typically based on âidentity-vectorsâ (i-vectors) and deep neural network (DNN), by exploiting diverse (phonetic) information extracted using various techniques such as automatic
speech recognition (ASR). Different speakers span different subspaces within a universal acoustic space, usually modelled by âuniversal background modelâ. The speaker-specific subspace
depends on the speakerâs voice characteristics, but also on the verbalised text of a speaker. In current state-of-the-art SV systems, i-vectors are extracted by applying a factor analysis
technique to obtain low dimensional speaker-specific representation. Furthermore, DNN output is also employed in a conventional i-vector framework to model phonetic information
embedded in the speech signal. This thesis proposes various techniques to exploit phonetic knowledge of speech to further enrich speaker characteristics.
More specifically, the techniques proposed in this thesis are applied to various SV tasks,
namely, text-independent and text-dependent SV. For text-independent SV task, several ASR
systems are developed and applied to compute phonetic posterior probabilities, subsequently
exploited to enhance the speaker-specific information included in i-vectors. These approaches
are then extended for text-dependent SV task, exploiting temporal information in a principled
way, i.e., by using dynamic time warping applied on speaker informative vectors.
Finally, as opposed to train DNN with phonetic information, DNN is trained in an end-to-end
fashion to directly discriminate speakers. The baseline end-to-end SV approach consists of
mapping a variable length speech segment to a fixed dimensional speaker vector by estimating
the mean of hidden representations in DNN structure. We improve upon this technique by
computing a distance function between two utterances which takes into account common
phonetic units. The whole network is optimized by employing a triplet-loss objective function.
The proposed approaches are evaluated on commonly used datasets such as NIST SRE 2010
and RSR2015. Significant improvements are observed over the baseline systems on both the
text-dependent and text-independent SV tasks by applying phonetic knowledge
Improving the Generalizability of Speech Emotion Recognition: Methods for Handling Data and Label Variability
Emotion is an essential component in our interaction with others. It transmits information that helps us interpret the content of what others say. Therefore, detecting emotion from speech is an important step towards enabling machine understanding of human behaviors and intentions. Researchers have demonstrated the potential of emotion recognition in areas such as interactive systems in smart homes and mobile devices, computer games, and computational medical assistants. However, emotion communication is variable: individuals may express emotion in a manner that is uniquely their own; different speech content and environments may shape how emotion is expressed and recorded; individuals may perceive emotional messages differently. Practically, this variability is reflected in both the audio-visual data and the labels used to create speech emotion recognition (SER) systems. SER systems must be robust and generalizable to handle the variability effectively.
The focus of this dissertation is on the development of speech emotion recognition systems that handle variability in emotion communications. We break the dissertation into three parts, according to the type of variability we address: (I) in the data, (II) in the labels, and (III) in both the data and the labels.
Part I: The first part of this dissertation focuses on handling variability present in data. We approximate variations in environmental properties and expression styles by corpus and gender of the speakers. We find that training on multiple corpora and controlling for the variability in gender and corpus using multi-task learning result in more generalizable models, compared to the traditional single-task models that do not take corpus and gender variability into account. Another source of variability present in the recordings used in SER is the phonetic modulation of acoustics. On the other hand, phonemes also provide information about the emotion expressed in speech content. We discover that we can make more accurate predictions of emotion by explicitly considering both roles of phonemes.
Part II: The second part of this dissertation addresses variability present in emotion labels, including the differences between emotion expression and perception, and the variations in emotion perception. We discover that it is beneficial to jointly model both the perception of others and how one perceives one’s own expression, compared to focusing on either one. Further, we show that the variability in emotion perception is a modelable signal and can be captured using probability distributions that describe how groups of evaluators perceive emotional messages.
Part III: The last part of this dissertation presents methods that handle variability in both data and labels. We reduce the data variability due to non-emotional factors using deep metric learning and model the variability in emotion perception using soft labels. We propose a family of loss functions and show that by pairing examples that potentially vary in expression styles and lexical content and preserving the real-valued emotional similarity between them, we develop systems that generalize better across datasets and are more robust to over-training.
These works demonstrate the importance of considering data and label variability in the creation of robust and generalizable emotion recognition systems. We conclude this dissertation with the following future directions: (1) the development of real-time SER systems; (2) the personalization of general SER systems.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147639/1/didizbq_1.pd
A syllable-based investigation of coarticulation
Coarticulation has been long investigated in Speech Sciences and Linguistics (Kühnert &
Nolan, 1999). This thesis explores coarticulation through a syllable based model (Y. Xu,
2020). First, it is hypothesised that consonant and vowel are synchronised at the syllable
onset for the sake of reducing temporal degrees of freedom, and such synchronisation
is the essence of coarticulation. Previous efforts in the examination of CV alignment
mainly report onset asynchrony (Gao, 2009; Shaw & Chen, 2019). The first study of this
thesis tested the synchrony hypothesis using articulatory and acoustic data in Mandarin.
Departing from conventional approaches, a minimal triplet paradigm was applied, in
which the CV onsets were determined through the consonant and vowel minimal pairs,
respectively. Both articulatory and acoustical results showed that CV articulation started
in close temporal proximity, supporting the synchrony hypothesis. The second study
extended the research to English and syllables with cluster onsets. By using acoustic data
in conjunction with Deep Learning, supporting evidence was found for co-onset, which
is in contrast to the widely reported c-center effect (Byrd, 1995). Secondly, the thesis
investigated the mechanism that can maximise synchrony – Dimension Specific Sequential
Target Approximation (DSSTA), which is highly relevant to what is commonly known
as coarticulation resistance (Recasens & Espinosa, 2009). Evidence from the first two studies show that, when conflicts arise due to articulation requirements between CV, the
CV gestures can be fulfilled by the same articulator on separate dimensions simultaneously.
Last but not least, the final study tested the hypothesis that resyllabification is the result of
coarticulation asymmetry between onset and coda consonants. It was found that neural
network based models could infer syllable affiliation of consonants, and those inferred
resyllabified codas had similar coarticulatory structure with canonical onset consonants. In
conclusion, this thesis found that many coarticulation related phenomena, including local
vowel to vowel anticipatory coarticulation, coarticulation resistance, and resyllabification,
stem from the articulatory mechanism of the syllable
Speech Self-Supervised Representations Benchmarking: a Case for Larger Probing Heads
Self-supervised learning (SSL) leverages large datasets of unlabeled speech
to reach impressive performance with reduced amounts of annotated data. The
high number of proposed approaches fostered the emergence of comprehensive
benchmarks that evaluate their performance on a set of downstream tasks
exploring various aspects of the speech signal. However, while the number of
considered tasks has been growing, most proposals rely upon a single downstream
architecture that maps the frozen SSL representations to the task labels. This
study examines how benchmarking results are affected by changes in the probing
head architecture. Interestingly, we found that altering the downstream
architecture structure leads to significant fluctuations in the performance
ranking of the evaluated models. Against common practices in speech SSL
benchmarking, we evaluate larger-capacity probing heads, showing their impact
on performance, inference costs, generalization and multi-level feature
exploitation.Comment: 18 Page
How Linguistic Chickens Help Spot Spoken-Eggs: Phonological Constraints on Speech Identification
It has long been known that the identification of aural stimuli as speech is context-dependent (Remez et al., 1981). Here, we demonstrate that the discrimination of speech stimuli from their non-speech transforms is further modulated by their linguistic structure. We gauge the effect of phonological structure on discrimination across different manifestations of well-formedness in two distinct languages. One case examines the restrictions on English syllables (e.g., the well-formed melif vs. ill-formed mlif); another investigates the constraints on Hebrew stems by comparing ill-formed AAB stems (e.g., TiTuG) with well-formed ABB and ABC controls (e.g., GiTuT, MiGuS). In both cases, non-speech stimuli that conform to well-formed structures are harder to discriminate from speech than stimuli that conform to ill-formed structures. Auxiliary experiments rule out alternative acoustic explanations for this phenomenon. In English, we show that acoustic manipulations that mimic the mlif–melif contrast do not impair the classification of non-speech stimuli whose structure is well-formed (i.e., disyllables with phonetically short vs. long tonic vowels). Similarly, non-speech stimuli that are ill-formed in Hebrew present no difficulties to English speakers. Thus, non-speech stimuli are harder to classify only when they are well-formed in the participants’ native language. We conclude that the classification of non-speech stimuli is modulated by their linguistic structure: inputs that support well-formed outputs are more readily classified as speech
- …