24 research outputs found
Speaker Recognition using Supra-segmental Level Excitation Information
Speaker specific information present in the excitation signal is mostly viewed from sub-segmental, segmental and supra-segmental levels. In this work, the supra-segmental level information is explored for recognizing speakers. Earlier study has shown that, combined use of pitch and epoch strength vectors provides useful supra-segmental information. However, the speaker recognition accuracy achieved by supra-segmental level feature is relatively poor than other levels source information. May be the modulation information present at the supra-segmental level of the excitation signal is not manifested properly in pith and epoch strength vectors. We propose a method to model the supra-segmental level modulation information from residual mel frequency cepstral coefficient (R-MFCC) trajectories. The evidences from R-MFCC trajectories combined with pitch and epoch strength vectors are proposed to represent supra-segmental information. Experimental results show that compared to pitch and epoch strength vectors, the proposed approach provides relatively improved performance. Further, the proposed supra-segmental level information is relatively more complimentary to other levels information
Gender dependent word-level emotion detection using global spectral speech features
In this study, global spectral features extracted from word and sentence levels are studied for speech emotion recognition. MFCC (Mel Frequency Cepstral Coefficient) were used as spectral information for recognition purpose. Global spectral features representing gross statistics such as mean of MFCC are used. This study also examine words at different positions (initial, middle and end) separately in a sentence. Word-level feature extraction is used to analyze emotion recognition performance of words at different positions. Word boundaries are manually identified. Gender dependent and independent models are also studied to analyze the gender impact on emotion recognition performance. Berlin’s Emo-DB (Emotional Database) was used for emotional speech dataset. Performance of different classifiers also been studied. NN (Neural
Network), KNN (K-Nearest Neighbor) and LDA (Linear Discriminant Analysis) are included in
the classifiers. Anger and neutral emotions were also studied. Results showed that, using all 13 MFCC coefficients provide better classification results than other combinations of MFCC coefficients for the mentioned emotions. Words at initial and ending positions provide more emotion, specific information than words at middle position. Gender dependent models are more efficient than gender independent models. Moreover, female are more efficient than male model and female exhibit emotions better than the male. General, NN performs the worst compared to KNN and LDA in classifying anger and neutral. LDA performs better than KNN almost 15% for gender independent model and almost 25% for gender dependent
Suprasegmental representations for the modeling of fundamental frequency in statistical parametric speech synthesis
Statistical parametric speech synthesis (SPSS) has seen improvements over
recent years, especially in terms of intelligibility. Synthetic speech is often clear
and understandable, but it can also be bland and monotonous. Proper generation
of natural speech prosody is still a largely unsolved problem. This is relevant
especially in the context of expressive audiobook speech synthesis, where speech
is expected to be fluid and captivating.
In general, prosody can be seen as a layer that is superimposed on the segmental
(phone) sequence. Listeners can perceive the same melody or rhythm
in different utterances, and the same segmental sequence can be uttered with a
different prosodic layer to convey a different message. For this reason, prosody
is commonly accepted to be inherently suprasegmental. It is governed by longer
units within the utterance (e.g. syllables, words, phrases) and beyond the utterance
(e.g. discourse). However, common techniques for the modeling of speech
prosody - and speech in general - operate mainly on very short intervals, either at
the state or frame level, in both hidden Markov model (HMM) and deep neural
network (DNN) based speech synthesis.
This thesis presents contributions supporting the claim that stronger representations
of suprasegmental variation are essential for the natural generation of
fundamental frequency for statistical parametric speech synthesis. We conceptualize
the problem by dividing it into three sub-problems: (1) representations of
acoustic signals, (2) representations of linguistic contexts, and (3) the mapping
of one representation to another. The contributions of this thesis provide novel
methods and insights relating to these three sub-problems.
In terms of sub-problem 1, we propose a multi-level representation of f0 using
the continuous wavelet transform and the discrete cosine transform, as well
as a wavelet-based decomposition strategy that is linguistically and perceptually
motivated. In terms of sub-problem 2, we investigate additional linguistic
features such as text-derived word embeddings and syllable bag-of-phones and
we propose a novel method for learning word vector representations based on
acoustic counts. Finally, considering sub-problem 3, insights are given regarding
hierarchical models such as parallel and cascaded deep neural networks
Recommended from our members
Cross-language speech perception in context : advantages for recent language learners and variation across language-specific acoustic cues
This dissertation explores the relationship between language experience and sensitivity to language-specific segmental cues by comparing cross-language speech perception in monolingual English listeners and Spanish-English bilinguals. The three studies in this project use a novel language categorization task to test language-segment associations in listeners’ first and second languages. Listener sensitivity is compared at two stages of development and across a variety of language backgrounds. These studies provide a more complete analysis of listeners’ language-specific phonological categories than offered in previous work by using word-length stimuli to evaluate segments in phonological contexts and by testing speech perception in listeners’ first language as well as their second language. The inclusion of bilingual children also allows connections to be drawn between previous work on infants’ perception of segments and the sensitivities of bilingual adults. In three experiments, participants categorized nonce words containing different classes of English- and Spanish-specific sounds as sounding more English-like or Spanish-like; target segments were either a phonemic cue, a cue for which there is no analogous sound in the other language, or a phonetic cue, a cue for which English and Spanish share the category but for which each language varies in its phonetic implementation. The results reveal a largely consistent categorization pattern across target segments. Listeners from all groups succeeded and struggled with the same subsets of language-specific segments. The same pattern of results held in a task where more time was given to make categorization decisions. Interestingly, for some segments the late bilinguals were significantly more accurate than monolingual and early bilingual listeners, and this was the case for the English phonemic cues. There were few differences in the sensitivity of monolinguals and early bilinguals to language-specific cues, suggesting that the early bilinguals’ exposure to Spanish did not fundamentally change their representations of English phonology, but neither did their proficiency in Spanish give them an advantage over monolinguals. The comparison of adult listeners with children indicates that the Spanish-speaking children who grow to be early bilingual adults categorize segments more accurately than monolinguals – a pattern that is neutralized in the adult results. These findings suggest that variation in listener sensitivity to language-specific cues is largely driven by inherent differences in the salience of the segments themselves. Listener language experience modulates the salience of some of these sounds, and these differences in cross-language speech perception may reflect how recently a language was learned and under what circumstances.Linguistic
Representation and variation in substance-free phonology:A case study in Celtic
This thesis presents a comprehensive analysis of the phonological patterns of two varieties of Brythonic Celtic in the framework of substance-free phonology. I argue that cross-linguistic variation in sound patterns does not derive solely from differences in grammars (implemented as Optimality Theoretic constraint rankings). Instead, I adopt the substance-free framework, based on the principle of modularity and autonomy of the phonological component, to account for cross-linguistic phonological and phonetic variation. Phonological representations in substance-free phonology are built up without regard to the physical implementation of phonological units, on the basis of the system of contrasts and patterns of alternation. Although this insight is not new when couched in terms of a language-specific assignment of a set of universal phonological features, I argue that the mapping between phonology and phonetics is also not universal and deterministic, and reject the universality of the feature set. Instead, I argue for a rich interface between phonology and phonetics.
Based on this understanding of the nature of variation, I provide a holistic analysis of the sound systems of two closely related languages: Pembrokeshire Welsh and Bothoa Breton. I propose an account in terms of a rich representational theory. Among other proposals, I defend the need for surface ternary contrasts, which I propose to implement using feature geometry. I also show that the substance-free approach, which decouples phonological representation from phonetic realization, strikes the correct balance between innatist and emergentist approaches to phonological markedness; I demonstrate this by way of an extensive case study of laryngeal phonology, which leads to a reinterpetation of the approach known as 'laryngeal realism'. I also argue that the phonological component of grammar should allow constraints with prima facie undesirable factorial consequences, if such constraints are needed to account for functionally unmotivated sound patterns, and discuss the consequences of this approach for the substance-free nature of phonological computation
An Overview of Indian Spoken Language Recognition from Machine Learning Perspective
International audienceAutomatic spoken language identification (LID) is a very important research field in the era of multilingual voice-command-based human-computer interaction (HCI). A front-end LID module helps to improve the performance of many speech-based applications in the multilingual scenario. India is a populous country with diverse cultures and languages. The majority of the Indian population needs to use their respective native languages for verbal interaction with machines. Therefore, the development of efficient Indian spoken language recognition systems is useful for adapting smart technologies in every section of Indian society. The field of Indian LID has started gaining momentum in the last two decades, mainly due to the development of several standard multilingual speech corpora for the Indian languages. Even though significant research progress has already been made in this field, to the best of our knowledge, there are not many attempts to analytically review them collectively. In this work, we have conducted one of the very first attempts to present a comprehensive review of the Indian spoken language recognition research field. In-depth analysis has been presented to emphasize the unique challenges of low-resource and mutual influences for developing LID systems in the Indian contexts. Several essential aspects of the Indian LID research, such as the detailed description of the available speech corpora, the major research contributions, including the earlier attempts based on statistical modeling to the recent approaches based on different neural network architectures, and the future research trends are discussed. This review work will help assess the state of the present Indian LID research by any active researcher or any research enthusiasts from related fields
Stress recognition from speech signal
Předložená disertační práce se zabývá vývojem algoritmů pro detekci stresu z řečového signálu. Inovativnost této práce se vyznačuje dvěma typy analýzy řečového signálu, a to za použití samohláskových polygonů a analýzy hlasivkových pulsů. Obě tyto základní analýzy mohou sloužit k detekci stresu v řečovém signálu, což bylo dokázáno sérií provedených experimentů. Nejlepších výsledků bylo dosaženo pomocí tzv. Closing-To-Opening phase ratio příznaku v Top-To-Bottom kritériu v kombinaci s vhodným klasifikátorem. Detekce stresu založená na této analýze může být definována jako jazykově i fonémově nezávislá, což bylo rovněž dokázáno získanými výsledky, které dosahují v některých případech až 95% úspěšnosti. Všechny experimenty byly provedeny na vytvořené české databázi obsahující reálný stres, a některé experimenty byly také provedeny pro anglickou stresovou databázi SUSAS.Presented doctoral thesis is focused on development of algorithms for psychological stress detection in speech signal. The novelty of this thesis aims on two different analysis of the speech signal- the analysis of vowel polygons and the analysis of glottal pulses. By performed experiments, the doctoral thesis uncovers the possible usage of both fundamental analyses for psychological stress detection in speech. The analysis of glottal pulses in amplitude domain according to Top-To-Bottom criterion seems to be as the most effective with the combination of properly chosen classifier, which can be defined as language and phoneme independent way to stress recognition. All experiments were performed on developed Czech real stress database and some observations were also made on English database SUSAS. The variety of possibly effective ways of stress recognition in speech leads to approach very high recognition accuracy of their combination, or of their possible usage for detection of other speaker’s state, which has to be further tested and verified by appropriate databases.