118 research outputs found

    Methods for speaking style conversion from normal speech to high vocal effort speech

    Get PDF
    This thesis deals with vocal-effort-focused speaking style conversion (SSC). Specifically, we studied two topics on conversion of normal speech to high vocal effort. The first topic involves the conversion of normal speech to shouted speech. We employed this conversion in a speaker recognition system with vocal effort mismatch between test and enrollment utterances (shouted speech vs. normal speech). The mismatch causes a degradation of the system's speaker identification performance. As solution, we proposed a SSC system that included a novel spectral mapping, used along a statistical mapping technique, to transform the mel-frequency spectral energies of normal speech enrollment utterances towards their counterparts in shouted speech. We evaluated the proposed solution by comparing speaker identification rates for a state-of-the-art i-vector-based speaker recognition system, with and without applying SSC to the enrollment utterances. Our results showed that applying the proposed SSC pre-processing to the enrollment data improves considerably the speaker identification rates. The second topic involves a normal-to-Lombard speech conversion. We proposed a vocoder-based parametric SSC system to perform the conversion. This system first extracts speech features using the vocoder. Next, a mapping technique, robust to data scarcity, maps the features. Finally, the vocoder synthesizes the mapped features into speech. We used two vocoders in the conversion system, for comparison: a glottal vocoder and the widely used STRAIGHT. We assessed the converted speech from the two vocoder cases with two subjective listening tests that measured similarity to Lombard speech and naturalness. The similarity subjective test showed that, for both vocoder cases, our proposed SSC system was able to convert normal speech to Lombard speech. The naturalness subjective test showed that the converted samples using the glottal vocoder were clearly more natural than those obtained with STRAIGHT

    The nature of internal representation in the internal lexicon

    Get PDF
    The first two experiments reported were concerned with the fact and growth of visual and acoustic representations of simple words in the Mental Lexicon. Using a Learning paradigm it was established that some form of visual and acoustic representations are formed within three exposures and that these forms of a word are also a basis for lexical organization. Five experiments, employing different techniques, were aimed at testing the psychological reality of the morphemic structure of prefixed words. It was established that the morphemic structure of some of these words is represented; that the identity of some prefixes is represented; and that some non-specific knowledge concerning the relationship between orthographic and prefix structure is also represented. Finally, the spelling errors of 11 year old children were analysed. This analysis revealed that acoustic, visual (more properly graphemic), and morphemic information, as well as some knowledge of phonotactic rules and statistical regularities, are represented in the Internal Lexicon. It is concluded that the contents of the Internal lexicon are both redundant and heterogeneous. The first two experiments reported were concerned with the fact and growth of visual and acoustic representations of simple words in the Mental Lexicon. Using a Learning paradigm it was established that some form of visual and acoustic representations are formed within three exposures and that these forms of a word are also a basis for lexical organization. Five experiments, employing different techniques, were aimed at testing the psychological reality of the morphemic structure of prefixed words. It was established that the morphemic structure of some of these words is represented; that the identity of some prefixes is represented; and that some non-specific knowledge concerning the relationship between orthographic and prefix structure is also represented. Finally, the spelling errors of 11 year old children were analysed. This analysis revealed that acoustic, visual (more properly graphemic), and morphemic information, as well as some knowledge of phonotactic rules and statistical regularities, are represented in the Internal Lexicon. It is concluded that the contents of the Internal lexicon are both redundant and heterogeneous

    A Likelihood-Ratio Based Forensic Voice Comparison in Standard Thai

    Get PDF
    This research uses a likelihood ratio (LR) framework to assess the discriminatory power of a range of acoustic parameters extracted from speech samples produced by male speakers of Standard Thai. The thesis aims to answer two main questions: 1) to what extent the tested linguistic-phonetic segments of Standard Thai perform in forensic voice comparison (FVC); and 2) how such linguistic-phonetic segments are profitably combined through logistic regression using the FoCal Toolkit (Brümmer, 2007). The segments focused on in this study are the four consonants /s, ʨh, n, m/ and the two diphthongs [ɔi, ai]. First of all, using the alveolar fricative /s/, two different sets of features were compared in terms of their performance in FVC. The first comprised the spectrum-based distributional features of four spectral moments, namely mean, variance, skew and kurtosis; the second consisted of the coefficients of the Discrete Cosine Transform (DCTs) applied to a spectrum. As DCTs were found to perform better, they were subsequently used to model the consonant spectrum of the remaining consonants. The consonant spectrum was extracted at the center point of the /s, ʨh, n, m/ consonants with a Hamming window of 31.25 msec. For the diphthongs [ɔi] - [nɔi L] and [ai] - [mai HL], the cubic polynomials fitted to the F2 and F1-F3 formants were tested separately. The quadratic polynomials fitted to the tonal F0 contours of [ɔi] - [nɔi L] and [ai] - [mai HL] were tested as well. Long-term F0 distribution (LTF0) was also trialed. The results show the promising discriminatory power of the Standard Thai acoustic features and segments tested in this thesis. The main findings are as follows. 1. The fricative /s/ performed better with the DCTs (Cllr = 0.70) than with the spectral moments (Cllr = 0.92). 2. The nasals /n, m/ (Cllr = 0.47) performed better than the affricate /tɕh/ (Cllr = 0.54) and the fricative /s/ (Cllr = 0.70) when their DCT coefficients were parameterized. 3. F1-F3 trajectories (Cllr = 0.42 and Cllr = 0.49) outperformed F2 trajectory (Cllr = 0.69 and Cllr = 0.67) for both diphthongs [ɔi] and [ai]. 4. F1-F3 trajectories of the diphthong [ɔi] (Cllr = 0.42) outperformed those of [ai] (Cllr = 0.49). 5. Tonal F0 (Cllr = 0.52) outperformed LTF0 (Cllr = 0.74). 6. Overall, better results were obtained when DCTs of /n/ - [na: HL] and /n/ - [nɔi L] were fused. (Cllr = 0.40 with the largest consistent-with-fact SSLog10LR = 2.53). In light of the findings, we can conclude that Standard Thai is generally amenable to FVC, especially when linguistic-phonetic segments are being combined; it is recommended that the latter procedure be followed when dealing with forensically realistic casework

    Automated Semantic Understanding of Human Emotions in Writing and Speech

    Get PDF
    Affective Human Computer Interaction (A-HCI) will be critical for the success of new technologies that will prevalent in the 21st century. If cell phones and the internet are any indication, there will be continued rapid development of automated assistive systems that help humans to live better, more productive lives. These will not be just passive systems such as cell phones, but active assistive systems like robot aides in use in hospitals, homes, entertainment room, office, and other work environments. Such systems will need to be able to properly deduce human emotional state before they determine how to best interact with people. This dissertation explores and extends the body of knowledge related to Affective HCI. New semantic methodologies are developed and studied for reliable and accurate detection of human emotional states and magnitudes in written and spoken speech; and for mapping emotional states and magnitudes to 3-D facial expression outputs. The automatic detection of affect in language is based on natural language processing and machine learning approaches. Two affect corpora were developed to perform this analysis. Emotion classification is performed at the sentence level using a step-wise approach which incorporates sentiment flow and sentiment composition features. For emotion magnitude estimation, a regression model was developed to predict evolving emotional magnitude of actors. Emotional magnitudes at any point during a story or conversation are determined by 1) previous emotional state magnitude; 2) new text and speech inputs that might act upon that state; and 3) information about the context the actors are in. Acoustic features are also used to capture additional information from the speech signal. Evaluation of the automatic understanding of affect is performed by testing the model on a testing subset of the newly extended corpus. To visualize actor emotions as perceived by the system, a methodology was also developed to map predicted emotion class magnitudes to 3-D facial parameters using vertex-level mesh morphing. The developed sentence level emotion state detection approach achieved classification accuracies as high as 71% for the neutral vs. emotion classification task in a test corpus of children’s stories. After class re-sampling, the results of the step-wise classification methodology on a test sub-set of a medical drama corpus achieved accuracies in the 56% to 84% range for each emotion class and polarity. For emotion magnitude prediction, the developed recurrent (prior-state feedback) regression model using both text-based and acoustic based features achieved correlation coefficients in the range of 0.69 to 0.80. This prediction function was modeled using a non-linear approach based on Support Vector Regression (SVR) and performed better than other approaches based on Linear Regression or Artificial Neural Networks

    Glossolalia

    Get PDF
    A thesis to be submitted in fulfilment of the requirements for the degree of Doctor of Medicine in the Department of Psychiatry and Mental Hygiene, University of the Witwatersrand.The introduction to the problem covers mainly three sections, namely. Biblical, Historical and Psychological. Various tests are then named and described. This is followed by the section dealing with the Test Results, statistical methods used and finally a summary and conclusion. The summary and conclusion are of necessity brief, and cannot be expected to cover the whole field. Special mention must here be made of & B. Cutten whose book Speaking with Tongues is considered by the present writer to be the most scholarly and extensive account of the historical aspect of GLOSSOLALIA yet systematised. Portions relevant to the present investigation were either quoted in toto or epitomised because the writer felt they could not be improved upon. An intimate knowledge of the historical instances quoted by Cutten is essential to a complete understanding of the scope of the present work. The only originality claimed lies in the selection and presentation of the material to be found in Cutten's invaluable work. Extensive references and/or quotations have also been taken from the works of A. Schweitzer - Mysticism of St. Paul the Apostle and E.B. Tylor - Primitive Culturo, both of whom are regarded as being leaders of thought in their respective fields.WHSLYP201

    Subjective evaluation and electroacoustic theoretical validation of a new approach to audio upmixing

    Get PDF
    Audio signal processing systems for converting two-channel (stereo) recordings to four or five channels are increasingly relevant. These audio upmixers can be used with conventional stereo sound recordings and reproduced with multichannel home theatre or automotive loudspeaker audio systems to create a more engaging and natural-sounding listening experience. This dissertation discusses existing approaches to audio upmixing for recordings of musical performances and presents specific design criteria for a system to enhance spatial sound quality. A new upmixing system is proposed and evaluated according to these criteria and a theoretical model for its behavior is validated using empirical measurements.The new system removes short-term correlated components from two electronic audio signals using a pair of adaptive filters, updated according to a frequency domain implementation of the normalized-least-means-square algorithm. The major difference of the new system with all extant audio upmixers is that unsupervised time-alignment of the input signals (typically, by up to +/-10 ms) as a function of frequency (typically, using a 1024-band equalizer) is accomplished due to the non-minimum phase adaptive filter. Two new signals are created from the weighted difference of the inputs, and are then radiated with two loudspeakers behind the listener. According to the consensus in the literature on the effect of interaural correlation on auditory image formation, the self-orthogonalizing properties of the algorithm ensure minimal distortion of the frontal source imagery and natural-sounding, enveloping reverberance (ambiance) imagery.Performance evaluation of the new upmix system was accomplished in two ways: Firstly, using empirical electroacoustic measurements which validate a theoretical model of the system; and secondly, with formal listening tests which investigated auditory spatial imagery with a graphical mapping tool and a preference experiment. Both electroacoustic and subjective methods investigated system performance with a variety of test stimuli for solo musical performances reproduced using a loudspeaker in an orchestral concert-hall and recorded using different microphone techniques.The objective and subjective evaluations combined with a comparative study with two commercial systems demonstrate that the proposed system provides a new, computationally practical, high sound quality solution to upmixing

    Voice and presence

    Get PDF
    De que forma pode a voz ser considerada enquanto tema de investigação filosófica? Utilizando principalmente aproximações fenomenológicas, esta tese é uma tentativa de mapear uma constelação transdisciplinar de pontos-­‐chave temáticos, onde a voz surge como uma manifestação expressa de presença e de processos vitais. Simultaneamente situada, corpórea e transgressora no contexto da noção de território acústico, a ambiguidade da voz e o seu potencial enquanto fenómeno, conceito e ressonância tangível de subjetividade, são explorados através do recurso a análise de fontes pertencentes à Filosofia Antiga, aos Estudos Sonoros, à Ciência Acústica, à Fenomenologia e à Pesquisa Artística.How can voice be understood as a theme for philosophical inquiry? Using mostly phenomenological strategies and approximations, this thesis attempts to map a transdisciplinar constellation of nexus points, where voice emerges as na expressive manifestation of presence and living processes. Simultaneously situated, bodily and transgressive in the contexto of the notion of acoustic territories, the ambiguity of the voice and its potential as a both phenomenon, concept and tangible resonance of subjectivity, are explored via an inquiry informed by Ancient Philosophy, Sound Studies, Acoustics, Phenomenology and Artistic Research
    corecore