1,130 research outputs found

    Finding the Most Uniform Changes in Vowel Polygon Caused by Psychological Stress

    Get PDF
    Using vowel polygons, exactly their parameters, is chosen as the criterion for achievement of differences between normal state of speaker and relevant speech under real psychological stress. All results were experimentally obtained by created software for vowel polygon analysis applied on ExamStress database. Selected 6 methods based on cross-correlation of different features were classified by the coefficient of variation and for each individual vowel polygon, the efficiency coefficient marking the most significant and uniform differences between stressed and normal speech were calculated. As the best method for observing generated differences resulted method considered mean of cross correlation values received for difference area value with vector length and angle parameter couples. Generally, best results for stress detection are achieved by vowel triangles created by /i/-/o/-/u/ and /a/-/i/-/o/ vowel triangles in formant planes containing the fifth formant F5 combined with other formants

    The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing

    Get PDF
    Work on voice sciences over recent decades has led to a proliferation of acoustic parameters that are used quite selectively and are not always extracted in a similar fashion. With many independent teams working in different research areas, shared standards become an essential safeguard to ensure compliance with state-of-the-art methods allowing appropriate comparison of results across studies and potential integration and combination of extraction and recognition systems. In this paper we propose a basic standard acoustic parameter set for various areas of automatic voice analysis, such as paralinguistic or clinical speech analysis. In contrast to a large brute-force parameter set, we present a minimalistic set of voice parameters here. These were selected based on a) their potential to index affective physiological changes in voice production, b) their proven value in former studies as well as their automatic extractability, and c) their theoretical significance. The set is intended to provide a common baseline for evaluation of future research and eliminate differences caused by varying parameter sets or even different implementations of the same parameters. Our implementation is publicly available with the openSMILE toolkit. Comparative evaluations of the proposed feature set and large baseline feature sets of INTERSPEECH challenges show a high performance of the proposed set in relation to its size

    The use of spectral information in the development of novel techniques for speech-based cognitive load classification

    Full text link
    The cognitive load of a user refers to the amount of mental demand imposed on the user when performing a particular task. Estimating the cognitive load (CL) level of the users is necessary to adjust the workload imposed on them accordingly in order to improve task performance. The current speech based CL classification systems are not adequate for commercial use due to their low performance particularly in noisy environments. This thesis proposes many techniques to improve the performance of the speech based cognitive load classification system in both clean and noisy conditions. This thesis analyses and presents the effectiveness of speech features such as spectral centroid frequency (SCF) and spectral centroid amplitude (SCA) for CL classification. Sub-systems based on SCF and SCA features were developed and fused with the traditional Mel frequency cepstral coefficients (MFCC) based system, producing an 8.9% and 31.5% relative error rate reduction respectively when compared to the MFCC-based system alone. The Stroop test corpus was used in these experiments. The investigation into cognitive load information in the form of spectral distribution in different subbands shows that the information distributed in the low frequency subband is significantly higher than the high frequency subband. Two different methods are proposed to utilize this finding. The first method, called the multi-band approach, uses a weighting scheme to emphasize the speech features in low frequency subbands. The cognitive load classification accuracy of this approach is shown to be higher than a system based on a non-weighting scheme. The second method is to design an effective filterbank based on the spectral distribution of cognitive load information using the Kullback-Leibler distance measure. It is shown that the designed filterbank consistently provides higher classification accuracies than other existing filterbanks such as mel, Bark, and equivalent rectangular bandwidth. A discrete cosine transform based speech enhancement technique is proposed in order to increase the robustness of the CL classification system and found to be more suitable than other methods investigated. This proposed method provides a 3.0% average relative error rate reduction for the seven types of noise and five levels of SNR used. In particular, it provides a maximum of 7.5% relative error rate reduction for the F16 noise (in NOISEX-92 database) at 20 dB SNR

    Computer Models for Musical Instrument Identification

    Get PDF
    PhDA particular aspect in the perception of sound is concerned with what is commonly termed as texture or timbre. From a perceptual perspective, timbre is what allows us to distinguish sounds that have similar pitch and loudness. Indeed most people are able to discern a piano tone from a violin tone or able to distinguish different voices or singers. This thesis deals with timbre modelling. Specifically, the formant theory of timbre is the main theme throughout. This theory states that acoustic musical instrument sounds can be characterised by their formant structures. Following this principle, the central point of our approach is to propose a computer implementation for building musical instrument identification and classification systems. Although the main thrust of this thesis is to propose a coherent and unified approach to the musical instrument identification problem, it is oriented towards the development of algorithms that can be used in Music Information Retrieval (MIR) frameworks. Drawing on research in speech processing, a complete supervised system taking into account both physical and perceptual aspects of timbre is described. The approach is composed of three distinct processing layers. Parametric models that allow us to represent signals through mid-level physical and perceptual representations are considered. Next, the use of the Line Spectrum Frequencies as spectral envelope and formant descriptors is emphasised. Finally, the use of generative and discriminative techniques for building instrument and database models is investigated. Our system is evaluated under realistic recording conditions using databases of isolated notes and melodic phrases

    Multivariate pattern analysis of input and output representations of speech

    Get PDF
    Repeating a word or nonword requires a speaker to map auditory representations of incoming sounds onto learned speech items, maintain those items in short-term memory, interface that representation with the motor output system, and articulate the target sounds. This dissertation seeks to clarify the nature and neuroanatomical localization of speech sound representations in perception and production through multivariate analysis of neuroimaging data. The major portion of this dissertation describes two experiments using functional magnetic resonance imaging (fMRI) to measure responses to the perception and overt production of syllables and multivariate pattern analysis to localize brain areas containing associated phonological/phonetic information. The first experiment used a delayed repetition task to permit response estimation for auditory syllable presentation (input) and overt production (output) in individual trials. In input responses, clusters sensitive to vowel identity were found in left inferior frontal sulcus (IFs), while clusters responsive to syllable identity were found in left ventral premotor cortex and left mid superior temporal sulcus (STs). Output-linked responses revealed clusters of vowel information bilaterally in mid/posterior STs. The second experiment was designed to dissociate the phonological content of the auditory stimulus and vocal target. Subjects were visually presented with two (non)word syllables simultaneously, then aurally presented with one of the syllables. A visual cue informed subjects either to repeat the heard syllable (repeat trials) or produce the unheard, visually presented syllable (change trials). Results suggest both IFs and STs represent heard syllables; on change trials, representations in frontal areas, but not STs, are updated to reflect the vocal target. Vowel identity covaries with formant frequencies, inviting the question of whether lower-level, auditory representations can support vowel classification in fMRI. The final portion of this work describes a simulation study, in which artificial fMRI datasets were constructed to mimic the overall design of Experiment 1 with voxels assumed to contain either discrete (categorical) or analog (frequency-based) vowel representations. The accuracy of classification models was characterized by type of representation and the density and strength of responsive voxels. It was shown that classification is more sensitive to sparse, discrete representations than dense analog representations

    Perception of acoustically complex phonological features in vowels is reflected in the induced brain-magnetic activity

    Get PDF
    A central issue in speech recognition is which basic units of speech are extracted by the auditory system and used for lexical access. One suggestion is that complex acoustic-phonetic information is mapped onto abstract phonological representations of speech and that a finite set of phonological features is used to guide speech perception. Previous studies analyzing the N1m component of the auditory evoked field have shown that this holds for the acoustically simple feature place of articulation. Brain magnetic correlates indexing the extraction of acoustically more complex features, such as lip rounding (ROUND) in vowels, have not been unraveled yet. The present study uses magnetoencephalography (MEG) to describe the spatial-temporal neural dynamics underlying the extraction of phonological features. We examined the induced electromagnetic brain response to German vowels and found the event-related desynchronization in the upper beta-band to be prolonged for those vowels that exhibit the lip rounding feature (ROUND). It was the presence of that feature rather than circumscribed single acoustic parameters, such as their formant frequencies, which explained the differences between the experimental conditions. We conclude that the prolonged event-related desynchronization in the upper beta-band correlates with the computational effort for the extraction of acoustically complex phonological features from the speech signal. The results provide an additional biomagnetic parameter to study mechanisms of speech perception

    Theories of developmental dyslexia: Insights from a multiple case study of dyslexic adults

    Get PDF
    A multiple case study was conducted in order to assess three leading theories of developmental dyslexia: the phonological, the magnocellular (auditory and visual) and the cerebellar theories. Sixteen dyslexic and 16 control university students were administered a full battery of psychometric, phonological, auditory, visual and cerebellar tests. Individual data reveal that all 16 dyslexics suffer from a phonological deficit, 10 from an auditory deficit, 4 from a motor deficit, and 2 from a visual magnocellular deficit. Results suggest that a phonological deficit can appear in the absence of any other sensory or motor disorder, and is sufficient to cause a literacy impairment, as demonstrated by 5 of the dyslexics. Auditory disorders, when present, aggravate the phonological deficit, hence the literacy impairment. However, auditory deficits cannot be characterised simply as rapid auditory processing problems, as would be predicted by the magnocellular theory. Nor are they restricted to speech. Contrary to the cerebellar theory, we find little support for the notion that motor impairments, when found, have a cerebellar origin, or reflect an automaticity deficit. Overall, the present data support the phonological theory of dyslexia, while acknowledging the presence of additional sensory and motor disorders in certain individuals

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications (MAVEBA) came into being in 1999 from the particularly felt need of sharing know-how, objectives and results between areas that until then seemed quite distinct such as bioengineering, medicine and singing. MAVEBA deals with all aspects concerning the study of the human voice with applications ranging from the newborn to the adult and elderly. Over the years the initial issues have grown and spread also in other fields of research such as occupational voice disorders, neurology, rehabilitation, image and video analysis. MAVEBA takes place every two years in Firenze, Italy. This edition celebrates twenty-two years of uninterrupted and successful research in the field of voice analysis

    Synthetic voice design and implementation.

    Get PDF
    The limitations of speech output technology emphasise the need for exploratory psychological research to maximise the effectiveness of speech as a display medium in human-computer interaction. Stage 1 of this study reviewed speech implementation research, focusing on general issues for tasks, users and environments. An analysis of design issues was conducted, related to the differing methodologies for synthesised and digitised message production. A selection of ergonomic guidelines were developed to enhance effective speech interface design. Stage 2 addressed the negative reactions of users to synthetic speech in spite of elegant dialogue structure and appropriate functional assignment. Synthetic speech interfaces have been consistently rejected by their users in a wide variety of application domains because of their poor quality. Indeed the literature repeatedly emphasises quality as being the most important contributor to implementation acceptance. In order to investigate this, a converging operations approach was adopted. This consisted of a series of five experiments (and associated pilot studies) which homed in on the specific characteristics of synthetic speech that determine the listeners varying perceptions of its qualities, and how these might be manipulated to improve its aesthetics. A flexible and reliable ratings interface was designed to display DECtalk speech variations and record listeners perceptions. In experiment one, 40 participants used this to evaluate synthetic speech variations on a wide range of perceptual scales. Factor analysis revealed two main factors: "listenability" accounting for 44.7% of the variance and correlating with the DECtalk "smoothness" parameter to . 57 (p<0.005) and "richness" to . 53 (p<0.005); "assurance" accounting for 12.6% of the variance and correlating with "average pitch" to . 42 (p<0.005) and "head size" to. 42 (p<0.005). Complimentary experiments were then required in order to address appropriate voice design for enhanced listenability and assurance perceptions. With a standard male voice set, 20 participants rated enhanced smoothness and attenuated richness as contributing significantly to speech listenability (p<0.001). Experiment three using a female voice set yielded comparable results, suggesting that further refinements of the technique were necessary in order to develop an effective methodology for speech quality optimization. At this stage it became essential to focus directly on the parameter modifications that are associated with the the aesthetically pleasing characteristics of synthetic speech. If a reliable technique could be developed to enhance perceived speech quality, then synthesis systems based on the commonly used DECtalk model might assume some of their considerable yet unfulfilled potential. In experiment four, 20 subjects rated a wide range of voices modified across the two main parameters associated with perceived listenability, smoothness and richness. The results clearly revealed a linear relationship between enhanced smoothness and attenuated richness and significant improvements in perceived listenability (p<0.001 in both cases). Planned comparisons conducted were between the different levels of the parameters and revealed significant listenability enhancements as smoothness was increased, and a similar pattern as richness decreased. Statistical analysis also revealed a significant interaction between the two parameters (p<0.001) and a more comprehensive picture was constructed. In order to expand the focus of and enhance the generality of the research, it was now necessary to assess the effects of synthetic speech modifications whilst subjects were undertaking a more realistic task. Passively rating the voices independent of processing for meaning is arguably an artificial task which rarely, if ever, would occur in 'real-world' settings. In order to investigate perceived quality in a more realistic task scenario, experiment five introduced two levels of information processing load. The purpose of this experiment was firstly to see if a comprehension load modified the pattern of listenability enhancements, and secondly to see if that pattern differed between high and and low load. Techniques for introducing cognitive load were investigated and comprehension load was selected as the most appropriate method in this case. A pilot study distinguished two levels of comprehension load from a set of 150 true/false sentences and these were recorded across the full range of parameter modifications. Twenty subjects then rated the voices using the established listenability scales as before but also performing the additional task of processing each spoken stimuli for meaning and determining the authenticity of the statements. Results indicated that listenability enhancements did indeed occur at both levels of processing although at the higher level variations in the pattern occured. A significant difference was revealed between optimal parameter modifications for conditions of high and low cognitive load (p<0.05). The results showed that subjects perceived the synthetic voices in the high cognitive load condition to be significantly less listenable than those same voices in the low cognitive load condition. The analysis also revealed that this effect was independent of the number of errors made. This result may be of general value because conclusions drawn from this findings are independent of any particular parameter modifications that may be exclusively available to DECtalk users. Overall, the study presents a detailed analysis of the research domain combined with a systematic experimental program of synthetic speech quality assessment. The experiments reported establish a reliable and replicable procedure for optimising the aesthetically pleasing characteristics of DECtalk speech, but the implications of the research extend beyond the boundaries of a particular synthesiser. Results from the experimental program lead to a number of conclusions, the most salient being that not only does the synthetic speech designer have to overcome the general rejection of synthetic voices based on their poor quality by sophisticated customisation of synthetic voice parameters, but that he or she needs to take into account the cognitive load of the task being undertaken. The interaction between cognitive load and optimal settings for synthesis requires direct consideration if synthetic speech systems are going to realise and maximise their potential in human computer interaction

    Iconicity in Language and Speech

    Get PDF
    Die vorliegende Arbeit befasst sich mit dem großen Oberthema der Ikonizität und ihrer Verbreitung auf verschiedenen linguistischen Ebenen. Ikonizität bezeichnet die Ähnlichkeit zwischen der sprachlichen Form und ihrer Bedeutung (vgl. Perniss und Vigliocco, 2014). So wie eine Skulptur einem Objekt oder einer Person ähnelt, kann auch der Klang oder die Form von Wörtern der Sache ähneln, auf die sie verweisen. Frühere theoretische Ansätze betonen, dass die Arbitrarität von sprachlichen Zeichen und deren Bedeutung ein Hauptmerkmal menschlicher Sprache ist und Ikonizität für die Sprachevolution eine Rolle gespielt haben mag, jedoch in der heutigen Sprache zu vernachlässigen ist. Im Gegensatz dazu ist das Hauptanliegen dieser Arbeit, das Potenzial und die Bedeutung von Ikonizität in der heutigen Sprache zu untersuchen. Die einzelnen Kapitel der Dissertation können als separate Teile betrachtet werden, die in ihrer Gesamtheit das umfassende Spektrum der Ikonizität sichtbar machen. Von der sprachevolutionären Debatte ausgehend wird in den einzelnen Kapiteln auf die unterschiedlichen Ebenen der Ikonizität eingegangen. Es werden experimentelle Untersuchungen zur Lautsymbolik, am Beispiel der deutschen Pokémon-Namen, zur ikonischen Prosodie und zu ikonischen Wörtern, den sogenannten Ideophonen, vorgestellt. Die Ergebnisse der einzelnen Untersuchungen deuten auf die weite Verbreitung der Ikonizität im heutigen Deutsch hin. Darüber hinaus entschlüsselt diese Dissertation das kommunikative Potenzial der Ikonizität als eine Kraft, die nicht nur die Entstehung der Sprache ermöglichte, sondern auch nach Jahrtausenden bestehen bleibt, sich immer wieder neu entfaltet und uns tagtäglich in mündlicher, schriftlicher Form und in Gesten begegnet.This dissertation is concerned with the major theme of iconicity and its prevalence on different linguistic levels. Iconicity refers to a resemblance between the linguistic form and the meaning of a referent (cf. Perniss and Vigliocco, 2014). Just like a sculpture resembles an object or a model, so can the sound or shape of words resemble the thing they refer to. Previous theoretical approaches emphasize that arbitrariness of the linguistic sign is one of the main features of human language; iconicity, however, may have played a role for language evolution, but is negligible in contemporary language. In contrast, the main point of this thesis is to explore the potential and the importance of iconicity in the language nowadays. The individual chapters of the dissertation can be viewed as separate parts that, taken together, reveal the comprehensive spectrum of iconicity. Starting from the language evolutionary debate, the individual chapters address iconicity on different linguistic levels. I present experimental evidence on sound symbolism, using the example of German Pokémon names, on iconic prosody, and on iconic words, the so-called ideophones. The results of the individual investigations point to the widespread use of iconicity in contemporary German. Moreover, this dissertation deciphers the communicative potential of iconicity as a force that not only enabled the emergence of language, but also persists after millennia, unfolding again and again and encountering us every day in speech, writing, and gestures
    corecore