174 research outputs found

    The Perception of Emotion from Acoustic Cues in Natural Speech

    Get PDF
    Knowledge of human perception of emotional speech is imperative for the development of emotion in speech recognition systems and emotional speech synthesis. Owing to the fact that there is a growing trend towards research on spontaneous, real-life data, the aim of the present thesis is to examine human perception of emotion in naturalistic speech. Although there are many available emotional speech corpora, most contain simulated expressions. Therefore, there remains a compelling need to obtain naturalistic speech corpora that are appropriate and freely available for research. In that regard, our initial aim was to acquire suitable naturalistic material and examine its emotional content based on listener perceptions. A web-based listening tool was developed to accumulate ratings based on large-scale listening groups. The emotional content present in the speech material was demonstrated by performing perception tests on conveyed levels of Activation and Evaluation. As a result, labels were determined that signified the emotional content, and thus contribute to the construction of a naturalistic emotional speech corpus. In line with the literature, the ratings obtained from the perception tests suggested that Evaluation (or hedonic valence) is not identified as reliably as Activation is. Emotional valence can be conveyed through both semantic and prosodic information, for which the meaning of one may serve to facilitate, modify, or conflict with the meaning of the other—particularly with naturalistic speech. The subsequent experiments aimed to investigate this concept by comparing ratings from perception tests of non-verbal speech with verbal speech. The method used to render non-verbal speech was low-pass filtering, and for this, suitable filtering conditions were determined by carrying out preliminary perception tests. The results suggested that nonverbal naturalistic speech provides sufficiently discernible levels of Activation and Evaluation. It appears that the perception of Activation and Evaluation is affected by low-pass filtering, but that the effect is relatively small. Moreover, the results suggest that there is a similar trend in agreement levels between verbal and non-verbal speech. To date it still remains difficult to determine unique acoustical patterns for hedonic valence of emotion, which may be due to inadequate labels or the incorrect selection of acoustic parameters. This study has implications for the labelling of emotional speech data and the determination of salient acoustic correlates of emotion

    Does speech prosody matter in health communication? Evidence from native and non-native English speaking medical students in a simulated clinical interaction

    Get PDF
    The impact of the UK’s multilingual and multicultural society today can be seen in its healthcare services and have contributed towards shaping communication skills training as a core part of the UK undergraduate medical curriculum. NHS complaints statistics involving perceived staff attitudes have remained high, despite extensive communication skills training. Furthermore, foreign doctors have received a higher proportion of complaints than UK doctors. Finally, how linguistic and social factors shape the conveyance and perception of attitudes related to professionalism in medical communication remains poorly understood. The ultimate aim of this study was to ascertain if speech prosody contributes to the perception of professionalism in medical communication. Research questions on the role of speech prosody in conveying professional attitudes in medical communication, the prosodic differences between native and non-native English speaking medical students in a simulated clinical interaction, and the influence of prosodic features on listeners’ perceptions of professional attitudes were addressed. A set of acoustic parameters representing the speech prosody of native and non-native medical students in the simulated clinical setting was analysed. A perceptual experiment was then carried out to investigate the factors affecting perceived professionalism in extracts of the analysed simulated clinical interaction. The examined acoustic parameters were found to be sensitive to the English language background and the task within the simulated consultation. Interestingly, the attitudinal information associated with some of these acoustic parameters were perceived by listeners and were reflected by higher professional scale scores in the perceptual experiment, even after adjusting for the English language background. The factors of training level and consultation task also emerged to be affecting professional scale scores. Initial findings have confirmed that speech prosody plays a role in terms of contributing towards the perception of professionalism in medical communication. Incorporating how messages are delivered to patients into current models of communication skills training may have positive outcomes

    The development of a new rating scale for the perceptual assessment of tracheoesophageal voice quality outcome following total laryngectomy

    Get PDF
    PhD ThesisPerceptual assessment of voice in people with surgical voice restoration (SVR) is essential to evaluate surgical and other interventions aimed at delivering optimal voice quality. Currently there are no tools to measure this that do not have issues of validity and reliability. This work describes the development and trialling of investigatory versions of three scales to address this situation: a) the Sunderland Tracheoesophageal Perceptual Scale (SToPS) for professional raters, b) the Naïve Rater Scale for non-specialist raters and c) the Patient and Carer Scale. In the final testing of the pilot version 55 speakers using tracheoesophageal voice were evaluated by twelve Speech and Language Therapists (SLT’s) and ten Ear, Nose and Throat (ENT) surgeons, divided into experienced or not at assessing voice. Ten naïve raters assessed the voice stimuli within a test-retest design. Forty tracheoesophageal speakers and thirty-seven carers attended an interview to rate their own or their relative’s voice. Inter rater agreement was then calculated between SLT, ENT, naïve, patient and carer groups with weighted kappa co-efficients Strength of agreement values (Landis and Koch 1977) were compared to profession and expertise. Expert SLT’s achieved “good” agreement for nine of fourteen parameters. Naïve judges attained “good” levels of inter and intra-rater agreement for the parameters Overall Grade and Social Acceptability. The greatest inter group consensus was for patients and carers, with “good” agreement for Intelligibility, Volume and Wetness. The only other “good” agreement was between naïve/ENT and naïve/ SLT groups for Overall Grade. The scales are ready for clinical use with the proviso that future work will determine whether it is possible to enhance agreement so less experienced judges can achieve “good” levels of agreement for more parameters and examine which perceptual parameters might be more prominent or vital for outcomes for different groups.City Hospitals Sunderland NHS Foundation Trust

    A linguistic approach to pitch range modelling

    Get PDF
    Pitch range is currently characterised in a number of different ways across research disciplines and is often treated as a simple measurement. Pitch range has been defined as the difference between minimum and maximum fO (Cosmides 1983). This data alone conveys no information about the distribution of fO values within that range. Similarly the mean and standard deviation does not adequately capture important differences in the pitch range of different speakers (Ladd et al. 1985). Ladd (1996) describes pitch range using two partially independent dimensions of variation, that of overall level and span. This idea has been further developed by Shriberg et al. (1996), in a study based on a large corpus of Dutch speech. Given this two parameter model, it is possible to predict target fO values for when speakers raise their voices from fO values at corresponding locations in speech produced normally. This thesis reports on three studies of pitch range variation across speakers. The experiments examine the relation between a two dimensional model of pitch range based on pitch level and pitch span with the perception of various speaker characteristics. The key to our measure of pitch range is that it is based on average data taken from clearly defined linguistic targets in speech. These targets included sentence-initial peaks, accent peaks, post-accent valleys and sentence-final lows. The results show that a pitch range model based on linguistic dimensions of variation better captures variation in listeners' judgements than the well established measures based on speakers' long term distributional properties of fO, such as 4 standard deviations around the mean, 95th-5th percentile and 90th-10th percentile. Most importantly this thesis shows that pitch range can and should be treated as the same entity across various research disciplines - extralinguistic, paralinguistic and linguistic - rather than the current situation in which pitch range has multiple definitions depending on the particular interest of the respective research discipline

    Models and analysis of vocal emissions for biomedical applications: 5th International Workshop: December 13-15, 2007, Firenze, Italy

    Get PDF
    The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies. The Workshop has the sponsorship of: Ente Cassa Risparmio di Firenze, COST Action 2103, Biomedical Signal Processing and Control Journal (Elsevier Eds.), IEEE Biomedical Engineering Soc. Special Issues of International Journals have been, and will be, published, collecting selected papers from the conference

    Prosodic Font : the space between the spoken and the written

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 1998."August 1998."Includes bibliographical references (leaves 131-133).by Tara Michelle Graber Rosenberger.S.M

    The Pitch Range of Italians and Americans. A Comparative Study

    Get PDF
    Linguistic experiments have investigated the nature of F0 span and level in cross-linguistic comparisons. However, only few studies have focused on the elaboration of a general-agreed methodology that may provide a unifying approach to the analysis of pitch range (Ladd, 1996; Patterson and Ladd, 1999; Daly and Warren, 2001; Bishop and Keating, 2010; Mennen et al. 2012). Pitch variation is used in different languages to convey different linguistic and paralinguistic meanings that may range from the expression of sentence modality to the marking of emotional and attitudinal nuances (Grice and Baumann, 2007). A number of factors have to be taken into consideration when determining the existence of measurable and reliable differences in pitch values. Daly and Warren (2001) demonstrated the importance of some independent variables such as language, age, body size, speaker sex (female vs. male), socio-cultural background, regional accents, speech task (read sentences vs. spontaneous dialogues), sentence type (questions vs. statements) and measure scales (Hertz, semitones, ERB etc.). Coherently with the model proposed by Mennen et al. (2012), my analysis of pitch range is based on the investigation of LTD (long-term distributional) and linguistic measures. LTD measures deal with the F0 distribution within a speaker’s contour (e.g. F0 minimum, F0 maximum, F0 mean, F0 median, standard deviation, F0 span) while linguistic measures are linked to specific targets within the contour, such as peaks and valleys (e.g. high and low landmarks) and preserve the temporal sequences of pitch contours. This investigation analyzed the characteristics of pitch range production and perception in English sentences uttered by Americans and Italians. Four experiments were conducted to examine different phenomena: i) the contrast between measures of F0 level and span in utterances produced by Americans and Italians (experiments 1-2); ii) the contrast between the pitch range produced by males and females in L1 and L2 (experiment 1); iii) the F0 patterns in different sentence types, that is, yes-no questions, wh-questions, and exclamations (experiment 2); iv) listeners’ evaluations of pitch span in terms of ±interesting, ±excited, ±credible, ±friendly ratings of different sentence types (experiments 3-4); v) the correlation between pitch span of the sentences and the evaluations given by American and Italian listeners (experiment 3); vi) the listeners’ evaluations of pitch span values in manipulated stimuli, whose F0 span was re-synthesized under three conditions: narrow span, original span, and wide span (experiment 4); vii) the different evaluations given to the sentences by male and female listeners. The results of this investigation supported the following generalizations. First, pitch span more than level was found to be a cue for non-nativeness, because L2 speakers of English used a narrower span, compared to the native norm. What is more, the experimental data in the production studies indicated that the mode of sentences was better captured by F0 span than level. Second, the Italian learners of English were influenced by their L1 and transferred L1 pitch range variation into their L2. The English sentences produced by the Italians had overall higher pitch levels and narrower pitch span than those produced by the Americans. In addition, the Italians used overall higher pitch levels when speaking Italian and lower levels when speaking English. Conversely, their pitch span was generally higher in English and lower in Italian. When comparing productions in English, the Italian females used higher F0 levels than the American females; vice versa, the Italian males showed slightly lower F0 levels than the American males. Third, there was a systematic relation between pitch span values and the listeners’ evaluations of the sentences. The two groups of listeners (the Americans and the Italians) rated the stimuli with larger pitch span as more interesting, exciting and credible than the stimuli with narrower pitch span. Thus, the listeners relied on the perceived pitch span to differentiate among the stimuli. Fourth, both the American and the Italian speakers were considered more friendly when the pitch span of their sentences was widened (wide span manipulation) and less friendly when the pitch span was narrowed (narrow span manipulation). This happened in all the stimuli regardless of the native language of the speakers (American vs. Italian)

    Models and analysis of vocal emissions for biomedical applications

    Get PDF
    This book of Proceedings collects the papers presented at the 3rd International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications, MAVEBA 2003, held 10-12 December 2003, Firenze, Italy. The workshop is organised every two years, and aims to stimulate contacts between specialists active in research and industrial developments, in the area of voice analysis for biomedical applications. The scope of the Workshop includes all aspects of voice modelling and analysis, ranging from fundamental research to all kinds of biomedical applications and related established and advanced technologies


    Get PDF
    RESUMEN. La presente tesis doctoral es un estudio de interpretaci\uf3n basado en corpus y consiste en una propuesta de evaluaci\uf3n subjetiva de tipo gest\ue1ltico de la interpretaci\uf3n simult\ue1nea transmitida por televisi\uf3n. El objetivo principal del estudio ha sido la construcci\uf3n de un modelo de evaluaci\uf3n de la calidad basado en la percepci\uf3n gest\ue1ltica del habla y del sonido-imagen percibido a trav\ue9s del medio auiovisual. El modelo de percepci\uf3n gest\ue1ltica adoptado est\ue1 formado por voz-s\uedlaba-prosodia-sentido-contexto-conocimiento (ling\u3cb\uedstico) del mundo, propuesto en \u201cIl volto fonico delle parole\u201d (Albano Leoni 2009), que es una reelaboraci\uf3n del modelo basado en melod\ueda-ritmo-palabras-oraciones, propuesto por Karl B\u3cbhler en su \u201cTeor\ueda del lenguaje\u201d (1934). Se construy\uf3 un corpus tem\ue1tico formado por las interpretaciones en italiano (2) y en espa\uf1ol (2 \u2013 Espa\uf1a y Estados Unidos) de los Debates Presidenciales de Estados Unidos de 2012: el corpus ORenesit (Obama-Romney English espa\uf1ol italiano) se incluye en el corpus de referencia CorIT (Corpus Italiano de Interpretaci\uf3n Televisiva). El modelo de evaluaci\uf3n fue ensayado en una encuesta piloto basada en cuestionario, que incluye 3 extractos v\ueddeo de la interpretaci\uf3n en italiano del Tercer Debate Presidencial de EE.UU. de 2008, entre Obama y McCain, debido a que el corpus ORenesit todav\ueda no se hab\ueda terminado. Uno de los tres v\ueddeos fue modificado por fines experimentales: la voz del int\ue9rprete original se sustituy\uf3 por la de un actor doblador profesional que imit\uf3 en estudio la interpretaci\uf3n original leyendo la transcripci\uf3n y escuchando al orador. Esta decisi\uf3n respond\ueda a dos necesidades, relacionadas sobre todo a la validez ecol\uf3gica del experimento: a) ensayar el efecto de una voz teleg\ue9nica; b) utilizar la expresi\uf3n natural y personal del sujeto. El cuestionario se construy\uf3 sobre categor\uedas extra\ueddas de \u201cLa vive voix\u201d (F\uf3nagy 1983) e \u201cL\u2019Audio-Vision\u201d (Chion 1990). Los datos obtenidos del cuestionario se trataron estad\uedsticamente. Los resultados del estudio cuali-cuantitativo parecen confirmar una percepci\uf3n gest\ue1ltica de la interpretaci\uf3n simult\ue1nea percibida a trav\ue9s del medio audio-visual formada por las componentes: sonido-imagen, s\uedlaba-melod\ueda(-voz-personalidad), palabras-oraciones. Lor resultados parecen poner en duda la efectividad del enfoque cuantitativo para el an\ue1lisis de la percepci\uf3n del habla.ABSTRACT. The present thesis is a corpus-based Interpreting study consisting of a proposal for a gestaltic subjective evaluation of quality in television broadcast simultaneous interpreting. The main objective of the research was to build and test a model of quality assessment based on the gestaltic perception both of speech and the sound-image perceived through the audiovisual medium. The model of gestaltic perception adopted is the one formed by voice-syllable-prosody-sense-context-(linguistic) knowledge of the world, proposed in \u201cIl volto fonico delle parole\u201d (Albano Leoni 2009), which is a re-elaborated version of the model based on melody-rhythm-words-sentences, proposed by Karl B\u3cbhler in his \u201cTheory of Language\u201d (1934). A thematic corpus was built consisting of 2 Italian and 2 Spanish (Spain and United States) interpretations of the 2012 US Presidential Debates: the corpus ORenesit (Obama-Romney English espa\uf1ol italiano) is included in the reference corpus CorIT (Italian Television Interpreting Corpus). The assessment model was tested in a questionnaire-based pilot survey including 3 video excerpts from the Italian interpretations of the 2008 Third Presidential Debate (Obama vs. McCain), since the corpus ORenesit had not been completed yet. One of the 3 video excerpts was modified for experimental purpose: the interpreter\u2019s voice was replaced with the voice of a professional actor and dubber, who imitated in studio the original interpretation while reading the transcript and listening to the speaker. This choice was made to fulfill two needs, mainly related to the ecological validity of the experiment: i) to test the effect of a telegenic voice; and ii) to use a natural and personal expression of the subject. The questionnaire was built on categories extracted from the \u201cLa vive voix\u201d (F\uf3nagy 1983) and \u201cL\u2019Audio-Vision\u201d (Chion 1990). The data obtained were treated statistically. Results of the qualitative and quantitative research seem to confirm a gestaltic perception of interpreting speech received through audio-vision and formed by the following components: sound-image; syllable-melody(-voice-personality), words-sentences. Results seem to raise doubts on the effectiveness of the quantitative approach to the analysis of speech perception

    Social affective variations in Brazilian Portuguese: a perceptual and acoustic analysis

    Get PDF
