18 research outputs found
What makes business speakers sound charismatic? A contrastive acoustic-melodic analysis of Steve Jobs and Mark Zuckerberg
Phonetic research on the prosodic sources of perceived charisma has taken a big step towards making a speaker’s tone-of-voice a tangible, quantifiable, and trainable matter. However, the tone-of-voice includes a complex bundle of acoustic features, and a lot of parameters have not even been looked at so far. Moreover, all previous studies focused on political or religious leaders and left aside the large field of managers and CEOs in the world of business. These are the two research gaps addressed in the present study. An acoustic analysis of about 1,350 prosodic phrases from keynotes given by a more charismatic CEO (Steve Jobs) and a less charismatic CEO (Mark Zuckerberg) suggests that the same tone-of-voice settings that make political or religious leaders sound more charismatic also work for business speakers. In addition, results point to further charisma-relevant acoustic parameters related to rhythm, emphasis, pausing, and voice quality - as well as to audience type as a significant context factor. The findings are discussed with respect to implications for future perception-oriented studies and perspectives for a computer-based measurement, assessment, and training of a charismatic tone of voice.La investigación sobre las características prosódicas de la percepción del carisma ha mostrado que el tono de voz de un orador es una característica tangible, cuantificable y entrenable. Sin embargo, el tono de voz incluye un conjunto complejo de rasgos acústicos y muchos parámetros no han sido estudiados hasta ahora. Además, los estudios previos se han centrado en el análisis del carisma de líderes políticos o religiosos y han dejado de lado el análisis de un gran número de mánagers y directores ejecutivos en el mundo de los negocios. En este estudio presentamos un análisis acústico de cerca de 1,350 frases prosódicas procedentes de discursos realizados por uno de los directores ejecutivos más carismáticos (Steve Jobs) y por uno menos carismático (Marc Zuckerberg). Los resultados sugieren que los ajustes del mismo tono de voz que hace que los líderes políticos y religiosos suenen más carismáticos también funcionen para oradores del mundo de los negocios. Además, los resultados muestran la relevancia de más parámetros acústicos, aparte del tono, para la percepción del carisma como son el ritmo, el énfasis, las pausas y la calidad de la voz - así como también el tipo de público como un factor significativo de contexto
Recommended from our members
Identifying Speaker State from Multimodal Cues
Automatic identification of speaker state is essential for spoken language understanding, with broad potential in various real-world applications. However, most existing work has focused on recognizing a limited set of emotional states using cues from a single modality. This thesis describes my research that addresses these limitations and challenges associated with speaker state identification by studying a wide range of speaker states, including emotion and sentiment, humor, and charisma, using features from speech, text, and visual modalities.
The first part of this thesis focuses on emotion and sentiment recognition in speech. Emotion and sentiment recognition is one of the most studied topics in speaker state identification and has gained increasing attention in speech research recently, with extensive emotional speech models and datasets published every year. However, most work focuses only on recognizing a set of discrete emotions in high-resource languages such as English, while in real-life conversations, emotion is changing continuously and exists in all spoken languages. To address the mismatch, we propose a deep neural network model to recognize continuous emotion by combining inputs from raw waveform signals and spectrograms. Experimental results on two datasets show that the proposed model achieves state-of-the-art results by exploiting both waveforms and spectrograms as input. Due to the higher number of existing textual sentiment models than speech models in low-resource languages, we also propose a method to bootstrap sentiment labels from text transcripts and use these labels to train a sentiment classifier in speech. Utilizing the speaker state information shared across modalities, we extend speech sentiment recognition from high-resource languages to low-resource languages. Moreover, using the natural verse-level alignment in the audio Bibles across different languages, we also explore cross-lingual and cross-modality sentiment transfer.
In the second part of the thesis, we focus on recognizing humor, whose expression is related to emotion and sentiment but has very different characteristics. Unlike emotion and sentiment that can be identified by crowdsourced annotators, humorous expressions are highly individualistic and cultural-specific, making it hard to obtain reliable labels. This results in the lack of data annotated for humor, and thus we propose two different methods to automatically and reliably label humor. First, we develop a framework for generating humor labels on videos, by learning from extensive user-generated comments. We collect and analyze 100 videos, building multimodal humor detection models using speech, text, and visual features, which achieves an F1-score of 0.76. In addition to humorous videos, we also develop another framework for generating humor labels on social media posts, by learning from user reactions to Facebook posts. We collect 785K posts with humor and non-humor scores and build models to detect humor with performance comparable to human labelers.
The third part of the thesis focuses on charisma, a commonly found but less studied speaker state with unique challenges -- the definition of charisma varies a lot among perceivers, and the perception of charisma also varies with speakers' and perceivers' different demographic backgrounds. To better understand charisma, we conduct the first gender-balanced study of charismatic speech, including speakers and raters from diverse backgrounds. We collect personality and demographic information from the rater as well as their own speech, and examine individual differences in the perception and production of charismatic speech. We also extend the work to politicians' speech by collecting speaker trait ratings on representative speech segments of politicians and study how the genre, gender, and the rater's political stance influence the charisma ratings of the segments
Recommended from our members
Automatic Dialect and Accent Recognition and its Application to Speech Recognition
A fundamental challenge for current research on speech science and technology is understanding and modeling individual variation in spoken language. Individuals have their own speaking styles, depending on many factors, such as their dialect and accent as well as their socioeconomic background. These individual differences typically introduce modeling difficulties for large-scale speaker-independent systems designed to process input from any variant of a given language. This dissertation focuses on automatically identifying the dialect or accent of a speaker given a sample of their speech, and demonstrates how such a technology can be employed to improve Automatic Speech Recognition (ASR). In this thesis, we describe a variety of approaches that make use of multiple streams of information in the acoustic signal to build a system that recognizes the regional dialect and accent of a speaker. In particular, we examine frame-based acoustic, phonetic, and phonotactic features, as well as high-level prosodic features, comparing generative and discriminative modeling techniques. We first analyze the effectiveness of approaches to language identification that have been successfully employed by that community, applying them here to dialect identification. We next show how we can improve upon these techniques. Finally, we introduce several novel modeling approaches -- Discriminative Phonotactics and kernel-based methods. We test our best performing approach on four broad Arabic dialects, ten Arabic sub-dialects, American English vs. Indian English accents, American English Southern vs. Non-Southern, American dialects at the state level plus Canada, and three Portuguese dialects. Our experiments demonstrate that our novel approach, which relies on the hypothesis that certain phones are realized differently across dialects, achieves new state-of-the-art performance on most dialect recognition tasks. This approach achieves an Equal Error Rate (EER) of 4% for four broad Arabic dialects, an EER of 6.3% for American vs. Indian English accents, 14.6% for American English Southern vs. Non-Southern dialects, and 7.9% for three Portuguese dialects. Our framework can also be used to automatically extract linguistic knowledge, specifically the context-dependent phonetic cues that may distinguish one dialect form another. We illustrate the efficacy of our approach by demonstrating the correlation of our results with geographical proximity of the various dialects. As a final measure of the utility of our studies, we also show that, it is possible to improve ASR. Employing our dialect identification system prior to ASR to identify the Levantine Arabic dialect in mixed speech of a variety of dialects allows us to optimize the engine's language model and use Levantine-specific acoustic models where appropriate. This procedure improves the Word Error Rate (WER) for Levantine by 4.6% absolute; 9.3% relative. In addition, we demonstrate in this thesis that, using a linguistically-motivated pronunciation modeling approach, we can improve the WER of a state-of-the art ASR system by 2.2% absolute and 11.5% relative WER on Modern Standard Arabic
Recommended from our members
Text-to-Speech Synthesis Using Found Data for Low-Resource Languages
Text-to-speech synthesis is a key component of interactive, speech-based systems. Typically, building a high-quality voice requires collecting dozens of hours of speech from a single professional speaker in an anechoic chamber with a high-quality microphone. There are about 7,000 languages spoken in the world, and most do not enjoy the speech research attention historically paid to such languages as English, Spanish, Mandarin, and Japanese. Speakers of these so-called "low-resource languages" therefore do not equally benefit from these technological advances. While it takes a great deal of time and resources to collect a traditional text-to-speech corpus for a given language, we may instead be able to make use of various sources of "found'' data which may be available. In particular, sources such as radio broadcast news and ASR corpora are available for many languages. While this kind of data does not exactly match what one would collect for a more standard TTS corpus, it may nevertheless contain parts which are usable for producing natural and intelligible parametric TTS voices.
In the first part of this thesis, we examine various types of found speech data in comparison with data collected for TTS, in terms of a variety of acoustic and prosodic features. We find that radio broadcast news in particular is a good match. Audiobooks may also be a good match despite their largely more expressive style, and certain speakers in conversational and read ASR corpora also resemble TTS speakers in their manner of speaking and thus their data may be usable for training TTS voices.
In the rest of the thesis, we conduct a variety of experiments in training voices on non-traditional sources of data, such as ASR data, radio broadcast news, and audiobooks. We aim to discover which methods produce the most intelligible and natural-sounding voices, focusing on three main approaches:
1) Training data subset selection. In noisy, heterogeneous data sources, we may wish to locate subsets of the data that are well-suited for building voices, based on acoustic and prosodic features that are known to correspond with TTS-style speech, while excluding utterances that introduce noise or other artifacts. We find that choosing subsets of speakers for training data can result in voices that are more intelligible.
2) Augmenting the frontend feature set with new features. In cleaner sources of found data, we may wish to train voices on all of the data, but we may get improvements in naturalness by including acoustic and prosodic features at the frontend and synthesizing in a manner that better matches the TTS style. We find that this approach is promising for creating more natural-sounding voices, regardless of the underlying acoustic model.
3) Adaptation. Another way to make use of high-quality data while also including informative acoustic and prosodic features is to adapt to subsets, rather than to select and train only on subsets. We also experiment with training on mixed high- and low-quality data, and adapting towards the high-quality set, which produces more intelligible voices than training on either type of data by itself.
We hope that our findings may serve as guidelines for anyone wishing to build their own TTS voice using non-traditional sources of found data
The prosodic design of Modern Standard Arabic political monologues
The aim of this study is to describe and understand the prosodic design of Modern Standard Arabic (MSA) political monologues. To work towards this aim, we compare two political monologues produced by the same speaker with a broadcast news reading produced by a news announcer. Through comparison of political monologues and broadcast news reading, we highlight linguistic strategies which could be used in any genre of speech, and also what we argue to be persuasive strategies which contribute to the political work of persuasion. We rely on a combination of prosodic, syntactic, and discourse (semantic) evidence to account for linguistic strategies, and on a similar combination of prosodic, syntactic, and discourse (semantics and pragmatics) evidence to account for persuasive strategies, but our primary contribution is highlighting the use of prosody as a persuasive political strategy.
A further contribution of this work to the field of knowledge is the elaboration of a set of fine-grained prosodic, syntactic, and discourse structures proposed for broadcast MSA monologues. The prosodic, syntactic, and discourse structures are first labelled independently according to a set of criteria (set out in Chapter 4 Methods). Then, we triangulate the results of labelling the prosodic, syntactic, and discourse structures independently, in Chapters 5-6 leading up to Chapter 7 where the major contribution of this work is highlighted, that is, the use of prosody as a persuasive strategy. The main argument in this work is structured in this gradual way because of the way the process of segmentation is carried out on all three data samples. The process of segmentation starts with identification of abstract forms, and then associates functions to these abstract forms based on detailed explanations of specific linguistic phenomena drawn from the process of triangulation. Therefore, the methodology implemented for broadcast MSA, which can also serve as a methodology for analysing MSA political monologues, is an integral and essential part of the main argument in this thesis
Phonological adaptation of English loanwords into Qassimi Arabic :an optimality- theoretic account
IPhD ThesisWithin the field of loanword phonology, this study enhances our understanding of the role played
by the contrastive features of the borrowing language in shaping the segmental adaptation patterns
of loanwords from the source language. This has been achieved by performing a theoretical
analysis of the segmental adaptation patterns of English loanwords into Qassimi Arabic, a dialect
spoken in the region of Qassim in central Saudi Arabia, using an Optimality-Theoretic framework.
The central argument of this study assumes that the inputs to QA are fully-specified English
outputs, which serve as inputs to QA. Then, the native grammar of QA allows only the
phonological features of inputs to surface that are contrastive in QA. Thus, redundant or noncontrastive phonological features in QA are eliminated from the outputs. The evidence behind the
argument that the contrastive features of QA segments play a main role in the adaptation process
emerges from adapting the English segments that are non-native in QA. For instance, English lax
vowels /ɪ/, /ʊ/, /æ/ are adapted as their tense counterparts in QA [i], [u] and [a]. I have argued that
the reason for this adaptation lies in the fact that the feature [ATR] is not a contrastive feature
within the QA vowel inventory. Therefore, dispensing with the value of the input feature [-ATR]
culminates in the tense vowels appearing at the surface level.
To identify the contrastive features of QA phonological inventory, I rely on the Contrastive
Hierarchy Theory proposed by Dresher (2009). This theory suggests that phonological features
should be ordered hierarchically to obtain only the contrastive features of any phonological
inventory. This is achieved by dividing any inventory into subsets of features until each segment
is distinguished contrastively from all others. Therefore, the features of QA segments are built
initially into a contrastive hierarchy model. Within this hierarchy, features are created and ordered
according to one or more of the following motivations: Activity, Minimality and Universality.
Finally, the contrastive hierarchy of QA segment inventory is converted into OT constraints. The
ranking of these constraints is sufficient to account for the evaluations of the segmental adaptation
patterns of loanwords from English into QA. For instance, based on the contrastive hierarchy of
QA, /b/ is contrastively specified as [-sonorant, +labial, -continuant]. In the adaptation of English
consonants, the English input segment /p/ is mapped consistently to [b] in the QA. In this case, the
contrastive hierarchy of QA consonant inventory contains the co-occurrence constraints *[αVoice,
+labial] and *[αCoronal, +labial], which filter the input features if the input is fully-specified
[-sonorant, +labial, -coronal, -continuant, -voiced, …], and permits only the contrastive features
[-sonorant, +labial, -continuant] to surface.Qassim University in Saudi Arabi
Linguistic practice on contemporary Jordanian radio: publics and participation
Contemporary studies of media Arabic often pass over issues of media
form and the broader relevance of language use. The present thesis
addresses these issues directly by examining the language used in Jordanian
non-government radio programmes. It examines recordings and transcriptions
of a range of programme genres – primarily, morning talk shows and “service
programmes” (barāmiž ḳadamātiyya), and Islamic advice programmes, both of
which feature significant audience input via call-ins. The data are examined
through an interpretive form of discourse analysis, drawing on linguistic
anthropological theory that analyses language as a form of performance,
through comparison of radio programmes as ‘units of interaction.’ This is
supported by sociolinguistic data obtained from the recordings, including
phoneme frequency analysis, in addition to the author’s experience of 6
months of fieldwork in Jordan in 2014-15. The analysis focuses on four major
themes: (1) the influence of media context, specifically the sonic exclusivity
and temporal evanescence of radio, on language use, as well as the impact of
digital media; (2) the indexicality of certain locally salient sociolinguistic
variables, and the use to which they are put in radio talk; (3) the role of
language in constructing the identity, or persona, of broadcasters; and (4) the
role of language in constructing and validating authoritative discourse, in
particular that of Islamic texts and scripture in religious programming.
Through its analysis of these themes, using selected recording excerpts
as demonstrative case studies, this thesis shows that specific strategies of
Arabic use in the radio setting crucially affect both the publics – the addressed
audiences – of radio talk, as well as the frameworks of participation in this talk
– how and to what extent broadcasters and members of the public can
participate in mediated discourse. The results demonstrate the unique value
of an interpretive study of linguistic performance for highlighting broader social
issues, including the inclusion and exclusion of particular segments of the
society through linguistic strategies – Jordanians versus non-Jordanians,
Ammanis versus non-Ammanis, and pious Muslims versus non-believers; and
the use of language to reassert, or occasionally challenge, dominant
ideologies and discourses, such as those of gender, nationalism, and religion.
This study thus contributes an examination of contemporary Jordanian non-government
radio language in its social and political context – something which
has not been attempted before, and which provides important insights
regarding both the nature of contemporary Arabic media language and its
broader social and cultural import
La voix charismatique : aspects psychologiques et caractéristiques acoustiques.
This dissertation analyzes the charismatic voice in the context of political leadership. It is shown that the speaker-leader uses his/her voice based on two functions. The primary function is biological and consists of manipulating changes in fundamental frequency in order to be recognized as the leader of the group. The secondary function is learned and dependent upon the language spoken and the culture that one belongs to, and consists of changing voice quality in order to convey different traits and types of charisma. These functions are employed in order to persuade an audience and achieve certain goals. The phenomenon of charisma is first addressed through social-cognitive theory that distinguishes charisma of the mind (the leader's thought, actions, and vision expressed through written and spoken language) from charisma of the body (all non-verbal behaviors used for expressing one's message, affects, and emotions. Certain adjectives were established through empirical research to describe positive and negative traits in French, Italian, and Brazilian Portuguese speech. The tool MASCharP (Multi-dimensional Adjective-based Scale of Charisma Perception) was then developed in order to evaluate the charismatic traits of an individual's perceptible behavior. The study then establishes an acoustic and perceptual description of the charismatic voice. Speech range profiles are created for French, Italian, and Brazilian male leaders in order to represent the leaders' vocal extension in different communication contexts (formal vs. informal). The voice profiles demonstrate how the leaders adopt a particular vocal strategy related to the communication context as well as the leaders' persuasive strategy. These results show cross-language and cross-cultural similarities in leaders' vocal behavior. The following experimental phase demonstrates the influence of voice quality on the perception of different types and attributes of charismatic leadership. The speaker-leader uses his vocal production to be recognized as the leader of a group. This is true for all formal communication contexts wherein the leader must express his leadership and has a persuasive goal to achieve. If he wants to submit group members and hopes to appear as a dominant or threatening leader, the leader uses a low fundamental frequency associated with phonatory types such as creaky voice. If he wants to be perceived as a sincere, calm, and reassuring, he uses a higher fundamental frequency associated with his modal voice, avoiding phonatory types such as harsh voice. This is the primary function of the charismatic voice. Lastly, this study shows that, in political discourse, the traits of a charismatic leader are filtered by the language and cultural context of the interaction. The secondary function of the charismatic voice is therefore addressed: the use of one's voice for conveying different types of charisma, as characterized by varying attributes, is filtered through the language and culture that favor certain charismatic vocal behaviors which serve as prototypes that correspond to the audience's inherent expectations.Cette thèse porte sur la voix charismatique dans le cadre du leadership politique. L'hypothèse générale est que le locuteur-leader utilise sa voix selon deux fonctions. Une fonction primaire, biologique, qui est d'utiliser les modulations de fréquence fondamentale pour être reconnu comme le leader du groupe. Une fonction secondaire, apprise et dépendante de la langue parlée et de la culture d'appartenance, qui est de modifier la qualité de la voix pour véhiculer différents traits et types de charisme dans le but de persuader l'auditoire et atteindre certains buts. La première étape a été de décrire le phénomène du charisme avec une théorie socio-cognitive qui distingue le charisme de l'esprit (la pensée, les actions et le visionnarisme du leader exprimés à travers le langage écrit et verbal) du charisme du corps (tout comportement non verbal utilisé pour exprimer son message ainsi que ses affects et émotions). De plus, des recherches empiriques ont permit de récolter des adjectifs décrivant les traits positifs et négatifs du charisme propres au français, à l'italien et au portugais brésilien. Enfin, un outil appelé MASCharP a été développé pour évaluer les traits du charisme d'un individu à partir de tout comportement perceptible. La deuxième étape d'étude concerne la description acoustique et perceptive de la voix charismatique. Une première phase a consisté à créer des profils vocaux des leaders masculins français, italiens et portugais brésiliens, dans le but de représenter l'extension vocale du leader lors de différents contextes de communication (formels vs. informels). Les profils vocaux montrent l'adoption, par les leaders, d'une stratégie vocale liée au contexte de communication et à leur stratégie persuasive. Ces résultats montrent des similarités inter-langagières et culturelles du comportement vocal entre leaders. La deuxième phase expérimentale sur la voix charismatique démontre l'influence de la qualité de voix des phrases sur la perception de différents types et attributs du leadership charismatique. Le locuteur-leader utilise sa production vocale pour être reconnu comme le leader du groupe. Ceci est valable dans tous les contextes de communication formels où le leader doit exprimer son leadership et a un but persuasif à atteindre. S'il veut soumettre les membres du groupe et souhaite apparaître comme un leader dominant ou menaçant, il utilise une fréquence fondamentale basse associée à des types phonatoires comme le creaky. S'il veut être perçu comme un leader sincère, calme et rassurant, il utilise une fréquence fondamentale plus haute associée à sa voix modale, évitant des types de phonation comme le harsh. Cela est la fonction primaire de la voix charismatique. Enfin, ces travaux de recherche montrent que les traits du leader charismatique, dans le discours politique, sont filtrés par le contexte langagier et culturel d'interaction partagé entre leader et partisans. L'utilisation de la voix pour véhiculer différents types de charisme, caractérises par des attributs différents, est filtrée par la langue et la culture qui favorisent l'émergence de comportements vocaux charismatiques, prototypiques et qui correspondent à l'attente inhérente de l'auditoire. Cela est la fonction secondaire de la voix charismatique
Ideological and cultural constraints in audiovisual translation: dubbing The Simpsons into Arabic: an approach to raise awareness and understanding of practioners involved in the dubbing and subtitling industries
Although Audiovisual Translation has received considerable attention in recent years, evidence suggests that there is a paucity of empirical research carried out on the topic of ideological and cultural constraints in audiovisual translation from English into Arabic. This is despite the fact that subtitling and dubbing Western animation into Arabic has been on the increase ever since television sets entered Arab homes; which is why several authority figures are calling for tighter control and moral screening of what is aired on television sets, in particular that which is aimed at children. This study aims to add some understanding of the problems facing practitioners in the dubbing/subtitling industry, such as the reasons for their alleged reality distortion and how these problems are dealt with by the dubbing agencies. This is achieved by exploring the extent ideological and cultural norms, as weB as other agents, shape the outcome of dubbed English animations/films when rendered into Arabic by manipulation, subversion and/or appropriation. Fifty-two dubbed episodes of The Simpsons were selected for this study. The Simpsons was chosen due to its universal appeal and influence. It addresses many sensitive issues, such as sex, drugs, religion, politics, racial and gender stereotypes, with a bluntness and boldness rarely seen before, and goes beyond passive entertainment and school education. Therefore, it is looked at with suspicion and vigilance in the Arab World. The methodological approach adopted for this study is primarily qualitative, which is proven to provide the kind of expert understanding this study aims to achieve, as Denzin and Lincoln (1994) attest. Because this research springs from the conviction that the issues involved constitute a complex phenomenon, and because the aim is to uncover what could be learnt about intrinsic and extrinsic conditions, it is important to adopt Toury's (1980, 1995) descriptive translation studies paradigm as well as critical discourse analysis strategy. This paradigm enables researchers to describe, explore, analyse, interpret views of the participants, and bring forth the representational properties of the screen discourse as a vehicle for ideological and cultural power transfer. The contrastive analysis of the English and Arabic versions of The Simpsons yielded interesting results; it established that the translation process is marred by many intrinsic and extrinsic factors, either exercised by the translator or imposed upon him. Ideological and socio-cultural factors are the chief culprits in the case of translating The Simpsons into Arabic
Cultural Dynamics in a Globalized World
The book contains essays on current issues in arts and humanities in which peoples and cultures compete as well as collaborate in globalizing the world while maintaining their uniqueness as viewed from cross- and inter-disciplinary perspectives. The book covers areas such as literature, cultural studies, archaeology, philosophy, history, language studies, information and literacy studies, and area studies. Asia and the Pacific are the particular regions that the conference focuses on as they have become new centers of knowledge production in arts and humanities and, in the future, seem to be able to grow significantly as a major contributor of culture, science and arts to the globalized world. The book will help shed light on what arts and humanities scholars in Asia and the Pacific have done in terms of research and knowledge development, as well as the new frontiers of research that have been explored and opening up, which can connect the two regions with the rest of the globe