    Parsing Speech: A Neural Approach to Integrating Lexical and Acoustic-Prosodic Information

    In conversational speech, the acoustic signal provides cues that help listeners disambiguate difficult parses. For automatically parsing spoken utterances, we introduce a model that integrates transcribed text and acoustic-prosodic features using a convolutional neural network over energy and pitch trajectories coupled with an attention-based recurrent neural network that accepts text and prosodic features. We find that different types of acoustic-prosodic features are individually helpful, and together give statistically significant improvements in parse and disfluency detection F1 scores over a strong text-only baseline. For this study with known sentence boundaries, error analyses show that the main benefit of acoustic-prosodic features is in sentences with disfluencies, attachment decisions are most improved, and transcription errors obscure gains from prosody.Comment: Accepted in NAACL HLT 201

    Acoustic-prosodic entrainment in structural metadata events

    This paper presents an acoustic-prosodic analysis of entrain- ment in a Portuguese map-task corpus. Our aim is to ana- lyze how turn-by-turn entrainment varies with distinct structural metadata events: types of sentence-like units (SU) in consecu- tive turns (e.g. interrogatives followed by declaratives, or both declaratives), and with the presence of discourse markers, affir- mative cue words, and disfluencies in the beginning of turns. Entrainment at turn-exchanges may be observed in terms of pitch, energy, duration, and voice quality. Regarding SU types, question-answer turns are the ones with stronger similarity, and declarative-interrogative pairs are the ones where less entrain- ment occurs, as expected. Moreover, in question-answer pairs, there is also stronger evidence of entrainment with Yes/No and Tag questions than with Wh- questions. In fact, these subtypes are coded in distinctive prosodic ways (moreover, the first sub- type has no associated lexical-syntactic cues in Portuguese, only prosodic). As for turn-initial structures, entrainment is stronger when the second turn begins with an affirmative cue word; less strong with ambiguous structures (such as ‘OK’), emphatic af- firmative answers, and negative answers; and scarce with dis- fluencies and discourse markers. The different degrees of local entrainment may be related with the informative structure of distinct structural metadata events.info:eu-repo/semantics/publishedVersio

    Cross-domain analysis of discourse markers in European Portuguese

    This paper presents an analysis of discourse markers in two spontaneous speech corpora for European Portuguese - university lectures and map-task dialogues - and also in a collection of tweets, aiming at contributing to their categorization, scarcely existent for European Portuguese. Our results show that the selection of discourse markers is domain and speaker dependent. We also found that the most frequent discourse markers are similar in all three corpora, despite tweets containing discourse markers not found in the other two corpora. In this multidisciplinary study, comprising both a linguistic perspective and a computational approach, discourse markers are also automatically discriminated from other structural metadata events, namely sentence-like units and disfluencies. Our results show that discourse markers and disfluencies tend to co-occur in the dialogue corpus, but have a complementary distribution in the university lectures. We used three acoustic-prosodic feature sets and machine learning to automatically distinguish between discourse markers, disfluencies and sentence-like units. Our in-domain experiments achieved an accuracy of about 87% in university lectures and 84% in dialogues, in line with our previous results. The eGeMAPS features, commonly used for other paralinguistic tasks, achieved a considerable performance on our data, especially considering the small size of the feature set. Our results suggest that turn-initial discourse markers are usually easier to classify than disfluencies, a result also previously reported in the literature. We conducted a cross-domain evaluation in order to evaluate the robustness of the models across domains. The results achieved are about 11%-12% lower, but we conclude that data from one domain can still be used to classify the same events in the other. Overall, despite the complexity of this task, these are very encouraging state-of-the-art results. Ultimately, using exclusively acoustic-prosodic cues, discourse markers can be fairly discriminated from disfluencies and SUs. In order to better understand the contribution of each feature, we have also reported the impact of the features in both the dialogues and the university lectures. Pitch features are the most relevant ones for the distinction between discourse markers and disfluencies, namely pitch slopes. These features are in line with the wide pitch range of discourse markers, in a continuum from a very compressed pitch range to a very wide one, expressed by total deaccented material or H+L* L* contours, with upstep H tones

    What makes business speakers sound charismatic? A contrastive acoustic-melodic analysis of Steve Jobs and Mark Zuckerberg

    Phonetic research on the prosodic sources of perceived charisma has taken a big step towards making a speaker’s tone-of-voice a tangible, quantifiable, and trainable matter. However, the tone-of-voice includes a complex bundle of acoustic features, and a lot of parameters have not even been looked at so far. Moreover, all previous studies focused on political or religious leaders and left aside the large field of managers and CEOs in the world of business. These are the two research gaps addressed in the present study. An acoustic analysis of about 1,350 prosodic phrases from keynotes given by a more charismatic CEO (Steve Jobs) and a less charismatic CEO (Mark Zuckerberg) suggests that the same tone-of-voice settings that make political or religious leaders sound more charismatic also work for business speakers. In addition, results point to further charisma-relevant acoustic parameters related to rhythm, emphasis, pausing, and voice quality - as well as to audience type as a significant context factor. The findings are discussed with respect to implications for future perception-oriented studies and perspectives for a computer-based measurement, assessment, and training of a charismatic tone of voice.La investigación sobre las características prosódicas de la percepción del carisma ha mostrado que el tono de voz de un orador es una característica tangible, cuantificable y entrenable. Sin embargo, el tono de voz incluye un conjunto complejo de rasgos acústicos y muchos parámetros no han sido estudiados hasta ahora. Además, los estudios previos se han centrado en el análisis del carisma de líderes políticos o religiosos y han dejado de lado el análisis de un gran número de mánagers y directores ejecutivos en el mundo de los negocios. En este estudio presentamos un análisis acústico de cerca de 1,350 frases prosódicas procedentes de discursos realizados por uno de los directores ejecutivos más carismáticos (Steve Jobs) y por uno menos carismático (Marc Zuckerberg). Los resultados sugieren que los ajustes del mismo tono de voz que hace que los líderes políticos y religiosos suenen más carismáticos también funcionen para oradores del mundo de los negocios. Además, los resultados muestran la relevancia de más parámetros acústicos, aparte del tono, para la percepción del carisma como son el ritmo, el énfasis, las pausas y la calidad de la voz - así como también el tipo de público como un factor significativo de contexto

    Los tópicos oracionales en italiano: análisis del Corpus CHROME

    The present study deals with the phonetic description of sentence topics in Italian tourist guides’ speech. Topical coherence characterizes the communicative strategies that human experts adopt when delivering contents to the visitors of cultural sites. Topical progression, which ensures temporal, spatial, and referential continuity, is frequently expressed by sentence topics as well. The relevant literature generally supports the idea of a topic accent and a rising-falling (or “hat”) contour is described as the most frequent for the unmarked topic in Italian utterance structures, but other realizations are also possible. The hypothesis that we want to test in this work is whether this variability is due to specific factors. Hence, we investigate phonetic realization of sentence topics as a function of syntactic features -structure, function and weight- and textual-pragmatic features -discourse role considering ±aboutness, ±contrastiveness, ±givenness-. Specifically, tonal events, i.e., accents and boundaries, phonetic phrasing, and disfluency phenomena were investigated. Results show that both syntactic and pragmatic factors play a role in the phonetic realization of topics, though they act at different levels. In particular, disfluencies are found to be affected by syntactic weight and givenness, while tonal events seem to depend mainly on the discourse role.El presente estudio se plantea un análisis fonético de los tópicos oracionales en italiano, examinando habla de guías turísticos. La coherencia temática caracteriza las estrategias comunicativas que adoptan los expertos humanos al transmitir contenidos a los visitantes de los sitios culturales. La progresión temática, que garantiza la continuidad temporal, espacial y referencial, se expresa también con frecuencia mediante tópicos oracionales. Por lo tanto, el corpus examinado ofrece la posibilidad de analizar la realización de estas entidades. La bibliografía pertinente apoya, en general, la idea de un acento de tópico e indica un contorno ascendente-descendente como el más frecuente para el tópico no marcado, aunque resultan posibles otras realizaciones. La hipótesis que queremos comprobar en este trabajo es si la variabilidad encontrada en la bibliografía se debe a factores sintácticos y pragmáticos específicos. Por lo tanto, investigamos la realización fonética de los tópicos oracionales en función de características sintácticas -estructura, función y “peso”- y de factores textuales y pragmáticos -rol discursivo considerando los siguientes rasgos: ±aboutness, ±contrastiveness, ±givenness-. En concreto, se investigaron los eventos tonales, es decir, los acentos y las fronteras, el fraseo prosódico y los fenómenos de disfluencia. Los resultados muestran que tanto los factores sintácticos como los pragmáticos desempeñan un papel en la realización fonética de los tópicos oracionales, aunque actúan a diferentes niveles. En particular, las disfluencias se ven afectadas por el peso sintáctico y el estatuto informativo, mientras que los eventos tonales parecen depender principalmente del rol discursivo