    Prosodic Event Recognition using Convolutional Neural Networks with Context Information

    This paper demonstrates the potential of convolutional neural networks (CNN) for detecting and classifying prosodic events on words, specifically pitch accents and phrase boundary tones, from frame-based acoustic features. Typical approaches use not only feature representations of the word in question but also its surrounding context. We show that adding position features indicating the current word benefits the CNN. In addition, this paper discusses the generalization from a speaker-dependent modelling approach to a speaker-independent setup. The proposed method is simple and efficient and yields strong results not only in speaker-dependent but also speaker-independent cases.Comment: Interspeech 2017 4 pages, 1 figur

    Exploiting Contextual Information for Prosodic Event Detection Using Auto-Context

    Prosody and prosodic boundaries carry significant information regarding linguistics and paralinguistics and are important aspects of speech. In the field of prosodic event detection, many local acoustic features have been investigated; however, contextual information has not yet been thoroughly exploited. The most difficult aspect of this lies in learning the long-distance contextual dependencies effectively and efficiently. To address this problem, we introduce the use of an algorithm called auto-context. In this algorithm, a classifier is first trained based on a set of local acoustic features, after which the generated probabilities are used along with the local features as contextual information to train new classifiers. By iteratively using updated probabilities as the contextual information, the algorithm can accurately model contextual dependencies and improve classification ability. The advantages of this method include its flexible structure and the ability of capturing contextual relationships. When using the auto-context algorithm based on support vector machine, we can improve the detection accuracy by about 3% and F-score by more than 7% on both two-way and four-way pitch accent detections in combination with the acoustic context. For boundary detection, the accuracy improvement is about 1% and the F-score improvement reaches 12%. The new algorithm outperforms conditional random fields, especially on boundary detection in terms of F-score. It also outperforms an n-gram language model on the task of pitch accent detection

    Structuring information through gesture and intonation

    Face-to-face communication is multimodal. In unscripted spoken discourse we can observe the interaction of several “semiotic layers”, modalities of information such as syntax, discourse structure, gesture, and intonation. We explore the role of gesture and intonation in structuring and aligning information in spoken discourse through a study of the co-occurrence of pitch accents and gestural apices. Metaphorical spatialization through gesture also plays a role in conveying the contextual relationships between the speaker, the government and other external forces in a naturally-occurring political speech setting

    Annotation graphs as a framework for multidimensional linguistic data analysis

    In recent work we have presented a formal framework for linguistic annotation based on labeled acyclic digraphs. These `annotation graphs' offer a simple yet powerful method for representing complex annotation structures incorporating hierarchy and overlap. Here, we motivate and illustrate our approach using discourse-level annotations of text and speech data drawn from the CALLHOME, COCONUT, MUC-7, DAMSL and TRAINS annotation schemes. With the help of domain specialists, we have constructed a hybrid multi-level annotation for a fragment of the Boston University Radio Speech Corpus which includes the following levels: segment, word, breath, ToBI, Tilt, Treebank, coreference and named entity. We show how annotation graphs can represent hybrid multi-level structures which derive from a diverse set of file formats. We also show how the approach facilitates substantive comparison of multiple annotations of a single signal based on different theoretical models. The discussion shows how annotation graphs open the door to wide-ranging integration of tools, formats and corpora.Comment: 10 pages, 10 figures, Towards Standards and Tools for Discourse Tagging, Proceedings of the Workshop. pp. 1-10. Association for Computational Linguistic

    Tagging Prosody and Discourse Structure in Elicited Spontaneous Speech

    This paper motivates and describes the annotation and analysis of prosody and discourse structure for several large spoken language corpora. The annotation schema are of two types: tags for prosody and intonation, and tags for several aspects of discourse structure. The choice of the particular tagging schema in each domain is based in large part on the insights they provide in corpus-based studies of the relationship between discourse structure and the accenting of referring expressions in American English. We first describe these results and show that the same models account for the accenting of pronouns in an extended passage from one of the Speech Warehouse hotel-booking dialogues. We then turn to corpora described in Venditti [Ven00], which adapts the same models to Tokyo Japanese. Japanese is interesting to compare to English, because accent is lexically specified and so cannot mark discourse focus in the same way. Analyses of these corpora show that local pitch range expansion serves the analogous focusing function in Japanese. The paper concludes with a section describing several outstanding questions in the annotation of Japanese intonation which corpus studies can help to resolve.Work reported in this paper was supported in part by a grant from the Ohio State University Office of Research, to Mary E. Beckman and co-principal investigators on the OSU Speech Warehouse project, and by an Ohio State University Presidential Fellowship to Jennifer J. Venditti

    Consistency of prosodic transcriptions : labelling experiments with trained and untrained transcribers

    Using the ToBI transcription to record the intonation of Slovene

    The paper presents ToBI, a transcription method for prosodic annotation. ToBI is an acronym for Tones and Breaks Indices which first denoted an intonation system developed in the 1990s for annotating intonation and prosody in the database of spoken Mainstream American English. The MAE_ToBI transcription originally consists of six parts - the audio recording of the utterance, the fundamental frequency contour and four parallel tiers for the transcription of tone sequence, ortographic transcription, indication of break indices between words and for additional observations. The core of the transcription, i. e. of the phonological analyses of the intonation pattern, is represented by the tone tier where tonal variation is transcribed by using labels for high tone and low tone where a tone can appear as a pitch accent, phrase accentand boundary tone. Due to its simplicity and flexibility, the system soon began to be used for the prosodic annotation of other variants of English and many other languages, as well as in different non-linguistic fields, leading to the creation of many new ToBI systems adapted to individual languages and dialects. The author is the first to use this method for Slovene, more precisely, for the intonational transcription and analysis of the corpus of spontaneous speech of Slovene Istria, in order to investigate if the ToBi system is useful for the annotation of Slovene and its regional variants.Članek predstavlja ToBI, transkripcijsko metodo za zapis prozodičnih dogodkov. ToBI je kratica za Tones and Breaks Indices, ki izvirno poimenuje intonacijski sistem, ki je bil razvit v 90-ih letih prejšnjega stoletja in zgrajen za označevanje intonacije in prozodije v podatkovni bazi govorjene ameriške angleščine (Mainstream American English). MAE_ToBI transkripcija po prvotnem dogovoru sestoji iz šestih delov - iz zvočnega posnetka izreka, zapisa poteka osnovne frekvence in štirih vzporedno poravnanih pasov, ki so namenjeni transkripciji tonskega poteka, ortografskemu zapisu izreka, označevanju jakosti mej med besedami ter zapisovanju dodatnih opazovanj. Jedro zapisa oziroma fonoloških analiz intonacijskega vzorca predstavlja tonski pas, v katerem z oznakami za visoki in nizki ton transkribiramo razlikovalna tonska nihanja. Sistem se je zaradi svoje enostavnosti in prilagodljivosti hitro razširil na prozodično označevanje ostalih variant angleščine in mnogih drugih jezikov ter na različna nelingvistična področja, nastali so številnih novih ToBI-sistemi, prilagojeni posameznim jezikom ali narečjem. Metoda je bila prvič uporabljena za zapis in analizo intonacije na korpusu spontanega govora govorcev v Slovenski Istri z namenom preizkusiti, v kolikšni meri je ToBI primeren za opis intonacije slovenskega jezika in njegovih pokrajinskih različic