Search CORE

65 research outputs found

Multi-Tier Annotations in the Verbmobil Corpus

Author: Gonzáles Rodriguez Manual
Reichel Uwe D.
Schiel Florian
Weilhammer Karl
Publication venue
Publication date: 01/05/2002
Field of study

In very large and diverse scientific projects where as different groups as linguists and engineers with different intentions work on the same signal data or its orthographic transcript and annotate new valuable information, it will not be easy to build a homogeneous corpus. We will describe how this can be achieved, considering the fact that some of these annotations have not been updated properly, or are based on erroneous or deliberately changed versions of the basis transcription. We used an algorithm similar to dynamic programming to detect differences between the transcription on which the annotation depends and the reference transcription for the whole corpus. These differences are automatically mapped on a set of repair operations for the transcriptions such as splitting compound words and merging neighbouring words. On the basis of these operations the correction process in the annotation is carried out. It always depends on the type of the annotation as well as on the position and the nature of the difference, whether a correction can be carried out automatically or has to be fixed manually. Finally we present a investigation in which we exploit the multi-tier annotations of the Verbmobil corpus to find out how breathing is correlated with prosodic-syntactic boundaries and dialog acts. 1

CiteSeerX

Open Access LMU

MAUS Goes Iterative

Author: Lino Maria Teresa
Schiel Florian
Publication venue
Publication date: 01/01/2004
Field of study

In this paper we describe further developments of the MAUS system and announce a free-ware software package that may be downloaded from the ’Bavarian Archive for Speech Signals’ (BAS) web site. The quality of the MAUS output can be considerably improved by using an iterative technique. In this mode MAUS will calculated a first pass through all the target speech material using the standard speaker-independent acoustical models of the target language. Then the segmented and labelled speech data are used to re-estimated the acoustical models and the MAUS procedure is applied again to the speech data using these speaker-dependent models. The last two steps are repeated iteratively until the segmentation converges. The paper describes the general algorithm, the German benchmark for evaluating the method as well as some experiments on German target speakers

Open Access LMU

The Lexicon Graph Model : a generic model for multimodal lexicon development

Author: Trippel Thorsten
Publication venue: Bielefeld University
Publication date: 01/01/2006
Field of study

Trippel T. The Lexicon Graph Model : a generic model for multimodal lexicon development. Bielefeld (Germany): Bielefeld University; 2006.Das Lexicon Graph Model stellt ein Modell für Lexika dar, die korpusbasiert sein können und multimodale Informationen enthalten. Hierbei wird die Perspektive der Lexikontheorie eingenommen, wobei die zugrundeliegenden Datenstrukturen sowohl vom Lexikon als auch von Annotationen betrachtet werden. Letztere fallen dadurch in das Blickfeld, weil sie als Grundlage für die Erstellung von Lexika gesehen werden. Der Begriff des Lexikons bezieht sich hier sowohl auf den Bereich des Wörterbuchs als auch der in elektronischen Applikationen integrierten Lexikondatenbanken. Die existierenden Formalismen und Ansätze der Lexikonentwicklung zeigen verschiedene Probleme im Zusammenhang mit Lexika auf, etwa die Zusammenfassung von existierenden Lexika zu einem, die Disambiguierung von Mehrdeutigkeiten im Lexikon auf verschiedenen lexikalischen Ebenen, die Repräsentation von anderen Modalitäten im Lexikon, die Selektion des lexikalischen Schlüsselbegriffs für Lexikonartikel, etc. Der vorliegende Ansatz geht davon aus, dass sich Lexika zwar in ihrem Inhalt, nicht aber in einer grundlegenden Struktur unterscheiden, so dass verschiedenartige Lexika im Rahmen eines Unifikationsprozesses dublettenfrei miteinander verbunden werden können. Hieraus resultieren deklarative Lexika. Für Lexika können diese Graphen mit dem Lexikongraph-Modell wie hier dargestellt modelliert werden. Dabei sind Lexikongraphen analog den von Bird und Libermann beschriebenen Annotationsgraphen gesehen und können daher auch ähnlich verarbeitet werden. Die Untersuchung des Lexikonformalismus beruht auf vier Schritten. Zunächst werden existierende Lexika analysiert und beschrieben. Danach wird mit dem Lexikongraph-Modell eine generische Darstellung von Lexika vorgestellt, die auch implementiert und getestet wird. Basierend auf diesem Formalismus wird die Beziehung zu Annotationsgraphen hergestellt, wobei auch beschrieben wird, welche Maßstäbe an angemessene Annotationen für die Verwendung zur Lexikonentwicklung angelegt werden müssen.The Lexicon Graph Model provides a model and framework for lexicons that can be corpus based and contain multimodal information. The focus is more from the lexicon theory perspective, looking at the underlying data structures that are part of existing lexicons and corpora. The term lexicon in linguistics and artificial intelligence is used in different ways, including traditional print dictionaries in book form, CD-ROM editions, Web based versions of the same, but also computerized resources of similar structures to be used by applications. These applications cover systems for human-machine communication as well as spell checkers. The term lexicon in this work is used as the most generic term covering all lexical applications. Existing formalisms in lexicon development show different problems with lexicons, for example combining different kinds of lexical resources, disambiguation on different lexical levels, the representation of different modalities in a lexicon. The Lexicon Graph Model presupposes that lexicons can have different structures but have fundamentally a similar structure, making it possible to combine lexicons in a unification process, resulting in a declarative lexicon. The underlying model is a graph, the Lexicon Graph, which is modeled similar to Annotation Graphs as described by Bird and Libermann. The investigation of the lexicon formalism contains four steps, that is the analysis of existing lexicons, the introduction of the Lexicon Graph Model as a generic representation for lexicons, the implementation of the formalism in different contexts and an evaluation of the formalism. It is shown that Annotation Graphs and Lexicon Graphs are indeed related not only in their formalism and it is shown, what standards have to be applied to annotations to be usable for lexicon development

Publications at Bielefeld University

The Production of Speech Corpora

Author: Baumann Angela
Draxler Christoph
Ellbogen Tania
Schiel Florian
Steffen Alexander
Publication venue
Publication date: 21/03/2012
Field of study

Open Access LMU

The Production of Speech Corpora

Author: Baumann Angela
Draxler Christoph
Ellbogen Tania
Schiel Florian
Steffen Alexander
Publication venue
Publication date: 21/03/2012
Field of study

Annotation of negotiation processes in joint-action dialogues

Author: Eberhard Kathleen
K ̈ubler Sandra
Scheutz Matthias
Shi Hui
Tenbrink Thora
Publication venue: University of Illinois at Chicago Library
Publication date: 09/07/2013
Field of study

Situated dialogic corpora are invaluable resources for understanding the complex relationship between language, perception, and action as they are based on naturalistic dialogue situations in which the interactants are given shared goals to be accomplished in the real world. In such situations, verbal interactions are intertwined with actions, and shared goals can only be achieved via dynamic negotiation processes based on common ground constructed from discourse history as well as the interactants' knowledge about the status of actions. In this paper, we propose four major dimensions of collaborative tasks that affect the negotiation processes among interactants, and, hence, the structure of the dialogue. Based on a review of available dialogue corpora and annotation manuals, we show that existing annotation schemes so far do not adequately account for the complex dialogue processes in situated task-based scenarios. We illustrate the effects of specific features of a scenario using annotated samples of dialogue taken from the literature as well as our own corpora, and end with a brief discussion of the challenges ahead

University of Illinois at Chicago: Journals@UIC

Dialogue & Discourse (E-Journal - Universität Bielefeld)

Privacy Guarantees for De-identifying Text Transformations

Author: Adelani David Ifeoluwa
Davody Ali
Klakow Dietrich
Kleinbauer Thomas
Publication venue
Publication date: 07/08/2020
Field of study

Machine Learning approaches to Natural Language Processing tasks benefit from a comprehensive collection of real-life user data. At the same time, there is a clear need for protecting the privacy of the users whose data is collected and processed. For text collections, such as, e.g., transcripts of voice interactions or patient records, replacing sensitive parts with benign alternatives can provide de-identification. However, how much privacy is actually guaranteed by such text transformations, and are the resulting texts still useful for machine learning? In this paper, we derive formal privacy guarantees for general text transformation-based de-identification methods on the basis of Differential Privacy. We also measure the effect that different ways of masking private information in dialog transcripts have on a subsequent machine learning task. To this end, we formulate different masking strategies and compare their privacy-utility trade-offs. In particular, we compare a simple redact approach with more sophisticated word-by-word replacement using deep learning models on multiple natural language understanding tasks like named entity recognition, intent detection, and dialog act classification. We find that only word-by-word replacement is robust against performance drops in various tasks.Comment: Proceedings of INTERSPEECH 202

arXiv.org e-Print Archive

The DBOX Corpus Collection of Spoken Human-Human and Human-Machine Dialogues

Author: Deroo O.
Dines John
Egeler Ronny
Eigner Gregor
Gropp Martin
Klakow Dietrich
Liersch Steffen
Meinz Uwe
Motlicek Petr
Petukhova Volha
Potard Blaise
Schmidt Anna
Srb Stefan
Topf Mario
Publication venue: Reykjavik, Iceland, European Language Resources Association (ELRA)
Publication date: 18/09/2014
Field of study

This paper describes the data collection and annotation carried out within the DBOX project ( Eureka project, number E! 7152). This project aims to develop interactive games based on spoken natural language human-computer dialogues, in 3 European languages: English, German and French. We collect the DBOX data continuously. We first start with human-human Wizard of Oz experiments to collect human-human data in order to model natural human dialogue behaviour, for better understanding of phenomena of human interactions and predicting interlocutors actions, and then replace the human Wizard by an increasingly advanced dialogue system, using evaluation data for system improvement. The designed dialogue system relies on a Question-Answering (QA) approach, but showing truly interactive gaming behaviour, e.g., by providing feedback, managing turns and contact, producing social signals and acts, e.g., encouraging vs. downplaying, polite vs. rude, positive vs. negative attitude towards players or their actions, etc. The DBOX dialogue corpus has required substantial investment. We expect it to have a great impact on the rest of the project. The DBOX project consortium will continue to maintain the corpus and to take an interest in its growth, e.g., expand to other languages. The resulting corpus will be publicly released

Infoscience - École polytechnique fédérale de Lausanne

Annotation of negotiation processes in joint-action dialogues

Author: Eberhard Kathleen
Kuebler Sandra
Scheutz Matthias
Shi Hui
Tenbrink Thora
Publication venue: The Dialogue & Discourse Board of Editors
Publication date: 09/07/2013
Field of study

Dialogue & Discourse (E-Journal - Universität Bielefeld)

Discourse parsing for multi-party chat dialogues

Author: Afantenos Stergos
Asher Nicholas
Kow Eric
Perret Jérémy
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2015
Field of study

In this paper we present the first ever, to the best of our knowledge, discourse parser for multi-party chat dialogues. Discourse in multi-party dialogues dramatically differs from monologues since threaded conversations are commonplace rendering prediction of the discourse structure compelling. Moreover, the fact that our data come from chats renders the use of syntactic and lexical information useless since people take great liberties in expressing themselves lexically and syntactically. We use the dependency parsing paradigm as has been done in the past (Muller et al., 2012; Li et al., 2014). We learn local probability distributions and then use MST for decoding. We achieve 0.680 F 1 on unlabelled structures and 0.516 F 1 on fully labeled structures which is better than many state of the art systems for monologues, despite the inherent difficulties that multi-party chat dialogues have

CiteSeerX

Scientific Publications of the University of Toulouse II Le Mirail

Open Archive Toulouse Archive Ouverte