813 research outputs found

    Towards error annotation in a learner corpus of Portuguese

    Get PDF
    In this article, we present COPLE2, a new corpus of Portuguese that encompasses written and spoken data produced by foreign learners of Portuguese as a foreign or second language (FL/L2). Following the trend towards learner corpus research applied to less commonly taught languages, it is our aim to enhance the learning data of Portuguese L2. These data may be useful not only for educational purposes (design of learning materials, curricula, etc.) but also for the development of NLP tools to support students in their learning process. The corpus is available online using TEITOK environment, a web-based framework for corpus treatment that provides several built-in NLP tools and a rich set of functionalities (multiple orthographic transcription layers, lemmatization and POS, normalization of the tokens, error annotation) to automatically process and annotate texts in xml format. A CQP-based search interface allows searching the corpus for different fields, such as words, lemmas, POS tags or error tags. We will describe the work in progress regarding the constitution and linguistic annotation of this corpus, particularly focusing on error annotation.info:eu-repo/semantics/publishedVersio

    Linguistic annotation in/for corpus linguistics

    Get PDF
    This article surveys linguistic annotation in corpora and corpus linguistics. We first define the concept of 'corpus ' as a radial category and then, in Section 2, discuss a variety of kinds of information for which corpora are annotated and that are exploited in contemporary corpus linguistics. Section 3 then exemplifies many current formats of annotation with an eye to highlighting both the diversity of formats currently available and the emergence of XML annotation as, for now, the most widespread form of annotation. Section 4 summarizes and concludes with desiderata for future developments.

    Compiling and annotating a learner corpus for a morphologically rich language: CzeSL, a corpus of non-native Czech

    Get PDF
    Learner corpora, linguistic collections documenting a language as used by learners, provide an important empirical foundation for language acquisition research and teaching practice. This book presents CzeSL, a corpus of non-native Czech, against the background of theoretical and practical issues in the current learner corpus research. Languages with rich morphology and relatively free word order, including Czech, are particularly challenging for the analysis of learner language. The authors address both the complexity of learner error annotation, describing three complementary annotation schemes, and the complexity of description of non-native Czech in terms of standard linguistic categories. The book discusses in detail practical aspects of the corpus creation: the process of collection and annotation itself, the supporting tools, the resulting data, their formats and search platforms. The chapter on use cases exemplifies the usefulness of learner corpora for teaching, language acquisition research, and computational linguistics. Any researcher developing learner corpora will surely appreciate the concluding chapter listing lessons learned and pitfalls to avoid

    The Lexicon Graph Model : a generic model for multimodal lexicon development

    Get PDF
    Trippel T. The Lexicon Graph Model : a generic model for multimodal lexicon development. Bielefeld (Germany): Bielefeld University; 2006.Das Lexicon Graph Model stellt ein Modell für Lexika dar, die korpusbasiert sein können und multimodale Informationen enthalten. Hierbei wird die Perspektive der Lexikontheorie eingenommen, wobei die zugrundeliegenden Datenstrukturen sowohl vom Lexikon als auch von Annotationen betrachtet werden. Letztere fallen dadurch in das Blickfeld, weil sie als Grundlage für die Erstellung von Lexika gesehen werden. Der Begriff des Lexikons bezieht sich hier sowohl auf den Bereich des Wörterbuchs als auch der in elektronischen Applikationen integrierten Lexikondatenbanken. Die existierenden Formalismen und Ansätze der Lexikonentwicklung zeigen verschiedene Probleme im Zusammenhang mit Lexika auf, etwa die Zusammenfassung von existierenden Lexika zu einem, die Disambiguierung von Mehrdeutigkeiten im Lexikon auf verschiedenen lexikalischen Ebenen, die Repräsentation von anderen Modalitäten im Lexikon, die Selektion des lexikalischen Schlüsselbegriffs für Lexikonartikel, etc. Der vorliegende Ansatz geht davon aus, dass sich Lexika zwar in ihrem Inhalt, nicht aber in einer grundlegenden Struktur unterscheiden, so dass verschiedenartige Lexika im Rahmen eines Unifikationsprozesses dublettenfrei miteinander verbunden werden können. Hieraus resultieren deklarative Lexika. Für Lexika können diese Graphen mit dem Lexikongraph-Modell wie hier dargestellt modelliert werden. Dabei sind Lexikongraphen analog den von Bird und Libermann beschriebenen Annotationsgraphen gesehen und können daher auch ähnlich verarbeitet werden. Die Untersuchung des Lexikonformalismus beruht auf vier Schritten. Zunächst werden existierende Lexika analysiert und beschrieben. Danach wird mit dem Lexikongraph-Modell eine generische Darstellung von Lexika vorgestellt, die auch implementiert und getestet wird. Basierend auf diesem Formalismus wird die Beziehung zu Annotationsgraphen hergestellt, wobei auch beschrieben wird, welche Maßstäbe an angemessene Annotationen für die Verwendung zur Lexikonentwicklung angelegt werden müssen.The Lexicon Graph Model provides a model and framework for lexicons that can be corpus based and contain multimodal information. The focus is more from the lexicon theory perspective, looking at the underlying data structures that are part of existing lexicons and corpora. The term lexicon in linguistics and artificial intelligence is used in different ways, including traditional print dictionaries in book form, CD-ROM editions, Web based versions of the same, but also computerized resources of similar structures to be used by applications. These applications cover systems for human-machine communication as well as spell checkers. The term lexicon in this work is used as the most generic term covering all lexical applications. Existing formalisms in lexicon development show different problems with lexicons, for example combining different kinds of lexical resources, disambiguation on different lexical levels, the representation of different modalities in a lexicon. The Lexicon Graph Model presupposes that lexicons can have different structures but have fundamentally a similar structure, making it possible to combine lexicons in a unification process, resulting in a declarative lexicon. The underlying model is a graph, the Lexicon Graph, which is modeled similar to Annotation Graphs as described by Bird and Libermann. The investigation of the lexicon formalism contains four steps, that is the analysis of existing lexicons, the introduction of the Lexicon Graph Model as a generic representation for lexicons, the implementation of the formalism in different contexts and an evaluation of the formalism. It is shown that Annotation Graphs and Lexicon Graphs are indeed related not only in their formalism and it is shown, what standards have to be applied to annotations to be usable for lexicon development

    Semi-automation of gesture annotation by machine learning and human collaboration

    Get PDF
    none6siGesture and multimodal communication researchers typically annotate video data manually, even though this can be a very time-consuming task. In the present work, a method to detect gestures is proposed as a fundamental step towards a semi-automatic gesture annotation tool. The proposed method can be applied to RGB videos and requires annotations of part of a video as input. The technique deploys a pose estimation method and active learning. In the experiment, it is shown that if about 27% of the video is annotated, the remaining parts of the video can be annotated automatically with an F-score of at least 0.85. Users can run this tool with a small number of annotations first. If the predicted annotations for the remainder of the video are not satisfactory, users can add further annotations and run the tool again. The code has been released so that other researchers and practitioners can use the results of this research. This tool has been confirmed to work in conjunction with ELAN.openIenaga, Naoto; Cravotta, Alice; Terayama, Kei; Scotney, Bryan W.; Saito, Hideo; Busà, M. GraziaIenaga, Naoto; Cravotta, Alice; Terayama, Kei; Scotney, Bryan W.; Saito, Hideo; Busà, M. Grazi

    Production Methods

    Get PDF

    Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme

    Get PDF
    Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie

    Extending automatic transcripts in a unified data representation towards a prosodic-based metadata annotation and evaluation

    Get PDF
    This paper describes a framework that extends automatic speech transcripts in order to accommodate relevant information coming from manual transcripts, the speech signal itself, and other resources, like lexica. The proposed framework automatically collects, relates, computes, and stores all relevant information together in a self-contained data source, making it possible to easily provide a wide range of interconnected information suitable for speech analysis, training, and evaluating a number of automatic speech processing tasks. The main goal of this framework is to integrate different linguistic and paralinguistic layers of knowledge for a more complete view of their representation and interactions in several domains and languages. The processing chain is composed of two main stages, where the first consists of integrating the relevant manual annotations in the speech recognition data, and the second consists of further enriching the previous output in order to accommodate prosodic information. The described framework has been used for the identification and analysis of structural metadata in automatic speech transcripts. Initially put to use for automatic detection of punctuation marks and for capitalization recovery from speech data, it has also been recently used for studying the characterization of disfluencies in speech. It was already applied to several domains of Portuguese corpora, and also to English and Spanish Broadcast News corpora
    corecore