5,566 research outputs found

    Guidelines for annotating the LUNA corpus with frame information

    Get PDF
    This document defines the annotation workflow aimed at adding frame information to the LUNA corpus of conversational speech. In particular, it details both the corpus pre-processing steps and the proper annotation process, giving hints about how to choose the frame and the frame element labels. Besides, the description of 20 new domain-specific and language-specific frames is reported. To our knowledge, this is the first attempt to adapt the frame paradigm to dialogs and at the same time to define new frames and frame elements for the specific domain of software/hardware assistance. The technical report is structured as follows: in Section 2 an overview of the FrameNet project is given, while Section 3 introduces the LUNA project and the annotation framework involving the Italian dialogs. Section 4 details the annotation workflow, including the format preparation of the dialog files and the annotation strategy. In Section 5 we discuss the main issues of the annotation of frame information in dialogs and we describe how the standard annotation procedure was changed in order to face such issues. Then, the 20 newly introduced frames are reported in Section 6

    Studying Directionality in Simultaneous Interpreting through an Electronic Corpus: EPIC (European Parliament Interpreting Corpus)

    Get PDF
    Parallel corpora have long been awaited in simultaneous interpreting studies in order to validate existing theories and models. The present paper illustrates the development of the European Parliament Interpreting Corpus (EPIC), an open, parallel, multilingual (English, Italian and Spanish), POS-tagged corpus of European Parliament source speeches and simultaneously-interpreted target speeches. The aim of the project is to study recurrent lexical patterns and morphosyntactical structures across all the possible language combinations and directions, and verify empirically whether different strategies can be detected when interpreting from a Germanic language into a Romance one and vice-versa, or between two Romance languages. EPIC is freely available on-line for the research community to use and contribute to.Les corpus parallèles dans le domaine de la recherche sur l’interprétation simultanée étaient attendus depuis longtemps pour valider des théories et des modèles existants. La présente contribution a pour but de présenter EPIC (European Parliament Interpreting Corpus), un corpus ouvert, parallèle, multilingue (anglais, italien et espagnol) et avec étiquetage des parties du discours, composé de discours source prononcés au Parlement européen et de discours cible interprétés en simultanée. Le but de ce projet est d’examiner les modèles lexicaux et les structures morphosyntaxiques dans toutes les combinaisons linguistiques considérées et quelles que soient la langue de départ et d’arrivée, et de vérifier de manière empirique si des stratégies différentes peuvent être décelées lors d’une interprétation à partir d’une langue germanique vers une langue romane et viceversa, ou entre deux langues romanes. EPIC est librement accessible en ligne pour les chercheurs et est ouvert à leurs contributions

    UmobiTalk: Ubiquitous Mobile Speech Based Learning Language Translator for Sesotho Language

    Get PDF
    Published ThesisThe need to conserve the under-resourced languages is becoming more urgent as some of them are becoming extinct; natural language processing can be used to redress this. Currently, most initiatives around language processing technologies are focusing on western languages such as English and French, yet resources for such languages are already available. The Sesotho language is one of the under-resourced Bantu languages; it is mostly spoken in Free State province of South Africa and in Lesotho. Like other parts of South Africa, Free State has experienced high number of migrants and non-Sesotho speakers from neighboring provinces and countries; such people are faced with serious language barrier problems especially in the informal settlements where everyone tends to speak only Sesotho. Non-Sesotho speakers refers to the racial groups such as Xhosas, Zulus, Coloureds, Whites and more, in which Sesotho language is not their native language. As a solution to this, we developed a parallel corpus that has English as source and Sesotho as a target language and packaged it in UmobiTalk - Ubiquitous mobile speech based learning translator. UmobiTalk is a mobile-based tool for learning Sesotho for English speakers. The development of this tool was based on the combination of automatic speech recognition, machine translation and speech synthesis

    Annotating patient clinical records with syntactic chunks and named entities: the Harvey corpus

    Get PDF
    The free text notes typed by physicians during patient consultations contain valuable information for the study of disease and treatment. These notes are difficult to process by existing natural language analysis tools since they are highly telegraphic (omitting many words), and contain many spelling mistakes, inconsistencies in punctuation, and non-standard word order. To support information extraction and classification tasks over such text, we describe a de-identified corpus of free text notes, a shallow syntactic and named entity annotation scheme for this kind of text, and an approach to training domain specialists with no linguistic background to annotate the text. Finally, we present a statistical chunking system for such clinical text with a stable learning rate and good accuracy, indicating that the manual annotation is consistent and that the annotation scheme is tractable for machine learning

    Steps for Creating two Persian Specialized Corpora

    Get PDF
    Currently, most linguistic studies benefit from valid linguistic data available at corpora. Compiling corpora is a common practice in linguistic research. The present study introduces two specialized corpora in Persian; a specialized corpus is used to study a particular type of language or language variety. For building such corpora, first, a set of texts were compiled based on pre-established criteria used in the sampling process (including the mode of the texts, type of the texts, domain of the texts, language/ language varieties of the texts and the date of the texts). The corpora are specialized because they include technical terms in information processing and management, librarianship, linguistics, computational linguistics, thesaurus building, managing, policy-making, natural language processing, information technology, information retrieval, ontology and other related interdisciplinary domains. After compiling data and Metadata, the texts were preprocessed (normalized and tokenized) and annotated (automated POS tagging); finally, the tags were manually checked. Each corpus includes more than four million words. Since not many specialized corpora are built in Persian, such corpora could be considered valuable resources for researchers interested in studying linguistic variations in Persian interdisciplinary texts.https://dorl.net/dor/20.1001.1.20088302.2022.20.4.14.

    Natural language processing

    Get PDF
    Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems

    Computer Assisted Language Learning Based on Corpora and Natural Language Processing : The Experience of Project CANDLE

    Get PDF
    This paper describes Project CANDLE, an ongoing 3-year project which uses various corpora and NLP technologies to construct an online English learning environment for learners in Taiwan. This report focuses on the interim results obtained in the first eighteen months. First, an English-Chinese parallel corpus, Sinorama, was used as the main course material for reading, writing, and culture-based learning courses. Second, an online bilingual concordancer, TotalRecall, and a collocation reference tool, TANGO, were developed based on Sinorama and other corpora. Third, many online lessons, including extensive reading, verb-noun collocations, and vocabulary, were designed to be used alone or together with TotalRecall and TANGO. Fourth, an online collocation check program, MUST, was developed for detecting V-N miscollocation and suggesting adequate collocates in student’s writings based on the hypothesis of L1 interference and the database of BNC and the bilingual Sinorama Corpus. Other computational scaffoldings are under development. It is hoped that this project will help intermediate learners in Taiwan enhance their English proficiency with effective pedagogical approaches and versatile language reference tools
    • …
    corecore