242 research outputs found

    The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres

    Get PDF
    Final version to Special Issue of JLCL (Journal of Language Technology and Computational Linguistics (JLCL, http://jlcl.org/): BUILDING AND ANNOTATING CORPORA OF COMPUTER-MEDIATED DISCOURSE: Issues and Challenges at the Interface of Corpus and Computational Linguistics (ed. by Michael Beißwenger, Nelleke Oostdijk, Angelika Storrer & Henk van den Heuvel)International audienceThe CoMeRe project aims to build a kernel corpus of different Computer-Mediated Com-munication (CMC) genres with interactions in French as the main language, by assembling interactions stemming from networks such as the Internet or telecommunication, as well as mono and multimodal, synchronous and asynchronous communications. Corpora are assem-bled using a standard, thanks to the TEI (Text Encoding Initiative) format. This implies extending, through a European endeavor, the TEI model of text, in order to encompass the richest and the more complex CMC genres. This paper presents the Interaction Space model. We explain how this model has been encoded within the TEI corpus header and body. The model is then instantiated through the first four corpora we have processed: three corpora where interactions occurred in single-modality environments (text chat, or SMS systems) and a fourth corpus where text chat, email and forum modalities were used simultaneously. The CoMeRe project has two main research perspectives: Discourse Analysis, only alluded to in this paper, and the linguistic study of idiolects occurring in different CMC genres. As NLP algorithms are an indispensable prerequisite for such research, we present our motiva-tions for applying an automatic annotation process to the CoMeRe corpora. Our wish to guarantee generic annotations meant we did not consider any processing beyond morphosyn-tactic labelling, but prioritized the automatic annotation of any freely variant elements within the corpora. We then turn to decisions made concerning which annotations to make for which units and describe the processing pipeline for adding these. All CoMeRe corpora are verified, thanks to a staged quality control process, designed to allow corpora to move from one project phase to the next. Public release of the CoMeRe corpora is a short-term goal: corpora will be integrated into the forthcoming French National Reference Corpus, and disseminated through the national linguistic infrastructure ORTOLANG. We, therefore, highlight issues and decisions made concerning the OpenData perspective

    Using CBR for portuguese question generation

    Get PDF
    In this paper, we propose a new architecture for Question Generation for the Portuguese Language. This architecture aims at the automatic generation of questions, to be used later, for instance, in automatic question answering by means of predictive question generation. Our approach combines a case-based reasoning system and a module for question generation. The question generation module uses manually built rules that are fed to the case-based reasoning engine for selecting which ones should be used. This is accomplished by comparing the answer and the sentence part-of-speech tag sequences. An identical tag sequence on sentences and answers usually implies a similar sequence on the corresponding questions. We discuss the details of this architecture, how it performs and the results obtained so far.info:eu-repo/semantics/publishedVersio

    The Manifesto Corpus: a new resource for research on political parties and quantitative text analysis

    Get PDF
    This article presents a digital, open-access, multilingual, annotated corpus of electoral programs. It complements the recent methodological innovations in (semi-) computerized content analysis by providing a large, standardized text corpus for the political science community. The corpus is based on the collection of the Manifesto Project, which comprises of (at the time of writing) the largest hand-annotated text corpus of electoral programs available. Since 2009 the project’s costly and time-intensive procedure of collecting and coding documents has been fully digitized. As a result, it now provides more than 1800 machine readable documents from 40 different countries. Six hundred of these documents contain content-analyzed annotations at the level of single (quasi-) sentences, which correspond to the Manifesto Project coding scheme. Additionally, the corpus will continually be extended by incorporating new elections and digitizing older documents. The database also provides meta-information for each document (eg. party, election, language, etc.) that allow it to be referenced back to the Manifesto Dataset. The corpus is stored in a standardized format in an online database, and an API and R package (manifestoR) guarantee easy access

    Analysis and Design of Computational News Angles

    Get PDF
    A key skill for a journalist is the ability to assess the newsworthiness of an event or situation. To this purpose journalists often rely on news angles, conceptual criteria that are used both i) to assess whether something is newsworthy and also ii) to shape the structure of the resulting news item. As journalism becomes increasingly computer-supported, and more and more sources of potentially newsworthy data become available in real time, it makes sense to try and equip journalistic software tools with operational versions of news angles, so that, when searching this vast data space, these tools can both identify effectively the events most relevant to the target audience, and also link them to appropriate news angles. In this paper we analyse the notion of news angle and, in particular, we i) introduce a formal framework and data schema for representing news angles and related concepts and ii) carry out a preliminary analysis and characterization of a number of commonly used news angles, both in terms of our formal model and also in terms of the computational reasoning capabilities that are needed to apply them effectively to real-world scenarios. This study provides a stepping stone towards our ultimate goal of realizing a solution capable of exploiting a library of news angles to identify potentially newsworthy events in a large journalistic data space

    Qlusty: Quick and dirty generation of event videos from written media coverage

    Get PDF
    Qlusty generates videos describing the coverage of the same event by different news outlets automatically. Throughout four modules it identifies events, de-duplicates notes, ranks according to coverage, and queries for images to generate an overview video. In this manuscript we present our preliminary models, including quantitative evaluations of the former two and a qualitative analysis of the latter two. The results show the potential for achieving our main aim: contributing in breaking the information bubble, so common in the current news landscape
    • …
    corecore