5,609 research outputs found

    TermEval 2020 : shared task on automatic term extraction using the Annotated Corpora for term Extraction Research (ACTER) dataset

    Get PDF
    The TermEval 2020 shared task provided a platform for researchers to work on automatic term extraction (ATE) with the same dataset: the Annotated Corpora for Term Extraction Research (ACTER). The dataset covers three languages (English, French, and Dutch) and four domains, of which the domain of heart failure was kept as a held-out test set on which final f1-scores were calculated. The aim was to provide a large, transparent, qualitatively annotated, and diverse dataset to the ATE research community, with the goal of promoting comparative research and thus identifying strengths and weaknesses of various state-of-the-art methodologies. The results show a lot of variation between different systems and illustrate how some methodologies reach higher precision or recall, how different systems extract different types of terms, how some are exceptionally good at finding rare terms, or are less impacted by term length. The current contribution offers an overview of the shared task with a comparative evaluation, which complements the individual papers by all participants

    Building a Sentiment Corpus of Tweets in Brazilian Portuguese

    Full text link
    The large amount of data available in social media, forums and websites motivates researches in several areas of Natural Language Processing, such as sentiment analysis. The popularity of the area due to its subjective and semantic characteristics motivates research on novel methods and approaches for classification. Hence, there is a high demand for datasets on different domains and different languages. This paper introduces TweetSentBR, a sentiment corpora for Brazilian Portuguese manually annotated with 15.000 sentences on TV show domain. The sentences were labeled in three classes (positive, neutral and negative) by seven annotators, following literature guidelines for ensuring reliability on the annotation. We also ran baseline experiments on polarity classification using three machine learning methods, reaching 80.99% on F-Measure and 82.06% on accuracy in binary classification, and 59.85% F-Measure and 64.62% on accuracy on three point classification.Comment: Accepted for publication in 11th International Conference on Language Resources and Evaluation (LREC 2018

    Dictionary writing system (DWS) plus corpus query package (CQP): the case of TshwaneLex

    Get PDF
    In this article the integrated corpus query functionality of the dictionary compilation software TshwanelLex is analysed. Attention is given to the handling of both raw corpus data and annotated corpus data. With regard to the latter it is shown how, with a minimum of human effort, machine learning techniques can be employed to obtain part-of-speech tagged corpora that can be used for lexicographic purposes. All points are illustrated with data drawn from English and Northern Sotho. The tools and techniques themselves, however, are language-independent, and as Such the encouraging outcomes of this study are far-reaching

    From treebank resources to LFG F-structures

    Get PDF
    We present two methods for automatically annotating treebank resources with functional structures. Both methods define systematic patterns of correspondence between partial PS configurations and functional structures. These are applied to PS rules extracted from treebanks, or directly to constraint set encodings of treebank PS trees
    corecore