5,609 research outputs found
TermEval 2020 : shared task on automatic term extraction using the Annotated Corpora for term Extraction Research (ACTER) dataset
The TermEval 2020 shared task provided a platform for researchers to work on automatic term extraction (ATE) with the same dataset: the Annotated Corpora for Term Extraction Research (ACTER). The dataset covers three languages (English, French, and Dutch) and four domains, of which the domain of heart failure was kept as a held-out test set on which final f1-scores were calculated. The aim was to provide a large, transparent, qualitatively annotated, and diverse dataset to the ATE research community, with the goal of promoting comparative research and thus identifying strengths and weaknesses of various state-of-the-art methodologies. The results show a lot of variation between different systems and illustrate how some methodologies reach higher precision or recall, how different systems extract different types of terms, how some are exceptionally good at finding rare terms, or are less impacted by term length. The current contribution offers an overview of the shared task with a comparative evaluation, which complements the individual papers by all participants
Building a Sentiment Corpus of Tweets in Brazilian Portuguese
The large amount of data available in social media, forums and websites
motivates researches in several areas of Natural Language Processing, such as
sentiment analysis. The popularity of the area due to its subjective and
semantic characteristics motivates research on novel methods and approaches for
classification. Hence, there is a high demand for datasets on different domains
and different languages. This paper introduces TweetSentBR, a sentiment corpora
for Brazilian Portuguese manually annotated with 15.000 sentences on TV show
domain. The sentences were labeled in three classes (positive, neutral and
negative) by seven annotators, following literature guidelines for ensuring
reliability on the annotation. We also ran baseline experiments on polarity
classification using three machine learning methods, reaching 80.99% on
F-Measure and 82.06% on accuracy in binary classification, and 59.85% F-Measure
and 64.62% on accuracy on three point classification.Comment: Accepted for publication in 11th International Conference on Language
Resources and Evaluation (LREC 2018
Dictionary writing system (DWS) plus corpus query package (CQP): the case of TshwaneLex
In this article the integrated corpus query functionality of the dictionary compilation software TshwanelLex is analysed. Attention is given to the handling of both raw corpus data and annotated corpus data. With regard to the latter it is shown how, with a minimum of human effort, machine learning techniques can be employed to obtain part-of-speech tagged corpora that can be used for lexicographic purposes. All points are illustrated with data drawn from English and Northern Sotho. The tools and techniques themselves, however, are language-independent, and as Such the encouraging outcomes of this study are far-reaching
From treebank resources to LFG F-structures
We present two methods for automatically annotating treebank resources with functional structures. Both methods define systematic patterns of correspondence between partial PS configurations and functional structures. These are applied to PS rules extracted from treebanks, or directly to constraint set encodings of treebank PS trees
- …