14 research outputs found
ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification
Lexical simplification (LS) is the task of automatically replacing complex
words for easier ones making texts more accessible to various target
populations (e.g. individuals with low literacy, individuals with learning
disabilities, second language learners). To train and test models, LS systems
usually require corpora that feature complex words in context along with their
candidate substitutions. To continue improving the performance of LS systems we
introduce ALEXSIS-PT, a novel multi-candidate dataset for Brazilian Portuguese
LS containing 9,605 candidate substitutions for 387 complex words. ALEXSIS-PT
has been compiled following the ALEXSIS protocol for Spanish opening exciting
new avenues for cross-lingual models. ALEXSIS-PT is the first LS
multi-candidate dataset that contains Brazilian newspaper articles. We
evaluated four models for substitute generation on this dataset, namely
mDistilBERT, mBERT, XLM-R, and BERTimbau. BERTimbau achieved the highest
performance across all evaluation metrics
Engage students in news writing
The technologies evolution impacts how information is produced and consumed by users. Nonetheless, with the spread of information content available on most online news platforms, the misinformation increases alongside the less credible content. In this scope, the present research aims to develop a technological ecosystem to promote students’ writing ability. The system will help students, search for credible content to create school newspapers. Thus, in this article, the architecture of the solution for news writing tool for the Portuguese language is presented.
This paper aims to introduce a constructive approach that presents the system architecture that will support the development of a news creation tool.publishe
Estimating the Entropy of Linguistic Distributions
Shannon entropy is often a quantity of interest to linguists studying the
communicative capacity of human language. However, entropy must typically be
estimated from observed data because researchers do not have access to the
underlying probability distribution that gives rise to these data. While
entropy estimation is a well-studied problem in other fields, there is not yet
a comprehensive exploration of the efficacy of entropy estimators for use with
linguistic data. In this work, we fill this void, studying the empirical
effectiveness of different entropy estimators for linguistic distributions. In
a replication of two recent information-theoretic linguistic studies, we find
evidence that the reported effect size is over-estimated due to over-reliance
on poor entropy estimators. Finally, we end our paper with concrete
recommendations for entropy estimation depending on distribution type and data
availability.Comment: 21 pages (5 pages main text). 4 figures. Accepted to ACL 202
Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark
We introduce Universal NER (UNER), an open, community-driven project to
develop gold-standard NER benchmarks in many languages. The overarching goal of
UNER is to provide high-quality, cross-lingually consistent annotations to
facilitate and standardize multilingual NER research. UNER v1 contains 18
datasets annotated with named entities in a cross-lingual consistent schema
across 12 diverse languages. In this paper, we detail the dataset creation and
composition of UNER; we also provide initial modeling baselines on both
in-language and cross-lingual learning settings. We release the data, code, and
fitted models to the public
Universal NER:A Gold-Standard Multilingual Named Entity Recognition Benchmark
We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We release the data, code, and fitted models to the public
Massive Choice, Ample Tasks (MaChAmp): A Toolkit for Multi-task Learning in NLP
Transfer learning, particularly approaches that combine multi-task learning
with pre-trained contextualized embeddings and fine-tuning, have advanced the
field of Natural Language Processing tremendously in recent years. In this
paper we present MaChAmp, a toolkit for easy fine-tuning of contextualized
embeddings in multi-task settings. The benefits of MaChAmp are its flexible
configuration options, and the support of a variety of natural language
processing tasks in a uniform toolkit, from text classification and sequence
labeling to dependency parsing, masked language modeling, and text generation.Comment: https://machamp-nlp.github.io