Search CORE

14 research outputs found

ALEXSIS-PT: A New Resource for Portuguese Lexical Simplification

Author: North Kai
Ranasinghe Tharindu
Zampieri Marcos
Publication venue
Publication date: 19/09/2022
Field of study

Lexical simplification (LS) is the task of automatically replacing complex words for easier ones making texts more accessible to various target populations (e.g. individuals with low literacy, individuals with learning disabilities, second language learners). To train and test models, LS systems usually require corpora that feature complex words in context along with their candidate substitutions. To continue improving the performance of LS systems we introduce ALEXSIS-PT, a novel multi-candidate dataset for Brazilian Portuguese LS containing 9,605 candidate substitutions for 387 complex words. ALEXSIS-PT has been compiled following the ALEXSIS protocol for Spanish opening exciting new avenues for cross-lingual models. ALEXSIS-PT is the first LS multi-candidate dataset that contains Brazilian newspaper articles. We evaluated four models for substitute generation on this dataset, namely mDistilBERT, mBERT, XLM-R, and BERTimbau. BERTimbau achieved the highest performance across all evaluation metrics

arXiv.org e-Print Archive

Aston Publications Explorer

Engage students in news writing

Author: Carvalho Diogo Miguel
Júlio Luísa
Lima Inês Rito
Silva Telmo
Ulisses Alexandre
Publication venue: Elsevier
Publication date: 01/01/2023
Field of study

The technologies evolution impacts how information is produced and consumed by users. Nonetheless, with the spread of information content available on most online news platforms, the misinformation increases alongside the less credible content. In this scope, the present research aims to develop a technological ecosystem to promote students’ writing ability. The system will help students, search for credible content to create school newspapers. Thus, in this article, the architecture of the solution for news writing tool for the Portuguese language is presented. This paper aims to introduce a constructive approach that presents the system architecture that will support the development of a news creation tool.publishe

Repositório Institucional da Universidade de Aveiro

Estimating the Entropy of Linguistic Distributions

Author: Arora Aryaman
Cotterell Ryan
Meister Clara
Publication venue
Publication date: 01/01/2022
Field of study

Shannon entropy is often a quantity of interest to linguists studying the communicative capacity of human language. However, entropy must typically be estimated from observed data because researchers do not have access to the underlying probability distribution that gives rise to these data. While entropy estimation is a well-studied problem in other fields, there is not yet a comprehensive exploration of the efficacy of entropy estimators for use with linguistic data. In this work, we fill this void, studying the empirical effectiveness of different entropy estimators for linguistic distributions. In a replication of two recent information-theoretic linguistic studies, we find evidence that the reported effect size is over-estimated due to over-reliance on poor entropy estimators. Finally, we end our paper with concrete recommendations for entropy estimation depending on distribution type and data availability.Comment: 21 pages (5 pages main text). 4 figures. Accepted to ACL 202

arXiv.org e-Print Archive

Repository for Publications and Research Data

Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark

Author: Blevins Terra
Gonen Hila
Imperial Joseph Marvin
Karlsson Börje F.
Lin Peiqin
Liu Shuheng
Ljubešić Nikola
Mayhew Stephen
Miranda LJ
Pinter Yuval
Plank Barbara
Riabi Arij
Šuppa Marek
Publication venue
Publication date: 15/11/2023
Field of study

We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 18 datasets annotated with named entities in a cross-lingual consistent schema across 12 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We release the data, code, and fitted models to the public

arXiv.org e-Print Archive

Universal NER:A Gold-Standard Multilingual Named Entity Recognition Benchmark

Author: Blevins Terra
Gonen Hila
Imperial Joseph Marvin
Karlsson Börje F.
Lin Peiqin
Liu Shuheng
Ljubešić Nikola
Mayhew Stephen
Miranda LJ
Pinter Yuval
Plank Barbara
Riabi Arij
Šuppa Marek
Publication venue: 'Center for Open Science'
Publication date: 15/11/2023
Field of study

OPUS

From the world to word order:Deriving biases in noun phrase order from statistical properties of the world

Author: Culbertson Jennifer
Kirby Simon
Schouwstra Marieke
Publication venue: 'Project Muse'
Publication date: 01/09/2020
Field of study

Edinburgh Research Explorer

Massive Choice, Ample Tasks (MaChAmp): A Toolkit for Multi-task Learning in NLP

Author: Plank Barbara
Ramponi Alan
Sharaf Ibrahim
van der Goot Rob
Üstün Ahmet
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2021
Field of study

Transfer learning, particularly approaches that combine multi-task learning with pre-trained contextualized embeddings and fine-tuning, have advanced the field of Natural Language Processing tremendously in recent years. In this paper we present MaChAmp, a toolkit for easy fine-tuning of contextualized embeddings in multi-task settings. The benefits of MaChAmp are its flexible configuration options, and the support of a variety of natural language processing tasks in a uniform toolkit, from text classification and sequence labeling to dependency parsing, masked language modeling, and text generation.Comment: https://machamp-nlp.github.io

arXiv.org e-Print Archive

Archivio della ricerca - Fondazione Bruno Kessler

The IT University of Copenhagen's Repository

Semantic Role Labeling in Portuguese: Improving the State of the Art with Transfer Learning and BERT-based Models

Author: Ana Sofia Medeiros Oliveira
Publication venue
Publication date: 09/11/2020
Field of study

Repositório Aberto da Universidade do Porto