39,155 research outputs found
A Lightweight Regression Method to Infer Psycholinguistic Properties for Brazilian Portuguese
Psycholinguistic properties of words have been used in various approaches to
Natural Language Processing tasks, such as text simplification and readability
assessment. Most of these properties are subjective, involving costly and
time-consuming surveys to be gathered. Recent approaches use the limited
datasets of psycholinguistic properties to extend them automatically to large
lexicons. However, some of the resources used by such approaches are not
available to most languages. This study presents a method to infer
psycholinguistic properties for Brazilian Portuguese (BP) using regressors
built with a light set of features usually available for less resourced
languages: word length, frequency lists, lexical databases composed of school
dictionaries and word embedding models. The correlations between the properties
inferred are close to those obtained by related works. The resulting resource
contains 26,874 words in BP annotated with concreteness, age of acquisition,
imageability and subjective frequency.Comment: Paper accepted for TSD201
Authorship attribution in portuguese using character N-grams
For the Authorship Attribution (AA) task, character n-grams are considered among the best predictive features. In the English language, it has also been shown that some types of character n-grams perform better than others. This paper tackles the AA task in Portuguese by examining the performance of different types of character n-grams, and various combinations of them. The paper also experiments with different feature representations and machine-learning algorithms. Moreover, the paper demonstrates that the performance of the character n-gram approach can be improved by fine-tuning the feature set and by appropriately selecting the length and type of character n-grams. This relatively simple and language-independent approach to the AA task outperforms both a bag-of-words baseline and other approaches, using the same corpus.Mexican Government (Conacyt) [240844, 20161958]; Mexican Government (SIP-IPN) [20171813, 20171344, 20172008]; Mexican Government (SNI); Mexican Government (COFAA-IPN)
A Multilingual Study of Compressive Cross-Language Text Summarization
Cross-Language Text Summarization (CLTS) generates summaries in a language
different from the language of the source documents. Recent methods use
information from both languages to generate summaries with the most informative
sentences. However, these methods have performance that can vary according to
languages, which can reduce the quality of summaries. In this paper, we propose
a compressive framework to generate cross-language summaries. In order to
analyze performance and especially stability, we tested our system and
extractive baselines on a dataset available in four languages (English, French,
Portuguese, and Spanish) to generate English and French summaries. An automatic
evaluation showed that our method outperformed extractive state-of-art CLTS
methods with better and more stable ROUGE scores for all languages
Cross-lingual RST Discourse Parsing
Discourse parsing is an integral part of understanding information flow and
argumentative structure in documents. Most previous research has focused on
inducing and evaluating models from the English RST Discourse Treebank.
However, discourse treebanks for other languages exist, including Spanish,
German, Basque, Dutch and Brazilian Portuguese. The treebanks share the same
underlying linguistic theory, but differ slightly in the way documents are
annotated. In this paper, we present (a) a new discourse parser which is
simpler, yet competitive (significantly better on 2/3 metrics) to state of the
art for English, (b) a harmonization of discourse treebanks across languages,
enabling us to present (c) what to the best of our knowledge are the first
experiments on cross-lingual discourse parsing.Comment: To be published in EACL 2017, 13 page
- …