1,064 research outputs found
Authorship attribution in portuguese using character N-grams
For the Authorship Attribution (AA) task, character n-grams are considered among the best predictive features. In the English language, it has also been shown that some types of character n-grams perform better than others. This paper tackles the AA task in Portuguese by examining the performance of different types of character n-grams, and various combinations of them. The paper also experiments with different feature representations and machine-learning algorithms. Moreover, the paper demonstrates that the performance of the character n-gram approach can be improved by fine-tuning the feature set and by appropriately selecting the length and type of character n-grams. This relatively simple and language-independent approach to the AA task outperforms both a bag-of-words baseline and other approaches, using the same corpus.Mexican Government (Conacyt) [240844, 20161958]; Mexican Government (SIP-IPN) [20171813, 20171344, 20172008]; Mexican Government (SNI); Mexican Government (COFAA-IPN)
Um filtro para palavras exóticas frequentes em Português
As formas gráficas (tokens) que constituem as palavras de um texto são muitas vezes
ambíguas, podendo frequentemente uma mesma forma corresponder a diferentes flexões
de duas ou mais entradas lexicais distintas. Algumas dessas formas correspondem
a palavras ‘exóticas’, isto é, palavras pouco frequentes ou até caídas em desuso.
O objectivo deste estudo é a determinação, a partir do corpus do CETEMPúblico, das
formas ambíguas mais frequentes de palavras exóticas do Português, com vista à
construção de um filtro que, durante a fase de análise lexical, elimine o ‘ruído’ provocado
por essas formas exóticas e que permita assim reduzir a ambiguidade formal dos
textos, simplificando as fases posteriores do seu processamento automático
Os provérbios em manuais de ensino de português língua não materna
Os provérbios apresentam uma grande variedade de estruturas e podem servir diversos propósitos comunicativos. Devido à sua riqueza cultural e linguística, prestamse ainda a múltiplos objetivos didáticos, nomeadamente no ensino de Português como Língua não Materna (PLNM). Neste trabalho, investigamos como são de facto utilizados os provérbios em manuais de PLNM, usando ferramentas e recursos de processamento
computacional de linguagem natural (PLN). Os resultados são comparados com observações já feitas sobre um corpus de manuais de Português para falantes nativos.info:eu-repo/semantics/publishedVersio
Mapping, filtering and measuring impact of ambiguous words in Portuguese
This paper deals with ambiguous simple words of Portuguese. The Portuguese dictionary of simple inflected words contains (DELAF) 936.215 entries, from which there are 889.986 different inflected forms. It is possible to obtain the full list of ambiguous inflected forms (43.126), that is, word forms belonging to different categories and/or lemmas: capital,A/N/N (capital). We may consider A/N/N an ambiguity class. There are 137 ambiguity classes. Each ambiguity class presents a certain level of ambiguity (Amb) that corresponds to the number of lexical entries associated to each ambiguous form (again, for class A/N/N Amb=3). Based on this information it is possible to map how ambiguity affects the lexicon. Using the frequency information associated to the list of tokens of a large corpus (the CETEMPÚBLICO corpus, with 200 million words), it is possible to calculate how ambiguity affects real texts. Combining the two types of information, it is possible to devise and evaluate different strategies to reduce lexical ambiguity
Portuguese proverbs: types and variants
Drawing on the methodology and previous results of Rassi et al. (2014) on the automatic
identification of Brazilian Portuguese proverbs, this paper reports on an extension of that
experiment, but now focused on the identification of the European Portuguese proverbs
and their variants. Based on a large collection of over 56 thousand Portuguese proverbs
and their variants, a database of proverb types was specifically built for natural language
processing, along with the finite-state tools that allow for the identification of these strings
in texts. Our aim is to make these linguistic resources and language processing tools
publicly available, which will undoubtedly be deemed useful assets to other paremiologic
studies.info:eu-repo/semantics/publishedVersio
Estimating lexical availability of European Portuguese proverbs
This paper relates data on lexical availability with data on textual frequency
of proverbs in European Portuguese. Each data source should provide
different perspectives on the use of proverbs in the language. This should allow
an empirically well-motivated selection of proverbs aiming at the development
of NLP resources, specifically for applications for learning Portuguese as a Foreign
Language and for the diagnosis/therapy of speech impairments/disabilities.
A large database (over 114,000 proverbs and their variants) was independently
classified by two annotators, according to intuitively estimated lexical availability.
Next, a random, stratified sample was selected and lexical availability was
then confirmed with an online survey. Frequency data was gathered from two
web browsers and a large-sized, publicly available, corpus of journalistic texts.
Results from the survey, the web and the corpus by and large confirm the initial
intuitive classification and a core of commonly used proverbs was definedinfo:eu-repo/semantics/publishedVersio
Let's play with proverbs? NLP tools and resources for iCALL applications around proverbs for PFL
Proverbs are an important form of cultural expression of a society and are related to various areas of
knowledge and human experience (González Rey, 2002). While linguistic elements in widespread
use, proverbs are very rich structures both from a cultural and from a linguistic point of view and can
therefore contribute significantly to the teaching of languages, both native and foreign (Council of
Europe, 2001). However, though there are extensive collections of Portuguese proverbs with tens of
thousands of forms and its variants (Reis, in preparation), its automatic identification in texts is quite
difficult, given its formal variation, both lexical and syntactic (Chacoto, 1994). Nevertheless, using
real examples, where proverbs are used in a natural or spontaneous discourse context, is a more natural
way to learn and teach the complex conditions and communicative situations that determine the
use and meaning of these expressions. On the other hand, frequency indices associated with proverbs
and its variants would allow one to select the most common expressions. These are precisely the
most interesting forms from the point of view of their teaching/learning and could serve as a basis for
the construction of educational games, particularly for learning Portuguese autonomously as a foreign
language (PFL) assisted by computer. To make this possible, it is necessary, first of all, be able
to recognize the occurrence of proverbs in the texts (Rassi et al. 2014), including the instances where
these expressions are presented in a truncated or creatively modified form, for example, to better suit
the communicative situation or to produce new and more expressive meanings. In this paper, we present
an on-going project, which aims at automatic identification of proverbs in texts. In this interdisciplinary
study, we combine natural language processing tools with questionnaires construction
techniques for teaching purposes (Hoshino and Nakagawa 2005, Correia et al. 2010). This is illustrated
here with different sets of formats that can be built based on the knowledge of the form and
variation of proverbs, as well as their frequency in corpora.info:eu-repo/semantics/publishedVersio
Vocatives in Portuguese: Identification and Processing
This paper describes the most salient linguistic aspects of vocative constructions in Portuguese, with special reference to its European variety. Next, the paper presents the strategy followed for implementing this linguistic knowledge in a computational grammar of Portuguese, developed for the natural language processing chain STRING and using the XIP rule-based parser. Very precise and detailed linguistic descriptions can be implemented in this way
CONSUMO DE ENERGIA E CUSTOS DE AQUECIMENTO NA PRODUÇÃO DE FLORES E LEGUMES EM ESTUFA
Pretende-se determinar os consumos de energia e os custos em aquecimento, na
produção de flores e de vegetais, ao longo do ano, em estufas de plástico
aquecidas, localizadas em diversas zonas de produção de culturas forçadas. No
primeiro ano considerou-se Portugal e a produção de rosa. Foram calculados os
consumos energéticos e os custos de aquecimento com gasóleo ou gás natural,
para duas combinações de temperaturas mínimas do ar noite/dia, em estufas
modernas de plástico. O estudo está a ser alargado para a produção de tomate,
englobando Portugal e Espanha
- …