7 research outputs found
Unsupervised compositionality prediction of nominal compounds
Nominal compounds such as red wine and nut case display a continuum of compositionality, with varying contributions from the components of the compound to its semantics. This article proposes a framework for compound compositionality prediction using distributional semantic models, evaluating to what extent they capture idiomaticity compared to human judgments. For evaluation, we introduce data sets containing human judgments in three languages: English, French, and Portuguese. The results obtained reveal a high agreement between the models and human predictions, suggesting that they are able to incorporate information about idiomaticity. We also present an in-depth evaluation of various factors that can affect prediction, such as model and corpus parameters and compositionality operations. General crosslingual analyses reveal the impact of morphological variation and corpus size in the ability of the model to predict compositionality, and of a uniform combination of the components for best results
PLATCOL, Plataforma Multilingüe de Diccionarios de Colocaciones: el caso del chino
El objetivo de esta contribución es realizar algunas observaciones sobre el procesamiento de las colocaciones extraÃdas de la lengua china, asà como discutir los problemas que hemos observado al trabajar con esta lengua en la Plataforma Multilingüe de Diccionarios de Colocaciones (PLATCOL). PLATCOL incluirá colocaciones en inglés, portugués, español, francés y chino (Orenha-Ottaiano et al. 2021) y forma parte del proyecto A phraseographical methodology and model for an Online Corpus-Based Multilingual Collocations Dictionary Platform (Proceso FAPESP 2020/01783-2). En la plataforma se ha seguido una metodologÃa unificada para obtener los datos que poblarán las entradas. Esta metodologÃa que funciona con razonable eficacia en las demás lenguas –aunque requiere una fase supervisada de corrección y validación– conlleva un esfuerzo suplementario en el caso de la lengua china donde, por ejemplo, discrepancias en la asignación de categorÃas gramaticales pueden afectar a la eficacia del método a la hora de extraer candidatos
Probing for idiomaticity in vector space models
Contextualised word representation models have been successfully used for capturing different word usages and they may be an attractive alternative for representing idiomaticity in language. In this paper, we propose probing measures to assess if some of the expected linguistic properties of noun compounds, especially those related to idiomatic meanings, and their dependence on context and sensitivity to lexical choice, are readily available in some standard and widely used representations. For that, we constructed the Noun Compound Senses Dataset, which contains noun compounds and their paraphrases, in context neutral and context informative naturalistic sentences, in two languages: English and Portuguese. Results obtained using four types of probing measures with models like ELMo, BERT and some of its variants, indicate that idiomaticity is not yet accurately represented by contextualised models
Revista da Associação Portuguesa de LinguÃstica
N.º 10 (2023) da Revista da Associação Portuguesa de LinguÃsticainfo:eu-repo/semantics/publishedVersio
Closed loop automation through intelligence artificial
Atualmente, há uma necessidade de automação de processos aplicados nas redes
devido à elevada complexidade e tamanho das mesmas. Nas redes das operadoras
de telecomunicações registam diariamente eventos de alarmes que ocorreram nos
seus dispositivos. Estes equipamentos como são de fornecedores ou operadoras diferentes,
geram diagnósticos de falhas que utilizam nomenclaturas distintas para se
referirem à mesma causa da falha. Deste modo, neste trabalho desenvolveu-se um
modelo que mede relações de semelhança entre os termos que aparecem nos diagnósticos
de falhas, na medida de tornar possÃvel mapear os alarmes para um modelo
único alarmÃstico. Inicialmente, processou-se uma base de dados de diagnósticos
de falhas reais com intuito de treinar modelos de word embedding, tais como,
Word2Vec e FastText, para converter as palavras em vetores numéricos. Portanto,
para avaliar os modelos, gerou-se uma base de dados a partir de um captcha de
palavras. Este foi utilizado por especialistas da área com objetivo de encontrarem
pares de termos semelhantes. Através das suas respostas foi possÃvel medir as suas
respetivas similaridades, sendo consideradas como as esperadas. Contudo, os modelos
de word embedding demonstraram não ter capacidade de encontrar este tipo
de relações. Por isso, adicionou-se uma camada de modelos de machine learning,
nos quais recebiam os vetores dos pares definidos na base de dados e tinham que
prever a similaridade mais próxima da esperada. Com isto, uma rede neural simples
com os vetores de 128 dimensões gerados pelo modelo Word2Vec com uma
arquitetura CBOW obteve os melhores resultados, com valores de 0.95 e 0.90 de
coeficientes de correlação de Pearson e Spearman, respetivamente. A CNN com
vetores da mesma dimensão, mas com uma arquitetura skip-gram no Word2Vec
obteve apenas 0.22 de correlação de Pearson e 0.23 de Spearman. As features geradas
combinadas com a LSTM obteve-se valores de correlação próximos de zero,
exceto com os vetores de 384 de dimensão gerados pelo Word2Vec com uma arquitetura
CBOW, que conseguiram obter 0.62 como coeficiente de correlação de
Pearson e 0.55 de Spearman. A CNN e LSTM embora sejam redes muito mais
complexas, a base de dados não tem tamanho suficiente para este tipo de redes
conseguirem encontrar uma boa função que meça a similaridade entre as palavras
do vocabulário especÃfico de redes e software.Nowadays, given the networks complexity and size there is a need for process
automation especially malfunction correction. Every day there are many failures
in the devices, which, as they are from different vendors or belong to distinct
telecommunications operators, alarm diagnostics use different vocabularies to refer
to the exact cause of the failure. Thus, in this work, a model was developed that
finds relations of similarity between these terms so that it is possible to map the
alarms to a single alarmist model. Initially, a database of real fault diagnostics
was processed to train embedding word models, such as Word2Vec and FastText,
to convert the words into numeric vectors. Therefore, to evaluate the models, it
is necessary to have a minimal amount of data, hence the creation of a captcha
system to collect pairs of similar terms and measure the similarity between new
acquired terms. However, word embedding models are not capable to find this
type of relationships. Therefore, a layer of machine learning models was added, in
which they received the vectors of the pairs defined in the database and had to
predict the closest to the expected similarity. With this, the simple neural network
has achieved the best results, while CNN and LSTM although they are much more
complex network the database is not large enough to achieve good results. Thus,
a neural network with 128-dimensional vectors generated by the Word2Vec model
with a CBOW architecture achieved the best results, with final values of 0.95 and
0.90 of Pearson and Spearman correlation coefficients, respectively. The CNN
with vectors of the same dimension buy with a skip-architecture in Word2Vec
had only 0.23 Pearson basis and 0.23 Spearman basis. The features combined
with the LSTM achieved low results values, except for the 384-dimensional vectors
generated byWord2Vec with a CBOWarchitecture, with values of 0.62 of Pearson’s
correlation coefficient and 0.55 of Spearman’s.Mestrado em Engenharia Informátic
Identification of multiword expressions in the brWaC
Although corpus size is a well known factor that affects the performance of many NLP tasks, for many languages large freely available corpora are still scarce. In this paper we describe one effort to build a very large corpus for Brazilian Portuguese, the brWaC, generated following the Web as Corpus kool yinitiative. To indirectly assess the quality of the resulting corpus we examined the impact of corpus origin in a specific task, the identification of Multiword Expressions with association measures, against a standard corpus. Focusing on nominal compounds, the expressions obtained from each corpus are of comparable quality and indicate that corpus origin has no impact on this task