7 research outputs found

    Unsupervised compositionality prediction of nominal compounds

    Get PDF
    Nominal compounds such as red wine and nut case display a continuum of compositionality, with varying contributions from the components of the compound to its semantics. This article proposes a framework for compound compositionality prediction using distributional semantic models, evaluating to what extent they capture idiomaticity compared to human judgments. For evaluation, we introduce data sets containing human judgments in three languages: English, French, and Portuguese. The results obtained reveal a high agreement between the models and human predictions, suggesting that they are able to incorporate information about idiomaticity. We also present an in-depth evaluation of various factors that can affect prediction, such as model and corpus parameters and compositionality operations. General crosslingual analyses reveal the impact of morphological variation and corpus size in the ability of the model to predict compositionality, and of a uniform combination of the components for best results

    PLATCOL, Plataforma Multilingüe de Diccionarios de Colocaciones: el caso del chino

    Get PDF
    El objetivo de esta contribución es realizar algunas observaciones sobre el procesamiento de las colocaciones extraídas de la lengua china, así como discutir los problemas que hemos observado al trabajar con esta lengua en la Plataforma Multilingüe de Diccionarios de Colocaciones (PLATCOL). PLATCOL incluirá colocaciones en inglés, portugués, español, francés y chino (Orenha-Ottaiano et al. 2021) y forma parte del proyecto A phraseographical methodology and model for an Online Corpus-Based Multilingual Collocations Dictionary Platform (Proceso FAPESP 2020/01783-2). En la plataforma se ha seguido una metodología unificada para obtener los datos que poblarán las entradas. Esta metodología que funciona con razonable eficacia en las demás lenguas –aunque requiere una fase supervisada de corrección y validación– conlleva un esfuerzo suplementario en el caso de la lengua china donde, por ejemplo, discrepancias en la asignación de categorías gramaticales pueden afectar a la eficacia del método a la hora de extraer candidatos

    Probing for idiomaticity in vector space models

    Get PDF
    Contextualised word representation models have been successfully used for capturing different word usages and they may be an attractive alternative for representing idiomaticity in language. In this paper, we propose probing measures to assess if some of the expected linguistic properties of noun compounds, especially those related to idiomatic meanings, and their dependence on context and sensitivity to lexical choice, are readily available in some standard and widely used representations. For that, we constructed the Noun Compound Senses Dataset, which contains noun compounds and their paraphrases, in context neutral and context informative naturalistic sentences, in two languages: English and Portuguese. Results obtained using four types of probing measures with models like ELMo, BERT and some of its variants, indicate that idiomaticity is not yet accurately represented by contextualised models

    Revista da Associação Portuguesa de Linguística

    Get PDF
    N.º 10 (2023) da Revista da Associação Portuguesa de Linguísticainfo:eu-repo/semantics/publishedVersio

    Closed loop automation through intelligence artificial

    Get PDF
    Atualmente, há uma necessidade de automação de processos aplicados nas redes devido à elevada complexidade e tamanho das mesmas. Nas redes das operadoras de telecomunicações registam diariamente eventos de alarmes que ocorreram nos seus dispositivos. Estes equipamentos como são de fornecedores ou operadoras diferentes, geram diagnósticos de falhas que utilizam nomenclaturas distintas para se referirem à mesma causa da falha. Deste modo, neste trabalho desenvolveu-se um modelo que mede relações de semelhança entre os termos que aparecem nos diagnósticos de falhas, na medida de tornar possível mapear os alarmes para um modelo único alarmístico. Inicialmente, processou-se uma base de dados de diagnósticos de falhas reais com intuito de treinar modelos de word embedding, tais como, Word2Vec e FastText, para converter as palavras em vetores numéricos. Portanto, para avaliar os modelos, gerou-se uma base de dados a partir de um captcha de palavras. Este foi utilizado por especialistas da área com objetivo de encontrarem pares de termos semelhantes. Através das suas respostas foi possível medir as suas respetivas similaridades, sendo consideradas como as esperadas. Contudo, os modelos de word embedding demonstraram não ter capacidade de encontrar este tipo de relações. Por isso, adicionou-se uma camada de modelos de machine learning, nos quais recebiam os vetores dos pares definidos na base de dados e tinham que prever a similaridade mais próxima da esperada. Com isto, uma rede neural simples com os vetores de 128 dimensões gerados pelo modelo Word2Vec com uma arquitetura CBOW obteve os melhores resultados, com valores de 0.95 e 0.90 de coeficientes de correlação de Pearson e Spearman, respetivamente. A CNN com vetores da mesma dimensão, mas com uma arquitetura skip-gram no Word2Vec obteve apenas 0.22 de correlação de Pearson e 0.23 de Spearman. As features geradas combinadas com a LSTM obteve-se valores de correlação próximos de zero, exceto com os vetores de 384 de dimensão gerados pelo Word2Vec com uma arquitetura CBOW, que conseguiram obter 0.62 como coeficiente de correlação de Pearson e 0.55 de Spearman. A CNN e LSTM embora sejam redes muito mais complexas, a base de dados não tem tamanho suficiente para este tipo de redes conseguirem encontrar uma boa função que meça a similaridade entre as palavras do vocabulário específico de redes e software.Nowadays, given the networks complexity and size there is a need for process automation especially malfunction correction. Every day there are many failures in the devices, which, as they are from different vendors or belong to distinct telecommunications operators, alarm diagnostics use different vocabularies to refer to the exact cause of the failure. Thus, in this work, a model was developed that finds relations of similarity between these terms so that it is possible to map the alarms to a single alarmist model. Initially, a database of real fault diagnostics was processed to train embedding word models, such as Word2Vec and FastText, to convert the words into numeric vectors. Therefore, to evaluate the models, it is necessary to have a minimal amount of data, hence the creation of a captcha system to collect pairs of similar terms and measure the similarity between new acquired terms. However, word embedding models are not capable to find this type of relationships. Therefore, a layer of machine learning models was added, in which they received the vectors of the pairs defined in the database and had to predict the closest to the expected similarity. With this, the simple neural network has achieved the best results, while CNN and LSTM although they are much more complex network the database is not large enough to achieve good results. Thus, a neural network with 128-dimensional vectors generated by the Word2Vec model with a CBOW architecture achieved the best results, with final values of 0.95 and 0.90 of Pearson and Spearman correlation coefficients, respectively. The CNN with vectors of the same dimension buy with a skip-architecture in Word2Vec had only 0.23 Pearson basis and 0.23 Spearman basis. The features combined with the LSTM achieved low results values, except for the 384-dimensional vectors generated byWord2Vec with a CBOWarchitecture, with values of 0.62 of Pearson’s correlation coefficient and 0.55 of Spearman’s.Mestrado em Engenharia Informátic

    Identification of multiword expressions in the brWaC

    No full text
    Although corpus size is a well known factor that affects the performance of many NLP tasks, for many languages large freely available corpora are still scarce. In this paper we describe one effort to build a very large corpus for Brazilian Portuguese, the brWaC, generated following the Web as Corpus kool yinitiative. To indirectly assess the quality of the resulting corpus we examined the impact of corpus origin in a specific task, the identification of Multiword Expressions with association measures, against a standard corpus. Focusing on nominal compounds, the expressions obtained from each corpus are of comparable quality and indicate that corpus origin has no impact on this task
    corecore