1,176 research outputs found

    Language classification from bilingual word embedding graphs

    Full text link
    We study the role of the second language in bilingual word embeddings in monolingual semantic evaluation tasks. We find strongly and weakly positive correlations between down-stream task performance and second language similarity to the target language. Additionally, we show how bilingual word embeddings can be employed for the task of semantic language classification and that joint semantic spaces vary in meaningful ways across second languages. Our results support the hypothesis that semantic language similarity is influenced by both structural similarity as well as geography/contact.Comment: To be published at Coling 201

    Measuring associational thinking through word embeddings

    Full text link
    [EN] The development of a model to quantify semantic similarity and relatedness between words has been the major focus of many studies in various fields, e.g. psychology, linguistics, and natural language processing. Unlike the measures proposed by most previous research, this article is aimed at estimating automatically the strength of associative words that can be semantically related or not. We demonstrate that the performance of the model depends not only on the combination of independently constructed word embeddings (namely, corpus- and network-based embeddings) but also on the way these word vectors interact. The research concludes that the weighted average of the cosine-similarity coefficients derived from independent word embeddings in a double vector space tends to yield high correlations with human judgements. Moreover, we demonstrate that evaluating word associations through a measure that relies on not only the rank ordering of word pairs but also the strength of associations can reveal some findings that go unnoticed by traditional measures such as Spearman's and Pearson's correlation coefficients.s Financial support for this research has been provided by the Spanish Ministry of Science, Innovation and Universities [grant number RTC 2017-6389-5], the Spanish ¿Agencia Estatal de Investigación¿ [grant number PID2020-112827GB-I00 / AEI / 10.13039/501100011033], and the European Union¿s Horizon 2020 research and innovation program [grant number 101017861: project SMARTLAGOON]. Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.Periñán-Pascual, C. (2022). Measuring associational thinking through word embeddings. Artificial Intelligence Review. 55(3):2065-2102. https://doi.org/10.1007/s10462-021-10056-62065210255

    On the Effect of Semantically Enriched Context Models on Software Modularization

    Full text link
    Many of the existing approaches for program comprehension rely on the linguistic information found in source code, such as identifier names and comments. Semantic clustering is one such technique for modularization of the system that relies on the informal semantics of the program, encoded in the vocabulary used in the source code. Treating the source code as a collection of tokens loses the semantic information embedded within the identifiers. We try to overcome this problem by introducing context models for source code identifiers to obtain a semantic kernel, which can be used for both deriving the topics that run through the system as well as their clustering. In the first model, we abstract an identifier to its type representation and build on this notion of context to construct contextual vector representation of the source code. The second notion of context is defined based on the flow of data between identifiers to represent a module as a dependency graph where the nodes correspond to identifiers and the edges represent the data dependencies between pairs of identifiers. We have applied our approach to 10 medium-sized open source Java projects, and show that by introducing contexts for identifiers, the quality of the modularization of the software systems is improved. Both of the context models give results that are superior to the plain vector representation of documents. In some cases, the authoritativeness of decompositions is improved by 67%. Furthermore, a more detailed evaluation of our approach on JEdit, an open source editor, demonstrates that inferred topics through performing topic analysis on the contextual representations are more meaningful compared to the plain representation of the documents. The proposed approach in introducing a context model for source code identifiers paves the way for building tools that support developers in program comprehension tasks such as application and domain concept location, software modularization and topic analysis

    Evaluation of taxonomic and neural embedding methods for calculating semantic similarity

    Full text link
    Modelling semantic similarity plays a fundamental role in lexical semantic applications. A natural way of calculating semantic similarity is to access handcrafted semantic networks, but similarity prediction can also be anticipated in a distributional vector space. Similarity calculation continues to be a challenging task, even with the latest breakthroughs in deep neural language models. We first examined popular methodologies in measuring taxonomic similarity, including edge-counting that solely employs semantic relations in a taxonomy, as well as the complex methods that estimate concept specificity. We further extrapolated three weighting factors in modelling taxonomic similarity. To study the distinct mechanisms between taxonomic and distributional similarity measures, we ran head-to-head comparisons of each measure with human similarity judgements from the perspectives of word frequency, polysemy degree and similarity intensity. Our findings suggest that without fine-tuning the uniform distance, taxonomic similarity measures can depend on the shortest path length as a prime factor to predict semantic similarity; in contrast to distributional semantics, edge-counting is free from sense distribution bias in use and can measure word similarity both literally and metaphorically; the synergy of retrofitting neural embeddings with concept relations in similarity prediction may indicate a new trend to leverage knowledge bases on transfer learning. It appears that a large gap still exists on computing semantic similarity among different ranges of word frequency, polysemous degree and similarity intensity

    A question-answering machine learning system for FAQs

    Get PDF
    With the increase in usage and dependence on the internet for gathering information, it’s now essential to efficiently retrieve information according to users’ needs. Question Answering (QA) systems aim to fulfill this need by trying to provide the most relevant answer for a user’s query expressed in natural language text or speech. Virtual assistants like Apple Siri and automated FAQ systems have become very popular and with this the constant rush of developing an efficient, advanced and expedient QA system is reaching new limits. In the field of QA systems, this thesis addresses the problem of finding the FAQ question that is most similar to a user’s query. Finding semantic similarities between database question banks and natural language text is its foremost step. The work aims at exploring unsupervised approaches for measuring semantic similarities for developing a closed domain QA system. To meet this objective modern sentence representation techniques, such as BERT and FLAIR GloVe, are coupled with various similarity measures (cosine, Euclidean and Manhattan) to identify the best model. The developed models were tested with three FAQs and SemEval 2015 datasets for English language; the best results were obtained from the coupling of BERT embedding with Euclidean distance similarity measure with a performance of 85.956% on a FAQ dataset. The model is also tested for Portuguese language with Portuguese Health support phone line SNS24 dataset; Sumário: Um sistema de pergunta-resposta de aprendizagem automatica para FAQs Com o aumento da utilização e da dependência da internet para a recolha de informação, tornou-se essencial recuperar a informação de forma eficiente de acordo com as necessidades dos utilizadores. Os Sistemas de Pergunta- Resposta (PR) visam responder a essa necessidade, tentando fornecer a resposta mais relevante para a consulta de um utilizador expressa em texto em linguagem natural escrita ou falada. Os assistentes virtuais como o Apple Siri e sistemas automatizados de perguntas frequentes tornaram-se muito populares aumentando a necessidade de desenvolver um sistema de controle de qualidade eficiente, avançado e conveniente. No campo dos sistemas de PR, esta dissertação aborda o problema de encontrar a pergunta que mais se assemelha à consulta de um utilizador. Encontrar semelhanças semânticas entre a base de dados de perguntas e o texto em linguagem natural é a sua etapa mais importante. Neste sentido, esta dissertação tem como objetivo explorar abordagens não supervisionadas para medir similaridades semânticas para o desenvolvimento de um sistema de pergunta-resposta de domínio fechado. Neste sentido, técnicas modernas de representação de frases como o BERT e FLAIR GloVe são utilizadas em conjunto com várias medidas de similaridade (cosseno, Euclidiana e Manhattan) para identificar os melhores modelos. Os modelos desenvolvidos foram testados com conjuntos de dados de três FAQ e o SemEval 2015; os melhores resultados foram obtidos da combinação entre modelos de embedding BERT e a distância euclidiana, tendo-se obtido um desempenho máximo de 85,956% num conjunto de dados FAQ. O modelo também é testado para a língua portuguesa com o conjunto de dados SNS24 da linha telefónica de suporte de saúde em português

    NASARI: a novel approach to a Semantically-Aware Representation of items

    Get PDF
    The semantic representation of individual word senses and concepts is of fundamental importance to several applications in Natural Language Processing. To date, concept modeling techniques have in the main based their representation either on lexicographic resources, such as WordNet, or on encyclopedic resources, such as Wikipedia. We propose a vector representation technique that combines the complementary knowledge of both these types of resource. Thanks to its use of explicit semantics combined with a novel cluster-based dimensionality reduction and an effective weighting scheme, our representation attains state-of-the-art performance on multiple datasets in two standard benchmarks: word similarity and sense clustering. We are releasing our vector representations at http://lcl.uniroma1.it/nasari/
    • …
    corecore