1,176 research outputs found
Language classification from bilingual word embedding graphs
We study the role of the second language in bilingual word embeddings in
monolingual semantic evaluation tasks. We find strongly and weakly positive
correlations between down-stream task performance and second language
similarity to the target language. Additionally, we show how bilingual word
embeddings can be employed for the task of semantic language classification and
that joint semantic spaces vary in meaningful ways across second languages. Our
results support the hypothesis that semantic language similarity is influenced
by both structural similarity as well as geography/contact.Comment: To be published at Coling 201
Measuring associational thinking through word embeddings
[EN] The development of a model to quantify semantic similarity and relatedness between words has been the major focus of many studies in various fields, e.g. psychology, linguistics, and natural language processing. Unlike the measures proposed by most previous research, this article is aimed at estimating automatically the strength of associative words that can be semantically related or not. We demonstrate that the performance of the model depends not only on the combination of independently constructed word embeddings (namely, corpus- and network-based embeddings) but also on the way these word vectors interact. The research concludes that the weighted average of the cosine-similarity coefficients derived from independent word embeddings in a double vector space tends to yield high correlations with human judgements. Moreover, we demonstrate that evaluating word associations through a measure that relies on not only the rank ordering of word pairs but also the strength of associations can reveal some findings that go unnoticed by traditional measures such as Spearman's and Pearson's correlation coefficients.s Financial support for this research has been provided by the Spanish Ministry of
Science, Innovation and Universities [grant number RTC 2017-6389-5], the Spanish ¿Agencia Estatal
de Investigación¿ [grant number PID2020-112827GB-I00 / AEI / 10.13039/501100011033], and the
European Union¿s Horizon 2020 research and innovation program [grant number 101017861: project
SMARTLAGOON].
Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.Periñán-Pascual, C. (2022). Measuring associational thinking through word embeddings. Artificial Intelligence Review. 55(3):2065-2102. https://doi.org/10.1007/s10462-021-10056-62065210255
On the Effect of Semantically Enriched Context Models on Software Modularization
Many of the existing approaches for program comprehension rely on the
linguistic information found in source code, such as identifier names and
comments. Semantic clustering is one such technique for modularization of the
system that relies on the informal semantics of the program, encoded in the
vocabulary used in the source code. Treating the source code as a collection of
tokens loses the semantic information embedded within the identifiers. We try
to overcome this problem by introducing context models for source code
identifiers to obtain a semantic kernel, which can be used for both deriving
the topics that run through the system as well as their clustering. In the
first model, we abstract an identifier to its type representation and build on
this notion of context to construct contextual vector representation of the
source code. The second notion of context is defined based on the flow of data
between identifiers to represent a module as a dependency graph where the nodes
correspond to identifiers and the edges represent the data dependencies between
pairs of identifiers. We have applied our approach to 10 medium-sized open
source Java projects, and show that by introducing contexts for identifiers,
the quality of the modularization of the software systems is improved. Both of
the context models give results that are superior to the plain vector
representation of documents. In some cases, the authoritativeness of
decompositions is improved by 67%. Furthermore, a more detailed evaluation of
our approach on JEdit, an open source editor, demonstrates that inferred topics
through performing topic analysis on the contextual representations are more
meaningful compared to the plain representation of the documents. The proposed
approach in introducing a context model for source code identifiers paves the
way for building tools that support developers in program comprehension tasks
such as application and domain concept location, software modularization and
topic analysis
Evaluation of taxonomic and neural embedding methods for calculating semantic similarity
Modelling semantic similarity plays a fundamental role in lexical semantic
applications. A natural way of calculating semantic similarity is to access
handcrafted semantic networks, but similarity prediction can also be
anticipated in a distributional vector space. Similarity calculation continues
to be a challenging task, even with the latest breakthroughs in deep neural
language models. We first examined popular methodologies in measuring taxonomic
similarity, including edge-counting that solely employs semantic relations in a
taxonomy, as well as the complex methods that estimate concept specificity. We
further extrapolated three weighting factors in modelling taxonomic similarity.
To study the distinct mechanisms between taxonomic and distributional
similarity measures, we ran head-to-head comparisons of each measure with human
similarity judgements from the perspectives of word frequency, polysemy degree
and similarity intensity. Our findings suggest that without fine-tuning the
uniform distance, taxonomic similarity measures can depend on the shortest path
length as a prime factor to predict semantic similarity; in contrast to
distributional semantics, edge-counting is free from sense distribution bias in
use and can measure word similarity both literally and metaphorically; the
synergy of retrofitting neural embeddings with concept relations in similarity
prediction may indicate a new trend to leverage knowledge bases on transfer
learning. It appears that a large gap still exists on computing semantic
similarity among different ranges of word frequency, polysemous degree and
similarity intensity
A question-answering machine learning system for FAQs
With the increase in usage and dependence on the internet for gathering
information, it’s now essential to efficiently retrieve information according
to users’ needs. Question Answering (QA) systems aim to fulfill this need
by trying to provide the most relevant answer for a user’s query expressed
in natural language text or speech. Virtual assistants like Apple Siri and
automated FAQ systems have become very popular and with this the constant
rush of developing an efficient, advanced and expedient QA system is
reaching new limits.
In the field of QA systems, this thesis addresses the problem of finding the
FAQ question that is most similar to a user’s query. Finding semantic similarities
between database question banks and natural language text is its
foremost step. The work aims at exploring unsupervised approaches for
measuring semantic similarities for developing a closed domain QA system.
To meet this objective modern sentence representation techniques, such as
BERT and FLAIR GloVe, are coupled with various similarity measures (cosine,
Euclidean and Manhattan) to identify the best model. The developed
models were tested with three FAQs and SemEval 2015 datasets for English
language; the best results were obtained from the coupling of BERT embedding
with Euclidean distance similarity measure with a performance of
85.956% on a FAQ dataset. The model is also tested for Portuguese language
with Portuguese Health support phone line SNS24 dataset; Sumário:
Um sistema de pergunta-resposta de aprendizagem
automatica para FAQs
Com o aumento da utilização e da dependência da internet para a recolha
de informação, tornou-se essencial recuperar a informação de forma eficiente
de acordo com as necessidades dos utilizadores. Os Sistemas de Pergunta-
Resposta (PR) visam responder a essa necessidade, tentando fornecer a resposta
mais relevante para a consulta de um utilizador expressa em texto em
linguagem natural escrita ou falada. Os assistentes virtuais como o Apple
Siri e sistemas automatizados de perguntas frequentes tornaram-se muito
populares aumentando a necessidade de desenvolver um sistema de controle
de qualidade eficiente, avançado e conveniente.
No campo dos sistemas de PR, esta dissertação aborda o problema de encontrar
a pergunta que mais se assemelha à consulta de um utilizador. Encontrar
semelhanças semânticas entre a base de dados de perguntas e o texto em
linguagem natural é a sua etapa mais importante. Neste sentido, esta dissertação
tem como objetivo explorar abordagens não supervisionadas para
medir similaridades semânticas para o desenvolvimento de um sistema de
pergunta-resposta de domÃnio fechado. Neste sentido, técnicas modernas
de representação de frases como o BERT e FLAIR GloVe são utilizadas em
conjunto com várias medidas de similaridade (cosseno, Euclidiana e Manhattan)
para identificar os melhores modelos. Os modelos desenvolvidos foram
testados com conjuntos de dados de três FAQ e o SemEval 2015; os melhores
resultados foram obtidos da combinação entre modelos de embedding
BERT e a distância euclidiana, tendo-se obtido um desempenho máximo de
85,956% num conjunto de dados FAQ. O modelo também é testado para a
lÃngua portuguesa com o conjunto de dados SNS24 da linha telefónica de
suporte de saúde em português
NASARI: a novel approach to a Semantically-Aware Representation of items
The semantic representation of individual word senses and concepts is of fundamental importance to several applications in Natural Language Processing. To date, concept modeling techniques have in the main based their representation either on lexicographic resources, such as WordNet, or on encyclopedic resources, such as Wikipedia. We propose a vector representation technique that combines the complementary knowledge of both these types of resource. Thanks to its use of explicit semantics combined with a novel cluster-based dimensionality reduction and an effective weighting scheme, our representation attains state-of-the-art performance on multiple datasets in two standard benchmarks: word similarity and sense clustering. We are releasing our vector representations at http://lcl.uniroma1.it/nasari/
- …