13 research outputs found
Workshop Notes of the Sixth International Workshop "What can FCA do for Artificial Intelligence?"
International audienc
Modelling semantic relations with distributitional semantics and deep learning : question answering, entailment recognition and paraphrase detection
Nesta dissertação apresenta-se uma abordagem à tarefa de modelar relações semânticas
entre dois textos com base em modelos de semântica distribucional e em aprendizagem
profunda. O presente trabalho tira partido de várias disciplinas da ciência
cognitiva, com especial relevo para a computação, a linguística e a inteligência artificial,
e com fortes influência da neurociência e da psicologia cognitiva.
Os modelos de semântica distribucional (também conhecidos como ”word embeddings”)
são usados para representar o significado das palavras. As representações
semânticas das palavras podem ainda ser combinadas para obter o significado de
um excerto de um texto recorrendo ao uso da aprendizagem profunda, isto é, com o
apoio das redes neurais de convolução.
Esta abordagen é utilizada para replicar a experiência realizada por Bogdanova
et al. (2015) na tarefa de deteção de perguntas que podem ser respondidas as mesmas
respostas tal como estas foram respondidas em fóruns on-line. Os resultados do
desempenho obtidos pelas experiências apresentadas nesta dissertação são equivalentes
ou melhores que os resultados obtidos no trabalho de referência mencionado
acima.
Apresentao também um estudo sobre o impacto do pré-processamento apropriado
do texto, tendo em conta os resultados que podem ser obtidos pelas abordagens
adotadas no trabalho de referência supramencionado. Este estudo é levado a cabo
removendo-se certas pistas que podem levar o sistema, indevidamente, a detetar
perguntas equivalentes. Essa remoção das pistas leva a uma diminuição significativa
no desempenho do sistema desenvolvido no trabalho de referência.
Nesta dissertação é ainda apresentado um estudo sobre o impacto que os word
embeddings treinados previamente têm na tarefa de detetar perguntas semanticamente
equivalentes. Substituindo-se, aleatoriamente, word embeddings previamente
treinados por outros melhora-se o desempenho do sistema.
Além disso, o modelo foi utilizado na tarefa de reconhecimento de implicações
para Português, onde mostrou uma taxa de acerto similar à da baseline. Este trabalho também reporta os resultados da aplicação da abordagem adotada
numa competição para a deteção de paráfrases em Russo. A configuração final apresenta
duas melhorias: usa character embeddings em vez de word embeddings e usa
vários filtros de convolução. Esta configuração foi testado na execução padrão da
Tarefa 2 da competição relevante, e mostrou resultados competitivos.This dissertation presents an approach to the task of modelling semantic relations between
two texts, which is based on distributional semantic models and deep learning.
The present work takes advantage of various disciplines of cognitive science, mainly
computation, linguistics and artificial intelligence, with strong influences from neuroscience
and cognitive psychology.
Distributional semantic models (also known as word embeddings) are used to
represent the meaning of words. Word semantic representations can be further combined
towards obtaining the meaning of a larger chunk of a text using a deep learning
approach, namely with the support of convolutional neural networks.
These approaches are used to replicate the experiment carried out, by Bogdanova
et al. (2015), for the task of detecting questions that can be answered by exactly the
same answer in online user forums. Performance results obtained by my experiments
are comparable or better than the ones reported in that referenced work.
I present also a study on the impact of appropriate text preprocessing with respect
to the results that can be obtained by the approaches adopted in that referenced
work. Removing certain clues that can unduly help the system to detect equivalent
questions leads to a significant decrease in system’s performance supported by that
referenced work.
I also present a study of the impact that pre-trained word embeddings have in the
task of detecting the semantically equivalent questions. Replacing pre-trained word
embeddings by randomly initialised ones improves the performance of the system.
Additionally, the model was applied to the task of entailment recognition for Portuguese
and showed an accuracy on a level with the baseline.
This dissertation also reports on the results of an experimental study on the application
of the adopted approach to the shared task of sentence paraphrase detection
in Russian. The final set up contained two improvements: it uses several convolutional
filters and it uses character embeddings instead of word embeddings. It was tested in Task 2 standard run of the relevant shared task and it showed competitive
results
Active Sampling for Large-scale Information Retrieval Evaluation
Evaluation is crucial in Information Retrieval. The development of models,
tools and methods has significantly benefited from the availability of reusable
test collections formed through a standardized and thoroughly tested
methodology, known as the Cranfield paradigm. Constructing these collections
requires obtaining relevance judgments for a pool of documents, retrieved by
systems participating in an evaluation task; thus involves immense human labor.
To alleviate this effort different methods for constructing collections have
been proposed in the literature, falling under two broad categories: (a)
sampling, and (b) active selection of documents. The former devises a smart
sampling strategy by choosing only a subset of documents to be assessed and
inferring evaluation measure on the basis of the obtained sample; the sampling
distribution is being fixed at the beginning of the process. The latter
recognizes that systems contributing documents to be judged vary in quality,
and actively selects documents from good systems. The quality of systems is
measured every time a new document is being judged. In this paper we seek to
solve the problem of large-scale retrieval evaluation combining the two
approaches. We devise an active sampling method that avoids the bias of the
active selection methods towards good systems, and at the same time reduces the
variance of the current sampling approaches by placing a distribution over
systems, which varies as judgments become available. We validate the proposed
method using TREC data and demonstrate the advantages of this new method
compared to past approaches
Crowd-annotation and LoD-based semantic indexing of content in multi-disciplinary web repositories to improve search results
Searching for relevant information in multi-disciplinary web
repositories is becoming a topic of increasing interest among the
computer science research community. To date, methods and techniques to extract useful and relevant information from
online repositories of research data have largely been based on
static full text indexing which entails a ‘produce once and use
forever’ kind of strategy. That strategy is fast becoming
insufficient due to increasing data volume, concept
obsolescence, and complexity and heterogeneity of content types
in web repositories. We propose that by automatic semantic
annotation of content in web repositories (using Linked Open
Data or LoD sources) without
using domain-specific ontologies,
we can sustain the performance of searching by retrieving highly
relevant search results. Secondly, we claim that by expert
crowd-annotation of content on top of automatic semantic
annotation, we can enrich the semantic index over time to
augment the contextual value of content in web repositories so
that they remain findable despite changes in language,
terminology and scientific concepts. We deployed a custom-
built annotation, indexing and searching environment in a web
repository website that has been used by expert annotators to
annotate webpages using free text and vocabulary terms. We
present our findings based on the annotation and tagging data on
top of LoD-based annotations and the overall
modus operandi.
We also analyze and demonstrate that by adding expert
annotations to the existing semantic index, we can improve the
relationship between query and documents using Cosine
Similarity Measures (CSM)