503 research outputs found
Hypothesis Only Baselines in Natural Language Inference
We propose a hypothesis only baseline for diagnosing Natural Language
Inference (NLI). Especially when an NLI dataset assumes inference is occurring
based purely on the relationship between a context and a hypothesis, it follows
that assessing entailment relations while ignoring the provided context is a
degenerate solution. Yet, through experiments on ten distinct NLI datasets, we
find that this approach, which we refer to as a hypothesis-only model, is able
to significantly outperform a majority class baseline across a number of NLI
datasets. Our analysis suggests that statistical irregularities may allow a
model to perform NLI in some datasets beyond what should be achievable without
access to the context.Comment: Accepted at *SEM 2018 as long paper. 12 page
Automatic information search for countering covid-19 misinformation through semantic similarity
Trabajo Fin de Máster en Bioinformática y Biología ComputacionalInformation quality in social media is an increasingly important issue and misinformation problem has become even more critical in the current COVID-19 pandemic, leading people exposed
to false and potentially harmful claims and rumours. Civil society organizations, such as the
World Health Organization, have demanded a global call for action to promote access to health
information and mitigate harm from health misinformation. Consequently, this project pursues
countering the spread of COVID-19 infodemic and its potential health hazards.
In this work, we give an overall view of models and methods that have been employed in the
NLP field from its foundations to the latest state-of-the-art approaches. Focusing on deep learning methods, we propose applying multilingual Transformer models based on siamese networks,
also called bi-encoders, combined with ensemble and PCA dimensionality reduction techniques.
The goal is to counter COVID-19 misinformation by analyzing the semantic similarity between
a claim and tweets from a collection gathered from official fact-checkers verified by the International Fact-Checking Network of the Poynter Institute.
It is factual that the number of Internet users increases every year and the language spoken
determines access to information online. For this reason, we give a special effort in the application of multilingual models to tackle misinformation across the globe. Regarding semantic
similarity, we firstly evaluate these multilingual ensemble models and improve the result in the
STS-Benchmark compared to monolingual and single models. Secondly, we enhance the interpretability of the models’ performance through the SentEval toolkit. Lastly, we compare these
models’ performance against biomedical models in TREC-COVID task round 1 using the BM25
Okapi ranking method as the baseline. Moreover, we are interested in understanding the ins
and outs of misinformation. For that purpose, we extend interpretability using machine learning
and deep learning approaches for sentiment analysis and topic modelling. Finally, we developed
a dashboard to ease visualization of the results.
In our view, the results obtained in this project constitute an excellent initial step toward
incorporating multilingualism and will assist researchers and people in countering COVID-19
misinformation
Semantic relations between sentences: from lexical to linguistically inspired semantic features and beyond
This thesis is concerned with the identification of semantic equivalence between pairs of natural language
sentences, by studying and computing models to address Natural Language Processing tasks where some
form of semantic equivalence is assessed. In such tasks, given two sentences, our models output either
a class label, corresponding to the semantic relation between the sentences, based on a predefined set
of semantic relations, or a continuous score, corresponding to their similarity on a predefined scale. The
former setup corresponds to the tasks of Paraphrase Identification and Natural Language Inference, while
the latter corresponds to the task of Semantic Textual Similarity.
We present several models for English and Portuguese, where various types of features are considered,
for instance based on distances between alternative representations of each sentence, following lexical
and semantic frameworks, or embeddings from pre-trained Bidirectional Encoder Representations from
Transformers models. For English, a new set of semantic features is proposed, from the formal semantic
representation of Discourse Representation Structure. In Portuguese, suitable corpora are scarce and formal
semantic representations are unavailable, hence an evaluation of currently available features and corpora is
conducted, following the modelling setup employed for English.
Competitive results are achieved on all tasks, for both English and Portuguese, particularly when considering
that our models are based on generally available tools and technologies, and that all features and models are
suitable for computation in most modern computers, except for those based on embeddings. In particular,
for English, our semantic features from DRS are able to improve the performance of other models, when
integrated in the feature set of such models, and state of the art results are achieved for Portuguese, with
models based on fine tuning embeddings to a specific task; Sumário:
Relações semânticas entre frases: de aspectos
lexicais a aspectos semânticos inspirados em
linguística e além destes
Esta tese é dedicada à identificação de equivalência semântica entre frases em língua natural, através do
estudo e computação de modelos destinados a tarefas de Processamento de Linguagem Natural relacionadas
com alguma forma de equivalência semântica. Em tais tarefas, a partir de duas frases, os nossos modelos
produzem uma etiqueta de classificação, que corresponde à relação semântica entre as frases, baseada
num conjunto predefinido de possíveis relações semânticas, ou um valor contínuo, que corresponde à
similaridade das frases numa escala predefinida. A primeira configuração mencionada corresponde às tarefas
de Identificação de Paráfrases e de Inferência em Língua Natural, enquanto que a última configuração
mencionada corresponde à tarefa de Similaridade Semântica em Texto.
Apresentamos diversos modelos para Inglês e Português, onde vários tipos de aspectos são considerados,
por exemplo baseados em distâncias entre representações alternativas para cada frase, seguindo formalismos
semânticos e lexicais, ou vectores contextuais de modelos previamente treinados com Representações
Codificadas Bidirecionalmente a partir de Transformadores. Para Inglês, propomos um novo conjunto de
aspectos semânticos, a partir da representação formal de semântica em Estruturas de Representação de
Discurso. Para Português, os conjuntos de dados apropriados são escassos e não estão disponíveis representações
formais de semântica, então implementámos uma avaliação de aspectos actualmente disponíveis,
seguindo a configuração de modelos aplicada para Inglês.
Obtivemos resultados competitivos em todas as tarefas, em Inglês e Português, particularmente considerando
que os nossos modelos são baseados em ferramentas e tecnologias disponíveis, e que todos
os nossos aspectos e modelos são apropriados para computação na maioria dos computadores modernos,
excepto os modelos baseados em vectores contextuais. Em particular, para Inglês, os nossos aspectos
semânticos a partir de Estruturas de Representação de Discurso melhoram o desempenho de outros modelos,
quando integrados no conjunto de aspectos de tais modelos, e obtivemos resultados estado da arte
para Português, com modelos baseados em afinação de vectores contextuais para certa tarefa
ARNLI: ARABIC NATURAL LANGUAGE INFERENCE ENTAILMENT AND CONTRADICTION DETECTION
Natural Language Inference (NLI) is a hot topic research in natural language processing, contradiction detection between sentences is a special case of NLI. This is considered a difficult NLP task which has a big influence when added as a component in many NLP applications, such as Question Answering Systems, text Summarization. Arabic Language is one of the most challenging low-resources languages in detecting contradictions due to its rich lexical, semantics ambiguity. We have created a dataset of more than 12k sentences and named ArNLI, that will be publicly available. Moreover, we have applied a new model inspired by Stanford contradiction detection proposed solutions on English language. We proposed an approach to detect contradictions between pairs of sentences in Arabic language using contradiction vector combined with language model vector as an input to machine learning model. We analyzed results of different traditional machine learning classifiers and compared their results on our created dataset (ArNLI) and on an automatic translation of both PHEME, SICK English datasets. Best results achieved using Random Forest classifier with an accuracy of 99%, 60%, 75% on PHEME, SICK and ArNLI respectively
e-SNLI: Natural Language Inference with Natural Language Explanations
In order for machine learning to garner widespread public adoption, models
must be able to provide interpretable and robust explanations for their
decisions, as well as learn from human-provided explanations at train time. In
this work, we extend the Stanford Natural Language Inference dataset with an
additional layer of human-annotated natural language explanations of the
entailment relations. We further implement models that incorporate these
explanations into their training process and output them at test time. We show
how our corpus of explanations, which we call e-SNLI, can be used for various
goals, such as obtaining full sentence justifications of a model's decisions,
improving universal sentence representations and transferring to out-of-domain
NLI datasets. Our dataset thus opens up a range of research directions for
using natural language explanations, both for improving models and for
asserting their trust.Comment: NeurIPS 201
- …