42 research outputs found
ARNLI: ARABIC NATURAL LANGUAGE INFERENCE ENTAILMENT AND CONTRADICTION DETECTION
Natural Language Inference (NLI) is a hot topic research in natural language processing, contradiction detection between sentences is a special case of NLI. This is considered a difficult NLP task which has a big influence when added as a component in many NLP applications, such as Question Answering Systems, text Summarization. Arabic Language is one of the most challenging low-resources languages in detecting contradictions due to its rich lexical, semantics ambiguity. We have created a dataset of more than 12k sentences and named ArNLI, that will be publicly available. Moreover, we have applied a new model inspired by Stanford contradiction detection proposed solutions on English language. We proposed an approach to detect contradictions between pairs of sentences in Arabic language using contradiction vector combined with language model vector as an input to machine learning model. We analyzed results of different traditional machine learning classifiers and compared their results on our created dataset (ArNLI) and on an automatic translation of both PHEME, SICK English datasets. Best results achieved using Random Forest classifier with an accuracy of 99%, 60%, 75% on PHEME, SICK and ArNLI respectively
Meemi: A Simple Method for Post-processing and Integrating Cross-lingual Word Embeddings
Word embeddings have become a standard resource in the toolset of any Natural
Language Processing practitioner. While monolingual word embeddings encode
information about words in the context of a particular language, cross-lingual
embeddings define a multilingual space where word embeddings from two or more
languages are integrated together. Current state-of-the-art approaches learn
these embeddings by aligning two disjoint monolingual vector spaces through an
orthogonal transformation which preserves the structure of the monolingual
counterparts. In this work, we propose to apply an additional transformation
after this initial alignment step, which aims to bring the vector
representations of a given word and its translations closer to their average.
Since this additional transformation is non-orthogonal, it also affects the
structure of the monolingual spaces. We show that our approach both improves
the integration of the monolingual spaces as well as the quality of the
monolingual spaces themselves. Furthermore, because our transformation can be
applied to an arbitrary number of languages, we are able to effectively obtain
a truly multilingual space. The resulting (monolingual and multilingual) spaces
show consistent gains over the current state-of-the-art in standard intrinsic
tasks, namely dictionary induction and word similarity, as well as in extrinsic
tasks such as cross-lingual hypernym discovery and cross-lingual natural
language inference.Comment: 22 pages, 2 figures, 9 tables. Preprint submitted to Natural Language
Engineerin
Meemi: a simple method for post-processing and integrating cross-lingual word embeddings
Word embeddings have become a standard resource in the toolset of any Natural Language Processing
practitioner. While monolingual word embeddings encode information about words in the context of a
particular language, cross-lingual embeddings define a multilingual space where word embeddings from
two or more languages are integrated together. Current state-of-the-art approaches learn these embeddings
by aligning two disjoint monolingual vector spaces through an orthogonal transformation which preserves
the structure of the monolingual counterparts. In this work, we propose to apply an additional transformation after this initial alignment step, which aims to bring the vector representations of a given word and its
translations closer to their average. Since this additional transformation is non-orthogonal, it also affects
the structure of the monolingual spaces. We show that our approach both improves the integration of the
monolingual spaces as well as the quality of the monolingual spaces themselves. Furthermore, because
our transformation can be applied to an arbitrary number of languages, we are able to effectively obtain a
truly multilingual space. The resulting (monolingual and multilingual) spaces show consistent gains over
the current state-of-the-art in standard intrinsic tasks, namely dictionary induction and word similarity,
as well as in extrinsic tasks such as cross-lingual hypernym discovery and cross-lingual natural language
inference
Computational models for semantic textual similarity
164 p.The overarching goal of this thesis is to advance on computational models of meaning and their evaluation. To achieve this goal we define two tasks and develop state-of-the-art systems that tackle both task: Semantic Textual Similarity (STS) and Typed Similarity.STS aims to measure the degree of semantic equivalence between two sentences by assigning graded similarity values that capture the intermediate shades of similarity. We have collected pairs of sentences to construct datasets for STS, a total of 15,436 pairs of sentences, being by far the largest collection of data for STS.We have designed, constructed and evaluated a new approach to combine knowledge-based and corpus-based methods using a cube. This new system for STS is on par with state-of-the-art approaches that make use of Machine Learning (ML) without using any of it, but ML can be used on this system, improving the results.Typed Similarity tries to identify the type of relation that holds between a pair of similar items in a digital library. Providing a reason why items are similar has applications in recommendation, personalization, and search. A range of types of similarity in this collection were identified and a set of 1,500 pairs of items from the collection were annotated using crowdsourcing.Finally, we present systems capable of resolving the Typed Similarity task. The best system resulted in a real-world application to recommend similar items to users in an online digital library
State-of-the-art generalisation research in NLP: a taxonomy and review
The ability to generalise well is one of the primary desiderata of natural
language processing (NLP). Yet, what `good generalisation' entails and how it
should be evaluated is not well understood, nor are there any common standards
to evaluate it. In this paper, we aim to lay the ground-work to improve both of
these issues. We present a taxonomy for characterising and understanding
generalisation research in NLP, we use that taxonomy to present a comprehensive
map of published generalisation studies, and we make recommendations for which
areas might deserve attention in the future. Our taxonomy is based on an
extensive literature review of generalisation research, and contains five axes
along which studies can differ: their main motivation, the type of
generalisation they aim to solve, the type of data shift they consider, the
source by which this data shift is obtained, and the locus of the shift within
the modelling pipeline. We use our taxonomy to classify over 400 previous
papers that test generalisation, for a total of more than 600 individual
experiments. Considering the results of this review, we present an in-depth
analysis of the current state of generalisation research in NLP, and make
recommendations for the future. Along with this paper, we release a webpage
where the results of our review can be dynamically explored, and which we
intend to up-date as new NLP generalisation studies are published. With this
work, we aim to make steps towards making state-of-the-art generalisation
testing the new status quo in NLP.Comment: 35 pages of content + 53 pages of reference