489 research outputs found
Sentence Meta-Embeddings for Unsupervised Semantic Textual Similarity
We address the task of unsupervised Seman- tic Textual Similarity (STS) by ensembling di- verse pre-trained sentence encoders into sen- tence meta-embeddings. We apply, extend and evaluate different meta-embedding meth- ods from the word embedding literature at the sentence level, including dimensionality re- duction (Yin and Schu Ìtze, 2016), generalized Canonical Correlation Analysis (Rastogi et al., 2015) and cross-view auto-encoders (Bolle- gala and Bao, 2018). Our sentence meta- embeddings set a new unsupervised State of The Art (SoTA) on the STS Benchmark and on the STS12âSTS16 datasets, with gains of be- tween 3.7% and 6.4% Pearsonâs r over single- source systems
Vermeidung von ReprÀsentationsheterogenitÀten in realweltlichen Wissensgraphen
Knowledge graphs are repositories providing factual knowledge about entities. They are a great source of knowledge to support modern AI applications for Web search, question answering, digital assistants, and online shopping. The advantages of machine learning techniques and the Web's growth have led to colossal knowledge graphs with billions of facts about hundreds of millions of entities collected from a large variety of sources. While integrating independent knowledge sources promises rich information, it inherently leads to heterogeneities in representation due to a large variety of different conceptualizations. Thus, real-world knowledge graphs are threatened in their overall utility. Due to their sheer size, they are hardly manually curatable anymore. Automatic and semi-automatic methods are needed to cope with these vast knowledge repositories. We first address the general topic of representation heterogeneity by surveying the problem throughout various data-intensive fields: databases, ontologies, and knowledge graphs. Different techniques for automatically resolving heterogeneity issues are presented and discussed, while several open problems are identified. Next, we focus on entity heterogeneity. We show that automatic matching techniques may run into quality problems when working in a multi-knowledge graph scenario due to incorrect transitive identity links. We present four techniques that can be used to improve the quality of arbitrary entity matching tools significantly. Concerning relation heterogeneity, we show that synonymous relations in knowledge graphs pose several difficulties in querying. Therefore, we resolve these heterogeneities with knowledge graph embeddings and by Horn rule mining. All methods detect synonymous relations in knowledge graphs with high quality. Furthermore, we present a novel technique for avoiding heterogeneity issues at query time using implicit knowledge storage. We show that large neural language models are a valuable source of knowledge that is queried similarly to knowledge graphs already solving several heterogeneity issues internally.Wissensgraphen sind eine wichtige Datenquelle von EntitĂ€tswissen. Sie unterstĂŒtzen viele moderne KI-Anwendungen. Dazu gehören unter anderem Websuche, die automatische Beantwortung von Fragen, digitale Assistenten und Online-Shopping. Neue Errungenschaften im maschinellen Lernen und das auĂerordentliche Wachstum des Internets haben zu riesigen Wissensgraphen gefĂŒhrt. Diese umfassen hĂ€ufig Milliarden von Fakten ĂŒber Hunderte von Millionen von EntitĂ€ten; hĂ€ufig aus vielen verschiedenen Quellen. WĂ€hrend die Integration unabhĂ€ngiger Wissensquellen zu einer groĂen Informationsvielfalt fĂŒhren kann, fĂŒhrt sie inhĂ€rent zu HeterogenitĂ€ten in der WissensreprĂ€sentation. Diese HeterogenitĂ€t in den Daten gefĂ€hrdet den praktischen Nutzen der Wissensgraphen. Durch ihre GröĂe lassen sich die Wissensgraphen allerdings nicht mehr manuell bereinigen. DafĂŒr werden heutzutage hĂ€ufig automatische und halbautomatische Methoden benötigt. In dieser Arbeit befassen wir uns mit dem Thema ReprĂ€sentationsheterogenitĂ€t. Wir klassifizieren HeterogenitĂ€t entlang verschiedener Dimensionen und erlĂ€utern HeterogenitĂ€tsprobleme in Datenbanken, Ontologien und Wissensgraphen. Weiterhin geben wir einen knappen Ăberblick ĂŒber verschiedene Techniken zur automatischen Lösung von HeterogenitĂ€tsproblemen. Im nĂ€chsten Kapitel beschĂ€ftigen wir uns mit EntitĂ€tsheterogenitĂ€t. Wir zeigen Probleme auf, die in einem Multi-Wissensgraphen-Szenario aufgrund von fehlerhaften transitiven Links entstehen. Um diese Probleme zu lösen stellen wir vier Techniken vor, mit denen sich die QualitĂ€t beliebiger Entity-Alignment-Tools deutlich verbessern lĂ€sst. Wir zeigen, dass RelationsheterogenitĂ€t in Wissensgraphen zu Problemen bei der Anfragenbeantwortung fĂŒhren kann. Daher entwickeln wir verschiedene Methoden um synonyme Relationen zu finden. Eine der Methoden arbeitet mit hochdimensionalen Wissensgrapheinbettungen, die andere mit einem Rule Mining Ansatz. Beide Methoden können synonyme Relationen in Wissensgraphen mit hoher QualitĂ€t erkennen. DarĂŒber hinaus stellen wir eine neuartige Technik zur Vermeidung von HeterogenitĂ€tsproblemen vor, bei der wir eine implizite WissensreprĂ€sentation verwenden. Wir zeigen, dass groĂe neuronale Sprachmodelle eine wertvolle Wissensquelle sind, die Ă€hnlich wie Wissensgraphen angefragt werden können. Im Sprachmodell selbst werden bereits viele der HeterogenitĂ€tsprobleme aufgelöst, so dass eine Anfrage heterogener Wissensgraphen möglich wird
Deep learning based semantic textual similarity for applications in translation technology
A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.Semantic Textual Similarity (STS) measures the equivalence of meanings
between two textual segments. It is a fundamental task for many natural
language processing applications. In this study, we focus on employing STS in
the context of translation technology. We start by developing models to estimate
STS. We propose a new unsupervised vector aggregation-based STS method
which relies on contextual word embeddings. We also propose a novel Siamese
neural network based on efficient recurrent neural network units. We empirically
evaluate various unsupervised and supervised STS methods, including these
newly proposed methods in three different English STS datasets, two non-
English datasets and a bio-medical STS dataset to list the best supervised and
unsupervised STS methods.
We then embed these STS methods in translation technology applications.
Firstly we experiment with Translation Memory (TM) systems. We propose a
novel TM matching and retrieval method based on STS methods that outperform
current TM systems. We then utilise the developed STS architectures in
translation Quality Estimation (QE). We show that the proposed methods are
simple but outperform complex QE architectures and improve the state-of-theart
results. The implementations of these methods have been released as open
source
MultiCQA: Zero-Shot Transfer of Self-Supervised Text Matching Models on a Massive Scale
We study the zero-shot transfer capabilities of text matching models on a
massive scale, by self-supervised training on 140 source domains from community
question answering forums in English. We investigate the model performances on
nine benchmarks of answer selection and question similarity tasks, and show
that all 140 models transfer surprisingly well, where the large majority of
models substantially outperforms common IR baselines. We also demonstrate that
considering a broad selection of source domains is crucial for obtaining the
best zero-shot transfer performances, which contrasts the standard procedure
that merely relies on the largest and most similar domains. In addition, we
extensively study how to best combine multiple source domains. We propose to
incorporate self-supervised with supervised multi-task learning on all
available source domains. Our best zero-shot transfer model considerably
outperforms in-domain BERT and the previous state of the art on six benchmarks.
Fine-tuning of our model with in-domain data results in additional large gains
and achieves the new state of the art on all nine benchmarks.Comment: EMNLP-202
Automatic information search for countering covid-19 misinformation through semantic similarity
Trabajo Fin de MĂĄster en BioinformĂĄtica y BiologĂa ComputacionalInformation quality in social media is an increasingly important issue and misinformation problem has become even more critical in the current COVID-19 pandemic, leading people exposed
to false and potentially harmful claims and rumours. Civil society organizations, such as the
World Health Organization, have demanded a global call for action to promote access to health
information and mitigate harm from health misinformation. Consequently, this project pursues
countering the spread of COVID-19 infodemic and its potential health hazards.
In this work, we give an overall view of models and methods that have been employed in the
NLP field from its foundations to the latest state-of-the-art approaches. Focusing on deep learning methods, we propose applying multilingual Transformer models based on siamese networks,
also called bi-encoders, combined with ensemble and PCA dimensionality reduction techniques.
The goal is to counter COVID-19 misinformation by analyzing the semantic similarity between
a claim and tweets from a collection gathered from official fact-checkers verified by the International Fact-Checking Network of the Poynter Institute.
It is factual that the number of Internet users increases every year and the language spoken
determines access to information online. For this reason, we give a special effort in the application of multilingual models to tackle misinformation across the globe. Regarding semantic
similarity, we firstly evaluate these multilingual ensemble models and improve the result in the
STS-Benchmark compared to monolingual and single models. Secondly, we enhance the interpretability of the modelsâ performance through the SentEval toolkit. Lastly, we compare these
modelsâ performance against biomedical models in TREC-COVID task round 1 using the BM25
Okapi ranking method as the baseline. Moreover, we are interested in understanding the ins
and outs of misinformation. For that purpose, we extend interpretability using machine learning
and deep learning approaches for sentiment analysis and topic modelling. Finally, we developed
a dashboard to ease visualization of the results.
In our view, the results obtained in this project constitute an excellent initial step toward
incorporating multilingualism and will assist researchers and people in countering COVID-19
misinformation
Local Embeddings for Relational Data Integration
Deep learning based techniques have been recently used with promising results
for data integration problems. Some methods directly use pre-trained embeddings
that were trained on a large corpus such as Wikipedia. However, they may not
always be an appropriate choice for enterprise datasets with custom vocabulary.
Other methods adapt techniques from natural language processing to obtain
embeddings for the enterprise's relational data. However, this approach blindly
treats a tuple as a sentence, thus losing a large amount of contextual
information present in the tuple.
We propose algorithms for obtaining local embeddings that are effective for
data integration tasks on relational databases. We make four major
contributions. First, we describe a compact graph-based representation that
allows the specification of a rich set of relationships inherent in the
relational world. Second, we propose how to derive sentences from such a graph
that effectively "describe" the similarity across elements (tokens, attributes,
rows) in the two datasets. The embeddings are learned based on such sentences.
Third, we propose effective optimization to improve the quality of the learned
embeddings and the performance of integration tasks. Finally, we propose a
diverse collection of criteria to evaluate relational embeddings and perform an
extensive set of experiments validating them against multiple baseline methods.
Our experiments show that our framework, EmbDI, produces meaningful results for
data integration tasks such as schema matching and entity resolution both in
supervised and unsupervised settings.Comment: Accepted to SIGMOD 2020 as Creating Embeddings of Heterogeneous
Relational Datasets for Data Integration Tasks. Code can be found at
https://gitlab.eurecom.fr/cappuzzo/embd
Comprehensive Overview of Named Entity Recognition: Models, Domain-Specific Applications and Challenges
In the domain of Natural Language Processing (NLP), Named Entity Recognition
(NER) stands out as a pivotal mechanism for extracting structured insights from
unstructured text. This manuscript offers an exhaustive exploration into the
evolving landscape of NER methodologies, blending foundational principles with
contemporary AI advancements. Beginning with the rudimentary concepts of NER,
the study spans a spectrum of techniques from traditional rule-based strategies
to the contemporary marvels of transformer architectures, particularly
highlighting integrations such as BERT with LSTM and CNN. The narrative
accentuates domain-specific NER models, tailored for intricate areas like
finance, legal, and healthcare, emphasizing their specialized adaptability.
Additionally, the research delves into cutting-edge paradigms including
reinforcement learning, innovative constructs like E-NER, and the interplay of
Optical Character Recognition (OCR) in augmenting NER capabilities. Grounding
its insights in practical realms, the paper sheds light on the indispensable
role of NER in sectors like finance and biomedicine, addressing the unique
challenges they present. The conclusion outlines open challenges and avenues,
marking this work as a comprehensive guide for those delving into NER research
and applications
- âŠ