679 research outputs found
Information Retrieval: Recent Advances and Beyond
In this paper, we provide a detailed overview of the models used for
information retrieval in the first and second stages of the typical processing
chain. We discuss the current state-of-the-art models, including methods based
on terms, semantic retrieval, and neural. Additionally, we delve into the key
topics related to the learning process of these models. This way, this survey
offers a comprehensive understanding of the field and is of interest for for
researchers and practitioners entering/working in the information retrieval
domain
Jointly Embedding Entities and Text with Distant Supervision
Learning representations for knowledge base entities and concepts is becoming
increasingly important for NLP applications. However, recent entity embedding
methods have relied on structured resources that are expensive to create for
new domains and corpora. We present a distantly-supervised method for jointly
learning embeddings of entities and text from an unnanotated corpus, using only
a list of mappings between entities and surface forms. We learn embeddings from
open-domain and biomedical corpora, and compare against prior methods that rely
on human-annotated text or large knowledge graph structure. Our embeddings
capture entity similarity and relatedness better than prior work, both in
existing biomedical datasets and a new Wikipedia-based dataset that we release
to the community. Results on analogy completion and entity sense disambiguation
indicate that entities and words capture complementary information that can be
effectively combined for downstream use.Comment: 12 pages; Accepted to 3rd Workshop on Representation Learning for NLP
(Repl4NLP 2018). Code at https://github.com/OSU-slatelab/JE
Neural Natural Language Processing for Long Texts: A Survey of the State-of-the-Art
The adoption of Deep Neural Networks (DNNs) has greatly benefited Natural
Language Processing (NLP) during the past decade. However, the demands of long
document analysis are quite different from those of shorter texts, while the
ever increasing size of documents uploaded on-line renders automated
understanding of long texts a critical area of research. This article has two
goals: a) it overviews the relevant neural building blocks, thus serving as a
short tutorial, and b) it surveys the state-of-the-art in long document NLP,
mainly focusing on two central tasks: document classification and document
summarization. Sentiment analysis for long texts is also covered, since it is
typically treated as a particular case of document classification.
Additionally, this article discusses the main challenges, issues and current
solutions related to long document NLP. Finally, the relevant, publicly
available, annotated datasets are presented, in order to facilitate further
research.Comment: 53 pages, 2 figures, 171 citation
Identifying Semantic Divergences in Parallel Text without Annotations
Recognizing that even correct translations are not always semantically
equivalent, we automatically detect meaning divergences in parallel sentence
pairs with a deep neural model of bilingual semantic similarity which can be
trained for any parallel corpus without any manual annotation. We show that our
semantic model detects divergences more accurately than models based on surface
features derived from word alignments, and that these divergences matter for
neural machine translation.Comment: Accepted as a full paper to NAACL 201
Bilingual dictionary generation and enrichment via graph exploration
In recent years, we have witnessed a steady growth of linguistic information represented and exposed as linked data on the Web. Such linguistic linked data have stimulated the development and use of openly available linguistic knowledge graphs, as is the case with the Apertium RDF, a collection of interconnected bilingual dictionaries represented and accessible through Semantic Web standards. In this work, we explore techniques that exploit the graph nature of bilingual dictionaries to automatically infer new links (translations). We build upon a cycle density based method: partitioning the graph into biconnected components for a speed-up, and simplifying the pipeline through a careful structural analysis that reduces hyperparameter tuning requirements. We also analyse the shortcomings of traditional evaluation metrics used for translation inference and propose to complement them with new ones, both-word precision (BWP) and both-word recall (BWR), aimed at being more informative of algorithmic improvements. Over twenty-seven language pairs, our algorithm produces dictionaries about 70% the size of existing Apertium RDF dictionaries at a high BWP of 85% from scratch within a minute. Human evaluation shows that 78% of the additional translations generated for dictionary enrichment are correct as well. We further describe an interesting use-case: inferring synonyms within a single language, on which our initial human-based evaluation shows an average accuracy of 84%. We release our tool as free/open-source software which can not only be applied to RDF data and Apertium dictionaries, but is also easily usable for other formats and communities.This work was partially funded by the Prêt-à-LLOD project within the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 825182. This article is also based upon work from COST Action CA18209 NexusLinguarum, “European network for Web-centred linguistic data science”, supported by COST (European Cooperation in Science and Technology). It has been also partially supported by the Spanish projects TIN2016-78011-C4-3-R and PID2020-113903RB-I00 (AEI/FEDER, UE), by DGA/FEDER, and by the Agencia Estatal de Investigación of the Spanish Ministry of Economy and Competitiveness and the European Social Fund through the “Ramón y Cajal” program (RYC2019-028112-I)
- …