7,846 research outputs found
Ranking Archived Documents for Structured Queries on Semantic Layers
Archived collections of documents (like newspaper and web archives) serve as
important information sources in a variety of disciplines, including Digital
Humanities, Historical Science, and Journalism. However, the absence of
efficient and meaningful exploration methods still remains a major hurdle in
the way of turning them into usable sources of information. A semantic layer is
an RDF graph that describes metadata and semantic information about a
collection of archived documents, which in turn can be queried through a
semantic query language (SPARQL). This allows running advanced queries by
combining metadata of the documents (like publication date) and content-based
semantic information (like entities mentioned in the documents). However, the
results returned by such structured queries can be numerous and moreover they
all equally match the query. In this paper, we deal with this problem and
formalize the task of "ranking archived documents for structured queries on
semantic layers". Then, we propose two ranking models for the problem at hand
which jointly consider: i) the relativeness of documents to entities, ii) the
timeliness of documents, and iii) the temporal relations among the entities.
The experimental results on a new evaluation dataset show the effectiveness of
the proposed models and allow us to understand their limitation
A Trio Neural Model for Dynamic Entity Relatedness Ranking
Measuring entity relatedness is a fundamental task for many natural language
processing and information retrieval applications. Prior work often studies
entity relatedness in static settings and an unsupervised manner. However,
entities in real-world are often involved in many different relationships,
consequently entity-relations are very dynamic over time. In this work, we
propose a neural networkbased approach for dynamic entity relatedness,
leveraging the collective attention as supervision. Our model is capable of
learning rich and different entity representations in a joint framework.
Through extensive experiments on large-scale datasets, we demonstrate that our
method achieves better results than competitive baselines.Comment: In Proceedings of CoNLL 201
NASARI: a novel approach to a Semantically-Aware Representation of items
The semantic representation of individual word senses and concepts is of fundamental importance to several applications in Natural Language Processing. To date, concept modeling techniques have in the main based their representation either on lexicographic resources, such as WordNet, or on encyclopedic resources, such as Wikipedia. We propose a vector representation technique that combines the complementary knowledge of both these types of resource. Thanks to its use of explicit semantics combined with a novel cluster-based dimensionality reduction and an effective weighting scheme, our representation attains state-of-the-art performance on multiple datasets in two standard benchmarks: word similarity and sense clustering. We are releasing our vector representations at http://lcl.uniroma1.it/nasari/
Towards Deep Semantic Analysis Of Hashtags
Hashtags are semantico-syntactic constructs used across various social
networking and microblogging platforms to enable users to start a topic
specific discussion or classify a post into a desired category. Segmenting and
linking the entities present within the hashtags could therefore help in better
understanding and extraction of information shared across the social media.
However, due to lack of space delimiters in the hashtags (e.g #nsavssnowden),
the segmentation of hashtags into constituent entities ("NSA" and "Edward
Snowden" in this case) is not a trivial task. Most of the current
state-of-the-art social media analytics systems like Sentiment Analysis and
Entity Linking tend to either ignore hashtags, or treat them as a single word.
In this paper, we present a context aware approach to segment and link entities
in the hashtags to a knowledge base (KB) entry, based on the context within the
tweet. Our approach segments and links the entities in hashtags such that the
coherence between hashtag semantics and the tweet is maximized. To the best of
our knowledge, no existing study addresses the issue of linking entities in
hashtags for extracting semantic information. We evaluate our method on two
different datasets, and demonstrate the effectiveness of our technique in
improving the overall entity linking in tweets via additional semantic
information provided by segmenting and linking entities in a hashtag.Comment: To Appear in 37th European Conference on Information Retrieva
Do Multi-Sense Embeddings Improve Natural Language Understanding?
Learning a distinct representation for each sense of an ambiguous word could
lead to more powerful and fine-grained models of vector-space representations.
Yet while `multi-sense' methods have been proposed and tested on artificial
word-similarity tasks, we don't know if they improve real natural language
understanding tasks. In this paper we introduce a multi-sense embedding model
based on Chinese Restaurant Processes that achieves state of the art
performance on matching human word similarity judgments, and propose a
pipelined architecture for incorporating multi-sense embeddings into language
understanding.
We then test the performance of our model on part-of-speech tagging, named
entity recognition, sentiment analysis, semantic relation identification and
semantic relatedness, controlling for embedding dimensionality. We find that
multi-sense embeddings do improve performance on some tasks (part-of-speech
tagging, semantic relation identification, semantic relatedness) but not on
others (named entity recognition, various forms of sentiment analysis). We
discuss how these differences may be caused by the different role of word sense
information in each of the tasks. The results highlight the importance of
testing embedding models in real applications
Distantly Labeling Data for Large Scale Cross-Document Coreference
Cross-document coreference, the problem of resolving entity mentions across
multi-document collections, is crucial to automated knowledge base construction
and data mining tasks. However, the scarcity of large labeled data sets has
hindered supervised machine learning research for this task. In this paper we
develop and demonstrate an approach based on ``distantly-labeling'' a data set
from which we can train a discriminative cross-document coreference model. In
particular we build a dataset of more than a million people mentions extracted
from 3.5 years of New York Times articles, leverage Wikipedia for distant
labeling with a generative model (and measure the reliability of such
labeling); then we train and evaluate a conditional random field coreference
model that has factors on cross-document entities as well as mention-pairs.
This coreference model obtains high accuracy in resolving mentions and entities
that are not present in the training data, indicating applicability to
non-Wikipedia data. Given the large amount of data, our work is also an
exercise demonstrating the scalability of our approach.Comment: 16 pages, submitted to ECML 201
- …