3,698 research outputs found
Improving Multilingual Named Entity Recognition with Wikipedia Entity Type Mapping
The state-of-the-art named entity recognition (NER) systems are statistical
machine learning models that have strong generalization capability (i.e., can
recognize unseen entities that do not appear in training data) based on lexical
and contextual information. However, such a model could still make mistakes if
its features favor a wrong entity type. In this paper, we utilize Wikipedia as
an open knowledge base to improve multilingual NER systems. Central to our
approach is the construction of high-accuracy, high-coverage multilingual
Wikipedia entity type mappings. These mappings are built from weakly annotated
data and can be extended to new languages with no human annotation or
language-dependent knowledge involved. Based on these mappings, we develop
several approaches to improve an NER system. We evaluate the performance of the
approaches via experiments on NER systems trained for 6 languages. Experimental
results show that the proposed approaches are effective in improving the
accuracy of such systems on unseen entities, especially when a system is
applied to a new domain or it is trained with little training data (up to 18.3
F1 score improvement).Comment: 11 pages, Conference on Empirical Methods in Natural Language
Processing (EMNLP), 201
Entity Linking for Queries by Searching Wikipedia Sentences
We present a simple yet effective approach for linking entities in queries.
The key idea is to search sentences similar to a query from Wikipedia articles
and directly use the human-annotated entities in the similar sentences as
candidate entities for the query. Then, we employ a rich set of features, such
as link-probability, context-matching, word embeddings, and relatedness among
candidate entities as well as their related entities, to rank the candidates
under a regression based framework. The advantages of our approach lie in two
aspects, which contribute to the ranking process and final linking result.
First, it can greatly reduce the number of candidate entities by filtering out
irrelevant entities with the words in the query. Second, we can obtain the
query sensitive prior probability in addition to the static link-probability
derived from all Wikipedia articles. We conduct experiments on two benchmark
datasets on entity linking for queries, namely the ERD14 dataset and the GERDAQ
dataset. Experimental results show that our method outperforms state-of-the-art
systems and yields 75.0% in F1 on the ERD14 dataset and 56.9% on the GERDAQ
dataset
Distributed Entity Disambiguation with Per-Mention Learning
Entity disambiguation, or mapping a phrase to its canonical representation in
a knowledge base, is a fundamental step in many natural language processing
applications. Existing techniques based on global ranking models fail to
capture the individual peculiarities of the words and hence, either struggle to
meet the accuracy requirements of many real-world applications or they are too
complex to satisfy real-time constraints of applications.
In this paper, we propose a new disambiguation system that learns specialized
features and models for disambiguating each ambiguous phrase in the English
language. To train and validate the hundreds of thousands of learning models
for this purpose, we use a Wikipedia hyperlink dataset with more than 170
million labelled annotations. We provide an extensive experimental evaluation
to show that the accuracy of our approach compares favourably with respect to
many state-of-the-art disambiguation systems. The training required for our
approach can be easily distributed over a cluster. Furthermore, updating our
system for new entities or calibrating it for special ones is a computationally
fast process, that does not affect the disambiguation of the other entities
Massively Increasing TIMEX3 Resources: A Transduction Approach
Automatic annotation of temporal expressions is a research challenge of great
interest in the field of information extraction. Gold standard
temporally-annotated resources are limited in size, which makes research using
them difficult. Standards have also evolved over the past decade, so not all
temporally annotated data is in the same format. We vastly increase available
human-annotated temporal expression resources by converting older format
resources to TimeML/TIMEX3. This task is difficult due to differing annotation
methods. We present a robust conversion tool and a new, large temporal
expression resource. Using this, we evaluate our conversion process by using it
as training data for an existing TimeML annotation tool, achieving a 0.87 F1
measure -- better than any system in the TempEval-2 timex recognition exercise.Comment: Proc. LREC (2012
Word-Entity Duet Representations for Document Ranking
This paper presents a word-entity duet framework for utilizing knowledge
bases in ad-hoc retrieval. In this work, the query and documents are modeled by
word-based representations and entity-based representations. Ranking features
are generated by the interactions between the two representations,
incorporating information from the word space, the entity space, and the
cross-space connections through the knowledge graph. To handle the
uncertainties from the automatically constructed entity representations, an
attention-based ranking model AttR-Duet is developed. With back-propagation
from ranking labels, the model learns simultaneously how to demote noisy
entities and how to rank documents with the word-entity duet. Evaluation
results on TREC Web Track ad-hoc task demonstrate that all of the four-way
interactions in the duet are useful, the attention mechanism successfully
steers the model away from noisy entities, and together they significantly
outperform both word-based and entity-based learning to rank systems
Information Extraction from Scientific Literature for Method Recommendation
As a research community grows, more and more papers are published each year.
As a result there is increasing demand for improved methods for finding
relevant papers, automatically understanding the key ideas and recommending
potential methods for a target problem. Despite advances in search engines, it
is still hard to identify new technologies according to a researcher's need.
Due to the large variety of domains and extremely limited annotated resources,
there has been relatively little work on leveraging natural language processing
in scientific recommendation. In this proposal, we aim at making scientific
recommendations by extracting scientific terms from a large collection of
scientific papers and organizing the terms into a knowledge graph. In
preliminary work, we trained a scientific term extractor using a small amount
of annotated data and obtained state-of-the-art performance by leveraging large
amount of unannotated papers through applying multiple semi-supervised
approaches. We propose to construct a knowledge graph in a way that can make
minimal use of hand annotated data, using only the extracted terms,
unsupervised relational signals such as co-occurrence, and structural external
resources such as Wikipedia. Latent relations between scientific terms can be
learned from the graph. Recommendations will be made through graph inference
for both observed and unobserved relational pairs.Comment: Thesis Proposal. arXiv admin note: text overlap with arXiv:1708.0607
Entity Query Feature Expansion Using Knowledge Base Links
Recent advances in automatic entity linking and knowledge base
construction have resulted in entity annotations for document and
query collections. For example, annotations of entities from large
general purpose knowledge bases, such as Freebase and the Google
Knowledge Graph. Understanding how to leverage these entity
annotations of text to improve ad hoc document retrieval is an open
research area. Query expansion is a commonly used technique to
improve retrieval effectiveness. Most previous query expansion
approaches focus on text, mainly using unigram concepts. In this
paper, we propose a new technique, called entity query feature
expansion (EQFE) which enriches the query with features from
entities and their links to knowledge bases, including structured
attributes and text. We experiment using both explicit query entity
annotations and latent entities. We evaluate our technique on TREC
text collections automatically annotated with knowledge base entity
links, including the Google Freebase Annotations (FACC1) data.
We find that entity-based feature expansion results in significant
improvements in retrieval effectiveness over state-of-the-art text
expansion approaches
Towards a Knowledge Graph based Speech Interface
Applications which use human speech as an input require a speech interface
with high recognition accuracy. The words or phrases in the recognised text are
annotated with a machine-understandable meaning and linked to knowledge graphs
for further processing by the target application. These semantic annotations of
recognised words can be represented as a subject-predicate-object triples which
collectively form a graph often referred to as a knowledge graph. This type of
knowledge representation facilitates to use speech interfaces with any spoken
input application, since the information is represented in logical, semantic
form, retrieving and storing can be followed using any web standard query
languages. In this work, we develop a methodology for linking speech input to
knowledge graphs and study the impact of recognition errors in the overall
process. We show that for a corpus with lower WER, the annotation and linking
of entities to the DBpedia knowledge graph is considerable. DBpedia Spotlight,
a tool to interlink text documents with the linked open data is used to link
the speech recognition output to the DBpedia knowledge graph. Such a
knowledge-based speech recognition interface is useful for applications such as
question answering or spoken dialog systems.Comment: Under Review in International Workshop on Grounding Language
Understanding, Satellite of Interspeech 201
Named Entity Disambiguation for Noisy Text
We address the task of Named Entity Disambiguation (NED) for noisy text. We
present WikilinksNED, a large-scale NED dataset of text fragments from the web,
which is significantly noisier and more challenging than existing news-based
datasets. To capture the limited and noisy local context surrounding each
mention, we design a neural model and train it with a novel method for sampling
informative negative examples. We also describe a new way of initializing word
and entity embeddings that significantly improves performance. Our model
significantly outperforms existing state-of-the-art methods on WikilinksNED
while achieving comparable performance on a smaller newswire dataset.Comment: Accepted to CoNLL 201
Boosting Question Answering by Deep Entity Recognition
In this paper an open-domain factoid question answering system for Polish,
RAFAEL, is presented. The system goes beyond finding an answering sentence; it
also extracts a single string, corresponding to the required entity. Herein the
focus is placed on different approaches to entity recognition, essential for
retrieving information matching question constraints. Apart from traditional
approach, including named entity recognition (NER) solutions, a novel
technique, called Deep Entity Recognition (DeepER), is introduced and
implemented. It allows a comprehensive search of all forms of entity references
matching a given WordNet synset (e.g. an impressionist), based on a previously
assembled entity library. It has been created by analysing the first sentences
of encyclopaedia entries and disambiguation and redirect pages. DeepER also
provides automatic evaluation, which makes possible numerous experiments,
including over a thousand questions from a quiz TV show answered on the grounds
of Polish Wikipedia. The final results of a manual evaluation on a separate
question set show that the strength of DeepER approach lies in its ability to
answer questions that demand answers beyond the traditional categories of named
entities
- …