224 research outputs found
Query-Driven Sampling for Collective Entity Resolution
Probabilistic databases play a preeminent role in the processing and
management of uncertain data. Recently, many database research efforts have
integrated probabilistic models into databases to support tasks such as
information extraction and labeling. Many of these efforts are based on batch
oriented inference which inhibits a realtime workflow. One important task is
entity resolution (ER). ER is the process of determining records (mentions) in
a database that correspond to the same real-world entity. Traditional pairwise
ER methods can lead to inconsistencies and low accuracy due to localized
decisions. Leading ER systems solve this problem by collectively resolving all
records using a probabilistic graphical model and Markov chain Monte Carlo
(MCMC) inference. However, for large datasets this is an extremely expensive
process. One key observation is that, such exhaustive ER process incurs a huge
up-front cost, which is wasteful in practice because most users are interested
in only a small subset of entities. In this paper, we advocate pay-as-you-go
entity resolution by developing a number of query-driven collective ER
techniques. We introduce two classes of SQL queries that involve ER operators
--- selection-driven ER and join-driven ER. We implement novel variations of
the MCMC Metropolis Hastings algorithm to generate biased samples and
selectivity-based scheduling algorithms to support the two classes of ER
queries. Finally, we show that query-driven ER algorithms can converge and
return results within minutes over a database populated with the extraction
from a newswire dataset containing 71 million mentions
Linking named entities to Wikipedia
Natural language is fraught with problems of ambiguity, including name reference. A name in text can refer to multiple entities just as an entity can be known by different names. This thesis examines how a mention in text can be linked to an external knowledge base (KB), in our case, Wikipedia. The named entity linking (NEL) task requires systems to identify the KB entry, or Wikipedia article, that a mention refers to; or, if the KB does not contain the correct entry, return NIL. Entity linking systems can be complex and we present a framework for analysing their different components, which we use to analyse three seminal systems which are evaluated on a common dataset and we show the importance of precise search for linking. The Text Analysis Conference (TAC) is a major venue for NEL research. We report on our submissions to the entity linking shared task in 2010, 2011 and 2012. The information required to disambiguate entities is often found in the text, close to the mention. We explore apposition, a common way for authors to provide information about entities. We model syntactic and semantic restrictions with a joint model that achieves state-of-the-art apposition extraction performance. We generalise from apposition to examine local descriptions specified close to the mention. We add local description to our state-of-the-art linker by using patterns to extract the descriptions and matching against this restricted context. Not only does this make for a more precise match, we are also able to model failure to match. Local descriptions help disambiguate entities, further improving our state-of-the-art linker. The work in this thesis seeks to link textual entity mentions to knowledge bases. Linking is important for any task where external world knowledge is used and resolving ambiguity is fundamental to advancing research into these problems
Efficient techniques for streaming cross document coreference resolution
Large text streams are commonplace; news organisations are constantly producing stories
and people are constantly writing social media posts. These streams should be
analysed in real-time so useful information can be extracted and acted upon instantly.
When natural disasters occur people want to be informed, when companies announce
new products financial institutions want to know and when celebrities do things their
legions of fans want to feel involved. In all these examples people care about getting
information in real-time (low latency).
These streams are massively varied, people’s interests are typically classified by the
entities they are interested in. Organising a stream by the entity being referred to would
help people extract the information useful to them. This is a difficult task: fans of ‘Captain
America’ films will not want to be incorrectly told that ‘Chris Evans’ (the main
actor) was appointed to host ‘Top Gear’ when it was a different ‘Chris Evans’. People
who use local idiosyncrasies such as referring to their home county (‘Cornwall’) as
‘Kernow’ (the Cornish for ‘Cornwall’ that has entered the local lexicon) should not be
forced to change their language when finding out information about their home.
This thesis addresses a core problem for real-time entity-specific NLP: Streaming
cross document coreference resolution (CDC), how to automatically identify all the
entities mentioned in a stream in real-time.
This thesis address two significant problems for streaming CDC: There is no representative
dataset and existing systems consume more resources over time. A new
technique to create datasets is introduced and it was applied to social media (Twitter)
to create a large (6M mentions) and challenging new CDC dataset that contains a much
more variend range of entities than typical newswire streams. Existing systems are not
able to keep up with large data streams. This problem is addressed with a streaming
CDC system that stores a constant sized set of mentions. New techniques to maintain
the sample are introduced significantly out-performing existing ones maintaining 95%
of the performance of a non-streaming system while only using 20% of the memory
Entity Linking and Discovery via Arborescence-based Supervised Clustering
Previous work has shown promising results in performing entity linking by
measuring not only the affinities between mentions and entities but also those
amongst mentions. In this paper, we present novel training and inference
procedures that fully utilize mention-to-mention affinities by building minimum
arborescences (i.e., directed spanning trees) over mentions and entities across
documents in order to make linking decisions. We also show that this method
gracefully extends to entity discovery, enabling the clustering of mentions
that do not have an associated entity in the knowledge base. We evaluate our
approach on the Zero-Shot Entity Linking dataset and MedMentions, the largest
publicly available biomedical dataset, and show significant improvements in
performance for both entity linking and discovery compared to identically
parameterized models. We further show significant efficiency improvements with
only a small loss in accuracy over previous work, which use more
computationally expensive models.Comment: Updated reference
CDˆ2CR:Co-reference resolution across documents and domains
Cross-document co-reference resolution (CDCR) is the task of identifying and
linking mentions to entities and concepts across many text documents. Current
state-of-the-art models for this task assume that all documents are of the same
type (e.g. news articles) or fall under the same theme. However, it is also
desirable to perform CDCR across different domains (type or theme). A
particular use case we focus on in this paper is the resolution of entities
mentioned across scientific work and newspaper articles that discuss them.
Identifying the same entities and corresponding concepts in both scientific
articles and news can help scientists understand how their work is represented
in mainstream media. We propose a new task and English language dataset for
cross-document cross-domain co-reference resolution (CDCR). The task aims
to identify links between entities across heterogeneous document types. We show
that in this cross-domain, cross-document setting, existing CDCR models do not
perform well and we provide a baseline model that outperforms current
state-of-the-art CDCR models on CDCR. Our data set, annotation tool and
guidelines as well as our model for cross-document cross-domain co-reference
are all supplied as open access open source resources.Comment: 9 pages, 5 figures, accepted at EACL 202
- …