5,093 research outputs found
Monte Carlo Methods for Top-k Personalized PageRank Lists and Name Disambiguation
We study a problem of quick detection of top-k Personalized PageRank lists.
This problem has a number of important applications such as finding local cuts
in large graphs, estimation of similarity distance and name disambiguation. In
particular, we apply our results to construct efficient algorithms for the
person name disambiguation problem. We argue that when finding top-k
Personalized PageRank lists two observations are important. Firstly, it is
crucial that we detect fast the top-k most important neighbours of a node,
while the exact order in the top-k list as well as the exact values of PageRank
are by far not so crucial. Secondly, a little number of wrong elements in top-k
lists do not really degrade the quality of top-k lists, but it can lead to
significant computational saving. Based on these two key observations we
propose Monte Carlo methods for fast detection of top-k Personalized PageRank
lists. We provide performance evaluation of the proposed methods and supply
stopping criteria. Then, we apply the methods to the person name disambiguation
problem. The developed algorithm for the person name disambiguation problem has
achieved the second place in the WePS 2010 competition
Name Disambiguation from link data in a collaboration graph using temporal and topological features
In a social community, multiple persons may share the same name, phone number
or some other identifying attributes. This, along with other phenomena, such as
name abbreviation, name misspelling, and human error leads to erroneous
aggregation of records of multiple persons under a single reference. Such
mistakes affect the performance of document retrieval, web search, database
integration, and more importantly, improper attribution of credit (or blame).
The task of entity disambiguation partitions the records belonging to multiple
persons with the objective that each decomposed partition is composed of
records of a unique person. Existing solutions to this task use either
biographical attributes, or auxiliary features that are collected from external
sources, such as Wikipedia. However, for many scenarios, such auxiliary
features are not available, or they are costly to obtain. Besides, the attempt
of collecting biographical or external data sustains the risk of privacy
violation. In this work, we propose a method for solving entity disambiguation
task from link information obtained from a collaboration network. Our method is
non-intrusive of privacy as it uses only the time-stamped graph topology of an
anonymized network. Experimental results on two real-life academic
collaboration networks show that the proposed method has satisfactory
performance.Comment: The short version of this paper has been accepted to ASONAM 201
MAG: A Multilingual, Knowledge-base Agnostic and Deterministic Entity Linking Approach
Entity linking has recently been the subject of a significant body of
research. Currently, the best performing approaches rely on trained
mono-lingual models. Porting these approaches to other languages is
consequently a difficult endeavor as it requires corresponding training data
and retraining of the models. We address this drawback by presenting a novel
multilingual, knowledge-based agnostic and deterministic approach to entity
linking, dubbed MAG. MAG is based on a combination of context-based retrieval
on structured knowledge bases and graph algorithms. We evaluate MAG on 23 data
sets and in 7 languages. Our results show that the best approach trained on
English datasets (PBOH) achieves a micro F-measure that is up to 4 times worse
on datasets in other languages. MAG, on the other hand, achieves
state-of-the-art performance on English datasets and reaches a micro F-measure
that is up to 0.6 higher than that of PBOH on non-English languages.Comment: Accepted in K-CAP 2017: Knowledge Capture Conferenc
A Hybrid Approach to Domain-Specific Entity Linking
The current state-of-the-art Entity Linking (EL) systems are geared towards
corpora that are as heterogeneous as the Web, and therefore perform
sub-optimally on domain-specific corpora. A key open problem is how to
construct effective EL systems for specific domains, as knowledge of the local
context should in principle increase, rather than decrease, effectiveness. In
this paper we propose the hybrid use of simple specialist linkers in
combination with an existing generalist system to address this problem. Our
main findings are the following. First, we construct a new reusable benchmark
for EL on a corpus of domain-specific conversations. Second, we test the
performance of a range of approaches under the same conditions, and show that
specialist linkers obtain high precision in isolation, and high recall when
combined with generalist linkers. Hence, we can effectively exploit local
context and get the best of both worlds.Comment: SEM'1
Identifying Geographic Clusters: A Network Analytic Approach
In recent years there has been a growing interest in the role of networks and
clusters in the global economy. Despite being a popular research topic in
economics, sociology and urban studies, geographical clustering of human
activity has often studied been by means of predetermined geographical units
such as administrative divisions and metropolitan areas. This approach is
intrinsically time invariant and it does not allow one to differentiate between
different activities. Our goal in this paper is to present a new methodology
for identifying clusters, that can be applied to different empirical settings.
We use a graph approach based on k-shell decomposition to analyze world
biomedical research clusters based on PubMed scientific publications. We
identify research institutions and locate their activities in geographical
clusters. Leading areas of scientific production and their top performing
research institutions are consistently identified at different geographic
scales
- âŠ