289 research outputs found
AuthCrowd: Author Name Disambiguation and Entity Matching using Crowdsourcing
Despite decades of research and development in
named entity resolution, dealing with name ambiguity is still a
a challenging issue for much bibliometric-enhanced information
retrieval (IR) tasks. As new bibliographic datasets are created as
a result of the upward growth of publication records worldwide,
more problems arise when considering the effects of errors
resulting from missing data fields, duplicate entities, misspellings,
extra characters, etc. As these concerns tend to be of large-scale,
both the general consistency and the quality of electronic data are
largely affected. This paper presents an approach to handle these
name ambiguity problems through the use of crowdsourcing as a
complementary means to traditional unsupervised approaches.
To this end, we present “AuthCrowd”, a crowdsourcing system
with the ability to decompose named entity disambiguation and
entity matching tasks. Experimental results on a real-world
dataset of publicly available papers published in peer-reviewed
venues demonstrate the potential of our proposed approach for
improving author name disambiguation. The findings further
highlight the importance of adopting hybrid crowd-algorithm
collaboration strategies, especially for handling complexity and
quantifying bias when working with large amounts of data
AliCG: Fine-grained and Evolvable Conceptual Graph Construction for Semantic Search at Alibaba
Conceptual graphs, which is a particular type of Knowledge Graphs, play an
essential role in semantic search. Prior conceptual graph construction
approaches typically extract high-frequent, coarse-grained, and time-invariant
concepts from formal texts. In real applications, however, it is necessary to
extract less-frequent, fine-grained, and time-varying conceptual knowledge and
build taxonomy in an evolving manner. In this paper, we introduce an approach
to implementing and deploying the conceptual graph at Alibaba. Specifically, We
propose a framework called AliCG which is capable of a) extracting fine-grained
concepts by a novel bootstrapping with alignment consensus approach, b) mining
long-tail concepts with a novel low-resource phrase mining approach, c)
updating the graph dynamically via a concept distribution estimation method
based on implicit and explicit user behaviors. We have deployed the framework
at Alibaba UC Browser. Extensive offline evaluation as well as online A/B
testing demonstrate the efficacy of our approach.Comment: Accepted by KDD 2021 (Applied Data Science Track
LAGOS-AND: A Large Gold Standard Dataset for Scholarly Author Name Disambiguation
In this paper, we present a method to automatically build large labeled
datasets for the author ambiguity problem in the academic world by leveraging
the authoritative academic resources, ORCID and DOI. Using the method, we built
LAGOS-AND, two large, gold-standard datasets for author name disambiguation
(AND), of which LAGOS-AND-BLOCK is created for clustering-based AND research
and LAGOS-AND-PAIRWISE is created for classification-based AND research. Our
LAGOS-AND datasets are substantially different from the existing ones. The
initial versions of the datasets (v1.0, released in February 2021) include 7.5M
citations authored by 798K unique authors (LAGOS-AND-BLOCK) and close to 1M
instances (LAGOS-AND-PAIRWISE). And both datasets show close similarities to
the whole Microsoft Academic Graph (MAG) across validations of six facets. In
building the datasets, we reveal the variation degrees of last names in three
literature databases, PubMed, MAG, and Semantic Scholar, by comparing author
names hosted to the authors' official last names shown on the ORCID pages.
Furthermore, we evaluate several baseline disambiguation methods as well as the
MAG's author IDs system on our datasets, and the evaluation helps identify
several interesting findings. We hope the datasets and findings will bring new
insights for future studies. The code and datasets are publicly available.Comment: 33 pages, 7 tables, 7 figure
Using Games to Create Language Resources: Successes and Limitations of the Approach
Abstract One of the more novel approaches to collaboratively creating language resources in recent years is to use online games to collect and validate data. The most significant challenges collaborative systems face are how to train users with the necessary expertise and how to encourage participation on a scale required to produce high quality data comparable with data produced by “traditional ” experts. In this chapter we provide a brief overview of collaborative creation and the different approaches that have been used to create language resources, before analysing games used for this purpose. We discuss some key issues in using a gaming approach, including task design, player motivation and data quality, and compare the costs of each approach in terms of development, distribution and ongoing administration. In conclusion, we summarise the benefits and limitations of using a gaming approach to resource creation and suggest key considerations for evaluating its utility in different research scenarios
A Survey of Location Prediction on Twitter
Locations, e.g., countries, states, cities, and point-of-interests, are
central to news, emergency events, and people's daily lives. Automatic
identification of locations associated with or mentioned in documents has been
explored for decades. As one of the most popular online social network
platforms, Twitter has attracted a large number of users who send millions of
tweets on daily basis. Due to the world-wide coverage of its users and
real-time freshness of tweets, location prediction on Twitter has gained
significant attention in recent years. Research efforts are spent on dealing
with new challenges and opportunities brought by the noisy, short, and
context-rich nature of tweets. In this survey, we aim at offering an overall
picture of location prediction on Twitter. Specifically, we concentrate on the
prediction of user home locations, tweet locations, and mentioned locations. We
first define the three tasks and review the evaluation metrics. By summarizing
Twitter network, tweet content, and tweet context as potential inputs, we then
structurally highlight how the problems depend on these inputs. Each dependency
is illustrated by a comprehensive review of the corresponding strategies
adopted in state-of-the-art approaches. In addition, we also briefly review two
related problems, i.e., semantic location prediction and point-of-interest
recommendation. Finally, we list future research directions.Comment: Accepted to TKDE. 30 pages, 1 figur
Attaching Translations to Proper Lexical Senses in DBnary
International audienceThe DBnary project aims at providing high quality Lexical Linked Data extracted from different Wiktionary language editions. Data from 10 different languages is currently extracted for a total of over 3.16M translation links that connect lexical entries from the 10 extracted languages, to entries in more than one thousand languages. In Wiktionary, glosses are often associated with translations to help users understand to what sense they refer to, whether through a textual definition or a target sense number. In this article we aim at the extraction of as much of this information as possible and then the disambiguation of the corresponding translations for all languages available. We use an adaptation of various textual and semantic similarity techniques based on partial or fuzzy gloss overlaps to disambiguate the translation relations (To account for the lack of normalization, e.g. lemmatization and PoS tagging) and then extract some of the sense number information present to build a gold standard so as to evaluate our disambiguation as well as tune and optimize the parameters of the similarity measures. We obtain F-measures of the order of 80\% (on par with similar work on English only), across the three languages where we could generate a gold standard (French, Portuguese, Finnish) and show that most of the disambiguation errors are due to inconsistencies in Wiktionary itself that cannot be detected at the generation of DBnary (shifted sense numbers, inconsistent glosses, etc.)
- …