105 research outputs found

    Bayesian Non-Exhaustive Classification A Case Study: Online Name Disambiguation using Temporal Record Streams

    Get PDF
    The name entity disambiguation task aims to partition the records of multiple real-life persons so that each partition contains records pertaining to a unique person. Most of the existing solutions for this task operate in a batch mode, where all records to be disambiguated are initially available to the algorithm. However, more realistic settings require that the name disambiguation task be performed in an online fashion, in addition to, being able to identify records of new ambiguous entities having no preexisting records. In this work, we propose a Bayesian non-exhaustive classification framework for solving online name disambiguation task. Our proposed method uses a Dirichlet process prior with a Normal * Normal * Inverse Wishart data model which enables identification of new ambiguous entities who have no records in the training data. For online classification, we use one sweep Gibbs sampler which is very efficient and effective. As a case study we consider bibliographic data in a temporal stream format and disambiguate authors by partitioning their papers into homogeneous groups. Our experimental results demonstrate that the proposed method is better than existing methods for performing online name disambiguation task.Comment: to appear in CIKM 201

    Name Disambiguation from link data in a collaboration graph using temporal and topological features

    Get PDF
    In a social community, multiple persons may share the same name, phone number or some other identifying attributes. This, along with other phenomena, such as name abbreviation, name misspelling, and human error leads to erroneous aggregation of records of multiple persons under a single reference. Such mistakes affect the performance of document retrieval, web search, database integration, and more importantly, improper attribution of credit (or blame). The task of entity disambiguation partitions the records belonging to multiple persons with the objective that each decomposed partition is composed of records of a unique person. Existing solutions to this task use either biographical attributes, or auxiliary features that are collected from external sources, such as Wikipedia. However, for many scenarios, such auxiliary features are not available, or they are costly to obtain. Besides, the attempt of collecting biographical or external data sustains the risk of privacy violation. In this work, we propose a method for solving entity disambiguation task from link information obtained from a collaboration network. Our method is non-intrusive of privacy as it uses only the time-stamped graph topology of an anonymized network. Experimental results on two real-life academic collaboration networks show that the proposed method has satisfactory performance.Comment: The short version of this paper has been accepted to ASONAM 201

    Overview of the M-WePNaD Task: Multilingual web person name disambiguation at IberEval 2017

    Get PDF
    Multilingual Web Person Name Disambiguation is a new shared task proposed for the first time at the IberEval 2017 evaluation campaign. For a set of web search results associated with a person name, the task deals with the grouping of the results based on the particular individual they refer to. Different from previous works dealing with monolingual search results, this task has further considered the challenge posed by search results written in different languages. This task allows to evaluate the performance of participating systems in a multilingual scenario. This overview summarizes a total of 18 runs received from four participating teams. We present the datasets utilized and the method- ology defined for the task and the evaluation, along with an analysis of the results and the submitted systems

    Towards Name Disambiguation: Relational, Streaming, and Privacy-Preserving Text Data

    Get PDF
    In the real world, our DNA is unique but many people share names. This phenomenon often causes erroneous aggregation of documents of multiple persons who are namesakes of one another. Such mistakes deteriorate the performance of document retrieval, web search, and more seriously, cause improper attribution of credit or blame in digital forensics. To resolve this issue, the name disambiguation task 1 is designed to partition the documents associated with a name reference such that each partition contains documents pertaining to a unique real-life person. Existing algorithms for this task mainly suffer from the following drawbacks. First, the majority of existing solutions substantially rely on feature engineering, such as biographical feature extraction, or construction of auxiliary features from Wikipedia. However, for many scenarios, such features may be costly to obtain or unavailable in privacy sensitive domains. Instead we solve the name disambiguation task in restricted setting by leveraging only the relational data in the form of anonymized graphs. Second, most of the existing works for this task operate in a batch mode, where all records to be disambiguated are initially available to the algorithm. However, more realistic settings require that the name disambiguation task should be performed in an online streaming fashion in order to identify records of new ambiguous entities having no preexisting records. Finally, we investigate the potential disclosure risk of textual features used in name disambiguation and propose several algorithms to tackle the task in a privacy-aware scenario. In summary, in this dissertation, we present a number of novel approaches to address name disambiguation tasks from the above three aspects independently, namely relational, streaming, and privacy preserving textual data

    LAGOS-AND: A Large Gold Standard Dataset for Scholarly Author Name Disambiguation

    Full text link
    In this paper, we present a method to automatically build large labeled datasets for the author ambiguity problem in the academic world by leveraging the authoritative academic resources, ORCID and DOI. Using the method, we built LAGOS-AND, two large, gold-standard datasets for author name disambiguation (AND), of which LAGOS-AND-BLOCK is created for clustering-based AND research and LAGOS-AND-PAIRWISE is created for classification-based AND research. Our LAGOS-AND datasets are substantially different from the existing ones. The initial versions of the datasets (v1.0, released in February 2021) include 7.5M citations authored by 798K unique authors (LAGOS-AND-BLOCK) and close to 1M instances (LAGOS-AND-PAIRWISE). And both datasets show close similarities to the whole Microsoft Academic Graph (MAG) across validations of six facets. In building the datasets, we reveal the variation degrees of last names in three literature databases, PubMed, MAG, and Semantic Scholar, by comparing author names hosted to the authors' official last names shown on the ORCID pages. Furthermore, we evaluate several baseline disambiguation methods as well as the MAG's author IDs system on our datasets, and the evaluation helps identify several interesting findings. We hope the datasets and findings will bring new insights for future studies. The code and datasets are publicly available.Comment: 33 pages, 7 tables, 7 figure

    Adding Domain Knowledge to Improve Entity Resolution in 17th and 18th Century Amsterdam Archival Records

    Get PDF
    The problem of entity resolution is central in the field of Digital Humanities. It is also one of the major issues in the Golden Agents project, which aims at creating an infrastructure that enables researchers to search for patterns that span across decentralised knowledge graphs from cultural heritage institutes. To this end, we created a method to perform entity resolution on complex historical knowledge graphs. In previous work, we encoded and embedded the relevant (duplicate) entities in a vector space to derive similarities between them based on sharing a similar context in RDF graphs. In some cases, however, available domain knowledge or rational axioms can be applied to improve entity resolution performance. We show how domain knowledge and rational axioms relevant to the task at hand can be expressed as (probabilistic) rules, and how the information derived from rule application can be combined with quantitative information from the embedding. In this work, we perform our entity resolution method on two data sets. First, we apply it to a data set for which we have a detailed ground truth for validation. This experiment shows that the combination of embedding and the application of domain knowledge and rational axioms leads to improved resolution performance. Second, we perform a case study by applying our method to a larger data set for which there is no ground truth and where the outcome is subsequently validated by a domain expert. Results of this demonstrate that our method achieves a very high precision

    Learning Algorithm to Automate Fast Author Name Disambiguation

    Get PDF
    RÉSUMÉ : La production scientifique mondiale représente une quantité massive d’enregistrements auxquels on peut accéder via de nombreuses bases de données. En raison de la présence d’enregistrements ambigus, un processus de désambiguïsation efficace dans un délai raisonnable est nécessaire comme étape essentielle pour extraire l’information correcte et générer des statistiques de publication. Cependant, la tâche de désambiguïsation est exhaustive et complexe en raison des bases de données volumineuses et des données manquantes. Actuellement, il n’existe pas de méthode automatique complète capable de produire des résultats satisfaisants pour le processus de désambiguïsation. Auparavant, une application efficace de désambiguïsation d’entité a été développée, qui est un algorithme en cascade supervisé donnant des résultats prometteurs sur de grandes bases de données bibliographiques. Bien que le travail existant produise des résultats de haute qualité dans un délai de traitement raisonnable, il manque un choix efficace de métriques et la structure des classificateurs est déterminée d’une manière heuristique par l’analyse des erreurs de précision et de rappel. De toute évidence, une approche automatisée qui rend l’application flexible et réglable améliorerait directement la convivialité de l’application. Une telle approche permettrait de comprendre l’importance de chaque classification d’attributs dans le processus de désambiguïsation et de sélectionner celles qui sont les plus performantes. Dans cette recherche, nous proposons un algorithme d’apprentissage pour automatiser le processus de désambiguïsation de cette application. Pour atteindre nos objectifs, nous menons trois étapes majeures: premièrement, nous abordons le problème d’évaluation des algorithmes de codage phonétique qui peuvent être utilisés dans le blocking. Six algorithmes de codage phonétique couramment utilisés ont été sélectionnés et des mesures d’évaluation quantitative spécifiques ont été développées afin d’évaluer leurs limites et leurs avantages et de recruter le meilleur. Deuxièmement, nous testons différentes mesures de similarité de chaîne de caractères et nous analysons les avantages et les inconvénients de chaque technique. En d’autres termes, notre deuxième objectif est de construire une méthode de désambiguïsation efficace en comparant plusieurs algorithmes basés sur les edits et les tokens pour améliorer la méthode du blocking. Enfin, en utilisant les méthodes d’agrégation bootstrap (Bagging) et AdaBoost, un algorithme a été développé qui utilise des techniques d’optimisation de particle swarm et d’optimisation de set covers pour concevoir un cadre d’apprentissage qui permet l’ordre automatique des weak classifiers et la détermination de leurs seuils. Des comparaisons de performance ont été effectuées sur des données réelles extraites du Web of Science (WoS) et des bases de données bibliographiques SCOPUS. En résumé, ce travail nous permet de tirer des conclusions sur les qualités et les faiblesses de chaque algorithme phonétique et mesure de similarité dans la perspective de notre application. Nous avons montré que l’algorithme phonétique NYSIIS est un meilleur choix à utiliser dans l’étape de blocking de l’application de désambiguïsation. De plus, l’algorithme de Weighting Table-based surpassait certains des algorithmes de similarité couramment utilisés en terme de efficacité de temps, tout en produisant des résultats satisfaisants. En outre, nous avons proposé une méthode d’apprentissage pour déterminer automatiquement la structure de l’algorithme de désambiguïsation.----------ABSTRACT : The worldwide scientific production represents a massive amount of records which can be accessed via numerous databases. Because of the presence of ambiguous records, a time-efficient disambiguation process is required as an essential step of extracting correct information and generating publication statistics. However, the disambiguation task is exhaustive and complex due to the large volume databases and existing missing data. Currently there is no complete automatic method that is able to produce satisfactory results for the disambiguation process. Previously, an efficient entity disambiguation application was developed that is a supervised cascade algorithm which gives promising results on large bibliographic databases. Although the existing work produces high-quality results within a reasonable processing time, it lacks an efficient choice of metrics and the structure of the classifiers is determined in a heuristic manner by the analysis of precision and recall errors. Clearly, an automated approach that makes the application flexible and adjustable would directly enhance the usability of the application. Such approach would help to understand the importance of each feature classification in the disambiguation process and select the most efficient ones. In this research, we propose a learning algorithm for automating the disambiguation process of this application. In fact, the aim of this work is to help to employ the most appropriate phonetic algorithm and similarity measures as well as introduce a desirable automatic approach instead of a heuristic approach. To achieve our goals, we conduct three major steps: First, we address the problem of evaluating phonetic encoding algorithms that can be used in blocking. Six commonly used phonetic encoding algorithm were selected and specific quantitative evaluation metrics were developed in order to assess their limitations and advantages and recruit the best one. Second, we test different string similarity measures and we analyze the advantages and disadvantages of each technique. In other words, our second goal is to build an efficient disambiguation method by comparing several editand token-based algorithms to improve the blocking method. Finally, using bootstrap aggregating (Bagging) and AdaBoost methods, an algorithm has been developed that employs particle swarm and set cover optimization techniques to design a learning framework that enables automatic ordering of the weak classifiers and determining their thresholds. Performance comparisons were carried out on real data extracted from the web of science (WoS) and the SCOPUS bibliographic databases. In summary, this work allows us to draw conclusions about the qualities and weaknesses of each phonetic algorithm and similarity measure in the perspective of our application. We have shown that the NYSIIS phonetic algorithm is a better choice to use in blocking step of the disambiguation application. In addition, the Weighting Table-based algorithm outperforms some of the commonly used similarity algorithms in terms of time-efficiency, while producing satisfactory results. Moreover, we proposed a learning method to determine the structure of the disambiguation algorithm automatically

    Pair-Linking for Collective Entity Disambiguation: Two Could Be Better Than All

    Full text link
    Collective entity disambiguation aims to jointly resolve multiple mentions by linking them to their associated entities in a knowledge base. Previous works are primarily based on the underlying assumption that entities within the same document are highly related. However, the extend to which these mentioned entities are actually connected in reality is rarely studied and therefore raises interesting research questions. For the first time, we show that the semantic relationships between the mentioned entities are in fact less dense than expected. This could be attributed to several reasons such as noise, data sparsity and knowledge base incompleteness. As a remedy, we introduce MINTREE, a new tree-based objective for the entity disambiguation problem. The key intuition behind MINTREE is the concept of coherence relaxation which utilizes the weight of a minimum spanning tree to measure the coherence between entities. Based on this new objective, we design a novel entity disambiguation algorithms which we call Pair-Linking. Instead of considering all the given mentions, Pair-Linking iteratively selects a pair with the highest confidence at each step for decision making. Via extensive experiments, we show that our approach is not only more accurate but also surprisingly faster than many state-of-the-art collective linking algorithms
    • …
    corecore