12 research outputs found

    Identification of Indonesian Authors Using Deep Neural Networks

    Get PDF
    Author Name Disambiguation (AND) is a problem that occurs when a set of publications contains ambiguous names of authors, i.e. the same author may appear with different names (synonyms) in other published papers, or author (authors) who may be different who may have the same name (homonym). In this final project, we will design a model with a Deep Neural Network (DNN) classifier. The dataset used in this final project uses primary data sourced from the Scopus website. This research focuses on integrating data from Indonesian authors. Parameters accuracy, sensitivity and precision are standard benchmarks to determine the performance of the method used to solve AND problems. The best DNN classification model achieves 99.9936% Accuracy, 93.1433% Sensitivity, 94.3733% Precision. Then for the highest performance measurement, the case of Non Synonym-Homonym (SH) has 99.9967% Accuracy, 96.7388% Sensitivity, and 97.5102% Precision

    Deep Neural Network Structure to Improve Individual Performance based Author Classification

    Get PDF
    This paper proposed an improved method for author name disambiguation problem, both homonym and synonym. The data prepared is the distance data of each pair of author’s attributes, Levenshtein distance are used. Using Deep Neural Networks, we found large gains on performance. The result shows that level of accuracy is 99.6% with a low number of hidden layer

    Author identification in bibliographic data using deep neural networks

    Get PDF
    Author name disambiguation (AND) is a challenging task for scholars who mine bibliographic information for scientific knowledge. A constructive approach for resolving name ambiguity is to use computer algorithms to identify author names. Some algorithm-based disambiguation methods have been developed by computer and data scientists. Among them, supervised machine learning has been stated to produce decent to very accurate disambiguation results. This paper presents a combination of principal component analysis (PCA) as a feature reduction and deep neural networks (DNNs), as a supervised algorithm for classifying AND problems. The raw data is grouped into four classes, i.e., synonyms, homonyms, homonyms-synonyms, and non-homonyms-synonyms classification. We have taken into account several hyperparameters tuning, such as learning rate, batch size, number of the neuron and hidden units, and analyzed their impact on the accuracy of results. To the best of our knowledge, there are no previous studies with such a scheme. The proposed DNNs are validated with other ML techniques such as NaĂŻve Bayes, random forest (RF), and support vector machine (SVM) to produce a good classifier. By exploring the result in all data, our proposed DNNs classifier has an outperformed other ML technique, with accuracy, precision, recall, and F1-score, which is 99.98%, 97.98%, 97.86%, and 99.99%, respectively. In the future, this approach can be easily extended to any dataset and any bibliographic records provider

    Harnessing Historical Corrections to build Test Collections for Named Entity Disambiguation

    Full text link
    Matching mentions of persons to the actual persons (the name disambiguation problem) is central for several digital library applications. Scientists have been working on algorithms to create this matching for decades without finding a universal solution. One problem is that test collections for this problem are often small and specific to a certain collection. In this work, we present an approach that can create large test collections from historical metadata with minimal extra cost. We apply this approach to the DBLP collection to generate two freely available test collections. One collection focuses on the properties of defects and one on the evaluation of disambiguation algorithms.Comment: Preprint of a paper accepted at TPDL 201

    Health warning: might contain multiple personalities - the problem of homonyms in Thomson Reuters Essential Science Indicators

    Get PDF
    Author name ambiguity is a crucial problem in any type of bibliometric analysis. It arises when several authors share the same name, but also when one author expresses their name in different ways. This article focuses on the former, also called the “namesake” problem. In particular, we assess the extent to which this compromises the Thomson Reuters Essential Science Indicators (ESI) ranking of the top 1% most cited authors worldwide. We show that three demographic characteristics that should be unrelated to research productivity – name origin, uniqueness of one’s family name, and the number of initials used in publishing – in fact have a very strong influence on it. In contrast to what could be expected from Web of Science publication data, researchers with Asian names – and in particular Chinese and Korean names – appear to be far more productive than researchers with Western names. Furthermore, for any country, academics with common names and fewer initials also appear to be more productive than their more uniquely named counterparts. However, this appearance of high productivity is caused purely by the fact that these “academic superstars” are in fact composites of many individual academics with the same name. We thus argue that it is high time that Thomson Reuters starts taking name disambiguation in general, and non-Anglophone names in particular, more seriously

    Health warning: might contain multiple personalities - the problem of homonyms in Thomson Reuters Essential Science Indicators

    Get PDF
    Author name ambiguity is a crucial problem in any type of bibliometric analysis. It arises when several authors share the same name, but also when one author expresses their name in different ways. This article focuses on the former, also called the “namesake” problem. In particular, we assess the extent to which this compromises the Thomson Reuters Essential Science Indicators (ESI) ranking of the top 1% most cited authors worldwide. We show that three demographic characteristics that should be unrelated to research productivity – name origin, uniqueness of one’s family name, and the number of initials used in publishing – in fact have a very strong influence on it. In contrast to what could be expected from Web of Science publication data, researchers with Asian names – and in particular Chinese and Korean names – appear to be far more productive than researchers with Western names. Furthermore, for any country, academics with common names and fewer initials also appear to be more productive than their more uniquely named counterparts. However, this appearance of high productivity is caused purely by the fact that these “academic superstars” are in fact composites of many individual academics with the same name. We thus argue that it is high time that Thomson Reuters starts taking name disambiguation in general, and non-Anglophone names in particular, more seriously
    corecore