242 research outputs found

    Effect of forename string on author name disambiguation

    Full text link
    In author name disambiguation, author forenames are used to decide which name instances are disambiguated together and how much they are likely to refer to the same author. Despite such a crucial role of forenames, their effect on the performance of heuristic (string matching) and algorithmic disambiguation is not well understood. This study assesses the contributions of forenames in author name disambiguation using multiple labeled data sets under varying ratios and lengths of full forenames, reflecting realā€world scenarios in which an author is represented by forename variants (synonym) and some authors share the same forenames (homonym). The results show that increasing the ratios of full forenames substantially improves both heuristic and machineā€learningā€based disambiguation. Performance gains by algorithmic disambiguation are pronounced when many forenames are initialized or homonyms are prevalent. As the ratios of full forenames increase, however, they become marginal compared to those by string matching. Using a small portion of forename strings does not reduce much the performances of both heuristic and algorithmic disambiguation methods compared to using fullā€length strings. These findings provide practical suggestions, such as restoring initialized forenames into a fullā€string format via record linkage for improved disambiguation performances.Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/155924/1/asi24298.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/155924/2/asi24298_am.pd

    Scaleā€free collaboration networks: An author name disambiguation perspective

    Full text link
    Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/149559/1/asi24158.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/149559/2/asi24158_am.pd

    A Syllable-based Technique for Word Embeddings of Korean Words

    Full text link
    Word embedding has become a fundamental component to many NLP tasks such as named entity recognition and machine translation. However, popular models that learn such embeddings are unaware of the morphology of words, so it is not directly applicable to highly agglutinative languages such as Korean. We propose a syllable-based learning model for Korean using a convolutional neural network, in which word representation is composed of trained syllable vectors. Our model successfully produces morphologically meaningful representation of Korean words compared to the original Skip-gram embeddings. The results also show that it is quite robust to the Out-of-Vocabulary problem.Comment: 5 pages, 3 figures, 1 table. Accepted for EMNLP 2017 Workshop - The 1st Workshop on Subword and Character level models in NLP (SCLeM

    The impact of author name disambiguation on knowledge discovery from large-scale scholarly data

    Get PDF
    In this study, I demonstrate that the choice of disambiguation methods for resolving author name ambiguity can adversely affect our understanding of scholarly collaboration patterns and coauthorship network structures extracted from large-scale scholarly data. By utilizing large-scale bibliometric data, scholars in many fields have gleaned knowledge for use in scholarly evaluation, collaborator recommendations, research policy evaluation, and network-evolution modeling. A common challenge has been that author names in bibliometric data are not properly disambiguated: authors may share the same name (i.e., different authors are sometimes misrepresented to be a single author which can lead to a ā€œmerging of identitiesā€). In addition, one author may use name variations (i.e., an author may be represented as two or more different authors which can lead to a ā€œsplitting of identitiesā€). When faced with these challenges, most scholars have pre-processed bibliometric data using simple heuristics (e.g., if two author names share the same surname and given name initials, they are presumed to represent the same author identity) and assumed that their findings are robust to errors due to author name ambiguity. I test this long-held assumption in bibliometrics by measuring the impact of author name ambiguity on network properties. I accomplish this under varying conditions, including network size and cumulative time window (from 1991 to 2009) using four large-scale bibliometric datasets that cover: biomedicine, computer science, psychology and neuroscience, and one nationā€™s entire domestic publication output. For this task, I collate the statistical properties of coauthorship networks constructed from algorithmically disambiguated data (i.e., close to clean data) against those that come from the same networks, but are compromised by misidentified authors via first-initial and all-initials disambiguation methods. In addition, I simulate the levels of merging and splitting incrementally using those empirical datasets. My findings show that initial-based name disambiguation methods can severely distort our understanding of given networks and such distortion gets worse over time. Moreover, the distortion sometimes leads to biased or false knowledge of coauthorship network formation and evolution mechanisms such as preferential attachment generating the power-law distribution of vertex degree and to false validation of theories about the choice of collaborators in scientific research. This may result in ill-informed decisions about research policy and resource allocation. Besides measuring the impact of name ambiguity on network properties, I also test how name ambiguity can be estimated using simple heuristics such as dataset size and how merged author identities can be detected via an authorā€™s ego-network properties to provide a practical guidance for corrective measures. My research calls for further studying the effects of author name ambiguity on coauthorship network properties and is expected to help scholars establish better practices for knowledge discovery from large-scale scholarly data
    • ā€¦
    corecore