2 research outputs found
Identification of organization name variants in large databases using rule-based scoring and clustering: With a case study on the Web of Science database
textabstractThis research describes a general method to automatically clean organizational and business names variants
within large databases, such as: patent databases, bibliographic databases, databases in business information
systems, or any other database containing organisational name variants. The method clusters name variants
of organizations based on similarities of their associated meta-data, like, for example, postal code and email
domain data. The method is divided into a rule-based scoring system and a clustering system. The method is
tested on the cleaning of research organisations in the Web of Science database for the purpose of bibliometric
analysis and scientific performance evaluation. The results of the clustering are evaluated with metrics such
as precision and recall analysis on a verified data set. The evaluation shows that our method performs well
and is conservative, it values precision over recall, with on average 95% precision and 80% recall for clusters