A Statistical Approach to Classify Nationality of Name

Abstract

Name entities (NEs), especially personal names, are very important components in interpreting some kinds of text documents e.g. news. To extract personal names efficiently, statistical language models are required to denote characteristics of personal names. Among these characteristics, nationality of a name is a useful source for interpreting the text document. Automatically inferencing nationality from a name also directly assists a user to gain more information from the name. In this paper, we therefore propose a statistical approach to identify nationality of names written in Thai. Extracting features from decomposed personal names, their probabilistic bigram and tri-gram models are used with naive Bayesian classification to assign the most proper class for a name. To evaluate the proposed approach, a number of experiments are conducted on real-world data. The experimental results show that our approach works efficiently with about 94 % accuracy.

    Similar works

    Full text

    thumbnail-image

    Available Versions