1 research outputs found

    Application of psycholinguistic features to authorship profiling for first language, gender and age group

    Get PDF
    Much of the fraud committed in cyberspace involves the misrepresentation of the demographic data of the perpetrator via the medium of seemly anonymous text messages. One way to address this issue is to apply techniques from the field of authorship characterisation or profiling which is the analysis of text to determine the demographic profile of the author. Most of the previous research into authorship characterisation has used counts and ratios of lexicographically based features that include words, parts of words and Parts Of Speech (POS) contained within the text. This study examines the effectiveness of classifying the first language, gender and age group of an author using a set of features developed in the psycholinguistic field (the Linguistic Inquiry and Word Count - LIWC), both as a single type feature set and in combination with the lexicographically based features used in previous studies (function words, character bigrams and POS unigrams and bigrams). This study also searched for the smallest, most effective subset of each feature set that was practical, by ranking the features using three feature selection algorithms and systematically reducing the number used. In addition, the study explored the effective lower word limit for accurate classification by reducing the text size by regular increments. LIWC was found to be more effective than a similar number of any of the lexicographic feature types, and to add insight rather than noise when combined with these feature types. This held to be true for both the full and reduced text sizes for all three demographic classes examined. In addition it was found that the size of feature sets could be greatly reduced while still maintaining effective levels of classification accuracy.Doctor of Philosoph
    corecore