1,444 research outputs found

    Investigating features and techniques for Arabic authoriship attribution

    Get PDF
    Authorship attribution is the problem of identifying the true author of a disputed text. Throughout history, there have been many examples of this problem concerned with revealing genuine authors of works of literature that were published anonymously, and in some cases where more than one author claimed authorship of the disputed text. There has been considerable research effort into trying to solve this problem. Initially these efforts were based on statistical patterns, and more recently they have centred on a range of techniques from artificial intelligence. An important early breakthrough was achieved by Mosteller and Wallace in 1964 [15], who pioneered the use of ‘function words’ – typically pronouns, conjunctions and prepositions – as the features on which to base the discovery of patterns of usage relevant to specific authors. The authorship attribution problem has been tackled in many languages, but predominantly in the English language. In this thesis the problem is addressed for the first time in the Arabic Language. We therefore investigate whether the concept of functions words in English can also be used in the same way for authorship attribution in Arabic. We also describe and evaluate a hybrid of evolutionary algorithms and linear discriminant analysis as an approach to learn a model that classifies the author of a text, based on features derived from Arabic function words. The main target of the hybrid algorithm is to find a subset of features that can robustly and accurately classify disputed texts in unseen data. The hybrid algorithm also aims to do this with relatively small subsets of features. A specialised dataset was produced for this work, based on a collection of 14 Arabic books of different natures, representing a collection of six authors. This dataset was processed into training and test partitions in a way that provides a diverse collection of challenges for any authorship attribution approach. The combination of the successful list of Arabic function words and the hybrid algorithm for classification led to satisfying levels of accuracy in determining the author of portions of the texts in test data. The work described here is the first (to our knowledge) that investigates authorship attribution in the Arabic knowledge using computational methods. Among its contributions are: the first set of Arabic function words, the first specialised dataset aimed at testing Arabic authorship attribution methods, a new hybrid algorithm for classifying authors based on patterns derived from these function words, and, finally, a number of ideas and variants regarding how to use function words in association with character level features, leading in some cases to more accurate results

    Adjacency Pair Recognition in Wikipedia Discussions using Lexical Pairs

    Get PDF

    Influence of features discretization on accuracy of random forest classifier for web user identification

    Get PDF
    Web user identification based on linguistic or stylometric features helps to solve several tasks in computer forensics and cybersecurity, and can be used to prevent and investigate high-tech crimes and crimes where computer is used as a tool. In this paper we present research results on influence of features discretization on accuracy of Random Forest classifier. To evaluate the influence were carried out series of experiments on text corpus, contains Russian online texts of different genres and topics. Was used data sets with various level of class imbalance and amount of training texts per user. The experiments showed that the discretization of features improves the accuracy of identification for all data sets. We obtained positive results for extremely low amount of online messages per one user, and for maximum imbalance level
    • …
    corecore