19,298 research outputs found

    Profiling a set of personality traits of text author: what our words reveal about us

    Get PDF
    Authorship profiling, i.e. revealing information about an unknown author by analyzing their text, is a task of growing importance. One of the most urgent problems of authorship profiling (AP) is selecting text parameters which may correlate to an author’s personality. Most researchers’ selection of these is not underpinned by any theory. This article proposes an approach to AP which applies neuroscience data. The aim of the study is to assess the probability of self-destructive behaviour of an individual via formal parameters of their texts. Here we have used the “Personality Corpus”, which consists of Russian-language texts. A set of correlations between scores on the Freiburg Personality Inventory scales that are known to be indicative of self-destructive behaviour (“Spontaneous Aggressiveness”, “Depressiveness”, “Emotional Lability”, and “Composedness”) and text variables (average sentence length, lexical diversity etc.) has been calculated. Further, a mathematical model which predicts the probability of self-destructive behaviour has been obtained

    Authorship attribution in portuguese using character N-grams

    Get PDF
    For the Authorship Attribution (AA) task, character n-grams are considered among the best predictive features. In the English language, it has also been shown that some types of character n-grams perform better than others. This paper tackles the AA task in Portuguese by examining the performance of different types of character n-grams, and various combinations of them. The paper also experiments with different feature representations and machine-learning algorithms. Moreover, the paper demonstrates that the performance of the character n-gram approach can be improved by fine-tuning the feature set and by appropriately selecting the length and type of character n-grams. This relatively simple and language-independent approach to the AA task outperforms both a bag-of-words baseline and other approaches, using the same corpus.Mexican Government (Conacyt) [240844, 20161958]; Mexican Government (SIP-IPN) [20171813, 20171344, 20172008]; Mexican Government (SNI); Mexican Government (COFAA-IPN)

    Multilingual Cross-domain Perspectives on Online Hate Speech

    Full text link
    In this report, we present a study of eight corpora of online hate speech, by demonstrating the NLP techniques that we used to collect and analyze the jihadist, extremist, racist, and sexist content. Analysis of the multilingual corpora shows that the different contexts share certain characteristics in their hateful rhetoric. To expose the main features, we have focused on text classification, text profiling, keyword and collocation extraction, along with manual annotation and qualitative study.Comment: 24 page

    A Word Embeddings based Approach for Author Profiling: Gender and Age Prediction

    Get PDF
    Author Profiling (AP) is a method of identifying the demographic profiles such as age, gender, location, native language and personality traits of an author by processing their written texts. The AP techniques are used in multiple applications such as literary research, marketing, forensics and security. The researchers identified various differences in the authors writing styles by analysing various datasets. The differences in writing styles are represented as stylistic features. The researchers extracted several style based features like structural, content, word, character, syntactic, readability and semantic features to recognize the profiles of the authors. Traditionally, the researchers extracted various feature combinations for differentiating the profiles of authors. Several existing works are used Machine Learning (ML) methods for predicting the author characteristics of a new author. The existing works achieved good accuracies for predicting the author characteristics by considering the both stylistic features and ML algorithms combination. Recently, in advent of Deep Learning (DL) techniques the researchers are proposed approaches to author profiling by using these techniques. Few researchers identified that the deep learning techniques performance is good for author profiles prediction than the results of style based features. In this work, a word embeddings based approach is proposed for gender and age prediction. In this approach, the experiment conducted with different word embedding models such as Word2Vec, GloVe, FastText and BERT for generating word vectors for words. The documents are converted as vectors by using the document representation technique which uses the word embeddings of words. The document vectors are transferred to three different ML algorithms such as Extreme Gradient Boosting (XGBoost), Random Forest (RF) and Logistic Regression (LR) for generating the trained model. This model is used for predicating the accuracy of age and gender prediction. The XGBoost classifier with word embeddings of BERT achieved good accuracies for age and gender prediction than other word embeddings and ML algorithms. The experiment implemented on PAN 2014 competition Reviews dataset for age and gender prediction. The proposed approach attained best accuracies for predicting age and gender than the performances of various existing approaches proposed for AP

    Author Profiling for English and Arabic Emails

    Get PDF
    This paper reports on some aspects of a research project aimed at automating the analysis of texts for the purpose of author profiling and identification. The Text Attribution Tool (TAT) was developed for the purpose of language-independent author profiling and has now been trained on two email corpora, English and Arabic. The complete analysis provides probabilities for the author’s basic demographic traits (gender, age, geographic origin, level of education and native language) as well as for five psychometric traits. The prototype system also provides a probability of a match with other texts, whether from known or unknown authors. A very important part of the project was the data collection and we give an overview of the collection process as well as a detailed description of the corpus of email data which was collected. We describe the overall TAT system and its components before outlining the ways in which the email data is processed and analysed. Because Arabic presents particular challenges for NLP, this paper also describes more specifically the text processing components developed to handle Arabic emails. Finally, we describe the Machine Learning setup used to produce classifiers for the different author traits and we present the experimental results, which are promising for most traits examined.The work presented in this paper was carried out while the authors were working at Appen Pty Ltd., Chatswood NSW 2067, Australi

    Author Profiling for English and Arabic Emails

    Get PDF
    This paper reports on some aspects of a research project aimed at automating the analysis of texts for the purpose of author profiling and identification. The Text Attribution Tool (TAT) was developed for the purpose of language-independent author profiling and has now been trained on two email corpora, English and Arabic. The complete analysis provides probabilities for the author’s basic demographic traits (gender, age, geographic origin, level of education and native language) as well as for five psychometric traits. The prototype system also provides a probability of a match with other texts, whether from known or unknown authors. A very important part of the project was the data collection and we give an overview of the collection process as well as a detailed description of the corpus of email data which was collected. We describe the overall TAT system and its components before outlining the ways in which the email data is processed and analysed. Because Arabic presents particular challenges for NLP, this paper also describes more specifically the text processing components developed to handle Arabic emails. Finally, we describe the Machine Learning setup used to produce classifiers for the different author traits and we present the experimental results, which are promising for most traits examined.The work presented in this paper was carried out while the authors were working at Appen Pty Ltd., Chatswood NSW 2067, Australi

    A New Term Representation Method for Gender and Age Prediction

    Get PDF
    Author Profiling is a kind of text classification method that is used for detecting the personality profiles such as age, gender, educational background, place of origin, personality traits, native language, etc., of authors by processing their written texts. Several applications like forensic analysis, security and marking are used the techniques of author profiling for finding the basic details of authors. The main problem in the domain of author profiling is preparation of suitable dataset for predicting the characteristics of authors. PAN is one organization conducting competitions on various types of shared tasks. In 2013, PAN organizers presented the task of author profiling in their series of competitions and continued this task in further years. They arranged different kinds of datasets in different varieties of languages. From 2013 onwards several researchers proposed solutions for author profiling to predict different personality features of authors by utilizing the datasets provided in PAN competitions. Researchers used different kinds of features like character based, lexical or word based, structural features, syntactic, content based, style based features for distinguishing the author’s writing styles in their texts. Most of the researchers observed that the content based features like words or phrases those are used in the text are most useful for detecting the personality features of authors. In this work, the experiment conducted with the content based features like most important words or terms for predicting age group and gender from the PAN competition datasets. Two datasets such as PAN 2014 and 2016 author profiling datasets are used in this experiment. The documents of dataset are converted in to a vector representation which is a suitable format for giving training to machine learning algorithms. The term representation in a document vector plays a crucial role to improve the performance of gender and age group prediction.The Term Weight Measures (TWMs) are such techniques used for this purpose to represent the significance of a term value in document vector representation. In this work, we developed a new TWM for representing the term value in document vector representation. The proposed TWM’s efficiency is compared with the efficiency of other existing TWMs. Two Machine Learning (ML) algorithms like SVM (Support Vector Machine) and RF (Random Forest) are considered in this experiment for estimating the accuracy of proposed approach. We recognized that the proposed TWM accomplished best accuracies for gender and age prediction in two PAN Datasets
    corecore