93 research outputs found

    Two-layer classification and distinguished representations of users and documents for grouping and authorship identification

    Get PDF
    Most studies on authorship identification reported a drop in the identification result when the number of authors exceeds 20-25. In this paper, we introduce a new user representation to address this problem and split classification across two layers. There are at least 3 novelties in this paper. First, the two-layer approach allows applying authorship identification over larger number of authors (tested over 100 authors), and it is extendable. The authors are divided into groups that contain smaller number of authors. Given an anonymous document, the primary layer detects the group to which the document belongs. Then, the secondary layer determines the particular author inside the selected group. In order to extract the groups linking similar authors, clustering is applied over users rather than documents. Hence, the second novelty of this paper is introducing a new user representation that is different from document representation. Without the proposed user representation, the clustering over documents will result in documents of author(s) distributed over several clusters, instead of a single cluster membership for each author. Third, the extracted clusters are descriptive and meaningful of their users as the dimensions have psychological backgrounds. For authorship identification, the documents are labelled with the extracted groups and fed into machine learning to build classification models that predicts the group and author of a given document. The results show that the documents are highly correlated with the extracted corresponding groups, and the proposed model can be accurately trained to determine the group and the author identity

    Profiling a set of personality traits of text author: what our words reveal about us

    Get PDF
    Authorship profiling, i.e. revealing information about an unknown author by analyzing their text, is a task of growing importance. One of the most urgent problems of authorship profiling (AP) is selecting text parameters which may correlate to an authorā€™s personality. Most researchersā€™ selection of these is not underpinned by any theory. This article proposes an approach to AP which applies neuroscience data. The aim of the study is to assess the probability of self-destructive behaviour of an individual via formal parameters of their texts. Here we have used the ā€œPersonality Corpusā€, which consists of Russian-language texts. A set of correlations between scores on the Freiburg Personality Inventory scales that are known to be indicative of self-destructive behaviour (ā€œSpontaneous Aggressivenessā€, ā€œDepressivenessā€, ā€œEmotional Labilityā€, and ā€œComposednessā€) and text variables (average sentence length, lexical diversity etc.) has been calculated. Further, a mathematical model which predicts the probability of self-destructive behaviour has been obtained

    Gender prediction from Tweets with convolutional neural networks: Notebook for PAN at CLEF 2018

    Get PDF
    19th Working Notes of CLEF Conference and Labs of the Evaluation Forum, CLEF 2018; Avignon; France; 10 September 2018 through 14 September 2018This paper presents a system1 developed for the author profiling task of PAN at CLEF 2018. The system utilizes style-based features to predict the gender information from the given tweets of each user. These features are automatically extracted by Convolutional Neural Networks (CNN). The system mainly depends on the idea that the informativeness of each tweet is not the same in terms of the gender of a user. Thus, the attention mechanism is included to the CNN outputs in order to discriminate the tweets carrying more information. Our architecture was able to obtain competitive results on three languages provided by the PAN 2018 author profiling challenge with an average accuracy of 75.1% on local runs and 70.23% on the submission run

    Gender prediction from tweets: Improving neural representations with hand-crafted features

    Get PDF
    Author profiling is the characterization of an author through some key attributes such as gender, age, and language. In this paper, a RNN model with Attention (RNNwA) is proposed to predict the gender of a twitter user using their tweets. Both word level and tweet level attentions are utilized to learn ā€™where to lookā€™. This model1 is improved by concatenating LSA-reduced n-gram features with the learned neural representation of a user. Both models are tested on three languages: English, Spanish, Arabic. The improved version of the proposed model (RNNwA + n-gram) achieves state-of-the-art performance on English and has competitive results on Spanish and Arabic

    Lost in Translation: What Linguistic Measurements Best Measure Text Quality of Online Listings

    Get PDF
    Ecommerce websites are filled with international sellers. Product descriptions on these sites are often written in English by non-native speakers. Linguistic imperfections in these descriptions confuse consumers, which may further attenuate their purchase intentions. How descriptive quality/efficacy can be defined and then improved shall be of great interest to all sellers and their consumers. In this research, we attempt to evaluate online product description quality using lexical measurements from linguistics studies. Linguistics measurements of writing quality were mostly developed in pure academic settings. We test and analyze these measurements\u27 applicability in defining and contrasting business description quality using Amazon.com data. Modern classification techniques in the artificial intelligence and machine learning field are deployed in identifying measurement applicability and assessing computational efficiency. Our findings enable automatic identification of descriptive efficacy through artificial intelligence methods on real ecommerce text data
    • ā€¦
    corecore