A Spanish text corpus for the author profiling task

Abstract

Author Profiling is the task of predicting characteristics of the author of a text, such as age, gender, personality, native language, etc. This is a task of growing importance due to its potential applications in security, crime and marketing, among others. One of the main difficulties in this field is the lack of reliable text collections (corpora) to train and test automatically derived classifiers, in particular in specific languages such as Spanish. Although some recent data sets were generated for the PAN competitions, these documents have a lot of “noise” that prevent researchers from obtaining more general conclusions about this task when more formal documents are used. In this context, this work proposes and describes SpanText, a data collection of formal texts in Spanish language which is, as far as we know, the first collection with these characteristics for the author profiling task. Besides, an experimental study is carried out where the difference in performance obtained with formal and informal texts is clearly established and opens interesting research lines to get a deeper understanding of the particularities that each type of documents poses to the author profiling task.XI Workshop Bases de Datos y Minería de DatosRed de Universidades con Carreras de Informática (RedUNCI

    Similar works