4,585 research outputs found

    A Spanish text corpus for the author profiling task

    Get PDF
    Author Profiling is the task of predicting characteristics of the author of a text, such as age, gender, personality, native language, etc. This is a task of growing importance due to its potential applications in security, crime and marketing, among others. One of the main difficulties in this field is the lack of reliable text collections (corpora) to train and test automatically derived classifiers, in particular in specific languages such as Spanish. Although some recent data sets were generated for the PAN competitions, these documents have a lot of “noise” that prevent researchers from obtaining more general conclusions about this task when more formal documents are used. In this context, this work proposes and describes SpanText, a data collection of formal texts in Spanish language which is, as far as we know, the first collection with these characteristics for the author profiling task. Besides, an experimental study is carried out where the difference in performance obtained with formal and informal texts is clearly established and opens interesting research lines to get a deeper understanding of the particularities that each type of documents poses to the author profiling task.XI Workshop Bases de Datos y Minería de DatosRed de Universidades con Carreras de Informática (RedUNCI

    A Spanish text corpus for the author profiling task

    Get PDF
    Author Profiling is the task of predicting characteristics of the author of a text, such as age, gender, personality, native language, etc. This is a task of growing importance due to its potential applications in security, crime and marketing, among others. One of the main difficulties in this field is the lack of reliable text collections (corpora) to train and test automatically derived classifiers, in particular in specific languages such as Spanish. Although some recent data sets were generated for the PAN competitions, these documents have a lot of “noise” that prevent researchers from obtaining more general conclusions about this task when more formal documents are used. In this context, this work proposes and describes SpanText, a data collection of formal texts in Spanish language which is, as far as we know, the first collection with these characteristics for the author profiling task. Besides, an experimental study is carried out where the difference in performance obtained with formal and informal texts is clearly established and opens interesting research lines to get a deeper understanding of the particularities that each type of documents poses to the author profiling task.XI Workshop Bases de Datos y Minería de DatosRed de Universidades con Carreras de Informática (RedUNCI

    Determining geographic origin of social media users with Bayesian Analysis of common syntactical and spelling errors when using foreign languages

    Get PDF
    As the growing influence and importance of social media, the need of categorizing authors of overt text information from social media by their geographic origin background is becoming more urgent than ever before. To achieve the goal, some method been developed, for instance, classifying by authors' language, timezone, or by geographic terms used in the text. This thesis explored a unique classifier to determine the social media users' geographic background: Native Language Classifier, which classifies authors' native language from the text they have written in English. The Native Language Classifier set up a training set consisting of English corpus in size of 6 million words of 800 authors from 4 different language background: Chinese, Russian, Spanish and French. And through testing 200 users (50 users from each language group) the classifier made an overall accuracy of 75% by combining result from n-gram algorithms in word level, n-gram algorithms in character level, and spell checking algorithm, to classify those authors into groups of correct language background. It would be valuable for both social media analyzers, and text classifying researchers. More than the classifying result, some interesting observations are made from the test as well. They disclosed some rules behind the languages. Therefore the method developed by this thesis would also possibly become a useful tool to help researchers analyzing the feature of the languages

    Representation and use of chemistry in the global electronic age.

    Get PDF
    We present an overview of the current state of public semantic chemistry and propose new approaches at a strategic and a detailed level. We show by example how a model for a Chemical Semantic Web can be constructed using machine-processed data and information from journal articles.This manuscript addresses questions of robotic access to data and its automatic re-use, including the role of Open Access archival of data. This is a pre-refereed preprint allowed by the publisher's (Royal Soc. Chemistry) Green policy. The author's preferred manuscript is an HTML hyperdocument with ca. 20 links to images, some of which are JPEgs and some of which are SVG (scalable vector graphics) including animations. There are also links to molecules in CML, for which the Jmol viewer is recommended. We susgeest that readers who wish to see the full glory of the manuscript, download the Zipped version and unpack on their machine. We also supply a PDF and DOC (Word) version which obviously cannot show the animations, but which may be the best palce to start, particularly for those more interested in the text

    Mining Semantic Loop Idioms

    Get PDF

    Textual History of Li Livres dou tresor: Fitting the Pieces Together

    Get PDF
    Modern editors of medieval texts all face the singular difficulty of determining which version of a text they will edit. Will they adhere to one manuscript? Will they attempt to recreate the author\u27s original? Will they eliminate or include interpolations and glosses? In the Middle Ages, the concepts of literary originality and authorship were not exalted as they are today. In fact, as succinctly stated by Cerquiglini (1989, 25), L\u27auteur n\u27est pas une idee medievale. Rather, literary compositions were fluid artifacts which were commonly modified with every copying or recitation, although they were frequently attributed to one source. Today, when faced with several extant versions of a given text, scholars of medieval texts must inevitably choose one for publication and subsequent incorporation into the literary canon. As Speer (1991 , 42) asserts, the factors which determine how an editor shapes his/ her text can be found in a three-fold response to the question What is the text? These factors are (l) the material considerations, grounded in codicological evidence; (2) literary history, which considers the author and his socio-historical milieu; (3) theoretical perspectives, stemming from the intent of the piece

    Book Reviews

    Get PDF
    corecore