5 research outputs found
Recommended from our members
Style over substance: A psychologically informed approach to feature selection and generalisability for author classification
Data availability: Data will be made available on request.Copyright © 2023 The Authors. Author profiling, or classifying user generated content based on demographic or other personal attributes, is a key task in social media-based research. Whilst high-accuracy has been achieved on many attributes, most studies tend to train and test models on a single domain only, ignoring cross-domain performance and research shows that models often transfer poorly into new domains as they tend to depend heavily on topic-specific (i.e., lexical) features. Knowledge specific to the field (e.g., Psychology, Political Science) is often ignored, with a reliance on data driven algorithms for feature development and selection.
Focusing on political affiliation, we evaluate an approach that selects stylistic features according to known psychological correlates (personality traits) of this attribute. Training data was collected from Reddit posts made by regular users of the political subreddits of r/republican and r/democrat. A second, non-political dataset, was created by collecting posts by the same users but in different subreddits.
Our results show that introducing domain specific knowledge in the form of psychologically informed stylistic features resulted in better out of training domain performance than lexical or more commonly used stylistic features
Making Predictions with Textual Contents
Forecasting real-world quantities with basis on information from textual descriptions has recently attracted significant interest as a research problem, although previous studies have
focused on applications involving only the English language.
This document presents an experimental study on the subject of making predictions with textual
contents written in Portuguese, using documents from three distinct domains. I specifically
report on experiments using different types of regression models, using state-of-the-art feature
weighting schemes, and using features derived from cluster-based word representations.
Through controlled experiments, I have shown that prediction models using the textual information achieve better results than simple baselines such as taking the average value over the training data, and that richer document representations (i.e., using Brown clusters and the Delta- TF-IDF feature weighting scheme) result in slight performance improvements
Workshop Proceedings of the 12th edition of the KONVENS conference
The 2014 issue of KONVENS is even more a forum for exchange: its main topic is the interaction between Computational Linguistics and Information Science, and the synergies such interaction, cooperation and integrated views can produce. This topic at the crossroads of different research traditions which deal with natural language as a container of knowledge, and with methods to extract and manage knowledge that is linguistically represented is close to the heart of many researchers at the Institut für Informationswissenschaft und Sprachtechnologie of Universität Hildesheim: it has long been one of the institute’s research topics, and it has received even more attention over the last few years