33,808 research outputs found
Two-layer classification and distinguished representations of users and documents for grouping and authorship identification
Most studies on authorship identification reported a drop in the identification result when the number of authors exceeds 20-25. In this paper, we introduce a new user representation to address this problem and split classification across two layers. There are at least 3 novelties in this paper. First, the two-layer approach allows applying authorship identification over larger number of authors (tested over 100 authors), and it is extendable. The authors are divided into groups that contain smaller number of authors. Given an anonymous document, the primary layer detects the group to which the document belongs. Then, the secondary layer determines the particular author inside the selected group. In order to extract the groups linking similar authors, clustering is applied over users rather than documents. Hence, the second novelty of this paper is introducing a new user representation that is different from document representation. Without the proposed user representation, the clustering over documents will result in documents of author(s) distributed over several clusters, instead of a single cluster membership for each author. Third, the extracted clusters are descriptive and meaningful of their users as the dimensions have psychological backgrounds. For authorship identification, the documents are labelled with the extracted groups and fed into machine learning to build classification models that predicts the group and author of a given document. The results show that the documents are highly correlated with the extracted corresponding groups, and the proposed model can be accurately trained to determine the group and the author identity
A Multiplicative Model for Learning Distributed Text-Based Attribute Representations
In this paper we propose a general framework for learning distributed
representations of attributes: characteristics of text whose representations
can be jointly learned with word embeddings. Attributes can correspond to
document indicators (to learn sentence vectors), language indicators (to learn
distributed language representations), meta-data and side information (such as
the age, gender and industry of a blogger) or representations of authors. We
describe a third-order model where word context and attribute vectors interact
multiplicatively to predict the next word in a sequence. This leads to the
notion of conditional word similarity: how meanings of words change when
conditioned on different attributes. We perform several experimental tasks
including sentiment classification, cross-lingual document classification, and
blog authorship attribution. We also qualitatively evaluate conditional word
neighbours and attribute-conditioned text generation.Comment: 11 pages. An earlier version was accepted to the ICML-2014 Workshop
on Knowledge-Powered Deep Learning for Text Minin
Mining online diaries for blogger identification
In this paper, we present an investigation of authorship
identification on personal blogs or diaries, which are different from other types of text such as essays, emails, or articles based on the text properties. The investigation utilizes couple of intuitive feature sets and studies various parameters that affect the identification performance.
Many studies manipulated the problem of authorship
identification in manually collected corpora, but only few
utilized real data from existing blogs. The complexity of
the language model in personal blogs is motivating to
identify the correspondent author. The main contribution
of this work is at least three folds. Firstly, we utilize the LIWC and MRC feature sets together, which have been
developed with Psychology background, for the first time
for authorship identification on personal blogs. Secondly, we analyze the effect of various parameters, and feature sets, on the identification performance. This includes the number of authors in the data corpus, the post size or the word count, and the number of posts for each author.
Finally, we study applying authorship identification over a limited set of users that have a common personality attributes. This analysis is motivated by the lack of standard or solid recommendations in literature for such task, especially in the domain of personal blogs.
The results and evaluation show that the utilized features
are compact while their performance is highly comparable
with other larger feature sets. The analysis also confirmed
the most effective parameters, their ranges in the data
corpus, and the usefulness of the common users classifier
in improving the performance, for the author identification
task
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
The Effect of Gender in the Publication Patterns in Mathematics
Despite the increasing number of women graduating in mathematics, a systemic
gender imbalance persists and is signified by a pronounced gender gap in the
distribution of active researchers and professors. Especially at the level of
university faculty, women mathematicians continue being drastically
underrepresented, decades after the first affirmative action measures have been
put into place. A solid publication record is of paramount importance for
securing permanent positions. Thus, the question arises whether the publication
patterns of men and women mathematicians differ in a significant way. Making
use of the zbMATH database, one of the most comprehensive metadata sources on
mathematical publications, we analyze the scholarly output of ~150,000
mathematicians from the past four decades whose gender we algorithmically
inferred. We focus on development over time, collaboration through
coautorships, presumed journal quality and distribution of research topics --
factors known to have a strong impact on job perspectives. We report
significant differences between genders which may put women at a disadvantage
when pursuing an academic career in mathematics.Comment: 24 pages, 12 figure
- ā¦