5,109 research outputs found
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
Fuzzy rule based systems for gender classification from blog data
Gender classification is a popular machine learning task, which has been involved in various application areas, such as business intelligence, access control and cyber security. In the context of information granulation, gender related information can be divided into three types, namely, biological information, vision based information and social network based information. In traditional machine learning, gender identification has been typically treated as a discriminative classification task, i.e. it is aimed at learning a classifier that discriminates between male and female. In this paper, we argue that it is not always appropriate to identify gender in the way of discriminative classification, especially when considering the case that both male and female people are of high diversity and thus individuals of different genders could have high similarity to each other in terms of their characteristics. In order to address the above issue, we propose the use of a fuzzy approach for generative classification of gender. In particular, we focus on gender classification based on social network information. We conduct an experiment study by using a blog data set, and compare the fuzzy approach with C4.5, Naive Bayes and Support Vector Machine in terms of classification performance. The results show that the fuzzy approach outperforms the other approaches and is also capable of capturing the diversity of both male and female people and dealing with the fuzziness in terms of gender identification
Author Profiling for English and Arabic Emails
This paper reports on some aspects of a research project aimed at automating the analysis of texts for the purpose of author profiling and identification. The Text Attribution Tool (TAT) was developed for the purpose of language-independent author profiling and has now been trained on two email corpora, English and Arabic. The complete analysis provides probabilities for the author’s basic demographic traits (gender, age, geographic origin, level of education and native language) as well as for five psychometric traits. The prototype system also provides a probability of a match with other texts, whether from known or unknown authors. A very important part of the project was the data collection and we give an overview of the collection process as well as a detailed description of the corpus of email data which was collected. We describe the overall TAT system and its components before outlining the ways in which the email data is processed and analysed. Because Arabic presents particular challenges for NLP, this paper also describes more specifically the text processing components developed to handle Arabic emails. Finally, we describe the Machine Learning setup used to produce classifiers for the different author traits and we present the experimental results, which are promising for most traits examined.The work presented in this paper was carried out while the authors were working at Appen Pty Ltd., Chatswood NSW 2067, Australi
Evaluation and Sociolinguistic Analysis of Text Features for Gender and Age Identification
The paper presents an interdisciplinary study in the field of automatic gender and age identification, under the scope of sociolinguistic knowledge on gendered and age linguistic choices that social media users make. The authors investigated and gathered standard and novel text features used in text mining approaches on the author's demographic information and profiling and they examined their efficacy in gender and age detection tasks on a corpus consisted of social media texts. An analysis of the most informative features is attempted according to the nature of each feature and the information derived after the characteristics' score of importance is discussed
Author Profiling for English and Arabic Emails
This paper reports on some aspects of a research project aimed at automating the analysis of texts for the purpose of author profiling and identification. The Text Attribution Tool (TAT) was developed for the purpose of language-independent author profiling and has now been trained on two email corpora, English and Arabic. The complete analysis provides probabilities for the author’s basic demographic traits (gender, age, geographic origin, level of education and native language) as well as for five psychometric traits. The prototype system also provides a probability of a match with other texts, whether from known or unknown authors. A very important part of the project was the data collection and we give an overview of the collection process as well as a detailed description of the corpus of email data which was collected. We describe the overall TAT system and its components before outlining the ways in which the email data is processed and analysed. Because Arabic presents particular challenges for NLP, this paper also describes more specifically the text processing components developed to handle Arabic emails. Finally, we describe the Machine Learning setup used to produce classifiers for the different author traits and we present the experimental results, which are promising for most traits examined.The work presented in this paper was carried out while the authors were working at Appen Pty Ltd., Chatswood NSW 2067, Australi
Using text mining algorithm to detect gender deception based on Malaysian chat room lingo / Dianne L. M. Cheong and Nur Atiqah Sia Abdullah @ Sia Sze Yieng
E-mail can be a fantasy playground for identity experimentations where players take on an imaginary persona and interact with each other in the virtual world. Therefore. gender deception is difficult. risky and it can be abandoned at will. Inference can be made both from writing style and from clues hidden in the posting data. A text-mining algorithm was designed to detect gender deception based on gender-preferential features at the word or clause level of Malaysian e-mail users. Based on this algorithm. a prototype in Visual Basic is developed It was tested with /6 documents; each consists of 5 e-mails
exchanges of respective individuals. The tests shown the prototype have 8/.3% of accuracy level. This is consistent with a human reader of the documents. This prototype can be a tool to assist interested parties such as the Criminology
and Forensic Department. e-mail users and virtual communities to successfully identify gender deception
Negative emotions boost users activity at BBC Forum
We present an empirical study of user activity in online BBC discussion
forums, measured by the number of posts written by individual debaters and the
average sentiment of these posts. Nearly 2.5 million posts from over 18
thousand users were investigated. Scale free distributions were observed for
activity in individual discussion threads as well as for overall activity. The
number of unique users in a thread normalized by the thread length decays with
thread length, suggesting that thread life is sustained by mutual discussions
rather than by independent comments. Automatic sentiment analysis shows that
most posts contain negative emotions and the most active users in individual
threads express predominantly negative sentiments. It follows that the average
emotion of longer threads is more negative and that threads can be sustained by
negative comments. An agent based computer simulation model has been used to
reproduce several essential characteristics of the analyzed system. The model
stresses the role of discussions between users, especially emotionally laden
quarrels between supporters of opposite opinions, and represents many observed
statistics of the forum.Comment: 29 pages, 6 figure
- …