3,424 research outputs found
More blogging features for author identification
In this paper we present a novel improvement in the field of authorship identification in personal blogs. The improvement in authorship identification, in our work, is by utilizing a hybrid collection of linguistic features that best capture the style of users in diaries blogs. The features sets contain LIWC with its psychology background, a collection of syntactic features & part-of-speech (POS), and the misspelling errors features.
Furthermore, we analyze the contribution of each feature set on the final result and compare the outcome of using different combination from the selected feature sets. Our new categorization of misspelling words which are mapped into numerical features, are noticeably enhancing the classification results. The paper also confirms the best ranges of several parameters that affect the final result of authorship identification such as the author numbers, words number in each post, and the number of documents/posts for each author/user. The results and evaluation show that the utilized features are compact, while their performance is highly comparable with other much larger feature sets
Classification of the Stance in Online Debates Using the Dependency Relations Feature
Online discussion forums offer Internet users a medium for discussions about current political debates. The debate is a system of claims regarding interactivity and representation. Users make claims in an online discussion with superior content to support their position. Factual accuracy and emotional appeal are critical attributes used to convince readers. A key challenge in debate forums is to identify the participants’ stance, each of which is inter-dependent and inter-connected. This research work aims to construct a classifier that takes the linguistic features of the posts as input and outputs predictions for the stance label of each post. Three types of features which include Lexical, Dependency, and Morphology are used to detect the stance of the posts. Lexical features such as cue words are employed as surface features, and deep features include dependency and morphology features. Multinomial Naïve Bayes classifier is used to build a model for classifying stance and the Chi-Square method is used to select the good feature set. The performance of the stance classification system is evaluated in terms of accuracy. The result of stance labels for this proposed research represents as for and against by analyzing the surface and deep features that capture the content of a post
Lightme: Analysing Language in Internet Support Groups for Mental Health
Background: Assisting moderators to triage harmful posts in Internet Support
Groups is relevant to ensure its safe use. Automated text classification
methods analysing the language expressed in posts of online forums is a
promising solution. Methods: Natural Language Processing and Machine Learning
technologies were used to build a triage post classifier using a dataset from
Reachout mental health forum for young people. Results: When comparing with the
state-of-the-art, a solution mainly based on features from lexical resources,
received the best classification performance for the crisis posts (52%), which
is the most severe class. Six salient linguistic characteristics were found
when analysing the crisis post; 1) posts expressing hopelessness, 2) short
posts expressing concise negative emotional responses, 3) long posts expressing
variations of emotions, 4) posts expressing dissatisfaction with available
health services, 5) posts utilising storytelling, and 6) posts expressing users
seeking advice from peers during a crisis. Conclusion: It is possible to build
a competitive triage classifier using features derived only from the textual
content of the post. Further research needs to be done in order to translate
our quantitative and qualitative findings into features, as it may improve
overall performance
Argumentation Mining in User-Generated Web Discourse
The goal of argumentation mining, an evolving research field in computational
linguistics, is to design methods capable of analyzing people's argumentation.
In this article, we go beyond the state of the art in several ways. (i) We deal
with actual Web data and take up the challenges given by the variety of
registers, multiple domains, and unrestricted noisy user-generated Web
discourse. (ii) We bridge the gap between normative argumentation theories and
argumentation phenomena encountered in actual data by adapting an argumentation
model tested in an extensive annotation study. (iii) We create a new gold
standard corpus (90k tokens in 340 documents) and experiment with several
machine learning methods to identify argument components. We offer the data,
source codes, and annotation guidelines to the community under free licenses.
Our findings show that argumentation mining in user-generated Web discourse is
a feasible but challenging task.Comment: Cite as: Habernal, I. & Gurevych, I. (2017). Argumentation Mining in
User-Generated Web Discourse. Computational Linguistics 43(1), pp. 125-17
Collective emotions online and their influence on community life
E-communities, social groups interacting online, have recently become an
object of interdisciplinary research. As with face-to-face meetings, Internet
exchanges may not only include factual information but also emotional
information - how participants feel about the subject discussed or other group
members. Emotions are known to be important in affecting interaction partners
in offline communication in many ways. Could emotions in Internet exchanges
affect others and systematically influence quantitative and qualitative aspects
of the trajectory of e-communities? The development of automatic sentiment
analysis has made large scale emotion detection and analysis possible using
text messages collected from the web. It is not clear if emotions in
e-communities primarily derive from individual group members' personalities or
if they result from intra-group interactions, and whether they influence group
activities. We show the collective character of affective phenomena on a large
scale as observed in 4 million posts downloaded from Blogs, Digg and BBC
forums. To test whether the emotions of a community member may influence the
emotions of others, posts were grouped into clusters of messages with similar
emotional valences. The frequency of long clusters was much higher than it
would be if emotions occurred at random. Distributions for cluster lengths can
be explained by preferential processes because conditional probabilities for
consecutive messages grow as a power law with cluster length. For BBC forum
threads, average discussion lengths were higher for larger values of absolute
average emotional valence in the first ten comments and the average amount of
emotion in messages fell during discussions. Our results prove that collective
emotional states can be created and modulated via Internet communication and
that emotional expressiveness is the fuel that sustains some e-communities.Comment: 23 pages including Supporting Information, accepted to PLoS ON
What Goes Around Comes Around: Learning Sentiments in Online Medical Forums
Currently 19%-28% of Internet users participate in online health discussions. A 2011 survey of the US population estimated that 59% of all adults have looked online for information about health topics such as a specific disease or treatment. Although empirical evidence strongly supports the importance of emotions in health-related messages, there are few studies of the relationship between a subjective lan-guage and online discussions of personal health. In this work, we study sentiments expressed on online medical forums. As well as considering the predominant sentiments expressed in individual posts, we analyze sequences of sentiments in online discussions. Individual posts are classified into one of five categories. We identified three categories as sentimental (encouragement, gratitude, confusion) and two categories as neutral (facts, endorsement). 1438 messages from 130 threads were annotated manually by two annotators with a strong inter-annotator agreement (Fleiss kappa = 0.737 and 0.763 for posts in se-quence and separate posts respectively). The annotated posts were used to analyse sentiments in consec-utive posts. In four multi-class classification problems, we assessed HealthAffect, a domain-specific af-fective lexicon, as well general sentiment lexicons in their ability to represent messages in sentiment recognition
Mining online diaries for blogger identification
In this paper, we present an investigation of authorship
identification on personal blogs or diaries, which are different from other types of text such as essays, emails, or articles based on the text properties. The investigation utilizes couple of intuitive feature sets and studies various parameters that affect the identification performance.
Many studies manipulated the problem of authorship
identification in manually collected corpora, but only few
utilized real data from existing blogs. The complexity of
the language model in personal blogs is motivating to
identify the correspondent author. The main contribution
of this work is at least three folds. Firstly, we utilize the LIWC and MRC feature sets together, which have been
developed with Psychology background, for the first time
for authorship identification on personal blogs. Secondly, we analyze the effect of various parameters, and feature sets, on the identification performance. This includes the number of authors in the data corpus, the post size or the word count, and the number of posts for each author.
Finally, we study applying authorship identification over a limited set of users that have a common personality attributes. This analysis is motivated by the lack of standard or solid recommendations in literature for such task, especially in the domain of personal blogs.
The results and evaluation show that the utilized features
are compact while their performance is highly comparable
with other larger feature sets. The analysis also confirmed
the most effective parameters, their ranges in the data
corpus, and the usefulness of the common users classifier
in improving the performance, for the author identification
task
- …