3,424 research outputs found

    More blogging features for author identification

    Get PDF
    In this paper we present a novel improvement in the field of authorship identification in personal blogs. The improvement in authorship identification, in our work, is by utilizing a hybrid collection of linguistic features that best capture the style of users in diaries blogs. The features sets contain LIWC with its psychology background, a collection of syntactic features & part-of-speech (POS), and the misspelling errors features. Furthermore, we analyze the contribution of each feature set on the final result and compare the outcome of using different combination from the selected feature sets. Our new categorization of misspelling words which are mapped into numerical features, are noticeably enhancing the classification results. The paper also confirms the best ranges of several parameters that affect the final result of authorship identification such as the author numbers, words number in each post, and the number of documents/posts for each author/user. The results and evaluation show that the utilized features are compact, while their performance is highly comparable with other much larger feature sets

    Classification of the Stance in Online Debates Using the Dependency Relations Feature

    Get PDF
    Online discussion forums offer Internet users a medium for discussions about current political debates. The debate is a system of claims regarding interactivity and representation. Users make claims in an online discussion with superior content to support their position. Factual accuracy and emotional appeal are critical attributes used to convince readers. A key challenge in debate forums is to identify the participants’ stance, each of which is inter-dependent and inter-connected. This research work aims to construct a classifier that takes the linguistic features of the posts as input and outputs predictions for the stance label of each post. Three types of features which include Lexical, Dependency, and Morphology are used to detect the stance of the posts. Lexical features such as cue words are employed as surface features, and deep features include dependency and morphology features. Multinomial Naïve Bayes classifier is used to build a model for classifying stance and the Chi-Square method is used to select the good feature set. The performance of the stance classification system is evaluated in terms of accuracy. The result of stance labels for this proposed research represents as for and against by analyzing the surface and deep features that capture the content of a post

    Lightme: Analysing Language in Internet Support Groups for Mental Health

    Full text link
    Background: Assisting moderators to triage harmful posts in Internet Support Groups is relevant to ensure its safe use. Automated text classification methods analysing the language expressed in posts of online forums is a promising solution. Methods: Natural Language Processing and Machine Learning technologies were used to build a triage post classifier using a dataset from Reachout mental health forum for young people. Results: When comparing with the state-of-the-art, a solution mainly based on features from lexical resources, received the best classification performance for the crisis posts (52%), which is the most severe class. Six salient linguistic characteristics were found when analysing the crisis post; 1) posts expressing hopelessness, 2) short posts expressing concise negative emotional responses, 3) long posts expressing variations of emotions, 4) posts expressing dissatisfaction with available health services, 5) posts utilising storytelling, and 6) posts expressing users seeking advice from peers during a crisis. Conclusion: It is possible to build a competitive triage classifier using features derived only from the textual content of the post. Further research needs to be done in order to translate our quantitative and qualitative findings into features, as it may improve overall performance

    Argumentation Mining in User-Generated Web Discourse

    Full text link
    The goal of argumentation mining, an evolving research field in computational linguistics, is to design methods capable of analyzing people's argumentation. In this article, we go beyond the state of the art in several ways. (i) We deal with actual Web data and take up the challenges given by the variety of registers, multiple domains, and unrestricted noisy user-generated Web discourse. (ii) We bridge the gap between normative argumentation theories and argumentation phenomena encountered in actual data by adapting an argumentation model tested in an extensive annotation study. (iii) We create a new gold standard corpus (90k tokens in 340 documents) and experiment with several machine learning methods to identify argument components. We offer the data, source codes, and annotation guidelines to the community under free licenses. Our findings show that argumentation mining in user-generated Web discourse is a feasible but challenging task.Comment: Cite as: Habernal, I. & Gurevych, I. (2017). Argumentation Mining in User-Generated Web Discourse. Computational Linguistics 43(1), pp. 125-17

    Collective emotions online and their influence on community life

    Get PDF
    E-communities, social groups interacting online, have recently become an object of interdisciplinary research. As with face-to-face meetings, Internet exchanges may not only include factual information but also emotional information - how participants feel about the subject discussed or other group members. Emotions are known to be important in affecting interaction partners in offline communication in many ways. Could emotions in Internet exchanges affect others and systematically influence quantitative and qualitative aspects of the trajectory of e-communities? The development of automatic sentiment analysis has made large scale emotion detection and analysis possible using text messages collected from the web. It is not clear if emotions in e-communities primarily derive from individual group members' personalities or if they result from intra-group interactions, and whether they influence group activities. We show the collective character of affective phenomena on a large scale as observed in 4 million posts downloaded from Blogs, Digg and BBC forums. To test whether the emotions of a community member may influence the emotions of others, posts were grouped into clusters of messages with similar emotional valences. The frequency of long clusters was much higher than it would be if emotions occurred at random. Distributions for cluster lengths can be explained by preferential processes because conditional probabilities for consecutive messages grow as a power law with cluster length. For BBC forum threads, average discussion lengths were higher for larger values of absolute average emotional valence in the first ten comments and the average amount of emotion in messages fell during discussions. Our results prove that collective emotional states can be created and modulated via Internet communication and that emotional expressiveness is the fuel that sustains some e-communities.Comment: 23 pages including Supporting Information, accepted to PLoS ON

    What Goes Around Comes Around: Learning Sentiments in Online Medical Forums

    Get PDF
    Currently 19%-28% of Internet users participate in online health discussions. A 2011 survey of the US population estimated that 59% of all adults have looked online for information about health topics such as a specific disease or treatment. Although empirical evidence strongly supports the importance of emotions in health-related messages, there are few studies of the relationship between a subjective lan-guage and online discussions of personal health. In this work, we study sentiments expressed on online medical forums. As well as considering the predominant sentiments expressed in individual posts, we analyze sequences of sentiments in online discussions. Individual posts are classified into one of five categories. We identified three categories as sentimental (encouragement, gratitude, confusion) and two categories as neutral (facts, endorsement). 1438 messages from 130 threads were annotated manually by two annotators with a strong inter-annotator agreement (Fleiss kappa = 0.737 and 0.763 for posts in se-quence and separate posts respectively). The annotated posts were used to analyse sentiments in consec-utive posts. In four multi-class classification problems, we assessed HealthAffect, a domain-specific af-fective lexicon, as well general sentiment lexicons in their ability to represent messages in sentiment recognition

    Mining online diaries for blogger identification

    Get PDF
    In this paper, we present an investigation of authorship identification on personal blogs or diaries, which are different from other types of text such as essays, emails, or articles based on the text properties. The investigation utilizes couple of intuitive feature sets and studies various parameters that affect the identification performance. Many studies manipulated the problem of authorship identification in manually collected corpora, but only few utilized real data from existing blogs. The complexity of the language model in personal blogs is motivating to identify the correspondent author. The main contribution of this work is at least three folds. Firstly, we utilize the LIWC and MRC feature sets together, which have been developed with Psychology background, for the first time for authorship identification on personal blogs. Secondly, we analyze the effect of various parameters, and feature sets, on the identification performance. This includes the number of authors in the data corpus, the post size or the word count, and the number of posts for each author. Finally, we study applying authorship identification over a limited set of users that have a common personality attributes. This analysis is motivated by the lack of standard or solid recommendations in literature for such task, especially in the domain of personal blogs. The results and evaluation show that the utilized features are compact while their performance is highly comparable with other larger feature sets. The analysis also confirmed the most effective parameters, their ranges in the data corpus, and the usefulness of the common users classifier in improving the performance, for the author identification task
    corecore