2,342 research outputs found

    Automatically extracting polarity-bearing topics for cross-domain sentiment classification

    Get PDF
    Joint sentiment-topic (JST) model was previously proposed to detect sentiment and topic simultaneously from text. The only supervision required by JST model learning is domain-independent polarity word priors. In this paper, we modify the JST model by incorporating word polarity priors through modifying the topic-word Dirichlet priors. We study the polarity-bearing topics extracted by JST and show that by augmenting the original feature space with polarity-bearing topics, the in-domain supervised classifiers learned from augmented feature representation achieve the state-of-the-art performance of 95% on the movie review data and an average of 90% on the multi-domain sentiment dataset. Furthermore, using feature augmentation and selection according to the information gain criteria for cross-domain sentiment classification, our proposed approach performs either better or comparably compared to previous approaches. Nevertheless, our approach is much simpler and does not require difficult parameter tuning

    Text pre-processing of multilingual for sentiment analysis based on social network data

    Get PDF
    Sentiment analysis (SA) is an enduring area for research especially in the field of text analysis. Text pre-processing is an important aspect to perform SA accurately. This paper presents a text processing model for SA, using natural language processing techniques for twitter data. The basic phases for machine learning are text collection, text cleaning, pre-processing, feature extractions in a text and then categorize the data according to the SA techniques. Keeping the focus on twitter data, the data is extracted in domain specific manner. In data cleaning phase, noisy data, missing data, punctuation, tags and emoticons have been considered. For pre-processing, tokenization is performed which is followed by stop word removal (SWR). The proposed article provides an insight of the techniques, that are used for text pre-processing, the impact of their presence on the dataset. The accuracy of classification techniques has been improved after applying text pre-processing and dimensionality has been reduced. The proposed corpus can be utilized in the area of market analysis, customer behaviour, polling analysis, and brand monitoring. The text pre-processing process can serve as the baseline to apply predictive analysis, machine learning and deep learning algorithms which can be extended according to problem definition

    A Case-Based Approach to Cross Domain Sentiment Classification

    Get PDF
    This paper considers the task of sentiment classification of subjective text across many domains, in particular on scenarios where no in-domain data is available. Motivated by the more general applicability of such methods, we propose an extensible approach to sentiment classification that leverages sentiment lexicons and out-of-domain data to build a case-based system where solutions to past cases are reused to predict the sentiment of new documents from an unknown domain. In our approach the case representation uses a set of features based on document statistics, while the case solution stores sentiment lexicons employed on past predictions allowing for later retrieval and reuse on similar documents. The case-based nature of our approach also allows for future improvements since new lexicons and classification methods can be added to the case base as they become available. On a cross domain experiment our method has shown robust results when compared to a baseline single-lexicon classifier where the lexicon has to be pre-selected for the domain in question

    Classification of socially generated medical data

    Get PDF
    The growth of online health communities, particularly those involving socially generated content, can provide considerable value for society. Participants can gain knowledge of medical information or interact with peers on medical forum platforms. However, the sheer volume of information so generated – and the consequent ‘noise’ associated with large data volumes – can create difficulties for information consumers. We propose a solution to this problem by applying high-level analytics to the data – primarily sentiment analysis, but also content and topic analysis - for accurate classification. We believe that such analysis can be of significant value to data users, such as identifying a particular aspect of an information space, determining themes that predominate among a large dataset, and allowing people to summarize topics within a big dataset. In this thesis, we apply machine learning strategies to identify sentiments expressed in online medical forums that discuss Lyme Disease. As part of this process, we distinguish a complete and relevant set of categories that can be used to characterize Lyme Disease discourse. We present a feature-based model that employs supervised learning algorithms and assess the feasibility and accuracy of this sentiment classification model. We further evaluate our model by assessing its ability to adapt to an online medical forum discussing a disease with similar characteristics, Lupus. The experimental results demonstrate the effectiveness of our approach. In many sentiment analysis applications, the labelled training datasets are expensive to obtain, whereas unlabelled datasets are readily available. Therefore, we present an adaptation of a well-known semi-supervised learning technique, in which co-training is implemented by combining labelled and unlabelled data. Our results would suggest the ability to learn even with limited labelled data. In addition, we investigate complementary analytic techniques – content and topic analysis – to leverage best used of the data for various consumer groups. Within the work described in this thesis, some particular research issues are addressed, specifically when applied to socially generated medical/health datasets: • When applying binary sentiment analysis to short-form text data (e.g. Twitter), could meta-level features improve performance of classification? • When applying more complex multi-class sentiment analysis to classification of long-form content-rich text data, would meta-level features be a useful addition to more conventional features? • Can this multi-class analysis approach be generalised to other medical/health domains? • How would alternative classification strategies benefit different groups of information consumers

    A Case-Based Approach to Cross Domain Sentiment Classification

    Get PDF
    This paper considers the task of sentiment classification of subjective text across many domains, in particular on scenarios where no in-domain data is available. Motivated by the more general applicability of such methods, we propose an extensible approach to sentiment classification that leverages sentiment lexicons and out-of-domain data to build a case-based system where solutions to past cases are reused to predict the sentiment of new documents from an unknown domain. In our approach the case representation uses a set of features based on document statistics, while the case solution stores sentiment lexicons employed on past predictions allowing for later retrieval and reuse on similar documents. The case-based nature of our approach also allows for future improvements since new lexicons and classification methods can be added to the case base as they become available. On a cross domain experiment our method has shown robust results when compared to a baseline single-lexicon classifier where the lexicon has to be pre-selected for the domain in question

    Graph-based approaches for semi-supervised and cross-domain sentiment analysis

    Get PDF
    A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of PhilosophyThe rapid development of Internet technologies has resulted in a sharp increase in the number of Internet users who create content online. Usergenerated content often represents people's opinions, thoughts, speculations and sentiments and is a valuable source of information for companies, organisations and individual users. This has led to the emergence of the eld of sentiment analysis, which deals with the automatic extraction and classi cation of sentiments expressed in texts. Sentiment analysis has been intensively researched over the last ten years, but there are still many issues to be addressed. One of the main problems is the lack of labelled data necessary to carry out precise supervised sentiment classi cation. In response, research has moved towards developing semi-supervised and crossdomain techniques. Semi-supervised approaches still need some labelled data and their e ectiveness is largely determined by the amount of these data, whereas cross-domain approaches usually perform poorly if training data are very di erent from test data. The majority of research on sentiment classi cation deals with the binary classi cation problem, although for many practical applications this rather coarse sentiment scale is not su cient. Therefore, it is crucial to design methods which are able to perform accurate multiclass sentiment classi cation. iii The aims of this thesis are to address the problem of limited availability of data in sentiment analysis and to advance research in semi-supervised and cross-domain approaches for sentiment classi cation, considering both binary and multiclass sentiment scales. We adopt graph-based learning as our main method and explore the most popular and widely used graph-based algorithm, label propagation. We investigate various ways of designing sentiment graphs and propose a new similarity measure which is unsupervised, easy to compute, does not require deep linguistic analysis and, most importantly, provides a good estimate for sentiment similarity as proved by intrinsic and extrinsic evaluations. The main contribution of this thesis is the development and evaluation of a graph-based sentiment analysis system that a) can cope with the challenges of limited data availability by using semi-supervised and crossdomain approaches b) is able to perform multiclass classi cation and c) achieves highly accurate results which are superior to those of most stateof- the-art semi-supervised and cross-domain systems. We systematically analyse and compare semi-supervised and cross-domain approaches in the graph-based framework and propose recommendations for selecting the most pertinent learning approach given the data available. Our recommendations are based on two domain characteristics, domain similarity and domain complexity, which were shown to have a signi cant impact on semi-supervised and cross-domain performance

    On the Promotion of the Social Web Intelligence

    Get PDF
    Given the ever-growing information generated through various online social outlets, analytical research on social media has intensified in the past few years from all walks of life. In particular, works on social Web intelligence foster and benefit from the wisdom of the crowds and attempt to derive actionable information from such data. In the form of collective intelligence, crowds gather together and contribute to solving problems that may be difficult or impossible to solve by individuals and single computers. In addition, the consumer insight revealed from social footprints can be leveraged to build powerful business intelligence tools, enabling efficient and effective decision-making processes. This dissertation is broadly concerned with the intelligence that can emerge from the social Web platforms. In particular, the two phenomena of social privacy and online persuasion are identified as the two pillars of the social Web intelligence, studying which is essential in the promotion and advancement of both collective and business intelligence. The first part of the dissertation is focused on the phenomenon of social privacy. This work is mainly motivated by the privacy dichotomy problem. Users often face difficulties specifying privacy policies that are consistent with their actual privacy concerns and attitudes. As such, before making use of social data, it is imperative to employ multiple safeguards beyond the current privacy settings of users. As a possible solution, we utilize user social footprints to detect their privacy preferences automatically. An unsupervised collaborative filtering approach is proposed to characterize the attributes of publicly available accounts that are intended to be private. Unlike the majority of earlier studies, a variety of social data types is taken into account, including the social context, the published content, as well as the profile attributes of users. Our approach can provide support in making an informed decision whether to exploit one\u27s publicly available data to draw intelligence. With the aim of gaining insight into the strategies behind online persuasion, the second part of the dissertation studies written comments in online deliberations. Specifically, we explore different dimensions of the language, the temporal aspects of the communication, as well as the attributes of the participating users to understand what makes people change their beliefs. In addition, we investigate the factors that are perceived to be the reasons behind persuasion by the users. We link our findings to traditional persuasion research, hoping to uncover when and how they apply to online persuasion. A set of rhetorical relations is known to be of importance in persuasive discourse. We further study the automatic identification and disambiguation of such rhetorical relations, aiming to take a step closer towards automatic analysis of online persuasion. Finally, a small proof of concept tool is presented, showing the value of our persuasion and rhetoric studies

    Probabilistic topic models for sentiment analysis on the Web

    Get PDF
    Sentiment analysis aims to use automated tools to detect subjective information such as opinions, attitudes, and feelings expressed in text, and has received a rapid growth of interest in natural language processing in recent years. Probabilistic topic models, on the other hand, are capable of discovering hidden thematic structure in large archives of documents, and have been an active research area in the field of information retrieval. The work in this thesis focuses on developing topic models for automatic sentiment analysis of web data, by combining the ideas from both research domains. One noticeable issue of most previous work in sentiment analysis is that the trained classifier is domain dependent, and the labelled corpora required for training could be difficult to acquire in real world applications. Another issue is that the dependencies between sentiment/subjectivity and topics are not taken into consideration. The main contribution of this thesis is therefore the introduction of three probabilistic topic models, which address the above concerns by modelling sentiment/subjectivity and topic simultaneously. The first model is called the joint sentiment-topic (JST) model based on latent Dirichlet allocation (LDA), which detects sentiment and topic simultaneously from text. Unlike supervised approaches to sentiment classification which often fail to produce satisfactory performance when applied to new domains, the weakly-supervised nature of JST makes it highly portable to other domains, where the only supervision information required is a domain-independent sentiment lexicon. Apart from document-level sentiment classification results, JST can also extract sentiment-bearing topics automatically, which is a distinct feature compared to the existing sentiment analysis approaches. The second model is a dynamic version of JST called the dynamic joint sentiment-topic (dJST) model. dJST respects the ordering of documents, and allows the analysis of topic and sentiment evolution of document archives that are collected over a long time span. By accounting for the historical dependencies of documents from the past epochs in the generative process, dJST gives a richer posterior topical structure than JST, and can better respond to the permutations of topic prominence. We also derive online inference procedures based on a stochastic EM algorithm for efficiently updating the model parameters. The third model is called the subjectivity detection LDA (subjLDA) model for sentence-level subjectivity detection. Two sets of latent variables were introduced in subjLDA. One is the subjectivity label for each sentence; another is the sentiment label for each word token. By viewing the subjectivity detection problem as weakly-supervised generative model learning, subjLDA significantly outperforms the baseline and is comparable to the supervised approach which relies on much larger amounts of data for training. These models have been evaluated on real world datasets, demonstrating that joint sentiment topic modelling is indeed an important and useful research area with much to offer in the way of good results

    The Effects of Urbanization on the Function of the Church

    Get PDF
    The interest in this paper will not be the causes for the rise of urbanism or a study of the social ecology of the city. The emphasis will be on the social relationships in the city. We will ask the question of how these influences in urban areas have affected the function of the church in witnessing
    • …
    corecore