53,301 research outputs found
Three Essays on Opinion Mining of Social Media Texts
This dissertation research is a collection of three essays on opinion mining of social media texts. I explore different theoretical and methodological perspectives in this inquiry. The first essay focuses on improving lexicon-based sentiment classification. I propose a method to automatically generate a sentiment lexicon that incorporates knowledge from both the language domain and the content domain. This method learns word associations from a large unannotated corpus. These associations are used to identify new sentiment words. Using a Twitter data set containing 743,069 tweets related to the stock market, I show that the sentiment lexicons generated using the proposed method significantly outperforms existing sentiment lexicons in sentiment classification. As sentiment analysis is being applied to different types of documents to solve different problems, the proposed method provides a useful tool to improve sentiment classification.
The second essay focuses on improving supervised sentiment classification. In previous work on sentiment classification, a document was typically represented as a collection of single words. This method of feature representation suffers from severe ambiguity, especially in classifying short texts, such as microblog messages. I propose the use of dependency features in sentiment classification. A dependency describes the relationship between a pair of words even when they are distant. I compare the sentiment classification performance of dependency features with a few commonly used features in different experiment settings. The results show that dependency features significantly outperform existing feature representations.
In the third essay, I examine the relationship between social media sentiment and stock returns. This is the first study to test the bidirectional effects in this relationship. Based on theories in behavioral finance research, I speculate that social media sentiment does not predict stock return, but rather that stock return predicts social media sentiment. I empirically test a set of research hypotheses by applying the vector autoregression (VAR) model on a social media data set, which is much larger than those used in previous studies. The hypotheses are supported by the results. The findings have significant implications for both theory and practice
Task-specific Word Identification from Short Texts Using a Convolutional Neural Network
Task-specific word identification aims to choose the task-related words that
best describe a short text. Existing approaches require well-defined seed words
or lexical dictionaries (e.g., WordNet), which are often unavailable for many
applications such as social discrimination detection and fake review detection.
However, we often have a set of labeled short texts where each short text has a
task-related class label, e.g., discriminatory or non-discriminatory, specified
by users or learned by classification algorithms. In this paper, we focus on
identifying task-specific words and phrases from short texts by exploiting
their class labels rather than using seed words or lexical dictionaries. We
consider the task-specific word and phrase identification as feature learning.
We train a convolutional neural network over a set of labeled texts and use
score vectors to localize the task-specific words and phrases. Experimental
results on sentiment word identification show that our approach significantly
outperforms existing methods. We further conduct two case studies to show the
effectiveness of our approach. One case study on a crawled tweets dataset
demonstrates that our approach can successfully capture the
discrimination-related words/phrases. The other case study on fake review
detection shows that our approach can identify the fake-review words/phrases.Comment: accepted by Intelligent Data Analysis, an International Journa
Non-Standard Words as Features for Text Categorization
This paper presents categorization of Croatian texts using Non-Standard Words
(NSW) as features. Non-Standard Words are: numbers, dates, acronyms,
abbreviations, currency, etc. NSWs in Croatian language are determined
according to Croatian NSW taxonomy. For the purpose of this research, 390 text
documents were collected and formed the SKIPEZ collection with 6 classes:
official, literary, informative, popular, educational and scientific. Text
categorization experiment was conducted on three different representations of
the SKIPEZ collection: in the first representation, the frequencies of NSWs are
used as features; in the second representation, the statistic measures of NSWs
(variance, coefficient of variation, standard deviation, etc.) are used as
features; while the third representation combines the first two feature sets.
Naive Bayes, CN2, C4.5, kNN, Classification Trees and Random Forest algorithms
were used in text categorization experiments. The best categorization results
are achieved using the first feature set (NSW frequencies) with the
categorization accuracy of 87%. This suggests that the NSWs should be
considered as features in highly inflectional languages, such as Croatian. NSW
based features reduce the dimensionality of the feature space without standard
lemmatization procedures, and therefore the bag-of-NSWs should be considered
for further Croatian texts categorization experiments.Comment: IEEE 37th International Convention on Information and Communication
Technology, Electronics and Microelectronics (MIPRO 2014), pp. 1415-1419,
201
Opinion Mining on Non-English Short Text
As the type and the number of such venues increase, automated analysis of
sentiment on textual resources has become an essential data mining task. In
this paper, we investigate the problem of mining opinions on the collection of
informal short texts. Both positive and negative sentiment strength of texts
are detected. We focus on a non-English language that has few resources for
text mining. This approach would help enhance the sentiment analysis in
languages where a list of opinionated words does not exist. We propose a new
method projects the text into dense and low dimensional feature vectors
according to the sentiment strength of the words. We detect the mixture of
positive and negative sentiments on a multi-variant scale. Empirical evaluation
of the proposed framework on Turkish tweets shows that our approach gets good
results for opinion mining
- …