1,021 research outputs found

    Sentiment classification using statistical data compression models

    Get PDF
    With growing availability and popularity of user generated content, the discipline of sentiment analysis has come to the attention of many researchers. Existing work has mainly focused on either knowledge based methods or standard machine learning techniques. In this paper we investigate sentiment polarity classification based on adaptive statistical data compression models. We evaluate the classification performance of the lossless compression algorithm Prediction by Partial Matching (PPM) as well as compression based measures using PPM-like character n-gram frequency statistics. Comprehensive experiments on three corpora show that compression based methods are efficient, easy to apply and can compete with the accuracy of sophisticated classifiers such as support vector machines

    Sentiment classification using statistical data compression models

    Get PDF
    With growing availability and popularity of user generated content, the discipline of sentiment analysis has come to the attention of many researchers. Existing work has mainly focused on either knowledge based methods or standard machine learning techniques. In this paper we investigate sentiment polarity classification based on adaptive statistical data compression models. We evaluate the classification performance of the lossless compression algorithm Prediction by Partial Matching (PPM) as well as compression based measures using PPM-like character n-gram frequency statistics. Comprehensive experiments on three corpora show that compression based methods are efficient, easy to apply and can compete with the accuracy of sophisticated classifiers such as support vector machines

    Enhanced news sentiment analysis using deep learning methods

    Get PDF
    We explore the predictive power of historical news sentiments based on financial market performance to forecast financial news sentiments. We define news sentiments based on stock price returns averaged over one minute right after a news article has been released. If the stock price exhibits positive (negative) return, we classify the news article released just prior to the observed stock return as positive (negative). We use Wikipedia and Gigaword five corpus articles from 2014 and we apply the global vectors for word representation method to this corpus to create word vectors to use as inputs into the deep learning TensorFlow network. We analyze high-frequency (intraday) Thompson Reuters News Archive as well as the high-frequency price tick history of the Dow Jones Industrial Average (DJIA 30) Index individual stocks for the period between 1/1/2003 and 12/30/2013. We apply a combination of deep learning methodologies of recurrent neural network with long short-term memory units to train the Thompson Reuters News Archive Data from 2003 to 2012, and we test the forecasting power of our method on 2013 News Archive data. We find that the forecasting accuracy of our methodology improves when we switch from random selection of positive and negative news to selecting the news with highest positive scores as positive news and news with highest negative scores as negative news to create our training data set.Published versio

    Rhetorical relations for information retrieval

    Full text link
    Typically, every part in most coherent text has some plausible reason for its presence, some function that it performs to the overall semantics of the text. Rhetorical relations, e.g. contrast, cause, explanation, describe how the parts of a text are linked to each other. Knowledge about this socalled discourse structure has been applied successfully to several natural language processing tasks. This work studies the use of rhetorical relations for Information Retrieval (IR): Is there a correlation between certain rhetorical relations and retrieval performance? Can knowledge about a document's rhetorical relations be useful to IR? We present a language model modification that considers rhetorical relations when estimating the relevance of a document to a query. Empirical evaluation of different versions of our model on TREC settings shows that certain rhetorical relations can benefit retrieval effectiveness notably (> 10% in mean average precision over a state-of-the-art baseline)

    Hash Embeddings for Efficient Word Representations

    Full text link
    We present hash embeddings, an efficient method for representing words in a continuous vector form. A hash embedding may be seen as an interpolation between a standard word embedding and a word embedding created using a random hash function (the hashing trick). In hash embeddings each token is represented by kk dd-dimensional embeddings vectors and one kk dimensional weight vector. The final dd dimensional representation of the token is the product of the two. Rather than fitting the embedding vectors for each token these are selected by the hashing trick from a shared pool of BB embedding vectors. Our experiments show that hash embeddings can easily deal with huge vocabularies consisting of millions of tokens. When using a hash embedding there is no need to create a dictionary before training nor to perform any kind of vocabulary pruning after training. We show that models trained using hash embeddings exhibit at least the same level of performance as models trained using regular embeddings across a wide range of tasks. Furthermore, the number of parameters needed by such an embedding is only a fraction of what is required by a regular embedding. Since standard embeddings and embeddings constructed using the hashing trick are actually just special cases of a hash embedding, hash embeddings can be considered an extension and improvement over the existing regular embedding types

    Sentimental Analysis and Its Applications - A Review

    Get PDF
    Sentimental Analysis over the years has received a lot of recognition as it has shown a tremendous growth. Nowadays Sentimental analysis can be applied to any field as to bring out the emotions attached to it and also we can be able to know what the other person wants to convey. In our work we will be applying the sentimental analysis on the dataset of 15 hotels from the city and apply the new technique of Statistical Analysis. The Statistical Analysis Technique has taken the attributes like Food Quality, Ambience, area of Location etc into the account for calculating Document Index. The Reviews are taken from a trusted website. It is observed that people before making any decision do visit the reviews before making out the further decisions so e-WOM is the most effective to way to convey the views .Nowadays people rely a lot on electronic form of reviews because it is considered as the most trusted way that people can convey the views

    Making the Most of Tweet-Inherent Features for Social Spam Detection on Twitter

    Get PDF
    Social spam produces a great amount of noise on social media services such as Twitter, which reduces the signal-to-noise ratio that both end users and data mining applications observe. Existing techniques on social spam detection have focused primarily on the identification of spam accounts by using extensive historical and network-based data. In this paper we focus on the detection of spam tweets, which optimises the amount of data that needs to be gathered by relying only on tweet-inherent features. This enables the application of the spam detection system to a large set of tweets in a timely fashion, potentially applicable in a real-time or near real-time setting. Using two large hand-labelled datasets of tweets containing spam, we study the suitability of five classification algorithms and four different feature sets to the social spam detection task. Our results show that, by using the limited set of features readily available in a tweet, we can achieve encouraging results which are competitive when compared against existing spammer detection systems that make use of additional, costly user features. Our study is the first that attempts at generalising conclusions on the optimal classifiers and sets of features for social spam detection over different datasets
    corecore