9 research outputs found

    The role of approximate negators in modeling the automatic detection of negation in tweets

    Get PDF
    Although improvements have been made in the performance of sentiment analysis tools, the automatic detection of negated text (which affects negative sentiment prediction) still presents challenges. More research is needed on new forms of negation beyond prototypical negation cues such as “not” or “never.” The present research reports findings on the role of a set of words called “approximate negators,” namely “barely,” “hardly,” “rarely,” “scarcely,” and “seldom,” which, in specific occasions (such as attached to a word from the non-affirmative adverb “any” family), can operationalize negation styles not yet explored. Using a corpus of 6,500 tweets, human annotation allowed for the identification of 17 recurrent usages of these words as negatives (such as “very seldom”) which, along with findings from the literature, helped engineer specific features that guided a machine learning classifier in predicting negated tweets. The machine learning experiments also modeled negation scope (i.e. in which specific words are negated in the text) by employing lexical and dependency graph information. Promising results included F1 values for negation detection ranging from 0.71 to 0.89 and scope detection from 0.79 to 0.88. Future work will be directed to the application of these findings in automatic sentiment classification, further exploration of patterns in data (such as part-of-speech recurrences for these new types of negation), and the investigation of sarcasm, formal language, and exaggeration as themes that emerged from observations during corpus annotation

    Artificial Neural Network methods applied to sentiment analysis

    Get PDF
    Sentiment Analysis (SA) is the study of opinions and emotions that are conveyed by text. This field of study has commercial applications for example in market research (e.g., “What do customers like and dislike about a product?”) and consumer behavior (e.g., “Which book will a customer buy next when he wrote a positive review about book X?”). A private person can benefit from SA by automatic movie or restaurant recommendations, or from applications on the computer or smart phone that adapt to the user’s current mood. In this thesis we will put forward research on artificial Neural Network (NN) methods applied to SA. Many challenges arise, such as sarcasm, domain dependency, and data scarcity, that need to be addressed by a successful system. In the first part of this thesis we perform linguistic analysis of a word (“hard”) under the light of SA. We show that sentiment-specific word sense disambiguation is necessary to distinguish fine nuances of polarity. Commonly available resources are not sufficient for this. The introduced Contextually Enhanced Sentiment Lexicon (CESL) is used to label occurrences of “hard” in a real dataset with its sense. That allows us to train a Support Vector Machine (SVM) with deep learning features that predicts the polarity of a single occurrence of the word, just given its context words. We show that the features we propose improve the result compared to existing standard features. Since the labeling effort is not negligible, we propose a clustering approach that reduces the manual effort to a minimum. The deep learning features that help predicting fine-grained, context-dependent polarity are computed by a Neural Network Language Model (NNLM), namely a variant of the Log-Bilinear Language model (LBL). By improving this model the performance of polarity classification might as well improve. Thus, we propose a non-linear version of the LBL and the vectorized Log-Bilinear Language model (vLBL), because non-linear models are generally considered more powerful. In a parameter study on a language modeling task, we show that the non-linear versions indeed perform better than their linear counterparts. However, the difference is small, except for settings where the model has only few parameters, which might be the case when little training data is available and the model therefore needs to be smaller in order to avoid overfitting. An alternative approach to fine-grained polarity classification as used above is to train classifiers that will do the distinction automatically. Due to the complexity of the task, the challenges of SA in general, and certain domain-specific issues (e.g., when using Twitter text) existing systems have much room to improve. Often statistical classifiers are used with simple Bag-of-Words (BOW) features or count features that stem from sentiment lexicons. We introduce a linguistically-informed Convolutional Neural Network (lingCNN) that builds upon the fact that there has been much research on language in general and sentiment lexicons in particular. lingCNN makes use of two types of linguistic features: word-based and sentence-based. Word-based features comprise features derived from sentiment lexicons, such as polarity or valence and general knowledge about language, such as a negation-based feature. Sentence-based features are also based on lexicon counts and valences. The combination of both types of features is superior to the original model without these features. Especially, when little training data is available (that can be the case for different languages that are underresourced), lingCNN proves to be significantly better (up to 12 macro-F1 points). Although, linguistic features in terms of sentiment lexicons are beneficial, their usage gives rise to a new set of problems. Most lexicons consist of infinitive forms of words only. Especially, lexicons for low-resource languages. However, the text that needs to be classified is unnormalized. Hence, we want to answer the question if morphological information is necessary for SA or if a system that neglects all this information and therefore can make better use of lexicons actually has an advantage. Our approach is to first stem or lemmatize a dataset and then perform polarity classification on it. On Czech and English datasets we show that better results can be achieved with normalization. As a positive side effect, we can compute better word embeddings by first normalizing the training corpus. This works especially well for languages that have rich morphology. We show on word similarity datasets for English, German, and Spanish that our embeddings improve performance. On a new WordNet-based evaluation we confirm these results on five different languages (Czech, English, German, Hungarian, and Spanish). The benefit of this new evaluation is further that it can be used for many other languages, as the only resource that is required is a WordNet. In the last part of the thesis, we use a recently introduced method to create an ultradense sentiment space out of generic word embeddings. This method allows us to compress 400 dimensional word embeddings down to 40 or even just 4 dimensions and still get similar results on a polarity classification task. While the training speed increases by a factor of 44, the difference in classification performance is not significant.Sentiment Analyse (SA) ist das Untersuchen von Meinungen und Emotionen die durch Text übermittelt werden. Dieses Forschungsgebiet findet kommerzielle Anwendungen in Marktforschung (z.B.: „Was mögen Kunden an einem Produkt (nicht)?“) und Konsumentenverhalten (z.B.: „Welches Buch wird ein Kunde als nächstes kaufen, nachdem er eine positive Rezension über Buch X geschrieben hat?“). Aber auch als Privatperson kann man von Forschung in SA profitieren. Beispiele hierfür sind automatisch erstellte Film- oder Restaurantempfehlungen oder Anwendungen auf Computer oder Smartphone die sich der aktuellen Stimmungslage des Benutzers anpassen. In dieser Arbeit werden wir Forschung auf dem Gebiet der Neuronen Netze (NN) angewendet auf SA vorantreiben. Dabei ergeben sich viele Herausforderungen, wie Sarkasmus, Domänenabhängigkeit und Datenarmut, die ein erfolgreiches System angehen muss. Im ersten Teil der Arbeit führen wir eine linguistische Analyse des englischen Wortes „hard“ in Hinblick auf SA durch. Wir zeigen, dass sentiment-spezifische Wortbedeutungsdisambiguierung notwendig ist, um feine Nuancen von Polarität (positive vs. negative Stimmung) unterscheiden zu können. Häufig verwendete, frei verfügbare Ressourcen sind dafür nicht ausreichend. Daher stellen wir CESL (Contextually Enhanced Sentiment Lexicon), ein sentiment-spezifisches Bedeutungslexicon vor, welches verwendet wird, um Vorkommen von „hard“ in einem realen Datensatz mit seinen Bedeutungen zu versehen. Das Lexikon erlaubt es eine Support Vector Machine (SVM) mit Features aus dem Deep Learning zu trainieren, die in der Lage ist, die Polarität eines Vorkommens nur anhand seiner Kontextwörter vorherzusagen. Wir zeigen, dass die vorgestellten Features die Ergebnisse der SVM verglichen mit Standard-Features verbessern. Da der Aufwand für das Erstellen von markierten Trainingsdaten nicht zu unterschätzen ist, stellen wir einen Clustering-Ansatz vor, der den manuellen Markierungsaufwand auf ein Minimum reduziert. Die Deep Learning Features, die die Vorhersage von feingranularer, kontextabhängiger Polarität verbessern, werden mittels eines neuronalen Sprachmodells, genauer eines Log-Bilinear Language model (LBL)s, berechnet. Wenn man dieses Modell verbessert, wird vermutlich auch das Ergebnis der Polaritätsklassifikation verbessert. Daher führen wir nichtlineare Versionen des LBL und vectorized Log-Bilinear Language model (vLBL) ein, weil nichtlineare Modelle generell als mächtiger angesehen werden. In einer Parameterstudie zur Sprachmodellierung zeigen wir, dass nichtlineare Modelle tatsächlich besser abschneiden, als ihre linearen Gegenstücke. Allerdings ist der Unterschied gering, es sei denn die Modelle können nur auf wenige Parameter zurückgreifen. So etwas kommt zum Beispiel vor, wenn nur wenige Trainingsdaten verfügbar sind und das Modell deshalb kleiner sein muss, um Überanpassung zu verhindern. Ein alternativer Ansatz zur feingranularen Polaritätsklassifikation wie oben verwendet, ist es, einen Klassifikator zu trainieren, der die Unterscheidung automatisch vornimmt. Durch die Komplexität der Aufgabe, der Herausforderungen von SA im Allgemeinen und speziellen domänenspezifischen Problemen (z.B.: wenn Twitter-Daten verwendet werden) haben existierende Systeme noch immer großes Optimierungspotential. Oftmals verwenden statistische Klassifikatoren einfache Bag-of-Words (BOW)-Features. Alternativ kommen Zähl-Features zum Einsatz, die auf Sentiment-Lexika aufsetzen. Wir stellen linguistically-informed Convolutional Neural Network (lingCNN) vor, dass auf dem Fakt beruht, dass bereits viel Forschung in Sprachen und Sentiment-Lexika geflossen ist. lingCNN macht von zwei linguistischen Feature-Typen Gebrauch: wortbasierte und satzbasierte. Wort-basierte Features umfassen Features die von Sentiment-Lexika, wie Polarität oder Valenz (die Stärke der Polarität) und generellem Wissen über Sprache, z.B.: Verneinung, herrühren. Satzbasierte Features basieren ebenfalls auf Zähl-Features von Lexika und auf Valenzen. Die Kombination beider Feature-Typen ist dem Originalmodell ohne linguistische Features überlegen. Besonders wenn wenige Trainingsdatensätze vorhanden sind (das kann der Fall für Sprachen sein, die weniger erforscht sind als englisch). lingCNN schneidet signifikant besser ab (bis zu 12 macro-F1 Punkte). Obwohl linguistische Features basierend auf Sentiment-Lexika vorteilhaft sind, führt deren Verwendung zu neuen Problemen. Der Großteil der Lexika enthält nur Infinitivformen der Wörter. Dies gilt insbesondere für Sprachen mit wenigen Ressourcen. Das ist eine Herausforderung, weil der Text der klassifiziert werden soll in der Regel nicht normalisiert ist. Daher wollen wir die Frage beantworten, ob morphologische Information für SA überhaupt notwendig ist oder ob ein System, dass jegliche morphologische Information ignoriert und dadurch bessere Verwendung der Lexika erzielt, einen Vorteil genießt. Unser Ansatz besteht aus Stemming und Lemmatisierung des Datensatzes, bevor dann die Polaritätsklassifikation durchgeführt wird. Auf englischen und tschechischen Daten zeigen wir, dass durch Normalisierung bessere Ergebnisse erzielt werden. Als positiven Nebeneffekt kann man bessere Wortrepresentationen (engl. word embeddings) berechnen, indem das Trainingskorpus zuerst normalisiert wird. Das funktioniert besonders gut für morphologisch reiche Sprachen. Wir zeigen auf Datensätzen zur Wortähnlichkeit für deutsch, englisch und spanisch, dass unsere Wortrepresentationen die Ergebnisse verbessern. In einer neuen WordNet-basierten Evaluation bestätigen wir diese Ergebnisse für fünf verschiedene Sprachen (deutsch, englisch, spanisch, tschechisch und ungarisch). Der Vorteil dieser Evaluation ist weiterhin, dass sie für viele Sprachen angewendet werden kann, weil sie lediglich ein WordNet als Ressource benötigt. Im letzten Teil der Arbeit verwenden wir eine kürzlich vorgestellte Methode zur Erstellen eines ultradichten Sentiment-Raumes aus generischen Wortrepresentationen. Diese Methode erlaubt es uns 400 dimensionale Wortrepresentationen auf 40 oder sogar nur 4 Dimensionen zu komprimieren und weiterhin die gleichen Resultate in Polaritätsklassifikation zu erhalten. Während die Trainingsgeschwindigkeit um einen Faktor von 44 verbessert wird, sind die Unterschiede in der Polaritätsklassifikation nicht signifikant

    A novel approach to stance detection in social media tweets by fusing ranked lists and sentiments

    Get PDF
    Stance detection is a relatively new concept in data mining that aims to assign a stance label (favor, against, or none) to a social media post towards a specific pre-determined target. These targets may not be referred to in the post, and may not be the target of opinion in the post. In this paper, we propose a novel enhanced method for identifying the writer’s stance of a given tweet. This comprises a three-phase process for stance detection: (a) tweets preprocessing; here we clean and normalize tweets (e.g., remove stop-words) to generate words and stems lists, (b) features generation; in this step, we create and fuse two dictionaries for generating features vector, and lastly (c) classification; all the instances of the features are classified based on the list of targets. Our innovative feature selection proposes fusion of two ranked lists (top-) of term frequency-inverse document frequency (tf-idf) scores and the sentiment information. We evaluate our method using six different classifiers: nearest neighbor (K-NN), discernibility-based K-NN, weighted K-NN, class-based K-NN, exemplar-based K-NN, and Support Vector Machines. Furthermore, we investigate the use of Principal Component Analysis and study its effect on performance. The model is evaluated on the benchmark dataset (SemEval-2016 task 6), and the results significance is determined using t-test. We achieve our best performance of macro -score (averaged across all topics) of 76.45% using the weighted K-NN classifier. This tops the current state-of-the-art score of 74.44% on the same dataset

    Sentiment Analysis of Textual Content in Social Networks. From Hand-Crafted to Deep Learning-Based Models

    Get PDF
    Aquesta tesi proposa diversos mètodes avançats per analitzar automàticament el contingut textual compartit a les xarxes socials i identificar les opinions, emocions i sentiments a diferents nivells d’anàlisi i en diferents idiomes. Comencem proposant un sistema d’anàlisi de sentiments, anomenat SentiRich, basat en un conjunt ric d’atributs, inclosa la informació extreta de lèxics de sentiments i models de word embedding pre-entrenats. A continuació, proposem un sistema basat en Xarxes Neurals Convolucionals i regressors XGboost per resoldre una sèrie de tasques d’anàlisi de sentiments i emocions a Twitter. Aquestes tasques van des de les tasques típiques d’anàlisi de sentiments fins a determinar automàticament la intensitat d’una emoció (com ara alegria, por, ira, etc.) i la intensitat del sentiment dels autors a partir dels seus tweets. També proposem un nou sistema basat en Deep Learning per solucionar el problema de classificació de les emocions múltiples a Twitter. A més, es va considerar el problema de l’anàlisi del sentiment depenent de l’objectiu. Per a aquest propòsit, proposem un sistema basat en Deep Learning que identifica i extreu l'objectiu dels tweets. Tot i que alguns idiomes, com l’anglès, disposen d’una àmplia gamma de recursos per permetre l’anàlisi del sentiment, a la majoria de llenguatges els hi manca. Per tant, utilitzem la tècnica d'anàlisi de sentiments entre idiomes per desenvolupar un sistema nou, multilingüe i basat en Deep Learning per a llenguatges amb pocs recursos lingüístics. Proposem combinar l’ajuda a la presa de decisions multi-criteri i anàlisis de sentiments per desenvolupar un sistema que permeti als usuaris la possibilitat d’explotar tant les opinions com les seves preferències en el procés de classificació d’alternatives. Finalment, vam aplicar els sistemes desenvolupats al camp de la comunicació de les marques de destinació a través de les xarxes socials. Amb aquesta finalitat, hem recollit tweets de persones locals, visitants i els gabinets oficials de Turisme de diferents destinacions turístiques i es van analitzar les opinions i les emocions compartides en ells. En general, els mètodes proposats en aquesta tesi milloren el rendiment dels enfocaments d’última generació i mostren troballes apassionants.Esta tesis propone varios métodos avanzados para analizar automáticamente el contenido textual compartido en las redes sociales e identificar opiniones, emociones y sentimientos, en diferentes niveles de análisis y en diferentes idiomas. Comenzamos proponiendo un sistema de análisis de sentimientos, llamado SentiRich, que está basado en un conjunto rico de características, que incluyen la información extraída de léxicos de sentimientos y modelos de word embedding previamente entrenados. Luego, proponemos un sistema basado en redes neuronales convolucionales y regresores XGboost para resolver una variedad de tareas de análisis de sentimientos y emociones en Twitter. Estas tareas van desde las típicas tareas de análisis de sentimientos hasta la determinación automática de la intensidad de una emoción (como alegría, miedo, ira, etc.) y la intensidad del sentimiento de los autores de los tweets. También proponemos un novedoso sistema basado en Deep Learning para abordar el problema de clasificación de emociones múltiples en Twitter. Además, consideramos el problema del análisis de sentimientos dependiente del objetivo. Para este propósito, proponemos un sistema basado en Deep Learning que identifica y extrae el objetivo de los tweets. Si bien algunos idiomas, como el inglés, tienen una amplia gama de recursos para permitir el análisis de sentimientos, la mayoría de los idiomas carecen de ellos. Por lo tanto, utilizamos la técnica de Análisis de Sentimiento Inter-lingual para desarrollar un sistema novedoso, multilingüe y basado en Deep Learning para los lenguajes con pocos recursos lingüísticos. Proponemos combinar la Ayuda a la Toma de Decisiones Multi-criterio y el análisis de sentimientos para desarrollar un sistema que brinde a los usuarios la capacidad de explotar las opiniones junto con sus preferencias en el proceso de clasificación de alternativas. Finalmente, aplicamos los sistemas desarrollados al campo de la comunicación de las marcas de destino a través de las redes sociales. Con este fin, recopilamos tweets de personas locales, visitantes, y gabinetes oficiales de Turismo de diferentes destinos turísticos y analizamos las opiniones y las emociones compartidas en ellos. En general, los métodos propuestos en esta tesis mejoran el rendimiento de los enfoques de vanguardia y muestran hallazgos interesa.This thesis proposes several advanced methods to automatically analyse textual content shared on social networks and identify people’ opinions, emotions and feelings at a different level of analysis and in different languages. We start by proposing a sentiment analysis system, called SentiRich, based on a set of rich features, including the information extracted from sentiment lexicons and pre-trained word embedding models. Then, we propose an ensemble system based on Convolutional Neural Networks and XGboost regressors to solve an array of sentiment and emotion analysis tasks on Twitter. These tasks range from the typical sentiment analysis tasks, to automatically determining the intensity of an emotion (such as joy, fear, anger, etc.) and the intensity of sentiment (aka valence) of the authors from their tweets. We also propose a novel Deep Learning-based system to address the multiple emotion classification problem on Twitter. Moreover, we considered the problem of target-dependent sentiment analysis. For this purpose, we propose a Deep Learning-based system that identifies and extracts the target of the tweets. While some languages, such as English, have a vast array of resources to enable sentiment analysis, most low-resource languages lack them. So, we utilise the Cross-lingual Sentiment Analysis technique to develop a novel, multi-lingual and Deep Learning-based system for low resource languages. We propose to combine Multi-Criteria Decision Aid and sentiment analysis to develop a system that gives users the ability to exploit reviews alongside their preferences in the process of alternatives ranking. Finally, we applied the developed systems to the field of communication of destination brands through social networks. To this end, we collected tweets of local people, visitors, and official brand destination offices from different tourist destinations and analysed the opinions and the emotions shared in these tweets

    Sentiment Analysis for Social Media

    Get PDF
    Sentiment analysis is a branch of natural language processing concerned with the study of the intensity of the emotions expressed in a piece of text. The automated analysis of the multitude of messages delivered through social media is one of the hottest research fields, both in academy and in industry, due to its extremely high potential applicability in many different domains. This Special Issue describes both technological contributions to the field, mostly based on deep learning techniques, and specific applications in areas like health insurance, gender classification, recommender systems, and cyber aggression detection

    Semantics-Driven Aspect-Based Sentiment Analysis

    Get PDF
    People using the Web are constantly invited to share their opinions and preferences with the rest of the world, which has led to an explosion of opinionated blogs, reviews of products and services, and comments on virtually everything. This type of web-based content is increasingly recognized as a source of data that has added value for multiple application domains. While the large number of available reviews almost ensures that all relevant parts of the entity under review are properly covered, manually reading each and every review is not feasible. Aspect-based sentiment analysis aims to solve this issue, as it is concerned with the development of algorithms that can automatically extract fine-grained sentiment information from a set of reviews, computing a separate sentiment value for the various aspects of the product or service being reviewed. This dissertation focuses on which discriminants are useful when performing aspect-based sentiment analysis. What signals for sentiment can be extracted from the text itself and what is the effect of using extra-textual discriminants? We find that using semantic lexicons or ontologies, can greatly improve the quality of aspect-based sentiment analysis, especially with limited training data. Additionally, due to semantics driving the analysis, the algorithm is less of a black box and results are easier to explain

    Leveraging Longitudinal Data for Personalized Prediction and Word Representations

    Full text link
    This thesis focuses on personalization, word representations, and longitudinal dialog. We first look at users expressions of individual preferences. In this targeted sentiment task, we find that we can improve entity extraction and sentiment classification using domain lexicons and linear term weighting. This task is important to personalization and dialog systems, as targets need to be identified in conversation and personal preferences affect how the system should react. Then we examine individuals with large amounts of personal conversational data in order to better predict what people will say. We consider extra-linguistic features that can be used to predict behavior and to predict the relationship between interlocutors. We show that these features improve over just using message content and that training on personal data leads to much better performance than training on a sample from all other users. We look not just at using personal data for these end-tasks, but also constructing personalized word representations. When we have a lot of data for an individual, we create personalized word embeddings that improve performance on language modeling and authorship attribution. When we have limited data, but we have user demographics, we can instead construct demographic word embeddings. We show that these representations improve language modeling and word association performance. When we do not have demographic information, we show that using a small amount of data from an individual, we can calculate similarity to existing users and interpolate or leverage data from these users to improve language modeling performance. Using these types of personalized word representations, we are able to provide insight into what words vary more across users and demographics. The kind of personalized representations that we introduce in this work allow for applications such as predictive typing, style transfer, and dialog systems. Importantly, they also have the potential to enable more equitable language models, with improved performance for those demographic groups that have little representation in the data.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/167971/1/cfwelch_1.pd
    corecore