428 research outputs found

    Disentangling categorical relationships through a graph of co-occurrences

    Get PDF
    The mesoscopic structure of complex networks has proven a powerful level of description to understand the linchpins of the system represented by the network. Nevertheless, themapping of a series of relationships between elements, in terms of a graph, is sometimes not straightforward. Given that all the information we would extract using complex network tools depend on this initial graph, it is mandatory to preprocess the data to build it on in the most accurate manner. Here we propose a procedure to build a network, attending only to statistically significant relations between constituents. We use a paradigmatic example of word associations to show the development of our approach. Analyzing the modular structure of the obtained network we are able to disentangle categorical relations, disambiguating words with success that is comparable to the best algorithms designed to the same end.We acknowledge financia support through Grant No. FIS2009-13364-C02-01, Holopedia (Grant No. TIN2010-21128-C02-01), MOSAICO (Grant No. FIS2006-01485), PRODIEVO (Grant No. FIS2011-22449), and Complexity-NET RESINEE, all of them from Ministerio de EducaciĂłn y Ciencia in Spain, as well as support from Research Networks MODELICO-CM (Grant No. S2009/ESP-1691) and MA2VICMR (Grant No. S2009/TIC-1542) from Comunidad de Madrid, and Network 2009-SGR-838 from Generalitat de Catalunya

    An event detection approach based on Twitter hashtags

    Get PDF
    Twitter is one of the most popular microblogging services in the world. The great amount of information made Twitter an important information channel for people to know and share news. Hashtag is a popular feature when people use Twitter. It can be taken as human labeled information and is useful for people to identify the topic of a tweet. Many researchers have proposed event-detection approaches that can monitor Twitter data and determine whether special events, such as accidents, extreme weather, earthquakes, or crimes, are happening. Although many approaches considered hashtag as one of their features, few of them explicitly focused on the effectiveness of using hashtag on event detection. In this study, we proposed an event detection approach that utilizes hashtags in tweets. We adopted the feature extraction used in STREAMCUBE (Feng et al., 2015) and applied a clustering K-means approach (Lloyd, 1982) to it. The experiments were conducted on 20,514 tweets with 8,616 hashtags collected between November 13, 2015 and November 17, 2015 with general topic of the Paris Attacks. A randomly sampled subset of 200 tweets was also manually labeled by a human subject to verify the approach. Based on the collected tweets, we demonstrated that the K-means approach could perform better than STREAMCUBE in the clustering results. Also, we discussed how to set the K values for the K-means approach to lead to a better clustering performance

    Knowledge Modelling and Learning through Cognitive Networks

    Get PDF
    One of the most promising developments in modelling knowledge is cognitive network science, which aims to investigate cognitive phenomena driven by the networked, associative organization of knowledge. For example, investigating the structure of semantic memory via semantic networks has illuminated how memory recall patterns influence phenomena such as creativity, memory search, learning, and more generally, knowledge acquisition, exploration, and exploitation. In parallel, neural network models for artificial intelligence (AI) are also becoming more widespread as inferential models for understanding which features drive language-related phenomena such as meaning reconstruction, stance detection, and emotional profiling. Whereas cognitive networks map explicitly which entities engage in associative relationships, neural networks perform an implicit mapping of correlations in cognitive data as weights, obtained after training over labelled data and whose interpretation is not immediately evident to the experimenter. This book aims to bring together quantitative, innovative research that focuses on modelling knowledge through cognitive and neural networks to gain insight into mechanisms driving cognitive processes related to knowledge structuring, exploration, and learning. The book comprises a variety of publication types, including reviews and theoretical papers, empirical research, computational modelling, and big data analysis. All papers here share a commonality: they demonstrate how the application of network science and AI can extend and broaden cognitive science in ways that traditional approaches cannot

    Doctor of Philosophy in Computer Science

    Get PDF
    dissertationOver the last decade, social media has emerged as a revolutionary platform for informal communication and social interactions among people. Publicly expressing thoughts, opinions, and feelings is one of the key characteristics of social media. In this dissertation, I present research on automatically acquiring knowledge from social media that can be used to recognize people's affective state (i.e., what someone feels at a given time) in text. This research addresses two types of affective knowledge: 1) hashtag indicators of emotion consisting of emotion hashtags and emotion hashtag patterns, and 2) affective understanding of similes (a form of figurative comparison). My research introduces a bootstrapped learning algorithm for learning hashtag in- dicators of emotions from tweets with respect to five emotion categories: Affection, Anger/Rage, Fear/Anxiety, Joy, and Sadness/Disappointment. With a few seed emotion hashtags per emotion category, the bootstrapping algorithm iteratively learns new hashtags and more generalized hashtag patterns by analyzing emotion in tweets that contain these indicators. Emotion phrases are also harvested from the learned indicators to train additional classifiers that use the surrounding word context of the phrases as features. This is the first work to learn hashtag indicators of emotions. My research also presents a supervised classification method for classifying affective polarity of similes in Twitter. Using lexical, semantic, and sentiment properties of different simile components as features, supervised classifiers are trained to classify a simile into a positive or negative affective polarity class. The property of comparison is also fundamental to the affective understanding of similes. My research introduces a novel framework for inferring implicit properties that 1) uses syntactic constructions, statistical association, dictionary definitions and word embedding vector similarity to generate and rank candidate properties, 2) re-ranks the top properties using influence from multiple simile components, and 3) aggregates the ranks of each property from different methods to create a final ranked list of properties. The inferred properties are used to derive additional features for the supervised classifiers to further improve affective polarity recognition. Experimental results show substantial improvements in affective understanding of similes over the use of existing sentiment resources

    Artificial Neural Network methods applied to sentiment analysis

    Get PDF
    Sentiment Analysis (SA) is the study of opinions and emotions that are conveyed by text. This field of study has commercial applications for example in market research (e.g., “What do customers like and dislike about a product?”) and consumer behavior (e.g., “Which book will a customer buy next when he wrote a positive review about book X?”). A private person can benefit from SA by automatic movie or restaurant recommendations, or from applications on the computer or smart phone that adapt to the user’s current mood. In this thesis we will put forward research on artificial Neural Network (NN) methods applied to SA. Many challenges arise, such as sarcasm, domain dependency, and data scarcity, that need to be addressed by a successful system. In the first part of this thesis we perform linguistic analysis of a word (“hard”) under the light of SA. We show that sentiment-specific word sense disambiguation is necessary to distinguish fine nuances of polarity. Commonly available resources are not sufficient for this. The introduced Contextually Enhanced Sentiment Lexicon (CESL) is used to label occurrences of “hard” in a real dataset with its sense. That allows us to train a Support Vector Machine (SVM) with deep learning features that predicts the polarity of a single occurrence of the word, just given its context words. We show that the features we propose improve the result compared to existing standard features. Since the labeling effort is not negligible, we propose a clustering approach that reduces the manual effort to a minimum. The deep learning features that help predicting fine-grained, context-dependent polarity are computed by a Neural Network Language Model (NNLM), namely a variant of the Log-Bilinear Language model (LBL). By improving this model the performance of polarity classification might as well improve. Thus, we propose a non-linear version of the LBL and the vectorized Log-Bilinear Language model (vLBL), because non-linear models are generally considered more powerful. In a parameter study on a language modeling task, we show that the non-linear versions indeed perform better than their linear counterparts. However, the difference is small, except for settings where the model has only few parameters, which might be the case when little training data is available and the model therefore needs to be smaller in order to avoid overfitting. An alternative approach to fine-grained polarity classification as used above is to train classifiers that will do the distinction automatically. Due to the complexity of the task, the challenges of SA in general, and certain domain-specific issues (e.g., when using Twitter text) existing systems have much room to improve. Often statistical classifiers are used with simple Bag-of-Words (BOW) features or count features that stem from sentiment lexicons. We introduce a linguistically-informed Convolutional Neural Network (lingCNN) that builds upon the fact that there has been much research on language in general and sentiment lexicons in particular. lingCNN makes use of two types of linguistic features: word-based and sentence-based. Word-based features comprise features derived from sentiment lexicons, such as polarity or valence and general knowledge about language, such as a negation-based feature. Sentence-based features are also based on lexicon counts and valences. The combination of both types of features is superior to the original model without these features. Especially, when little training data is available (that can be the case for different languages that are underresourced), lingCNN proves to be significantly better (up to 12 macro-F1 points). Although, linguistic features in terms of sentiment lexicons are beneficial, their usage gives rise to a new set of problems. Most lexicons consist of infinitive forms of words only. Especially, lexicons for low-resource languages. However, the text that needs to be classified is unnormalized. Hence, we want to answer the question if morphological information is necessary for SA or if a system that neglects all this information and therefore can make better use of lexicons actually has an advantage. Our approach is to first stem or lemmatize a dataset and then perform polarity classification on it. On Czech and English datasets we show that better results can be achieved with normalization. As a positive side effect, we can compute better word embeddings by first normalizing the training corpus. This works especially well for languages that have rich morphology. We show on word similarity datasets for English, German, and Spanish that our embeddings improve performance. On a new WordNet-based evaluation we confirm these results on five different languages (Czech, English, German, Hungarian, and Spanish). The benefit of this new evaluation is further that it can be used for many other languages, as the only resource that is required is a WordNet. In the last part of the thesis, we use a recently introduced method to create an ultradense sentiment space out of generic word embeddings. This method allows us to compress 400 dimensional word embeddings down to 40 or even just 4 dimensions and still get similar results on a polarity classification task. While the training speed increases by a factor of 44, the difference in classification performance is not significant.Sentiment Analyse (SA) ist das Untersuchen von Meinungen und Emotionen die durch Text ĂŒbermittelt werden. Dieses Forschungsgebiet findet kommerzielle Anwendungen in Marktforschung (z.B.: „Was mögen Kunden an einem Produkt (nicht)?“) und Konsumentenverhalten (z.B.: „Welches Buch wird ein Kunde als nĂ€chstes kaufen, nachdem er eine positive Rezension ĂŒber Buch X geschrieben hat?“). Aber auch als Privatperson kann man von Forschung in SA profitieren. Beispiele hierfĂŒr sind automatisch erstellte Film- oder Restaurantempfehlungen oder Anwendungen auf Computer oder Smartphone die sich der aktuellen Stimmungslage des Benutzers anpassen. In dieser Arbeit werden wir Forschung auf dem Gebiet der Neuronen Netze (NN) angewendet auf SA vorantreiben. Dabei ergeben sich viele Herausforderungen, wie Sarkasmus, DomĂ€nenabhĂ€ngigkeit und Datenarmut, die ein erfolgreiches System angehen muss. Im ersten Teil der Arbeit fĂŒhren wir eine linguistische Analyse des englischen Wortes „hard“ in Hinblick auf SA durch. Wir zeigen, dass sentiment-spezifische Wortbedeutungsdisambiguierung notwendig ist, um feine Nuancen von PolaritĂ€t (positive vs. negative Stimmung) unterscheiden zu können. HĂ€ufig verwendete, frei verfĂŒgbare Ressourcen sind dafĂŒr nicht ausreichend. Daher stellen wir CESL (Contextually Enhanced Sentiment Lexicon), ein sentiment-spezifisches Bedeutungslexicon vor, welches verwendet wird, um Vorkommen von „hard“ in einem realen Datensatz mit seinen Bedeutungen zu versehen. Das Lexikon erlaubt es eine Support Vector Machine (SVM) mit Features aus dem Deep Learning zu trainieren, die in der Lage ist, die PolaritĂ€t eines Vorkommens nur anhand seiner Kontextwörter vorherzusagen. Wir zeigen, dass die vorgestellten Features die Ergebnisse der SVM verglichen mit Standard-Features verbessern. Da der Aufwand fĂŒr das Erstellen von markierten Trainingsdaten nicht zu unterschĂ€tzen ist, stellen wir einen Clustering-Ansatz vor, der den manuellen Markierungsaufwand auf ein Minimum reduziert. Die Deep Learning Features, die die Vorhersage von feingranularer, kontextabhĂ€ngiger PolaritĂ€t verbessern, werden mittels eines neuronalen Sprachmodells, genauer eines Log-Bilinear Language model (LBL)s, berechnet. Wenn man dieses Modell verbessert, wird vermutlich auch das Ergebnis der PolaritĂ€tsklassifikation verbessert. Daher fĂŒhren wir nichtlineare Versionen des LBL und vectorized Log-Bilinear Language model (vLBL) ein, weil nichtlineare Modelle generell als mĂ€chtiger angesehen werden. In einer Parameterstudie zur Sprachmodellierung zeigen wir, dass nichtlineare Modelle tatsĂ€chlich besser abschneiden, als ihre linearen GegenstĂŒcke. Allerdings ist der Unterschied gering, es sei denn die Modelle können nur auf wenige Parameter zurĂŒckgreifen. So etwas kommt zum Beispiel vor, wenn nur wenige Trainingsdaten verfĂŒgbar sind und das Modell deshalb kleiner sein muss, um Überanpassung zu verhindern. Ein alternativer Ansatz zur feingranularen PolaritĂ€tsklassifikation wie oben verwendet, ist es, einen Klassifikator zu trainieren, der die Unterscheidung automatisch vornimmt. Durch die KomplexitĂ€t der Aufgabe, der Herausforderungen von SA im Allgemeinen und speziellen domĂ€nenspezifischen Problemen (z.B.: wenn Twitter-Daten verwendet werden) haben existierende Systeme noch immer großes Optimierungspotential. Oftmals verwenden statistische Klassifikatoren einfache Bag-of-Words (BOW)-Features. Alternativ kommen ZĂ€hl-Features zum Einsatz, die auf Sentiment-Lexika aufsetzen. Wir stellen linguistically-informed Convolutional Neural Network (lingCNN) vor, dass auf dem Fakt beruht, dass bereits viel Forschung in Sprachen und Sentiment-Lexika geflossen ist. lingCNN macht von zwei linguistischen Feature-Typen Gebrauch: wortbasierte und satzbasierte. Wort-basierte Features umfassen Features die von Sentiment-Lexika, wie PolaritĂ€t oder Valenz (die StĂ€rke der PolaritĂ€t) und generellem Wissen ĂŒber Sprache, z.B.: Verneinung, herrĂŒhren. Satzbasierte Features basieren ebenfalls auf ZĂ€hl-Features von Lexika und auf Valenzen. Die Kombination beider Feature-Typen ist dem Originalmodell ohne linguistische Features ĂŒberlegen. Besonders wenn wenige TrainingsdatensĂ€tze vorhanden sind (das kann der Fall fĂŒr Sprachen sein, die weniger erforscht sind als englisch). lingCNN schneidet signifikant besser ab (bis zu 12 macro-F1 Punkte). Obwohl linguistische Features basierend auf Sentiment-Lexika vorteilhaft sind, fĂŒhrt deren Verwendung zu neuen Problemen. Der Großteil der Lexika enthĂ€lt nur Infinitivformen der Wörter. Dies gilt insbesondere fĂŒr Sprachen mit wenigen Ressourcen. Das ist eine Herausforderung, weil der Text der klassifiziert werden soll in der Regel nicht normalisiert ist. Daher wollen wir die Frage beantworten, ob morphologische Information fĂŒr SA ĂŒberhaupt notwendig ist oder ob ein System, dass jegliche morphologische Information ignoriert und dadurch bessere Verwendung der Lexika erzielt, einen Vorteil genießt. Unser Ansatz besteht aus Stemming und Lemmatisierung des Datensatzes, bevor dann die PolaritĂ€tsklassifikation durchgefĂŒhrt wird. Auf englischen und tschechischen Daten zeigen wir, dass durch Normalisierung bessere Ergebnisse erzielt werden. Als positiven Nebeneffekt kann man bessere Wortrepresentationen (engl. word embeddings) berechnen, indem das Trainingskorpus zuerst normalisiert wird. Das funktioniert besonders gut fĂŒr morphologisch reiche Sprachen. Wir zeigen auf DatensĂ€tzen zur WortĂ€hnlichkeit fĂŒr deutsch, englisch und spanisch, dass unsere Wortrepresentationen die Ergebnisse verbessern. In einer neuen WordNet-basierten Evaluation bestĂ€tigen wir diese Ergebnisse fĂŒr fĂŒnf verschiedene Sprachen (deutsch, englisch, spanisch, tschechisch und ungarisch). Der Vorteil dieser Evaluation ist weiterhin, dass sie fĂŒr viele Sprachen angewendet werden kann, weil sie lediglich ein WordNet als Ressource benötigt. Im letzten Teil der Arbeit verwenden wir eine kĂŒrzlich vorgestellte Methode zur Erstellen eines ultradichten Sentiment-Raumes aus generischen Wortrepresentationen. Diese Methode erlaubt es uns 400 dimensionale Wortrepresentationen auf 40 oder sogar nur 4 Dimensionen zu komprimieren und weiterhin die gleichen Resultate in PolaritĂ€tsklassifikation zu erhalten. WĂ€hrend die Trainingsgeschwindigkeit um einen Faktor von 44 verbessert wird, sind die Unterschiede in der PolaritĂ€tsklassifikation nicht signifikant
    • 

    corecore