14 research outputs found

    Opinion Mining Using Population-tuned Generative Language Models

    Full text link
    We present a novel method for mining opinions from text collections using generative language models trained on data collected from different populations. We describe the basic definitions, methodology and a generic algorithm for opinion insight mining. We demonstrate the performance of our method in an experiment where a pre-trained generative model is fine-tuned using specifically tailored content with unnatural and fully annotated opinions. We show that our approach can learn and transfer the opinions to the semantic classes while maintaining the proportion of polarisation. Finally, we demonstrate the usage of an insight mining system to scale up the discovery of opinion insights from a real text corpus

    Opinion Mining Using Population-tuned Generative Language Models

    Get PDF
    We present a novel method for mining opinions from text collections using generative language models trained on data collected from different populations. We describe the basic definitions, methodology and a generic algorithm for opinion insight mining. We demonstrate the performance of our method in an experiment where a pre-trained generative model is fine-tuned using specifically tailored content with unnatural and fully annotated opinions. We show that our approach can learn and transfer the opinions to the semantic classes while maintaining the proportion of polarisation. Finally, we demonstrate the usage of an insight mining system to scale up the discovery of opinion insights from a real text corpus

    Demographic inference and affect estimation of microbloggers

    No full text
    Abstract Owing to the peculiar nature of the discourse on Twitter, developing analytical frameworks to derive useful insights from Twitter remains challenging as evidenced by the poor performance at tasks such as reliable demographic inference, affect estimation, and event detection. One of the focal problems lies in analyzing short texts in general, and tweets in particular. The analysis is as such made difficult because of the vagaries of the linguistic expressions and Twitter further exacerbates this by enabling the use of emojis, hashtags, URLs, and embedded media. While the previous research has demonstrated ways of extracting useful information from individual tweet-texts to some extent, a detailed and thorough investigation of the role of metadata has not yet been systematically performed. Furthermore, a majority of the previous work has paid little or no attention to the emerging role of deep learning approaches in Twitter-based analytics. These observations motivate this thesis, which aims to enhance machine understanding of tweets towards deriving deeper insights from the public data on Twitter and inform the scientific objectives of this thesis. First, this thesis sets out to empirically investigate the impact and efficacy of deep learning approaches integrating message-text and metadata leveraging on the distributed semantic representations of textual entities. Second, the thesis contributes towards improving capturing enhanced semantics from tweets by harnessing external, open-sourced knowledge graphs and other crowd-sourced lexical resources. Third, the role of the user-created metadata, such as hashtags and URLs, in machine understanding of tweets is examined and quantified. At the same time, computational models are introduced to derive conversational, topical, and temporal contexts of tweets and utilize them in machine learning models to improve Twitter-based analytics. Validation of the proposed novel machine learning models integrating the diverse footprints of users’ online activity/behavior is achieved by employing them in various case study applications. In addition, the datasets and the tools developed during this thesis have been made available publicly for the scientific community.Tiivistelmä Twitter-pohjainen analytiikka on noussut useiden tieteenalojen työkalupakkiin viime vuosina. Kuitenkin, järjestelmällisten analyysikokonaisuuksien kehitys on mikroblog-keskustelujen erityisluonteen vuoksi haastavaa. Analysointimenetelmien heikko suorituskyky on todettu useissa sovelluskohteissa, kuten kirjoittajien väestörakenne- ja tunnetila-analyyseissa taikka tehtävissä, joissa mikrobloggauksista pyritään havaitsemaan tärkeitä tapahtumia. Analyysit pitäisi suorittaa hyvin lyhyistä tekstipätkistä, tässä tutkimuksessa erityisesti mikroblogauksista. Omalaatuisten ja persoonallisten kielellisten ilmaisujen, mutta myös Twitterin emojien, metatietotagien, ulkoisten linkkien (url) ja upotettujen kuvien sekä videoiden käyttö monipuolistaa ongelmakenttää. Aikaisemmissa tutkimuksissa on onnistuttu johtamaan hyödyllistä tietoa yksittäisistä mikroblogauksista jossain määrin, mutta metatietojen roolia ja merkitystä ei ole vielä järjestelmällisesti eikä yksityiskohtaisesti tutkittu. Lisäksi syväoppimisen hyödyntämistä Twitter-pohjaisten datojen analyyseissa on tutkittu vähän tai ei ollenkaan. Tämän väitöskirjan tavoitteena on parantaa tietokoneiden valmiuksia käsitellä mikroblogauksia siten, että nykyistä parempi ja merkityksellisempi julkisten Twitter-aineistojen koneellinen ymmärtäminen olisi mahdollista. Ensinnäkin, tutkimuksessa testataan empiirisesti syväoppivan mallin vaikuttavuutta sekä tehokkuutta ym. tekstikokonaisuuksien hajautetun semanttisen esitysmuodon integroinnissa. Toiseksi, työssä parannetaan mikroblogauksien sisältöanalyysia ulkoisten, avoimen lähdekoodin tietograafien sekä muiden joukkoistettujen sanastojen avulla. Kolmanneksi tutkitaan ja kvantifioidaan käyttäjien luomien metadatojen, kuten metatietotagien ja ulkoisten linkkien roolit analyysikehikoissa. Työssä esitellään laskennalliset mallit mikroblogauksien keskusteluun, aihepiiriin sekä aikaan liittyvien asiayhteyksien päättelemiseksi ja käytetään näitä malleja koneoppimismallien suorituskyvyn parantamiseksi Twitter-dataan pohjautuvassa analytiikassa. Mikroblogaajien verkkokäyttäytymisen perusteella saadun monimuotoisen aineiston integrointi tapahtuu koneoppivien mallien avulla. Työssä käytetyt aineistot sekä tutkimuksessa kehitetyt työkalut on saatettu julkiseksi tiedeyhteisön käyttöön

    Novel semantics-based distributed representations for message polarity classification using deep convolutional neural networks

    No full text
    Abstract Unsupervised learning of distributed representations (word embeddings) obviates the need for task-specific feature engineering for various NLP applications. However, such representations learned from massive text datasets do not faithfully represent finer semantic information in the feature space required by specific applications. This is owing to the fact that (a) models learning such representations ignore the linguistic structure of the sentences, (b) they fail to capture polysemous usages of the words, and (c) they ignore pre-existing semantic information from manually-created ontologies. In this paper, we propose three semantics-based distributed representations of words and phrases as features for message polarity classification: Sentiment-Specific Multi-Word Expressions Embeddings(SSMWE) are sentiment encoded distributed representations of multi-word expressions (MWEs); Sense-Disambiguated Word Embeddings(SDWE) are sense-specific distributed representations of words; and WordNet embeddings(WNE) are distributed representations of hypernym and hyponym of the correct sense of a given word. We examine the effects of these features incorporated in a convolutional neural network(CNN) model for evaluation on the SemEval benchmarked dataset. Our approach of using these novel features yields 14.24% improvement in the macro-averaged F1 score on SemEval datasets over existing methods. While we have shown promising results in twitter sentiment classification, we believe that the method is general enough to be applied to many NLP applications where finer semantic analysis is required

    On the use of distributed semantics of tweet metadata for user age prediction

    No full text
    Social media data represent an important resource for behavioral analysis of the aging population. This paper addresses the problem of age prediction from Twitter dataset, where the prediction issue is viewed as a classification task. For this purpose, an innovative model based on Convolutional Neural Network is devised. To this end, we rely on language-related features and social media specific metadata. More specifically, we introduce two features that have not been previously considered in the literature: the content of URLs and hashtags appearing in tweets. We also employ distributed representations of words and phrases present in tweets, hashtags and URLs, pre-trained on appropriate corpora in order to exploit their semantic information in age prediction. We show that our CNN-based classifier, when compared with baseline models, yields an improvement of up to 12.3% for Dutch dataset, 9.8% for English1 dataset, and 6.6% for English2 dataset in the micro-averaged F1 score

    Inferring demographic data of marginalized users in Twitter with computer vision APIs

    No full text
    Abstract Inferring demographic intelligence from unlabeled social media data is an actively growing area of research, challenged by low availability of ground truth annotated training corpora. High-accuracy approaches for labeling demographic traits of social media users employ various heuristics that do not scale up and often discount non-English texts and marginalized users. First, we present a framework for inferring the demographic attributes of Twitter users from their profile pictures (avatars) using the Microsoft Azure Face API. Second, we measure the inter-rater agreement between annotations made using our framework against two pre-labeled samples of Twitter users (N1=1163; N2=659) whose age labels were manually annotated. Our results indicate that the strength of the inter-rater agreement (Gwet’s AC1=0.89; 0.90) between the gold standard and our approach is ‘very good’ for labelling the age group of users. The paper provides a use case of Computer Vision for enabling the development of large cross-sectional labeled datasets, and further advances novel solutions in the field of demographic inference from short social media texts

    On the use of distributed semantics of tweet metadata for user age prediction

    No full text
    Abstract Social media data represent an important resource for behavioral analysis of the aging population. This paper addresses the problem of age prediction from Twitter dataset, where the prediction issue is viewed as a classification task. For this purpose, an innovative model based on Convolutional Neural Network is devised. To this end, we rely on language-related features and social media specific metadata. More specifically, we introduce two features that have not been previously considered in the literature: the content of URLs and hashtags appearing in tweets. We also employ distributed representations of words and phrases present in tweets, hashtags and URLs, pre-trained on appropriate corpora in order to exploit their semantic information in age prediction. We show that our CNN-based classifier, when compared with baseline models, yields an improvement of up to 12.3% for Dutch dataset, 9.8% for English1 dataset, and 6.6% for English2 dataset in the micro-averaged F1 score

    Covert online ethnography and machine learning for detecting individuals at risk of being drawn into online sex work

    No full text
    Abstract How can we identify individuals at risk of being drawn into online sex work? The spread of online communication removes transaction costs and enables a greater number of people to be involved in illicit activities, including online sex trade. As a result, social media platforms often work as springboard for criminal careers posing a significant risk to the economy, public health and trust. Detecting deviant behaviors online is limited by the poor availability of ground-truth data and machine learning tools. Unlike prior work which focuses exclusively on either qualitative or quantitative methods, in this paper we combine covert online ethnography with semi-supervised learning methodologies, using data from a popular European adult forum. We obtained risk assessment results of 78 users using covert online ethnography, and set out to build a machine learning model that can predict the risk factor in other 28,832 users. Results show that a combination-based approach in which all features are used yields the most accurate results

    On the use of distributed semantics of tweet metadata for user age prediction

    No full text
    Social media data represent an important resource for behavioral analysis of the aging population. This paper addresses the problem of age prediction from Twitter dataset, where the prediction issue is viewed as a classification task. For this purpose, an innovative model based on Convolutional Neural Network is devised. To this end, we rely on language-related features and social media specific metadata. More specifically, we introduce two features that have not been previously considered in the literature: the content of URLs and hashtags appearing in tweets. We also employ distributed representations of words and phrases present in tweets, hashtags and URLs, pre-trained on appropriate corpora in order to exploit their semantic information in age prediction. We show that our CNN-based classifier, when compared with baseline models, yields an improvement of up to 12.3% for Dutch dataset, 9.8% for English1 dataset, and 6.6% for English2 dataset in the micro-averaged F1 score

    Meta-terrorism:identifying linguistic patterns in public discourse after an attack

    No full text
    Abstract When a terror-related event occurs, there is a surge of traffic on social media comprising of informative messages, emotional outbursts, helpful safety tips, and rumors. It is important to understand the behavior manifested on social media sites to gain a better understanding of how to govern and manage in a time of crisis. We undertook a detailed study of Twitter during two recent terror-related events: the Manchester attacks and the Las Vegas shooting. We analyze the tweets during these periods using (a) sentiment analysis, (b) topic analysis, and (c) fake news detection. Our analysis demonstrates the spectrum of emotions evinced in reaction and the way those reactions spread over the event timeline. Also, with respect to topic analysis, we find “echo chambers”, groups of people interested in similar aspects of the event. Encouraged by our results on these two event datasets, the paper seeks to enable a holistic analysis of social media messages in a time of crisis
    corecore