3 research outputs found

    Low-resource Personal Attribute Prediction from Conversation

    Full text link
    Personal knowledge bases (PKBs) are crucial for a broad range of applications such as personalized recommendation and Web-based chatbots. A critical challenge to build PKBs is extracting personal attribute knowledge from users' conversation data. Given some users of a conversational system, a personal attribute and these users' utterances, our goal is to predict the ranking of the given personal attribute values for each user. Previous studies often rely on a relative number of resources such as labeled utterances and external data, yet the attribute knowledge embedded in unlabeled utterances is underutilized and their performance of predicting some difficult personal attributes is still unsatisfactory. In addition, it is found that some text classification methods could be employed to resolve this task directly. However, they also perform not well over those difficult personal attributes. In this paper, we propose a novel framework PEARL to predict personal attributes from conversations by leveraging the abundant personal attribute knowledge from utterances under a low-resource setting in which no labeled utterances or external data are utilized. PEARL combines the biterm semantic information with the word co-occurrence information seamlessly via employing the updated prior attribute knowledge to refine the biterm topic model's Gibbs sampling process in an iterative manner. The extensive experimental results show that PEARL outperforms all the baseline methods not only on the task of personal attribute prediction from conversations over two data sets, but also on the more general weakly supervised text classification task over one data set.Comment: Accepted by AAAI'2

    Twitter alloy steel disambiguation and user relevance via one-class and two-class news titles classifiers

    Get PDF
    This paper addresses the nontrivial task of Twitter financial disam- biguation (TFD), which is relevant to filter financial domain tweets (e.g., alloy steel or coffee prices) when no unique identifiers (e.g., cashtags) are adopted. To automate TFD, we propose a transfer learning approach that uses freely labeled news titles to train diverse one-class and two-class classification methods. These include different text handling transforms, adaptations of statistical measures and modern machine learning methods, including support vector machines (SVM), deep autoencoders and multilayer perceptrons. As a case study, we analyzed the domain of alloy steel prices, collecting a recent Twitter dataset. Overall, the best results were achieved by a two-class SVM fed with TFD statistical measures and topic model features, obtaining an 80% and 71% discrimination level when tested with 11,081 and 3,000 manually labeled tweets. The best one-class performance (78% and 69% for the same test tweets) was obtained by a term frequency-inverse document frequency classifier (TF-IDFC). These models were further used to gen- erate a Financial User Relevance rank (FUR) score, aiming to filter relevant users. The SVM and TF-IDFC FUR models obtained a predictive user discrimination level of 80% and 75% when tested with a manually labeled test sample of 418 users. These results confirm the proposed joint TFD-FUR approach as a valuable tool for the selection of Twitter texts and users for financial social media analytics (e.g., sentiment analysis, detection of influential users).Research carried out with the support of resources of Big and Open Data Innovation Laboratory (BODaI-Lab), University of Brescia, granted by Fondazione Cariplo and Regione Lombardia

    Attitudes towards the Covid-19 vaccine on Twitter in Norway

    Get PDF
    The goal of this thesis is to characterize the distribution of attitudes present on Norwegian Twitter concerning the Covid-19 vaccine by implementing methods for text analysis and social media network analysis. The first analysis performed was manually classifying a sample of the dataset into four categories: irrelevant, neutral, vaccine hesitancy and anti-vaccine hesitancy. This sample dataset was used to train a supervised machine learning model, using BoW and SVM, in order to classify the total dataset. Furthermore, two methods for topic modeling were implemented: Latent Dirichlet Allocation and Biterm. Lastly, three main social networks were created: a mentioning-network containing users mention or mentioning in the dataset, a retweet-network containing users retweeted/quoted or retweeting/quoting and a sentiment network only including users classified as vaccine hesitancy and anti-vaccine hesitancy in the sample network. The ten users with highest scores for in-degree, out-degree and betweenness from the retweet network were analyzed to determine sentiment. The main findings are that the methods for topic modeling did not fit expectations and gave limited findings concerning topics in the theme, but topic modeling illustrated the amount of noise in the dataset. The manual classification resulted in approximately 30% vaccine hesitancy, while the trained supervised machine learning model resulted in only 10% vaccine hesitancy. The mentioning-network illustrated that the debate evolved and then stabilized through the autumn/winter of 2020. The most mentioned users were positive towards the vaccine. There was a separation regarding sentiment for the most retweeted and users retweeting most. Users displaying vaccine hesitancy sentiment tended to retweet slightly more than users displaying anti-vaccine hesitancy sentiment, and there were signs of echo chambers.Masteroppgave i informasjonsvitenskapINFO390MASV-INF
    corecore