428 research outputs found

    A Pointillism Approach for Natural Language Processing of Social Media

    Get PDF
    Natural language processing tasks typically start with the basic unit of words, and then from words and their meanings a big picture is constructed about what the meanings of documents or other larger constructs are in terms of the topics discussed. Social media is very challenging for natural language processing because it challenges the notion of a word. Social media users regularly use words that are not in even the most comprehensive lexicons. These new words can be unknown named entities that have suddenly risen in prominence because of a current event, or they might be neologisms newly created to emphasize meaning or evade keyword filtering. Chinese social media is particularly challenging. The Chinese language poses challenges for natural language processing based on the unit of a word even for formal uses of the Chinese language, social media only makes word segmentation in Chinese even more difficult. Thus, even knowing what the boundaries of words are in a social media corpus is a difficult proposition. For these reasons, in this document I propose the Pointillism approach to natural language processing. In the pointillism approach, language is viewed as a time series, or sequence of points that represent the grams\u27 usage over time. Time is an important aspect of the Pointillism approach. Detailed timing information, such as timestamps of when posts were posted, contain correlations based on human patterns and current events. This timing information provides the necessary context to build words and phrases out of trigrams and then group those words and phrases into topical clusters. Rather than words that have individual meanings, the basic unit of the pointillism approach is trigrams of characters. These grams take on meaning in aggregate when they appear together in a way that is correlated over time. I anticipate that the pointillism approach can perform well in a variety of natural language processing tasks for many different languages, but in this document my focus is on trend analysis for Chinese microblogging. Microblog posts have a timestamp of when posts were posted, that is accurate to the minute or second (though, in this dissertation, I bin posts by the hour). To show that trigrams supplemented with frequency information do collect scattered information into meaningful pieces, I first use the pointillism approach to extract phrases. I conducted experiments on 4-character idioms, a set of 500 phrases that are longer than 3 characters taken from the Chinese-language version of Wiktionary, and also on Weibo\u27s hot keywords. My results show that when words and topics do have a meme-like trend, they can be reconstructed from only trigrams. For example, for 4-character idioms that appear at least 99 times in one day in my data, the unconstrained precision (that is, precision that allows for deviation from a lexicon when the result is just as correct as the lexicon version of the word or phrase) is 0.93. For longer words and phrases collected from Wiktionary, including neologisms, the unconstrained precision is 0.87. I consider these results to be very promising, because they suggest that it is feasible for a machine to reconstruct complex idioms, phrases, and neologisms with good precision without any notion of words. Next, I examine the potential of the pointillism approach for extracting topical trends from microblog posts that are related to environmental issues. Independent Component Analysis (ICA) is utilized to find the trigrams which have the same independent signal source, i.e., topics. Contrast this with probabilistic topic models, which leverage co-occurrence to classify the documents into the topics they have learned, so it is hard for it to extract topics in real-time. However, pointillism approach can extract trends in real-time, whether those trends have been discussed before or not. This is more challenging because in phrase extraction, order information is used to narrow down the candidates, whereas for trend extraction only the frequency of the trigrams are considered. The proposed approach is compared against a state of the art topic extraction technique, Latent Dirichlet Allocation (LDA), on 9,147 labelled posts with timestamps. The experimental results show that the highest F1 score of the pointillism approach with ICA is 4% better than that of LDA. Thus, using the pointillism approach, the colorful and baroque uses of language that typify social media in challenging languages such as Chinese may in fact be accessible to machines. The thesis that my dissertation tests is this: For topic extraction for scenarios where no adequate lexicon is available, such as social media, the Pointillism approach uses timing information to out-perform traditional techniques that are based on co-occurrence

    Detecting Traffic Information From Social Media Texts With Deep Learning Approaches

    Get PDF
    Mining traffic-relevant information from social media data has become an emerging topic due to the real-time and ubiquitous features of social media. In this paper, we focus on a specific problem in social media mining which is to extract traffic relevant microblogs from Sina Weibo, a Chinese microblogging platform. It is transformed into a machine learning problem of short text classification. First, we apply the continuous bag-of-word model to learn word embedding representations based on a data set of three billion microblogs. Compared to the traditional one-hot vector representation of words, word embedding can capture semantic similarity between words and has been proved effective in natural language processing tasks. Next, we propose using convolutional neural networks (CNNs), long short-term memory (LSTM) models and their combination LSTM-CNN to extract traffic relevant microblogs with the learned word embeddings as inputs. We compare the proposed methods with competitive approaches, including the support vector machine (SVM) model based on a bag of n-gram features, the SVM model based on word vector features, and the multi-layer perceptron model based on word vector features. Experiments show the effectiveness of the proposed deep learning approaches

    Using social media for air pollution detection-the case of Eastern China Smog

    Get PDF
    Air pollution has become an urgent issue that affecting public health and people’s daily life in China. Social media as potential air quality sensors to surveil air pollution is emphasized recently. In this research, we picked up a case-2013 Eastern China smog and focused on two of the most popular Chinese microblog platforms Sina Weibo and Tencent Weibo. The purpose of this study is to determine whether social media can be capable to be used as ‘sensors’ to monitor air pollution in China and to provide an innovative model for air pollution detection through social media. Based on that, we propose our research question, how a salient change of air quality expressed on social media discussions to reflect the extent of air pollution. Hence, our research (1) determine the correlation between the volume of air quality-related messages and observed Air quality index (AQI) with the help of time series analysis model; (2) investigate further the impact of a salient change of air quality on the relationship between the people’s subjective perceptions regarding to air pollution released on the Weibo and the extent of air pollution through a co-word network analysis model. Our study illustrates that the discussions on social media about air quality reflect the level of air pollution when the air quality changes saliently

    VIRAL TOPIC PREDICTION AND DESCRIPTION IN MICROBLOG SOCIAL NETWORKS

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Toward a Cognitive-Inspired Hashtag Recommendation for Twitter Data Analysis

    Get PDF
    This research investigates hashtag suggestions in a heterogeneous and huge social network, as well as a cognitive-based deep learning solution based on distributed knowledge graphs. Community detection is first performed to find the connected communities in a vast and heterogeneous social network. The knowledge graph is subsequently generated for each discovered community, with an emphasis on expressing the semantic relationships among the Twitter platform’s user communities. Each community is trained with the embedded deep learning model. To recommend hashtags for the new user in the social network, the correlation between the tweets of such user and the knowledge graph of each community is explored to set the relevant communities of such user. The models of the relevant communities are used to infer the hashtags of the tweets of such users. We conducted extensive testing to demonstrate the usefulness of our methods on a variety of tweet collections. Experimental results show that the proposed approach is more efficient than the baseline approaches in terms of both runtime and accuracy.acceptedVersio
    corecore