583 research outputs found

    A Pointillism Approach for Natural Language Processing of Social Media

    Get PDF
    Natural language processing tasks typically start with the basic unit of words, and then from words and their meanings a big picture is constructed about what the meanings of documents or other larger constructs are in terms of the topics discussed. Social media is very challenging for natural language processing because it challenges the notion of a word. Social media users regularly use words that are not in even the most comprehensive lexicons. These new words can be unknown named entities that have suddenly risen in prominence because of a current event, or they might be neologisms newly created to emphasize meaning or evade keyword filtering. Chinese social media is particularly challenging. The Chinese language poses challenges for natural language processing based on the unit of a word even for formal uses of the Chinese language, social media only makes word segmentation in Chinese even more difficult. Thus, even knowing what the boundaries of words are in a social media corpus is a difficult proposition. For these reasons, in this document I propose the Pointillism approach to natural language processing. In the pointillism approach, language is viewed as a time series, or sequence of points that represent the grams\u27 usage over time. Time is an important aspect of the Pointillism approach. Detailed timing information, such as timestamps of when posts were posted, contain correlations based on human patterns and current events. This timing information provides the necessary context to build words and phrases out of trigrams and then group those words and phrases into topical clusters. Rather than words that have individual meanings, the basic unit of the pointillism approach is trigrams of characters. These grams take on meaning in aggregate when they appear together in a way that is correlated over time. I anticipate that the pointillism approach can perform well in a variety of natural language processing tasks for many different languages, but in this document my focus is on trend analysis for Chinese microblogging. Microblog posts have a timestamp of when posts were posted, that is accurate to the minute or second (though, in this dissertation, I bin posts by the hour). To show that trigrams supplemented with frequency information do collect scattered information into meaningful pieces, I first use the pointillism approach to extract phrases. I conducted experiments on 4-character idioms, a set of 500 phrases that are longer than 3 characters taken from the Chinese-language version of Wiktionary, and also on Weibo\u27s hot keywords. My results show that when words and topics do have a meme-like trend, they can be reconstructed from only trigrams. For example, for 4-character idioms that appear at least 99 times in one day in my data, the unconstrained precision (that is, precision that allows for deviation from a lexicon when the result is just as correct as the lexicon version of the word or phrase) is 0.93. For longer words and phrases collected from Wiktionary, including neologisms, the unconstrained precision is 0.87. I consider these results to be very promising, because they suggest that it is feasible for a machine to reconstruct complex idioms, phrases, and neologisms with good precision without any notion of words. Next, I examine the potential of the pointillism approach for extracting topical trends from microblog posts that are related to environmental issues. Independent Component Analysis (ICA) is utilized to find the trigrams which have the same independent signal source, i.e., topics. Contrast this with probabilistic topic models, which leverage co-occurrence to classify the documents into the topics they have learned, so it is hard for it to extract topics in real-time. However, pointillism approach can extract trends in real-time, whether those trends have been discussed before or not. This is more challenging because in phrase extraction, order information is used to narrow down the candidates, whereas for trend extraction only the frequency of the trigrams are considered. The proposed approach is compared against a state of the art topic extraction technique, Latent Dirichlet Allocation (LDA), on 9,147 labelled posts with timestamps. The experimental results show that the highest F1 score of the pointillism approach with ICA is 4% better than that of LDA. Thus, using the pointillism approach, the colorful and baroque uses of language that typify social media in challenging languages such as Chinese may in fact be accessible to machines. The thesis that my dissertation tests is this: For topic extraction for scenarios where no adequate lexicon is available, such as social media, the Pointillism approach uses timing information to out-perform traditional techniques that are based on co-occurrence

    Blog Style Classification: Refining Affective Blogs

    Get PDF
    In the constantly growing blogosphere with no restrictions on form or topic, a number of writing styles and genres have emerged. Recognition and classification of these styles has become significant for information processing with an aim to improve blog search or sentiment mining. One of the main issues in this field is detection of informative and affective articles. However, such differentiation does not suffice today. In this paper we extend the differentiation and suggest a fine-grained set of subcategories for affective articles. We propose and evaluate a classification method employing novel lexical, morphological, lightweight syntactic and structural features of written text. The results show that our method outperforms the existing approaches

    Hierarchical categorisation of web tags for Delicious

    Get PDF
    In the scenario of social bookmarking, a user browsing the Web bookmarks web pages and assigns free-text labels (i.e., tags) to them according to their personal preferences. The benefits of social tagging are clear – tags enhance Web content browsing and search. However, since these tags may be publicly available to any Internet user, a privacy attacker may collect this information and extract an accurate snapshot of users’ interests or user profiles, containing sensitive information, such as health-related information, political preferences, salary or religion. In order to hinder attackers in their efforts to profile users, this report focuses on the practical aspects of capturing user interests from their tagging activity. More accurately, we study how to categorise a collection of tags posted by users in one of the most popular bookmarking services, Delicious (http://delicious.com).Preprin

    Investigating and extending the methods in automated opinion analysis through improvements in phrase based analysis

    Get PDF
    Opinion analysis is an area of research which deals with the computational treatment of opinion statement and subjectivity in textual data. Opinion analysis has emerged over the past couple of decades as an active area of research, as it provides solutions to the issues raised by information overload. The problem of information overload has emerged with the advancements in communication technologies which gave rise to an exponential growth in user generated subjective data available online. Opinion analysis has a rich set of applications which are used to enable opportunities for organisations such as tracking user opinions about products, social issues in communities through to engagement in political participation etc.The opinion analysis area shows hyperactivity in recent years and research at different levels of granularity has, and is being undertaken. However it is observed that there are limitations in the state-of-the-art, especially as dealing with the level of granularities on their own does not solve current research issues. Therefore a novel sentence level opinion analysis approach utilising clause and phrase level analysis is proposed. This approach uses linguistic and syntactic analysis of sentences to understand the interdependence of words within sentences, and further uses rule based analysis for phrase level analysis to calculate the opinion at each hierarchical structure of a sentence. The proposed opinion analysis approach requires lexical and contextual resources for implementation. In the context of this Thesis the approach is further presented as part of an extended unifying framework for opinion analysis resulting in the design and construction of a novel corpus. The above contributions to the field (approach, framework and corpus) are evaluated within the Thesis and are found to make improvements on existing limitations in the field, particularly with regards to opinion analysis automation. Further work is required in integrating a mechanism for greater word sense disambiguation and in lexical resource development
    corecore