94 research outputs found

    Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)

    Full text link

    Normalization of common noisy terms in Malaysian online media

    Get PDF
    This paper proposes a normalization technique of noisy terms that occur in Malaysian micro-texts.Noisy terms are common in online messages and influence the results of activities such as text classification and information retrieval.Even though many researchers have study methods to solve this problem, few had looked into the problems using a language other than English. In this study, about 5000 noisy texts were extracted from 15000 documents that were created by the Malaysian.Normalization process was executed using specific translation rules as part or preprocessing steps in opinion mining of movie reviews.The result shows up to 5% improvement in accuracy values of opinion mining

    Bot Spammer Detection in Twitter Using Tweet Similarity and TIME Interval Entropy

    Get PDF
    The popularity of Twitter has attracted spammers to disseminate large amount of spam messages. Preliminary studies had shown that most spam messages were produced automatically by bot. Therefore bot spammer detection can reduce the number of spam messages in Twitter significantly. However, to the best of our knowledge, few researches have focused in detecting Twitter bot spammer. Thus, this paper proposes a novel approach to differentiate between bot spammer and legitimate user accounts using time interval entropy and tweet similarity. Timestamp collections are utilized to calculate the time interval entropy of each user. Uni-gram matching-based similarity will be used to calculate tweet similarity. Datasets are crawled from Twitter containing both normal and spammer accounts. Experimental results showed that legitimate user may exhibit regular behavior in posting tweet as bot spammer. Several legitimate users are also detected to post similar tweets. Therefore it is less optimal to detect bot spammer using one of those features only. However, combination of both features gives better classification result. Precision, recall, and f-measure of the proposed method reached 85,71%, 94,74% and 90% respectively. It outperforms precision, recall, and f-measure of method which only uses either time interval entropy or tweet similarity

    AT&T vs Verizon: Mining Twitter for Customer Satisfaction towards North American Mobile Operators

    Get PDF
    The North American Telecommunications sector is one of the leading mobile broadband sectors worldwide, representing increasingly important revenue opportunities for mobile operators. Taking into consideration that the market is being saturated and revenue from new subscriptions is increasingly deteriorating, mobile carriers tend to focus on customer service and high levels of customer satisfaction in order to retain customers and maintain a low churn rate. In this context, it is a matter of critical importance to be able to measure the overall customer satisfaction level, by explicitly or implicitly mining the public opinion towards this end. In this paper, we argue that online social media can be exploited as a proxy to infer customer satisfaction through the utilization of automated, machine-learning based sentiment analysis techniques. Our work focuses on the two leading mobile broadband carriers located in the broader North American area, AT&T and Verizon, by analysing tweets fetched during a 15-day period within February 2013, to assess relative customer satisfaction degrees. The validity of our approach is justified through comparison against surveys conducted during 2012 from Forrester and Vocalabs in terms of customer satisfaction on the overall brand - usage experience

    Developing a Prototype System for Syndromic Surveillance and Visualization Using Social Media Data.

    Get PDF
    Syndromic surveillance of emerging diseases is crucial for timely planning and execution of epidemic response from both local and global authorities. Traditional sources of information employed by surveillance systems are not only slow but also impractical for developing countries. Internet and social media provide a free source of a large amount of data which can be utilized for Syndromic surveillance. We propose developing a prototype system for gathering, storing, filtering and presenting data collected from Twitter (a popular social media platform). Since social media data is inherently noisy we describe ways to preprocess the gathered data and utilize SVM (Support Vector Machine) to identify tweets relating to influenza like symptoms. The filtered data is presented in a web application, which allows the user to explore the underlying data in both spatial and temporal dimensions

    Automatic Hate Speech Detection: A Literature Review

    Get PDF
    Hate speech has been an ongoing problem on the Internet for many years. Besides, social media, especially Facebook, and Twitter have given it a global stage where those hate speeches can spread far more rapidly. Every social media platform needs to implement an effective hate speech detection system to remove offensive content in real-time. There are various approaches to identify hate speech, such as Rule-Based, Machine Learning based, deep learning based and Hybrid approach. Since this is a review paper, we explained the valuable works of various authors who have invested their valuable time in studying to identifying hate speech using various approaches

    Topic identification using filtering and rule generation algorithm for textual document

    Get PDF
    Information stored digitally in text documents are seldom arranged according to specific topics. The necessity to read whole documents is time-consuming and decreases the interest for searching information. Most existing topic identification methods depend on occurrence of terms in the text. However, not all frequent occurrence terms are relevant. The term extraction phase in topic identification method has resulted in extracted terms that might have similar meaning which is known as synonymy problem. Filtering and rule generation algorithms are introduced in this study to identify topic in textual documents. The proposed filtering algorithm (PFA) will extract the most relevant terms from text and solve synonym roblem amongst the extracted terms. The rule generation algorithm (TopId) is proposed to identify topic for each verse based on the extracted terms. The PFA will process and filter each sentence based on nouns and predefined keywords to produce suitable terms for the topic. Rules are then generated from the extracted terms using the rule-based classifier. An experimental design was performed on 224 English translated Quran verses which are related to female issues. Topics identified by both TopId and Rough Set technique were compared and later verified by experts. PFA has successfully extracted more relevant terms compared to other filtering techniques. TopId has identified topics that are closer to the topics from experts with an accuracy of 70%. The proposed algorithms were able to extract relevant terms without losing important terms and identify topic in the verse

    On using Twitter to monitor political sentiment and predict election results

    Get PDF
    The body of content available on Twitter undoubtedly contains a diverse range of political insight and commentary. But, to what extent is this representative of an electorate? Can we model political sentiment effectively enough to capture the voting intentions of a nation during an election capaign? We use the recent Irish General Election as a case study for investigating the potential to model political sentiment through mining of social media. Our approach combines sentiment analysis using supervised learning and volume-based measures. We evaluate against the conventional election polls and the final election result. We find that social analytics using both volume-based measures and sentiment analysis are predictive and wemake a number of observations related to the task of monitoring public sentiment during an election campaign, including examining a variety of sample sizes, time periods as well as methods for qualitatively exploring the underlying content
    corecore