144 research outputs found

    Cluster Analysis of Twitter Data: A Review of Algorithms

    Get PDF
    Twitter, a microblogging online social network (OSN), has quickly gained prominence as it provides people with the opportunity to communicate and share posts and topics. Tremendous value lies in automated analysing and reasoning about such data in order to derive meaningful insights, which carries potential opportunities for businesses, users, and consumers. However, the sheer volume, noise, and dynamism of Twitter, imposes challenges that hinder the efficacy of observing clusters with high intra-cluster (i.e. minimum variance) and low inter-cluster similarities. This review focuses on research that has used various clustering algorithms to analyse Twitter data streams and identify hidden patterns in tweets where text is highly unstructured. This paper performs a comparative analysis on approaches of unsupervised learning in order to determine whether empirical findings support the enhancement of decision support and pattern recognition applications. A review of the literature identified 13 studies that implemented different clustering methods. A comparison including clustering methods, algorithms, number of clusters, dataset(s) size, distance measure, clustering features, evaluation methods, and results was conducted. The conclusion reports that the use of unsupervised learning in mining social media data has several weaknesses. Success criteria and future directions for research and practice to the research community are discussed


    Get PDF
    Understanding dynamic interactions between human activities and land-use structure in a city is a key lens to explore the city as a complex system. This dissertation contributes to understanding the complexity of urban dynamics by gaining knowledge of the interactions between human activities and city land-use structures by utilizing free-accessible socially sensed data sources, and building upon recent research trend and technologies in geographical information science, urban study, and computer science. This dissertation addresses three main questions related to human dynamics: 1) how human activities in an urban environment are shaped by socioeconomic status and the intra-city land-use structure, and how in turn, the knowledge of socioeconomic status-activity relationships can contribute to understanding the social landscape of a city; 2) how different types of activities are located in space and time in three U.S. cities and how the spatiotemporal activity patterns in these cities characterize the activity profile of different neighborhoods in the cities; and 3) how recent socially sensed information on human activities can be integrated with widely-used remotely sensed geographical data to create a novel approach for discovering patterns of land use in cities that are otherwise lacking in up to date land use information. This dissertation models the associations between socioeconomics and mobility in the Washington, D.C. metropolitan area as a case study and applies the learned associations for inferring geographical patterns of socioeconomic status (SES) solely using the socially sensed data. This dissertation also implements a semi-automated workflow to retrieve activity details from socially sensed Twitter data in Washington, D.C., the City of Baltimore, and New York City. The dissertation integrates remotely-sensed imagery and socially sensed data to model the dynamics associated with changing land-use types in the Washington, D.C.-Baltimore metropolitan area over time

    An integrated semantic-based framework for intelligent similarity measurement and clustering of microblogging posts

    Get PDF
    Twitter, the most popular microblogging platform, is gaining rapid prominence as a source of information sharing and social awareness due to its popularity and massive user generated content. These include applications such as tailoring advertisement campaigns, event detection, trends analysis, and prediction of micro-populations. The aforementioned applications are generally conducted through cluster analysis of tweets to generate a more concise and organized representation of the massive raw tweets. However, current approaches perform traditional cluster analysis using conventional proximity measures, such as Euclidean distance. However, the sheer volume, noise, and dynamism of Twitter, impose challenges that hinder the efficacy of traditional clustering algorithms in detecting meaningful clusters within microblogging posts. The research presented in this thesis sets out to design and develop a novel short text semantic similarity (STSS) measure, named TREASURE, which captures the semantic and structural features of microblogging posts for intelligently predicting the similarities. TREASURE is utilised in the development of an innovative semantic-based cluster analysis algorithm (SBCA) that contributes in generating more accurate and meaningful granularities within microblogging posts. The integrated semantic-based framework incorporating TREASURE and the SBCA algorithm tackles both the problem of microblogging cluster analysis and contributes to the success of a variety of natural language processing (NLP) and computational intelligence research. TREASURE utilises word embedding neural network (NN) models to capture the semantic relationships between words based on their co-occurrences in a corpus. Moreover, TREASURE analyses the morphological and lexical structure of tweets to predict the syntactic similarities. An intrinsic evaluation of TREASURE was performed with reference to a reliable similarity benchmark generated through an experiment to gather human ratings on a Twitter political dataset. A further evaluation was performed with reference to the SemEval-2014 similarity benchmark in order to validate the generalizability of TREASURE. The intrinsic evaluation and statistical analysis demonstrated a strong positive linear correlation between TREASURE and human ratings for both benchmarks. Furthermore, TREASURE achieved a significantly higher correlation coefficient compared to existing state-of-the-art STSS measures. The SBCA algorithm incorporates TREASURE as the proximity measure. Unlike conventional partition-based clustering algorithms, the SBCA algorithm is fully unsupervised and dynamically determine the number of clusters beforehand. Subjective evaluation criteria were employed to evaluate the SBCA algorithm with reference to the SemEval-2014 similarity benchmark. Furthermore, an experiment was conducted to produce a reliable multi-class benchmark on the European Referendum political domain, which was also utilised to evaluate the SBCA algorithm. The evaluation results provide evidence that the SBCA algorithm undertakes highly accurate combining and separation decisions and can generate pure clusters from microblogging posts. The contributions of this thesis to knowledge are mainly demonstrated as: 1) Development of a novel STSS measure for microblogging posts (TREASURE). 2) Development of a new SBCA algorithm that incorporates TREASURE to detect semantic themes in microblogs. 3) Generating a word embedding pre-trained model learned from a large corpus of political tweets. 4) Production of a reliable similarity-annotated benchmark and a reliable multi-class benchmark in the domain of politics

    Tweets on a tree: Index-based clustering of tweets

    Get PDF
    Computer-mediated communication, CMC, is a type of communication that occurs through use of two or more electronic devices. With the advancement of technology, CMC has started to become a more preferred type of communication between humans. Through computer-mediated technologies, news portals, search engines and social media platforms such as Facebook, Twitter, Reddit and many other platforms are created. In social media platforms, a user can post and discuss his/her own opinion and also read and share other users' opinions. This generates a signi cant amount of data which, if ltered and analyzed, can give researchers important insights about public opinion and culture. Twitter is a social networking service founded in 2006 and became widespread throughout the world in a very short time frame. The service has more than 310 million monthly active users and throughout these users more than 500 million tweets are generated daily as of 2016. Due the volume, velocity and variety of Twitter data, it cannot be analyzed by using conventional methods. A clustering or sampling method is necessary to reduce the amount of data for analysis. To cluster documents, in a very broad sense two similarity measures can be used: Lexical similarity and semantic similarity. Lexical similarity looks for syntactic similarity between documents. It is usually computationally light to compute lexical similarity, however for clustering purposes it may not be very accurate as it disregards the semantic value of words. On the other hand, semantic similarity looks for semantic value and relations between words to calculate the similarity and while it is generally more accurate than lexical similarity, it is computationally di cult to calculate semantic similarity. In our work we aim to create computationally light and accurate clustering of short documents which have the characteristics of big data. We propose a hybrid approach of clustering where lexical and semantic similarity is combined together. In our approach, we use string similarity to create clusters and semantic vector representations of words to interactively merge clusters

    Few are as Good as Many: An Ontology-Based Tweet Spam Detection Approach

    Get PDF
    Due to the high popularity of Twitter, spammers tend to favor its use in spreading their commercial messages. In the context of detecting twitter spams, different statistical and behavioral analysis approaches were proposed. However, these techniques suffer from many limitations due to (1) ongoing changes to Twitter\u2019s streaming API which constrains access to a user\u2019s list of followers/followees, (2) spammer\u2019s creativity in building diverse messages, (3) use of embedded links and new accounts, and (4) need for analyzing different characteristics about users without their consent. To address the aforementioned challenges, we propose a novel ontology-based approach for spam detection over Twitter during events by analyzing the relationship between ham user tweets vs. spams. Our approach relies solely on public tweet messages while performing the analysis and classification tasks. In this context, ontologies are derived and used to generate a dictionary that validates real tweet messages from random topics. Similarity ratio among the dictionary and tweets is used to reflect the legitimacy of the messages. Experiments conducted on real tweet data illustrate that message-to-message techniques achieved a low detection rate compared to our ontology based approach which outperforms them by approximately 200%, in addition to promising scalability for large data analysis

    Exploiting clustering algorithms in a multiple-level fashion: A comparative study in the medical care scenario

    Get PDF
    Clustering real-world data is a challenging task, since many real-data collections are characterized by an inherent sparseness and variable distribution. An appealing domain that generates such data collections is the medical care scenario where collected data include a large cardinality of patient records and a variety of medical treatments usually adopted for a given disease pathology. This paper proposes a two-phase data mining methodology to iteratively analyze dierent dataset portions and locally identify groups of objects with common properties. Discovered cohesive clusters are then analyzed using sequential patterns to characterize temporal relationships among data features. To support an automatic classication of a new data objects within one of the discovered groups, a classication model is created starting from the computed cluster set. A mobile application has been also designed and developed to visualize and update data under analysis as well as categorizing new unlabeled records. A comparative study has been conducted on real datasets in the medical care scenario using diverse clustering algorithms. Results were compared in terms of cluster quality, execution time, classication performance and discovered sequential patterns. The experimental evaluation showed the eectiveness of MLC to discover interesting knowledge items and to easily exploit them through a mobile application. Results have been also discussed from a medical perspective

    How Do People View COVID-19 Vaccines

    Get PDF
    The COVID-19 pandemic has been the most devastating public health crisis in the recent decade and vaccination is anticipated as the means to terminate the pandemic. People's views and feelings over COVID-19 vaccines determine the success of vaccination. This study was set to investigate sentiments and common topics about COVID-19 vaccines by machine learning sentiment and topic analyses with natural language processing on massive tweets data. Findings revealed that concern on COVID-19 vaccine grew alongside the introduction and start of vaccination programs. Overall positive sentiments and emotions were greater than negative ones. Common topics include vaccine development for progression, effectiveness, safety, availability, sharing of vaccines received, and updates on pandemics and government policies. Outcomes suggested the current atmosphere and its focus over the COVID-19 vaccine issue for the public health sector and policymakers for better decision-making. Evaluations on analytical methods were performed additionally

    NLP and ML Methods for Pre-processing, Clustering and Classification of Technical Logbook Datasets

    Get PDF
    Technical logbooks are a challenging and under-explored text type in automated event identification. These texts are typically short and written in non-standard yet technical language, posing challenges to off-the-shelf NLP pipelines. These datasets typically represent a domain (a technical field such as automotive) and an application (e.g., maintenance). The granularity of issue types described in these datasets additionally leads to class imbalance, making it challenging for models to accurately predict which issue each logbook entry describes. In this research, we focus on the problem of technical issue pre-processing, clustering, and classification by considering logbook datasets from the automotive, aviation, and facility maintenance domains. We developed MaintNet, a collaborative open source library including logbook datasets from various domains and a pre-processing pipeline to clean unstructured datasets. Additionally, we adapted a feedback loop strategy from computer vision for handling extreme class imbalance, which resamples the training data based on its error in the prediction process. We further investigated the benefits of using transfer learning from sources within the same domain (but different applications), from within the same application (but different domains), and from all available data to improve the performance of the classification models. Finally, we evaluated several data augmentation approaches including synonym replacement, random swap, and random deletion to address the issue of data scarcity in technical logbooks

    Linguistic Analysis of Agency in Online Discussions about Postpartum

    Get PDF
    openThe research focuses on understanding the relationship between agency and the emotion score of being positive and negative, as well as the similarity score for the term depression within expressions related to postpartum on social media platforms, such as Reddit and Twitter. The research project is executed through a series of detailed and structured steps, including data scraping, preprocessing, and subsequent analysis with a range of tools and methods. The results of the project are presented through various graphs and accompanied by qualitative explanations. Overall, this research sheds light on the significance of utilizing data analytics in social media networks to determine the association between agentic language and emotion scores, providing valuable insights.The research focuses on understanding the relationship between agency and the emotion score of being positive and negative, as well as the similarity score for the term depression within expressions related to postpartum on social media platforms, such as Reddit and Twitter. The research project is executed through a series of detailed and structured steps, including data scraping, preprocessing, and subsequent analysis with a range of tools and methods. The results of the project are presented through various graphs and accompanied by qualitative explanations. Overall, this research sheds light on the significance of utilizing data analytics in social media networks to determine the association between agentic language and emotion scores, providing valuable insights

    Geographies of online social interaction: a big data analytics approach to social media platform Sina Weibo

    Get PDF
    Social media has revolutionized many aspects of people’s social life. However, few studies have utilized massive individual-level data from social media to examine the effects of geography. In this study a program was developed to collect and analyze data from Sina Weibo in ten selected Chinese cities. Four geographic concepts, i.e., borders, distance, places, and urban system hierarchy were chosen to measure the geographic effects by investigating geographical distribution of people’s connections and comparing tweets similarity between different cities. The results show that these geographic concepts are playing an important role in the formation of new online connections and shaping people’s interests. Social media users still tend to establish connections and share more common interests with people who live in the same city or close to them. People who live in the first-tier cities have more opportunities to establish connections across the country and their interests cover a broader range
    • …