144 research outputs found
Cluster Analysis of Twitter Data: A Review of Algorithms
Twitter, a microblogging online social network (OSN), has quickly gained prominence as it provides people with the opportunity to communicate and share posts and topics. Tremendous value lies in automated analysing and reasoning about such data in order to derive meaningful insights, which carries potential opportunities for businesses, users, and consumers. However, the sheer volume, noise, and dynamism of Twitter, imposes challenges that hinder the efficacy of observing clusters with high intra-cluster (i.e. minimum variance) and low inter-cluster similarities. This review focuses on research that has used various clustering algorithms to analyse Twitter data streams and identify hidden patterns in tweets where text is highly unstructured. This paper performs a comparative analysis on approaches of unsupervised learning in order to determine whether empirical findings support the enhancement of decision support and pattern recognition applications. A review of the literature identified 13 studies that implemented different clustering methods. A comparison including clustering methods, algorithms, number of clusters, dataset(s) size, distance measure, clustering features, evaluation methods, and results was conducted. The conclusion reports that the use of unsupervised learning in mining social media data has several weaknesses. Success criteria and future directions for research and practice to the research community are discussed
USING SOCIALLY SENSED BIG DATA TO MODEL PATTERNS AND GEOGRAPHIC CONTEXT OF HUMAN ACTIVITIES IN CITIES
Understanding dynamic interactions between human activities and land-use structure in a city is a key lens to explore the city as a complex system. This dissertation contributes to understanding the complexity of urban dynamics by gaining knowledge of the interactions between human activities and city land-use structures by utilizing free-accessible socially sensed data sources, and building upon recent research trend and technologies in geographical information science, urban study, and computer science. This dissertation addresses three main questions related to human dynamics: 1) how human activities in an urban environment are shaped by socioeconomic status and the intra-city land-use structure, and how in turn, the knowledge of socioeconomic status-activity relationships can contribute to understanding the social landscape of a city; 2) how different types of activities are located in space and time in three U.S. cities and how the spatiotemporal activity patterns in these cities characterize the activity profile of different neighborhoods in the cities; and 3) how recent socially sensed information on human activities can be integrated with widely-used remotely sensed geographical data to create a novel approach for discovering patterns of land use in cities that are otherwise lacking in up to date land use information. This dissertation models the associations between socioeconomics and mobility in the Washington, D.C. metropolitan area as a case study and applies the learned associations for inferring geographical patterns of socioeconomic status (SES) solely using the socially sensed data. This dissertation also implements a semi-automated workflow to retrieve activity details from socially sensed Twitter data in Washington, D.C., the City of Baltimore, and New York City. The dissertation integrates remotely-sensed imagery and socially sensed data to model the dynamics associated with changing land-use types in the Washington, D.C.-Baltimore metropolitan area over time
An integrated semantic-based framework for intelligent similarity measurement and clustering of microblogging posts
Twitter, the most popular microblogging platform, is gaining rapid prominence as a source of
information sharing and social awareness due to its popularity and massive user generated
content. These include applications such as tailoring advertisement campaigns, event
detection, trends analysis, and prediction of micro-populations. The aforementioned
applications are generally conducted through cluster analysis of tweets to generate a more
concise and organized representation of the massive raw tweets. However, current approaches
perform traditional cluster analysis using conventional proximity measures, such as Euclidean
distance. However, the sheer volume, noise, and dynamism of Twitter, impose challenges that
hinder the efficacy of traditional clustering algorithms in detecting meaningful clusters within
microblogging posts. The research presented in this thesis sets out to design and develop a
novel short text semantic similarity (STSS) measure, named TREASURE, which captures the
semantic and structural features of microblogging posts for intelligently predicting the
similarities. TREASURE is utilised in the development of an innovative semantic-based
cluster analysis algorithm (SBCA) that contributes in generating more accurate and
meaningful granularities within microblogging posts. The integrated semantic-based
framework incorporating TREASURE and the SBCA algorithm tackles both the problem of
microblogging cluster analysis and contributes to the success of a variety of natural language
processing (NLP) and computational intelligence research.
TREASURE utilises word embedding neural network (NN) models to capture the semantic
relationships between words based on their co-occurrences in a corpus. Moreover,
TREASURE analyses the morphological and lexical structure of tweets to predict the syntactic
similarities. An intrinsic evaluation of TREASURE was performed with reference to a reliable
similarity benchmark generated through an experiment to gather human ratings on a Twitter
political dataset. A further evaluation was performed with reference to the SemEval-2014
similarity benchmark in order to validate the generalizability of TREASURE. The intrinsic
evaluation and statistical analysis demonstrated a strong positive linear correlation between
TREASURE and human ratings for both benchmarks. Furthermore, TREASURE achieved a
significantly higher correlation coefficient compared to existing state-of-the-art STSS
measures.
The SBCA algorithm incorporates TREASURE as the proximity measure. Unlike
conventional partition-based clustering algorithms, the SBCA algorithm is fully unsupervised
and dynamically determine the number of clusters beforehand. Subjective evaluation criteria
were employed to evaluate the SBCA algorithm with reference to the SemEval-2014 similarity
benchmark. Furthermore, an experiment was conducted to produce a reliable multi-class
benchmark on the European Referendum political domain, which was also utilised to evaluate
the SBCA algorithm. The evaluation results provide evidence that the SBCA algorithm
undertakes highly accurate combining and separation decisions and can generate pure clusters
from microblogging posts.
The contributions of this thesis to knowledge are mainly demonstrated as: 1) Development
of a novel STSS measure for microblogging posts (TREASURE). 2) Development of a new
SBCA algorithm that incorporates TREASURE to detect semantic themes in microblogs. 3)
Generating a word embedding pre-trained model learned from a large corpus of political
tweets. 4) Production of a reliable similarity-annotated benchmark and a reliable multi-class
benchmark in the domain of politics
Tweets on a tree: Index-based clustering of tweets
Computer-mediated communication, CMC, is a type of communication that occurs through use of two or more electronic devices. With the advancement of technology, CMC has started to become a more preferred type of communication between humans. Through computer-mediated technologies, news portals, search engines and social media platforms such as Facebook, Twitter, Reddit and many other platforms are created. In social media platforms, a user can post and discuss his/her own opinion and also read and share other users' opinions. This generates a signi cant amount of data which, if ltered and analyzed, can give researchers important insights about public opinion and culture. Twitter is a social networking service founded in 2006 and became widespread throughout the world in a very short time frame. The service has more than 310 million monthly active users and throughout these users more than 500 million tweets are generated daily as of 2016. Due the volume, velocity and variety of Twitter data, it cannot be analyzed by using conventional methods. A clustering or sampling method is necessary to reduce the amount of data for analysis. To cluster documents, in a very broad sense two similarity measures can be used: Lexical similarity and semantic similarity. Lexical similarity looks for syntactic similarity between documents. It is usually computationally light to compute lexical similarity, however for clustering purposes it may not be very accurate as it disregards the semantic value of words. On the other hand, semantic similarity looks for semantic value and relations between words to calculate the similarity and while it is generally more accurate than lexical similarity, it is computationally di cult to calculate semantic similarity. In our work we aim to create computationally light and accurate clustering of short documents which have the characteristics of big data. We propose a hybrid approach of clustering where lexical and semantic similarity is combined together. In our approach, we use string similarity to create clusters and semantic vector representations of words to interactively merge clusters
Few are as Good as Many: An Ontology-Based Tweet Spam Detection Approach
Due to the high popularity of Twitter, spammers tend to favor its use in spreading their commercial messages. In the context of detecting twitter spams, different statistical and behavioral analysis approaches were proposed. However, these techniques suffer from many limitations due to (1) ongoing changes to Twitter\u2019s streaming API which constrains access to a user\u2019s list of followers/followees, (2) spammer\u2019s creativity in building diverse messages, (3) use of embedded links and new accounts, and (4) need for analyzing different characteristics about users without their consent. To address the aforementioned challenges, we propose a novel ontology-based approach for spam detection over Twitter during events by analyzing the relationship between ham user tweets vs. spams. Our approach relies solely on public tweet messages while performing the analysis and classification tasks. In this context, ontologies are derived and used to generate a dictionary that validates real tweet messages from random topics. Similarity ratio among the dictionary and tweets is used to reflect the legitimacy of the messages. Experiments conducted on real tweet data illustrate that message-to-message techniques achieved a low detection rate compared to our ontology based approach which outperforms them by approximately 200%, in addition to promising scalability for large data analysis
Exploiting clustering algorithms in a multiple-level fashion: A comparative study in the medical care scenario
Clustering real-world data is a challenging task, since many real-data collections are characterized by an inherent sparseness and variable distribution. An appealing domain that generates such data collections is the medical care scenario where collected data include a large
cardinality of patient records and a variety of medical treatments usually adopted for a given disease pathology.
This paper proposes a two-phase data mining methodology to iteratively analyze dierent dataset portions and locally identify groups of objects
with common properties. Discovered cohesive clusters are then analyzed using sequential patterns to characterize temporal relationships among
data features. To support an automatic classication of a new data objects within one of the discovered groups, a classication model is created
starting from the computed cluster set. A mobile application has been also designed and developed to visualize and update data under analysis
as well as categorizing new unlabeled records.
A comparative study has been conducted on real datasets in the medical care scenario using diverse clustering algorithms. Results were compared
in terms of cluster quality, execution time, classication performance and discovered sequential patterns. The experimental evaluation showed
the eectiveness of MLC to discover interesting knowledge items and to easily exploit them through a mobile application. Results have been also
discussed from a medical perspective
How Do People View COVID-19 Vaccines
The COVID-19 pandemic has been the most devastating public health crisis in the recent decade and vaccination is anticipated as the means to terminate the pandemic. People's views and feelings over COVID-19 vaccines determine the success of vaccination. This study was set to investigate sentiments and common topics about COVID-19 vaccines by machine learning sentiment and topic analyses with natural language processing on massive tweets data. Findings revealed that concern on COVID-19 vaccine grew alongside the introduction and start of vaccination programs. Overall positive sentiments and emotions were greater than negative ones. Common topics include vaccine development for progression, effectiveness, safety, availability, sharing of vaccines received, and updates on pandemics and government policies. Outcomes suggested the current atmosphere and its focus over the COVID-19 vaccine issue for the public health sector and policymakers for better decision-making. Evaluations on analytical methods were performed additionally
NLP and ML Methods for Pre-processing, Clustering and Classification of Technical Logbook Datasets
Technical logbooks are a challenging and under-explored text type in automated event identification. These texts are typically short and written in non-standard yet technical language, posing challenges to off-the-shelf NLP pipelines. These datasets typically represent a domain (a technical field such as automotive) and an application (e.g., maintenance). The granularity of issue types described in these datasets additionally leads to class imbalance, making it challenging for models to accurately predict which issue each logbook entry describes. In this research, we focus on the problem of technical issue pre-processing, clustering, and classification by considering logbook datasets from the automotive, aviation, and facility maintenance domains. We developed MaintNet, a collaborative open source library including logbook datasets from various domains and a pre-processing pipeline to clean unstructured datasets. Additionally, we adapted a feedback loop strategy from computer vision for handling extreme class imbalance, which resamples the training data based on its error in the prediction process. We further investigated the benefits of using transfer learning from sources within the same domain (but different applications), from within the same application (but different domains), and from all available data to improve the performance of the classification models. Finally, we evaluated several data augmentation approaches including synonym replacement, random swap, and random deletion to address the issue of data scarcity in technical logbooks
Linguistic Analysis of Agency in Online Discussions about Postpartum
openThe research focuses on understanding the relationship between agency and the emotion score of being positive and negative, as well as the similarity score for the term depression within expressions related to postpartum on social media platforms, such as Reddit and Twitter. The research project is executed through a series of detailed and structured steps, including data scraping, preprocessing, and subsequent analysis with a range of tools and methods. The results of the project are presented through various graphs and accompanied by qualitative explanations. Overall, this research sheds light on the significance of utilizing data analytics in social media networks to determine the association between agentic language and emotion scores, providing valuable insights.The research focuses on understanding the relationship between agency and the emotion score of being positive and negative, as well as the similarity score for the term depression within expressions related to postpartum on social media platforms, such as Reddit and Twitter. The research project is executed through a series of detailed and structured steps, including data scraping, preprocessing, and subsequent analysis with a range of tools and methods. The results of the project are presented through various graphs and accompanied by qualitative explanations. Overall, this research sheds light on the significance of utilizing data analytics in social media networks to determine the association between agentic language and emotion scores, providing valuable insights
Geographies of online social interaction: a big data analytics approach to social media platform Sina Weibo
Social media has revolutionized many aspects of people’s social life. However, few studies have utilized massive individual-level data from social media to examine the effects of geography.
In this study a program was developed to collect and analyze data from Sina Weibo in ten selected Chinese cities. Four geographic concepts, i.e., borders, distance, places, and urban system hierarchy were chosen to measure the geographic effects by investigating geographical distribution of people’s connections and comparing tweets similarity between different cities. The results show that these geographic concepts are playing an important role in the formation of new online connections and shaping people’s interests. Social media users still tend to establish connections and share more common interests with people who live in the same city or close to them. People who live in the first-tier cities have more opportunities to establish connections across the country and their interests cover a broader range
- …