2 research outputs found
A Heuristic Based Pre-processing Methodology for Short Text Similarity Measures in Microblogs
Short text similarity measures have lots of applications in online social networks (OSN), as they are being integrated in machine learning algorithms. However, the data quality is a major challenge in most OSNs, particularly Twitter. The sparse, ambiguous, informal, and unstructured nature of the medium impose difficulties to capture the underlying semantics of the text. Therefore, text pre-processing is a crucial phase in similarity identification applications, such as clustering and classification. This is because selecting the appropriate data processing methods contributes to the increase in correlations of the similarity measure. This research proposes a novel heuristicdriven pre-processing methodology for enhancing the performance of similarity measures in the context of Twitter tweets. The components of the proposed pre-processing methodology are discussed and evaluated on an annotated dataset that was published as part of SemEval-2014 shared task. An experimental analysis was conducted using the cosine angle as a similarity measure to assess the effect of our method against a baseline (C-Method). Experimental results indicate that our approach outperforms the baseline in terms of correlations and error rates
An integrated semantic-based framework for intelligent similarity measurement and clustering of microblogging posts
Twitter, the most popular microblogging platform, is gaining rapid prominence as a source of
information sharing and social awareness due to its popularity and massive user generated
content. These include applications such as tailoring advertisement campaigns, event
detection, trends analysis, and prediction of micro-populations. The aforementioned
applications are generally conducted through cluster analysis of tweets to generate a more
concise and organized representation of the massive raw tweets. However, current approaches
perform traditional cluster analysis using conventional proximity measures, such as Euclidean
distance. However, the sheer volume, noise, and dynamism of Twitter, impose challenges that
hinder the efficacy of traditional clustering algorithms in detecting meaningful clusters within
microblogging posts. The research presented in this thesis sets out to design and develop a
novel short text semantic similarity (STSS) measure, named TREASURE, which captures the
semantic and structural features of microblogging posts for intelligently predicting the
similarities. TREASURE is utilised in the development of an innovative semantic-based
cluster analysis algorithm (SBCA) that contributes in generating more accurate and
meaningful granularities within microblogging posts. The integrated semantic-based
framework incorporating TREASURE and the SBCA algorithm tackles both the problem of
microblogging cluster analysis and contributes to the success of a variety of natural language
processing (NLP) and computational intelligence research.
TREASURE utilises word embedding neural network (NN) models to capture the semantic
relationships between words based on their co-occurrences in a corpus. Moreover,
TREASURE analyses the morphological and lexical structure of tweets to predict the syntactic
similarities. An intrinsic evaluation of TREASURE was performed with reference to a reliable
similarity benchmark generated through an experiment to gather human ratings on a Twitter
political dataset. A further evaluation was performed with reference to the SemEval-2014
similarity benchmark in order to validate the generalizability of TREASURE. The intrinsic
evaluation and statistical analysis demonstrated a strong positive linear correlation between
TREASURE and human ratings for both benchmarks. Furthermore, TREASURE achieved a
significantly higher correlation coefficient compared to existing state-of-the-art STSS
measures.
The SBCA algorithm incorporates TREASURE as the proximity measure. Unlike
conventional partition-based clustering algorithms, the SBCA algorithm is fully unsupervised
and dynamically determine the number of clusters beforehand. Subjective evaluation criteria
were employed to evaluate the SBCA algorithm with reference to the SemEval-2014 similarity
benchmark. Furthermore, an experiment was conducted to produce a reliable multi-class
benchmark on the European Referendum political domain, which was also utilised to evaluate
the SBCA algorithm. The evaluation results provide evidence that the SBCA algorithm
undertakes highly accurate combining and separation decisions and can generate pure clusters
from microblogging posts.
The contributions of this thesis to knowledge are mainly demonstrated as: 1) Development
of a novel STSS measure for microblogging posts (TREASURE). 2) Development of a new
SBCA algorithm that incorporates TREASURE to detect semantic themes in microblogs. 3)
Generating a word embedding pre-trained model learned from a large corpus of political
tweets. 4) Production of a reliable similarity-annotated benchmark and a reliable multi-class
benchmark in the domain of politics