1 research outputs found
Improved Density-Based Spatio--Textual Clustering on Social Media
DBSCAN may not be sufficient when the input data type is heterogeneous in
terms of textual description. When we aim to discover clusters of geo-tagged
records relevant to a particular point-of-interest (POI) on social media,
examining only one type of input data (e.g., the tweets relevant to a POI) may
draw an incomplete picture of clusters due to noisy regions. To overcome this
problem, we introduce DBSTexC, a newly defined density-based clustering
algorithm using spatio--textual information. We first characterize POI-relevant
and POI-irrelevant tweets as the texts that include and do not include a POI
name or its semantically coherent variations, respectively. By leveraging the
proportion of POI-relevant and POI-irrelevant tweets, the proposed algorithm
demonstrates much higher clustering performance than the DBSCAN case in terms
of score and its variants. While DBSTexC performs exactly as
DBSCAN with the textually homogeneous inputs, it far outperforms DBSCAN with
the textually heterogeneous inputs. Furthermore, to further improve the
clustering quality by fully capturing the geographic distribution of tweets, we
present fuzzy DBSTexC (F-DBSTexC), an extension of DBSTexC, which incorporates
the notion of fuzzy clustering into the DBSTexC. We then demonstrate the
robustness of F-DBSTexC via intensive experiments. The computational complexity
of our algorithms is also analytically and numerically shown.Comment: 14 pages, 10 figures, 6 tables, Submitted for publication to the IEEE
Transactions on Knowledge and Data Engineerin