11 research outputs found

    Analysis of Twitter Data Using a Multiple-level Clustering Strategy

    Get PDF
    Twitter, currently the leading microblogging social network, has attracted a great body of research works. This paper proposes a data analysis framework to discover groups of similar twitter messages posted on a given event. By analyzing these groups, user emotions or thoughts that seem to be associated with specific events can be extracted, as well as aspects characterizing events according to user perception. To deal with the inherent sparseness of micro-messages, the proposed approach relies on a multiple-level strategy that allows clustering text data with a variable distribution. Clusters are then characterized through the most representative words appearing in their messages, and association rules are used to highlight correlations among these words. To measure the relevance of specific words for a given event, text data has been represented in the Vector Space Model using the TF-IDF weighting score. As a case study, two real Twitter datasets have been analyse

    Data Mining Algorithms for Internet Data: from Transport to Application Layer

    Get PDF
    Nowadays we live in a data-driven world. Advances in data generation, collection and storage technology have enabled organizations to gather data sets of massive size. Data mining is a discipline that blends traditional data analysis methods with sophisticated algorithms to handle the challenges posed by these new types of data sets. The Internet is a complex and dynamic system with new protocols and applications that arise at a constant pace. All these characteristics designate the Internet a valuable and challenging data source and application domain for a research activity, both looking at Transport layer, analyzing network tra c flows, and going up to Application layer, focusing on the ever-growing next generation web services: blogs, micro-blogs, on-line social networks, photo sharing services and many other applications (e.g., Twitter, Facebook, Flickr, etc.). In this thesis work we focus on the study, design and development of novel algorithms and frameworks to support large scale data mining activities over huge and heterogeneous data volumes, with a particular focus on Internet data as data source and targeting network tra c classification, on-line social network analysis, recommendation systems and cloud services and Big data

    Comparison of DBSCAN and PCA-DBSCAN Algorithm for Grouping Earthquake Area

    Get PDF
    Geologically, the territory of Indonesia is where the three active tectonic plates meet which are always moving and colliding with each other, resulting in earthquakes, volcanic pathways, and faults. Earthquake is a natural disaster that cannot be avoided or prevented, but the consequences of earthquakes can be minimized. Based on data obtained from Meteorology, Climatology and Geophysics Agency (MCGA), earthquakes often occur in Indonesia. Data obtained from earthquakes can be grouped to map the area of earthquake occurrence and an analysis will be carried out to determine the characteristics of earthquake clustering areas. The clustering in this is study conducted with two experiments, first experiment is Density-Based Spatial Clustering of Applications with Noise (DBSCAN) without dimensional reduction and second experiment is DBSCAN clustering with dimensional reduction using Principal Component Analysis (PCA). The best cluster results can be found by calculating the value of Silhouette Index (SI) of each cluster. From the two experiments, the highest SI value was obtained in experiment using PCA, which was 0.4137. Then the second experiment was used as the best cluster results with the highest Dept and Magnitude features in clusters 19 and 17 which showed the 5 main regions where earthquakes often occur are Sumatra, Banda Sea, Moluccan Sea, Irian Jaya and Sulawesi Keywords— Climatology and Geophysics Agency, DBSCAN, DBSCAN-PCA, Earthquake Area, PC

    DBSCAN algorithm: twitter text clustering of trend topic pilkada pekanbaru

    Get PDF
    Social media is one of the most common sources used to communicate, such as Twitter. Every tweet on Twitter contains data such as text which when collected can be processed into information. Data processed from Twitter tweet will create a trend which can be used for information such as in education, economics, politics, etc. This then created the concept of text mining. Text mining techniques are needed to find an interesting pattern in search of trends based on Twitter text with topics related to Pilkada Pekanbaru 2017. This research is intended to cluster Twitter text data using Density-Based Spatial Clustering of Application with Noise (DBSCAN) algorithm. This research was conducted with several experiments using different Eps and MinPts parameters for 2,184 text data which has been through several stages, such as cleaning, duplication removal, pre-processing like stemming and stopwords. Based on the highest average of Silhouette Index, Eps 0.1 and MinPts 10 with SI = 0.413 were chosen as paramaters, thus forming 31 clusters. According to the frequency of word occurrences in the cluster, the highest are "kpu", followed by "firdaus", "kota", "pasang", and "ayat". As can be seen that the candidate pairs most often appear on cluster results are Firdaus-Ayat, and based on the results of Pilkada 2017, Firdaus-Ayat was chosen as Mayor and Vice Mayor of Pekanbaru

    Exploiting clustering algorithms in a multiple-level fashion: A comparative study in the medical care scenario

    Get PDF
    Clustering real-world data is a challenging task, since many real-data collections are characterized by an inherent sparseness and variable distribution. An appealing domain that generates such data collections is the medical care scenario where collected data include a large cardinality of patient records and a variety of medical treatments usually adopted for a given disease pathology. This paper proposes a two-phase data mining methodology to iteratively analyze dierent dataset portions and locally identify groups of objects with common properties. Discovered cohesive clusters are then analyzed using sequential patterns to characterize temporal relationships among data features. To support an automatic classication of a new data objects within one of the discovered groups, a classication model is created starting from the computed cluster set. A mobile application has been also designed and developed to visualize and update data under analysis as well as categorizing new unlabeled records. A comparative study has been conducted on real datasets in the medical care scenario using diverse clustering algorithms. Results were compared in terms of cluster quality, execution time, classication performance and discovered sequential patterns. The experimental evaluation showed the eectiveness of MLC to discover interesting knowledge items and to easily exploit them through a mobile application. Results have been also discussed from a medical perspective

    Cluster Analysis of Twitter Data: A Review of Algorithms

    Get PDF
    Twitter, a microblogging online social network (OSN), has quickly gained prominence as it provides people with the opportunity to communicate and share posts and topics. Tremendous value lies in automated analysing and reasoning about such data in order to derive meaningful insights, which carries potential opportunities for businesses, users, and consumers. However, the sheer volume, noise, and dynamism of Twitter, imposes challenges that hinder the efficacy of observing clusters with high intra-cluster (i.e. minimum variance) and low inter-cluster similarities. This review focuses on research that has used various clustering algorithms to analyse Twitter data streams and identify hidden patterns in tweets where text is highly unstructured. This paper performs a comparative analysis on approaches of unsupervised learning in order to determine whether empirical findings support the enhancement of decision support and pattern recognition applications. A review of the literature identified 13 studies that implemented different clustering methods. A comparison including clustering methods, algorithms, number of clusters, dataset(s) size, distance measure, clustering features, evaluation methods, and results was conducted. The conclusion reports that the use of unsupervised learning in mining social media data has several weaknesses. Success criteria and future directions for research and practice to the research community are discussed

    Analyzing Tweets For Predicting Mental Health States Using Data Mining And Machine Learning Algorithms

    Get PDF
    Tweets are usually the outcome of peoples’ feelings on various topics. Twitter allows users to post casual and emotional thoughts to share in real-time. Around 20% of U.S. adults use Twitter. Using the word-frequency and singular value decomposition methods, we identified the behavior of individuals through their tweets. We graded depressive and anti-depressive keywords using the tweet time-series, time-window, and time-stamp methods. We have collected around four million tweets since 2018. A parameter (Depressive Index) is computed using the F1 score and Mathews correlation coefficient (MCC) to indicate the depressive level. A framework showing the Depressive Index and the Happiness Index is prepared with the time, location, and keywords and delivers F1 Score, MCC, and CI values. COVID-19 changed the routines of most peoples\u27 lives and affected mental health. We studied the tweets and compared them with the COVID-19 growth. The Happiness Index from our work and World Happiness Report for Georgia, New York, and Sri Lanka is compared. An interactive framework is prepared to analyze the tweets, depict the happiness index, and compare it. Bad words in tweets are analyzed, and a map showing the Happiness Index is computed for all the US states and was compared with WalletHub data. We add tweets continuously and a framework delivering an atlas of maps based on the Happiness Index and make these maps available for further study. We forecasted tweets with real-time data. Our results of tweets and COVID-19 reports (WHO) are in a similar pattern. A new moving average method was presented; this unique process gave perfect results at peaks of the function and improved the error percentage. An interactive GUI portal computes the Happiness Index, depression index, feel-good- factors, prediction of the keywords, and prepares a Happiness Index map. We plan to create a public web portal to facilitate users to get these results. Upon completing the proposed GUI application, the users can get the Happiness Index, Depression Index values, Happiness map, and prediction of keywords of the desired dates and geographical locations instantaneously

    Data Mining Techniques for Complex User-Generated Data

    Get PDF
    Nowadays, the amount of collected information is continuously growing in a variety of different domains. Data mining techniques are powerful instruments to effectively analyze these large data collections and extract hidden and useful knowledge. Vast amount of User-Generated Data (UGD) is being created every day, such as user behavior, user-generated content, user exploitation of available services and user mobility in different domains. Some common critical issues arise for the UGD analysis process such as the large dataset cardinality and dimensionality, the variable data distribution and inherent sparseness, and the heterogeneous data to model the different facets of the targeted domain. Consequently, the extraction of useful knowledge from such data collections is a challenging task, and proper data mining solutions should be devised for the problem under analysis. In this thesis work, we focus on the design and development of innovative solutions to support data mining activities over User-Generated Data characterised by different critical issues, via the integration of different data mining techniques in a unified frame- work. Real datasets coming from three example domains characterized by the above critical issues are considered as reference cases, i.e., health care, social network, and ur- ban environment domains. Experimental results show the effectiveness of the proposed approaches to discover useful knowledge from different domains

    Mapping the evolving landscape of child-computer interaction research: structures and processes of knowledge (re)production

    Get PDF
    Implementing an iterative sequential mixed methods design (Quantitative → Qualitative → Quantitative) framed within a sociology of knowledge approach to discourse, this study offers an account of the structure of the field of Child-Computer Interaction (CCI), its development over time, and the practices through which researchers have (re)structured knowledge comprising the field. Thematic structure of knowledge within the field, and its evolution over time, is quantified through implementation of a Correlated Topic Model (CTM), an automated inductive content analysis method, in analysing 4,771 CCI research papers published between 2003 and 2021. Detailed understanding of practices through which researchers (re)structure knowledge within the field, including factors influencing these practices, is obtained through thematic analysis of online workshops involving prominent contributors to the field (n=7). Strategic practices utilised by researchers in negotiating tensions impeding integration of novel concepts in the field are investigated through analysis of semantic features of retrieved papers using linear and negative binomial regression models. Contributing an extensive mapping, results portray the field of CCI as a varied research landscape, comprising 48 major themes of study, which has evolved dynamically over time. Research priorities throughout the field have been subject to influence from a range of endogenous and exogenous factors which researchers actively negotiate through research and publication practices. Tacitly structuring research practices, these factors have broadly sustained a technology-driven, novelty-dominated paradigm throughout the field which has failed to substantively progress cumulative knowledge. Through strategic negotiation of persistent tensions arising as consequence of these factors, researchers have nonetheless affected structural change within the field, contributing to a shift towards a user needs-driven agenda and progression of knowledge therein. Findings demonstrate that the field of CCI is proceeding through an intermediary phase in maturation, forming an increasingly distinct disciplinary shape and identity through the cumulative structuring effect of community members’ continued negotiation of tensions
    corecore