199,426 research outputs found
Parallel Hierarchical Affinity Propagation with MapReduce
The accelerated evolution and explosion of the Internet and social media is
generating voluminous quantities of data (on zettabyte scales). Paramount
amongst the desires to manipulate and extract actionable intelligence from vast
big data volumes is the need for scalable, performance-conscious analytics
algorithms. To directly address this need, we propose a novel MapReduce
implementation of the exemplar-based clustering algorithm known as Affinity
Propagation. Our parallelization strategy extends to the multilevel
Hierarchical Affinity Propagation algorithm and enables tiered aggregation of
unstructured data with minimal free parameters, in principle requiring only a
similarity measure between data points. We detail the linear run-time
complexity of our approach, overcoming the limiting quadratic complexity of the
original algorithm. Experimental validation of our clustering methodology on a
variety of synthetic and real data sets (e.g. images and point data)
demonstrates our competitiveness against other state-of-the-art MapReduce
clustering techniques
Recommended from our members
Monitoring conceptual development with text mining technologies: CONSPECT
This paper evaluates CONSPECT, a service that analyses states in a learner’s conceptual development. It combines two technologies – Latent Semantic Analysis to analyse text and Network Analysis (NA) to provide visualisations – into a technique called Meaningful Interaction Analysis (MIA). CONSPECT was designed to help both online learners and their tutors monitor their conceptual development. This paper reports on the validation experiments undertaken to determine how well LSA matches first year medical students in clustering concepts and in annotating text. The validation used several techniques, including card sorting and Likert scales. CONSPECT produces almost ‘peer’ quality results and what remains to be tested is whether it improves with more advanced learners. One of the experiments showed an average 0.7 correlation between humans and CONSPECT
Visualization and clustering for SNMP intrusion detection
Accurate intrusion detection is still an open challenge. The present work aims at being one step toward that purpose by studying the combination of clustering and visualization techniques. To do that, the mobile visualization connectionist agent-based intrusion detection system (MOVICAB-IDS), previously proposed as a hybrid intelligent IDS based on visualization techniques, is upgraded by adding automatic response thanks to clustering methods. To check the validity of the proposed clustering extension, it has been applied to the identification of different anomalous situations related to the simple network management network protocol by using real-life data sets. Different ways of applying neural projection and clustering techniques are studied in the present article. Through the experimental validation it is shown that the proposed techniques could be compatible and consequently applied to a continuous network flow for intrusion detectionSpanish Ministry of Economy and Competitiveness with ref: TIN2010-21272-C02-01 (funded by the European Regional Development Fund) and SA405A12-2 from Junta de Castilla y Leon
Deep Learning vs Spectral Clustering into an active clustering with pairwise constraints propagation
International audienceIn our data driven world, categorization is of major importance to help end-users and decision makers understanding information structures. Supervised learning techniques rely on annotated samples that are often difficult to obtain and training often overfits. On the other hand, unsupervised clustering techniques study the structure of the data without disposing of any training data. Given the difficulty of the task, supervised learning often outperforms unsupervised learning. A compromise is to use a partial knowledge, selected in a smart way, in order to boost performance while minimizing learning costs, what is called semi-supervised learning. In such use case, Spectral Clustering proved to be an efficient method. Also, Deep Learning outperformed several state of the art classification approaches and it is interesting to test it in our context. In this paper, we firstly introduce the concept of Deep Learning into an active semi-supervised clustering process and compare it with Spectral Clustering. Secondly, we introduce constraint propagation and demonstrate how it maximizes partitioning quality while reducing annotation costs. Experimental validation is conducted on two different real datasets. Results show the potential of the clustering methods
Fuzzy and non-fuzzy approaches for digital image classification
This paper classifies different digital images using two types of clustering algorithms. The first type is the fuzzy clustering methods, while the second type considers the non-fuzzy methods. For the performance comparisons, we apply four clustering algorithms with two from the fuzzy type and the other two from the non-fuzzy (partitonal) clustering type. The automatic partitional clustering algorithm and the partitional k-means algorithm are chosen as the two examples of the non-fuzzy clustering techniques, while the automatic fuzzy algorithm and the fuzzy C-means clustering algorithm are taken as the examples of the fuzzy clustering techniques. The evaluation among the four algorithms are done by implementing these algorithms to three different types of image databases, based on the comparison criteria of: dataset size, cluster number, execution time and classification accuracy and k-cross validation. The experimental results demonstrate that the non-fuzzy algorithms have higher accuracies in compared to the fuzzy algorithms, especially when dealing with large data sizes and different types of images. Three types of image databases of human face images, handwritten digits and natural scenes are used for the performance evaluation
Analysis of FMRI Exams Through Unsupervised Learning and Evaluation Index
In the last few years, the clustering of time series has seen significant growth and has proven effective in
providing useful information in various domains of use. This growing interest in time series clustering is the
result of the effort made by the scientific community in the context of time data mining.
For these reasons, the first phase of the thesis focused on the study of the data obtained from fMRI exams
carried out in task-based and resting state mode, using and comparing different clustering algorithms: SelfOrganizing map (SOM), the Growing Neural Gas (GNG) and Neural Gas (NG) which are crisp-type
algorithms, a fuzzy algorithm, the Fuzzy C algorithm, was also used (FCM). The evaluation of the results
obtained by using clustering algorithms was carried out using the Davies Bouldin evaluation index (DBI or
DB index).
Clustering evaluation is the second topic of this thesis. To evaluate the validity of the clustering, there are
specific techniques, but none of these is already consolidated for the study of fMRI exams. Furthermore,
the evaluation of evaluation techniques is still an open research field. Eight clustering validation indexes
(CVIs) applied to fMRI data clustering will be analysed. The validation indices that have been used are
Pakhira Bandyopadhyay Maulik Index (crisp and fuzzy), Fukuyama Sugeno Index, Rezaee Lelieveldt Reider
Index, Wang Sun Jiang Index, Xie Beni Index, Davies Bouldin Index, Soft Davies Bouldin Index. Furthermore,
an evaluation of the evaluation indices will be carried out, which will take into account the sub-optimal
performance obtained by the indices, through the introduction of new metrics. Finally, a new methodology
for the evaluation of CVIs will be introduced, which will use an ANFIS model
- …