12 research outputs found
Efficient clustering of massive data with MapReduce
Since several decades, after the Agrarian society and Machine Age, the mankind approached the Information Age. Information or even much more important knowledge became one of the most valuable resources. The usual way to generate knowledge is the analysis of observation, or of some raw data, and the more and interconnected data is available the more insights can be gained from it. Therefore, in the past decade the trend to gather all possible information in all areas of life, industry and science became overwhelming. Moreover, the technological development of storage and sensor systems allowed an even larger growth of data that are stored. As stated by Peter Hirshberg (global pulse summit) the amount of generated data in the year 2011 alone has exceeded the amount of data generated since the beginning of mankind’s history. The importance of knowledge extraction led to the development of the Knowledge Discovery process in Databases (KDD process) in the year 1996. The KDD process describes a workflow from the raw data gathering, its preprocessing, and analysis to the final visualization for further interpretation. In the last decades, the model-driven approach for knowledge extraction was mainly used. That is, the gathered data was used to accept or to decline a hypothesis that was developed by a human expert. Therefore, the accuracy of the predictive quality of the model highly depended on the expertise of the specialist. Moreover, even good models could miss several aspects of the problem at hand. In the last years, the data-driven approach for knowledge extraction gained a lot of attention. The idea is letting the data "speak for themselves", i.e., to generate novel models based on the given data and to validate them afterwards. As the models are not known in advance, the goal is to find unknown patterns in the data. In the KDD process, this task is usually solved by a group of data mining techniques called unsupervised learning or cluster analysis. However, the cluster analysis task is often computationally expensive and efficient techniques for huge amount of data are indispensable. The usual way for processing large amounts of data is the parallelization of single tasks on multi-core or in cluster environments. In this work, the author follows the parallelization approach and investigates and presents novel techniques for processing and analyzing huge datasets in the widely used MapReduce framework. MapReduce is a parallelization framework for data intensive task that was proposed by Google Inc. in 2004 and developed to one of the most prevalent technologies for batch processing of huge amounts of data. More precisely, this thesis deals with two classes of cluster analysis - the density-based approaches and particularly DBSCAN algorithm, and the projected clustering techniques, where the P3C algorithm was investigated and further developed for processing huge datasets. As part of the density-based approaches, the author of this thesis proposes efficient approaches for similarity self-join technique in vector spaces and determination of connected components in huge graphs in the MapReduce framework
MR-DSJ: Distance-Based Self-Join forLarge-Scale Vector Data Analysis with MapReduce
Abstract: Data analytics gets faced with huge and tremendously increasing amounts of data for which MapReduce provides avery convenient and effective distributed programming model. Various algorithms already support massive data analysis on computer clusters but, in particular, distance-based similarity self-joins lack efficient solutions for large vector data sets though they are fundamental in many data mining tasks including clustering, near-duplicate detection or outlier analysis. Our noveldistance-based self-join algorithm for MapReduce, MR-DSJ, is based on grid partitioning and delivers correct, complete, and inherently duplicate-free results in asingle iteration. Additionally we propose several filter techniques which reduce the runtime and communication of the MR-DSJ algorithm. Analytical and experimental evaluations demonstrate the superiority over other join algorithms for MapReduce.