7 research outputs found

    Large-scale data mining analytics based on MapReduce

    Get PDF
    In this work, we search for possible approaches to large-scale data mining analytics. We perform an exploration about the existing MapReduce and other MapReduce-like frameworks for distributed data processing and the distributed file systems for distributed data storage. We study in detail about Hadoop Distributed File System (HDFS) and Hadoop MapReduce software framework. We analyse the benefits of newer version of Hadoop software framework which provides better scalability solution by segregating the cluster resource management task from MapReduce framework. This version is called YARN and is very flexible in supporting various kinds of distributed data processing other than batchmode processing of MapReduce. We also looked into various implementations of data mining algorithms based on MapReduce to derive a comprehensive concept about developing such algorithms. We also looked for various tools that provided MapRedcue based scalable data mining algorithms. We could only find Mahout as a tool specially based on Hadoop MapReduce. But the tool developer team decided to stop using Hadoop MapReduce and to use instead Apache Spark as the underlying execution engine. WEKA also has a very small subset of data mining algorithms implemented using MapReduce which is not properly maintained and supported by the developer team. Subsequently, we found out that Apache Spark, apart from providing an optimised and a faster execution engine for distributed processing also provided an accompanying library for machine learning algorithms. This library is called Machine Learning library (MLlib). Apache Spark claimed that it is much faster than Hadoop MapReduce as it exploits the advantages of in-memory computations which is particularly more beneficial for iterative workloads in case of data mining. Spark is designed to work on variety of clusters: YARN being one of them. It is designed to process the Hadoop data. We selected to perform a particular data mining task: decision tree learning based classification and regression data mining. We stored properly labelled training data for predictive mining tasks in HDFS. We set up a YARN cluster and run Spark's MLlib applications on this cluster. These applications use the cluster managing capabilities of YARN and the distributed execution framework of Spark core services. We performed several experiments to measure the performance gains, speed-up and scaleup of implementations of decision tree learning algorithms in Spark's MLlib. We found out much better than expected results for our experiments. We achieved a much higher than ideal speed-up when we increased the number of nodes. The scale-up is also very excellent. There is a significant decrease in run-time for training decision tree models by increasing the number of nodes. This demonstrates that Spark's MLlib decision tree learning algorithms for classification and regression analysis are highly scalable

    Efficient clustering techniques for big data

    Get PDF
    Clustering is an essential data mining technique that divides observations into groups where each group contains similar observations. K-Means is one of the most popular and widely used clustering algorithms that has been used for over fifty years. The majority of the running time in the original K-Means algorithm (known as Lloyd’s algorithm) is spent on computing distances from each data point to all cluster centres to find the closest centre to each data point. Due to the current exponential growth of the data, it became a necessity to improve KMeans even further to cope with large-scale datasets, known as Big Data. Hence, the main aim of this thesis is to improve the efficiency and scalability of Lloyd’s K-Means. One of the most efficient techniques to accelerate K-Means is to use triangle inequality. Implementing such efficient techniques on a reliable distributed model creates a powerful combination. This combination can lead to an efficient and highly scalable parallel version of K-Means that offers a practical solution to the problem of clustering Big Data. MapReduce, and its popular open-source implementation known as Hadoop, provides a distributed computing framework that efficiently stores, manages, and processes large-scale datasets over a large cluster of commodity machines. Many studies introduced a parallel implementation of Lloyd’s K-Means on Hadoop in order to improve the algorithm’s scalability. This research examines methods based on triangle inequality to achieve further improvements on the efficiency of the parallel Lloyd’s K-Means on Hadoop. Variants of K-Means that use triangle inequality usually require extra information, such as distance bounds and cluster assignments, from the previous iteration to work efficiently. This is a challenging task to achieve on Hadoop for two reasons: 1) Hadoop does not directly support iterative algorithms; and 2) Hadoop does not allow information to be exchanged between two consecutive iterations. Hence, two techniques are proposed to give Hadoop the ability to pass information from an iteration to the next. The first technique uses a data structure referred to as an Extended Vector (EV), that appends the extra information to the original data vector. The second technique stores the extra information on files where each file is referred to as a Bounds File (BF). To evaluate the two proposed techniques, two K-Means variants are implemented on Hadoop using the two techniques. Each variant is tested against variable number of clusters, dimensions, data points, and mappers. Furthermore, the performance of various implementations of K-Means on Hadoop and Spark is investigated. The results show a significant improvement on the efficiency of the new implementations compared to the Lloyd’s K-Means on Hadoop with real and artificial datasets
    corecore