470 research outputs found

    kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data

    Get PDF
    The k-Nearest Neighbors classifier is a simple yet effective widely renowned method in data mining. The actual application of this model in the big data domain is not feasible due to time and memory restrictions. Several distributed alternatives based on MapReduce have been proposed to enable this method to handle large-scale data. However, their performance can be further improved with new designs that fit with newly arising technologies. In this work we provide a new solution to perform an exact k-nearest neighbor classification based on Spark. We take advantage of its in-memory operations to classify big amounts of unseen cases against a big training dataset. The map phase computes the k-nearest neighbors in different training data splits. Afterwards, multiple reducers process the definitive neighbors from the list obtained in the map phase. The key point of this proposal lies on the management of the test set, keeping it in memory when possible. Otherwise, it is split into a minimum number of pieces, applying a MapReduce per chunk, using the caching skills of Spark to reuse the previously partitioned training set. In our experiments we study the differences between Hadoop and Spark implementations with datasets up to 11 million instances, showing the scaling-up capabilities of the proposed approach. As a result of this work an open-source Spark package is available

    Adaptive kNN using Expected Accuracy for Classification of Geo-Spatial Data

    Full text link
    The k-Nearest Neighbor (kNN) classification approach is conceptually simple - yet widely applied since it often performs well in practical applications. However, using a global constant k does not always provide an optimal solution, e.g., for datasets with an irregular density distribution of data points. This paper proposes an adaptive kNN classifier where k is chosen dynamically for each instance (point) to be classified, such that the expected accuracy of classification is maximized. We define the expected accuracy as the accuracy of a set of structurally similar observations. An arbitrary similarity function can be used to find these observations. We introduce and evaluate different similarity functions. For the evaluation, we use five different classification tasks based on geo-spatial data. Each classification task consists of (tens of) thousands of items. We demonstrate, that the presented expected accuracy measures can be a good estimator for kNN performance, and the proposed adaptive kNN classifier outperforms common kNN and previously introduced adaptive kNN algorithms. Also, we show that the range of considered k can be significantly reduced to speed up the algorithm without negative influence on classification accuracy

    Parallel classification and optimization of telco trouble ticket dataset

    Get PDF
    In the big data age, extracting applicable information using traditional machine learning methodology is very challenging. This problem emerges from the restricted design of existing traditional machine learning algorithms, which do not entirely support large datasets and distributed processing. The large volume of data nowadays demands an efficient method of building machine-learning classifiers to classify big data. New research is proposed to solve problems by converting traditional machine learning classification into a parallel capable. Apache Spark is recommended as the primary data processing framework for the research activities. The dataset used in this research is related to the telco trouble ticket, identified as one of the large volume datasets. The study aims to solve the data classification problem in a single machine using traditional classifiers such as W-J48. The proposed solution is to enable a conventional classifier to execute the classification method using big data platforms such as Hadoop. This study’s significant contribution is the output matrix evaluation, such as accuracy and computational time taken from both ways resulting from hyper-parameter tuning and improvement of W-J48 classification accuracy for the telco trouble ticket dataset. Additional optimization and estimation techniques have been incorporated into the study, such as grid search and cross-validation method, which significantly improves classification accuracy by 22.62% and reduces the classification time by 21.1% in parallel execution inside the big data environment

    KNN Optimization for Multi-Dimensional Data

    Get PDF
    The K-Nearest Neighbors (KNN) algorithm is a simple but powerful technique used in the field of data analytics. It uses a distance metric to identify existing samples in a dataset which are similar to a new sample. The new sample can then be classified via a class majority voting of its most similar samples, i.e. nearest neighbors. The KNN algorithm can be applied in many fields, such as recommender systems where it can be used to group related products or predict user preferences. In most cases, the performance of the KNN algorithm tends to suffer as the size of the dataset increases because the number of comparisons performed increases exponentially. In this paper, we propose a KNN optimization algorithm which leverages vector space models to enhance the nearest neighbors search for a new sample. It accomplishes this enhancement by restricting the search area, and therefore reducing the number of comparisons necessary to find the nearest neighbors. The experimental results demonstrate significant performance improvements without degrading the algorithm’s accuracy. The applicability of this optimization algorithm is further explored in the field of Big Data by parallelizing the work using Apache Spark. The experimental results of the Spark implementation demonstrate that it outperforms the serial, or local, implementation of this optimization algorithm after the dataset size reaches a specific threshold. Thus, further improving the performance of this optimization algorithm in the field of Big Data, where large datasets are prevalent

    Data Mining Applications in Big Data

    Get PDF
    Data mining is a process of extracting hidden, unknown, but potentially useful information from massive data. Big Data has great impacts on scientific discoveries and value creation. This paper introduces methods in data mining and technologies in Big Data. Challenges of data mining and data mining with big data are discussed. Some technology progress of data mining and data mining with big data are also presented

    A Nearest Neighbours-Based Algorithm for Big Time Series Data Forecasting

    Get PDF
    A forecasting algorithm for big data time series is presented in this work. A nearest neighbours-based strategy is adopted as the main core of the algorithm. A detailed explanation on how to adapt and imple ment the algorithm to handle big data is provided. Although some parts remain iterative, and consequently requires an enhanced implementation, execution times are considered as satisfactory. The performance of the proposed approach has been tested on real-world data related to elec tricity consumption from a public Spanish university, by using a Spark cluster.Ministerio de Economía y Competitividad TIN2014-55894-C2-RJunta de Andalucía P12-TIC-1728Centro de Estudios Andaluces PRY153/14Universidad Pablo de Olavide APPB81309

    Distributed multi-label learning on Apache Spark

    Get PDF
    This thesis proposes a series of multi-label learning algorithms for classification and feature selection implemented on the Apache Spark distributed computing model. Five approaches for determining the optimal architecture to speed up multi-label learning methods are presented. These approaches range from local parallelization using threads to distributed computing using independent or shared memory spaces. It is shown that the optimal approach performs hundreds of times faster than the baseline method. Three distributed multi-label k nearest neighbors methods built on top of the Spark architecture are proposed: an exact iterative method that computes pair-wise distances, an approximate tree-based method that indexes the instances across multiple nodes, and an approximate local sensitive hashing method that builds multiple hash tables to index the data. The results indicated that the predictions of the tree-based method are on par with those of an exact method while reducing the execution times in all the scenarios. The aforementioned method is then used to evaluate the quality of a selected feature subset. The optimal adaptation for a multi-label feature selection criterion is discussed and two distributed feature selection methods for multi-label problems are proposed: a method that selects the feature subset that maximizes the Euclidean norm of individual information measures, and a method that selects the subset of features maximizing the geometric mean. The results indicate that each method excels in different scenarios depending on type of features and the number of labels. Rigorous experimental studies and statistical analyses over many multi-label metrics and datasets confirm that the proposals achieve better performances and provide better scalability to bigger data than the methods compared in the state of the art

    Overview of Caching Mechanisms to Improve Hadoop Performance

    Full text link
    Nowadays distributed computing environments, large amounts of data are generated from different resources with a high velocity, rendering the data difficult to capture, manage, and process within existing relational databases. Hadoop is a tool to store and process large datasets in a parallel manner across a cluster of machines in a distributed environment. Hadoop brings many benefits like flexibility, scalability, and high fault tolerance; however, it faces some challenges in terms of data access time, I/O operation, and duplicate computations resulting in extra overhead, resource wastage, and poor performance. Many researchers have utilized caching mechanisms to tackle these challenges. For example, they have presented approaches to improve data access time, enhance data locality rate, remove repetitive calculations, reduce the number of I/O operations, decrease the job execution time, and increase resource efficiency. In the current study, we provide a comprehensive overview of caching strategies to improve Hadoop performance. Additionally, a novel classification is introduced based on cache utilization. Using this classification, we analyze the impact on Hadoop performance and discuss the advantages and disadvantages of each group. Finally, a novel hybrid approach called Hybrid Intelligent Cache (HIC) that combines the benefits of two methods from different groups, H-SVM-LRU and CLQLMRS, is presented. Experimental results show that our hybrid method achieves an average improvement of 31.2% in job execution time
    corecore