unknown

A HADOOP-Based Framework for Parallel and Distributed Feature Selection

Abstract

In this paper, we introduce a theoretical basis for a Hadoop-based framework for parallel and distributed feature selection. It is underpinned by an associative memory (binary) neural network which is highly amenable to parallel and distributed processing and fits with the Hadoop paradigm. There are many feature selectors described in the literature which all have various strengths and weaknesses. We present the implementation details of four feature selection algorithms constructed using our artificial neural network framework embedded in Hadoop MapReduce. Hadoop allows parallel and distributed processing so each feature selector can be processed in parallel and multiple feature selectors can be processed together in parallel allowing multiple feature selectors to be compared. We identify commonalities among the four features selectors. All can be processed in the framework using a single representation and the overall processing can also be greatly reduced by only processing the common aspects of the feature selectors once and propagating these aspects across all four feature selectors as necessary. This allows the best feature selector and the actual features to select to be identified for large and high dimensional data sets through exploiting the efficiency and flexibility of embedding the binary associative-memory neural network in Hadoop

    Similar works