Scalable Feature Selection Using ReliefF Aided by Locality-Sensitive Hashing

Abstract

Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG[Abstract] Feature selection algorithms, such as ReliefF, are very important for processing high-dimensionality data sets. However, widespread use of popular and effective such algorithms is limited by their computational cost. We describe an adaptation of the ReliefF algorithm that simplifies the costliest of its step by approximating the nearest neighbor graph using locality-sensitive hashing (LSH). The resulting ReliefF-LSH algorithm can process data sets that are too large for the original ReliefF, a capability further enhanced by distributed implementation in Apache Spark. Furthermore, ReliefF-LSH obtains better results and is more generally applicable than currently available alternatives to the original ReliefF, as it can handle regression and multiclass data sets. The fact that it does not require any additional hyperparameters with respect to ReliefF also avoids costly tuning. A set of experiments demonstrates the validity of this new approach and confirms its good scalability.This study has been supported in part by the Spanish Ministerio de Economía y Competitividad (projects PID2019-109238GB-C2 and TIN 2015-65069-C2-1-R and 2-R), partially funded by FEDER funds of the EU and by the Xunta de Galicia (projects ED431C 2018/34 and Centro Singular de Investigación de Galicia, accreditation 2016-2019). The authors wish to thank the Fundación Pública Galega Centro Tecnolóxico de Supercomputación de Galicia (CESGA) for the use of their computing resources. Funding for open access charge: Universidade da Coruña/CISUGXunta de Galicia; ED431C 2018/3

    Similar works