6 research outputs found

    Multithreaded and Spark parallelization of feature selection filters

    Get PDF
    ©2016 Elsevier B.V. All rights reserved. This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/licenses/bync-nd/4.0/. This version of the article has been accepted for publication in Journal of Computational Science. The Version of Record is available online at https://doi.org/10.1016/j.jocs.2016.07.002Versión final aceptada de: C. Eiras-Franco, V. Bolón-Canedo, S. Ramos, J. González-Domínguez, A. Alonso-Betanzos, and J. Touriño, "Multithreaded and Spark parallelization of feature selection filters", Journal of Computational Science, Vol. 17, Part 3, Nov. 2016, Pp. 609-619[Abstract]: Vast amounts of data are generated every day, constituting a volume that is challenging to analyze. Techniques such as feature selection are advisable when tackling large datasets. Among the tools that provide this functionality, Weka is one of the most popular ones, although the implementations it provides struggle when processing large datasets, requiring excessive times to be practical. Parallel processing can help alleviate this problem, effectively allowing users to work with Big Data. The computational power of multicore machines can be harnessed by using multithreading and distributed programming, effectively helping to tackle larger problems. Both these techniques can dramatically speed up the feature selection process allowing users to work with larger datasets. The reimplementation of four popular feature selection algorithms included in Weka is the focus of this work. Multithreaded implementations previously not included in Weka as well as parallel Spark implementations were developed for each algorithm. Experimental results obtained from tests on real-world datasets show that the new versions offer significant reductions in processing times.This work has been financed in part by Xunta de Galicia under Research Network R2014/041 and project GRC2014/035, and by Spanish Ministerio de Economía y Competitividad under projects TIN2012-37954 and TIN-2015-65069-C2-1-R, partially funded by FEDER funds of the European Union. V. Bolón-Canedo acknowledges support of the Xunta de Galicia under postdoctoral Grant code ED481B 2014/164-0. Additionally, the collaboration of Jorge Veiga on setting up and using the MREv tool for Spark execution was essential for this work.Xunta de Galicia; R2014/041Xunta de Galicia; GRC2014/035Xunta de Galicia; ED481B 2014/164-

    Distributed Correlation-Based Feature Selection in Spark

    Get PDF
    CFS (Correlation-Based Feature Selection) is an FS algorithm that has been successfully applied to classification problems in many domains. We describe Distributed CFS (DiCFS) as a completely redesigned, scalable, parallel and distributed version of the CFS algorithm, capable of dealing with the large volumes of data typical of big data applications. Two versions of the algorithm were implemented and compared using the Apache Spark cluster computing model, currently gaining popularity due to its much faster processing times than Hadoop's MapReduce model. We tested our algorithms on four publicly available datasets, each consisting of a large number of instances and two also consisting of a large number of features. The results show that our algorithms were superior in terms of both time-efficiency and scalability. In leveraging a computer cluster, they were able to handle larger datasets than the non-distributed WEKA version while maintaining the quality of the results, i.e., exactly the same features were returned by our algorithms when compared to the original algorithm available in WEKA.Comment: 25 pages, 5 figure

    Parallel feature selection for distributed-memory clusters

    Get PDF
    Versión final aceptada de: https://doi.org/10.1016/j.ins.2019.01.050This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/ licenses/by-nc-nd/4.0/. This version of the article: González-Domínguez, J. et al. (2019) ‘Parallel feature selection for distributed-memory clusters’, has been accepted for publication in Information Sciences, 496, pp. 399–409. The Version of Record is available online at: https://doi.org/10.1016/j.ins.2019.01.050[Abstract]: Feature selection is nowadays an extremely important data mining stage in the field of machine learning due to the appearance of problems of high dimensionality. In the literature there are numerous feature selection methods, mRMR (minimum-Redundancy-Maximum-Relevance) being one of the most widely used. However, although it achieves good results in selecting relevant features, it is impractical for datasets with thousands of features. A possible solution to this limitation is the use of the fast-mRMR method, a greedy optimization of the mRMR algorithm that improves both scalability and efficiency. In this work we present fast-mRMR-MPI, a novel hybrid parallel implementation that uses MPI and OpenMP to accelerate feature selection on distributed-memory clusters. Our performance evaluation on two different systems using five representative input datasets shows that fast-mRMR-MPI is significantly faster than fast-mRMR while providing the same results. As an example, our tool needs less than one minute to select 200 features of a dataset with more than four million features and 16,000 samples on a cluster with 32 nodes (768 cores in total), while the sequential fast-mRMR required more than eight hours. Moreover, fast-mRMR-MPI distributes data so that it is able to exploit the memory available on different nodes of a cluster and then complete analyses that fail on a single node due to memory constraints. Our tool is publicly available at https://github.com/borjaf696/Fast-mRMR.This research has been partially funded by projects TIN2016-75845-P and TIN-2015-65069-C2-1-R of the Ministry of Economy, Industry and Competitiveness of Spain, as well as by Xunta de Galicia projects ED431D R2016/045 and GRC2014/035, all of them partially funded by FEDER funds of the European Union. We gratefully thank CESGA for providing access to the Finis Terrae II supercomputer.Xunta de Galicia; ED431D R2016/045Xunta de Galicia; GRC2014/03

    A decentralized metaheuristic approach applied to the FMS scheduling problem

    Get PDF
    La programación de FMS ha sido uno de los temas más populares para los investigadores. Se han entregado varios enfoques para programar los FMS, incluidas las técnicas de simulación y los métodos analíticos. Las metaheurísticas descentralizadas pueden verse como una forma en que la población se divide en varias subpoblaciones, con el objetivo de reducir el tiempo de ejecución y el número de evaluaciones, debido a la separación del espacio de búsqueda. La descentralización es una ruta de investigación prominente en la programación, por lo que el costo de la computación se puede reducir y las soluciones se pueden encontrar más rápido, sin penalizar la función objetivo. En este proyecto, se propone una metaheurística descentralizada en el contexto de un problema de programación flexible del sistema de fabricación. La principal contribución de este proyecto es analizar otros tipos de división del espacio de búsqueda, particularmente aquellos asociados con el diseño físico del FMS. El desempeño del enfoque descentralizado se validará con los puntos de referencia de programación de FMS.FMS scheduling has been one of the most popular topics for researchers. A number of approaches have been delivered to schedule FMSs including simulation techniques and analytical methods. Decentralized metaheuristics can be seen as a way where the population is divided into several subpopulations, aiming to reduce the run time and the numbers of evaluation, due to the separation of the search space. Decentralization is a prominent research path in scheduling so the computing cost can be reduced and solutions can be found faster, without penalizing the objective function. In this project, a decentralized metaheuristic is proposed in the context of a flexible manufacturing system scheduling problem. The main contribution of this project is to analyze other types of search space division, particularly those associated with the physical layout of the FMS. The performance of the decentralized approach will be validated with FMS scheduling benchmarks.Ingeniero (a) IndustrialPregrad

    Multithreaded and Spark parallelization of feature selection filters

    No full text
    ©2016 Elsevier B.V. All rights reserved. This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/licenses/bync-nd/4.0/. This version of the article has been accepted for publication in Journal of Computational Science. The Version of Record is available online at https://doi.org/10.1016/j.jocs.2016.07.002Versión final aceptada de: C. Eiras-Franco, V. Bolón-Canedo, S. Ramos, J. González-Domínguez, A. Alonso-Betanzos, and J. Touriño, "Multithreaded and Spark parallelization of feature selection filters", Journal of Computational Science, Vol. 17, Part 3, Nov. 2016, Pp. 609-619[Abstract]: Vast amounts of data are generated every day, constituting a volume that is challenging to analyze. Techniques such as feature selection are advisable when tackling large datasets. Among the tools that provide this functionality, Weka is one of the most popular ones, although the implementations it provides struggle when processing large datasets, requiring excessive times to be practical. Parallel processing can help alleviate this problem, effectively allowing users to work with Big Data. The computational power of multicore machines can be harnessed by using multithreading and distributed programming, effectively helping to tackle larger problems. Both these techniques can dramatically speed up the feature selection process allowing users to work with larger datasets. The reimplementation of four popular feature selection algorithms included in Weka is the focus of this work. Multithreaded implementations previously not included in Weka as well as parallel Spark implementations were developed for each algorithm. Experimental results obtained from tests on real-world datasets show that the new versions offer significant reductions in processing times.This work has been financed in part by Xunta de Galicia under Research Network R2014/041 and project GRC2014/035, and by Spanish Ministerio de Economía y Competitividad under projects TIN2012-37954 and TIN-2015-65069-C2-1-R, partially funded by FEDER funds of the European Union. V. Bolón-Canedo acknowledges support of the Xunta de Galicia under postdoctoral Grant code ED481B 2014/164-0. Additionally, the collaboration of Jorge Veiga on setting up and using the MREv tool for Spark execution was essential for this work.Xunta de Galicia; R2014/041Xunta de Galicia; GRC2014/035Xunta de Galicia; ED481B 2014/164-
    corecore