35 research outputs found
Multiple Relevant Feature Ensemble Selection Based on Multilayer Co-Evolutionary Consensus MapReduce
IEEE Although feature selection for large data has been intensively investigated in data mining, machine learning, and pattern recognition, the challenges are not just to invent new algorithms to handle noisy and uncertain large data in applications, but rather to link the multiple relevant feature sources, structured, or unstructured, to develop an effective feature reduction method. In this paper, we propose a multiple relevant feature ensemble selection (MRFES) algorithm based on multilayer co-evolutionary consensus MapReduce (MCCM). We construct an effective MCCM model to handle feature ensemble selection of large-scale datasets with multiple relevant feature sources, and explore the unified consistency aggregation between the local solutions and global dominance solutions achieved by the co-evolutionary memeplexes, which participate in the cooperative feature ensemble selection process. This model attempts to reach a mutual decision agreement among co-evolutionary memeplexes, which calls for the need for mechanisms to detect some noncooperative co-evolutionary behaviors and achieve better Nash equilibrium resolutions. Extensive experimental comparative studies substantiate the effectiveness of MRFES to solve large-scale dataset problems with the complex noise and multiple relevant feature sources on some well-known benchmark datasets. The algorithm can greatly facilitate the selection of relevant feature subsets coming from the original feature space with better accuracy, efficiency, and interpretability. Moreover, we apply MRFES to human cerebral cortex-based classification prediction. Such successful applications are expected to significantly scale up classification prediction for large-scale and complex brain data in terms of efficiency and feasibility
From fuzzy-rough to crisp feature selection
A central problem in machine learning and pattern recognition is the process of
recognizing the most important features in a dataset. This process plays a decisive
role in big data processing by reducing the size of datasets. One major drawback of
existing feature selection methods is the high chance of redundant features appearing
in the final subset, where in most cases, finding and removing them can greatly
improve the resulting classification accuracy. To tackle this problem on two different
fronts, we employed fuzzy-rough sets and perturbation theories. On one side, we used
three strategies to improve the performance of fuzzy-rough set-based feature selection
methods. The first strategy was to code both features and samples in one binary
vector and use a shuffled frog leaping algorithm to choose the best combination using
fuzzy dependency degree as the fitness function. In the second strategy, we designed
a measure to evaluate features based on fuzzy-rough dependency degree in a fashion
where redundant features are given less priority to be selected. In the last strategy,
we designed a new binary version of the shuffled frog leaping algorithm that employs a
fuzzy positive region as its similarity measure to work in complete harmony with the
fitness function (i.e. fuzzy-rough dependency degree). To extend the applicability of
fuzzy-rough set-based feature selection to multi-party medical datasets, we designed
a privacy-preserving version of the original method. In addition, we studied the
feasibility and applicability of perturbation theory to feature selection, which to the
best of our knowledge has never been researched. We introduced a new feature
selection based on perturbation theory that is not only capable of detecting and
discarding redundant features but also is very fast and flexible in accommodating the special needs of the application. It employs a clustering algorithm to group likely-behaved
features based on the sensitivity of each feature to perturbation, the angle of
each feature to the outcome and the effect of removing each feature to the outcome,
and it chooses the closest feature to the centre of each cluster and returns all those
features as the final subset. To assess the effectiveness of the proposed methods,
we compared the results of each method with well-known feature selection methods
against a series of artificially generated datasets, and biological, medical and cancer
datasets adopted from the University of California Irvine machine learning repository,
Arizona State University repository and Gene Expression Omnibus repository
A recurrent neural network for urban long-term traffic flow forecasting
This paper investigates the use of recurrent neural network to predict urban long-term traffic flows. A representation of the long-term flows with related weather and contextual information is first introduced. A recurrent neural network approach, named RNN-LF, is then proposed to predict the long-term of flows from multiple data sources. Moreover, a parallel implementation on GPU of the proposed solution is developed (GRNN-LF), which allows to boost the performance of RNN-LF. Several experiments have been carried out on real traffic flow including a small city (Odense, Denmark) and a very big city (Beijing). The results reveal that the sequential version (RNN-LF) is capable of dealing effectively with traffic of small cities. They also confirm the scalability of GRNN-LF compared to the most competitive GPU-based software tools when dealing with big traffic flow such as Beijing urban data
A Robust Multilabel Method Integrating Rule-based Transparent Model, Soft Label Correlation Learning and Label Noise Resistance
Model transparency, label correlation learning and the robust-ness to label
noise are crucial for multilabel learning. However, few existing methods study
these three characteristics simultaneously. To address this challenge, we
propose the robust multilabel Takagi-Sugeno-Kang fuzzy system (R-MLTSK-FS) with
three mechanisms. First, we design a soft label learning mechanism to reduce
the effect of label noise by explicitly measuring the interactions between
labels, which is also the basis of the other two mechanisms. Second, the
rule-based TSK FS is used as the base model to efficiently model the inference
relationship be-tween features and soft labels in a more transparent way than
many existing multilabel models. Third, to further improve the performance of
multilabel learning, we build a correlation enhancement learning mechanism
based on the soft label space and the fuzzy feature space. Extensive
experiments are conducted to demonstrate the superiority of the proposed
method.Comment: This paper has been accepted by IEEE Transactions on Fuzzy System
Task Allocation in Foraging Robot Swarms:The Role of Information Sharing
Autonomous task allocation is a desirable feature of robot swarms that collect and deliver items in scenarios where congestion, caused by accumulated items or robots, can temporarily interfere with swarm behaviour. In such settings, self-regulation of workforce can prevent unnecessary energy consumption. We explore two types of self-regulation: non-social, where robots become idle upon experiencing congestion, and social, where robots broadcast information about congestion to their team mates in order to socially inhibit foraging. We show that while both types of self-regulation can lead to improved energy efficiency and increase the amount of resource collected, the speed with which information about congestion flows through a swarm affects the scalability of these algorithms
Enhancing Big Data Feature Selection Using a Hybrid Correlation-Based Feature Selection
This study proposes an alternate data extraction method that combines three well-known
feature selection methods for handling large and problematic datasets: the correlation-based feature
selection (CFS), best first search (BFS), and dominance-based rough set approach (DRSA) methods.
This study aims to enhance the classifier’s performance in decision analysis by eliminating uncorrelated and inconsistent data values. The proposed method, named CFS-DRSA, comprises several
phases executed in sequence, with the main phases incorporating two crucial feature extraction tasks.
Data reduction is first, which implements a CFS method with a BFS algorithm. Secondly, a data selection process applies a DRSA to generate the optimized dataset. Therefore, this study aims to solve
the computational time complexity and increase the classification accuracy. Several datasets with
various characteristics and volumes were used in the experimental process to evaluate the proposed
method’s credibility. The method’s performance was validated using standard evaluation measures
and benchmarked with other established methods such as deep learning (DL). Overall, the proposed
work proved that it could assist the classifier in returning a significant result, with an accuracy rate
of 82.1% for the neural network (NN) classifier, compared to the support vector machine (SVM),
which returned 66.5% and 49.96% for DL. The one-way analysis of variance (ANOVA) statistical
result indicates that the proposed method is an alternative extraction tool for those with difficulties
acquiring expensive big data analysis tools and those who are new to the data analysis field.Ministry of Higher Education under the Fundamental Research Grant Scheme (FRGS/1/2018/ICT04/UTM/01/1)Universiti Teknologi Malaysia (UTM) under Research University Grant Vot-20H04, Malaysia Research University Network (MRUN) Vot 4L876SPEV project, University of Hradec Kralove, Faculty
of Informatics and Management, Czech Republic (ID: 2102–2021), “Smart Solutions in Ubiquitous
Computing Environments