81 research outputs found

    Knowledge management overview of feature selection problem in high-dimensional financial data: Cooperative co-evolution and Map Reduce perspectives

    Get PDF
    The term big data characterizes the massive amounts of data generation by the advanced technologies in different domains using 4Vs volume, velocity, variety, and veracity-to indicate the amount of data that can only be processed via computationally intensive analysis, the speed of their creation, the different types of data, and their accuracy. High-dimensional financial data, such as time-series and space-Time data, contain a large number of features (variables) while having a small number of samples, which are used to measure various real-Time business situations for financial organizations. Such datasets are normally noisy, and complex correlations may exist between their features, and many domains, including financial, lack the al analytic tools to mine the data for knowledge discovery because of the high-dimensionality. Feature selection is an optimization problem to find a minimal subset of relevant features that maximizes the classification accuracy and reduces the computations. Traditional statistical-based feature selection approaches are not adequate to deal with the curse of dimensionality associated with big data. Cooperative co-evolution, a meta-heuristic algorithm and a divide-And-conquer approach, decomposes high-dimensional problems into smaller sub-problems. Further, MapReduce, a programming model, offers a ready-To-use distributed, scalable, and fault-Tolerant infrastructure for parallelizing the developed algorithm. This article presents a knowledge management overview of evolutionary feature selection approaches, state-of-The-Art cooperative co-evolution and MapReduce-based feature selection techniques, and future research directions

    Shared Nearest-Neighbor Quantum Game-Based Attribute Reduction with Hierarchical Coevolutionary Spark and Its Application in Consistent Segmentation of Neonatal Cerebral Cortical Surfaces

    Full text link
    © 2012 IEEE. The unprecedented increase in data volume has become a severe challenge for conventional patterns of data mining and learning systems tasked with handling big data. The recently introduced Spark platform is a new processing method for big data analysis and related learning systems, which has attracted increasing attention from both the scientific community and industry. In this paper, we propose a shared nearest-neighbor quantum game-based attribute reduction (SNNQGAR) algorithm that incorporates the hierarchical coevolutionary Spark model. We first present a shared coevolutionary nearest-neighbor hierarchy with self-evolving compensation that considers the features of nearest-neighborhood attribute subsets and calculates the similarity between attribute subsets according to the shared neighbor information of attribute sample points. We then present a novel attribute weight tensor model to generate ranking vectors of attributes and apply them to balance the relative contributions of different neighborhood attribute subsets. To optimize the model, we propose an embedded quantum equilibrium game paradigm (QEGP) to ensure that noisy attributes do not degrade the big data reduction results. A combination of the hierarchical coevolutionary Spark model and an improved MapReduce framework is then constructed that it can better parallelize the SNNQGAR to efficiently determine the preferred reduction solutions of the distributed attribute subsets. The experimental comparisons demonstrate the superior performance of the SNNQGAR, which outperforms most of the state-of-the-art attribute reduction algorithms. Moreover, the results indicate that the SNNQGAR can be successfully applied to segment overlapping and interdependent fuzzy cerebral tissues, and it exhibits a stable and consistent segmentation performance for neonatal cerebral cortical surfaces

    Intelligent FMI-Reduct Ensemble Frame Work for Network Intrusion Detection System (NIDS)

    Get PDF
    The era of computer networks and information systems includes finance, transport, medicine, and education contains a lot of sensitive and confidential data. With the amount of confidential and sensitive data running over network applications are growing vulnerable to a variety of cyber threats. The manual monitoring of network connections and malicious activities is extremely difficult, leading to an increasing concern for malicious attacks on network-related systems. Network intrusion is an increasing issue in the virtual realm of the internet and computer networks that could harm the network structure in various ways, such as by altering system configurations and parameters. To address this issue, the creation of an efficient Network Intrusion Detection System (NID) that identifies malicious activities within a network has become necessary. The NID must regularly monitor network activities to detect malicious connections and help secure computer networks. The utilization of ML and mining of data approaches has proven to be beneficial in these types of scenarios. In this article, mutual a data-driven Fuzzy-Rough feature selection technique has been suggested to rank important features for the NIDS model to enforce cyber security attacks. The primary goal of the research is to classify potential attacks in high dimensional scenario, handling redundant and irrelevant features using proposed dimensionality reduction technique by combining Fuzzy and Rough set Theory techniques. The classical anomaly intrusion detection approaches that use individual classifiers Such as SVM, Decision Tree, Naive Bayes, k-Nearest Neighbor, and Multi Layer Perceptron are not enough to increase the effectiveness of detecting modern attacks. Hence, the suggested anomaly-based Network Intrusion Detection System named "FMI-Reduct based Ensemble Classifier" has been tested on highly imbalanced benchmark datasets, NSL_KDD and UNSW_NB15datasets of intrusion

    Advances in Data Mining Knowledge Discovery and Applications

    Get PDF
    Advances in Data Mining Knowledge Discovery and Applications aims to help data miners, researchers, scholars, and PhD students who wish to apply data mining techniques. The primary contribution of this book is highlighting frontier fields and implementations of the knowledge discovery and data mining. It seems to be same things are repeated again. But in general, same approach and techniques may help us in different fields and expertise areas. This book presents knowledge discovery and data mining applications in two different sections. As known that, data mining covers areas of statistics, machine learning, data management and databases, pattern recognition, artificial intelligence, and other areas. In this book, most of the areas are covered with different data mining applications. The eighteen chapters have been classified in two parts: Knowledge Discovery and Data Mining Applications

    Securing cloud-enabled smart cities by detecting intrusion using spark-based stacking ensemble of machine learning algorithms

    Get PDF
    With the use of cloud computing, which provides the infrastructure necessary for the efficient delivery of smart city services to every citizen over the internet, intelligent systems may be readily integrated into smart cities and communicate with one another. Any smart system at home, in a car, or in the workplace can be remotely controlled and directed by the individual at any time. Continuous cloud service availability is becoming a critical subscriber requirement within smart cities. However, these cost-cutting measures and service improvements will make smart city cloud networks more vulnerable and at risk. The primary function of Intrusion Detection Systems (IDS) has gotten increasingly challenging due to the enormous proliferation of data created in cloud networks of smart cities. To alleviate these concerns, we provide a framework for automatic, reliable, and uninterrupted cloud availability of services for the network data security of intelligent connected devices. This framework enables IDS to defend against security threats and to provide services that meet the users' Quality of Service (QoS) expectations. This study's intrusion detection solution for cloud network data from smart cities employed Spark and Waikato Environment for Knowledge Analysis (WEKA). WEKA and Spark are linked and made scalable and distributed. The Hadoop Distributed File System (HDFS) storage advantages are combined with WEKA's Knowledge flow for processing cloud network data for smart cities. Utilizing HDFS components, WEKA's machine learning algorithms receive cloud network data from smart cities. This research utilizes the wrapper-based Feature Selection (FS) approach for IDS, employing both the Pigeon Inspired Optimizer (PIO) and the Particle Swarm Optimization (PSO). For classifying the cloud network traffic of smart cities, the tree-based Stacking Ensemble Method (SEM) of J48, Random Forest (RF), and eXtreme Gradient Boosting (XGBoost) are applied. Performance evaluations of our system were conducted using the UNSW-NB15 and NSL-KDD datasets. Our technique is superior to previous works in terms of sensitivity, specificity, precision, false positive rate (FPR), accuracy, F1 Score, and Matthews correlation coefficient (MCC)

    Ensembles for feature selection: A review and future trends

    Get PDF
    © 2019. This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/licenses/by-nc-nd/4.0/. This version of the article: Bolón-Canedo, V. and Alonso-Betanzos, A. (2019) ‘Ensembles for Feature Selection: A Review and Future Trends’ has been accepted for publication in: Information Fusion, 52, pp. 1–12. The Version of Record is available online at https://doi.org/10.1016/j.inffus.2018.11.008.[Abstract]: Ensemble learning is a prolific field in Machine Learning since it is based on the assumption that combining the output of multiple models is better than using a single model, and it usually provides good results. Normally, it has been commonly employed for classification, but it can be used to improve other disciplines such as feature selection. Feature selection consists of selecting the relevant features for a problem and discard those irrelevant or redundant, with the main goal of improving classification accuracy. In this work, we provide the reader with the basic concepts necessary to build an ensemble for feature selection, as well as reviewing the up-to-date advances and commenting on the future trends that are still to be faced.This research has been financially supported in part by the Spanish Ministerio de Economa y Competitividad (research project TIN 2015-65069-C2-1-R), by the Xunta de Galicia (research projects GRC2014/035 and the Centro Singular de Investigación de Galicia, accreditation 2016–2019, Ref. ED431G/01) and by the European Union (FEDER/ERDF).Xunta de Galicia; GRC2014/035Xunta de Galicia; ED431G/0

    Mobile analytics database summarization using rough set

    Get PDF
    The mobile device is a device that supports the mobility activities and more portable. However, mobile devices have the limited resources and storage capacity. This deficiency should be considered in order to maximize the functionality of this mobile device. Hence, this study provides a formulation in data management to support a process of storing data with large scale by using Rough Set approach to select the data with relevant and useful information. Additionally, the features are combining analytics method to complete analysis of the data storage processing, making users more easily understand how to read the analysis results. Testing is done by utilizing data from the Malaysia’s Open Government Data about Air Pollutant Index (API) to determine the condition of the air pollution level to the health and safety of the population. The testing has successfully created a summary of the API data with the Rough Set approach to select significant data from the main database based on generated rule. The analysis results of the selected API data are stored as a mobile database and presented in the chart intended to make the data meaningful and easier to understand the analysis results of API conditions using the mobile device

    Improving the classification performance on imbalanced data sets via new hybrid parameterisation model

    Get PDF
    The aim of this work is to analyse the performance of the new proposed hybrid parameterisation model in handling problematic data. Three types of problematic data will be highlighted in this paper: i) big data set, ii) uncertain and inconsistent data set and iii) imbalanced data set. The proposed hybrid model is an integration of three main phases which consist of the data decomposition, parameter reduction and parameter selection phases. Three main methods, which are soft set and rough set theories, were implemented to reduce and to select the optimised parameter set, while a neural network was used to classify the optimised data set. This proposed model can process a data set that might contain uncertain, inconsistent and imbalanced data. Therefore, one additional phase, data decomposition, was introduced and executed after the pre-processing task was completed in order to manage the big data issue. Imbalanced data sets were used to evaluate the capability of the proposed hybrid model in handling problematic data. The experimental results demonstrate that the proposed hybrid model has the potential to be implemented with any type of data set in a classification task, especially with complex data sets

    SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

    Get PDF
    The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered \de facto" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to di erent type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several di erent domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also signi cantly contributed to new supervised learning paradigms, including multilabel classi cation, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of di erent software packages | from open source to commercial. In this paper, marking the fteen year anniversary of SMOTE, we re ect on the SMOTE journey, discuss the current state of a airs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project 887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016; and the National Science Foundation (NSF) Grant IIS-1447795
    corecore