195 research outputs found

    Evolutionary algorithms and weighting strategies for feature selection in predictive data mining

    Get PDF
    The improvements in Deoxyribonucleic Acid (DNA) microarray technology mean that thousands of genes can be profiled simultaneously in a quick and efficient manner. DNA microarrays are increasingly being used for prediction and early diagnosis in cancer treatment. Feature selection and classification play a pivotal role in this process. The correct identification of an informative subset of genes may directly lead to putative drug targets. These genes can also be used as an early diagnosis or predictive tool. However, the large number of features (many thousands) present in a typical dataset present a formidable barrier to feature selection efforts. Many approaches have been presented in literature for feature selection in such datasets. Most of them use classical statistical approaches (e.g. correlation). Classical statistical approaches, although fast, are incapable of detecting non-linear interactions between features of interest. By default, Evolutionary Algorithms (EAs) are capable of taking non-linear interactions into account. Therefore, EAs are very promising for feature selection in such datasets. It has been shown that dimensionality reduction increases the efficiency of feature selection in large and noisy datasets such as DNA microarray data. The two-phase Evolutionary Algorithm/k-Nearest Neighbours (EA/k-NN) algorithm is a promising approach that carries out initial dimensionality reduction as well as feature selection and classification. This thesis further investigates the two-phase EA/k-NN algorithm and also introduces an adaptive weights scheme for the k-Nearest Neighbours (k-NN) classifier. It also introduces a novel weighted centroid classification technique and a correlation guided mutation approach. Results show that the weighted centroid approach is capable of out-performing the EA/k-NN algorithm across five large biomedical datasets. It also identifies promising new areas of research that would complement the techniques introduced and investigated

    Particle Swarm Optimisation for Feature Selection in Classification

    No full text
    Classification problems often have a large number of features, but not all of them are useful for classification. Irrelevant and redundant features may even reduce the classification accuracy. Feature selection is a process of selecting a subset of relevant features, which can decrease the dimensionality, shorten the running time, and/or improve the classification accuracy. There are two types of feature selection approaches, i.e. wrapper and filter approaches. Their main difference is that wrappers use a classification algorithm to evaluate the goodness of the features during the feature selection process while filters are independent of any classification algorithm. Feature selection is a difficult task because of feature interactions and the large search space. Existing feature selection methods suffer from different problems, such as stagnation in local optima and high computational cost. Evolutionary computation (EC) techniques are well-known global search algorithms. Particle swarm optimisation (PSO) is an EC technique that is computationally less expensive and can converge faster than other methods. PSO has been successfully applied to many areas, but its potential for feature selection has not been fully investigated. The overall goal of this thesis is to investigate and improve the capability of PSO for feature selection to select a smaller number of features and achieve similar or better classification performance than using all features. This thesis investigates the use of PSO for both wrapper and filter, and for both single objective and multi-objective feature selection, and also investigates the differences between wrappers and filters. This thesis proposes a new PSO based wrapper, single objective feature selection approach by developing new initialisation and updating mechanisms. The results show that by considering the number of features in the initialisation and updating procedures, the new algorithm can improve the classification performance, reduce the number of features and decrease computational time. This thesis develops the first PSO based wrapper multi-objective feature selection approach, which aims to maximise the classification accuracy and simultaneously minimise the number of features. The results show that the proposed multi-objective algorithm can obtain more and better feature subsets than single objective algorithms, and outperform other well-known EC based multi-objective feature selection algorithms. This thesis develops a filter, single objective feature selection approach based on PSO and information theory. Two measures are proposed to evaluate the relevance of the selected features based on each pair of features and a group of features, respectively. The results show that PSO and information based algorithms can successfully address feature selection tasks. The group based method achieves higher classification accuracies, but the pair based method is faster and selects smaller feature subsets. This thesis proposes the first PSO based multi-objective filter feature selection approach using information based measures. This work is also the first work using other two well-known multi-objective EC algorithms in filter feature selection, which are also used to compare the performance of the PSO based approach. The results show that the PSO based multiobjective filter approach can successfully address feature selection problems, outperform single objective filter algorithms and achieve better classification performance than other multi-objective algorithms. This thesis investigates the difference between wrapper and filter approaches in terms of the classification performance and computational time, and also examines the generality of wrappers. The results show that wrappers generally achieve better or similar classification performance than filters, but do not always need longer computational time than filters. The results also show that wrappers built with simple classification algorithms can be general to other classification algorithms

    Forecasting Models for Integration of Large-Scale Renewable Energy Generation to Electric Power Systems

    Get PDF
    Amid growing concerns about climate change and non-renewable energy sources deple¬tion, vari¬able renewable energy sources (VRESs) are considered as a feasible substitute for conventional environment-polluting fossil fuel-based power plants. Furthermore, the transition towards clean power systems requires additional transmission capacity. Dynamic thermal line rating (DTLR) is being considered as a potential solution to enhance the current transmission line capacity and omit/postpone transmission system expansion planning, while DTLR is highly dependent on weather variations. With increasing the accommodation of VRESs and application of DTLR, fluctuations and variations thereof impose severe and unprecedented challenges on power systems operation. Therefore, short-term forecasting of large-scale VERSs and DTLR play a crucial role in the electric power system op¬eration problems. To this end, this thesis devotes on developing forecasting models for two large-scale VRESs types (i.e., wind and tidal) and DTLR. Deterministic prediction can be employed for a variety of power system operation problems solved by deterministic optimization. Also, the outcomes of deterministic prediction can be employed for conditional probabilistic prediction, which can be used for modeling uncertainty, used in power system operation problems with robust optimization, chance-constrained optimization, etc. By virtue of the importance of deterministic prediction, deterministic prediction models are developed. Prevalently, time-frequency decomposition approaches are adapted to decompose the wind power time series (TS) into several less non-stationary and non-linear components, which can be predicted more precisely. However, in addition to non-stationarity and nonlinearity, wind power TS demonstrates chaotic characteristics, which reduces the predictability of the wind power TS. In this regard, a wind power generation prediction model based on considering the chaosity of the wind power generation TS is addressed. The model consists of a novel TS decomposition approach, named multi-scale singular spectrum analysis (MSSSA), and least squares support vector machines (LSSVMs). Furthermore, deterministic tidal TS prediction model is developed. In the proposed prediction model, a variant of empirical mode decomposition (EMD), which alleviates the issues associated with EMD. To further improve the prediction accuracy, the impact of different components of wind power TS with different frequencies (scales) in the spatiotemporal modeling of the wind farm is assessed. Consequently, a multiscale spatiotemporal wind power prediction is developed, using information theory-based feature selection, wavelet decomposition, and LSSVM. Power system operation problems with robust optimization and interval optimization require prediction intervals (PIs) to model the uncertainty of renewables. The advanced PI models are mainly based on non-differentiable and non-convex cost functions, which make the use of heuristic optimization for tuning a large number of unknown parameters of the prediction models inevitable. However, heuristic optimization suffers from several issues (e.g., being trapped in local optima, irreproducibility, etc.). To this end, a new wind power PI (WPPI) model, based on a bi-level optimization structure, is put forward. In the proposed WPPI, the main unknown parameters of the prediction model are globally tuned based on optimizing a convex and differentiable cost function. In line with solving the non-differentiability and non-convexity of PI formulation, an asymmetrically adaptive quantile regression (AAQR) which benefits from a linear formulation is proposed for tidal uncertainty modeling. In the prevalent QR-based PI models, for a specified reliability level, the probabilities of the quantiles are selected symmetrically with respect the median probability. However, it is found that asymmetrical and adaptive selection of quantiles with respect to median can provide more efficient PIs. To make the formulation of AAQR linear, extreme learning machine (ELM) is adapted as the prediction engine. Prevalently, the parameters of activation functions in ELM are selected randomly; while different sets of random values might result in dissimilar prediction accuracy. To this end, a heuristic optimization is devised to tune the parameters of the activation functions. Also, to enhance the accuracy of probabilistic DTLR, consideration of latent variables in DTLR prediction is assessed. It is observed that convective cooling rate can provide informative features for DTLR prediction. Also, to address the high dimensional feature space in DTLR, a DTR prediction based on deep learning and consideration of latent variables is put forward. Numerical results of this thesis are provided based on realistic data. The simulations confirm the superiority of the proposed models in comparison to traditional benchmark models, as well as the state-of-the-art models

    Integrated information gain with extra tree algorithm for feature permission analysis in android malware classification

    Get PDF
    The rapid growth of free applications in the android market has led to the fast spread of malware apps since users store their sensitive personal information on their mobile devices when using those apps. The permission mechanism is designed as a security layer to protect the android operating system by restricting access to local resources of the system at installation time and run time for updated versions of the android operating system. Even though permissions provide a secure layer to users, they can be exploited by attackers to threaten user privacy. Consequently, exploring the patterns of those permissions becomes necessary to find the relevant permission features that contribute to classifying android apps. However, with the era of big data and the rapid explosion of malware along with many unnecessary requested permissions, it has become a challenge to recognize the patterns of permissions from these data due to the irrelevant and redundant features that affect the classification performance and increase the complexity cost overhead. Ensemble-based Extra Tree - Feature Selection (FS-EX) algorithm was proposed in this study to explore the permission patterns by selecting a minimal-sized subset of highly discriminant permission features capable of discriminating against malware samples from nonmalware samples. The integrated Information Gain with Ensemble-based Extra Tree - Feature Selection (FS-IGEX) algorithm is proposed to assign weight values to permission features instead of binary values to determine the impact of weighted attribute variables on the classification performance. The two proposed methods based on Ensemble Extra Tree Feature Selection were evaluated on five datasets with various sample sizes and feature space using nine machine learning classifiers. Comparison studies were carried out between FS-EX subsets and the dataset of Full Permission features (FP) and the two approaches of the FS-IGEX method - the Permission-Binary (PB) approach and the Permission-Weighted (PW) approach. The permissions with PB were represented with binary values, whereas permissions with PW were represented with weighted values. The results demonstrated that the approach with the FS-EX was promising in obtaining the most prominent permission features related to the class target and attaining the same or close classification results in terms of accuracy with the highest accuracy mean of 96%, as compared to the FP. In addition, the PW approach of the FS-IGEX method had highly influential weighted permission features that could classify apps as malware and non-malware with the highest accuracy mean of 93%, compared to the PB approach of the FS-IGEX method and the FP

    Diskretointi osajoukkojen haussa

    Get PDF
    Subgroup discovery is a data mining technique to discoverer interesting subgroups from a selected population. It seeks to discover interesting relationships between different objects in a set with respect to a specific property. The discovered patterns are called subgroups and they are represented in the form of rules. Discretization is technique to replace numerical attributes with nominal ones, making it possible to use them with algorithms that do not support numerical attributes. In this thesis two datasets are discretized for the application of subgroup discovery. For the discretizations four different methods were used and three different bin amounts were applied. The used datasets are the heart disease and the Australian credit approval from the UCI Machine Learning Repository. The subgroup discovery technique produced eleven subgroups sets as result, eight from heart disease dataset and three from Australian credit approval dataset. We observed that the bin amount affects greatly on the results. Also, with the binary discretization there are subgroup sets with a high share of subgroups with discretized attributes. In addition, the importance of expert guidance is emphasized

    Computer-Aided Biomimetics : Semi-Open Relation Extraction from scientific biological texts

    Get PDF
    Engineering inspired by biology – recently termed biom* – has led to various groundbreaking technological developments. Example areas of application include aerospace engineering and robotics. However, biom* is not always successful and only sporadically applied in industry. The reason is that a systematic approach to biom* remains at large, despite the existence of a plethora of methods and design tools. In recent years computational tools have been proposed as well, which can potentially support a systematic integration of relevant biological knowledge during biom*. However, these so-called Computer-Aided Biom* (CAB) tools have not been able to fill all the gaps in the biom* process. This thesis investigates why existing CAB tools fail, proposes a novel approach – based on Information Extraction – and develops a proof-of-concept for a CAB tool that does enable a systematic approach to biom*. Key contributions include: 1) a disquisition of existing tools guides the selection of a strategy for systematic CAB, 2) a dataset of 1,500 manually-annotated sentences, 3) a novel Information Extraction approach that combines the outputs from a supervised Relation Extraction system and an existing Open Information Extraction system. The implemented exploratory approach indicates that it is possible to extract a focused selection of relations from scientific texts with reasonable accuracy, without imposing limitations on the types of information extracted. Furthermore, the tool developed in this thesis is shown to i) speed up a trade-off analysis by domain-experts, and ii) also improve the access to biology information for nonexperts

    Computer-aided biomimetics : semi-open relation extraction from scientific biological texts

    Get PDF
    Engineering inspired by biology – recently termed biom* – has led to various ground-breaking technological developments. Example areas of application include aerospace engineering and robotics. However, biom* is not always successful and only sporadically applied in industry. The reason is that a systematic approach to biom* remains at large, despite the existence of a plethora of methods and design tools. In recent years computational tools have been proposed as well, which can potentially support a systematic integration of relevant biological knowledge during biom*. However, these so-called Computer-Aided Biom* (CAB) tools have not been able to fill all the gaps in the biom* process. This thesis investigates why existing CAB tools fail, proposes a novel approach – based on Information Extraction – and develops a proof-of-concept for a CAB tool that does enable a systematic approach to biom*. Key contributions include: 1) a disquisition of existing tools guides the selection of a strategy for systematic CAB, 2) a dataset of 1,500 manually-annotated sentences, 3) a novel Information Extraction approach that combines the outputs from a supervised Relation Extraction system and an existing Open Information Extraction system. The implemented exploratory approach indicates that it is possible to extract a focused selection of relations from scientific texts with reasonable accuracy, without imposing limitations on the types of information extracted. Furthermore, the tool developed in this thesis is shown to i) speed up a trade-off analysis by domain-experts, and ii) also improve the access to biology information for non-exper
    corecore