8 research outputs found

    A neuroevolutionary approach to feature selection using multiobjective evolutionary algorithms

    Get PDF
    Feature selection plays a central role in predictive analysis where datasets have hundreds or thousands of variables available. It can also reduce the overall training time and the computational costs of the classifiers used. However, feature selection methods can be computationally intensive or dependent of human expertise to analyze data. This study proposes a neuroevolutionary approach which uses multiobjective evolutionary algorithms to optimize neural network parameters in order to find the best network able to identify the most important variables of analyzed data. Classification is done through a Support Vector Machine (SVM) classifier where specific parameters are also optimized. The method is applied to datasets with different number of features and classes.FCT - Fundação para a Ciência e Tecnologia in the scope of the projects: PEst-OE/EEI/UI0319/2014, UID/MAT/00013/2013, UID/CEC/ 00319/2019 and the European project MSCA-RISE-2015, NEWEX, with reference 734205

    Pareto front feature selection based on artificial bee colony optimization

    No full text
    Feature selection has two major conflicting aims, i.e., to maximize the classification performance and to minimize the number of selected features to overcome the curse of dimensionality. To balance their trade-off, feature selection can be handled as a multi-objective problem. In this paper, a feature selection approach is proposed based on a new multi objective artificial bee colony algorithm integrated with non-dominated sorting procedure and genetic operators. Two different implementations of the proposed approach are developed: ABC with binary representation and ABC with continuous representation. Their performance are examined on 12 benchmark datasets and the results are compared with those of linear forward selection, greedy stepwise backward selection, two single objective ABC algorithms and three well-known multi-objective evolutionary computation algorithms. The results show that the proposed approach with the binary representation outperformed the other methods in terms of both the dimensionality reduction and the classification accuracy. (C) 2017 Elsevier Inc. All rights reserved

    Machine Learning Approaches for Cancer Analysis

    Get PDF
    In addition, we propose many machine learning models that serve as contributions to solve a biological problem. First, we present Zseq, a linear time method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors, such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Studying the abundance of select mRNA species throughout prostate cancer progression may provide some insight into the molecular mechanisms that advance the disease. In the second contribution of this dissertation, we reveal that the combination of proper clustering, distance function and Index validation for clusters are suitable in identifying outlier transcripts, which show different trending than the majority of the transcripts, the trending of the transcript is the abundance throughout different stages of prostate cancer. We compare this model with standard hierarchical time-series clustering method based on Euclidean distance. Using time-series profile hierarchical clustering methods, we identified stage-specific mRNA species termed outlier transcripts that exhibit unique trending patterns as compared to most other transcripts during disease progression. This method is able to identify those outliers rather than finding patterns among the trending transcripts compared to the hierarchical clustering method based on Euclidean distance. A wet-lab experiment on a biomarker (CAM2G gene) confirmed the result of the computational model. Genes related to these outlier transcripts were found to be strongly associated with cancer, and in particular, prostate cancer. Further investigation of these outlier transcripts in prostate cancer may identify them as potential stage-specific biomarkers that can predict the progression of the disease. Breast cancer, on the other hand, is a widespread type of cancer in females and accounts for a lot of cancer cases and deaths in the world. Identifying the subtype of breast cancer plays a crucial role in selecting the best treatment. In the third contribution, we propose an optimized hierarchical classification model that is used to predict the breast cancer subtype. Suitable filter feature selection methods and new hybrid feature selection methods are utilized to find discriminative genes. Our proposed model achieves 100% accuracy for predicting the breast cancer subtypes using the same or even fewer genes. Studying breast cancer survivability among different patients who received various treatments may help understand the relationship between the survivability and treatment therapy based on gene expression. In the fourth contribution, we have built a classifier system that predicts whether a given breast cancer patient who underwent some form of treatment, which is either hormone therapy, radiotherapy, or surgery will survive beyond five years after the treatment therapy. Our classifier is a tree-based hierarchical approach that partitions breast cancer patients based on survivability classes; each node in the tree is associated with a treatment therapy and finds a predictive subset of genes that can best predict whether a given patient will survive after that particular treatment. We applied our tree-based method to a gene expression dataset that consists of 347 treated breast cancer patients and identified potential biomarker subsets with prediction accuracies ranging from 80.9% to 100%. We have further investigated the roles of many biomarkers through the literature. Studying gene expression through various time intervals of breast cancer survival may provide insights into the recovery of the patients. Discovery of gene indicators can be a crucial step in predicting survivability and handling of breast cancer patients. In the fifth contribution, we propose a hierarchical clustering method to separate dissimilar groups of genes in time-series data as outliers. These isolated outliers, genes that trend differently from other genes, can serve as potential biomarkers of breast cancer survivability. In the last contribution, we introduce a method that uses machine learning techniques to identify transcripts that correlate with prostate cancer development and progression. We have isolated transcripts that have the potential to serve as prognostic indicators and may have significant value in guiding treatment decisions. Our study also supports PTGFR, NREP, scaRNA22, DOCK9, FLVCR2, IK2F3, USP13, and CLASP1 as potential biomarkers to predict prostate cancer progression, especially between stage II and subsequent stages of the disease

    Enhanced Harris's Hawk algorithm for continuous multi-objective optimization problems

    Get PDF
    Multi-objective swarm intelligence-based (MOSI-based) metaheuristics were proposed to solve multi-objective optimization problems (MOPs) with conflicting objectives. Harris’s hawk multi-objective optimizer (HHMO) algorithm is a MOSIbased algorithm that was developed based on the reference point approach. The reference point is determined by the decision maker to guide the search process to a particular region in the true Pareto front. However, HHMO algorithm produces a poor approximation to the Pareto front because lack of information sharing in its population update strategy, equal division of convergence parameter and randomly generated initial population. A two-step enhanced non-dominated sorting HHMO (2SENDSHHMO) algorithm has been proposed to solve this problem. The algorithm includes (i) a population update strategy which improves the movement of hawks in the search space, (ii) a parameter adjusting strategy to control the transition between exploration and exploitation, and (iii) a population generating method in producing the initial candidate solutions. The population update strategy calculates a new position of hawks based on the flush-and-ambush technique of Harris’s hawks, and selects the best hawks based on the non-dominated sorting approach. The adjustment strategy enables the parameter to adaptively changed based on the state of the search space. The initial population is produced by generating quasi-random numbers using Rsequence followed by adapting the partial opposition-based learning concept to improve the diversity of the worst half in the population of hawks. The performance of the 2S-ENDSHHMO has been evaluated using 12 MOPs and three engineering MOPs. The obtained results were compared with the results of eight state-of-the-art multi-objective optimization algorithms. The 2S-ENDSHHMO algorithm was able to generate non-dominated solutions with greater convergence and diversity in solving most MOPs and showed a great ability in jumping out of local optima. This indicates the capability of the algorithm in exploring the search space. The 2S-ENDSHHMO algorithm can be used to improve the search process of other MOSI-based algorithms and can be applied to solve MOPs in applications such as structural design and signal processing
    corecore