Search CORE

7 research outputs found

Population Subset Selection for the Use of a Validation Dataset for Overfitting Control in Genetic Programming

Author: Fernández-Blanco Enrique
Fernández-Lozano Carlos
Pazos A.
Rivero Daniel
Publication venue: 'Informa UK Limited'
Publication date: 31/07/2019
Field of study

[Abstract] Genetic Programming (GP) is a technique which is able to solve different problems through the evolution of mathematical expressions. However, in order to be applied, its tendency to overfit the data is one of its main issues. The use of a validation dataset is a common alternative to prevent overfitting in many Machine Learning (ML) techniques, including GP. But, there is one key point which differentiates GP and other ML techniques: instead of training a single model, GP evolves a population of models. Therefore, the use of the validation dataset has several possibilities because any of those evolved models could be evaluated. This work explores the possibility of using the validation dataset not only on the training-best individual but also in a subset with the training-best individuals of the population. The study has been conducted with 5 well-known databases performing regression or classification tasks. In most of the cases, the results of the study point out to an improvement when the validation dataset is used on a subset of the population instead of only on the training-best individual, which also induces a reduction on the number of nodes and, consequently, a lower complexity on the expressions.Xunta de Galicia; ED431G/01Xunta de Galicia; ED431D 2017/16Xunta de Galicia; ED431C 2018/49Xunta de Galicia; ED431D 2017/23Instituto de Salud Carlos III; PI17/0182

Repositorio da Universidade da Coruña

The relationship between search based software engineering and predictive modeling

Author: Harman M
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2010
Field of study

Search Based Software Engineering (SBSE) is an approach to software engineering in which search based optimization algorithms are used to identify optimal or near optimal solutions and to yield insight. SBSE techniques can cater for multiple, possibly competing objectives and/or constraints and applications where the potential solution space is large and complex. This paper will provide a brief overview of SBSE, explaining some of the ways in which it has already been applied to construction of predictive models. There is a mutually beneficial relationship between predictive models and SBSE. The paper sets out eleven open problem areas for Search Based Predictive Modeling and describes how predictive models also have role to play in improving SBSE

Crossref

UCL Discovery

Classification of EEG data using machine learning techniques

Author: Heyden Martin
Publication venue: Lunds universitet/Institutionen för reglerteknik
Publication date: 01/01/2016
Field of study

Automatic interpretation of reading from the brain could allow for many interesting applications including movement of prosthetic limbs and more seamless manmachine interaction. This work studied classification of EEG signals used in a study of memory. The goal was to evaluate the performance of the state of the art algorithms. A secondary goal was to try to improve upon the result of a method that was used in a study similar to the one used in this work. For the experiment, the signals were transformed into the frequency domain and their magnitudes were used as features. A subset of these features was then selected and fed into a support vector machine classifier. The first part of this work tried to improve the selection of features that was used to discriminate between different memory categories. The second part investigated the uses of time series as features instead of time points. Two feature selection methods, genetic algorithm and correlation-based, were implemented and tested. Both of them performed worse than the baseline ANOVA method. The time series classifier also performed worse than the standard classifier. However, experiments showed that there was information to gain by using the time series, motivating more advanced methods to be explored. Both the results achieved by this thesis and in other work are above chance. However, high accuracies can only be achieved at the cost of long delays and few output alternatives. This limits the information that can be extracted from the EEG sensor and its usability

Genetic programming and serial processing for time series classification

Author: Anna I. Esparcia-Alcázar
Arenas M.
Dasgupta D.
Estébanez C.
Eva Alfaro-Cid
Hui A.
Jabeen H.
Ken Sharman
Koza J. R.
Publication venue: 'MIT Press - Journals'
Publication date: 01/06/2014
Field of study

This work describes an approach devised by the authors for time series classification. In our approach genetic programming is used in combination with a serial processing of data, where the last output is the result of the classification. The use of genetic programming for classification, although still a field where more research in needed, is not new. However, the application of genetic programming to classification tasks is normally done by considering the input data as a feature vector. That is, to the best of our knowledge, there are not examples in the genetic programming literature of approaches where the time series data are processed serially and the last output is considered as the classification result. The serial processing approach presented here fills a gap in the existing literature. This approach was tested in three different problems. Two of them are real world problems whose data were gathered for online or conference competitions. As there are published results of these two problems this gives us the chance to compare the performance of our approach against top performing methods. The serial processing of data in combination with genetic programming obtained competitive results in both competitions, showing its potential for solving time series classification problems. The main advantage of our serial processing approach is that it can easily handle very large datasets.Alfaro Cid, E.; Sharman, KC.; Esparcia Alcázar, AI. (2014). Genetic programming and serial processing for time series classification. Evolutionary Computation. 22(2):265-285. doi:10.1162/EVCO_a_00110S265285222Adeodato, P. J. L., Arnaud, A. L., Vasconcelos, G. C., Cunha, R. C. L. V., Gurgel, T. B., & Monteiro, D. S. M. P. (2009). The role of temporal feature extraction and bagging of MLP neural networks for solving the WCCI 2008 Ford Classification Challenge. 2009 International Joint Conference on Neural Networks. doi:10.1109/ijcnn.2009.5178965Alfaro-Cid, E., Merelo, J. J., de Vega, F. F., Esparcia-Alcázar, A. I., & Sharman, K. (2010). Bloat Control Operators and Diversity in Genetic Programming: A Comparative Study. Evolutionary Computation, 18(2), 305-332. doi:10.1162/evco.2010.18.2.18206Alfaro-Cid, E., Sharman, K., & Esparcia-Alcazar, A. I. (s. f.). Evolving a Learning Machine by Genetic Programming. 2006 IEEE International Conference on Evolutionary Computation. doi:10.1109/cec.2006.1688316Arenas, M. G., Collet, P., Eiben, A. E., Jelasity, M., Merelo, J. J., Paechter, B., … Schoenauer, M. (2002). A Framework for Distributed Evolutionary Algorithms. Lecture Notes in Computer Science, 665-675. doi:10.1007/3-540-45712-7_64Blankertz, B., Muller, K.-R., Curio, G., Vaughan, T. M., Schalk, G., Wolpaw, J. R., … Birbaumer, N. (2004). The BCI Competition 2003: Progress and Perspectives in Detection and Discrimination of EEG Single Trials. IEEE Transactions on Biomedical Engineering, 51(6), 1044-1051. doi:10.1109/tbme.2004.826692Borrelli, A., De Falco, I., Della Cioppa, A., Nicodemi, M., & Trautteur, G. (2006). Performance of genetic programming to extract the trend in noisy data series. Physica A: Statistical Mechanics and its Applications, 370(1), 104-108. doi:10.1016/j.physa.2006.04.025Eads, D. R., Hill, D., Davis, S., Perkins, S. J., Ma, J., Porter, R. B., & Theiler, J. P. (2002). Genetic Algorithms and Support Vector Machines for Time Series Classification. Applications and Science of Neural Networks, Fuzzy Systems, and Evolutionary Computation V. doi:10.1117/12.453526Eggermont, J., Eiben, A. E., & van Hemert, J. I. (1999). A Comparison of Genetic Programming Variants for Data Classification. Lecture Notes in Computer Science, 281-290. doi:10.1007/3-540-48412-4_24Holladay, K. L., & Robbins, K. A. (2007). Evolution of Signal Processing Algorithms using Vector Based Genetic Programming. 2007 15th International Conference on Digital Signal Processing. doi:10.1109/icdsp.2007.4288629Kaboudan, M. A. (2000). Computational Economics, 16(3), 207-236. doi:10.1023/a:1008768404046Kishore, J. K., Patnaik, L. M., Mani, V., & Agrawal, V. K. (2000). Application of genetic programming for multicategory pattern classification. IEEE Transactions on Evolutionary Computation, 4(3), 242-258. doi:10.1109/4235.873235Kishore, J. K., Patnaik, L. M., Mani, V., & Agrawal, V. K. (2001). Genetic programming based pattern classification with feature space partitioning. Information Sciences, 131(1-4), 65-86. doi:10.1016/s0020-0255(00)00081-5Langdon, W. B., McKay, R. I., & Spector, L. (2010). Genetic Programming. International Series in Operations Research & Management Science, 185-225. doi:10.1007/978-1-4419-1665-5_7Yi Liu, & Khoshgoftaar, T. (s. f.). Reducing overfitting in genetic programming models for software quality classification. Eighth IEEE International Symposium on High Assurance Systems Engineering, 2004. Proceedings. doi:10.1109/hase.2004.1281730Luke, S. (2000). Two fast tree-creation algorithms for genetic programming. IEEE Transactions on Evolutionary Computation, 4(3), 274-283. doi:10.1109/4235.873237Luke, S., & Panait, L. (2006). A Comparison of Bloat Control Methods for Genetic Programming. Evolutionary Computation, 14(3), 309-344. doi:10.1162/evco.2006.14.3.309Mensh, B. D., Werfel, J., & Seung, H. S. (2004). BCI Competition 2003—Data Set Ia: Combining Gamma-Band Power With Slow Cortical Potentials to Improve Single-Trial Classification of Electroencephalographic Signals. IEEE Transactions on Biomedical Engineering, 51(6), 1052-1056. doi:10.1109/tbme.2004.827081Muni, D. P., Pal, N. R., & Das, J. (2006). Genetic programming for simultaneous feature selection and classifier design. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), 36(1), 106-117. doi:10.1109/tsmcb.2005.854499Oltean, M., & Dioşan, L. (2009). An autonomous GP-based system for regression and classification problems. Applied Soft Computing, 9(1), 49-60. doi:10.1016/j.asoc.2008.03.008Otero, F. E. B., Silva, M. M. S., Freitas, A. A., & Nievola, J. C. (2003). Genetic Programming for Attribute Construction in Data Mining. Genetic Programming, 384-393. doi:10.1007/3-540-36599-0_36Poli, R. (2010). Genetic programming theory. Proceedings of the 12th annual conference comp on Genetic and evolutionary computation - GECCO ’10. doi:10.1145/1830761.1830905Tsakonas, A. (2006). A comparison of classification accuracy of four genetic programming-evolved intelligent structures. Information Sciences, 176(6), 691-724. doi:10.1016/j.ins.2005.03.012Wolpaw, J. R., Birbaumer, N., Heetderks, W. J., McFarland, D. J., Peckham, P. H., Schalk, G., … Vaughan, T. M. (2000). Brain-computer interface technology: a review of the first international meeting. IEEE Transactions on Rehabilitation Engineering, 8(2), 164-173. doi:10.1109/tre.2000.84780

Crossref

RiuNet

Differential evolution technique on weighted voting stacking ensemble method for credit card fraud detection

Author: Dolo Kgaugelo Moses
Publication venue
Publication date: 01/12/2019
Field of study

Differential Evolution is an optimization technique of stochastic search for a population-based vector, which is powerful and efficient over a continuous space for solving differentiable and non-linear optimization problems. Weighted voting stacking ensemble method is an important technique that combines various classifier models. However, selecting the appropriate weights of classifier models for the correct classification of transactions is a problem. This research study is therefore aimed at exploring whether the Differential Evolution optimization method is a good approach for defining the weighting function. Manual and random selection of weights for voting credit card transactions has previously been carried out. However, a large number of fraudulent transactions were not detected by the classifier models. Which means that a technique to overcome the weaknesses of the classifier models is required. Thus, the problem of selecting the appropriate weights was viewed as the problem of weights optimization in this study. The dataset was downloaded from the Kaggle competition data repository. Various machine learning algorithms were used to weight vote a class of transaction. The differential evolution optimization techniques was used as a weighting function. In addition, the Synthetic Minority Oversampling Technique (SMOTE) and Safe Level Synthetic Minority Oversampling Technique (SL-SMOTE) oversampling algorithms were modified to preserve the definition of SMOTE while improving the performance. Result generated from this research study showed that the Differential Evolution Optimization method is a good weighting function, which can be adopted as a systematic weight function for weight voting stacking ensemble method of various classification methods.School of ComputingM. Sc. (Computing

Unisa Institutional Repository

Analysis of microarray and next generation sequencing data for classification and biomarker discovery in relation to complex diseases

Author: Elyasigomari Vahid
Publication venue: 'Queen Mary University of London'
Publication date: 21/09/2017
Field of study

PhDThis thesis presents an investigation into gene expression profiling, using microarray and next generation sequencing (NGS) datasets, in relation to multi-category diseases such as cancer. It has been established that if the sequence of a gene is mutated, it can result in the unscheduled production of protein, leading to cancer. However, identifying the molecular signature of different cancers amongst thousands of genes is complex. This thesis investigates tools that can aid the study of gene expression to infer useful information towards personalised medicine. For microarray data analysis, this study proposes two new techniques to increase the accuracy of cancer classification. In the first method, a novel optimisation algorithm, COA-GA, was developed by synchronising the Cuckoo Optimisation Algorithm and the Genetic Algorithm for data clustering in a shuffle setup, to choose the most informative genes for classification purposes. Support Vector Machine (SVM) and Multilayer Perceptron (MLP) artificial neural networks are utilised for the classification step. Results suggest this method can significantly increase classification accuracy compared to other methods. An additional method involving a two-stage gene selection process was developed. In this method, a subset of the most informative genes are first selected by the Minimum Redundancy Maximum Relevance (MRMR) method. In the second stage, optimisation algorithms are used in a wrapper setup with SVM to minimise the selected genes whilst maximising the accuracy of classification. A comparative performance assessment suggests that the proposed algorithm significantly outperforms other methods at selecting fewer genes that are highly relevant to the cancer type, while maintaining a high classification accuracy. In the case of NGS, a state-of-the-art pipeline for the analysis of RNA-Seq data is investigated to discover differentially expressed genes and differential exon usages between normal and AIP positive Drosophila datasets, which are produced in house at Queen Mary, University of London. Functional genomic of differentially expressed genes were examined and found to be relevant to the case study under investigation. Finally, after normalising the RNA-Seq data, machine learning approaches similar to those in microarray was successfully implemented for these datasets

Queen Mary Research Online