    Automating biomedical data science through tree-based pipeline optimization

    Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in academia, business, and government. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning---pipeline design. We implement a Tree-based Pipeline Optimization Tool (TPOT) and demonstrate its effectiveness on a series of simulated and real-world genetic data sets. In particular, we show that TPOT can build machine learning pipelines that achieve competitive classification accuracy and discover novel pipeline operators---such as synthetic feature constructors---that significantly improve classification accuracy on these data sets. We also highlight the current challenges to pipeline optimization, such as the tendency to produce pipelines that overfit the data, and suggest future research paths to overcome these challenges. As such, this work represents an early step toward fully automating machine learning pipeline design.Comment: 16 pages, 5 figures, to appear in EvoBIO 2016 proceeding

    Applications of Artificial Intelligence in Power Systems

    Artificial intelligence tools, which are fast, robust and adaptive can overcome the drawbacks of traditional solutions for several power systems problems. In this work, applications of AI techniques have been studied for solving two important problems in power systems. The first problem is static security evaluation (SSE). The objective of SSE is to identify the contingencies in planning and operations of power systems. Numerical conventional solutions are time-consuming, computationally expensive, and are not suitable for online applications. SSE may be considered as a binary-classification, multi-classification or regression problem. In this work, multi-support vector machine is combined with several evolutionary computation algorithms, including particle swarm optimization (PSO), differential evolution, Ant colony optimization for the continuous domain, and harmony search techniques to solve the SSE. Moreover, support vector regression is combined with modified PSO with a proposed modification on the inertia weight in order to solve the SSE. Also, the correct accuracy of classification, the speed of training, and the final cost of using power equipment heavily depend on the selected input features. In this dissertation, multi-object PSO has been used to solve this problem. Furthermore, a multi-classifier voting scheme is proposed to get the final test output. The classifiers participating in the voting scheme include multi-SVM with different types of kernels and random forests with an adaptive number of trees. In short, the development and performance of different machine learning tools combined with evolutionary computation techniques have been studied to solve the online SSE. The performance of the proposed techniques is tested on several benchmark systems, namely the IEEE 9-bus, 14-bus, 39-bus, 57-bus, 118-bus, and 300-bus power systems. The second problem is the non-convex, nonlinear, and non-differentiable economic dispatch (ED) problem. The purpose of solving the ED is to improve the cost-effectiveness of power generation. To solve ED with multi-fuel options, prohibited operating zones, valve point effect, and transmission line losses, genetic algorithm (GA) variant-based methods, such as breeder GA, fast navigating GA, twin removal GA, kite GA, and United GA are used. The IEEE systems with 6-units, 10-units, and 15-units are used to study the efficiency of the algorithms

    Parallel optimization algorithms for high performance computing : application to thermal systems

    The need of optimization is present in every field of engineering. Moreover, applications requiring a multidisciplinary approach in order to make a step forward are increasing. This leads to the need of solving complex optimization problems that exceed the capacity of human brain or intuition. A standard way of proceeding is to use evolutionary algorithms, among which genetic algorithms hold a prominent place. These are characterized by their robustness and versatility, as well as their high computational cost and low convergence speed. Many optimization packages are available under free software licenses and are representative of the current state of the art in optimization technology. However, the ability of optimization algorithms to adapt to massively parallel computers reaching satisfactory efficiency levels is still an open issue. Even packages suited for multilevel parallelism encounter difficulties when dealing with objective functions involving long and variable simulation times. This variability is common in Computational Fluid Dynamics and Heat Transfer (CFD & HT), nonlinear mechanics, etc. and is nowadays a dominant concern for large scale applications. Current research in improving the performance of evolutionary algorithms is mainly focused on developing new search algorithms. Nevertheless, there is a vast knowledge of sequential well-performing algorithmic suitable for being implemented in parallel computers. The gap to be covered is efficient parallelization. Moreover, advances in the research of both new search algorithms and efficient parallelization are additive, so that the enhancement of current state of the art optimization software can be accelerated if both fronts are tackled simultaneously. The motivation of this Doctoral Thesis is to make a step forward towards the successful integration of Optimization and High Performance Computing capabilities, which has the potential to boost technological development by providing better designs, shortening product development times and minimizing the required resources. After conducting a thorough state of the art study of the mathematical optimization techniques available to date, a generic mathematical optimization tool has been developed putting a special focus on the application of the library to the field of Computational Fluid Dynamics and Heat Transfer (CFD & HT). Then the main shortcomings of the standard parallelization strategies available for genetic algorithms and similar population-based optimization methods have been analyzed. Computational load imbalance has been identified to be the key point causing the degradation of the optimization algorithm¿s scalability (i.e. parallel efficiency) in case the average makespan of the batch of individuals is greater than the average time required by the optimizer for performing inter-processor communications. It occurs because processors are often unable to finish the evaluation of their queue of individuals simultaneously and need to be synchronized before the next batch of individuals is created. Consequently, the computational load imbalance is translated into idle time in some processors. Several load balancing algorithms have been proposed and exhaustively tested, being extendable to any other population-based optimization method that needs to synchronize all processors after the evaluation of each batch of individuals. Finally, a real-world engineering application that consists on optimizing the refrigeration system of a power electronic device has been presented as an illustrative example in which the use of the proposed load balancing algorithms is able to reduce the simulation time required by the optimization tool.El aumento de las aplicaciones que requieren de una aproximación multidisciplinar para poder avanzar se constata en todos los campos de la ingeniería, lo cual conlleva la necesidad de resolver problemas de optimización complejos que exceden la capacidad del cerebro humano o de la intuición. En estos casos es habitual el uso de algoritmos evolutivos, principalmente de los algoritmos genéticos, caracterizados por su robustez y versatilidad, así como por su gran coste computacional y baja velocidad de convergencia. La multitud de paquetes de optimización disponibles con licencias de software libre representan el estado del arte actual en tecnología de optimización. Sin embargo, la capacidad de adaptación de los algoritmos de optimización a ordenadores masivamente paralelos alcanzando niveles de eficiencia satisfactorios es todavía una tarea pendiente. Incluso los paquetes adaptados al paralelismo multinivel tienen dificultades para gestionar funciones objetivo que requieren de tiempos de simulación largos y variables. Esta variabilidad es común en la Dinámica de Fluidos Computacional y la Transferencia de Calor (CFD & HT), mecánica no lineal, etc. y es una de las principales preocupaciones en aplicaciones a gran escala a día de hoy. La investigación actual que tiene por objetivo la mejora del rendimiento de los algoritmos evolutivos está enfocada principalmente al desarrollo de nuevos algoritmos de búsqueda. Sin embargo, ya se conoce una gran variedad de algoritmos secuenciales apropiados para su implementación en ordenadores paralelos. La tarea pendiente es conseguir una paralelización eficiente. Además, los avances en la investigación de nuevos algoritmos de búsqueda y la paralelización son aditivos, por lo que el proceso de mejora del software de optimización actual se verá incrementada si se atacan ambos frentes simultáneamente. La motivación de esta Tesis Doctoral es avanzar hacia una integración completa de las capacidades de Optimización y Computación de Alto Rendimiento para así impulsar el desarrollo tecnológico proporcionando mejores diseños, acortando los tiempos de desarrollo del producto y minimizando los recursos necesarios. Tras un exhaustivo estudio del estado del arte de las técnicas de optimización matemática disponibles a día de hoy, se ha diseñado una librería de optimización orientada al campo de la Dinámica de Fluidos Computacional y la Transferencia de Calor (CFD & HT). A continuación se han analizado las principales limitaciones de las estrategias de paralelización disponibles para algoritmos genéticos y otros métodos de optimización basados en poblaciones. En el caso en que el tiempo de evaluación medio de la tanda de individuos sea mayor que el tiempo medio que necesita el optimizador para llevar a cabo comunicaciones entre procesadores, se ha detectado que la causa principal de la degradación de la escalabilidad o eficiencia paralela del algoritmo de optimización es el desequilibrio de la carga computacional. El motivo es que a menudo los procesadores no terminan de evaluar su cola de individuos simultáneamente y deben sincronizarse antes de que se cree la siguiente tanda de individuos. Por consiguiente, el desequilibrio de la carga computacional se convierte en tiempo de inactividad en algunos procesadores. Se han propuesto y testado exhaustivamente varios algoritmos de equilibrado de carga aplicables a cualquier método de optimización basado en una población que necesite sincronizar los procesadores tras cada tanda de evaluaciones. Finalmente, se ha presentado como ejemplo ilustrativo un caso real de ingeniería que consiste en optimizar el sistema de refrigeración de un dispositivo de electrónica de potencia. En él queda demostrado que el uso de los algoritmos de equilibrado de carga computacional propuestos es capaz de reducir el tiempo de simulación que necesita la herramienta de optimización

    Advances in Evolutionary Algorithms

    With the recent trends towards massive data sets and significant computational power, combined with evolutionary algorithmic advances evolutionary computation is becoming much more relevant to practice. Aim of the book is to present recent improvements, innovative ideas and concepts in a part of a huge EA field

    evtree: Evolutionary Learning of Globally Optimal Classification and Regression Trees in R

    Commonly used classification and regression tree methods like the CART algorithm are recursive partitioning methods that build the model in a forward stepwise search. Although this approach is known to be an efficient heuristic, the results of recursive tree methods are only locally optimal, as splits are chosen to maximize homogeneity at the next step only. An alternative way to search over the parameter space of trees is to use global optimization methods like evolutionary algorithms. This paper describes the "evtree" package, which implements an evolutionary algorithm for learning globally optimal classification and regression trees in R. Computationally intensive tasks are fully computed in C++ while the "partykit" (Hothorn and Zeileis 2011) package is leveraged for representing the resulting trees in R, providing unified infrastructure for summaries, visualizations, and predictions. "evtree" is compared to "rpart" (Therneau and Atkinson 1997), the open-source CART implementation, and conditional inference trees ("ctree", Hothorn, Hornik, and Zeileis 2006). The usefulness of "evtree" is illustrated in a textbook customer classification task and a benchmark study of predictive accuracy in which "evtree" achieved at least similar and most of the time better results compared to the recursive algorithms "rpart" and "ctree".machine learning, classification trees, regression trees, evolutionary algorithms, R

    Optimisation of multiplier-less FIR filter design techniques

    This thesis is concerned with the design of multiplier-less (ML) finite impulse response (FIR) digital filters. The use of multiplier-less digital filters results in simplified filtering structures, better throughput rates and higher speed. These characteristics are very desirable in many DSP systems. This thesis concentrates on the design of digital filters with power-of-two coefficients that result in simplified filtering structures. Two distinct classesof ML FIR filter design algorithms are developed and compared with traditional techniques. The first class is based on the sensitivity of filter coefficients to rounding to power-of-two. Novel elements include extending of the algorithm for multiple-bands filters and introducing mean square error as the sensitivity criterion. This improves the performance of the algorithm and reduces the complexity of resulting filtering structures. The second class of filter design algorithms is based on evolutionary techniques, primarily genetic algorithms. Three different algorithms based on genetic algorithm kernel are developed. They include simple genetic algorithm, knowledge-based genetic algorithm and hybrid of genetic algorithm and simulated annealing. Inclusion of the additional knowledge has been found very useful when re-designing filters or refining previous designs. Hybrid techniques are useful when exploring large, N-dimensional searching spaces. Here, the genetic algorithm is used to explore searching space rapidly, followed by fine search using simulated annealing. This approach has been found beneficial for design of high-order filters. Finally, a formula for estimation of the filter length from its specification and complementing both classes of design algorithms, has been evolved using techniques of symbolic regression and genetic programming. Although the evolved formula is very complex and not easily understandable, statistical analysis has shown that it produces more accurate results than traditional Kaiser's formula. In summary, several novel algorithms for the design of multiplier-less digital filters have been developed. They outperform traditional techniques that are used for the design of ML FIR filters and hence contributed to the knowledge in the field of ML FIR filter design

    Hybrid Advanced Optimization Methods with Evolutionary Computation Techniques in Energy Forecasting

    More accurate and precise energy demand forecasts are required when energy decisions are made in a competitive environment. Particularly in the Big Data era, forecasting models are always based on a complex function combination, and energy data are always complicated. Examples include seasonality, cyclicity, fluctuation, dynamic nonlinearity, and so on. These forecasting models have resulted in an over-reliance on the use of informal judgment and higher expenses when lacking the ability to determine data characteristics and patterns. The hybridization of optimization methods and superior evolutionary algorithms can provide important improvements via good parameter determinations in the optimization process, which is of great assistance to actions taken by energy decision-makers. This book aimed to attract researchers with an interest in the research areas described above. Specifically, it sought contributions to the development of any hybrid optimization methods (e.g., quadratic programming techniques, chaotic mapping, fuzzy inference theory, quantum computing, etc.) with advanced algorithms (e.g., genetic algorithms, ant colony optimization, particle swarm optimization algorithm, etc.) that have superior capabilities over the traditional optimization approaches to overcome some embedded drawbacks, and the application of these advanced hybrid approaches to significantly improve forecasting accuracy
