6 research outputs found

    DoME: A Deterministic Technique for Equation Development and Symbolic Regression

    Get PDF
    Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG[Abstract] Based on a solid mathematical background, this paper proposes a method for Symbolic Regression that enables the extraction of mathematical expressions from a dataset. Contrary to other approaches, such as Genetic Programming, the proposed method is deterministic and, consequently, does not require the creation of a population of initial solutions. Instead, a simple expression is grown until it fits the data. This method has been compared with four well-known Symbolic Regression techniques with a large number of datasets. As a result, on average, the proposed method returns better performance than the other techniques, with the advantage of returning mathematical expressions that can be easily used by different systems. Additionally, this method makes it possible to establish a threshold at the complexity of the expressions generated, i.e., the system can return mathematical expressions that are easily analyzed by the user, as opposed to other techniques that return very large expressions.This study is partially supported by Instituto de Salud Carlos III, grant number PI17/01826 (Collaborative Project in Genomic Data Integration (CICLOGEN) funded by the Instituto de Salud Carlos III from the Spanish National Plan for Scientific and Technical Research and Innovation 2013–2016 and the European Regional Development Funds (FEDER)—“A way to build Europe”. It was also partially supported by different grants and projects from the Xunta de Galicia [ED431D 2017/23; ED431D 2017/16; ED431G/01; ED431C 2018/49; IN845D-2020/03]. The authors thank the CyTED, Spain and each National Organism for Science and Technology for funding the IBEROBDIA project (P918PTE0409). In this regard, Spain specifically thanks the Ministry of Economy and Competitiveness for the financial support for this project through the State Program of I+D+I Oriented to the Challenges of Society 2017–2020 (International Joint Programming 2018), project (PCI2018-093284). Funding for open access charge: Universidade da Coruña/CISUGXunta de Galicia; ED431D 2017/23Xunta de Galicia; ED431D 2017/16Xunta de Galicia; ED431G/01Xunta de Galicia; ED431C 2018/49Xunta de Galicia; IN845D-2020/0

    Automatic synthesis of sorting algorithms by gene expression programming + (geometric) semantic gene expression programming + encouraging phenotype variation with a new semantic operator: semantic conditional crossover

    Get PDF
    Gene Expression Programming (GEP) is an alternative to Genetic Programming (GP). Given its characteristics compared to GP, we question if GEP should be the standard choice for evolutionary program synthesis, both as base for research and practical application. We raise the question if such a shift could increase the rate of investigation, applicability and the quality of results obtained from evolutionary techniques for code optimization. We present three distinct and unprecedented studies using GEP in an attempt to develop understanding, investigate the potential and forward the branch. Each study has an individual contribution on its own involving GEP. As a whole, the three studies try to investigate di erent aspects that might be critical to answer the questions raised in the previous paragraph. In the rst individual contribution, we investigate GEP's applicability to automatically synthesize sorting algorithms. Performance is compared against GP under similar experimental conditions. GEP is shown to be capable of producing sorting algorithms and outperforms GP in doing so. As a second experiment, we enhanced GEP's evolutionary process with semantic awareness of candidate programs, originating Semantic Gene Expression Programming (SGEP), similarly to how Semantic Genetic Programming (SGP) builds over GP. Geometric semantic concepts are then introduced to SGEP, forming Geometric Semantic Gene Expression Programming (GSGEP). A comparative experiment between GP, GEP, SGP and SGEP is performed using di erent problems and setup combinations. Results were mixed when comparing SGEP and SGP, suggesting performance is signi cantly related to the problem addressed. By out-performing the alternatives in many of the benchmarks, SGEP demonstrates practical potential. The results are analyzed in di erent perspectives, also providing insight on the potential of di erent crossover variations when applied along GP/GEP. GEP' compatibility with innovation developed to work with GP is demonstrated possible without extensive adaptation. Considerations for integration of SGEP are discussed. In the last contribution, a new semantic operator is proposed, SCC, which applies crossover conditionally only when elements are semantically di erent enough, performing mutation otherwise. The strategy attempts to encourage semantic diversity and wider the portion of the semantic-solution space searched. A practical experiment was performed alternating the integration of SCC in the evolutionary process. When using the operator, the quality of obtained solutions alternated between slight improvements and declines. The results don't show a relevant indication of possible advantage from its employment and don't con rm what was expected in the theory. We discuss ways in which further work might investigate this concept and assess if it has practical potential under di erent circumstances. On the other hand, in regards to the basilar questions of this investigation, the process of development and testing of SCC is performed completely on a GEP/SGEP base, suggesting how the latest can be used as the base for future research on evolutionary program synthesis.Programa c~ao Gen etica por Express~oes (GEP) e uma alternativa recente a Programa c~ao Gen etica (GP). Neste estudo observamos o GEP e colocamos a quest~ao se este n~ao deveria ser tratado como primeira escolha quando se trata de sintetiza c~ao autom atica de programas atrav es de m etodos evolutivos. Dadas as caracteristicas do GEP perguntamonos se esta mudan ca de perspectiva poderia aumentar a investiga c~ao, aplicabilidade e qualidade dos resultados obtidos para a optimiza c~ao de c odigo por m etodos evolutivos. Neste estudo apresentamos tr^es contribui c~oes in editas e distintas usando o algoritmo GEP. Cada uma das contribui c~oes apresenta um avan co ou investiga c~ao no campo da GEP. Como um todo, estas contribui c~oes tentam obter cohecimento e informa c~oes para se abordar a quest~ao geral apresentada no p aragrafo anterior. Na primeira contribui c~ao, investiga-mos e testamos o GEP no problema da sintese autom atica de algoritmos de ordena c~ao. Para o melhor do nosso conhecimento, esta e a primeira vez que este problema e abordado com o GEP. A performance e comparada a do GP em condi c~oes semelhantes, de modo a isolar as caracteristicas de cada algoritmo como factor de distin c~ao. As a second experiment, we enhanced GEP's evolutionary process with semantic awareness of candidate programs, originating Semantic Gene Expression Programming (SGEP), similarly to how Semantic Genetic Programming (SGP) builds over GP. Geometric semantic concepts are then introduced to SGEP, forming Geometric Semantic Gene Expression Programming (GSGEP). A comparative experiment between GP, GEP, SGP and SGEP is performed using di erent problems and setup combinations. Results were mixed when comparing SGEP and SGP, suggesting performance is signi cantly related to the problem addressed. By out-performing the alternatives in many of the benchmarks, SGEP demonstrates practical potential. The results are analyzed in di erent perspectives, also providing insight on the potential of di erent crossover variations when applied along GP/GEP. GEP's compatibility with innovation developed to work with GP is demonstrated possible without extensive adaptation. Considerations for integration of SGEP are discussed. Na segunda contribui c~ao, adicionamos ao processo evolutivo do GEP a capacidade de medir o valor sem^antico dos programas que constituem a popula c~ao. A esta variante damos o nome de Programa c~ao Gen etica por Express~oes Sem^antica (SGEP). Esta variante tr as para o GEP as mesmas caracteristicas que a Programa c~ao Gen etica Sem^antica(SGP) trouxe para o GP convencional. Conceitos geom etricos s~ao tamb em apresentados para o SGEP, extendendo assim a variante e criando a Programa c~ao Gen etica por Express~oes Geom etrica Sem^antica (GSGEP). De forma a testar estas novas variantes, efectuamos uma experi^encia onde s~ao comparados o GP, GEP, SGP e SGEP entre diferentes problemas e combina c~oes de operadores de cruzamento. Os resultados mostraram que n~ao houve um algoritmo que se destaca-se em todas as experi^encias, sugerindo que a performance est a signi cativamente relacionada com o problema a ser abordado. De qualquer modo, o SGEP obteve vantagem em bastantes dos benchmarks, dando assim ind cios de pot^encial ter utilidade pr atica. De um modo geral, esta contribui c~ao demonstra que e possivel utilizar tecnologia desenvolvida a pensar em GP no GEP sem grande esfor co na adapta c~ao. No m da contribui c~ao, s~ao discutidas algumas considera c~oes sobre o SGEP. Na terceira contribui c~ao propomos um novo operador, o Cruzamento Sem^antico Condicional (SCC). Este operador, baseado na dist^ancia sem^antica entre dois elementos propostos, decide se os elementos s~ao propostos para cruzamento, ou se um deles e mutato e ambos re-introduzidos na popula c~ao. Esta estrat egia tem como objectivo aumentar a diversidade gen etica na popula c~ao em fases cruciais do processo evolutivo e alargar a por c~ao do espa co sem^antico pesquisado. Para avaliar o pot^encial deste operador, realizamos uma experi^encia pr atica e comparamos processos evolutivos semelhantes onde o uso ou n~ao uso do SCC e o factor de distin c~ao. Os resultados obtidos n~ao demonstraram vantagens no uso do SCC e n~ao con rmam o esperado em teoria. No entanto s~ao discutidas maneiras em que o conceito pode ser reaproveitado para novos testes em que possa ter pot^encial para demonstrar resultados possitivos. Em rela c~ao a quest~ao central da tese, visto este estudo ter sido desenvolvido com base em GEP/SGEP e visto a teoria do SCC ser compativel com GP, e demonstrado que um estudo geral a area da sintese de algoritmos por meios evolutivos, pode ser conduzido com base no GEP

    Mining Explicit and Implicit Relationships in Data Using Symbolic Regression

    Full text link
    Identification of implicit and explicit relations within observed data is a generic problem commonly encountered in several domains including science, engineering, finance, and more. It forms the core component of data analytics, a process of discovering useful information from data sets that are potentially huge and otherwise incomprehensible. In industries, such information is often instrumental for profitable decision making, whereas in science and engineering it is used to build empirical models, propose new or verify existing theories and explain natural phenomena. In recent times, digital and internet based technologies have proliferated, making it viable to generate and collect large amount of data at low cost. This inturn has resulted in an ever growing need for methods to analyse and draw interpretations from such data quickly and reliably. With this overarching goal, this thesis attempts to make contributions towards developing accurate and efficient methods for discovering such relations through evolutionary search, a method commonly referred to as Symbolic Regression (SR). A data set of input variables x and a corresponding observed response y is given. The aim is to find an explicit function y = f (x) or an implicit function f (x, y) = 0, which represents the data set. While seemingly simple, the problem is challenging for several reasons. Some of the conventional regression methods try to “guess” a functional form such as linear/quadratic/polynomial, and attempt to do a curve-fitting of the data to the equation, which may limit the possibility of discovering more complex relations, if they exist. On the other hand, there are meta-modelling techniques such as response surface method, Kriging, etc., that model the given data accurately, but provide a “black-box” predictor instead of an expression. Such approximations convey little or no insights about how the variables and responses are dependent on each other, or their relative contribution to the output. SR attempts to alleviate the above two extremes by providing a structure which evolves mathematical expressions instead of assuming them. Thus, it is flexible enough to represent the data, but at the same time provides useful insights instead of a black-box predictor. SR can be categorized as part of Explainable Artificial Intelligence and can contribute to Trustworthy Artificial Intelligence. The works proposed in this thesis aims to integrate the concept of “semantics” deeper into Genetic Programming (GP) and Evolutionary Feature Synthesis, which are the two algorithms usually employed for conducting SR. The semantics will be integrated into well-known components of the algorithms such as compactness, diversity, recombination, constant optimization, etc. The main contribution of this thesis is the proposal of two novel operators to generate expressions based on Linear Programming and Mixed Integer Programming with the aim of controlling the length of the discovered expressions without compromising on the accuracy. In the experiments, these operators are proven to be able to discover expressions with better accuracy and interpretability on many explicit and implicit benchmarks. Moreover, some applications of SR on real-world data sets are shown to demonstrate the practicality of the proposed approaches. Besides, in related to practical problems, how GP can be applied to effectively solve the Resource Constrained Scheduling Problems is also presented

    New Genetic Programming Methods for Rainfall Prediction and Rainfall Derivatives Pricing

    Get PDF
    Rainfall derivatives is a part of an umbrella concept of weather derivatives, whereby the underlying weather variable determines the value of derivative, in our case the rainfall. These financial contracts are currently in their infancy as they have started trading on the Chicago Mercantile Exchange (CME) since 2011. Such contracts are very useful for investors or trading firms who wish to hedge against the direct or indirect adverse effects of the rainfall. The first crucial problem to focus on in this thesis is the prediction of the level of rainfall. In order to predict this, two techniques are routinely used. The first most commonly used approach is Markov chain extended with rainfall prediction. The second approach is Poisson-cluster model. Both techniques have some weakness in their predictive powers for rainfall data. More specifically, a large number of rainfall pathways obtained from these techniques are not representative of future rainfall levels. Additionally, the predictions are heavily influenced by the prior information, leading to future rainfall levels being the average of previously observed values. This motivates us to develop a new algorithm to the problem domain, based on Genetic Programming (GP), to improve the prediction of the underlying variable rainfall. GP is capable of producing white box (interpretable, as opposed to black box) models, which allows us to probe the models produced. Moreover, we can capture nonlinear and unexpected patterns in the data without making any strict assumptions regarding the data. The daily rainfall data represents some difficulties for GP. The difficulties include the data value being non-negative and discontinuous on the real time line. Moreover, the rainfall data consists of high volatilities and low seasonal time series. This makes the rainfall derivatives much more challenging to deal with than other weather contracts such as temperature or wind. However, GP does not perform well when it is applied directly on the daily rainfall data. We thus propose a data transformation method that improves GP's predictive power. The transformation works by accumulating the daily rainfall amounts into accumulated amounts with a sliding window. To evaluate the performance, we compare the prediction accuracy obtained by GP against the most currently used approach in rainfall derivatives, and six other machine learning algorithms. They are compared on 42 different data sets collected from different cities across the USA and Europe. We discover that GP is able to predict rainfall more accurately than the most currently used approaches in the literature and comparably to other machine learning methods. However, we find that the equations generated by GP are not able to take into account the volatilities and extreme periods of wet and dry rainfall. Thus, we propose decomposing the problem of rainfall into 'sub problems' for GP to solve. We decompose the time series of rainfall by creating a partition to represent a selected range of the total rainfall amounts, where each partition is modelled by a separate equation from GP. We use a Genetic Algorithm to assist with the partitioning of data. We find that through the decomposition of the data, we are able to predict the underlying data better than all machine learning benchmark methods. Moreover, GP is able to provide a better representation of the extreme periods in the rainfall time series. The natural progression is to price rainfall futures contracts from rainfall prediction. Unlike other pricing domains in the trading market, there is no generally recognised pricing framework used within the literature. Much of this is due to weather derivatives (including rainfall derivatives) existing in an incomplete market, where the existing and well-studied pricing methods cannot be directly applied. There are two well-known techniques for pricing, the first is through indifference pricing and the second is through arbitrage free pricing. One of the requirements for pricing is knowing the level of risk or uncertainty that exists within the market. This allows for a contract price free of arbitrage. GP can be used to price derivatives, but the risk cannot be directly estimated. To estimate the risk, we must calculate a density of proposed rainfall values from a single GP equation, in order to calculate the most probable outcome. We propose three methods to achieve the required results. The first is through the procedure of sampling many different equations and extrapolating a density from the best of each generation over multiple runs. The second proposal builds on the first considering contract-specific equations, rather than a single equation explaining all contracts before extrapolating a density. The third method is the proposition of GP evolving and creating a collection of stochastic equations for pricing rainfall derivatives. We find that GP is a suitable method for pricing and both proposed methods are able to produce good pricing results. Our first and second methods are capable of pricing closer to the rainfall futures prices given by the CME. Moreover, we find that our third method reproduces the actual rainfall for the specified period of interest more accurately
    corecore