2,956 research outputs found

    Gradient Information and Regularization for Gene Expression Programming to Develop Data-Driven Physics Closure Models

    Full text link
    Learning accurate numerical constants when developing algebraic models is a known challenge for evolutionary algorithms, such as Gene Expression Programming (GEP). This paper introduces the concept of adaptive symbols to the GEP framework by Weatheritt and Sandberg (2016) to develop advanced physics closure models. Adaptive symbols utilize gradient information to learn locally optimal numerical constants during model training, for which we investigate two types of nonlinear optimization algorithms. The second contribution of this work is implementing two regularization techniques to incentivize the development of implementable and interpretable closure models. We apply L2L_2 regularization to ensure small magnitude numerical constants and devise a novel complexity metric that supports the development of low complexity models via custom symbol complexities and multi-objective optimization. This extended framework is employed to four use cases, namely rediscovering Sutherland's viscosity law, developing laminar flame speed combustion models and training two types of fluid dynamics turbulence models. The model prediction accuracy and the convergence speed of training are improved significantly across all of the more and less complex use cases, respectively. The two regularization methods are essential for developing implementable closure models and we demonstrate that the developed turbulence models substantially improve simulations over state-of-the-art models

    Statistical methods of SNP data analysis with applications

    Get PDF
    Various statistical methods important for genetic analysis are considered and developed. Namely, we concentrate on the multifactor dimensionality reduction, logic regression, random forests and stochastic gradient boosting. These methods and their new modifications, e.g., the MDR method with "independent rule", are used to study the risk of complex diseases such as cardiovascular ones. The roles of certain combinations of single nucleotide polymorphisms and external risk factors are examined. To perform the data analysis concerning the ischemic heart disease and myocardial infarction the supercomputer SKIF "Chebyshev" of the Lomonosov Moscow State University was employed

    Machine Learning for Fluid Mechanics

    Full text link
    The field of fluid mechanics is rapidly advancing, driven by unprecedented volumes of data from field measurements, experiments and large-scale simulations at multiple spatiotemporal scales. Machine learning offers a wealth of techniques to extract information from data that could be translated into knowledge about the underlying fluid mechanics. Moreover, machine learning algorithms can augment domain knowledge and automate tasks related to flow control and optimization. This article presents an overview of past history, current developments, and emerging opportunities of machine learning for fluid mechanics. It outlines fundamental machine learning methodologies and discusses their uses for understanding, modeling, optimizing, and controlling fluid flows. The strengths and limitations of these methods are addressed from the perspective of scientific inquiry that considers data as an inherent part of modeling, experimentation, and simulation. Machine learning provides a powerful information processing framework that can enrich, and possibly even transform, current lines of fluid mechanics research and industrial applications.Comment: To appear in the Annual Reviews of Fluid Mechanics, 202

    Generalized Clusterwise Regression for Simultaneous Estimation of Optimal Pavement Clusters and Performance Models

    Full text link
    The existing state-of-the-art approach of Clusterwise Regression (CR) to estimate pavement performance models (PPMs) pre-specifies explanatory variables without testing their significance; as an input, this approach requires the number of clusters for a given data set. Time-consuming ‘trial and error’ methods are required to determine the optimal number of clusters. A common objective function is the minimization of the total sum of squared errors (SSE). Given that SSE decreases monotonically as a function of the number of clusters, the optimal number of clusters with minimum SSE always is the total number of data points. Hence, the minimization of SSE is not the best objective function to seek for an optimal number of clusters. In previous studies, the PPMs were restricted to be either linear or nonlinear, irrespective of which functional form provided the best results. The existing mathematical programming formulations did not include constraints that ensured the minimum number of observations required in each cluster to achieve statistical significance. In addition, a pavement sample could be associated with multiple performance models. Hence, additional modeling was required to combine the results from multiple models. To address all these limitations, this research proposes a generalized CR that simultaneously 1) finds the optimal number of pavement clusters, 2) assigns pavement samples into clusters, 3) estimates the coefficients of cluster-specific explanatory variables, and 4) determines the best functional form between linear and nonlinear models. Linear and nonlinear functional forms were investigated to select the best model specification. A mixed-integer nonlinear mathematical program was formulated with the Bayesian Information Criteria (BIC) as the objective function. The advantage of using BIC is that it penalizes for including additional parameters (i.e., number of clusters and/or explanatory variables). Hence, the optimal CR models provided a balance between goodness of fit and model complexity. In addition, the search process for the best model specification using BIC has the property of consistency, which asymptotically selects this model with a probability of ‘1’. Comprehensive solution algorithms – Simulated Annealing coupled with Ordinary Least Squares for linear models and All Subsets Regression for nonlinear models – were implemented to solve the proposed mathematical problem. The algorithms selected the best model specification for each cluster after exploring all possible combinations of potentially significant explanatory variables. Potential multicollinearity issues were investigated and addressed as required. Variables identified as significant explanatory variables were average daily traffic, pavement age, rut depth along the pavement, annual average precipitation and minimum temperature, road functional class, prioritization category, and the number of lanes. All these variables were considered in the literature as the most critical factors for pavement deterioration. In addition, the predictive capability of the estimated models was investigated. The results showed that the models were robust without any overfitting issues, and provided small prediction errors. The models developed using the proposed approach provided superior explanatory power compared to those that were developed using the existing state-of-the-art approach of clusterwise regression. In particular, for the data set used in this research, nonlinear models provided better explanatory power than did the linear models. As expected, the results illustrated that different clusters might require different explanatory variables and associated coefficients. Similarly, determining the optimal number of clusters while estimating the corresponding PPMs contributed significantly to reduce the estimation error

    Learning from life-logging data by hybrid HMM: a case study on active states prediction

    Get PDF
    In this paper, we have proposed employing a hybrid classifier-hidden Markov model (HMM) as a supervised learning approach to recognize daily active states from sequential life-logging data collected from wearable sensors. We generate synthetic data from real dataset to cope with noise and incompleteness for training purpose and, in conjunction with HMM, propose using a multiobjective genetic programming (MOGP) classifier in comparison of the support vector machine (SVM) with variant kernels. We demonstrate that the system with either algorithm works effectively to recognize personal active states regarding medical reference. We also illustrate that MOGP yields generally better results than SVM without requiring an ad hoc kernel

    Data driven theory for knowledge discovery in the exact sciences with applications to thermonuclear fusion

    Get PDF
    In recent years, the techniques of the exact sciences have been applied to the analysis of increasingly complex and non-linear systems. The related uncertainties and the large amounts of data available have progressively shown the limits of the traditional hypothesis driven methods, based on first principle theories. Therefore, a new approach of data driven theory formulation has been developed. It is based on the manipulation of symbols with genetic computing and it is meant to complement traditional procedures, by exploring large datasets to find the most suitable mathematical models to interpret them. The paper reports on the vast amounts of numerical tests that have shown the potential of the new techniques to provide very useful insights in various studies, ranging from the formulation of scaling laws to the original identification of the most appropriate dimensionless variables to investigate a given system. The application to some of the most complex experiments in physics, in particular thermonuclear plasmas, has proved the capability of the methodology to address real problems, even highly nonlinear and practically important ones such as catastrophic instabilities. The proposed tools are therefore being increasingly used in various fields of science and they constitute a very good set of techniques to bridge the gap between experiments, traditional data analysis and theory formulation

    A study of generalization in regression: proposal of a new metric and loss function to better understand and improve generability

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsIntuitively Generalization in Machine Learning can be understood as a models ability to apply its trained or acquired knowledge to a previously unseen scenario. In the recent years there has been an exponential growth in machine learning models both efficiency and accuracy, yet the current research is still trying to understand and trust how well models can perform on previously unseen data. For this thesis we propose a study of machine learning’s theoretical background to further expand the notion of generalization and it’s limitation’s, enabling us to derive its commonly accepted approximation, definitions that we will use to present a new generalization metric or score more consistent in detecting and providing understanding of the occurrence of generalization. Additionally a new loss function will be presented in order to mitigate generalization error inherit to a noisy sample, where extensive tests suggest that our loss function has a higher rate of convergence while producing statistically similar or even better results when compared with classical loss functions.Intuitivamente generalização em Aprendizagem Automática pode ser entendida como a capacidade de um modelo em aplicar o seu conhecimento treinado ou adquirido a um cenário nunca antes visto. Nos últimos anos, tem existido um crescimento exponencial tanto na eficiência quanto na precisão dos modelos de Aprendizagem Automática, no entanto a pesquisa atual ainda se debate bastante em como entender e confiar na capacidade de execução dos modelos em dados nunca antes vistos. Para esta tese, propomos um estudo dos fundamentos teóricos da Aprendizagem Automática para expandir ainda mais a noção de generalização e suas limitações, permitindo-nos derivar sua aproximação comummente aceita. Definições estas que usaremos para apresentar uma nova métrica de generalização mais consistente na detecção da ocorrência ou não de generalização. Adicionalmente, uma nova função de perda será apresentada a fim de mitigar o erro de generalização herdado de uma amostra ruidosa, onde testes extensivos sugerem que nossa função de perda tem uma taxa de convergência significantemente mais alta produzindo resultados estatisticamente semelhantes ou até melhores quando comparada com as funções de perda clássicas

    Grammatical evolution-based ensembles for algorithmic trading

    Get PDF
    The literature on trading algorithms based on Grammatical Evolution commonly presents solutions that rely on static approaches. Given the prevalence of structural change in financial time series, that implies that the rules might have to be updated at predefined time intervals. We introduce an alternative solution based on an ensemble of models which are trained using a sliding window. The structure of the ensemble combines the flexibility required to adapt to structural changes with the need to control for the excessive transaction costs associated with over-trading. The performance of the algorithm is benchmarked against five different comparable strategies that include the traditional static approach, the generation of trading rules that are used for single time period and are subsequently discarded, and three alternatives based on ensembles with different voting schemes. The experimental results, based on market data, show that the suggested approach offers very competitive results against comparable solutions and highlight the importance of containing transaction costs.The authors would like to acknowledge the nancial support of the Spanish Ministry of Science, Innovation and Universities under project PGC2018-646 096849-B-I00 (MCFin)

    Kernel alignment for identifying objective criteria from brain MEG recordings in schizophrenia

    Get PDF
    The current wide access to data from different neuroimaging techniques has permitted to obtain data to explore the possibility of finding objective criteria that can be used for diagnostic purposes. In order to decide which features of the data are relevant for the diagnostic task, we present in this paper a simple method for feature selection based on kernel alignment with the ideal kernel in support vector machines (SVM). The method presented shows state-of-the-art performance while being more efficient than other methods for feature selection in SVM. It is also less prone to overfitting due to the properties of the alignment measure. All these abilities are essential in neuroimaging study, where the number of features representing recordings is usually very large compared with the number of recordings. The method has been applied to a dataset in order to determine objective criteria for the diagnosis of schizophrenia. The dataset analyzed has been obtained from multichannel magnetoencephalogram (MEG) recordings, corresponding to the recordings during the performance of a mismatch negativity (MMN) auditory task by a set of schizophrenia patients and a control group. All signal frequency bands are analyzed (from d (1–4 Hz) to high frequency ¿ (60–200 Hz)) and the signal correlations among the different sensors for these frequencies are used as features.Peer ReviewedPostprint (author's final draft
    corecore