17 research outputs found

    Mathematically Reduced Chemical Reaction Mechanism Using Neural Networks

    Full text link

    Boosting in structured additive models

    Get PDF
    Variable selection and model choice are of major concern in many statistical applications, especially in regression models for high-dimensional data. Boosting is a convenient statistical method that combines model fitting with intrinsic model selection. We investigate the impact of base-learner specification on the performance of boosting as a model selection procedure. We show that variable selection may be biased if the base-learners have different degrees of flexibility, both for categorical covariates and for smooth effects of continuous covariates. We investigate these problems from a theoretical perspective and suggest a framework for unbiased model selection based on a general class of penalized least squares base-learners. Making all base-learners comparable in terms of their degrees of freedom strongly reduces the selection bias observed with naive boosting specifications. Furthermore, the definition of degrees of freedom that is used in the smoothing literature is questionable in the context of boosting, and an alternative definition is theoretically derived. The importance of unbiased model selection is demonstrated in simulations and in an application to forest health models. A second aspect of this thesis is the expansion of the boosting algorithm to new estimation problems: by using constraint base-learners, monotonicity constrained effect estimates can be seamlessly incorporated in the existing boosting framework. This holds for both, smooth effects and ordinal variables. Furthermore, cyclic restrictions can be integrated in the model for smooth effects of continuous covariates. In particular in time-series models, cyclic constraints play an important role. Monotonic and cyclic constraints of smooth effects can, in addition, be extended to smooth, bivariate function estimates. If the true effects are monotonic or cyclic, simulation studies show that constrained estimates are superior to unconstrained estimates. In three case studies (the modeling the presence of Red Kite in Bavaria, the modeling of activity profiles for Roe Deer, and the modeling of deaths caused by air pollution in Sao Paulo) it is shown that both constraints can be integrated in the boosting framework and that they are easy to use. All described results were included in the R add-on package mboost.Insbesondere in Regressionsmodellen für hochdimensionale Daten kommt der Variablenselektion und der Modellwahl eine herausragende Bedeutung zu. Boosting-Verfahren bieten die Möglichkeit die Modellanpassung mit intrinsischer Modellwahl zu kombinieren. In dieser Arbeit wird der Einfluss der Spezifikation der Base-learner auf die Modellwahl untersucht. Es zeigt sich, dass sowohl für kategoriale Einflussvariablen als auch für glatte Effekte stetiger Einflussgrößen Base-learner mit höheren Freiheitsgraden bevorzugt werden. Um diese Verzerrung zu reduzieren oder gar zu vermeiden müssen die Freiheitsgrade gleich gewählt werden. Darüber hinaus wird der in der Smoothing-Literatur vorherrschende Freiheitsgradbegriff im Kontext von Boosting in Frage gestellt und eine alternative Definition theoretisch begründet. Die hergeleiteten Resultate werden in Simulationsstudien untersucht und beispielhaft für die Modellierung von Waldschadensdaten herangezogen. Ein weiterer Aspekt dieser Arbeit besteht in der Erweiterung des Boosting-Algorithmus auf neue Fragestellungen: Durch die Einbeziehung von Nebenbedingungen in die Schätzung der Base-learner können monotonie-restringierte Effekte nahtlos in den bestehende Rahmen integriert werden. Dies ist sowohl für glatte Effekte als auch für ordinale Variablen möglich. Darüber hinaus lassen sich zyklische Restriktionen für glatte Funktionen einer stetigen Variable in die Modellschätzung einbeziehen. Zyklische Restriktionen spielen insbesondere in der Modellierung von Zeitreihen eine wichtige Rolle. Monotonie und zyklische Effekte lassen sich darüber hinaus ebenso auf glatte, bivariate Funktionen erweitern. Beide Arten von Restriktionen stellen sich in Simulationsstudien gegenüber unrestringierten Modellen als überlegen heraus, falls in Wahrheit ein monotoner bzw. ein zyklischer Effekt vorliegt. In drei Anwendungen (der Modellierung des Vorkommens von Rotmilanen in Bayern, der Modellierung von Aktivitätsmustern beim Reh und der Modellierung der Todesfälle aufgrund von Luftverschmutzung in Sao Paulo) zeigt sich, dass sich die beschriebenen Restriktionen in Boosting-Modelle integrieren und einfach verwenden. Alle beschriebenen Ergebnisse fanden Eingang in das R Paket mboost

    Estimation Stability with Cross Validation (ESCV)

    Full text link
    Cross-validation (CV) is often used to select the regularization parameter in high dimensional problems. However, when applied to the sparse modeling method Lasso, CV leads to models that are unstable in high-dimensions, and consequently not suited for reliable interpretation. In this paper, we propose a model-free criterion ESCV based on a new estimation stability (ES) metric and CV. Our proposed ESCV finds a locally ES-optimal model smaller than the CV choice so that the it fits the data and also enjoys estimation stability property. We demonstrate that ESCV is an effective alternative to CV at a similar easily parallelizable computational cost. In particular, we compare the two approaches with respect to several performance measures when applied to the Lasso on both simulated and real data sets. For dependent predictors common in practice, our main finding is that, ESCV cuts down false positive rates often by a large margin, while sacrificing little of true positive rates. ESCV usually outperforms CV in terms of parameter estimation while giving similar performance as CV in terms of prediction. For the two real data sets from neuroscience and cell biology, the models found by ESCV are less than half of the model sizes by CV. Judged based on subject knowledge, they are more plausible than those by CV as well. We also discuss some regularization parameter alignment issues that come up in both approaches

    New Statistical Methods for Evaluating Brain Functional Connectivity

    Full text link
    The human brain functions through the coordination of a complex network of billions of neurons. This network, when defined by the functions it dictates, is known as functional brain connectivity. Associating brain networks with clinical symptoms and outcomes has great potential for shaping future work in neuroimaging and clinical practice. Resting-state functional magnetic resonance imaging (rfMRI) has commonly been used to establish the functional brain network; however, understanding the links to clinical characteristics is still an ongoing research question. Existing methods for analysis of functional brain networks, such as independent component analysis and canonical correlation analysis, have laid a good foundation for this research; yet most methods do not directly model the node-level association between connectivity and clinical characteristics, and thus provide limited ability for interpretation. To address those limitations, this dissertation research focuses on developing efficient methods that identify node-level associations to answer important research questions in brain imaging studies. In the first project, we propose a joint modeling framework for estimating functional connectivity networks from rfMRI time series data and evaluating the predictability of individual's brain connectivity patterns using their clinical characteristics. Our goal is to understand the link between clinical presentations of psychiatric disorders and functional brain connectivity at different region pairs. Our modeling framework consists of two components: estimation of individual functional connectivity networks and identifying associations with clinical characteristics. We propose a model fitting procedure for jointly estimating these components via the alternating direction method of multipliers (ADMM) algorithm. The key advantage of the proposed approach lies in its ability to directly identify the brain region pairs between which the functional connectivity is strongly associated with the clinical characteristics. Compared to existing methods, our framework has the flexibility to integrate machine learning methods to estimate the nonlinear predictive effects of clinical characteristics. Additionally, jointly modeling the precision matrix and the predictive model estimates provides a novel framework to accommodate the uncertainty in estimating functional connectivity. In the second project, we focus on a scalar-on-network regression problem which utilizes brain functional connectivity networks to predict a single clinical outcome of interest, where the regression coefficient is edge-dependent. To improve estimation efficiency, we develop a two stage boosting algorithm to estimate the sparse edge-dependent regression coefficients by leveraging the knowledge of brain functional organization. Simulations have shown the proposed method has higher power to detect the true signals while controlling the false discovery rate better than existing approaches. We apply the proposed method to analysis of rfMRI data in the Adolescent Brain Cognitive Development (ABCD) study and identify the important functional connectivity sub-networks that are associated with general cognitive ability. In the third project, we extend scalar-on-network regression via boosting in the second project by relaxing the homogeneity constraints within the prespecified functional connectivity networks. We adopt deep neural networks (DNN) to model the edge-dependent regression coefficients in light of the edge-level and node level features in the brain network, as well as the well-known brain functional organization. In addition, the proposed DNN-based scalar-on-network regression has the flexibility to incorporate the signal pattern from other imaging modalities into the model. We develop an efficient model fitting method based on ADMM. The proposed method is evaluated and compared with existing alternatives via simulations and analysis of rfMRI and task fMRI data in the ABCD study.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169684/1/emorrisl_1.pd

    Demand Estimation with Machine Learning and Model Combination

    Get PDF
    Abstract We survey and apply several techniques from the statistical and computer science literature to the problem of demand estimation. We derive novel asymptotic properties for several of these models. To improve out-of-sample prediction accuracy and obtain parametric rates of convergence, we propose a method of combining the underlying models via linear regression. We illustrate our method using a standard scanner panel data set to estimate promotional lift and find that our estimates are considerably more accurate in out-of-sample predictions of demand than some commonly-used alternatives. While demand estimation is our motivating application, these methods are widely applicable to other microeconometric problems

    Computational Statistics and Applications

    Get PDF
    Nature evolves mainly in a statistical way. Different strategies, formulas, and conformations are continuously confronted in the natural processes. Some of them are selected and then the evolution continues with a new loop of confrontation for the next generation of phenomena and living beings. Failings are corrected without a previous program or design. The new options generated by different statistical and random scenarios lead to solutions for surviving the present conditions. This is the general panorama for all scrutiny levels of the life cycles. Over three sections, this book examines different statistical questions and techniques in the context of machine learning and clustering methods, the frailty models used in survival analysis, and other studies of statistics applied to diverse problems

    Self-Validated Ensemble Modelling

    Get PDF
    An important objective when performing designed experiments is to build models that predict future performance of a system in study; e.g. predict future yields of a bio-process used to manufacture therapeutic proteins. Because experimentation is costly experimental designs are structured to be efficient in terms of the number of trials while providing substantial information about the behavior of the physical system. The strategy to build accurate predictive models in larger data sets is to partition the data into a training set, used to fit the model, and a validation set to access prediction performance. Models are selected that have the lowest prediction error on the validation set. However, designed experiments are usually small in sample size and have a fixed structure which precludes partitioning of any kind; the entire set must be used for training. Contemporary methods use information criteria like the AICc or BIC with model algorithms such as Forward Selection or Lasso to select candidate models. These surrogate prediction measures often produce models with poor prediction performance relative to models selected using a validation procedure such ascross validation. This approach also uses a single fit from a model algorithm which we show to be insufficient. We propose a novel approach that allows the original data set to function as both a training set and a validation set. We accomplish this auto-validation strategy by employing a unique fractionally re-weighted bootstrapping technique. The weighting scheme is structured to induce anti-correlation between the original set and the auto-validation copy. We randomly assign new fractional weights using the bootstrap algorithm and fit a predictive model. This procedure is iterated many times producing a new model each time. The final model is the average of these models. We refer to this new methodology as Self-Validated Ensemble Modeling (SVEM). In this dissertation we investigate the performance of the SVEM algorithm across various scenarios: different model selection algorithms, different designs with varying sample sizes, model noise levels, and sparsity. This investigation shows that SVEM outperforms contemporary one-shot model selection approaches
    corecore