Search CORE

26,924 research outputs found

Regression tree models for designed experiments

Author: Loh Wei-Yin
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2006
Field of study

Although regression trees were originally designed for large datasets, they can profitably be used on small datasets as well, including those from replicated or unreplicated complete factorial experiments. We show that in the latter situations, regression tree models can provide simpler and more intuitive interpretations of interaction effects as differences between conditional main effects. We present simulation results to verify that the models can yield lower prediction mean squared errors than the traditional techniques. The tree models span a wide range of sophistication, from piecewise constant to piecewise simple and multiple linear, and from least squares to Poisson and logistic regression.Comment: Published at http://dx.doi.org/10.1214/074921706000000464 in the IMS Lecture Notes--Monograph Series (http://www.imstat.org/publications/lecnotes.htm) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

Minimizing Bias in Biomass Allometry: Model Selection and Log‐Transformation of Data

Author: Hughes R. Flint
Litton Creighton M.
Mascaro Joseph
Schnitzer Stefan A.
Uowolo Amanda
Publication venue: e-Publications@Marquette
Publication date: 01/11/2011
Field of study

Nonlinear regression is increasingly used to develop allometric equations for forest biomass estimation (i.e., as opposed to the traditional approach of log‐transformation followed by linear regression). Most statistical software packages, however, assume additive errors by default, violating a key assumption of allometric theory and possibly producing spurious models. Here, we show that such models may bias stand‐level biomass estimates by up to 100 percent in young forests, and we present an alternative nonlinear fitting approach that conforms with allometric theory

epublications@Marquette

Exploring helical dynamos with machine learning

Author: Nauman Farrukh
Nättilä Joonas
Publication venue: 'EDP Sciences'
Publication date: 01/01/2019
Field of study

We use ensemble machine learning algorithms to study the evolution of magnetic fields in magnetohydrodynamic (MHD) turbulence that is helically forced. We perform direct numerical simulations of helically forced turbulence using mean field formalism, with electromotive force (EMF) modeled both as a linear and non-linear function of the mean magnetic field and current density. The form of the EMF is determined using regularized linear regression and random forests. We also compare various analytical models to the data using Bayesian inference with Markov Chain Monte Carlo (MCMC) sampling. Our results demonstrate that linear regression is largely successful at predicting the EMF and the use of more sophisticated algorithms (random forests, MCMC) do not lead to significant improvement in the fits. We conclude that the data we are looking at is effectively low dimensional and essentially linear. Finally, to encourage further exploration by the community, we provide all of our simulation data and analysis scripts as open source IPython notebooks.Comment: accepted by A&A, 11 pages, 6 figures, 3 tables, data + IPython notebooks: https://github.com/fnauman/ML_alpha

arXiv.org e-Print Archive

EDP Sciences OAI-PMH repository (1.2.0)

Chalmers Research

An analysis of the evolving comoving number density of galaxies in hydrodynamical simulations

Author: Griffen Brendan F.
Hernquist Lars
Leal Machado Francisco
Ma Chung-Pei
McKinnon Ryan
McKinnon Ryan Michael
Nelson Dylan
Pillepich Annalisa
Rodriguez-Gomez Vicente
Springel Volker
Torrey Paul A.
Vogelsberger Mark
Wellons Sarah
Publication venue: 'Oxford University Press (OUP)'
Publication date: 07/07/2015
Field of study

The cumulative comoving number-density of galaxies as a function of stellar mass or central velocity dispersion is commonly used to link galaxy populations across different epochs. By assuming that galaxies preserve their number-density in time, one can infer the evolution of their properties, such as masses, sizes, and morphologies. However, this assumption does not hold in the presence of galaxy mergers or when rank ordering is broken owing to variable stellar growth rates. We present an analysis of the evolving comoving number density of galaxy populations found in the Illustris cosmological hydrodynamical simulation focused on the redshift range

0\leq z \leq 3

. Our primary results are as follows: 1) The inferred average stellar mass evolution obtained via a constant comoving number density assumption is systematically biased compared to the merger tree results at the factor of

\sim

2(4) level when tracking galaxies from redshift

z=0

out to redshift

z=2(3)

; 2) The median number density evolution for galaxy populations tracked forward in time is shallower than for galaxy populations tracked backward in time; 3) A similar evolution in the median number density of tracked galaxy populations is found regardless of whether number density is assigned via stellar mass, stellar velocity dispersion, or dark matter halo mass; 4) Explicit tracking reveals a large diversity in galaxies' assembly histories that cannot be captured by constant number-density analyses; 5) The significant scatter in galaxy linking methods is only marginally reduced by considering a number of additional physical and observable galaxy properties as realized in our simulation. We provide fits for the forward and backward median evolution in stellar mass and number density and discuss implications of our analysis for interpreting multi-epoch galaxy property observations.Comment: 18 pages, 11 figures, submitted to MNRAS, comments welcom

arXiv.org e-Print Archive

DSpace@MIT

Crossref

Caltech Authors

Survival ensembles by the sum of pairwise differences with application to lung cancer microarray studies

Author: Johnson Brent A.
Long Qi
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 09/08/2011
Field of study

Lung cancer is among the most common cancers in the United States, in terms of incidence and mortality. In 2009, it is estimated that more than 150,000 deaths will result from lung cancer alone. Genetic information is an extremely valuable data source in characterizing the personal nature of cancer. Over the past several years, investigators have conducted numerous association studies where intensive genetic data is collected on relatively few patients compared to the numbers of gene predictors, with one scientific goal being to identify genetic features associated with cancer recurrence or survival. In this note, we propose high-dimensional survival analysis through a new application of boosting, a powerful tool in machine learning. Our approach is based on an accelerated lifetime model and minimizing the sum of pairwise differences in residuals. We apply our method to a recent microarray study of lung adenocarcinoma and find that our ensemble is composed of 19 genes, while a proportional hazards (PH) ensemble is composed of nine genes, a proper subset of the 19-gene panel. In one of our simulation scenarios, we demonstrate that PH boosting in a misspecified model tends to underfit and ignore moderately-sized covariate effects, on average. Diagnostic analyses suggest that the PH assumption is not satisfied in the microarray data and may explain, in part, the discrepancy in the sets of active coefficients. Our simulation studies and comparative data analyses demonstrate how statistical learning by PH models alone is insufficient.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS426 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

Nonparametric Methods in Astronomy: Think, Regress, Observe -- Pick Any Three

Author: Jermyn Adam S.
Steinhardt Charles L.
Publication venue: 'IOP Publishing'
Publication date: 19/01/2018
Field of study

Telescopes are much more expensive than astronomers, so it is essential to minimize required sample sizes by using the most data-efficient statistical methods possible. However, the most commonly used model-independent techniques for finding the relationship between two variables in astronomy are flawed. In the worst case they can lead without warning to subtly yet catastrophically wrong results, and even in the best case they require more data than necessary. Unfortunately, there is no single best technique for nonparametric regression. Instead, we provide a guide for how astronomers can choose the best method for their specific problem and provide a python library with both wrappers for the most useful existing algorithms and implementations of two new algorithms developed here.Comment: 19 pages, PAS

arXiv.org e-Print Archive

Copenhagen University Research Information System

Caltech Authors