26,924 research outputs found
Regression tree models for designed experiments
Although regression trees were originally designed for large datasets, they
can profitably be used on small datasets as well, including those from
replicated or unreplicated complete factorial experiments. We show that in the
latter situations, regression tree models can provide simpler and more
intuitive interpretations of interaction effects as differences between
conditional main effects. We present simulation results to verify that the
models can yield lower prediction mean squared errors than the traditional
techniques. The tree models span a wide range of sophistication, from piecewise
constant to piecewise simple and multiple linear, and from least squares to
Poisson and logistic regression.Comment: Published at http://dx.doi.org/10.1214/074921706000000464 in the IMS
Lecture Notes--Monograph Series
(http://www.imstat.org/publications/lecnotes.htm) by the Institute of
Mathematical Statistics (http://www.imstat.org
Minimizing Bias in Biomass Allometry: Model Selection and LogâTransformation of Data
Nonlinear regression is increasingly used to develop allometric equations for forest biomass estimation (i.e., as opposed to the traditional approach of logâtransformation followed by linear regression). Most statistical software packages, however, assume additive errors by default, violating a key assumption of allometric theory and possibly producing spurious models. Here, we show that such models may bias standâlevel biomass estimates by up to 100 percent in young forests, and we present an alternative nonlinear fitting approach that conforms with allometric theory
Exploring helical dynamos with machine learning
We use ensemble machine learning algorithms to study the evolution of
magnetic fields in magnetohydrodynamic (MHD) turbulence that is helically
forced. We perform direct numerical simulations of helically forced turbulence
using mean field formalism, with electromotive force (EMF) modeled both as a
linear and non-linear function of the mean magnetic field and current density.
The form of the EMF is determined using regularized linear regression and
random forests. We also compare various analytical models to the data using
Bayesian inference with Markov Chain Monte Carlo (MCMC) sampling. Our results
demonstrate that linear regression is largely successful at predicting the EMF
and the use of more sophisticated algorithms (random forests, MCMC) do not lead
to significant improvement in the fits. We conclude that the data we are
looking at is effectively low dimensional and essentially linear. Finally, to
encourage further exploration by the community, we provide all of our
simulation data and analysis scripts as open source IPython notebooks.Comment: accepted by A&A, 11 pages, 6 figures, 3 tables, data + IPython
notebooks: https://github.com/fnauman/ML_alpha
An analysis of the evolving comoving number density of galaxies in hydrodynamical simulations
The cumulative comoving number-density of galaxies as a function of stellar
mass or central velocity dispersion is commonly used to link galaxy populations
across different epochs. By assuming that galaxies preserve their
number-density in time, one can infer the evolution of their properties, such
as masses, sizes, and morphologies. However, this assumption does not hold in
the presence of galaxy mergers or when rank ordering is broken owing to
variable stellar growth rates. We present an analysis of the evolving comoving
number density of galaxy populations found in the Illustris cosmological
hydrodynamical simulation focused on the redshift range . Our
primary results are as follows: 1) The inferred average stellar mass evolution
obtained via a constant comoving number density assumption is systematically
biased compared to the merger tree results at the factor of 2(4) level
when tracking galaxies from redshift out to redshift ; 2) The
median number density evolution for galaxy populations tracked forward in time
is shallower than for galaxy populations tracked backward in time; 3) A similar
evolution in the median number density of tracked galaxy populations is found
regardless of whether number density is assigned via stellar mass, stellar
velocity dispersion, or dark matter halo mass; 4) Explicit tracking reveals a
large diversity in galaxies' assembly histories that cannot be captured by
constant number-density analyses; 5) The significant scatter in galaxy linking
methods is only marginally reduced by considering a number of additional
physical and observable galaxy properties as realized in our simulation. We
provide fits for the forward and backward median evolution in stellar mass and
number density and discuss implications of our analysis for interpreting
multi-epoch galaxy property observations.Comment: 18 pages, 11 figures, submitted to MNRAS, comments welcom
Survival ensembles by the sum of pairwise differences with application to lung cancer microarray studies
Lung cancer is among the most common cancers in the United States, in terms
of incidence and mortality. In 2009, it is estimated that more than 150,000
deaths will result from lung cancer alone. Genetic information is an extremely
valuable data source in characterizing the personal nature of cancer. Over the
past several years, investigators have conducted numerous association studies
where intensive genetic data is collected on relatively few patients compared
to the numbers of gene predictors, with one scientific goal being to identify
genetic features associated with cancer recurrence or survival. In this note,
we propose high-dimensional survival analysis through a new application of
boosting, a powerful tool in machine learning. Our approach is based on an
accelerated lifetime model and minimizing the sum of pairwise differences in
residuals. We apply our method to a recent microarray study of lung
adenocarcinoma and find that our ensemble is composed of 19 genes, while a
proportional hazards (PH) ensemble is composed of nine genes, a proper subset
of the 19-gene panel. In one of our simulation scenarios, we demonstrate that
PH boosting in a misspecified model tends to underfit and ignore
moderately-sized covariate effects, on average. Diagnostic analyses suggest
that the PH assumption is not satisfied in the microarray data and may explain,
in part, the discrepancy in the sets of active coefficients. Our simulation
studies and comparative data analyses demonstrate how statistical learning by
PH models alone is insufficient.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS426 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Nonparametric Methods in Astronomy: Think, Regress, Observe -- Pick Any Three
Telescopes are much more expensive than astronomers, so it is essential to
minimize required sample sizes by using the most data-efficient statistical
methods possible. However, the most commonly used model-independent techniques
for finding the relationship between two variables in astronomy are flawed. In
the worst case they can lead without warning to subtly yet catastrophically
wrong results, and even in the best case they require more data than necessary.
Unfortunately, there is no single best technique for nonparametric regression.
Instead, we provide a guide for how astronomers can choose the best method for
their specific problem and provide a python library with both wrappers for the
most useful existing algorithms and implementations of two new algorithms
developed here.Comment: 19 pages, PAS
- âŚ