103,255 research outputs found
Improving medium-range ensemble weather forecasts with hierarchical ensemble transformers
Statistical post-processing of global ensemble weather forecasts is revisited
by leveraging recent developments in machine learning. Verification of past
forecasts is exploited to learn systematic deficiencies of numerical weather
predictions in order to boost post-processed forecast performance. Here, we
introduce PoET, a post-processing approach based on hierarchical transformers.
PoET has 2 major characteristics: 1) the post-processing is applied directly to
the ensemble members rather than to a predictive distribution or a functional
of it, and 2) the method is ensemble-size agnostic in the sense that the number
of ensemble members in training and inference mode can differ. The PoET output
is a set of calibrated members that has the same size as the original ensemble
but with improved reliability. Performance assessments show that PoET can bring
up to 20% improvement in skill globally for 2m temperature and 2% for
precipitation forecasts and outperforms the simpler statistical
member-by-member method, used here as a competitive benchmark. PoET is also
applied to the ENS10 benchmark dataset for ensemble post-processing and
provides better results when compared to other deep learning solutions that are
evaluated for most parameters. Furthermore, because each ensemble member is
calibrated separately, downstream applications should directly benefit from the
improvement made on the ensemble forecast with post-processing
Improving the performance of cascade correlation neural networks on multimodal functions
Intrinsic qualities of the cascade correlation algorithm make it a popular choice for many researchers wishing to utilize neural networks. Problems arise when the outputs required are highly multimodal over the input domain. The mean squared error of the approximation increases significantly as the number of modes increases. By applying ensembling and early stopping, we show that this error can be reduced by a factor of three. We also present a new technique based on subdivision that we call patchworking. When used in combination with early stopping and ensembling the mean
improvement in error is over 10 in some cases
A study of early stopping, ensembling, and patchworking for cascade correlation neural networks
The constructive topology of the cascade correlation algorithm makes it a popular choice for many researchers wishing to utilize neural networks. However, for multimodal problems, the mean squared error of the approximation increases significantly as the number of modes increases. The components of this error will comprise both bias and variance and we provide formulae for estimating these values from mean squared errors alone. We achieve a near threefold reduction in the overall error by using early stopping and ensembling. Also described is a new subdivision technique that we call patchworking. Patchworking, when used in combination with early stopping and ensembling, can achieve an order of magnitude improvement in the error. Also presented is an approach for validating the quality of a neural network’s training, without the explicit use of a testing dataset
The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures
Motivation: Biomarker discovery from high-dimensional data is a crucial
problem with enormous applications in biology and medicine. It is also
extremely challenging from a statistical viewpoint, but surprisingly few
studies have investigated the relative strengths and weaknesses of the plethora
of existing feature selection methods. Methods: We compare 32 feature selection
methods on 4 public gene expression datasets for breast cancer prognosis, in
terms of predictive performance, stability and functional interpretability of
the signatures they produce. Results: We observe that the feature selection
method has a significant influence on the accuracy, stability and
interpretability of signatures. Simple filter methods generally outperform more
complex embedded or wrapper methods, and ensemble feature selection has
generally no positive effect. Overall a simple Student's t-test seems to
provide the best results. Availability: Code and data are publicly available at
http://cbio.ensmp.fr/~ahaury/
COMET: A Recipe for Learning and Using Large Ensembles on Massive Data
COMET is a single-pass MapReduce algorithm for learning on large-scale data.
It builds multiple random forest ensembles on distributed blocks of data and
merges them into a mega-ensemble. This approach is appropriate when learning
from massive-scale data that is too large to fit on a single machine. To get
the best accuracy, IVoting should be used instead of bagging to generate the
training subset for each decision tree in the random forest. Experiments with
two large datasets (5GB and 50GB compressed) show that COMET compares favorably
(in both accuracy and training time) to learning on a subsample of data using a
serial algorithm. Finally, we propose a new Gaussian approach for lazy ensemble
evaluation which dynamically decides how many ensemble members to evaluate per
data point; this can reduce evaluation cost by 100X or more
- …