13,277 research outputs found
Combining similarity in time and space for training set formation under concept drift
Concept drift is a challenge in supervised learning for sequential data. It describes a phenomenon when the data distributions change over time. In such a case accuracy of a classifier benefits from the selective sampling for training. We develop a method for training set selection, particularly relevant when the expected drift is gradual. Training set selection at each time step is based on the distance to the target instance. The distance function combines similarity in space and in time. The method determines an optimal training set size online at every time step using cross validation. It is a wrapper approach, it can be used plugging in different base classifiers. The proposed method shows the best accuracy in the peer group on the real and artificial drifting data. The method complexity is reasonable for the field applications
Gaussian Process Regression for Estimating EM Ducting Within the Marine Atmospheric Boundary Layer
We show that Gaussian process regression (GPR) can be used to infer the
electromagnetic (EM) duct height within the marine atmospheric boundary layer
(MABL) from sparsely sampled propagation factors within the context of bistatic
radars. We use GPR to calculate the posterior predictive distribution on the
labels (i.e. duct height) from both noise-free and noise-contaminated array of
propagation factors. For duct height inference from noise-contaminated
propagation factors, we compare a naive approach, utilizing one random sample
from the input distribution (i.e. disregarding the input noise), with an
inverse-variance weighted approach, utilizing a few random samples to estimate
the true predictive distribution. The resulting posterior predictive
distributions from these two approaches are compared to a "ground truth"
distribution, which is approximated using a large number of Monte-Carlo
samples. The ability of GPR to yield accurate and fast duct height predictions
using a few training examples indicates the suitability of the proposed method
for real-time applications.Comment: 15 pages, 6 figure
The problem with Kappa
It is becoming clear that traditional
evaluation measures used in
Computational Linguistics (including
Error Rates, Accuracy, Recall, Precision
and F-measure) are of limited value for
unbiased evaluation of systems, and are
not meaningful for comparison of
algorithms unless both the dataset and
algorithm parameters are strictly
controlled for skew (Prevalence and
Bias). The use of techniques originally
designed for other purposes, in particular
Receiver Operating Characteristics Area
Under Curve, plus variants of Kappa,
have been proposed to fill the void.
This paper aims to clear up some of the
confusion relating to evaluation, by
demonstrating that the usefulness of each
evaluation method is highly dependent on
the assumptions made about the
distributions of the dataset and the
underlying populations. The behaviour of
a number of evaluation measures is
compared under common assumptions.
Deploying a system in a context which
has the opposite skew from its validation
set can be expected to approximately
negate Fleiss Kappa and halve Cohen
Kappa but leave Powers Kappa
unchanged. For most performance
evaluation purposes, the latter is thus
most appropriate, whilst for comparison
of behaviour, Matthews Correlation is
recommended
Survival ensembles by the sum of pairwise differences with application to lung cancer microarray studies
Lung cancer is among the most common cancers in the United States, in terms
of incidence and mortality. In 2009, it is estimated that more than 150,000
deaths will result from lung cancer alone. Genetic information is an extremely
valuable data source in characterizing the personal nature of cancer. Over the
past several years, investigators have conducted numerous association studies
where intensive genetic data is collected on relatively few patients compared
to the numbers of gene predictors, with one scientific goal being to identify
genetic features associated with cancer recurrence or survival. In this note,
we propose high-dimensional survival analysis through a new application of
boosting, a powerful tool in machine learning. Our approach is based on an
accelerated lifetime model and minimizing the sum of pairwise differences in
residuals. We apply our method to a recent microarray study of lung
adenocarcinoma and find that our ensemble is composed of 19 genes, while a
proportional hazards (PH) ensemble is composed of nine genes, a proper subset
of the 19-gene panel. In one of our simulation scenarios, we demonstrate that
PH boosting in a misspecified model tends to underfit and ignore
moderately-sized covariate effects, on average. Diagnostic analyses suggest
that the PH assumption is not satisfied in the microarray data and may explain,
in part, the discrepancy in the sets of active coefficients. Our simulation
studies and comparative data analyses demonstrate how statistical learning by
PH models alone is insufficient.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS426 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- ā¦