1,354 research outputs found
On Using Active Learning and Self-Training when Mining Performance Discussions on Stack Overflow
Abundant data is the key to successful machine learning. However, supervised
learning requires annotated data that are often hard to obtain. In a
classification task with limited resources, Active Learning (AL) promises to
guide annotators to examples that bring the most value for a classifier. AL can
be successfully combined with self-training, i.e., extending a training set
with the unlabelled examples for which a classifier is the most certain. We
report our experiences on using AL in a systematic manner to train an SVM
classifier for Stack Overflow posts discussing performance of software
components. We show that the training examples deemed as the most valuable to
the classifier are also the most difficult for humans to annotate. Despite
carefully evolved annotation criteria, we report low inter-rater agreement, but
we also propose mitigation strategies. Finally, based on one annotator's work,
we show that self-training can improve the classification accuracy. We conclude
the paper by discussing implication for future text miners aspiring to use AL
and self-training.Comment: Preprint of paper accepted for the Proc. of the 21st International
Conference on Evaluation and Assessment in Software Engineering, 201
Suggestive Annotation: A Deep Active Learning Framework for Biomedical Image Segmentation
Image segmentation is a fundamental problem in biomedical image analysis.
Recent advances in deep learning have achieved promising results on many
biomedical image segmentation benchmarks. However, due to large variations in
biomedical images (different modalities, image settings, objects, noise, etc),
to utilize deep learning on a new application, it usually needs a new set of
training data. This can incur a great deal of annotation effort and cost,
because only biomedical experts can annotate effectively, and often there are
too many instances in images (e.g., cells) to annotate. In this paper, we aim
to address the following question: With limited effort (e.g., time) for
annotation, what instances should be annotated in order to attain the best
performance? We present a deep active learning framework that combines fully
convolutional network (FCN) and active learning to significantly reduce
annotation effort by making judicious suggestions on the most effective
annotation areas. We utilize uncertainty and similarity information provided by
FCN and formulate a generalized version of the maximum set cover problem to
determine the most representative and uncertain areas for annotation. Extensive
experiments using the 2015 MICCAI Gland Challenge dataset and a lymph node
ultrasound image segmentation dataset show that, using annotation suggestions
by our method, state-of-the-art segmentation performance can be achieved by
using only 50% of training data.Comment: Accepted at MICCAI 201
Optimism in Active Learning with Gaussian Processes
International audienceIn the context of Active Learning for classification, the classification error depends on the joint distribution of samples and their labels which is initially unknown. The minimization of this error requires estimating this distribution. Online estimation of this distribution involves a trade-off between exploration and exploitation. This is a common problem in machine learning for which multi-armed bandit theory, building upon Optimism in the Face of Uncertainty, has been proven very efficient these last years. We introduce two novel algorithms that use Optimism in the Face of Uncertainty along with Gaussian Processes for the Active Learning problem. The evaluation lead on real world datasets shows that these new algorithms compare positively to state-of-the-art methods
A Monte Carlo study of the three-dimensional Coulomb frustrated Ising ferromagnet
We have investigated by Monte-Carlo simulation the phase diagram of a
three-dimensional Ising model with nearest-neighbor ferromagnetic interactions
and small, but long-range (Coulombic) antiferromagnetic interactions. We have
developed an efficient cluster algorithm and used different lattice sizes and
geometries, which allows us to obtain the main characteristics of the
temperature-frustration phase diagram. Our finite-size scaling analysis
confirms that the melting of the lamellar phases into the paramgnetic phase is
driven first-order by the fluctuations. Transitions between ordered phases with
different modulation patterns is observed in some regions of the diagram, in
agreement with a recent mean-field analysis.Comment: 14 pages, 10 figures, submitted to Phys. Rev.
Active Sampling-based Binary Verification of Dynamical Systems
Nonlinear, adaptive, or otherwise complex control techniques are increasingly
relied upon to ensure the safety of systems operating in uncertain
environments. However, the nonlinearity of the resulting closed-loop system
complicates verification that the system does in fact satisfy those
requirements at all possible operating conditions. While analytical proof-based
techniques and finite abstractions can be used to provably verify the
closed-loop system's response at different operating conditions, they often
produce conservative approximations due to restrictive assumptions and are
difficult to construct in many applications. In contrast, popular statistical
verification techniques relax the restrictions and instead rely upon
simulations to construct statistical or probabilistic guarantees. This work
presents a data-driven statistical verification procedure that instead
constructs statistical learning models from simulated training data to separate
the set of possible perturbations into "safe" and "unsafe" subsets. Binary
evaluations of closed-loop system requirement satisfaction at various
realizations of the uncertainties are obtained through temporal logic
robustness metrics, which are then used to construct predictive models of
requirement satisfaction over the full set of possible uncertainties. As the
accuracy of these predictive statistical models is inherently coupled to the
quality of the training data, an active learning algorithm selects additional
sample points in order to maximize the expected change in the data-driven model
and thus, indirectly, minimize the prediction error. Various case studies
demonstrate the closed-loop verification procedure and highlight improvements
in prediction error over both existing analytical and statistical verification
techniques.Comment: 23 page
An analysis of a manufacturing process using the GERT approach
Graphical Evaluation and Review Technique for analyzing manufacturing processe
Higgs signals and hard photons at the Next Linear Collider: the -fusion channel in the Standard Model
In this paper, we extend the analyses carried out in a previous article for
-fusion to the case of Higgs production via -fusion within the Standard
Model at the Next Linear Collider, in presence of electromagnetic radiation due
real photon emission. Calculations are carried out at tree-level and rates of
the leading order (LO) processes e^+e^-\rightarrow e^+e^- H \ar e^+e^- b\bar b
and e^+e^-\rightarrow e^+e^- H \ar e^+e^- WW \ar e^+e^- \mathrm{jjjj} are
compared to those of the next-to-leading order (NLO) reactions
e^+e^-\rightarrow e^+e^- H (\gamma)\ar e^+e^- b\bar b \gamma and
e^+e^-\rightarrow e^+e^- H (\gamma)\ar e^+e^- WW (\gamma) \ar e^+e^-
\mathrm{jjjj}\gamma, in the case of energetic and isolated photons.Comment: 12 pages, LaTeX, 5 PostScript figures embedded using epsfig and
bitmapped at 100dpi, complete paper including high definition figures
available at ftp://axpa.hep.phy.cam.ac.uk/stefano/cavendish_9611.ps or at
http://www.hep.phy.cam.ac.uk/theory/papers
Multiple-scattering effects on incoherent neutron scattering in glasses and viscous liquids
Incoherent neutron scattering experiments are simulated for simple dynamic
models: a glass (with a smooth distribution of harmonic vibrations) and a
viscous liquid (described by schematic mode-coupling equations). In most
situations multiple scattering has little influence upon spectral
distributions, but it completely distorts the wavenumber-dependent amplitudes.
This explains an anomaly observed in recent experiments
Discovering Valuable Items from Massive Data
Suppose there is a large collection of items, each with an associated cost
and an inherent utility that is revealed only once we commit to selecting it.
Given a budget on the cumulative cost of the selected items, how can we pick a
subset of maximal value? This task generalizes several important problems such
as multi-arm bandits, active search and the knapsack problem. We present an
algorithm, GP-Select, which utilizes prior knowledge about similarity be- tween
items, expressed as a kernel function. GP-Select uses Gaussian process
prediction to balance exploration (estimating the unknown value of items) and
exploitation (selecting items of high value). We extend GP-Select to be able to
discover sets that simultaneously have high utility and are diverse. Our
preference for diversity can be specified as an arbitrary monotone submodular
function that quantifies the diminishing returns obtained when selecting
similar items. Furthermore, we exploit the structure of the model updates to
achieve an order of magnitude (up to 40X) speedup in our experiments without
resorting to approximations. We provide strong guarantees on the performance of
GP-Select and apply it to three real-world case studies of industrial
relevance: (1) Refreshing a repository of prices in a Global Distribution
System for the travel industry, (2) Identifying diverse, binding-affine
peptides in a vaccine de- sign task and (3) Maximizing clicks in a web-scale
recommender system by recommending items to users
- …