455 research outputs found
Characterizing Datapoints via Second-Split Forgetting
Researchers investigating example hardness have increasingly focused on the
dynamics by which neural networks learn and forget examples throughout
training. Popular metrics derived from these dynamics include (i) the epoch at
which examples are first correctly classified; (ii) the number of times their
predictions flip during training; and (iii) whether their prediction flips if
they are held out. However, these metrics do not distinguish among examples
that are hard for distinct reasons, such as membership in a rare subpopulation,
being mislabeled, or belonging to a complex subpopulation. In this paper, we
propose - (SSFT), a complementary metric
that tracks the epoch (if any) after which an original training example is
forgotten as the network is fine-tuned on a randomly held out partition of the
data. Across multiple benchmark datasets and modalities, we demonstrate that
examples are forgotten quickly, and seemingly examples are
forgotten comparatively slowly. By contrast, metrics only considering the first
split learning dynamics struggle to differentiate the two. At large learning
rates, SSFT tends to be robust across architectures, optimizers, and random
seeds. From a practical standpoint, the SSFT can (i) help to identify
mislabeled samples, the removal of which improves generalization; and (ii)
provide insights about failure modes. Through theoretical analysis addressing
overparameterized linear models, we provide insights into how the observed
phenomena may arise. Code for reproducing our experiments can be found here:
https://github.com/pratyushmaini/ssftComment: Accepted at NeurIPS 202
Seismic Event Classification using Machine Learning
The manual detection of seismic events is a labor intensive task, requiring highly skilled workers continuously analyzing recorded waveforms. Previous work has shown the potential of machine learning methods for aiding in this task, and that deep neural networks are able to learn important patterns in seismic recordings. This study aims to develop a deep neural network to classify earthquake-, explosion and noise events using long beamformed waveform snippets from NORSAR's ARCES array. The final model was evaluated using an unseen test set and on recordings of the North Korean nuclear weapons tests. I developed custom augmentation methods in order to combat the uneven class distribution, and several preprocessing techniques were deployed in pursuit of performance. Models developed for similar data, state-of-the-art multivariate time series models, as well as self-developed models were experimented with and evaluated. Analysis of the results demonstrated that the final model can classify noise and explosion events with a high degree of accuracy, while earthquake classifications were less reliable. I conclude that deep neural networks can learn distinguishing features and detect events of interest on long beamformed three-component waveforms.Masteroppgave i informatikkINF399MAMN-INFMAMN-PRO
Methods for labeling error detection in microarrays based on the effect of data perturbation on the regression model
Abstract
Motivation: Mislabeled samples often appear in gene expression profile because of the similarity of different sub-type of disease and the subjective misdiagnosis. The mislabeled samples deteriorate supervised learning procedures. The LOOE-sensitivity algorithm is an approach for mislabeled sample detection for microarray based on data perturbation. However, the failure of measuring the perturbing effect makes the LOOE-sensitivity algorithm a poor performance. The purpose of this article is to design a novel detection method for mislabeled samples of microarray, which could take advantage of the measuring effect of data perturbations.
Results: To measure the effect of data perturbation, we define an index named perturbing influence value (PIV), based on the support vector machine (SVM) regression model. The Column Algorithm (CAPIV), Row Algorithm (RAPIV) and progressive Row Algorithm (PRAPIV) based on the PIV value are proposed to detect the mislabeled samples. Experimental results obtained by using six artificial datasets and five microarray datasets demonstrate that all proposed methods in this article are superior to LOOE-sensitivity. Moreover, compared with the simple SVM and CL-stability, the PRAPIV algorithm shows an increase in precision and high recall.
Availability: The program and source code (in JAVA) are publicly available at http://ccst.jlu.edu.cn/CSBG/PIVS/index.htm
Contact: [email protected]; [email protected]
The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use
The GTZAN dataset appears in at least 100 published works, and is the
most-used public dataset for evaluation in machine listening research for music
genre recognition (MGR). Our recent work, however, shows GTZAN has several
faults (repetitions, mislabelings, and distortions), which challenge the
interpretability of any result derived using it. In this article, we disprove
the claims that all MGR systems are affected in the same ways by these faults,
and that the performances of MGR systems in GTZAN are still meaningfully
comparable since they all face the same faults. We identify and analyze the
contents of GTZAN, and provide a catalog of its faults. We review how GTZAN has
been used in MGR research, and find few indications that its faults have been
known and considered. Finally, we rigorously study the effects of its faults on
evaluating five different MGR systems. The lesson is not to banish GTZAN, but
to use it with consideration of its contents.Comment: 29 pages, 7 figures, 6 tables, 128 reference
DEVELOPMENT OF AN EEG BRAIN-MACHINE INTERFACE TO AID IN RECOVERY OF MOTOR FUNCTION AFTER NEUROLOGICAL INJURY
Impaired motor function following neurological injury may be overcome through therapies that induce neuroplastic changes in the brain. Therapeutic methods include repetitive exercises that promote use-dependent plasticity (UDP), the benefit of which may be increased by first administering peripheral nerve stimulation (PNS) to activate afferent fibers, resulting in increased cortical excitability. We speculate that PNS delivered only in response to attempted movement would induce timing-dependent plasticity (TDP), a mechanism essential to normal motor learning. Here we develop a brain-machine interface (BMI) to detect movement intent and effort in healthy volunteers (n=5) from their electroencephalogram (EEG). This could be used in the future to promote TDP by triggering PNS in response to a patient’s level of effort in a motor task. Linear classifiers were used to predict state (rest, sham, right, left) based on EEG variables in a handgrip task and to determine between three levels of force applied. Mean classification accuracy with out-of-sample data was 54% (23-73%) for tasks and 44% (21-65%) for force. There was a slight but significant correlation (p\u3c0.001) between sample entropy and force exerted. The results indicate the feasibility of applying PNS in response to motor intent detected from the brain
Image Classification with Deep Learning in the Presence of Noisy Labels: A Survey
Image classification systems recently made a giant leap with the advancement
of deep neural networks. However, these systems require an excessive amount of
labeled data to be adequately trained. Gathering a correctly annotated dataset
is not always feasible due to several factors, such as the expensiveness of the
labeling process or difficulty of correctly classifying data, even for the
experts. Because of these practical challenges, label noise is a common problem
in real-world datasets, and numerous methods to train deep neural networks with
label noise are proposed in the literature. Although deep neural networks are
known to be relatively robust to label noise, their tendency to overfit data
makes them vulnerable to memorizing even random noise. Therefore, it is crucial
to consider the existence of label noise and develop counter algorithms to fade
away its adverse effects to train deep neural networks efficiently. Even though
an extensive survey of machine learning techniques under label noise exists,
the literature lacks a comprehensive survey of methodologies centered
explicitly around deep learning in the presence of noisy labels. This paper
aims to present these algorithms while categorizing them into one of the two
subgroups: noise model based and noise model free methods. Algorithms in the
first group aim to estimate the noise structure and use this information to
avoid the adverse effects of noisy labels. Differently, methods in the second
group try to come up with inherently noise robust algorithms by using
approaches like robust losses, regularizers or other learning paradigms
- …