470,260 research outputs found
Robust Loss Functions under Label Noise for Deep Neural Networks
In many applications of classifier learning, training data suffers from label
noise. Deep networks are learned using huge training data where the problem of
noisy labels is particularly relevant. The current techniques proposed for
learning deep networks under label noise focus on modifying the network
architecture and on algorithms for estimating true labels from noisy labels. An
alternate approach would be to look for loss functions that are inherently
noise-tolerant. For binary classification there exist theoretical results on
loss functions that are robust to label noise. In this paper, we provide some
sufficient conditions on a loss function so that risk minimization under that
loss function would be inherently tolerant to label noise for multiclass
classification problems. These results generalize the existing results on
noise-tolerant loss functions for binary classification. We study some of the
widely used loss functions in deep networks and show that the loss function
based on mean absolute value of error is inherently robust to label noise. Thus
standard back propagation is enough to learn the true classifier even under
label noise. Through experiments, we illustrate the robustness of risk
minimization with such loss functions for learning neural networks.Comment: Appeared in AAAI 201
Biologically Inspired Oscillating Activation Functions Can Bridge the Performance Gap between Biological and Artificial Neurons
Nonlinear activation functions endow neural networks with the ability to
learn complex high-dimensional functions. The choice of activation function is
a crucial hyperparameter that determines the performance of deep neural
networks. It significantly affects the gradient flow, speed of training and
ultimately the representation power of the neural network. Saturating
activation functions like sigmoids suffer from the vanishing gradient problem
and cannot be used in deep neural networks. Universal approximation theorems
guarantee that multilayer networks of sigmoids and ReLU can learn arbitrarily
complex continuous functions to any accuracy. Despite the ability of multilayer
neural networks to learn arbitrarily complex activation functions, each neuron
in a conventional neural network (networks using sigmoids and ReLU like
activations) has a single hyperplane as its decision boundary and hence makes a
linear classification. Thus single neurons with sigmoidal, ReLU, Swish, and
Mish activation functions cannot learn the XOR function. Recent research has
discovered biological neurons in layers two and three of the human cortex
having oscillating activation functions and capable of individually learning
the XOR function. The presence of oscillating activation functions in
biological neural neurons might partially explain the performance gap between
biological and artificial neural networks. This paper proposes 4 new
oscillating activation functions which enable individual neurons to learn the
XOR function without manual feature engineering. The paper explores the
possibility of using oscillating activation functions to solve classification
problems with fewer neurons and reduce training time
Correlations of random classifiers on large data sets
Classification of large data sets by feedforward neural networks is investigated. To deal with unmanageably large sets of classification tasks, a probabilistic model of their relevance is considered. Optimization of networks computing randomly chosen classifiers is studied in terms of correlations of classifiers with network input–output functions. Effects of increasing sizes of sets of data to be classified are analyzed using geometrical properties of high-dimensional spaces. Their consequences on concentrations of values of sufficiently smooth functions of random variables around their mean values are applied. It is shown that the critical factor for suitability of a class of networks for computing randomly chosen classifiers is the maximum of sizes of the mean values of their correlations with network input–output functions. To include cases in which function values are not independent, the method of bounded differences is exploited
- …