232 research outputs found
An Exploration of Softmax Alternatives Belonging to the Spherical Loss Family
In a multi-class classification problem, it is standard to model the output
of a neural network as a categorical distribution conditioned on the inputs.
The output must therefore be positive and sum to one, which is traditionally
enforced by a softmax. This probabilistic mapping allows to use the maximum
likelihood principle, which leads to the well-known log-softmax loss. However
the choice of the softmax function seems somehow arbitrary as there are many
other possible normalizing functions. It is thus unclear why the log-softmax
loss would perform better than other loss alternatives. In particular Vincent
et al. (2015) recently introduced a class of loss functions, called the
spherical family, for which there exists an efficient algorithm to compute the
updates of the output weights irrespective of the output size. In this paper,
we explore several loss functions from this family as possible alternatives to
the traditional log-softmax. In particular, we focus our investigation on
spherical bounds of the log-softmax loss and on two spherical log-likelihood
losses, namely the log-Spherical Softmax suggested by Vincent et al. (2015) and
the log-Taylor Softmax that we introduce. Although these alternatives do not
yield as good results as the log-softmax loss on two language modeling tasks,
they surprisingly outperform it in our experiments on MNIST and CIFAR-10,
suggesting that they might be relevant in a broad range of applications.Comment: Published at ICLR 201
The Z-loss: a shift and scale invariant classification loss belonging to the Spherical Family
Despite being the standard loss function to train multi-class neural
networks, the log-softmax has two potential limitations. First, it involves
computations that scale linearly with the number of output classes, which can
restrict the size of problems we are able to tackle with current hardware.
Second, it remains unclear how close it matches the task loss such as the top-k
error rate or other non-differentiable evaluation metrics which we aim to
optimize ultimately. In this paper, we introduce an alternative classification
loss function, the Z-loss, which is designed to address these two issues.
Unlike the log-softmax, it has the desirable property of belonging to the
spherical loss family (Vincent et al., 2015), a class of loss functions for
which training can be performed very efficiently with a complexity independent
of the number of output classes. We show experimentally that it significantly
outperforms the other spherical loss functions previously investigated.
Furthermore, we show on a word language modeling task that it also outperforms
the log-softmax with respect to certain ranking scores, such as top-k scores,
suggesting that the Z-loss has the flexibility to better match the task loss.
These qualities thus makes the Z-loss an appealing candidate to train very
efficiently large output networks such as word-language models or other extreme
classification problems. On the One Billion Word (Chelba et al., 2014) dataset,
we are able to train a model with the Z-loss 40 times faster than the
log-softmax and more than 4 times faster than the hierarchical softmax
Exploring Alternatives to Softmax Function
Softmax function is widely used in artificial neural networks for multiclass
classification, multilabel classification, attention mechanisms, etc. However,
its efficacy is often questioned in literature. The log-softmax loss has been
shown to belong to a more generic class of loss functions, called spherical
family, and its member log-Taylor softmax loss is arguably the best alternative
in this class. In another approach which tries to enhance the discriminative
nature of the softmax function, soft-margin softmax (SM-softmax) has been
proposed to be the most suitable alternative. In this work, we investigate
Taylor softmax, SM-softmax and our proposed SM-Taylor softmax, an amalgamation
of the earlier two functions, as alternatives to softmax function. Furthermore,
we explore the effect of expanding Taylor softmax up to ten terms (original
work proposed expanding only to two terms) along with the ramifications of
considering Taylor softmax to be a finite or infinite series during
backpropagation. Our experiments for the image classification task on different
datasets reveal that there is always a configuration of the SM-Taylor softmax
function that outperforms the normal softmax function and its other
alternatives
On Controllable Sparse Alternatives to Softmax
Converting an n-dimensional vector to a probability distribution over n
objects is a commonly used component in many machine learning tasks like
multiclass classification, multilabel classification, attention mechanisms etc.
For this, several probability mapping functions have been proposed and employed
in literature such as softmax, sum-normalization, spherical softmax, and
sparsemax, but there is very little understanding in terms how they relate with
each other. Further, none of the above formulations offer an explicit control
over the degree of sparsity. To address this, we develop a unified framework
that encompasses all these formulations as special cases. This framework
ensures simple closed-form solutions and existence of sub-gradients suitable
for learning via backpropagation. Within this framework, we propose two novel
sparse formulations, sparsegen-lin and sparsehourglass, that seek to provide a
control over the degree of desired sparsity. We further develop novel convex
loss functions that help induce the behavior of aforementioned formulations in
the multilabel classification setting, showing improved performance. We also
demonstrate empirically that the proposed formulations, when used to compute
attention weights, achieve better or comparable performance on standard seq2seq
tasks like neural machine translation and abstractive summarization.Comment: To appear in NIPS 2018, Total 16 pages including appendi
Adaptive Sampled Softmax with Kernel Based Sampling
Softmax is the most commonly used output function for multiclass problems and
is widely used in areas such as vision, natural language processing, and
recommendation. A softmax model has linear costs in the number of classes which
makes it too expensive for many real-world problems. A common approach to speed
up training involves sampling only some of the classes at each training step.
It is known that this method is biased and that the bias increases the more the
sampling distribution deviates from the output distribution. Nevertheless,
almost any recent work uses simple sampling distributions that require a large
sample size to mitigate the bias. In this work, we propose a new class of
kernel based sampling methods and develop an efficient sampling algorithm.
Kernel based sampling adapts to the model as it is trained, thus resulting in
low bias. Kernel based sampling can be easily applied to many models because it
relies only on the model's last hidden layer. We empirically study the
trade-off of bias, sampling distribution and sample size and show that kernel
based sampling results in low bias with few samples
Stolen Probability: A Structural Weakness of Neural Language Models
Neural Network Language Models (NNLMs) generate probability distributions by
applying a softmax function to a distance metric formed by taking the dot
product of a prediction vector with all word vectors in a high-dimensional
embedding space. The dot-product distance metric forms part of the inductive
bias of NNLMs. Although NNLMs optimize well with this inductive bias, we show
that this results in a sub-optimal ordering of the embedding space that
structurally impoverishes some words at the expense of others when assigning
probability. We present numerical, theoretical and empirical analyses showing
that words on the interior of the convex hull in the embedding space have their
probability bounded by the probabilities of the words on the hull.Comment: Preprint of paper accepted for ACL-202
Sigsoftmax: Reanalysis of the Softmax Bottleneck
Softmax is an output activation function for modeling categorical probability
distributions in many applications of deep learning. However, a recent study
revealed that softmax can be a bottleneck of representational capacity of
neural networks in language modeling (the softmax bottleneck). In this paper,
we propose an output activation function for breaking the softmax bottleneck
without additional parameters. We re-analyze the softmax bottleneck from the
perspective of the output set of log-softmax and identify the cause of the
softmax bottleneck. On the basis of this analysis, we propose sigsoftmax, which
is composed of a multiplication of an exponential function and sigmoid
function. Sigsoftmax can break the softmax bottleneck. The experiments on
language modeling demonstrate that sigsoftmax and mixture of sigsoftmax
outperform softmax and mixture of softmax, respectively.Comment: 15pages, 2 figure
Von Mises-Fisher Loss for Training Sequence to Sequence Models with Continuous Outputs
The Softmax function is used in the final layer of nearly all existing
sequence-to-sequence models for language generation. However, it is usually the
slowest layer to compute which limits the vocabulary size to a subset of most
frequent types; and it has a large memory footprint. We propose a general
technique for replacing the softmax layer with a continuous embedding layer.
Our primary innovations are a novel probabilistic loss, and a training and
inference procedure in which we generate a probability distribution over
pre-trained word embeddings, instead of a multinomial distribution over the
vocabulary obtained via softmax. We evaluate this new class of
sequence-to-sequence models with continuous outputs on the task of neural
machine translation. We show that our models obtain upto 2.5x speed-up in
training time while performing on par with the state-of-the-art models in terms
of translation quality. These models are capable of handling very large
vocabularies without compromising on translation quality. They also produce
more meaningful errors than in the softmax-based models, as these errors
typically lie in a subspace of the vector space of the reference translations.Comment: Seventh International Conference on Learning Representations (ICLR
2019
AMC-Loss: Angular Margin Contrastive Loss for Improved Explainability in Image Classification
Deep-learning architectures for classification problems involve the
cross-entropy loss sometimes assisted with auxiliary loss functions like center
loss, contrastive loss and triplet loss. These auxiliary loss functions
facilitate better discrimination between the different classes of interest.
However, recent studies hint at the fact that these loss functions do not take
into account the intrinsic angular distribution exhibited by the low-level and
high-level feature representations. This results in less compactness between
samples from the same class and unclear boundary separations between data
clusters of different classes. In this paper, we address this issue by
proposing the use of geometric constraints, rooted in Riemannian geometry.
Specifically, we propose Angular Margin Contrastive Loss (AMC-Loss), a new loss
function to be used along with the traditional cross-entropy loss. The AMC-Loss
employs the discriminative angular distance metric that is equivalent to
geodesic distance on a hypersphere manifold such that it can serve a clear
geometric interpretation. We demonstrate the effectiveness of AMC-Loss by
providing quantitative and qualitative results. We find that although the
proposed geometrically constrained loss-function improves quantitative results
modestly, it has a qualitatively surprisingly beneficial effect on increasing
the interpretability of deep-net decisions as seen by the visual explanations
generated by techniques such as the Grad-CAM. Our code is available at
https://github.com/hchoi71/AMC-Loss
DropMax: Adaptive Variational Softmax
We propose DropMax, a stochastic version of softmax classifier which at each
iteration drops non-target classes according to dropout probabilities
adaptively decided for each instance. Specifically, we overlay binary masking
variables over class output probabilities, which are input-adaptively learned
via variational inference. This stochastic regularization has an effect of
building an ensemble classifier out of exponentially many classifiers with
different decision boundaries. Moreover, the learning of dropout rates for
non-target classes on each instance allows the classifier to focus more on
classification against the most confusing classes. We validate our model on
multiple public datasets for classification, on which it obtains significantly
improved accuracy over the regular softmax classifier and other baselines.
Further analysis of the learned dropout probabilities shows that our model
indeed selects confusing classes more often when it performs classification
- …