2 research outputs found
Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-linearities
The Softmax function on top of a final linear layer is the de facto method to
output probability distributions in neural networks. In many applications such
as language models or text generation, this model has to produce distributions
over large output vocabularies. Recently, this has been shown to have limited
representational capacity due to its connection with the rank bottleneck in
matrix factorization. However, little is known about the limitations of
Linear-Softmax for quantities of practical interest such as cross entropy or
mode estimation, a direction that we explore here. As an efficient and
effective solution to alleviate this issue, we propose to learn parametric
monotonic functions on top of the logits. We theoretically investigate the rank
increasing capabilities of such monotonic functions. Empirically, our method
improves in two different quality metrics over the traditional Linear-Softmax
layer in synthetic and real language model experiments, adding little time or
memory overhead, while being comparable to the more computationally expensive
mixture of Softmaxes
Rethinking Channel Dimensions for Efficient Model Design
Designing an efficient model within the limited computational cost is
challenging. We argue the accuracy of a lightweight model has been further
limited by the design convention: a stage-wise configuration of the channel
dimensions, which looks like a piecewise linear function of the network stage.
In this paper, we study an effective channel dimension configuration towards
better performance than the convention. To this end, we empirically study how
to design a single layer properly by analyzing the rank of the output feature.
We then investigate the channel configuration of a model by searching network
architectures concerning the channel configuration under the computational cost
restriction. Based on the investigation, we propose a simple yet effective
channel configuration that can be parameterized by the layer index. As a
result, our proposed model following the channel parameterization achieves
remarkable performance on ImageNet classification and transfer learning tasks
including COCO object detection, COCO instance segmentation, and fine-grained
classifications. Code and ImageNet pretrained models are available at
https://github.com/clovaai/rexnet.Comment: 13 pages, 8 figures, CVPR 202