6,384 research outputs found
A Modern Take on the Bias-Variance Tradeoff in Neural Networks
The bias-variance tradeoff tells us that as model complexity increases, bias
falls and variances increases, leading to a U-shaped test error curve. However,
recent empirical results with over-parameterized neural networks are marked by
a striking absence of the classic U-shaped test error curve: test error keeps
decreasing in wider networks. This suggests that there might not be a
bias-variance tradeoff in neural networks with respect to network width, unlike
was originally claimed by, e.g., Geman et al. (1992). Motivated by the shaky
evidence used to support this claim in neural networks, we measure bias and
variance in the modern setting. We find that both bias and variance can
decrease as the number of parameters grows. To better understand this, we
introduce a new decomposition of the variance to disentangle the effects of
optimization and data sampling. We also provide theoretical analysis in a
simplified setting that is consistent with our empirical findings
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
The rising popularity of intelligent mobile devices and the daunting
computational cost of deep learning-based models call for efficient and
accurate on-device inference schemes. We propose a quantization scheme that
allows inference to be carried out using integer-only arithmetic, which can be
implemented more efficiently than floating point inference on commonly
available integer-only hardware. We also co-design a training procedure to
preserve end-to-end model accuracy post quantization. As a result, the proposed
quantization scheme improves the tradeoff between accuracy and on-device
latency. The improvements are significant even on MobileNets, a model family
known for run-time efficiency, and are demonstrated in ImageNet classification
and COCO detection on popular CPUs.Comment: 14 pages, 12 figure
An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks
Catastrophic forgetting is a problem faced by many machine learning models
and algorithms. When trained on one task, then trained on a second task, many
machine learning models "forget" how to perform the first task. This is widely
believed to be a serious problem for neural networks. Here, we investigate the
extent to which the catastrophic forgetting problem occurs for modern neural
networks, comparing both established and recent gradient-based training
algorithms and activation functions. We also examine the effect of the
relationship between the first task and the second task on catastrophic
forgetting. We find that it is always best to train using the dropout
algorithm--the dropout algorithm is consistently best at adapting to the new
task, remembering the old task, and has the best tradeoff curve between these
two extremes. We find that different tasks and relationships between tasks
result in very different rankings of activation function performance. This
suggests the choice of activation function should always be cross-validated
Reconciling modern machine learning practice and the bias-variance trade-off
Breakthroughs in machine learning are rapidly changing science and society,
yet our fundamental understanding of this technology has lagged far behind.
Indeed, one of the central tenets of the field, the bias-variance trade-off,
appears to be at odds with the observed behavior of methods used in the modern
machine learning practice. The bias-variance trade-off implies that a model
should balance under-fitting and over-fitting: rich enough to express
underlying structure in data, simple enough to avoid fitting spurious patterns.
However, in the modern practice, very rich models such as neural networks are
trained to exactly fit (i.e., interpolate) the data. Classically, such models
would be considered over-fit, and yet they often obtain high accuracy on test
data. This apparent contradiction has raised questions about the mathematical
foundations of machine learning and their relevance to practitioners.
In this paper, we reconcile the classical understanding and the modern
practice within a unified performance curve. This "double descent" curve
subsumes the textbook U-shaped bias-variance trade-off curve by showing how
increasing model capacity beyond the point of interpolation results in improved
performance. We provide evidence for the existence and ubiquity of double
descent for a wide spectrum of models and datasets, and we posit a mechanism
for its emergence. This connection between the performance and the structure of
machine learning models delineates the limits of classical analyses, and has
implications for both the theory and practice of machine learning
Wireless Data Acquisition for Edge Learning: Data-Importance Aware Retransmission
By deploying machine-learning algorithms at the network edge, edge learning
can leverage the enormous real-time data generated by billions of mobile
devices to train AI models, which enable intelligent mobile applications. In
this emerging research area, one key direction is to efficiently utilize radio
resources for wireless data acquisition to minimize the latency of executing a
learning task at an edge server. Along this direction, we consider the specific
problem of retransmission decision in each communication round to ensure both
reliability and quantity of those training data for accelerating model
convergence. To solve the problem, a new retransmission protocol called
data-importance aware automatic-repeat-request (importance ARQ) is proposed.
Unlike the classic ARQ focusing merely on reliability, importance ARQ
selectively retransmits a data sample based on its uncertainty which helps
learning and can be measured using the model under training. Underpinning the
proposed protocol is a derived elegant communication-learning relation between
two corresponding metrics, i.e., signal-to-noise ratio (SNR) and data
uncertainty. This relation facilitates the design of a simple threshold based
policy for importance ARQ. The policy is first derived based on the classic
classifier model of support vector machine (SVM), where the uncertainty of a
data sample is measured by its distance to the decision boundary. The policy is
then extended to the more complex model of convolutional neural networks (CNN)
where data uncertainty is measured by entropy. Extensive experiments have been
conducted for both the SVM and CNN using real datasets with balanced and
imbalanced distributions. Experimental results demonstrate that importance ARQ
effectively copes with channel fading and noise in wireless data acquisition to
achieve faster model convergence than the conventional channel-aware ARQ.Comment: This is an updated version: 1) extension to general classifiers; 2)
consideration of imbalanced classification in the experiments. Submitted to
IEEE Journal for possible publicatio
- …