Machine learning problems rely heavily on stochastic gradient descent (SGD)
for optimization. The effectiveness of SGD is contingent upon accurately
estimating gradients from a mini-batch of data samples. Instead of the commonly
used uniform sampling, adaptive or importance sampling reduces noise in
gradient estimation by forming mini-batches that prioritize crucial data
points. Previous research has suggested that data points should be selected
with probabilities proportional to their gradient norm. Nevertheless, existing
algorithms have struggled to efficiently integrate importance sampling into
machine learning frameworks. In this work, we make two contributions. First, we
present an algorithm that can incorporate existing importance functions into
our framework. Second, we propose a simplified importance function that relies
solely on the loss gradient of the output layer. By leveraging our proposed
gradient estimation techniques, we observe improved convergence in
classification and regression tasks with minimal computational overhead. We
validate the effectiveness of our adaptive and importance-sampling approach on
image and point-cloud datasets.Comment: 15 pages, 10 figure