1,285 research outputs found
Layer Freezing & Data Sieving: Missing Pieces of a Generic Framework for Sparse Training
Recently, sparse training has emerged as a promising paradigm for efficient
deep learning on edge devices. The current research mainly devotes efforts to
reducing training costs by further increasing model sparsity. However,
increasing sparsity is not always ideal since it will inevitably introduce
severe accuracy degradation at an extremely high sparsity level. This paper
intends to explore other possible directions to effectively and efficiently
reduce sparse training costs while preserving accuracy. To this end, we
investigate two techniques, namely, layer freezing and data sieving. First, the
layer freezing approach has shown its success in dense model training and
fine-tuning, yet it has never been adopted in the sparse training domain.
Nevertheless, the unique characteristics of sparse training may hinder the
incorporation of layer freezing techniques. Therefore, we analyze the
feasibility and potentiality of using the layer freezing technique in sparse
training and find it has the potential to save considerable training costs.
Second, we propose a data sieving method for dataset-efficient training, which
further reduces training costs by ensuring only a partial dataset is used
throughout the entire training process. We show that both techniques can be
well incorporated into the sparse training algorithm to form a generic
framework, which we dub SpFDE. Our extensive experiments demonstrate that SpFDE
can significantly reduce training costs while preserving accuracy from three
dimensions: weight sparsity, layer freezing, and dataset sieving.Comment: Published in 36th Conference on Neural Information Processing Systems
(NeurIPS 2022
Dynamic Sparse Training via Balancing the Exploration-Exploitation Trade-off
Over-parameterization of deep neural networks (DNNs) has shown high
prediction accuracy for many applications. Although effective, the large number
of parameters hinders its popularity on resource-limited devices and has an
outsize environmental impact. Sparse training (using a fixed number of nonzero
weights in each iteration) could significantly mitigate the training costs by
reducing the model size. However, existing sparse training methods mainly use
either random-based or greedy-based drop-and-grow strategies, resulting in
local minimal and low accuracy. In this work, we consider the dynamic sparse
training as a sparse connectivity search problem and design an exploitation and
exploration acquisition function to escape from local optima and saddle points.
We further design an acquisition function and provide the theoretical
guarantees for the proposed method and clarify its convergence property.
Experimental results show that sparse models (up to 98\% sparsity) obtained by
our proposed method outperform the SOTA sparse training methods on a wide
variety of deep learning tasks. On VGG-19 / CIFAR-100, ResNet-50 / CIFAR-10,
ResNet-50 / CIFAR-100, our method has even higher accuracy than dense models.
On ResNet-50 / ImageNet, the proposed method has up to 8.2\% accuracy
improvement compared to SOTA sparse training methods
The Unreasonable Effectiveness of Random Pruning: Return of the Most Naive Baseline for Sparse Training
Random pruning is arguably the most naive way to attain sparsity in neural
networks, but has been deemed uncompetitive by either post-training pruning or
sparse training. In this paper, we focus on sparse training and highlight a
perhaps counter-intuitive finding, that random pruning at initialization can be
quite powerful for the sparse training of modern neural networks. Without any
delicate pruning criteria or carefully pursued sparsity structures, we
empirically demonstrate that sparsely training a randomly pruned network from
scratch can match the performance of its dense equivalent. There are two key
factors that contribute to this revival: (i) the network sizes matter: as the
original dense networks grow wider and deeper, the performance of training a
randomly pruned sparse network will quickly grow to matching that of its dense
equivalent, even at high sparsity ratios; (ii) appropriate layer-wise sparsity
ratios can be pre-chosen for sparse training, which shows to be another
important performance booster. Simple as it looks, a randomly pruned subnetwork
of Wide ResNet-50 can be sparsely trained to outperforming a dense Wide
ResNet-50, on ImageNet. We also observed such randomly pruned networks
outperform dense counterparts in other favorable aspects, such as
out-of-distribution detection, uncertainty estimation, and adversarial
robustness. Overall, our results strongly suggest there is larger-than-expected
room for sparse training at scale, and the benefits of sparsity might be more
universal beyond carefully designed pruning. Our source code can be found at
https://github.com/VITA-Group/Random_Pruning.Comment: Published as a conference paper at ICLR 2022. Code is available at
https://github.com/VITA-Group/Random_Prunin
Sparse Training Theory for Scalable and Efficient Agents
A fundamental task for artificial intelligence is learning. Deep Neural
Networks have proven to cope perfectly with all learning paradigms, i.e.
supervised, unsupervised, and reinforcement learning. Nevertheless, traditional
deep learning approaches make use of cloud computing facilities and do not
scale well to autonomous agents with low computational resources. Even in the
cloud, they suffer from computational and memory limitations, and they cannot
be used to model adequately large physical worlds for agents which assume
networks with billions of neurons. These issues are addressed in the last few
years by the emerging topic of sparse training, which trains sparse networks
from scratch. This paper discusses sparse training state-of-the-art, its
challenges and limitations while introducing a couple of new theoretical
research directions which has the potential of alleviating sparse training
limitations to push deep learning scalability well beyond its current
boundaries. Nevertheless, the theoretical advancements impact in complex
multi-agents settings is discussed from a real-world perspective, using the
smart grid case study
Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training
In this paper, we introduce a new perspective on training deep neural
networks capable of state-of-the-art performance without the need for the
expensive over-parameterization by proposing the concept of In-Time
Over-Parameterization (ITOP) in sparse training. By starting from a random
sparse network and continuously exploring sparse connectivities during
training, we can perform an Over-Parameterization in the space-time manifold,
closing the gap in the expressibility between sparse training and dense
training. We further use ITOP to understand the underlying mechanism of Dynamic
Sparse Training (DST) and indicate that the benefits of DST come from its
ability to consider across time all possible parameters when searching for the
optimal sparse connectivity. As long as there are sufficient parameters that
have been reliably explored during training, DST can outperform the dense
neural network by a large margin. We present a series of experiments to support
our conjecture and achieve the state-of-the-art sparse training performance
with ResNet-50 on ImageNet. More impressively, our method achieves dominant
performance over the overparameterization-based sparse methods at extreme
sparsity levels. When trained on CIFAR-100, our method can match the
performance of the dense model even at an extreme sparsity (98%). Code can be
found https://github.com/Shiweiliuiiiiiii/In-Time-Over-Parameterization.Comment: 16 pages; 10 figures; Published in Proceedings of the 38th
International Conference on Machine Learning. Code can be found
https://github.com/Shiweiliuiiiiiii/In-Time-Over-Parameterizatio
JaxPruner: A concise library for sparsity research
This paper introduces JaxPruner, an open-source JAX-based pruning and sparse
training library for machine learning research. JaxPruner aims to accelerate
research on sparse neural networks by providing concise implementations of
popular pruning and sparse training algorithms with minimal memory and latency
overhead. Algorithms implemented in JaxPruner use a common API and work
seamlessly with the popular optimization library Optax, which, in turn, enables
easy integration with existing JAX based libraries. We demonstrate this ease of
integration by providing examples in four different codebases: Scenic, t5x,
Dopamine and FedJAX and provide baseline experiments on popular benchmarks.Comment: Jaxpruner is hosted at http://github.com/google-research/jaxprune
Cross likelihood ratio based speaker clustering using eigenvoice models
This paper proposes the use of eigenvoice modeling techniques with the Cross Likelihood Ratio (CLR) as a criterion for speaker clustering within a speaker diarization system. The CLR has previously been shown to be a robust decision criterion for speaker clustering using Gaussian Mixture Models. Recently, eigenvoice modeling techniques have become increasingly popular, due to its ability to adequately represent a speaker based on sparse training data, as well as an improved capture of differences in speaker characteristics. This paper hence proposes that it would be beneficial to capitalize on the advantages of eigenvoice modeling in a CLR framework. Results obtained on the 2002 Rich Transcription (RT-02) Evaluation dataset show an improved clustering performance, resulting in a 35.1% relative improvement in the overall Diarization Error Rate (DER) compared to the baseline system
- …