80 research outputs found
SparseOptimizer: Sparsify Language Models through Moreau-Yosida Regularization and Accelerate through Compiler Co-design
This paper introduces SparseOptimizer, a novel deep learning optimizer that
exploits Moreau-Yosida regularization to naturally induce sparsity in large
language models such as BERT, ALBERT and GPT. Key to the design of
SparseOptimizer is an embedded shrinkage operator, which imparts sparsity
directly within the optimization process. This operator, backed by a sound
theoretical framework, includes an analytical solution, thereby reinforcing the
optimizer's robustness and efficacy. Crucially, SparseOptimizer's plug-and-play
functionality eradicates the need for code modifications, making it a
universally adaptable tool for a wide array of large language models. Empirical
evaluations on benchmark datasets such as GLUE, RACE, SQuAD1, and SQuAD2
confirm that SparseBERT and SparseALBERT, when sparsified using
SparseOptimizer, achieve performance comparable to their dense counterparts,
BERT and ALBERT, while significantly reducing their parameter count. Further,
this work proposes an innovative optimizer-compiler co-design strategy,
demonstrating the potential of inference acceleration (\textbf{3.37x},
\textbf{6.30x}, and \textbf{7.15x} in comparison with Pytorch, TensorFlow, and
LLVM generic compile, respectively) in SparseBERT when paired with an
appropriately designed compiler. This study represents a significant step
forward in the evolution of efficient, scalable, and high-performing large
language models, setting a precedent for future exploration and optimization in
this domain. The SparseOptimizer code and SparseALBERT model will be made
available upon paper acceptance
Mathematical Optimization Algorithms for Model Compression and Adversarial Learning in Deep Neural Networks
Large-scale deep neural networks (DNNs) have made breakthroughs in a variety of tasks, such as image recognition, speech recognition and self-driving cars. However, their large model size and computational requirements add a significant burden to state-of-the-art computing systems. Weight pruning is an effective approach to reduce the model size and computational requirements of DNNs. However, prior works in this area are mainly heuristic methods. As a result, the performance of a DNN cannot maintain for a high weight pruning ratio. To mitigate this limitation, we propose a systematic weight pruning framework for DNNs based on mathematical optimization. We first formulate the weight pruning for DNNs as a non-convex optimization problem, and then systematically solve it using alternating direction method of multipliers (ADMM). Our work achieves a higher weight pruning ratio on DNNs without accuracy loss and a higher acceleration on the inference of DNNs on CPU and GPU platforms compared with prior works.
Besides the issue of model size, DNNs are also sensitive to adversarial attacks, a small invisible noise on the input data can fully mislead a DNN. Research on the robustness of DNNs follows two directions in general. The first is to enhance the robustness of DNNs, which increases the degree of difficulty for adversarial attacks to fool DNNs. The second is to design adversarial attack methods to test the robustness of DNNs. These two aspects reciprocally benefit each other towards hardening DNNs. In our work, we propose to generate adversarial attacks with low distortion via convex optimization, which achieves 100% attack success rate with lower distortion compared with prior works. We also propose a unified min-max optimization framework for the adversarial attack and defense on DNNs over multiple domains. Our proposed method performs better compared with the prior works, which use average-based strategies to solve the problems over multiple domains
Efficient Multi-Template Learning for Structured Prediction
Conditional random field (CRF) and Structural Support Vector Machine
(Structural SVM) are two state-of-the-art methods for structured prediction
which captures the interdependencies among output variables. The success of
these methods is attributed to the fact that their discriminative models are
able to account for overlapping features on the whole input observations. These
features are usually generated by applying a given set of templates on labeled
data, but improper templates may lead to degraded performance. To alleviate
this issue, in this paper, we propose a novel multiple template learning
paradigm to learn structured prediction and the importance of each template
simultaneously, so that hundreds of arbitrary templates could be added into the
learning model without caution. This paradigm can be formulated as a special
multiple kernel learning problem with exponential number of constraints. Then
we introduce an efficient cutting plane algorithm to solve this problem in the
primal, and its convergence is presented. We also evaluate the proposed
learning paradigm on two widely-studied structured prediction tasks,
\emph{i.e.} sequence labeling and dependency parsing. Extensive experimental
results show that the proposed method outperforms CRFs and Structural SVMs due
to exploiting the importance of each template. Our complexity analysis and
empirical results also show that our proposed method is more efficient than
OnlineMKL on very sparse and high-dimensional data. We further extend this
paradigm for structured prediction using generalized -block norm
regularization with , and experiments show competitive performances when
- …