80 research outputs found

    SparseOptimizer: Sparsify Language Models through Moreau-Yosida Regularization and Accelerate through Compiler Co-design

    Full text link
    This paper introduces SparseOptimizer, a novel deep learning optimizer that exploits Moreau-Yosida regularization to naturally induce sparsity in large language models such as BERT, ALBERT and GPT. Key to the design of SparseOptimizer is an embedded shrinkage operator, which imparts sparsity directly within the optimization process. This operator, backed by a sound theoretical framework, includes an analytical solution, thereby reinforcing the optimizer's robustness and efficacy. Crucially, SparseOptimizer's plug-and-play functionality eradicates the need for code modifications, making it a universally adaptable tool for a wide array of large language models. Empirical evaluations on benchmark datasets such as GLUE, RACE, SQuAD1, and SQuAD2 confirm that SparseBERT and SparseALBERT, when sparsified using SparseOptimizer, achieve performance comparable to their dense counterparts, BERT and ALBERT, while significantly reducing their parameter count. Further, this work proposes an innovative optimizer-compiler co-design strategy, demonstrating the potential of inference acceleration (\textbf{3.37x}, \textbf{6.30x}, and \textbf{7.15x} in comparison with Pytorch, TensorFlow, and LLVM generic compile, respectively) in SparseBERT when paired with an appropriately designed compiler. This study represents a significant step forward in the evolution of efficient, scalable, and high-performing large language models, setting a precedent for future exploration and optimization in this domain. The SparseOptimizer code and SparseALBERT model will be made available upon paper acceptance

    Mathematical Optimization Algorithms for Model Compression and Adversarial Learning in Deep Neural Networks

    Get PDF
    Large-scale deep neural networks (DNNs) have made breakthroughs in a variety of tasks, such as image recognition, speech recognition and self-driving cars. However, their large model size and computational requirements add a significant burden to state-of-the-art computing systems. Weight pruning is an effective approach to reduce the model size and computational requirements of DNNs. However, prior works in this area are mainly heuristic methods. As a result, the performance of a DNN cannot maintain for a high weight pruning ratio. To mitigate this limitation, we propose a systematic weight pruning framework for DNNs based on mathematical optimization. We first formulate the weight pruning for DNNs as a non-convex optimization problem, and then systematically solve it using alternating direction method of multipliers (ADMM). Our work achieves a higher weight pruning ratio on DNNs without accuracy loss and a higher acceleration on the inference of DNNs on CPU and GPU platforms compared with prior works. Besides the issue of model size, DNNs are also sensitive to adversarial attacks, a small invisible noise on the input data can fully mislead a DNN. Research on the robustness of DNNs follows two directions in general. The first is to enhance the robustness of DNNs, which increases the degree of difficulty for adversarial attacks to fool DNNs. The second is to design adversarial attack methods to test the robustness of DNNs. These two aspects reciprocally benefit each other towards hardening DNNs. In our work, we propose to generate adversarial attacks with low distortion via convex optimization, which achieves 100% attack success rate with lower distortion compared with prior works. We also propose a unified min-max optimization framework for the adversarial attack and defense on DNNs over multiple domains. Our proposed method performs better compared with the prior works, which use average-based strategies to solve the problems over multiple domains

    Efficient Multi-Template Learning for Structured Prediction

    Full text link
    Conditional random field (CRF) and Structural Support Vector Machine (Structural SVM) are two state-of-the-art methods for structured prediction which captures the interdependencies among output variables. The success of these methods is attributed to the fact that their discriminative models are able to account for overlapping features on the whole input observations. These features are usually generated by applying a given set of templates on labeled data, but improper templates may lead to degraded performance. To alleviate this issue, in this paper, we propose a novel multiple template learning paradigm to learn structured prediction and the importance of each template simultaneously, so that hundreds of arbitrary templates could be added into the learning model without caution. This paradigm can be formulated as a special multiple kernel learning problem with exponential number of constraints. Then we introduce an efficient cutting plane algorithm to solve this problem in the primal, and its convergence is presented. We also evaluate the proposed learning paradigm on two widely-studied structured prediction tasks, \emph{i.e.} sequence labeling and dependency parsing. Extensive experimental results show that the proposed method outperforms CRFs and Structural SVMs due to exploiting the importance of each template. Our complexity analysis and empirical results also show that our proposed method is more efficient than OnlineMKL on very sparse and high-dimensional data. We further extend this paradigm for structured prediction using generalized pp-block norm regularization with p>1p>1, and experiments show competitive performances when p∈[1,2)p \in [1,2)
    • …
    corecore