17 research outputs found

    Neural Architecture Search for Effective Teacher-Student Knowledge Transfer in Language Models

    Full text link
    Large pretrained language models have achieved state-of-the-art results on a variety of downstream tasks. Knowledge Distillation (KD) into a smaller student model addresses their inefficiency, allowing for deployment in resource-constrained environments. However, KD can be ineffective when the student is manually selected from a set of existing options, since it can be a sub-optimal choice within the space of all possible student architectures. We develop multilingual KD-NAS, the use of Neural Architecture Search (NAS) guided by KD to find the optimal student architecture for task agnostic distillation from a multilingual teacher. In each episode of the search process, a NAS controller predicts a reward based on the distillation loss and latency of inference. The top candidate architectures are then distilled from the teacher on a small proxy set. Finally the architecture(s) with the highest reward is selected, and distilled on the full training corpus. KD-NAS can automatically trade off efficiency and effectiveness, and recommends architectures suitable to various latency budgets. Using our multi-layer hidden state distillation process, our KD-NAS student model achieves a 7x speedup on CPU inference (2x on GPU) compared to a XLM-Roberta Base Teacher, while maintaining 90% performance, and has been deployed in 3 software offerings requiring large throughput, low latency and deployment on CPU.Comment: 11 pages, 5 figure

    Ensemble-Instruct: Generating Instruction-Tuning Data with a Heterogeneous Mixture of LMs

    Full text link
    Using in-context learning (ICL) for data generation, techniques such as Self-Instruct (Wang et al., 2023) or the follow-up Alpaca (Taori et al., 2023) can train strong conversational agents with only a small amount of human supervision. One limitation of these approaches is that they resort to very large language models (around 175B parameters) that are also proprietary and non-public. Here we explore the application of such techniques to language models that are much smaller (around 10B--40B parameters) and have permissive licenses. We find the Self-Instruct approach to be less effective at these sizes and propose new ICL methods that draw on two main ideas: (a) Categorization and simplification of the ICL templates to make prompt learning easier for the LM, and (b) Ensembling over multiple LM outputs to help select high-quality synthetic examples. Our algorithm leverages the 175 Self-Instruct seed tasks and employs separate pipelines for instructions that require an input and instructions that do not. Empirical investigations with different LMs show that: (1) Our proposed method yields higher-quality instruction tuning data than Self-Instruct, (2) It improves performances of both vanilla and instruction-tuned LMs by significant margins, and (3) Smaller instruction-tuned LMs generate more useful outputs than their larger un-tuned counterparts. Our codebase is available at https://github.com/IBM/ensemble-instruct

    Parallel finite element processing using Gaussian belief propagation inference on probabilistic graphical models

    No full text
    The Finite Element Method (FEM) is one of the most popular numerical methods to obtain approximate solutions to Partial Differential Equations (PDEs). Due to its wide applicability, robustness and accuracy of its solution, the FEM takes a prominent role in the design and analysis of many engineering applications. However, the FEM is also considered a computationally intensive method when used for complex designs requiring accurate modeling. As a result, high fidelity FEM simulations create strong demand for scalable High Performance Computing (HPC) systems. The FEM's conventional approaches are based on global sparse matrix operations that severely limit the parallel scalability of costly HPC systems. In this work we look into Belief Propagation (BP) inference algorithms to address the FEM's computational bottleneck. BP is a message passing algorithm on probabilistic Graphical Models (GMs) used to efficiently compute marginal distributions. BP algorithms can exploit the underlying problem's structure of variable interactions to expose more parallelism from its computations. A particular instance of BP algorithms is BP on Gaussian models with pairwise interaction (PW-GaBP). However, PW-GaBP did not provide the sought numerical efficiency for FEM problems in general due to its large number of iterations. Nonetheless, our analysis of using PW-GaBP to solve FEM problems reveals great insights on how to improve the numerical efficiency of BP style algorithms while maintaining its parallelism features. Instead, we developed a novel probabilistic Gaussian GM for the FEM based on Factor Graphs (FEM-FG) by reformulating the FEM problem as a distributed variational inference problem. This development facilitates the use of computational inference algorithms such as BP to solve the FEM problem by decoupling it into local systems of low ranks. The resulting FEM Gaussian Belief Propagation (FGaBP) algorithm solves the FEM in parallel, element-by-element, eliminating the need for large sparse matrix operations. The FGaBP algorithm demonstrates significant parallel scalability; nonetheless, its number if iterations scales linearly with the number of unknowns making it slow for larger FEM problems.We remedy this by introducing a multigrid adaptation to FGaBP resulting in the fast FMGaBP algorithm. The FMGaBP algorithm processes the FEM in a fully distributed and parallel manner, with stencil-like element-by-element operations, while eliminating all sparse algebraic operations. To our knowledge, this is the first multigrid formulation for continuous domain Gaussian BP algorithms that is derived directly from a variational formulation of the FEM. In comparison with state-of-the-art parallel implementations of the Multigrid Preconditioned Conjugate Gradient (MG-PCG) solver, the FMGaBP algorithm demonstrates considerable speedups for both 2D and 3D FEM problems.La méthode des éléments finis (MEF) est une des approches les plus populaires pour résoudre numériquement des équations aux dérivées partielles. La MEF s'applique à une grande variétés de situations et fournit des solutions robustes et précises, ce qui lui donne un rôle de premier plan dans la conception et l'analyse de plusieurs problème d'ingénérie. Par contre, la MEF requiert une forte puissance de calcul lorsqu'elle est utilisée pour résoudre des problèmes de conception complexes qui nécessitent une modélisation précise. Pour cette raison, les simulations par MEF à haute fidélité génèrent une demande importante pour des systèmes de calcul haute performance. Les approches conventionnelles utilisées pour la MEF sont basées sur des opérations globales effectuées sur des matrices creuses, ce qui limite sévèrement les possibilités de parallélisation. La présente thèse s'intéresse aux algorithmes d'inférences par Belief Propagation (BP) dans le but de réduire la puissance de calcul nécessaire aux simulations par MEF. L'algorithme BP est basé sur des transport de messages à l'intérieur de réseaux statistiques servant à calculer efficacement des distributions marginales. Un algorithme BP est à même d'exploiter la structure des interactions entre variables pour exposer une plus grande part d'opérations parallèles. L'algorithme BP sur réseaux gaussiens avec interactions par paires, ou BP on Gaussian models with pairwise interaction (PW-GaBP), est un type d'algorithme BP bien connu. Malheureusement, nous avons constaté qu'en général, cet algorithme n'était pas à même de fournir une efficacité numérique suffisante pour se prêter aux simulations par MEF, à cause de son nombre d'itérations trop important. Malgré tout, notre analyse de l'algorithme PW-GaBP utilisé dans le cadre de la MEF fournit de bonnes pistes sur la façon d'améliorer l'efficacité numérique des algorithmes de style BP, tout en conservant ses propriétés parallèles.Pour palier aux défauts de l'algorithme PW-GaBP, nous avons développé une nouvelle structure de réseau statistique gaussien pour la MEF à l'aide de Factor Graphs, en reformulant le problème de la MEF en tant que problème d'inférence variationnelle distribuée. Cette reformulation facilite l'utilisation d'algorithmes d'inférence tel que BP pour résoudre des problèmes de MEF en les séparant en plusieurs systèmes locaux à bas rang. Nous obtenons ainsi un algorithme appelé FEM Gaussian Belief Propagation (FGaBP) qui peut résoudre un système de MEF en parallèle, élément-par-élément, éliminant ainsi la nécessité d'effectuer des opérations complexes sur des matrices creuses. L'algorithme FGaBP peut être parallélisé de façon importante, mais il demande néanmoins un nombre d'itérations qui croît linéairement en nombre d'inconnus, ce qui le rend assez lent sur des problèmes MEF de grande taille.Nous améliorons ce dernier point en proposant une adaptation multi-grille de l'algorithme FGaBP pour obtenir un algorithme plus rapide baptisé FMGaBP. L'algorithme FMGaBP traite le système MEF de façon parfaitement distribuée et parallèle. Les opérations se font élément-par-élément, tout en éliminant toutes les opérations sur matrices creuses. Au meilleur de notre connaissance, il s'agit de la première formulation multi-grille pour des algorithmes BP gaussiens dérivée directement d'une formulation variationnelle du système MEF. Par rapport aux implémentations actuelles de solveurs à gradient conjugué multi-grilles, l'algorithme FMGaBP offre des gains substantiels de vitesse pour les calculs de MEF, aussi bien dans le cas des problèmes 2D que pour ceux en 3D

    Sparse Matrix-Vector floating-point multiplication with FPGAs for finite element electromagnetics

    No full text
    The Finite Element Method (FEM) is a computationally intensive scientific and engineering analysis tool that has diverse applications ranging from structural engineering to electromagnetic simulation. Field Programmable Gate Arrays (FPGAs) have been shown to have higher peak floating-point performance than general purpose CPUs, and the trends are moving in favor of FPGAs. We present an architecture and implementation of an FPGA-based Sparse Matrix-Vector Multiplier (SMVM) for use in the iterative solution of large, sparse systems of equations arising from FEM applications. Our architecture exploits the FEM matrix sparsity structure to achieve a balance between performance and hardware resource requirements. The architecture is based on a pipelined linear array of Processing Elements (PEs). A hardware-oriented matrix "striping" scheme is developed which reduces the number of required processing elements. The implemented SMVM-pipeline prototype contains 8 PEs and is clocked at 110 MHz obtaining a peak performance of 1.76 GFLOPS. For 8 GB/s of memory bandwidth typical of recent FPGA reconfigurable systems, this architecture can achieve 1.5 GFLOPS sustained performance. A single pipeline uses 30% of the logic resources and 40% of the memory resources of a Stratix S80 FPGA. Using multiple instances of the pipeline, linear scaling of the peak and sustained performance can be achieved. Our stream-through architecture provides the added advantage of enabling an iterative implementation of the SMVM computation required by iterative solvers such as the conjugate gradient method, avoiding initialization time due to data loading and setup inside the FPGA internal memory

    Acceleration of the Finite-Element Gaussian Belief Propagation Solver Using Minimum Residual Techniques

    No full text
    The finite-element Gaussian belief propagation (FGaBP) method, introduced recently, provides a powerful alternative to the conventional finite-element method solvers to efficiently utilize high-performance computing platforms. In this paper, we accelerate the FGaBP convergence by combining it with two methods based on residual minimization techniques, namely, the flexible generalized minimum residual and the iterant recombination method. The numerical results show considerable reductions in the total number of operations compared with the stand-alone FGaBP method, while maintaining the scalability features of FGaBP

    Parallel Multigrid Acceleration for the Finite-Element Gaussian Belief Propagation Algorithm

    No full text

    Communication-Avoiding Krylov Techniques on Graphic Processing Units

    No full text

    Communication-Avoiding Krylov Techniques on Graphic Processing Units

    No full text
    Communicating data within the graphic processing unit (GPU) memory system and between the CPU and GPU are major bottlenecks in accelerating Krylov solvers on GPUs. Communication-avoiding techniques reduce the communication cost of Krylov subspace methods by computing several vectors of a Krylov subspace “at once,” using a kernel called “matrix powers.” The matrix powers kernel is implemented on a recent generation of NVIDIA GPUs and speedups of up to 5.7 times are reported for the communication-avoiding matrix powers kernel compared to the standards prase matrix vector multiplication (SpMV) implementation
    corecore