Search CORE

32 research outputs found

Recommended from our members

Hardware Acceleration for Tensorized Neural Networks

Author: Gan Yiming
Publication venue: eScholarship, University of California
Publication date: 01/01/2018
Field of study

Machine learning has gained success in many application domains including medical data analysis, finance, computer vision, and so forth. However, many popular machine learning models (e.g., deep neural networks) are both data-intensive and computationally expensive: they require high-volume data samples to train the networks, millions to billions of parameters to describe the model, and large-scale computations to complete the optimization or inference. Therefore, deep learning can cause unaffordable energy and run-time cost on a hardware platform. In this paper, we present a way of accelerating deep neural networks as well as compressing weights used by designing hardware acceleration for tensor train decomposition layers in deep neural networks. By utilizing hardware acceleration on tensorized neural networks, we achieved massive memory saving on two fully -connected layers. Parameters shrink 4880644x and 3195660x respectively. At the same time, we achieve speed up at 2600x and 2900x compared to original matrix multiplication process

eScholarship - University of California

Decoupled Model Schedule for Deep Learning Training

Author: Chen Hongzheng
Wang Yida
Yu Cody Hao
Zhang Zhen
Zhang Zhiru
Zheng Shuai
Publication venue
Publication date: 15/02/2023
Field of study

Recent years have seen an increase in the development of large deep learning (DL) models, which makes training efficiency crucial. Common practice is struggling with the trade-off between usability and performance. On one hand, DL frameworks such as PyTorch use dynamic graphs to facilitate model developers at a price of sub-optimal model training performance. On the other hand, practitioners propose various approaches to improving the training efficiency by sacrificing some of the flexibility, ranging from making the graph static for more thorough optimization (e.g., XLA) to customizing optimization towards large-scale distributed training (e.g., DeepSpeed and Megatron-LM). In this paper, we aim to address the tension between usability and training efficiency through separation of concerns. Inspired by DL compilers that decouple the platform-specific optimizations of a tensor-level operator from its arithmetic definition, this paper proposes a schedule language to decouple model execution from definition. Specifically, the schedule works on a PyTorch model and uses a set of schedule primitives to convert the model for common model training optimizations such as high-performance kernels, effective 3D parallelism, and efficient activation checkpointing. Compared to existing optimization solutions, we optimize the model as-needed through high-level primitives, and thus preserving programmability and debuggability for users to a large extent. Our evaluation results show that by scheduling the existing hand-crafted optimizations in a systematic way, we are able to improve training throughput by up to 3.35x on a single machine with 8 NVIDIA V100 GPUs, and by up to 1.32x on multiple machines with up to 64 GPUs, when compared to the out-of-the-box performance of DeepSpeed and Megatron-LM

arXiv.org e-Print Archive

Rewriting History: Repurposing Domain-Specific CGRAs

Author: Ainsworth Sam
Brauckmann Alexander
Cummins Chris
Koehler Thomas
O'Boyle Michael F. P.
Woodruff Jackson
Publication venue
Publication date: 16/09/2023
Field of study

Coarse-grained reconfigurable arrays (CGRAs) are domain-specific devices promising both the flexibility of FPGAs and the performance of ASICs. However, with restricted domains comes a danger: designing chips that cannot accelerate enough current and future software to justify the hardware cost. We introduce FlexC, the first flexible CGRA compiler, which allows CGRAs to be adapted to operations they do not natively support. FlexC uses dataflow rewriting, replacing unsupported regions of code with equivalent operations that are supported by the CGRA. We use equality saturation, a technique enabling efficient exploration of a large space of rewrite rules, to effectively search through the program-space for supported programs. We applied FlexC to over 2,000 loop kernels, compiling to four different research CGRAs and 300 generated CGRAs and demonstrate a 2.2

\times

increase in the number of loop kernels accelerated leading to 3

\times

speedup compared to an Arm A5 CPU on kernels that would otherwise be unsupported by the accelerator

arXiv.org e-Print Archive

Complex-to-Real Random Features for Polynomial Kernels

Author: Filippone Maurizio
Ohana Ruben
Wacker Jonas
Publication venue
Publication date: 08/11/2022
Field of study

Polynomial kernels are among the most popular kernels in machine learning, since their feature maps model the interactions between the dimensions of the input data. However, these features correspond to tensor products of the input with itself, which makes their dimension grow exponentially with the polynomial degree. We address this issue by proposing Complexto-Real (CtR) sketches for tensor products that can be used as random feature approximations of polynomial kernels. These sketches leverage intermediate complex random projections, leading to better theoretical guarantees and potentially much lower variances than analogs using real projections. Our sketches are simple to construct and their final output is real-valued, which makes their downstream use straightforward. Finally, we show that they achieve state-of-the-art performance in terms of accuracy and speed.Comment: 33 page

arXiv.org e-Print Archive