40 research outputs found
Compressing the Backward Pass of Large-Scale Neural Architectures by Structured Activation Pruning
The rise of Deep Neural Networks (DNNs) has led to an increase in model size
and complexity, straining the memory capacity of GPUs. Sparsity in DNNs,
characterized as structural or ephemeral, has gained attention as a solution.
This work focuses on ephemeral sparsity, aiming to reduce memory consumption
during training. It emphasizes the significance of activations, an often
overlooked component, and their role in memory usage. This work employs
structured pruning in Block Sparse Compressed Row (BSR) format in combination
with a magnitude-based criterion to efficiently prune activations. We
furthermore introduce efficient block-sparse operators for GPUs and showcase
their effectiveness, as well as the superior compression offered by block
sparsity. We report the effectiveness of activation pruning by evaluating
training speed, accuracy, and memory usage of large-scale neural architectures
on the example of ResMLP on image classification tasks. As a result, we observe
a memory reduction of up to 32% while maintaining accuracy. Ultimately, our
approach aims to democratize large-scale model training, reduce GPU
requirements, and address ecological concerns.Comment: 8 pages, 11 figures, submitted to the 6th AccML workshop at HiPEAC
conference 202
On the Non-Associativity of Analog Computations
The energy efficiency of analog forms of computing makes it one of the most
promising candidates to deploy resource-hungry machine learning tasks on
resource-constrained system such as mobile or embedded devices. However, it is
well known that for analog computations the safety net of discretization is
missing, thus all analog computations are exposed to a variety of imperfections
of corresponding implementations. Examples include non-linearities, saturation
effect and various forms of noise. In this work, we observe that the ordering
of input operands of an analog operation also has an impact on the output
result, which essentially makes analog computations non-associative, even
though the underlying operation might be mathematically associative. We conduct
a simple test by creating a model of a real analog processor which captures
such ordering effects. With this model we assess the importance of ordering by
comparing the test accuracy of a neural network for keyword spotting, which is
trained based either on an ordered model, on a non-ordered variant, and on real
hardware. The results prove the existence of ordering effects as well as their
high impact, as neglecting ordering results in substantial accuracy drops.Comment: Published at the ECML PKDD Conference 2023, at the 4th Workshop on
IoT, Edge, and Mobile for Embedded Machine Learnin
Implications of Noise in Resistive Memory on Deep Neural Networks for Image Classification
Resistive memory is a promising alternative to SRAM, but is also an
inherently unstable device that requires substantial effort to ensure correct
read and write operations. To avoid the associated costs in terms of area, time
and energy, the present work is concerned with exploring how much noise in
memory operations can be tolerated by image classification tasks based on
neural networks. We introduce a special noisy operator that mimics the noise in
an exemplary resistive memory unit, explore the resilience of convolutional
neural networks on the CIFAR-10 classification task, and discuss a couple of
countermeasures to improve this resilience
Reducing Memory Requirements for the IPU using Butterfly Factorizations
High Performance Computing (HPC) benefits from different improvements during
last decades, specially in terms of hardware platforms to provide more
processing power while maintaining the power consumption at a reasonable level.
The Intelligence Processing Unit (IPU) is a new type of massively parallel
processor, designed to speedup parallel computations with huge number of
processing cores and on-chip memory components connected with high-speed
fabrics. IPUs mainly target machine learning applications, however, due to the
architectural differences between GPUs and IPUs, especially significantly less
memory capacity on an IPU, methods for reducing model size by sparsification
have to be considered. Butterfly factorizations are well-known replacements for
fully-connected and convolutional layers. In this paper, we examine how
butterfly structures can be implemented on an IPU and study their behavior and
performance compared to a GPU. Experimental results indicate that these methods
can provide 98.5% compression ratio to decrease the immense need for memory,
the IPU implementation can benefit from 1.3x and 1.6x performance improvement
for butterfly and pixelated butterfly, respectively. We also reach to 1.62x
training time speedup on a real-word dataset such as CIFAR10
GraphMatch: Subgraph Query Processing on FPGAs
Efficiently finding subgraph embeddings in large graphs is crucial for many
application areas like biology and social network analysis. Set intersections
are the predominant and most challenging aspect of current join-based subgraph
query processing systems for CPUs. Previous work has shown the viability of
utilizing FPGAs for acceleration of graph and join processing.
In this work, we propose GraphMatch, the first genearl-purpose stand-alone
subgraph query processing accelerator based on worst-case optimal joins (WCOJ)
that is fully designed for modern, field programmable gate array (FPGA)
hardware. For efficient processing of various graph data sets and query graph
patterns, it leverages a novel set intersection approach, called AllCompare,
tailor-made for FPGAs. We show that this set intersection approach efficiently
solves multi-set intersections in subgraph query processing, superior to
CPU-based approaches. Overall, GraphMatch achieves a speedup of over 2.68x and
5.16x, compared to the state-of-the-art systems GraphFlow and RapidMatch,
respectively
Analyzing Communication Models for Distributed Thread-Collaborative Processors in Terms of Energy and Time
Abstract-Accelerated computing has become pervasive for increasing the computational power and energy efficiency in terms of GFLOPs/Watt. For application areas with highest demands, for instance high performance computing, data warehousing and high performance analytics, accelerators like GPUs or Intel's MICs are distributed throughout the cluster. Since current analyses and predictions show that data movement will be the main contributor to energy consumption, we are entering an era of communication-centric heterogeneous systems that are operating with hard power constraints. In this work, we analyze data movement optimizations for distributed heterogeneous systems based on CPUs and GPUs. Thread-collaborative processors like GPUs differ significantly in their execution model from generalpurpose processors like CPUs, but available communication models are still designed and optimized for CPUs. Similar to heterogeneity in processing, heterogeneity in communication can have a huge impact on energy and time. To analyze this impact, we use multiple workloads with distinct properties regarding computational intensity and communication characteristics. We show for which workloads tailored communication models are essential, not only reducing execution time but also saving energy. Exposing the impact in terms of energy and time for communication-centric heterogeneous systems is crucial for future optimizations, and this work is a first step in this direction
A HT3 Platform for Rapid Prototyping and High Performance Reconfigurable Computing
FPGAs as reconfigurable devices play an important role in both rapid prototyping and high performance reconfigurable computing. Usually, FPGA vendors help the users with pre-designed cores, for instance for various communication protocols. However, this is only true for widely used protocols. In the use case described here, the target application may benefit from a tight integration of the FPGA in a computing system. Typical commodity protocols like PCI Express may not fulfill these demands. HyperTransport (HT), on the other hand, allows connecting directly and without intermediate bridges or protocol conversion to a processor interface. As a result, communication costs between the FPGA unit and both processor and main memory are minimal. In this paper we present an HT3 interface for Stratix IV based FPGAs, which allows for minimal latencies and high bandwidths between processor and device and main memory and device. Designs targeting a HT connection can now be prototyped in real world systems. Furthermore, this design can be leveraged for acceleration tasks, with the minimal communication costs allowing fine-grain work deployment and the use of cost-efficient main memory instead of size-limited and costly on-device memory