42,673 research outputs found
Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels
Achieving optimal program performance requires deep insight into the
interaction between hardware and software. For software developers without an
in-depth background in computer architecture, understanding and fully utilizing
modern architectures is close to impossible. Analytic loop performance modeling
is a useful way to understand the relevant bottlenecks of code execution based
on simple machine models. The Roofline Model and the Execution-Cache-Memory
(ECM) model are proven approaches to performance modeling of loop nests. In
comparison to the Roofline model, the ECM model can also describes the
single-core performance and saturation behavior on a multicore chip. We give an
introduction to the Roofline and ECM models, and to stencil performance
modeling using layer conditions (LC). We then present Kerncraft, a tool that
can automatically construct Roofline and ECM models for loop nests by
performing the required code, data transfer, and LC analysis. The layer
condition analysis allows to predict optimal spatial blocking factors for loop
nests. Together with the models it enables an ab-initio estimate of the
potential benefits of loop blocking optimizations and of useful block sizes. In
cases where LC analysis is not easily possible, Kerncraft supports a cache
simulator as a fallback option. Using a 25-point long-range stencil we
demonstrate the usefulness and predictive power of the Kerncraft tool.Comment: 22 pages, 5 figure
Training deep neural networks with low precision multiplications
Multipliers are the most space and power-hungry arithmetic operators of the
digital implementation of deep neural networks. We train a set of
state-of-the-art neural networks (Maxout networks) on three benchmark datasets:
MNIST, CIFAR-10 and SVHN. They are trained with three distinct formats:
floating point, fixed point and dynamic fixed point. For each of those datasets
and for each of those formats, we assess the impact of the precision of the
multiplications on the final error after training. We find that very low
precision is sufficient not just for running trained networks but also for
training them. For example, it is possible to train Maxout networks with 10
bits multiplications.Comment: 10 pages, 5 figures, Accepted as a workshop contribution at ICLR 201
- …