1,536 research outputs found
Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks
Predicting the number of clock cycles a processor takes to execute a block of
assembly instructions in steady state (the throughput) is important for both
compiler designers and performance engineers. Building an analytical model to
do so is especially complicated in modern x86-64 Complex Instruction Set
Computer (CISC) machines with sophisticated processor microarchitectures in
that it is tedious, error prone, and must be performed from scratch for each
processor generation. In this paper we present Ithemal, the first tool which
learns to predict the throughput of a set of instructions. Ithemal uses a
hierarchical LSTM--based approach to predict throughput based on the opcodes
and operands of instructions in a basic block. We show that Ithemal is more
accurate than state-of-the-art hand-written tools currently used in compiler
backends and static machine code analyzers. In particular, our model has less
than half the error of state-of-the-art analytical models (LLVM's llvm-mca and
Intel's IACA). Ithemal is also able to predict these throughput values just as
fast as the aforementioned tools, and is easily ported across a variety of
processor microarchitectures with minimal developer effort.Comment: Published at 36th International Conference on Machine Learning (ICML)
201
A Survey on Design Methodologies for Accelerating Deep Learning on Heterogeneous Architectures
In recent years, the field of Deep Learning has seen many disruptive and
impactful advancements. Given the increasing complexity of deep neural
networks, the need for efficient hardware accelerators has become more and more
pressing to design heterogeneous HPC platforms. The design of Deep Learning
accelerators requires a multidisciplinary approach, combining expertise from
several areas, spanning from computer architecture to approximate computing,
computational models, and machine learning algorithms. Several methodologies
and tools have been proposed to design accelerators for Deep Learning,
including hardware-software co-design approaches, high-level synthesis methods,
specific customized compilers, and methodologies for design space exploration,
modeling, and simulation. These methodologies aim to maximize the exploitable
parallelism and minimize data movement to achieve high performance and energy
efficiency. This survey provides a holistic review of the most influential
design methodologies and EDA tools proposed in recent years to implement Deep
Learning accelerators, offering the reader a wide perspective in this rapidly
evolving field. In particular, this work complements the previous survey
proposed by the same authors in [203], which focuses on Deep Learning hardware
accelerators for heterogeneous HPC platforms
HaTS: Hardware-Assisted Transaction Scheduler
In this paper we present HaTS, a Hardware-assisted Transaction Scheduler. HaTS improves performance of concurrent applications by classifying the executions of their atomic blocks (or in-memory transactions) into scheduling queues, according to their so called conflict indicators. The goal is to group those transactions that are conflicting while letting non-conflicting transactions proceed in parallel. Two core innovations characterize HaTS. First, HaTS does not assume the availability of precise information associated with incoming transactions in order to proceed with the classification. It relaxes this assumption by exploiting the inherent conflict resolution provided by Hardware Transactional Memory (HTM). Second, HaTS dynamically adjusts the number of the scheduling queues in order to capture the actual application contention level. Performance results using the STAMP benchmark suite show up to 2x improvement over state-of-the-art HTM-based scheduling techniques
Vertical Optimizations of Convolutional Neural Networks for Embedded Systems
L'abstract è presente nell'allegato / the abstract is in the attachmen
- …