10,181 research outputs found
Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks
Predicting the number of clock cycles a processor takes to execute a block of
assembly instructions in steady state (the throughput) is important for both
compiler designers and performance engineers. Building an analytical model to
do so is especially complicated in modern x86-64 Complex Instruction Set
Computer (CISC) machines with sophisticated processor microarchitectures in
that it is tedious, error prone, and must be performed from scratch for each
processor generation. In this paper we present Ithemal, the first tool which
learns to predict the throughput of a set of instructions. Ithemal uses a
hierarchical LSTM--based approach to predict throughput based on the opcodes
and operands of instructions in a basic block. We show that Ithemal is more
accurate than state-of-the-art hand-written tools currently used in compiler
backends and static machine code analyzers. In particular, our model has less
than half the error of state-of-the-art analytical models (LLVM's llvm-mca and
Intel's IACA). Ithemal is also able to predict these throughput values just as
fast as the aforementioned tools, and is easily ported across a variety of
processor microarchitectures with minimal developer effort.Comment: Published at 36th International Conference on Machine Learning (ICML)
201
An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration
We empirically evaluate an undervolting technique, i.e., underscaling the
circuit supply voltage below the nominal level, to improve the power-efficiency
of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable
Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing
faults due to excessive circuit latency increase. We evaluate the
reliability-power trade-off for such accelerators. Specifically, we
experimentally study the reduced-voltage operation of multiple components of
real FPGAs, characterize the corresponding reliability behavior of CNN
accelerators, propose techniques to minimize the drawbacks of reduced-voltage
operation, and combine undervolting with architectural CNN optimization
techniques, i.e., quantization and pruning. We investigate the effect of
environmental temperature on the reliability-power trade-off of such
accelerators. We perform experiments on three identical samples of modern
Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification
CNN benchmarks. This approach allows us to study the effects of our
undervolting technique for both software and hardware variability. We achieve
more than 3X power-efficiency (GOPs/W) gain via undervolting. 2.6X of this gain
is the result of eliminating the voltage guardband region, i.e., the safe
voltage region below the nominal level that is set by FPGA vendor to ensure
correct functionality in worst-case environmental and circuit conditions. 43%
of the power-efficiency gain is due to further undervolting below the
guardband, which comes at the cost of accuracy loss in the CNN accelerator. We
evaluate an effective frequency underscaling technique that prevents this
accuracy loss, and find that it reduces the power-efficiency gain from 43% to
25%.Comment: To appear at the DSN 2020 conferenc
A Survey on Compiler Autotuning using Machine Learning
Since the mid-1990s, researchers have been trying to use machine-learning
based approaches to solve a number of different compiler optimization problems.
These techniques primarily enhance the quality of the obtained results and,
more importantly, make it feasible to tackle two main compiler optimization
problems: optimization selection (choosing which optimizations to apply) and
phase-ordering (choosing the order of applying optimizations). The compiler
optimization space continues to grow due to the advancement of applications,
increasing number of compiler optimizations, and new target architectures.
Generic optimization passes in compilers cannot fully leverage newly introduced
optimizations and, therefore, cannot keep up with the pace of increasing
options. This survey summarizes and classifies the recent advances in using
machine learning for the compiler optimization field, particularly on the two
major problems of (1) selecting the best optimizations and (2) the
phase-ordering of optimizations. The survey highlights the approaches taken so
far, the obtained results, the fine-grain classification among different
approaches and finally, the influential papers of the field.Comment: version 5.0 (updated on September 2018)- Preprint Version For our
Accepted Journal @ ACM CSUR 2018 (42 pages) - This survey will be updated
quarterly here (Send me your new published papers to be added in the
subsequent version) History: Received November 2016; Revised August 2017;
Revised February 2018; Accepted March 2018
Improving the translation environment for professional translators
When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side.
This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project
Timing verification of dynamically reconfigurable logic for Xilinx Virtex FPGA series
This paper reports on a method for extending existing VHDL design and verification software available for the Xilinx Virtex series of FPGAs. It allows the designer to apply standard hardware design and verification tools to the design of dynamically reconfigurable logic (DRL). The technique involves the conversion of a dynamic design into multiple static designs, suitable for input to standard synthesis and APR tools. For timing and functional verification after APR, the sections of the design can then be recombined into a single dynamic system. The technique has been automated by extending an existing DRL design tool named DCSTech, which is part of the Dynamic Circuit Switching (DCS) CAD framework. The principles behind the tools are generic and should be readily extensible to other architectures and CAD toolsets. Implementation of the dynamic system involves the production of partial configuration bitstreams to load sections of circuitry. The process of creating such bitstreams, the final stage of our design flow, is summarized
ENTRA:Whole-systems energy transparency
Promoting energy efficiency to a first class system design goal is an
important research challenge. Although more energy-efficient hardware can be
designed, it is software that controls the hardware; for a given system the
potential for energy savings is likely to be much greater at the higher levels
of abstraction in the system stack. Thus the greatest savings are expected from
energy-aware software development, which is the vision of the EU ENTRA project.
This article presents the concept of energy transparency as a foundation for
energy-aware software development. We show how energy modelling of hardware is
combined with static analysis to allow the programmer to understand the energy
consumption of a program without executing it, thus enabling exploration of the
design space taking energy into consideration. The paper concludes by
summarising the current and future challenges identified in the ENTRA project.Comment: Revised preprint submitted to MICPRO on 27 May 2016, 23 pages, 3
figure
Thwarting Code-Reuse and Side-Channel Attacks in Embedded Systems
Nowadays, embedded devices are increasingly present in everyday life, often
controlling and processing critical information. For this reason, these devices
make use of cryptographic protocols. However, embedded devices are particularly
vulnerable to attackers seeking to hijack their operation and extract sensitive
information. Code-Reuse Attacks (CRAs) can steer the execution of a program to
malicious outcomes, leveraging existing on-board code without direct access to
the device memory. Moreover, Side-Channel Attacks (SCAs) may reveal secret
information to the attacker based on mere observation of the device. In this
paper, we are particularly concerned with thwarting CRAs and SCAs against
embedded devices, while taking into account their resource limitations.
Fine-grained code diversification can hinder CRAs by introducing uncertainty to
the binary code; while software mechanisms can thwart timing or power SCAs. The
resilience to either attack may come at the price of the overall efficiency.
Moreover, a unified approach that preserves these mitigations against both CRAs
and SCAs is not available. This is the main novelty of our approach, Secure
Diversity by Construction (SecDivCon); a combinatorial compiler-based approach
that combines software diversification against CRAs with software mitigations
against SCAs. SecDivCon restricts the performance overhead in the generated
code, offering a secure-by-design control on the performance-security
trade-off. Our experiments show that SCA-aware diversification is effective
against CRAs, while preserving SCA mitigation properties at a low, controllable
overhead. Given the combinatorial nature of our approach, SecDivCon is suitable
for small, performance-critical functions that are sensitive to SCAs. SecDivCon
may be used as a building block to whole-program code diversification or in a
re-randomization scheme of cryptographic code
- …