116 research outputs found
On the Distribution of Control in Asynchronous Processor Architectures
Institute for Computing Systems ArchitectureThe effective performance of computer systems is to a large measure
determined by the synergy between the processor architecture, the
instruction set and the compiler. In the past, the sequencing of
information within processor architectures has normally been
synchronous: controlled centrally by a clock. However, this global
signal could possibly limit the future gains in performance that can
potentially be achieved through improvements in implementation
technology.
This thesis investigates the effects of relaxing this strict synchrony
by distributing control within processor architectures through the use
of a novel asynchronous design model known as a micronet. The impact
of asynchronous control on the performance of a RISC-style processor
is explored at different levels. Firstly, improvements in the
performance of individual instructions by exploiting actual run-time
behaviours are demonstrated. Secondly, it is shown that micronets are
able to exploit further (both spatial and temporal) instructionlevel
parallelism (ILP) efficiently through the distribution of control to
datapath resources. Finally, exposing fine-grain concurrency within a
datapath can only be of benefit to a computer system if it can easily
be exploited by the compiler. Although compilers for micronet-based
asynchronous processors may be considered to be more complex than
their synchronous counterparts, it is shown that the variable
execution time of an instruction does not adversely affect the
compiler's ability to schedule code efficiently. In conclusion, the
modelling of a processor's datapath as a micronet permits the
exploitation of both finegrain ILP and actual run-time delays, thus
leading to the efficient utilisation of functional units and in turn
resulting in an improvement in overall system performance
Are We There Yet? Product Quantization and its Hardware Acceleration
Conventional multiply-accumulate (MAC) operations have long dominated
computation time for deep neural networks (DNNs). Recently, product
quantization (PQ) has been successfully applied to these workloads, replacing
MACs with memory lookups to pre-computed dot products. While this property
makes PQ an attractive solution for model acceleration, little is understood
about the associated trade-offs in terms of compute and memory footprint, and
the impact on accuracy. Our empirical study investigates the impact of
different PQ settings and training methods on layerwise reconstruction error
and end-to-end model accuracy. When studying the efficiency of deploying PQ
DNNs, we find that metrics such as FLOPs, number of parameters, and even
CPU/GPU performance, can be misleading. To address this issue, and to more
fairly assess PQ in terms of hardware efficiency, we design the first custom
hardware accelerator to evaluate the speed and efficiency of running PQ models.
We identify PQ configurations that are able to improve performance-per-area for
ResNet20 by 40%-104%, even when compared to a highly optimized conventional DNN
accelerator. Our hardware performance outperforms recent PQ solutions by 4x,
with only a 0.6% accuracy degradation. This work demonstrates the practical and
hardware-aware design of PQ models, paving the way for wider adoption of this
emerging DNN approximation methodology
A Low-Power Two-Digit Multi-dimensional Logarithmic Number System Filterbank Architecture for a Digital Hearing Aid
This paper addresses the implementation of a filterbank for digital hearing aids using a multi-dimensional logarithmic number system (MDLNS). The MDLNS, which has similar properties to the classical logarithmic number system (LNS), provides more degrees of freedom than the LNS by virtue of having two, or more, orthogonal bases and the ability to use multiple MDLNS components or digits. The logarithmic properties of the MDLNS also allow for reduced complexity multiplication and large dynamic range, and a multiple-digit MDLNS provides a considerable reduction in hardware complexity compared to a conventional LNS approach. We discuss an improved design for a two-digit 2D MDLNS filterbank implementation which reduces power and area by over two times compared to the original design
A Network-based Asynchronous Architecture for Cryptographic Devices
Institute for Computing Systems ArchitectureThe traditional model of cryptography examines the security of the cipher as a
mathematical function. However, ciphers that are secure when specified as mathematical
functions are not necessarily secure in real-world implementations. The physical
implementations of ciphers can be extremely difficult to control and often leak socalled
side-channel information. Side-channel cryptanalysis attacks have shown to
be especially effective as a practical means for attacking implementations of cryptographic
algorithms on simple hardware platforms, such as smart-cards. Adversaries
can obtain sensitive information from side-channels, such as the timing of operations,
power consumption and electromagnetic emissions. Some of the attack techniques
require surprisingly little side-channel information to break some of the best known
ciphers. In constrained devices, such as smart-cards, straightforward implementations
of cryptographic algorithms can be broken with minimal work. Preventing these attacks
has become an active and a challenging area of research.
Power analysis is a successful cryptanalytic technique that extracts secret information
from cryptographic devices by analysing the power consumed during their operation.
A particularly dangerous class of power analysis, differential power analysis
(DPA), relies on the correlation of power consumption measurements. It has been proposed
that adding non-determinism to the execution of the cryptographic device would
reduce the danger of these attacks. It has also been demonstrated that asynchronous
logic has advantages for security-sensitive applications. This thesis investigates the
security and performance advantages of using a network-based asynchronous architecture,
in which the functional units of the datapath form a network. Non-deterministic
execution is achieved by exploiting concurrent execution of instructions both with and
without data-dependencies; and by forwarding register values between instructions
with data-dependencies using randomised routing over the network. The executions of
cryptographic algorithms on different architectural configurations are simulated, and
the obtained power traces are subjected to DPA attacks. The results show that the
proposed architecture introduces a level of non-determinism in the execution that significantly
raises the threshold for DPA attacks to succeed. In addition, the performance
analysis shows that the improved security does not degrade performance
- …