54 research outputs found
VThreads: A novel VLIW chip multiprocessor with hardware-assisted PThreads
We discuss VThreads, a novel VLIW CMP with hardware-assisted shared-memory Thread support. VThreads supports Instruction Level Parallelism via static multiple-issue and Thread Level Parallelism via hardware-assisted POSIX Threads along with extensive customization. It allows the instantiation of tightlycoupled streaming accelerators and supports up to 7-address Multiple-Input, Multiple-Output instruction extensions. VThreads is designed in technology-independent Register-Transfer-Level VHDL and prototyped on 40 nm and 28 nm Field-Programmable gate arrays. It was evaluated against a PThreads-based multiprocessor
based on the Sparc-V8 ISA. On a 65 nm ASIC implementation VThreads achieves up to x7.2
performance increase on synthetic benchmarks, x5 on a parallel Mandelbrot implementation, 66% better on a threaded JPEG implementation, 79% better on an edge-detection benchmark and ~13% improvement on DES compared to the Leon3MP CMP. In the range of 2 to 8 cores VThreads demonstrates a post-route (statistical) power reduction between 65% to 57% at an area increase of 1.2%-10% for 1-8 cores, compared to a similarly-configured Leon3MP CMP. This combination of micro-architectural features, scalability, extensibility,
hardware support for low-latency PThreads, power efficiency and area make the processor an attractive proposition for low-power, deeply-embedded applications requiring minimum OS support
Efficient design-space exploration of custom instruction-set extensions
Customization of processors with instruction set extensions (ISEs) is a technique
that improves performance through parallelization with a reasonable area overhead,
in exchange for additional design effort. This thesis presents a collection of
novel techniques that reduce the design effort and cost of generating ISEs by advancing
automation and reconfigurability. In addition, these techniques maximize
the perfomance gained as a function of the additional commited resources.
Including ISEs into a processor design implies development at many levels.
Most prior works on ISEs solve separate stages of the design: identification,
selection, and implementation. However, the interations between these stages
also hold important design trade-offs. In particular, this thesis addresses the lack
of interaction between the hardware implementation stage and the two previous
stages. Interaction with the implementation stage has been mostly limited to
accurately measuring the area and timing requirements of the implementation
of each ISE candidate as a separate hardware module. However, the need to
independently generate a hardware datapath for each ISE limits the flexibility
of the design and the performance gains. Hence, resource sharing is essential in
order to create a customized unit with multi-function capabilities.
Previously proposed resource-sharing techniques aggressively share resources
amongst the ISEs, thus minimizing the area of the solution at any cost. However,
it is shown that aggressively sharing resources leads to large ISE datapath latency.
Thus, this thesis presents an original heuristic that can be parameterized
in order to control the degree of resource sharing amongst a given set of ISEs,
thereby permitting the exploration of the existing implementation trade-offs between
instruction latency and area savings. In addition, this thesis introduces an
innovative predictive model that is able to quickly expose the optimal trade-offs of this design space. Compared to an exhaustive exploration of the design space,
the predictive model is shown to reduce by two orders of magnitude the number
of executions of the resource-sharing algorithm that are required in order to find
the optimal trade-offs.
This thesis presents a technique that is the first one to combine the design
spaces of ISE selection and resource sharing in ISE datapath synthesis, in order
to offer the designer solutions that achieve maximum speedup and maximum
resource utilization using the available area. Optimal trade-offs in the design
space are found by guiding the selection process to favour ISE combinations that
are likely to share resources with low speedup losses. Experimental results show
that this combined approach unveils new trade-offs between speedup and area
that are not identified by previous selection techniques; speedups of up to 238%
over previous selection thecniques were obtained.
Finally, multi-cycle ISEs can be pipelined in order to increase their throughput.
However, it is shown that traditional ISE identification techniques do not
allow this optimization due to control flow overhead. In order to obtain the benefits
of overlapping loop executions, this thesis proposes to carefully insert loop
control flow statements into the ISEs, thus allowing the ISE to control the iterations
of the loop. The proposed ISEs broaden the scope of instruction-level
parallelism and obtain higher speedups compared to traditional ISEs, primarily
through pipelining, the exploitation of spatial parallelism, and reducing the
overhead of control flow statements and branches. A detailed case study of a
real application shows that the proposed method achieves 91% higher speedups
than the state-of-the-art, with an area overhead of less than 8% in hardware
implementation
Investigating the Potential of Custom Instruction Set Extensions for SHA-3 Candidates on a 16-bit Microcontroller Architecture
In this paper, we investigate the benefit of instruction set extensions for software implementations of all five SHA-3 candidates. To this end, we start from optimized assembly code for a common 16-bit microcontroller instruction set architecture. By themselves, these implementations provide reference for complexity of the algorithms on 16-bit architectures, commonly used in embedded systems. For each algorithm, we then propose suitable instruction set extensions and implement the modified processor core. We assess the gains in throughput, memory consumption, and the area overhead. Our results show that with less than 10% additional area, it is possible to increase the execution speed on average by almost 40%, while reducing memory requirements on average by more than 40%. In particular, the Grostl algorithm, which was one of the slowest algorithms in previous reference implementations, ends up being the fastest implementation by some margin, once minor (but dedicated) instruction set extensions are taken into account
Customising compilers for customisable processors
The automatic generation of instruction set extensions to provide application-specific acceleration
for embedded processors has been a productive area of research in recent years. There
have been incremental improvements in the quality of the algorithms that discover and select
which instructions to add to a processor. The use of automatic algorithms, however, result in
instructions which are radically different from those found in conventional, human-designed,
RISC or CISC ISAs. This has resulted in a gap between the hardware’s capabilities and the
compiler’s ability to exploit them.
This thesis proposes and investigates the use of a high-level compiler pass that uses graph-subgraph
isomorphism checking to exploit these complex instructions. Operating in a separate
pass permits techniques to be applied that are uniquely suited for mapping complex instructions,
but unsuitable for conventional instruction selection. The existing, mature, compiler
back-end can then handle the remainder of the compilation. With this method, the high-level
pass was able to use 1965 different automatically produced instructions to obtain an initial average
speed-up of 1.11x over 179 benchmarks evaluated on a hardware-verified cycle-accurate
simulator.
This result was improved following an investigation of how the produced instructions were
being used by the compiler. It was established that the models the automatic tools were using to
develop instructions did not take account of how well the compiler could realistically use them.
Adding additional parameters to the search heuristic to account for compiler issues increased
the speed-up from 1.11x to 1.24x. An alternative approach using a re-designed hardware interface
was also investigated and this achieved a speed-up of 1.26x while reducing hardware and
compiler complexity.
A complementary, high-level, method of exploiting dual memory banks was created to increase
memory bandwidth to accommodate the increased data-processing bandwidth provided
by extension instructions. Finally, the compiler was considered for use in a non-conventional
role where rather than generating code it is used to apply source-level transformations prior to
the generation of extension instructions and thus affect the shape of the instructions that are
generated
Microarchitectural Low-Power Design Techniques for Embedded Microprocessors
With the omnipresence of embedded processing in all forms of electronics today, there is a strong trend towards wireless, battery-powered, portable embedded systems which have to operate under stringent energy constraints. Consequently, low power consumption and high energy efficiency have emerged as the two key criteria for embedded microprocessor design. In this thesis we present a range of microarchitectural low-power design techniques which enable the increase of performance for embedded microprocessors and/or the reduction of energy consumption, e.g., through voltage scaling. In the context of cryptographic applications, we explore the effectiveness of instruction set extensions (ISEs) for a range of different cryptographic hash functions (SHA-3 candidates) on a 16-bit microcontroller architecture (PIC24). Specifically, we demonstrate the effectiveness of light-weight ISEs based on lookup table integration and microcoded instructions using finite state machines for operand and address generation. On-node processing in autonomous wireless sensor node devices requires deeply embedded cores with extremely low power consumption. To address this need, we present TamaRISC, a custom-designed ISA with a corresponding ultra-low-power microarchitecture implementation. The TamaRISC architecture is employed in conjunction with an ISE and standard cell memories to design a sub-threshold capable processor system targeted at compressed sensing applications. We furthermore employ TamaRISC in a hybrid SIMD/MIMD multi-core architecture targeted at moderate to high processing requirements (> 1 MOPS). A range of different microarchitectural techniques for efficient memory organization are presented. Specifically, we introduce a configurable data memory mapping technique for private and shared access, as well as instruction broadcast together with synchronized code execution based on checkpointing. We then study an inherent suboptimality due to the worst-case design principle in synchronous circuits, and introduce the concept of dynamic timing margins. We show that dynamic timing margins exist in microprocessor circuits, and that these margins are to a large extent state-dependent and that they are correlated to the sequences of instruction types which are executed within the processor pipeline. To perform this analysis we propose a circuit/processor characterization flow and tool called dynamic timing analysis. Moreover, this flow is employed in order to devise a high-level instruction set simulation environment for impact-evaluation of timing errors on application performance. The presented approach improves the state of the art significantly in terms of simulation accuracy through the use of statistical fault injection. The dynamic timing margins in microprocessors are then systematically exploited for throughput improvements or energy reductions via our proposed instruction-based dynamic clock adjustment (DCA) technique. To this end, we introduce a 6-stage 32-bit microprocessor with cycle-by-cycle DCA. Besides a comprehensive design flow and simulation environment for evaluation of the DCA approach, we additionally present a silicon prototype of a DCA-enabled OpenRISC microarchitecture fabricated in 28 nm FD-SOI CMOS. The test chip includes a suitable clock generation unit which allows for cycle-by-cycle DCA over a wide range with fine granularity at frequencies exceeding 1 GHz. Measurement results of speedups and power reductions are provided
- …