5 research outputs found
ASAM : Automatic Architecture Synthesis and Application Mapping; dl. 3.2: Instruction set synthesis
No abstract
Customising compilers for customisable processors
The automatic generation of instruction set extensions to provide application-specific acceleration
for embedded processors has been a productive area of research in recent years. There
have been incremental improvements in the quality of the algorithms that discover and select
which instructions to add to a processor. The use of automatic algorithms, however, result in
instructions which are radically different from those found in conventional, human-designed,
RISC or CISC ISAs. This has resulted in a gap between the hardwareâs capabilities and the
compilerâs ability to exploit them.
This thesis proposes and investigates the use of a high-level compiler pass that uses graph-subgraph
isomorphism checking to exploit these complex instructions. Operating in a separate
pass permits techniques to be applied that are uniquely suited for mapping complex instructions,
but unsuitable for conventional instruction selection. The existing, mature, compiler
back-end can then handle the remainder of the compilation. With this method, the high-level
pass was able to use 1965 different automatically produced instructions to obtain an initial average
speed-up of 1.11x over 179 benchmarks evaluated on a hardware-verified cycle-accurate
simulator.
This result was improved following an investigation of how the produced instructions were
being used by the compiler. It was established that the models the automatic tools were using to
develop instructions did not take account of how well the compiler could realistically use them.
Adding additional parameters to the search heuristic to account for compiler issues increased
the speed-up from 1.11x to 1.24x. An alternative approach using a re-designed hardware interface
was also investigated and this achieved a speed-up of 1.26x while reducing hardware and
compiler complexity.
A complementary, high-level, method of exploiting dual memory banks was created to increase
memory bandwidth to accommodate the increased data-processing bandwidth provided
by extension instructions. Finally, the compiler was considered for use in a non-conventional
role where rather than generating code it is used to apply source-level transformations prior to
the generation of extension instructions and thus affect the shape of the instructions that are
generated
Increasing the efficacy of automated instruction set extension
The use of Instruction Set Extension (ISE) in customising embedded processors for a specific
application has been studied extensively in recent years. The addition of a set of complex
arithmetic instructions to a baseline core has proven to be a cost-effective means of meeting
design performance requirements. This thesis proposes and evaluates a reconfigurable ISE
implementation called âConfigurable Flow Acceleratorsâ (CFAs), a number of refinements to
an existing Automated ISE (AISE) algorithm called âISEGENâ, and the effects of source form
on AISE.
The CFA is demonstrated repeatedly to be a cost-effective design for ISE implementation.
A temporal partitioning algorithm called âstaggeringâ is proposed and demonstrated on average
to reduce the area of CFA implementation by 37% for only an 8% reduction in acceleration.
This thesis then turns to concerns within the ISEGEN AISE algorithm. A methodology
for finding a good static heuristic weighting vector for ISEGEN is proposed and demonstrated.
Up to 100% of merit is shown to be lost or gained through the choice of vector. ISEGEN
early-termination is introduced and shown to improve the runtime of the algorithm by up to
7.26x, and 5.82x on average. An extension to the ISEGEN heuristic to account for pipelining
is proposed and evaluated, increasing acceleration by up to an additional 1.5x. An energyaware
heuristic is added to ISEGEN, which reduces the energy used by a CFA implementation
of a set of ISEs by an average of 1.6x, up to 3.6x. This result directly contradicts the frequently
espoused notion that âbigger is betterâ in ISE.
The last stretch of work in this thesis is concerned with source-level transformation: the effect
of changing the representation of the application on the quality of the combined hardwaresoftware
solution. A methodology for combined exploration of source transformation and ISE
is presented, and demonstrated to improve the acceleration of the result by an average of 35%
versus ISE alone. Floating point is demonstrated to perform worse than fixed point, for all
design concerns and applications studied here, regardless of ISEs employed
From software to hardware: making dynamic multi-core processors practical
Heterogeneous processors such as Armâs big.LITTLE have become popular as they
offer a choice between performance and energy efficiency. However, the core configurations
are fixed at design time which offers a limited amount of adaptation. Dynamic
Multi-core Processors (DMPs) bridge the gap between homogeneous and fully reconfigurable
systems. They present a new way of improving single-threaded performance
by running a thread on groups of cores (compositions) and with the ability of changing
the processor topology on the fly, they can better adapt themselves to any task at
hand. However, these potential performance improvements are made difficult due to
two main challenges: the difficulty of determining a processor configuration that leads
to the optimal performance and knowing how to tackle hardware bottlenecks that may
impede the performance of composition.
This thesis first demonstrates that ahead-of-time thread and core partitioning used
to improve the performance of multi-threaded programs can be automated. This is
done by analysing static code features to generate a machine-learning model that determines
a processor configuration that leads to good performance for an application.
The machine learning model is able to predict a configuration that is within 16% of the
performance of the best configuration from the search space.
This is followed by studying how dynamically adapting the size of a composition
at runtime can be used to reduce energy consumption whilst maintaining the same
speedup as the fastest static core composition. An analysis of how much energy can
be saved by adapting the size of the composition at runtime is conducted, showing that
dynamic reconfiguration can reduce energy consumption by 42% on average. A model
is then built using linear regression which analyses the features of basic blocks being
executed to determine if the current composition should be reconfigured; on average it
reduces energy consumption by 37%.
Finally the hardware mechanisms that drive core composition are explored. A
new fetching mechanism for core composition is proposed, where cores fetch code in
a round-robin fashion. The use of value prediction is also motivated, as large core
compositions are more susceptible to data-dependencies. This new hardware setup
shows massive potential. By exploring a perfect value predictor with perfect branch
prediction and the new fetching scheme, the performance of a large core composition
can be improved by a factor of up to 3x, and 1.88x on average. Furthermore, this thesis
shows that state-of-the-art value prediction with a normal branch predictor still leads
to good performance improvements, with an average of 1.33x to up to 2.7x speedup