3 research outputs found
Instruction scheduling optimizations for energy efficient VLIW processors
Very Long Instruction Word (VLIW) processors are wide-issue statically scheduled
processors. Instruction scheduling for these processors is performed by the compiler
and is therefore a critical factor for its operation. Some VLIWs are clustered, a design
that improves scalability to higher issue widths while improving energy efficiency and
frequency. Their design is based on physically partitioning the shared hardware resources
(e.g., register file). Such designs further increase the challenges of instruction
scheduling since the compiler has the additional tasks of deciding on the placement
of the instructions to the corresponding clusters and orchestrating the data movements
across clusters.
In this thesis we propose instruction scheduling optimizations for energy-efficient
VLIW processors. Some of the techniques aim at improving the existing state-of-theart
scheduling techniques, while others aim at using compiler techniques for closing
the gap between lightweight hardware designs and more complex ones. Each of the
proposed techniques target individual features of energy efficient VLIW architectures.
Our first technique, called Aligned Scheduling, makes use of a novel scheduling
heuristic for hiding memory latencies in lightweight VLIW processors without hardware
load-use interlocks (Stall-On-Miss). With Aligned Scheduling, a software-only
technique, a SOM processor coupled with non-blocking caches can better cope with
the cache latencies and it can perform closer to the heavyweight designs. Performance
is improved by up to 20% across a range of benchmarks from the Mediabench II and
SPEC CINT2000 benchmark suites.
The rest of the techniques target a class of VLIW processors known as clustered
VLIWs, that are more scalable and more energy efficient and operate at higher frequencies
than their monolithic counterparts.
The second scheme (LUCAS) is an improved scheduler for clustered VLIW processors
that solves the problem of the existing state-of-the-art schedulers being very
susceptible to the inter-cluster communication latency. The proposed unified clustering
and scheduling technique is a hybrid scheme that performs instruction by instruction
switching between the two state-of-the-art clustering heuristics, leading to better
scheduling than either of them. It generates better performing code compared to the
state-of-the-art for a wide range of inter-cluster latency values on the Mediabench II
benchmarks.
The third technique (called CAeSaR) is a scheduler for clustered VLIW architectures
that minimizes inter-cluster communication by local caching and reuse of already
received data. Unlike dynamically scheduled processors, where this can be supported
by the register renaming hardware, in VLIWs it has to be done by the code generator.
The proposed instruction scheduler unifies cluster assignment, instruction scheduling
and communication minimization in a single unified algorithm, solving the phase ordering
issues between all three parts. The proposed scheduler shows an improvement
in execution time of up to 20.3% and 13.8% on average across a range of benchmarks
from the Mediabench II and SPEC CINT2000 benchmark suites.
The last technique, applies to heterogeneous clustered VLIWs that support dynamic
voltage and frequency scaling (DVFS) independently per cluster. In these processors
there are no hardware interlocks between clusters to honor the data dependencies.
Instead, the scheduler has to be aware of the DVFS decisions to guarantee correct
execution. Effectively controlling DVFS, to selectively decrease the frequency of clusters
with slack in their schedule, can lead to significant energy savings. The proposed
technique (called UCIFF) solves the phase ordering problem between frequency selection
and scheduling that is present in existing algorithms. The results show that UCIFF
produces better code than the state-of-the-art and very close to the optimal across the
Mediabench II benchmarks.
Overall, the proposed instruction scheduling techniques lead to either better efficiency
on existing designs or allow simpler lightweight designs to be competitive
against ones with more complex hardware