Heterogeneous processors such as Arm’s big.LITTLE have become popular as they
offer a choice between performance and energy efficiency. However, the core configurations
are fixed at design time which offers a limited amount of adaptation. Dynamic
Multi-core Processors (DMPs) bridge the gap between homogeneous and fully reconfigurable
systems. They present a new way of improving single-threaded performance
by running a thread on groups of cores (compositions) and with the ability of changing
the processor topology on the fly, they can better adapt themselves to any task at
hand. However, these potential performance improvements are made difficult due to
two main challenges: the difficulty of determining a processor configuration that leads
to the optimal performance and knowing how to tackle hardware bottlenecks that may
impede the performance of composition.
This thesis first demonstrates that ahead-of-time thread and core partitioning used
to improve the performance of multi-threaded programs can be automated. This is
done by analysing static code features to generate a machine-learning model that determines
a processor configuration that leads to good performance for an application.
The machine learning model is able to predict a configuration that is within 16% of the
performance of the best configuration from the search space.
This is followed by studying how dynamically adapting the size of a composition
at runtime can be used to reduce energy consumption whilst maintaining the same
speedup as the fastest static core composition. An analysis of how much energy can
be saved by adapting the size of the composition at runtime is conducted, showing that
dynamic reconfiguration can reduce energy consumption by 42% on average. A model
is then built using linear regression which analyses the features of basic blocks being
executed to determine if the current composition should be reconfigured; on average it
reduces energy consumption by 37%.
Finally the hardware mechanisms that drive core composition are explored. A
new fetching mechanism for core composition is proposed, where cores fetch code in
a round-robin fashion. The use of value prediction is also motivated, as large core
compositions are more susceptible to data-dependencies. This new hardware setup
shows massive potential. By exploring a perfect value predictor with perfect branch
prediction and the new fetching scheme, the performance of a large core composition
can be improved by a factor of up to 3x, and 1.88x on average. Furthermore, this thesis
shows that state-of-the-art value prediction with a normal branch predictor still leads
to good performance improvements, with an average of 1.33x to up to 2.7x speedup