3 research outputs found
Resource and thermal management in 3D-stacked multi-/many-core systems
Continuous semiconductor technology scaling and the rapid increase in computational needs have stimulated the emergence of multi-/many-core processors. While up to hundreds of cores can be placed on a single chip, the performance capacity of the cores cannot be fully exploited due to high latencies of interconnects and memory, high power consumption, and low manufacturing yield in traditional (2D) chips. 3D stacking is an emerging technology that aims to overcome these limitations of 2D designs by stacking processor dies over each other and using through-silicon-vias (TSVs) for on-chip communication, and thus, provides a large amount of on-chip resources and shortens communication latency. These benefits, however, are limited by challenges in high power densities and temperatures.
3D stacking also enables integrating heterogeneous technologies into a single chip. One example of heterogeneous integration is building many-core systems with silicon-photonic network-on-chip (PNoC), which reduces on-chip communication latency significantly and provides higher bandwidth compared to electrical links. However, silicon-photonic links are vulnerable to on-chip thermal and process variations. These variations can be countered by actively tuning the temperatures of optical devices through micro-heaters, but at the cost of substantial power overhead.
This thesis claims that unearthing the energy efficiency potential of 3D-stacked systems requires intelligent and application-aware resource management. Specifically, the thesis improves energy efficiency of 3D-stacked systems via three major components of computing systems: cache, memory, and on-chip communication. We analyze characteristics of workloads in computation, memory usage, and communication, and present techniques that leverage these characteristics for energy-efficient computing.
This thesis introduces 3D cache resource pooling, a cache design that allows for flexible heterogeneity in cache configuration across a 3D-stacked system and improves cache utilization and system energy efficiency. We also demonstrate the impact of resource pooling on a real prototype 3D system with scratchpad memory.
At the main memory level, we claim that utilizing heterogeneous memory modules and memory object level management significantly helps with energy efficiency. This thesis proposes a memory management scheme at a finer granularity: memory object level, and a page allocation policy to leverage the heterogeneity of available memory modules and cater to the diverse memory requirements of workloads.
On the on-chip communication side, we introduce an approach to limit the power overhead of PNoC in (3D) many-core systems through cross-layer thermal management. Our proposed thermally-aware workload allocation policies coupled with an adaptive thermal tuning policy minimize the required thermal tuning power for PNoC, and in this way, help broader integration of PNoC. The thesis also introduces techniques in placement and floorplanning of optical devices to reduce optical loss and, thus, laser source power consumption.2018-03-09T00:00:00
From software to hardware: making dynamic multi-core processors practical
Heterogeneous processors such as Arm’s big.LITTLE have become popular as they
offer a choice between performance and energy efficiency. However, the core configurations
are fixed at design time which offers a limited amount of adaptation. Dynamic
Multi-core Processors (DMPs) bridge the gap between homogeneous and fully reconfigurable
systems. They present a new way of improving single-threaded performance
by running a thread on groups of cores (compositions) and with the ability of changing
the processor topology on the fly, they can better adapt themselves to any task at
hand. However, these potential performance improvements are made difficult due to
two main challenges: the difficulty of determining a processor configuration that leads
to the optimal performance and knowing how to tackle hardware bottlenecks that may
impede the performance of composition.
This thesis first demonstrates that ahead-of-time thread and core partitioning used
to improve the performance of multi-threaded programs can be automated. This is
done by analysing static code features to generate a machine-learning model that determines
a processor configuration that leads to good performance for an application.
The machine learning model is able to predict a configuration that is within 16% of the
performance of the best configuration from the search space.
This is followed by studying how dynamically adapting the size of a composition
at runtime can be used to reduce energy consumption whilst maintaining the same
speedup as the fastest static core composition. An analysis of how much energy can
be saved by adapting the size of the composition at runtime is conducted, showing that
dynamic reconfiguration can reduce energy consumption by 42% on average. A model
is then built using linear regression which analyses the features of basic blocks being
executed to determine if the current composition should be reconfigured; on average it
reduces energy consumption by 37%.
Finally the hardware mechanisms that drive core composition are explored. A
new fetching mechanism for core composition is proposed, where cores fetch code in
a round-robin fashion. The use of value prediction is also motivated, as large core
compositions are more susceptible to data-dependencies. This new hardware setup
shows massive potential. By exploring a perfect value predictor with perfect branch
prediction and the new fetching scheme, the performance of a large core composition
can be improved by a factor of up to 3x, and 1.88x on average. Furthermore, this thesis
shows that state-of-the-art value prediction with a normal branch predictor still leads
to good performance improvements, with an average of 1.33x to up to 2.7x speedup