There is a growing consensus that heterogeneous multicores are the future of CPUs. These processors would be composed of cores that are specifically adapted or tuned to particular types of applications and use cases, thereby increasing performance. The move from homogeneous to heterogeneous multicores causes the design space to explode, however. An architect of a heterogeneous processor must make design decisions per processor core rather than once for the entire processor as before. Currently, there are no methods for handling this design complexity to yield a processor that performs well for real workloads. As a step forward, we propose weak heterogeneity. A weakly heterogeneous processor is one whose cores are different, but not significantly so. The cores share an ISA and major microarchitectural features, differing only in minor details. Limiting the design space in this way allows us to explore the heterogeneous space without becoming overwhelmed by its size. We show preliminary results suggesting that a design space so constrained still has interesting trade-offs among performance, power consumption, and area.
INTRODUCTION
The current trend of increasing processor performance by increasing the number of cores on a single chip is not sustainable. Even if software were eventually able to make use of a very large numbers of cores, thermal considerations would limit the number of cores actually utilized. At an 8nm node, for example, it is expected that the density of transistors will require half of the processor to be powered down at any given time to stay within thermal limits [2] . There is little benefit to adding identical cores to a processor if they cannot Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ADAPT '13, January 22 2013, Berlin, Germany Copyright 2013 ACM 978-1-4503-2022-1/13/01 ...$15.00. be used. Heterogeneous processors work around this limitation by containing many different cores. Even though only a fraction of all cores can be powered at a time, heterogeneous CPUs would power only those cores that are particularly suited for the job at hand.
The possible design space of heterogeneous processors is huge. There are microarchitectural decisions (e.g. pipeline depth, issue width), architectural decisions (e.g. RISC vs CISC, out-of-order vs in-order), and even several possible instruction sets, all to be chosen on a per-core basis. The problem of balancing these variables appears intractable. We propose to limit this design space to its simplest form by constraining it to one ISA and architecture, and only varying microarchitectural parameters. Our preliminary results suggest that such a weakly heterogeneous design space contains sufficient variation to be interesting in itself. We further hope that methods that prove useful for exploring a weakly heterogeneous space can be extended to handle more heterogeneity.
Related work on heterogeneous computing is in the next section. Our experimental methodology is described in section 3, and a motivating example with results is in section 4. We conclude with our future research direction in section 5.
RELATED WORK
Current research on heterogeneous computing can broadly be divided into two categories. In the first is research on combining different existing cores into new processors (e.g. [3] [4]). In the second is research on middleware and hardware support for mapping jobs to heterogeneous platforms (e.g. [6] ). The first category reuses cores designed for homogeneous processors; the second is targeted at a higher level of abstraction than core design. Neither category of work provides guidance on how cores for heterogeneous processors should be designed. The design question is crucial, however, as the design goals for cores on homogeneous and heterogeneous processors are contradictory. Traditionally, cores have been balanced for use in the general case, whereas cores on heterogeneous processors must instead target subsets of the general case so that the overall processor is balanced. Our interest is precisely in methods for finding unbalanced cores that combine into balanced processors.
METHODOLOGY
A balanced, heterogeneous processor is one that minimizes the energy-delay product (ED) of a set of benchmarks when each benchmark is run on the best possible core. EDD (energy-delay-delay product) is sometimes used, but this metric overemphasizes execution time and biases the design space exploration toward large, power-hungry cores. We intend to simulate benchmarks on cores within a weakly heterogeneous design space to find configurations of good heterogeneous processors. Our simulations also consider area to avoid solutions that contain impractically large numbers of cores. We use the gem5 simulator [1] to measure performance and the McPAT power model [5] to estimate power and area. Simulations are based on the arm detailed model, a small, out-of-order processor model distributed with gem5. Eventually, we wish to supplant simulation with machine learning models that capture relationships among a core's configuration, performance, area, and energy consumption for given workloads.
MOTIVATING EXAMPLE
The motivation for weakly heterogeneous design space exploration can be demonstrated with a simple example: The size of the reorder buffer (ROB) is one parameter that controls how much instruction-level parallelism (ILP) can be extracted from a workload. Increasing its size increases its footprint and the energy cost of accesses, but can potentially decrease overall execution time. To explore these tradeoffs, we run the CJPEG (JPEG encoding) benchmark from the EEMBC 2 Consumer suite on the arm detailed model while varying the size of the ROB from 16 to 150 in steps of one (the default size is 40). CJPEG is used because it is short-running, allowing all sizes in the range to be simulated. For comparison, we also run coarse-grained simulations with HUFFDE (Huffman decoding, also from EEMBC 2) in steps of 10. It is important to note that the ROB is tiny relative to the core.
Execution time decreases with ROB size, leveling off after around 45, where CJPEG's ILP limit is reached. Power mirrors time-as ROB size increases, more things can be done at once. While positive for speed, this could be problematic for thermally limited processors. The energy cost of executing CJPEG varies irregularly with ROB size. There are several energy local minima that coincide with ROB restructuring decisions. The global minimum is at a size of 62, but other minima come close. Total energy cost is important when considering battery-limited devices, for example. The energy-delay product is much smoother than energy as the multiplication with time reduces jumps in the curve. While ROB size makes a significant difference to total execution time, it is such a small structure that variations in its own energy consumption almost disappear on the core level.
The most obvious trade-off made possible by changing ROB size is power vs time and energy, which is useful in thermally constrained situations. Ultimately, the goal is to find other parameters that in combination form a design space that is much less predictable than that in figure 1.
RESEARCH DIRECTION
We have demonstrated that varying one microarchitectural parameter has a non-linear effect on a core's area, execution time, power, energy consumption, and ED. We are actively working on identifying other parameters that have similarly broad effects, with a particular interest in those parameters that behave non-monotonically-i.e. like "Energy" in figure 1. As the HUFFDE data points demonstrate, a parameter does not have an identical influence on all benchmarks. A set of parameters should therefore produce a design space with hot spots where at least some workloads perform better than average. Such hot spots represent individually unbalanced cores that together can combine into a balanced, weakly heterogeneous processor. To the best of our knowledge, this is the first proposal for designing a heterogeneous processor from the bottom up to target real use cases.
There could even be a case for creating products with weakly heterogeneous processors in the near future. Given that all cores are similar, average performance should be comparable to that of homogeneous multicores-the stakes are lowered because even a worst case job scheduling is unlikely to be catastrophic. In the best case, however, a weakly heterogeneous processor in the wild will drive the development of schedulers that can take advantage of it. 
ACKNOWLEDGMENTS
This work has made use of the resources provided by the Edinburgh Compute and Data Facility (ECDF). (http:// www.ecdf.ed.ac.uk/). The ECDF is partially supported by the eDIKT initiative (http://www.edikt.org.uk).
