Comparing Composite vs. Wave-Cores in a Novel Dark-Silicon Methodology by Crispin-Bailey, Christopher & Arnone, Antonio
Comparing Composite vs. Wave-Cores in a Novel 
Dark-Silicon Methodology 
Antonio Arnone 
Department of Computer Science 
University of York 
York, United Kingdom 
aa835@york.ac.uk 
Chris Bailey 
Department of Computer Science 
University of York 
York, United Kingdom 
chrisb@cs.york.ac.uk
 
 
Abstract— As transistor scaling continues to push us into new 
design spaces, where power density is increasingly a major 
performance constraint, there have been moves to explore 
solutions which exploit so-called Dark Silicon, the UCSD 
Greendroid project being a notable exemplar. In this paper, we 
explore one novel dark silicon methodology, based on a 
heterogeneous multi-accelerator system model and an implicit 
execution model for the host processor. We also highlight a  back-
end translation methodology from raw machine code into data-
flow style hardware cores, and introduce two distinct 
implementation styles. We then demonstrate comparative power 
benefits as compared to a relevant CPU model, assuming a 65nm 
benchmark technology node for both cases. 
Keywords— Acellerator Cores, data-flow, stack-machine, 
power-density, code translation 
I.  INTRODUCTION  
It is widely suggested, with the apparent demise of so-
called Dennardian scaling, that the percentage of a silicon chip 
useable at full frequency is dropping alarmingly with each 
process generation, primarily as a result of power-density 
limitations [1,2]. Transistor size and frequency continue to 
scale at a healthy rate, at least for the moment, but effective 
power consumption per transistor lags behind increasingly. 
Consequently, transistors simply cannot switch at full rate in a 
sustained way across a large silicon area, as the result would be 
thermal overload. Forced underclocking or intermittent use 
represent two well known symptoms, those of ‘dim silicon’ and 
‘dark silicon’ respectively. 
Dark Silicon therefore represents silicon area that can be 
used best only if power efficiency and operability lie within a 
‘sweet spot’ of intermittent use and high pay-off [3].  For 
instance, at the 22nm process-node, it is predicted that just 25% 
of chip area can be active at any moment, and at 11nm just 
10% [1,3]. This is somewhat analogous to the anectdote that 
humans only use 10% of their brain at any one time, perhaps 
for similar reasons. Approaches to tackle this problem include 
novel power management techniques and use of special 
purpose cores to spread ‘useful’ power density across the 
otherwise poorly utilised chip area [4,5,6]. This paper 
introduces some novel aspects to these ideas, including the 
choice of a non-traditional host-architecture, and using 
alternative core structures and implementations.  
The purpose, of this approach is to explore automated ways 
in which silicon area can be best utilized within overall power-
density limits, by using a number of highly efficient cores 
intermittently. Ultimately a subset of these cores would be 
chosen according to their relative usage and resulting benefit. 
 Two novel attributes of our approach are highlighted in 
this paper. First the host-architecture assumes a stack-machine 
model, rather than a register-file approach. With recent 
advancement of compiler and code generation for such 
architectures [7,8], this seems to be an area that deserves more 
attention, since the core CPU in multi-accelerator architectures 
need not be as complex as those of a purely CPU based system. 
We also introduce two distinct core implementation styles 
(‘Composite’ and ‘Wave’), with different attributes. Evaluation 
of these cores in terms of power, area, and timing is performed,  
assuming a 65nm process. However the methodology is 
entirely independent of the chosen target process.  
Whilst it is common to refer to these cores as accelerators, 
their goal is often to reduce power density, rather than to boost 
execution speed. Likewise, since typical dataflow sequences 
contain plenty of fine grain instruction level parallelism that 
may be less exploitable on a host processor, parallelism may be 
obtained – but again not as the primary goal. In this paper we 
focus upon the general power and area trade-off of such cores 
rather than parallelism and speedup capabilities per se. 
II. METHODOLOGY 
A. Translation from assembler to custom logic 
 A translation tool was developed to map machine code 
fragments (typically basic-blocks) into hardware structures. 
This involved translating machine code generated by a C++ 
compiler [8] into VHDL dataflow sequences, and subsequently 
synthesising to FPGA and ASIC targets. The translation tool is 
able to generate two core styles, which we refer to as 
‘Composite’ and ‘Wave’ cores (see subsection B). The 
translation tool can target both XILINXTM FPGA tool-sets and 
CADENCETM VLSI toolsets. It also auto-generates Verilog test 
benches for each core, and outputs scripts for performing 
synthesis and power analysis, which is important when 100’s 
or 1000’s of cores are being translated and investigated. 
An example given in Figure-1 shows such a high-level 
data-flow structure as it relates and decomposes into a low-
This work was funded by UK EPSRC Grant No. GR/R67668/01 
level machine code sequence. This highlights the fact that the 
core receives inputs, generates outputs, and communicates with 
main memory, whilst groups of instructions are separated by 
successive memory accesses and treated as atomic states in our 
design. Each group (coloured blue, green, orange, red) 
represents one state in a simple state machine. Interestingly, 
one can observe that the red instruction group could have been 
merged with the blue group, reducing the number of clocked 
states from 4 to 3, and increasing parallelism from 1.75 
instructions per clock to 2.33. This is an optimisation that 
deserves further investigation, though not covered in this paper. 
Formal translation of the core is relatively straightforward: 
each instruction in a group has an equivalent VHDL statement, 
for example, if we assume inputs are named {i1,i2} and 
intermediate values {p1,p2, p3 etc.}, then the first (blue) group 
of instructions might translate as:-  
  p1 <= i1+1;   -- INC  
  p2 <= p1 + i2; -- ADD 
  p3 <= 5;  -- LIT 5  (literal is a constant) 
When stack reordering is encountered (SWAP for example), no 
hardware is needed – this is simply a wiring transposition 
between logic components, meaning that any implied stack 
behaviour actually disappears at the hardware level when 
synthesis is applied and HDL sequences are optimised.  
A key problem solved with our approach was how to take 
generated stack-cpu code blocks and identify the inputs and 
outputs to each block. Here, reiterative backtracking is used, 
whereby the stack content of a block is tracked instruction by 
instruction, and the number of initial inputs (values present on 
the host CPU stack before execution) and outputs (values 
remaining after execution), can then be determined.  An 
example of backtracking is given in Table-1 and Table-2, 
where the effect of stack operators are traced repeatedly until a 
complete code sequence is parsed. With each pass, initial stack 
depth is increased, until the depth is valid at the end point.  
Table I.  Stack Effects of instructions relevant to example 
Instruction PUSH POP 
Lit 0 0 
@loc 0 1 
Add 2 1 
Sub 2 1 
!loc 1 0 
Nop 0 0 
Swap 2 2 
 
Table II.  Backtracking algorithm example 
Command push pop Pass1 
(depth=0) 
Pass2 
(depth=1) 
Pass3 
(depth=2) 
Nop 0 0 0 1 2 
Add 2 1 -2 -1 1 
Lit 5 0 1 restart restart 2 
@Loc 7 0 1   3 
Nop 0 0   3 
@Loc 0 1   4 
Add 2 1   3 
Nop 0 0   3 
Nop 0 0   3 
@ 1 1   3 
Add 2 1   2 
Swap 2 2   2 (done) 
Once the code block has been analysed using the 
backtracking algorithm, the block can be generated such that a 
known number of input operands are clocked into the state 
machine from the host CPU before beginning a core operation. 
On completion, the core generates a known number of results 
to be returned to the host CPU. Data transfer is assumed to be 
via a common bus connected to the array of cores. 
B. Overview of Core Architecture 
This paper introduces two styles of architecture to 
implement the cores; Composite and Wave-core architectures. 
The translator generates both core styles at the same time to 
provide two sets of cores, either of which can be chosen. The 
‘Composite core’ model uses a shared logic structure to 
process a sequence of instruction groups in successive 
hardware states, and is closely related to the Mealy State 
Machine model (Figure-2a). The Wave-core, in contrast, treats 
each state in a computational flow as a separate hardware 
structure, operating in a fashion that is similar to a systolic 
array. Division into states is identical to that of the Composite 
core model, bounded by load or store operations. However, the 
computation structure for each state is implemented as a 
separate circuit and these are activated in turn, hence a ‘wave’ 
of activity rolling through a ‘daisy chained’ structure, stage by 
stage (Figure 2b). This means no logic circuit is used in 
successive states, reducing localized hot-spot occurrence. 
III. ANALYSIS AND RESULTS 
Translation of Assembly code to VHDL was performed 
using “test case” benchmarks selected for initial 
experimentation and processed to translate all code sections 
into cores (several thousand cores were generated in total). 
These were : “Coremark”, Scimark” and thee individual 
programs “spectral-normal”, “n-body” and “binary-tree”, to 
illustrate some program specific variations in our results. 
Additionally we utilized compiler library functions ‘library’ as 
a separate test set (not reported here). Many ‘library’ functions 
were used commonly by all benchmarks (e.g. malloc, printf ). 
We therefore separated the cores into library and non-library 
groups to prevent duplication-bias of characteristics.  Our auto-
Inc 
Add 
Lit 5 
@loc 7 
Add 
@loc 18 
 
Add 
!loc 4 
 
Lit 3 
Add 
 
 
SPLIT 
Inc 
Add 
Lit 5 
@loc 7 
Add 
@loc 18 
Add 
!loc 4 
Lit 3 
Add 
VHDL 
INC 
ADD LIT 5 
Add 
Add 
LIT 3 
Add 
m1 
m2 
m3 
result1 
Input 1 Input 2 
 
Figure 1 The high level data flow relationship of the core 
Where @loc and !loc represent memory fetch and store  to local variables in 
system memory, lit is a constant, and m1,m2,m3 are memory transactions. 
 
generated VHDL was synthesized to FPGA and ASIC targets. 
Xilinx ISETM Design tools were used for FPGA work, and 
Cadence ICTM Suite for ASIC work with a 65nm UMCTM 
process library being chosen. Verilog test-benches are also 
auto-generated as well as VCD files for power analysis. In each 
case we used the aforementioned stack machine C++ compiler 
tools [8] to generate the initial assembler code with the default 
code generation settings. We have noticed in undertaking this 
work that the advanced compiler optimization available in this 
compiler can significantly change core structure. This is 
certainly an area worthy of future investigation. 
We present our results in terms of a comparison between 
Wave and Composite core style for each key metric. Our data 
supports the following observations:- 
a.  Composite core style is on average 2.3 times smaller in 
area than Wave-Cores. However some cores are up to 10 
times larger and some of equal size (Figures 3a to 3c). 
b.  For frequency of core operation and critical paths, we see 
that Composite and Wave implementation styles are 
roughly equal overall.  (Figures 4a to 4c). 
c.  Total power (Static and Dynamic combined) again show 
almost equitable performance between Composite and 
Wave-core styles. (Figures 5a to 5c). 
d.  Relative power density of the two core styles show that 
Wave-cores are superior in this respect (Figures 6a to 6c). 
What is interesting is that although Wave-cores have better 
power density, they have similar power and timing attributes. 
However, the key question is how these cores compare to a 
relevant CPU doing the same work. To quantify this final issue, 
we compared a suitable CPU core [9] using an implicit 
execution model and the same 65nm process node against our 
core power consumption (Figure 7). Here we assumed that 
each group instruction required an average of 1.5 CPU clocks 
for CPU, and used the 85uW/MHz data quoted in the cited 
source. Generated cores have a range of relative power 
consumptions that tend toward an average of about 40%-50% 
equivalent CPU power, but with some exceeding five times 
power efficiency (points at 20% and less) for the same work. 
Given the low complexity of implicit execution models used in 
our host architecture, this is not out of line with the 10 to 20 
times gains reported in the Greendroid Project for an ARM 
CPU core [3]. A 65nm layout example of a Wave-core and 
Composite core of same function is given in Figure 8.  
IV. CONCLUSIONS 
In this paper we have explored a novel model to be applied 
in a heterogeneous multi-accelerator system. To improve the 
performance of the overall system, we have implemented a 
Wave-core architecture based upon a synchronous sequential 
state-per-stage system. We have presented an automated 
validation and synthesis tool chain to convert stack-machine 
assembly-code into hardware and so generate possibly 
thousands of units. The final selection and use of such units in 
practice requires further research: it is likely that a subset of 
frequently utilized cores would be better than a comprehensive 
suite of cores in which some are rarely used. In conclusion, by 
using an appropriate core architecture, it is possible to achieve 
similar or better power consumption than an alternative core 
model, and better speed of execution, whilst having a beneficial 
effect of reducing power density. Further work on the tradeoff 
between code quality and core structure is recommended. 
Access to sophisticated stack code optimization tools, and 
using these to generate better quality code before translation 
will, we believe, yield greater efficiencies, higher levels of 
instruction level parallelism and more optimal core structures. 
REFERENCES 
[1]   Vikas Agarwal, M.S. Hrishikesh, Stephen W. Keckler, Doug Burger. 
Clock Rate versus IPC: The End of the   Road for Conventional 
Microarchitectures. In the Proc. of 27th Annual International Symposium 
on Computer Architecture, pp. 248-259, May 2000. 
[2]  M. B. Taylor “A Landscape of the new dark silicon design regime” IEEE 
Micro, pp. 8-19, Sep/Oct 2013. 
[3]  Ganesh Venkatesh, Jack Sampson, et al. “The GreenDroid Mobile 
Application Processor: An Architecture for Silicon’s Dark Future” IEEE 
Micro, pp. 86-95, April/March 2011. 
[4] Yan, Guihai, et al. "AgileRegulator: A hybrid voltage regulator scheme 
redeeming dark silicon for power efficiency in a multicore 
architecture.", HPCA, 2012 IEEE 18th Intl. Symposium on. IEEE, 2012. 
[5]  Eric S. Chung, Peter A. Milder, James C. Hoe, and Ken Mai. Single-Chip 
Heterogeneous Computing: Does the Future Include Custom Logic, 
FPGA, and GPGPUs?. In 43th Annual IEEE/ACM International 
Symposium , pp225-236, Dec. 4-8, 2010. 
[6]  Ganesh Venkatesh, Jack Sampson, et al. Conservation Cores: Reducing the 
Energy of Mature Computation.  ASPLOS, pp13-17, March 2010.  
[7]  Shi,H, Bailey, C, IADIS. 2005a Instruction Level Parallelism of Stack-
Code under Varied Issues Widths, and One-Level Branch Prediction. Int. 
Conference on  Applied Computing, pp 23, 2005 
[8] Shannon M, Bailey C. 2006c Global Register Allocation, Register 
allocation for Stack Machines. In the Proceedings of Euroforth 2006. 
[9]  Subbarao G,A, Low-power Microprocessor based on Stack Architecture, 
University of Lund, Master Thesis 2015. 
mem# outputs 
inputs 
me
m# outputs 
inputs 
me
m# outputs 
inputs 
me
m# outputs 
inputs 
WAVE CORE STATE 1 STATE 2 STATE 3 
 
STAGE 1    STAGE 2      STAGE 3 
Figure 2a  – Composite Core Structure and Typical Operation 
mem#
outputs 
inputs 
mem#
outputs 
inputs 
mem#
outputs 
inputs 
mem#
outputs 
inputs 
WAVE CORE STATE 1 STATE 2 STATE 3 
 
STAGE 1    STAGE 2      STAGE 3 
Figure 2b  – Wave Core Structure and Typical Operation 
 0	
10
00
0	
20
00
0	
30
00
0	
40
00
0	
0	 2000	 4000	 6000	 8000	 0	
10
00
0	
20
00
0	
30
00
0	
40
00
0	
0	 2000	 4000	 6000	 8000	 0	
10
00
0	
20
00
0	
30
00
0	
40
00
0	
0	 2000	 4000	 6000	 8000	
Binary	
Spectral	
Nbody	
 
3(a) Coremark    3(b) Scimark    3(c) Binary-tree/spectral/Nbody 
Figures 3(a) to 3(c), Area comparison for Composite (x-axis) and Wave Core  (y-axis), all units in um2 
90
0	
11
00
	
13
00
	
15
00
	
900	 1100	 1300	 1500	 900	
11
00
	
13
00
	
15
00
	
900	 1100	 1300	 1500	 900	
11
00
	
13
00
	
15
00
	
900	 1100	 1300	 1500	
Binary	
Spectral	
Nbody	
 
4(a) Coremark    4(b) Scimark    4(c) Binary-tree/spectral/Nbody 
Figures 4(a) to 4(c), timing comparison for Composite (x-axis) and Wave Core  (y-axis), all units in picoseconds 
50
	
75
	
10
0	
12
5	
15
0	
17
5	
20
0	
70	 80	 90	 100	 110	 120	  50	
75
	
10
0	
12
5	
15
0	
17
5	
20
0	
70	 80	 90	 100	 110	 120	  50	
75
	
10
0	
12
5	
15
0	
17
5	
20
0	
70	 80	 90	 100	 110	 120	
Binary	
Spectral	
Nbody	
 
5(a) Coremark     5(b) Scimark    5(c) Binary-tree/spectral/Nbody 
Figures 5(a) to 5(c), Power comparison for Composite (x-axis) and Wave Core  (y-axis), all units in uW/MHZ 
0	
1	
2	
3	
4	
5	
6	
7	
8	
0	 1	 2	 3	 4	 5	 6	  0	
1	
2	
3	
4	
5	
6	
7	
8	
0	 1	 2	 3	 4	 5	 6	  0	
1	
2	
3	
4	
5	
6	
7	
8	
0	 1	 2	 3	 4	 5	 6	
Binary	
Spectral	
Nbody	
 
6(a) Coremark    6(b) Scimark    6(c) Binary-tree/spectral/Nbody 
Figures 6(a) to 6(c), Power Density comparison for Composite (x-axis) and Wave Core  (y-axis), all units in milliWatts/cm2/MHz 
 
0%	
20%	
40%	
60%	
80%	
100%	
120%	
0	 1000	 2000	 3000	 4000	 5000	 6000	 7000	 8000	
W
av
e	
Co
re
	R
el
a5
ve
	P
ow
er
	C
om
su
m
ed
	
Es5mated	CPU	Dynamic	Power	Consump5on	for	equivalent	work	(pW)	
 
Figure 7, Comparison of Wave Cores vs. CPU power (Scimark)   
 
Figure 8, Example of a larger core showing a Composite core 
overlaid on an equivalent Wave core for size comparison. 
(implemented at 65nm with 90% utilization in macrocell area) 
  
