The Feasibility of Domain Specific Compilation for Spatially Programmable Architectures by Mackay, Curtis Alexander (Author) et al.
The Feasibility of Domain Specific Compilation  
for Spatially Programmable Architectures 
by 
Curtis Mackay 
 
 
 
 
 
A Thesis Presented in Partial Fulfillment  
of the Requirements for the Degree  
Master of Science  
 
 
 
 
 
 
 
 
 
 
Approved June 2016 by the 
Graduate Supervisory Committee:  
 
John Brunhaver, Chair 
Lina Karam 
Jae-Sun Seo 
 
 
 
 
 
 
 
 
 
 
 
 
ARIZONA STATE UNIVERSITY  
August 2016  
  i 
ABSTRACT  
   
Integrated circuits must be energy efficient. This efficiency affects all aspects of 
chip design, from the battery life of embedded devices to thermal heating on high 
performance servers. As technology scaling slows, future generations of transistors will 
lack the energy efficiency gains as it has had in previous generations. Therefore, other 
sources of energy efficiency will be much more important. Many computations have the 
potential to be executed for extreme energy efficiency but are not instigated because the 
platforms they run on are not optimized for efficient execution. ASICs improve energy 
efficiency by reducing flexibility and leveraging the properties of a specific computation. 
However, ASICs are fixed in function and therefore have incredible opportunity cost. 
FPGAs offer a reconfigurable solution but are 25x less energy efficient than ASIC 
implementation. Spatially programmable architectures (SPAs) are similar in design and 
structure to ASICs and FPGAs but are able bridge the ASIC-FPGA energy efficiency gap 
by trading flexibility for efficiency. However, SPAs are difficult to program because they 
do not share the same programming model as normal architectures that execute in time. 
This work addresses compiler challenges for coarse grained, locally interconnected SPA 
for domain efficiency (SPADE). A novel SPADE topology, called the wave pipeline, is 
introduced that is designed for the image signal processing domain that is both efficient 
and simple to compile to. A compiler for the wave pipeline is created that solves for 
maximum energy and area efficiency using low complexity, greedy methods. The wave 
pipeline topology and compiler allow for us to investigate and experiment with image 
signal processing applications to prove the feasibility of SPADE compilers.  
  ii 
DEDICATION  
   
 
To my beloved wife Katie. She has been a pillar of support for me to lean on through all 
of my pursuits.  
  iii 
ACKNOWLEDGMENTS  
   
 I would like to express my gratitude to my advisor, Dr. Brunhaver, for his 
guidance, support and patience. He instilled in me the knowledge, intuition, and habits I 
needed to progress in my academic studies. The time he sacrificed to have face to face 
meetings with all of his students was invaluable to all of our studies. He gave me the 
freedom to explore and search while leaving me a line to anchor myself to.  
 I would also like to express thanks to Dr. Jae-Sun Seo and Dr. Lina Karam for 
being on my committee. Their comments, suggestions, and critique were insightful for 
my research. 
 This work would not be complete if it were not for my colleagues Saktiswarup 
Satapathy, Ron Jokai, and Prateek Mohan. It has been an honor to work alongside with 
these brilliant students, who I know will go on to do great things. The discussions we 
have shared have helped me to expand my faculties and stretch to new heights. 
 Finally, I would like to acknowledge my wife and family. They have supported 
my efforts since the beginning. My mother and father, Donald and Kathleen Mackay, 
have given me the foundation to be all that I can, and the confidence to pursue my 
dreams. My mother and father in law, Craig and Darlene Palmer, who have also 
sacrificed to support me through my endeavors. And most of all my wife who was with 
me during the late nights without complaint. 
 
  iv 
TABLE OF CONTENTS  
          Page 
LIST OF TABLES ................................................................................................................... vi  
LIST OF FIGURES ............................................................................................................... vii  
CHAPTER 
1     INTRODUCTION ....................................................................................................... 1  
2     ENERGY EFFICIENCY AND FLEXIBILITY ........................................................ 7  
2.1 Energy Efficient Computations ........................................................... 7  
2.2 Image Signal Processing is an Energy Efficient Domain ................. 10  
2.3 Convolution is Energy Efficient ........................................................ 11  
2.4 General ISP Abstraction .................................................................... 14  
3     SPATIALLY PROGRAMMABLE ARCHITECTURE FOR DOMAIN EFFICIENY     
 ................................................................................................................ 17  
3.1 Coarse Grain Reconfigurable Arrays (CGRA) ................................. 17  
3.2 Spatially Programmable Architectures .............................................. 18  
3.3 SPADE ............................................................................................... 19  
3.3.1 Line Buffer and Stencil Registers ................................................... 20  
3.3.2 Programmable Elements ................................................................. 20  
3.3.3 Switches .......................................................................................... 22  
3.3.4 Topologies ....................................................................................... 25  
3.3.5 PE Heterogeneity ............................................................................ 27  
4     THE SPADE KERNEL COMPILER  ...................................................................... 30  
4.1 Compilers for Specialized Architectures ........................................... 31  
  v 
CHAPTER  Page 
4.2 SPA Programming Model ................................................................. 32  
4.3 Compiling to SPA .............................................................................. 34  
4.4 SPADE Backend Compiler ............................................................... 38  
4.5 A Fast Optimal Compiler for the Wave Pipeline .............................. 39  
5     METHODOLOGY  ................................................................................................... 44  
5.1 Compiler Implementation .................................................................. 44  
5.2 Compiler Metrics ............................................................................... 44  
6     RESULTS .. ............................................................................................................... 47  
6.1 Linear Compile Times ....................................................................... 48  
6.2 High Utilization .................................................................................. 49  
6.3 Utilization Cost of Heterogeneous PEs ............................................. 51  
6.4 Measuring Communication ............................................................... 53  
7     DISCUSSION ............................................................................................................ 56  
7.1 Fabric Heterogeneity .......................................................................... 56  
7.2 Localized Communication ................................................................. 57  
8     CONCLUSION.......................................................................................................... 59  
9     FUTURE WORK ...................................................................................................... 61  
REFERENCES....................................................................................................................... 63 
  vi 
LIST OF TABLES 
Table Page 
 
1. SPA Architectures Compared to the Wave Pipeline..................................................... 26 
 
2. Application Kernel Profiles .......................................................................................... 48
  vii 
LIST OF FIGURES 
Figure Page 
1. Flexibility-Efficiency Tradeoff ........................................................................................2 
2. Energy Cost of Different Operations Measured in Energy per Op per Bit ......................8 
3. Convolution Demonstration ...........................................................................................13 
4. Block Diagram of 3x3 Convolution with a Line Buffer and Shift Registers.................14 
5. General ISP Block Diagram with a Programmable Functional Unit .............................15 
6. Example of Cascaded ISP Pipeline ................................................................................16 
7. Block Diagram of PE .....................................................................................................21 
8. Block Diagram of Switch...............................................................................................23 
9. Energy and Area Cost of Switch ....................................................................................24 
10. Wave Pipeline Topology..............................................................................................26 
11. Operation Profile of Kernels ........................................................................................29 
12. SPADE Compiler Flow................................................................................................30 
13. Programming in Space vs. Programming in Time .......................................................33 
14. Compiler Framework Block Diagram .........................................................................39 
15. Pseudo Code for Greedy Compiler Algorithm Using Breadth First Search ................41 
16. Updated Pseudo Code for Greedy Compiler with a Heterogeneous Fabric ................43 
17. Compile Times Plotted Against the Number of PEs Configured ................................49 
18. Utilization of PEs by Each Application .......................................................................50 
19. Harris Wave Pipeline Graphs.......................................................................................51 
20. Fabric Size Increase Due To Heterogeneity ................................................................53 
21. Histogram of Level Changes .......................................................................................55 
  viii 
Figure Page 
22. Banded Switch .............................................................................................................57 
23. Utilization Cost of Banded Switch ............................................................................. 58 
 
 
  1 
CHAPTER 1 
INTRODUCTION 
Computers are ubiquitous. For mobile devices, the energy efficiency of their 
computation directly affects the amount of processing that can be done in a desired 
battery life. Similarly, the energy efficiency of thermally constrained systems determines 
the rate of computation [1]. Therefore, high performance systems and devices with long 
battery lives minimize the energy cost of their computation.  
The chip making community cannot wait for better transistors to combat energy 
efficiency issues. Previously, these problems were resolved by moving to smaller 
technology nodes since power consumption decreased proportionally to device size by 
Dennard’s law. However, static field scaling ended when oxide thicknesses of transistors 
could not be scaled down any further without severe current leakage. To compound the 
problem further, Moore’s law is slowing, and the margin of benefits for each new 
generation decreases.  
Alternatively, we can increase efficiency by reducing the architectural energy 
overheads of computation. For example, we can eliminate redundant memory accesses, 
amortize instruction sequencing across multiple datum, and reduce the precision of 
execution units. However, many of these overheads exist to increase the flexibility of a 
microprocessor [2]. Fig. 1 shows that as flexibility and generality increases, energy and 
area costs increase. 
  2 
 
Fig. 1: Flexibility-Efficiency Tradeoff. Architectures are more energy and area efficient as 
they become more specialized. This specialization sacrifices flexibility for efficiency. 
Values were calculated using 45nm technology based on work presented at ISSCC and 
JSSC [3]. The FPGA value are a general estimate design based from Kuon’s paper [4].   
 ASICs are incredibly energy efficient because functions are fixed in hardware. 
This eliminates architectural overhead that is inherent to flexible architectures. They also 
capitalize on the structure of computation with fixed data paths and memory structures. 
However, the cost of developing ASICs is tremendous so they solely focus on 
intrinsically energy efficient computations. 
Removing flexibility, as in an ASIC, represents incredible opportunity cost. The 
area taken up by the hardware is now fixed to a single function and cannot be used for 
other computations that could have been executed by a more flexible architecture, such as 
a CPU. The design effort is also allocated to one function. There is not enough real estate 
  3 
on chip die to justify the use of excessive amounts of ASICs. Therefore, ASICs are 
mainly restricted to a few extremely common recurring computations.  
FPGAs are much more flexible than an ASIC, but are still orders of magnitude 
more efficient than CPUs. Their flexibility comes at an energy cost of approximately 
25x-50x the energy cost of an ASIC [4], [2]. FPGAs are extremely flexible because of 
their fine grained configurable fabrics and hierarchical networks of global and local 
interconnects. However, the fundamental issues with FPGAs are the fine granularity of 
their fabric and large interconnected networks. While fine grained control is useful for 
predicate operations, it is a drain on data path operations [5]. Fine grained control and 
global communication also suggests a large number of switched interconnects. Large 
global interconnects incur large parasitic RC power dissipation and area cost. 
The goal of this work is to close the energy gap between FPGAs and ASICs with 
new architectures. As a place to start, an FPGA’s energy efficiency could be increased by 
reducing flexibility and the accompanying overheads. This is done by using coarse 
grained logic units, local interconnects, and custom memories to efficiently control data 
flow. Coarse grained logic has been shown to improve energy and area efficiency in other 
FPGA-like architectures [6], [7]. Local communication is more energy efficient than 
global communication because there are significantly less wires running between 
elements. Memory costs are another significant cost to processing. Custom memories 
control data flow much more efficiently than general global memory and should be used 
whenever possible.  
Optimized FPGAs begin to look like spatially programmable architectures (SPA) 
[5]. SPAs are fabrics of routable interconnects connecting programmable processors 
  4 
together to execute computations in space rather than time. They are processors unto 
themselves and typically accelerate computations for CPUs. Previous work on SPAs have 
been proven to efficiently execute computations for applications in linear algebra [8], 
image and multimedia processing [6], [7], and signal processing [9]. These types of 
compute are commonly executed and socially significant in modern computation.  
Despite the promising energy savings from SPAs, these energy efficient 
architectures are difficult to program. One fundamental reason for this is because 
underlying knowledge of the hardware is required to effectively program it. For example, 
FPGAs, which are fine-grained SPAs, are generally made up of heterogeneous blocks of 
DSPs, LUTs, and BRAMs. High level synthesis tools exist in an attempt to make it easy 
for developers to compile C code onto the FPGA. Though it is simple and familiar to 
compile the C code, it is difficult for the compile tools to find an optimal solution during 
the place and route phases. The developer must manually optimize the code to improve 
performance.  
Programmability can be simplified by narrowing target applications to a specific 
domain. Application developers are usually unfamiliar with the hardware space, but they 
are familiar with the problem domain of their applications. Domain specific languages 
(DSL) are an active area of research that strive to trade a restriction on expression for 
programmer productivity and application performance [10], [11], [12]. These languages 
can be leveraged at the front end of a programming model to map code to energy efficient 
architectures. 
Concurrently if we are reducing the flexibility of the architecture, we can use the 
domain of the compilation as  a guide for these restrictions. This affords further energy 
  5 
savings to the architecture because it can now customize data paths and programmable 
elements to match domain specific requirements. By restricting the domain, programmers 
no longer need to understand the details of the underlying architecture because the DSL 
restricts the code enough to match the hardware abstraction.  
Therefore, to achieve an architecture that will bridge the FPGA-ASIC energy gap, 
this work focuses on SPA for domain efficiency (SPADE). Specifically, we target the 
image signal processing (ISP) domain, although there are many other suitable domains 
worth exploring. This domain is chosen because there are many architectural works and 
DSLs that are optimized for it. 
To achieve this goal, a novel SPADE topology called the wave pipeline is created 
to execute the inner loop function of kernels in the ISP domain. The wave pipeline is a 
fabric of programmable execution elements organized into stages or waves. It is similar 
to other SPA works but removes much of the hardware overhead such as large registers 
files, sequencers, and global data buses. The area savings from the reduced overhead 
allows for the fabric to be large so that many computations can be executed 
simultaneously in the pipeline. 
 To compile a DSL’s program to the wave pipeline hardware requires a compiler 
that maps operations to hardware, places them onto a spatial fabric, and routes the data 
flow between processing elements. Many SPA compilers are constrained by some 
optimization cost such as energy, area, or physical fabric constraints. Some have attacked 
this problem by using modulo scheduling [13], [6],  integer linear programming [6], [14], 
virtualization [7], and mapping tree paths to hardware [15]. However, most of these 
  6 
compilers are designed for small fabrics and only sub-graphs of kernels. The compiler for 
the wave pipeline compiles for large fabrics and targets the entire graph of kernels. 
 Energy efficiency is largely due to the types of kernels and computations that are 
executed. These types of computations share similarities in data flow and data type, 
especially when they are within the same domain (Chapter 2 Energy Efficiency and 
Flexibility). Since only certain computations are fundamentally efficient, efficient 
architectures ought to sacrifice flexibility and specialize on these types of computations. 
This is specifically shown with the wave pipeline SPA that is designed for this work 
(Chapter 3 Spatially Programmable Architecture for Domain Efficiency). Compiling to 
SPAs has historically been difficult due to heterogeneity and large computational fabrics. 
However, by specializing to a domain specific programming model it is shown that a low 
complexity compiler can actually solve for optimal configurations on the wave pipeline 
topology by leveraging the intermediate representations of DSLs (Chapter 4 The SPADE 
Kernel Compiler). The compiler was used to compiler a few ISP applications to evaluate 
the generated configurations and performance of the compiler (Chapter 5 Methodology). 
The results indicate the tradeoffs in fabric utilization with certain optimizations made to 
the wave pipeline topology, which are helpful to understand the balance between high 
level parameters including heterogeneity, fabric complexity, and element size. (Chapter 6 
Results). Fabrics are optimized with heterogeneous fabrics and localized communication 
to save on energy and area costs. However, as fabrics become more heterogeneous and as 
communication between elements is localized, the compiler’s configurations themselves 
become less efficient. This implies that optimization requires an understanding of both 
the compiler effort and hardware costs. (Chapter 7 Discussion).  
  7 
CHAPTER 2 
ENERGY EFFICIENCY AND FLEXIBILITY 
 There exists a tradeoff between efficiency and flexibility. Therefore, an 
architecture that fills the ASIC-FPGA efficiency gap will be less flexible than an FPGA 
but flexible enough to overcome fixed functionality opportunity costs associated with 
ASICs. Reducing flexibility naturally constrains targeted computations to a specific 
domain, such as communications, scientific calculations, and image processing. There are 
enough operations and kernels within these domains that it is worthwhile to create 
specialized architectures for them. By studying the tradeoffs between efficiency and 
flexibility, it becomes easier to design an architecture for specific domains in an efficient 
manner.  
2.1 Energy Efficient Computations 
If the architecture is to be restricted, it should focus on executing domain specific 
computations that are intrinsically energy efficient. Domain computations are energy 
efficient if their computations have a small average energy spent per arithmetic operation. 
Therefore, such computations favor low energy operations over high energy operations.  
  8 
 
Fig. 2: Energy cost of different operations measured in energy per op per bit. The cost is 
plotted against bitwidth. Note that the axes scale is logarithmic base 2 scale. Figure was 
taken from Brunhaver’s PhD Thesis [16]. 
Arithmetic operations use much less energy than memory and architecture 
overhead operations. Figure 2 illustrates the energy gaps between various operations, 
which shows the scale of energy cost per operations. DRAM memory operations accesses 
are the most expensive operations and are 300-8000x more energy expensive than any 
arithmetic operation. SRAM memory accesses are significantly cheaper than DRAM but 
are approximately 100x more expensive than basic arithmetic operations. Overhead 
operations, such as instruction fetch and decode, are essential to highly flexible 
architectures because they allow for arbitrary instruction sequences. Such overhead is 
architecture dependent and lies anywhere between 100-500x more energy expensive than 
arithmetic operations. 
Memory and overhead operations are unavoidable but can be forced to execute at 
lower frequencies to limit their costs. If for every high energy operation there are many 
low energy operations executed, the average cost of high energy operations per cycle 
  9 
begins to approach the average cost of low energy operations per cycle. The limit here is 
when the average energy spent for memory and overhead operations equal the average 
energy spent for arithmetic operations. 
For global memory accesses to be energy efficient, there must be a high compute 
to bandwidth ratio. For each word of data accessed from DRAM memory, approximately 
1000 arithmetic operations should be executed to offset the memory cost. This is a 
scenario that often occurs in the inner loop of functions where a data set is accessed once 
and operated on many times before storing and moving to the next data set. 
Similarly, data should have a high degree of locality to mitigate local memory 
cost. Locality here refers to the interdependence between data points in both space and 
time. The working data set should be small enough to fit in local memory and then only 
require few accesses per set of computation. This concept is similar to amortizing DRAM 
cost but is applied to SRAM memory at a smaller scale with 100 arithmetic per SRAM 
memory access.  
Instruction overhead is about as costly as accessing local memory, so each 
instruction should encode many arithmetic operations. It is intractable to imagine an 
instruction for every computation in existence, but there are certainly code traces and 
functions that could justify the specialized effort. The easier it is to encode complex 
instructions, the more useful an architecture will be at executing energy efficient 
computations. 
Though arithmetic operations have low energy costs, they are most energy 
efficient at low precision. The arithmetic operations in Figure 2 have a positive slope, 
indicating that there is a direct relation between bitwidth and energy spent per operation 
  10 
per bit. This means that arithmetic operations become less efficient per bit at larger 
precisions. Therefore, energy efficient computations should mainly execute low precision 
data.  
Energy efficiency computations can then be summarized into four principles: 
1. High compute to bandwidth ratio [17], [18]. 
2. High data locality [19], [20]. 
3. Execute many operations per instruction [21]. 
4. Preference towards low precision operations [22]. 
Computations that do not follow these four principles are inherently energy 
inefficient and as such should not be executed by energy efficient architectures. These 
types of computations can be executed on a CPU instead, leaving only the targeted 
energy efficient computations for a separate coprocessor. 
2.2 Image Signal Processing is an Energy Efficient Domain 
Applications in the image signal processing (ISP) domain generally follow these 
four principles of energy efficient computation. Most ISP applications are made up of 
multiple kernels that do some looped computation over a sub region of an input image. 
These functions share common characteristics that allow them to be executed efficiently. 
Principles 1 and 2 are fulfilled because ISP functions have a finite working set of pixels 
that can be stored in specialized local memory to avoid costly redundant global memory 
accesses. Principle 3 is fulfilled because the inner loops of ISP functions are common to 
many applications, such as filters, non-maximal suppression, and up-sampling. These 
functions essentially become complex fused instructions. However, these kernels usually 
only contain a few tens of instructions which is far from the hundreds of operations 
  11 
needed offset instruction and memory costs. To resolve this, kernels can be pipelined 
with other kernels in order to execute thousands of operations per global memory access. 
This abstraction is flexible enough to execute many ISP applications with little 
configuration. 
2.3 Convolution is Energy Efficient 
2D Convolution is an extremely common function in many applications and a 
standing example for energy efficiency in the ISP domain. When applied to an image, a 
kernel window selects a sub-region of an image and each pixel is then multiplied with 
elements of a kernel matrix. The products are summed together and then placed at a 
corresponding region in the output image. The kernel window is shifted by one pixel to a 
neighboring sub region of the image and the process is repeated until the entire input 
image is scanned and all pixels in the output image are calculated.  
Data flow in convolution is very regular. If the data flow were to be imagined as 
running on hardware, then as the kernel scans the input image the pixels that are selected 
by the kernel are loaded from memory and could be placed in local registers. The 
coefficients could be stored in registers, as well, and multiplied to their corresponding 
pixel. The products are then summed to a single pixel which is then loaded back into 
memory.  
Convolution satisfies three of the four principles of energy efficiency mentioned 
in Section 2.1 Energy Efficient Computations. Principles 1 and 2 are illustrated in Fig. 3 
by highlighting the areas where pixels can be reused to reduce memory accesses. The 
kernel is represented by the red square in Fig. 3 and the three sub figures show how it 
scans horizontally across a row before it moves down to the next row. Fig. 3 (b) shows 
  12 
the kernel shifted by one. The green highlighted area represents the pixels that are reused 
immediately from the previous kernel operation. Fig. 3 (c) shows the kernel as it moves 
to the next row. The orange highlighted area represents the row of pixels that are going to 
be reused from the last row that was processed. These areas of spatial and temporal 
locality highlight how convolution can be executed efficiently through custom memory 
structures to satisfy Principles 1 and 2. Principle 4 is satisfied because pixels are usually 
encoded into a small bitwidth (e.g. 8, 16 or 24 bit) so low precision hardware can be used 
to satisfy Principle 4. Principle 3 is not satisfied by convolution because only a few tens 
of operations are executed instead of thousands.  
  13 
 
Fig. 3: Convolution Demonstration. (a) A simple block diagram illustrating convolution. 
(b) Highlights immediate data reuse between computations in green. (c) Highlights row 
reuse after each row is processed. 
 
For hardware to take advantage of the efficient properties of convolution, custom 
memory and functional units can be used to make an energy efficient convolution block. 
A line buffer interfaces with global memory to receive incoming pixels from an image. 
  14 
As each pixel is accessed, it is stored in the line buffer and the rows as highlighted in  
Fig. 3 (c) can be accessed again in the near future. Meanwhile, the same pixel is sent to 
an array of shift registers while other pixels belonging to the same column of the 
incoming pixels are sent from the line buffer. The shift registers are like the window that 
traverses across the image in Fig. 3 (b) and the array is the same size as the kernel size. 
At each cycle, the shift registers send the data from all the pixels to a multiplication unit 
to be multiplied with the coefficients. The products are then summed to produce a single 
output that is stored back in memory. Fig. 4 illustrates this abstraction for a 3x3 
convolution. 
 
  
Fig. 4: Block diagram of 3x3 convolution with a line buffer and shift registers. 
2.4 General ISP Abstraction 
There are many kernels in the ISP domain that fall under a very similar 
abstraction as convolution [16]. Such an abstraction is shown in Fig. 5, again using a line 
buffer, shift registers and now with a programmable functional unit.  The only major 
difference is that the multiplication-sum unit will be replaced by a general functional unit 
  15 
that is able to execute any number of kernel functions. The details, such as line buffer 
size, window size, and pixel bitwidth, become flexible configurations so that any type of 
kernel can be executed. 
 
Fig. 5: General ISP block diagram with a programmable functional unit. 
 
Image signal processing pipelines use many convolution-like kernels by 
cascading them together. The ISP abstraction can also be used in the same way to create 
useful ISP pipelines as shown in Fig. 6. By cascading multiple instances of the pipeline, 
many operations are executed per memory access, thus satisfying Principle 3. 
For this work we use a set of sample ISP applications for evaluating the 
architecture and the compiler: a Canny edge detection filter [23], a Harris corner 
detection filter [24], a FAST feature detection filter [25], [26], convolution, and a motion 
estimation kernel as found in the H.264/AVC joint model [27]. The motion estimation 
kernel implements an exhaustive search and is used for measuring compiler complexity 
because it has so many operations. The other applications are used to gain insight on the 
efficiency tradeoffs between architecture and compiler design.  
  16 
 
Fig. 6: Example of cascaded ISP pipeline. 
  17 
CHAPTER 3 
SPATIALLY PROGRAMMABLE ARCHITECTURE FOR DOMAIN EFFICIENCY 
  The ISP abstraction above needs a programmable functional unit that can execute 
100s of operations. This requirement is not possible by exploiting parallelism as done in 
wide VLIW-SIMD processors. Instead of having a few processing elements that execute 
parallel data, we can utilize a large array processing elements can be fused together in a 
reconfigurable way. The growing field of coarse grain reconfigurable arrays (CGRA) and 
spatially programmable architectures (SPA) provide the processing array needed to fuse 
hundreds of operations together. 
3.1 Coarse Grain Reconfigurable Arrays (CGRA) 
CGRAs are reconfigurable computational arrays that allow operations to be fused 
together to make a complex operation but have the added flexibility to be reconfigured at 
any time. They consist of an array of coarse processing elements interconnected by a set 
of switches for routing data. CGRAs are usually implemented as part of a processor and 
not as isolated processors. They fit along with the rest of the pipeline stages and receive 
data and instructions from the memory and register files. CGRAs are used often to 
execute hot traces of a program that are heavily used. The reconfigurable array of 
processing elements can be quickly reconfigured as new hot traces are discovered. 
Reconfiguring a CGRA involves a substantial amount of overhead, so there must be 
enough work done by each configuration between each reconfiguration to make them 
feasible.  
DySER is a dynamically scheduled CGRA made up of blocks of locally 
interconnected functional units [15]. It runs in parallel with the execution stage pipeline 
  18 
and interfaces with the register file and memory at the input and output. The DySER 
architecture is able to run a variety of programs at higher efficiencies and performance 
compared to an out of order (OOO) processor. The efficiency gain is dependent on how 
energy efficient the computation is. Gains in energy efficiency were measured to be from 
5% to 90% for OOO processors compared to DySER [15]. 
Despite their ability to fuse instructions together, CGRAs fail to achieve ASIC-
like efficiency. This is understandable because CGRAs run as part of the execution phase 
of a general processor, and thus still have the overhead associated with general 
processors. It is also difficult to find which traces should be fused together that will yield 
energy savings. In order to save energy without sacrificing performance, a CGRA 
configuration needs to be executed many times to offset the cost of configuring.  
3.2 Spatially Programmable Architectures 
 Spatially programmable architectures (SPA) are closely related to CGRAs. 
CGRAs are reconfigurable fabrics that run as a step or stage in a general processor 
pipeline.  Spatial Architectures exist as processors unto themselves, or as coprocessors, 
and may be much larger than typical CGRAs. In spatial architectures, data is fed into the 
computational fabric made up of programmable elements (PE) interconnected by a switch 
network. 
Systolic arrays are energy conscientious spatial architectures that became popular 
in the late 1970s [28]. They have largely been used for executing BLAS and also have 
application with digital signal processing [8]. They consist of a fabric of PEs elements 
interconnected by a mesh of buses. At every cycle they pump data in and out of the PE in 
  19 
a systolic fashion. Each PE executes independently of others to compute a partial result 
and stores that result as part of its state before sending it downstream.  
Scheduling instructions onto an SPA fabric varies among the architectures. Some 
SPAs utilize sequencers to overload a PE with multiple instructions. PipeRench takes 
advantage of this to allow for a virtualized fabric that can execute kernels that would 
otherwise not fit in the given space [7]. This is done by sequencing programmable PEs to 
execute different operations in a schedule. Other SPAs take a systolic approach and allow 
the PEs to have guards that alter the operation and data flow based on the inputs received 
by the PE [29]. A simple approach is to schedule a fixed static configuration that maps a 
single operation to a PE for the life of the configuration. This requires a larger fabric 
since PEs cannot be virtualized or sequenced to execute multiple operations, but it does 
simplify scheduling.  
3.3 SPADE 
 When SPA is restricted to a specific domain it becomes a spatially programmable 
architecture for domain efficiency (SPADE). Different implementations of SPADE can 
be optimized for different domains. In this work we focus on the ISP by using the ISP 
abstraction described in Section 2.4 General ISP Abstraction. The basic SPADE block is 
made up of a line buffer and stencil registers that feed into an SPA fabric which executes 
the inner loop of ISP kernels.  
Kernels are mapped to the SPA fabric using a compiler to place and route the 
operations. The compiled kernels are then executed many times with the data being 
streamed in from the line buffer and stencil register. The fabric is made up of 
programmable elements that execute the operations and programmable switches for 
  20 
routing data paths. This fabric is similar to those found in other SPAs, except instruction 
sequencing is removed, all hardware elements are coarse-grained, and the fabric is large 
(thousands of elements instead of hundreds of elements).  
3.3.1 Line Buffer and Stencil Registers 
SPADE is a programmable architecture that focuses on shifting pixels from an 
image into a computational fabric programmed by the user using the compiler. An image 
is streamed in through FIFO structures and buffered in line buffers. The size of the line 
buffer is determined by the height of the stencil window and the width of the region of 
the image the stencil window slides over. The line buffer feeds into a stencil window that 
selects the set of pixels that will then be shifted into the SPA fabric via a set of shift 
registers. After the pixels are shifted into the fabric, the stencil is shifted over to the next 
set of pixels. For each stencil shift, a new pixel is added to the line buffer as an old one 
leaves. The shifted in pixels are moved into the SPA fabric where the kernel is applied. 
The output pixel is then stored back in a memory where the output image is located.  
3.3.2 Programmable Elements 
 The programmable element (PE) is a minimal processor that is the workhorse of 
SPADE. Each PE supports a set of basic operations that is programmed at configuration 
time. PEs are comparable to LUTs in an FPGA, but instead of 1-bit precision control, PEs 
are granular, such as 8-bit or 16-bit, which is useful for ISP applications since pixels are 
uniform, coarse grained data types. Each PE can be programmed individually to execute 
an operation that its hardware can support. The PE will only execute this one operation 
for the duration of the configuration. PE inputs are streamed from other PEs, shift 
registers, or top level constant coefficients through the reserved input ports.  
  21 
 
 
Fig. 7: Block diagram of PE. 
 
PEs are interconnected using switches that are configured to route data paths 
between PEs. A PE may attempt to route a signal to a receiving PE that is not a direct 
neighbor (i.e. not directly interconnected by a single switch). In this case, the sending PE 
must use neighboring PEs to reach its intended target. This comes at the cost of 
consuming an entire PE for a bypass. Such PEs are called bypass PEs. Some bypass PEs 
are required with the given network or a resource is reserved so an alternate path must be 
taken. 
  22 
All PE operations execute synchronously in a single cycle to make scheduling 
simple. Every PE is also pipelined so that the throughput is a single operation per cycle. 
There is no need for buffers between PEs because of these restrictions. It also implies that 
if SPADE inputs and outputs a single pixel then the throughput is one pixel per cycle, 
since the entire SPADE flow is deeply pipelined. 
3.3.3 Switches 
 Switches control the flow of data along the fabric. A switch has incoming ports 
and outbound ports that are interconnected through a large set of multiplexors, as shown 
in Figure 8. As described by their names, data can only flow into the incoming ports and 
out of the outbound ports. Routing is done by configuring the multiplexors during the 
configuration phase.  
The energy and area cost of a switch grows 𝑁2 where 𝑁 is the bandwidth of the 
switch as shown in Figure 9. Therefore, there is great benefit to having local 
communication over global. The largest switch would be fully interconnected to all PEs 
in the fabric, making compilation straightforward but at the cost of low energy efficiency. 
Juxtapose to this example would be a network of small switches, which complicates the 
compilation but leads to an increase in energy and area efficiency. Switches are generally 
connected to PEs but may be connected to other switches if desired. 
  23 
 
Fig. 8: Block diagram of switch. 
 
 
  24 
 
(a) 
 
(b) 
Fig. 9: Energy (a) and Area (b) cost of switch. 
 
0
5
10
15
20
25
30
35
8 18 28 38 48 58 68 78
p
J
Number of Switch Ports
Energy Cost of Switch
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0 10 20 30 40 50 60 70 80
μ
m
2
Number of Switch Ports
Area Cost of Switch
  25 
3.3.4 Topologies 
 The organization of the PEs and switches in the SPA fabric is called the topology. 
Topologies may stand alone as a single general fabric or be made up of sub-topologies 
that can execute various specific tasks. For example, a topology may be made for general 
execution while another is specifically designed for convolution. Multiple topologies can 
be combined to form a single class of topology that takes advantage of the properties of 
each individual topology. 
For this work we designed a new, simple SPA topology based off the PipeRench 
topology [7], called the wave-pipeline (see Fig. ). In this topology there are N columns of 
M PEs distributed into waves of a pipeline with depth N. Waves are a collection of PEs 
that are analogous to stripes in other SPAs. PEs in the same wave cannot communicate 
with each other, and waves only communicate with the wave one step further down the 
pipeline. If a PE needs to send data down to another PE two stages down, then it will 
have to use bypass PEs to route the signal there. The switches are fully interconnected so 
any input port is able to route to any output port. This topology was chosen for this work 
mainly because it seemed it would be simple to compile ISP kernels onto it. 
  26 
 
Fig. 10: Wave Pipeline Topology. Data flows from left to right. 
 The wave pipeline topology is similar to other SPA and CGRA topologies, such 
as DySER[15], PipeRench[7], Systolic Arrays [28], DynaSpAM [30], Multi-Granular 
FPGA [31], and the Triggered Instruction Architecture [29]. However, key parameters 
differentiate these architectures from one another in terms of how the fabric is 
configured, how data is routed between PEs, size of PEs, and the scope of targeted 
applications. Table 1 classifies SPAs and relating architecture, according to these 
common to SPA parameters. Along with SPA architectures FGPAs and ASICs are 
included with LUTs and standard cells instead of PEs. 
  27 
Table 1: SPA Architectures compared to the wave pipeline. 
Note: * The rules for PE communication. Communication among neighbors indicates that PEs are able to 
communicate with any other PE in an array that are one unit away from them. For blocked communication 
all PEs are interconnected within a block and blocks are able to communicate with other blocks. 
3.3.5 PE Heterogeneity 
The simplest fabric is a homogeneous fabric, but this would require PEs to 
support many operations. Having large PEs that support every operation comes at a 
tremendous opportunity cost. Some area and energy cost is associated with every 
operation a PE supports. Additionally, since each PE can only perform one operation per 
configuration the rest of the PE functionality unusable. Instead of a homogenous fabric of 
PEs that can execute any operation, it is more effective to have a heterogeneous fabric of 
PEs that can only execute a few operations. Heterogeneous fabrics reduce PEs’ 
opportunity cost because there is less overhead when a PE is utilized. The maximum 
Architecture 
Fabric 
Configuration 
PE 
Communication* 
Fabric 
Complexity 
Granularity 
(bits) 
PE 
Register 
File Size 
Scope 
ASIC Fixed fixed High Mixed N/A Fixed Function 
FPGA Static Local & Global High 1 1 General Purpose 
DySER [15] Trace Schedule Neighbor PEs Medium 32-64 1 General-Purpose 
PipeRench 
[7] 
Virtual 
Pipeline 
Direct PE-PE, 
Inter-Stripe 
Low 2-32 2 to 16 General-Purpose 
Systolic 
Array 
Pure Systolic Neighbor PEs Low 32-64 300 Domain Specific 
DynaSpAM 
[30] 
Trace Schedule 
Intra-Stripe, Inter-
Stripe, Global 
Stripe 
Low 32-64 2 General-Purpose 
Multi 
Granularity 
FPGA  [9] 
Static Local & Global 
Medium-
High 
Mixed 1 
Domain 
(Communication) 
Triggered 
Instruction 
[29] 
Triggered 
States 
Blocked Medium 32 78 Parallelism 
SPA Wave 
Pipeline 
Static 
Feed-Forward 
Wave 
Low 16 1 Domain (ISP) 
  28 
efficiency of heterogeneous PEs occurs when the overhead penalty is negligible 
compared to the value of the operation being configured. 
A heterogeneous fabric of PEs is more energy and area efficient, but at the 
sacrifice of utilization and complexity. In the case of homogenous PEs, placing and 
routing are simple tasks. With heterogeneous PEs an operation may not be able to be 
placed in an ideal location and instead has to be placed further downstream in the 
pipeline.   
A profile of the sample applications showed that the most common operations 
were ALU operations, followed by multiplication and reduction operations. Fig.  shows 
the breakdown of operations for each benchmark application used. To reflect the 
distribution of operations onto hardware, most of the PEs in a heterogeneous fabric would 
support simple ALU operations and the remaining would support more expensive 
operations. By creating a topology that reflects the profile of applications, the utilization 
penalty is reduced. The savings in area and energy from the reduced overhead in 
supporting so many expensive operations would likely outweigh the cost of these delays.  
  29 
 
Fig. 11: Operation profile of Canny edge detection, Harris corner detection, Fast feature 
detection and convolution. Operations are grouped by the hardware that can execute them. 
 
0%
20%
40%
60%
80%
100%
Conv3x3 Canny Harris Fast Total
Reduction
Multiply
ALU
  30 
CHAPTER 4 
THE SPADE KERNEL COMPILER 
Energy efficient compute requires a new programming model to write programs 
for SPA fabrics. It is desirable that developers be able to write code that can be compiled 
down to a hardware configuration and executed optimally for a given architecture. 
Domain specific languages (DSL) are restricted high level languages designed for energy 
efficient compute. They produce intermediate representations of the code similar to 
assembly code in general purpose languages. But the intermediate representation is 
architecture agnostic and so a compiler is required to link between DSL and SPA, as 
shown in Fig. 8. The contribution of this work is a SPADE backend compiler that 
compiles energy efficient programs by consuming the intermediate representation of a 
DSL and transforming it to an SPA configuration that can be used for the SPADE 
architecture.  
 
Fig. 8: SPADE compiler flow. 
Previous compilers for specialized architectures have been successful by changing 
the programming model. Similarly, the programming model must change for SPA. This 
is typically done for SPAs by transforming the computations into a directed acyclic graph 
so that the computations then have spatial context. These graphs of operations then need 
to be mapped to available hardware functions, placed onto PEs, and routed by 
configuring switches. Beyond compiling for correctness, cost is also a major 
  31 
consideration. Compilers for SPA target minimum energy and area configurations 
without significantly raising the complexity of the compiler. A compiler for SPADE is 
described below to show how these issues can be addressed for a wave pipeline topology. 
By making certain assumptions about the topology, it is shown that a low complexity 
compiler can create minimum energy and area configurations.  
4.1 Compilers for Specialized Architectures 
 When traditional programming methods do not fit the architectural model, new 
programming models must be used to effectively compile code to specialized 
architectures. Compiling programs onto specialized architectures is difficult because it is 
not always clear how to assign physical resources to program operations written in 
generic languages such as C, where parallelism, locality, and communication between 
functions may not be apparent. Therefore, restrictions are placed on the programming 
model to make the relationship between code and architecture clear.  
For GPUs, different programming models from classical general purpose 
programming models, such as OpenGL [32], are used to effectively execute parallelizable 
programs. These programming models abstract the architectural details of a GPU away 
from the developer, fostering innovation and rapid development. Developers can then 
focus on creating image and graphics pipelines while being assured that it will run 
optimally on hardware. OpenGL was adopted and generalized by NVIDIA to create 
CUDA [33], which creates more abstractions to execute applications beyond graphics 
processing. Using regular C constructs, a developer can seamlessly integrate their general 
parallelizable code onto GPUs using CUDA.  
  32 
 VLIW-DSP architectures change the programming model by packaging libraries 
of optimized code that are common to signal processing. Such libraries include matrix 
arithmetic, convolution, polynomial evaluation, and FIR filtering. The Convolution 
Engine [34] uses libraries to program map and reduction units, as well as route data flow. 
However, beyond these limited number of libraries, VLIW-DSPs are fairly difficult to 
program due to their specialized execution units and data flow structures. 
4.2 SPA Programming Model 
Programming SPAs is similarly difficult to other specialized architectures because 
it is not clear how to transform programs in time into programs in space. A program in 
time is executed as a set of instructions that are sequenced through an execution unit 
connected to memory (see Fig. 9 (a)). In contrast, a program in space unrolls 
computations along the fabric of many execution units with interconnects that control the 
flow of data (see Fig. 9 (b)). 
FPGAs directly program onto spatial fabrics using Verilog or high level synthesis 
(HLS). The low level complexity of Verilog makes it difficult to program; therefore, HLS 
is used to ease the development cycle. HLS lifts the abstraction of hardware design by 
leveraging the expression of a high level language but maintaining Verilog’s 
programming model of concurrent execution. However, it is difficult to predict the its 
behavior without expertise in HLS design, and manual optimizations are required 
regardless. 
 
  33 
 
(a) 
 
(b) 
Fig. 9: Programming in space vs. programming in time. 
There needs to be a less complex programming model that restricts the expression 
of code to match the structure and data flow of SPA. GRAMPS is a programming model 
that generalizes graphics pipelines to expose the connectivity between processing stages 
in a graphics pipeline in a similar way that SPA kernels may communicate to each other 
[31]. Software defines the type of processing kernels being executed and how data is 
exchanged through queue, buffer, and thread abstractions. SPA can use a similar model 
of kernels that interexchange data using streams.   
  34 
SPA can also leverage programming models from high level domain specific 
languages (DSL) and express programs in a domain specific context [12], [10], [11]. 
StreamIt, for example, is a language that enforce interfaces between ISP pipeline stages 
for common connections between kernels, such as pipelines, splits, joins and feedback 
loops [12]. In DSLs, kernel functions are usually modeled as computation graphs that 
describe how operations are related to each other spatially. DSLs also provide front end 
compilers to transform their high level representation to intermediate representation that 
encodes the computation graphs.  
4.3 Compiling to SPA 
A backend compiler is required to transform the computation graph of a DSL into 
a hardware configuration. To do this a compiler must map, place, and route the compute 
graph onto a spatial fabric. The compiler must therefore be aware of the topology and 
parameters of the fabric so that it can make decisions about mapping, placement, and 
routing. 
In further detail, mapping is a process that checks if operations in a compute 
graph are able to be executed by a PE on the fabric. If not, then the operation must be 
broken up into simpler supported operations. There are three common situations where 
this is required: first, when a complex operation is supported by multiple basic 
operations; second, when an operation with many inputs is supported but there are no 
functional units with enough input ports to execute it; and third, when the precision of the 
operation does not meet the precision of the PE. In the first scenario an operation such as 
multiply-accumulate (MAC) may instead be mapped to a separate PE configured for 
multiply and a PE configured for accumulate. In the second scenario an operation such as 
  35 
summation is supported but there is not a PE that can support the number of inputs 
required (e.g. The computation graph it contains has a 6 input summation operation, but 
only 3 input summations are supported). The operation is broken up into multiple smaller 
functions and merged with a reduction tree to produce a single result. Mapping is usually 
done before placing and routing as most of the mappings are obvious. But as resources 
are used up, mapping may be interweaved between placing and routing as needed. 
Finally, in the third scenario, an operation with higher precision then allowed in hardware 
is split into multiple operations and smaller precision operations. 
Placing involves assigning each operation on the computational graph to a 
physical PE. Two factors are considered when placing: first, what operations are 
supported by which PEs; and second, where should operations be placed relative to each 
other. If all the PEs on a fabric are homogenous then the first factor does not need to be 
considered. As for the second factor, virtual PEs that are interdependent should be placed 
as close as possible down the same data path. 
The flow of data in the hardware fabric is determined by the switches during the 
routing phase. PEs are connected by interconnects, but switches control the flow of data 
to the interconnects. PEs that are attached to the same switch are neighboring PEs and 
signals can be routed between them directly. If two interdependent PEs are not attached 
to the same switch then the signal must hop across neighboring PEs, configuring them to 
bypass PEs, by configuring switches to achieve a correct route. This comes at an 
opportunity cost since using PEs for bypassing are unavailable for execution later. 
FPGA and ASIC compilers follow a similar flow. For FPGAs, a compiler must 
map structures and functions to LUTs, DSPs, and memory. Then these mappings are 
  36 
placed on the spatial fabric and routed by switching a large, complex hierarchy of 
interconnects. For ASICs structures and functions are mapped to standard circuit cells. 
Instead of placing the cells onto a fixed fabric, they are placed in a fixed, open area and 
then routed using vias and wires. These turn out to be very complex and difficult 
processes, due to a large design space and heterogeneous blocks. 
It is easier to map, place, and route onto SPA fabrics because of their simplified 
characteristics. They are coarse grained fabrics, as opposed to the fine or mixed 
granularity fabrics found in FPGA and ASIC, which reduces the complexity of wires and 
computational resources. Interconnects are localized by smaller switches instead of 
allowing for global communication between nodes. Further assumptions can be made, as 
well, about the flow and topology of an SPA depending on the needs of the targeted 
applications. By restricting the fabric, the configuration space shrinks to a manageable 
size. 
The configurations created by a compiler should be energy efficient. Energy is a 
key cost of concern for energy efficient computing and so it is important that compilers 
produce optimal or near optimal energy efficient solutions. The SPIRAL framework does 
this by extending a power model to empirically solve for efficient compilations [35]. 
Nowatzki et al.[14] use constraint centric integer linear programming solvers by 
describing the fabric limitations and cost functions as linear programming constraints. 
However, expressing cost and constraints as linear programming constraints is not always 
straightforward or possible.  
A simple model for estimating configuration energy cost is to measure the length 
of the data paths. The lower bound for energy cost in these terms is when all operations 
  37 
are placed onto a PE without any unnecessary bypasses between them. Some bypasses 
are necessary in order to have correct timing of signals, and therefore are differentiated 
from unnecessary bypasses. The upper bound energy cost is to use all of the PEs in a 
topology but only some are programmed for executing operations and the rest are for 
bypassing signals.  
For general topologies solving for the minimum energy cost begins to look like a 
minimum spanning tree problem with each extra hop adding to the weight between 
connecting nodes. For the wave pipeline topology from Section 3.3.4 Topologies, 
bypasses only occur horizontally across the stages. The problem is much simpler as there 
is only one direction that optimization can occur. 
Area efficiency is another concern for compilers. Area is an opportunity cost. 
Every bit of space used by an architecture is now unavailable for other processors and 
accelerators. Therefore, it is important to use area efficiently. The compiler reduces the 
required area by utilizing the fabric of PEs efficiently. This is similar to optimizing for 
energy cost because higher utilization means solving for the least number of hops 
between operations. Therefore, this problem is likely solved using the same or similar 
methods for reducing energy costs. 
 Compilers are also concerned about their own complexity. One of the major 
issues that hinders FPGAs is the time required to compile a program into a configuration 
stream. This is in part to the complexity of large networks in FPGAs as well as the 
general scope of its programming model. An SPA compiler will likely be low complexity 
if the fabric is constrained for domain efficiency because the data flow will be consistent, 
removing a complex factor of placing and routing. 
  38 
4.4 SPADE Backend Compiler 
We created a compiler framework for the SPADE abstraction described in 3.3 
SPADE. The SPADE abstraction is used because energy efficient computations are 
limited to a few domains. For this work, the framework focuses on ISP domain since 
there are many energy efficient computations that exist and that are socially significant. 
Therefore, DSLs that focus on ISP, such as Darkroom, will be integrated into the 
compiler framework. 
The framework of this compiler is detailed in Fig. 10. The compiler requires two 
inputs: the intermediate representation and fabric hardware description. The intermediate 
representation is specific to its parent DSL but follows the same general concepts. It 
describes the registers storing data as well as the operations operating on those registers. 
These operations and registers are modeled as a directed acyclic graph. This work uses 
the Darkroom DSL [10] and consequently the compiler consumes the intermediate 
representation created by Darkroom called the data path description assembler (DPDA). 
A DPDA parser consumes the DPDA code to form it into an object that can then be used 
for compiling. 
 The fabric hardware description is a file or object that describes the fabric 
topology parameters and configuration. From this, virtual hardware can be built for 
placing and routing. As the compiler places and routes elements it can ask questions of 
the hardware to make decisions. The virtual hardware is made up of three atomic units: 
PEs, switches, and nets. The PEs and switches are created and are then interconnected 
with nets, which ultimately decide the location of the PEs and switches in the topology. 
PEs and switches are also given IDs to easily identify their location. For the wave 
  39 
pipeline topology PEs are numbered from top to bottom, and then left to right. Switches 
are numbered from left to right. All numbering is zero indexed. 
 
 Fig. 10: Compiler framework block diagram. 
 The compiler then prepares the input for actual compilation. This is done by 
flattening all multidimensional registers, resolving all register aliases, and identifying 
each register as an input, coefficient, or data path signal. The new data input is then 
stored into pseudo-PEs in preparation for compilation. 
 The SPADE compiler schedules applications onto the SPADE fabric through 
three steps: 
1. Mapping unsupported complex operations to atomic operations. 
2. Placing pseudo-PEs on the fabric by programming PEs. 
3. Routing data between PEs by programming switches. 
 
4.5 A Fast Optimal Compiler for the Wave Pipeline 
To prove the feasibility of an SPA compiler, an implementation of the compiler 
was created using the framework discussed above for the wave pipeline topology. The 
  40 
compiler consumes Darkroom DPDA code and creates a JSON [36] file that can then be 
parsed into a machine specific binary configuration. Several assumptions and restrictions 
are placed on the topology to allow for rapid development. These assumptions are: 
1. PEs are homogenous and support all operations. 
2. All ports of switches are fully interconnected. 
3. All operations are 16 bit wordlength. 
4. The topology fabric is as large as needed to support a kernel. 
5. All operations execute in a single cycle. 
 
Some of these requirements may be too restrictive or inefficient, but suit the needs of this 
work since the goal is to prove compiler feasibility. Future work would explore a more 
generalized approach that could be applied to more realistic settings. 
 The assumptions and restrictions to the wave topology listed above constrain the 
energy and area optimization problem to a minimum depth assignment problem. This is 
because the wave pipeline topology only allows communications between stripes in a 
feed forward fashion so placing a PE is essentially determining its depth in the pipeline. 
The placement of PEs in using minimum depth assignment is the most compact 
placement allowed in the wave pipeline fabric.  
The compiler uses a breadth first search (BFS) algorithm [37] on the graph parsed 
from the DPDA. The input operation nodes are the first to be processed by the algorithm 
so the ordering will always be correct. As operations are being assigned a depth they are 
placed onto PEs on the hardware and routed. If two neighboring operations are assigned 
to PEs only one stage apart they are immediately routed together by the switch in 
  41 
between them.  If the PEs are further than one stage apart, then the PE closer to the front 
of the pipeline will propagate its output along the pipeline by using bypass PEs until the 
two PEs are one stage apart. The pseudo code in Fig. 11 shows how the routing works 
with the BFS algorithm. Placing and routing occur in a single loop so the code executes 
in linear time. Since the decision of depth placement and routing are decided locally, this 
algorithm is called greedy. 
01 G => Compute Graph 
02 V => Vertex of Compute Graph 
03 H => Hardware Graph 
04 
05 BFS = Breadth-First-Iterator(G) 
06 while V => BFS.next(): 
07     V.wave = max wave of V.Parents 
08     Place V on H 
09 
10     foreach P in V.Parents: 
11         if P.wave != (V.wave - 1): 
12             Bypass(P) 
13         route P to V on H 
Fig. 11: Pseudo Code for greedy compiler algorithm using breadth first search. 
It is important that compiling to SPADE is quick because this will allow for just-
in-time compiling onto any implementation of SPADE. Then a single SPADE chip can 
replace multiple ASICs since it will have the ability to switch between configurations 
quickly from source code. One of the advantages and motivations for a low complexity 
compiler is to have the ability to compile a program just-in-time. This allows for 
developers to quickly compile interactively and test their code on physical hardware. It 
also makes code distribution easy because it can be compiled on the client even on low 
  42 
performance devices. This is more useful than distributing pre-compiled configuration 
byte streams because the distributed code is platform agnostic and portable. 
To understand the cost of heterogeneous PEs, the compiler was tweaked to restrict 
which stages can execute multiply and reduction operations. Multiply and reduction 
operations were subjected to periodic support among the stages and the resulting added 
overhead connections were accounted for. Programs that perform a large amount of 
arithmetic operations, such as Canny and Fast, are affected very little by these 
restrictions. Programs that rely upon more expensive operations can be devastated from 
the connection overhead introduced by the need to delay those operations until they reach 
a supported stage.  
 
Despite the extra constraints applied to the topology as noted above, the 
complexity of the compiler does not increase. For heterogeneous PEs there needs to be 
some a priori knowledge of which PEs support which functions and then the simple 
question can be asked if there exists a PE that supports the operation being placed. If the 
operation is supported then the operation is placed, otherwise the depth of the operation 
moved down to a further stage. The new pseudo code for this type of a compiler is shown 
below and it can be seen that there is no increased complexity with the given 
optimizations. 
  43 
01 G => Compute Graph 
02 V => Vertex of Compute Graph 
03 H => Hardware Graph 
04  
05 BFS = Breadth-First-Iterator(G) 
06 while V => BFS.next(): 
07     V.wave = max wave of V.Parents 
08     if V.Op not available: 
09         V.wave = next available wave 
10     Place V on H 
11 
12     foreach P in V.Parents: 
13         if P.wave != (V.wave - 1): 
14             Bypass(P) 
15         route P to V on H 
Fig. 12: Updated pseudo code for greedy compiler with a heterogeneous fabric. 
  44 
CHAPTER 5 
METHODOLOGY 
5.1 Compiler Implementation 
A fully functional compiler was created to prove the feasibility of a SPADE 
compiler. It was written in Python 2.7 for easy, rapid development. The compiler was 
built to receive DPDA code produced by the front end Darkroom compiler. The code for 
the compiler is available on Github [38]. 
The compiler must know the specifics of the target architecture a priori provided 
by a hardware description configuration. For the wave pipeline a hardware description 
configuration is a file that set values for a handful of parameters. These parameters are 
height, depth, and number of inputs to PEs.  
The compiler produces a JSON file that contains the configuration of every PE 
and switch. The JSON output creates a flexible format that can then be used to draw a 
picture of the programmed hardware, extract statistics, or create a binary encoding of the 
configuration to be sent to the actual hardware. 
5.2 Compiler Metrics 
A few implementations of ISP applications were used as sample programs for the 
compiler including a Canny edge detection, Harris corner detection, Fast feature 
detection, convolution, and motion estimation. Code from previous work [10] was written 
for these sample applications in the Darkroom language. From there it was compiled by 
the front end Darkroom compiler and used as input for the backend SPADE compiler.  
  45 
These applications and corresponding implementations were chosen because they 
are useful in the ISP domain and exhibit energy efficient computations qualities. Many 
other applications could have been chosen for testing, but this work does not seek to 
comprehensively test all available options. Instead we infer from the results of these 
chosen applications that many other ISP applications should perform similarly and leave 
a more exhaustive study to future work. 
The compiler output was verified for correctness by comparing results with the 
results from a C implementation. An RTL model [39] was configured using the compiler 
output and ran through a series of tests while capturing the output. The same inputs were 
used to run C code implementations of the applications and the output of the RTL code 
was compared to the output of the C code. This verification process proved that the 
compiler correctly mapped the programs to hardware for these applications. 
The compiler’s performance depends on the three costs discussed in Section 4.3 
Compiling to SPA—energy, area and computational complexity. Each of these metrics 
must be evaluated to determine the effectiveness of the compiler.  
Measuring for energy efficiency is the simplest of the three metrics because it is 
simply finding the shortest path. The path is measured by counting the number of hops a 
signal has to travel. For the given wave pipeline topology and assumptions discussed in 
the previous section, the compiler will always find the shortest path to route all of the 
PEs, therefore it does not need much further consideration.  
Area is measured by tracking the hardware utilization. Utilization is defined as 
percentage of hardware used in fabric. The height and depth of topology is assumed to 
  46 
only be as large as necessary for each kernel so hardware utilization can be described by 
the equation 𝑈𝑡𝑖𝑙 =
𝑃𝐸𝑠 𝑈𝑠𝑒𝑑
𝑇𝑜𝑡𝑎𝑙 𝑃𝐸𝑠
∗ 100%. 
Computational complexity refers to how quickly the compiler can produce results 
for a large number of operations. This was measured by profiling the compile times 
against the number of operations that needed to be placed. This was done by compiling 
each kernel of each application five times. The minimum time was chosen from this set 
so as to remove any outlier data caused by operating system interaction. This process was 
done 1000 times for each kernel and averaged for accurate results. The compiler was 
tested on an Intel 4440 processor at a max of 2.3 GHz with 2 cores, 3 MB L3 cache, and 
1 MB L2 cache.  
These measurements were taken for both a homogeneous fabric and a 
heterogeneous fabric. For the heterogeneous fabric experiments, operations were 
restricted to only be allowed at certain waves in the wave pipeline. We chose to restrict 
multiply and summation, as they are not needed at every stage for most kernels. The 
restricted operations were supported at the first wave of the wave pipeline and then 
supported every S waves. These experiments were conducted with S = 2, 4, and 8. 
  47 
CHAPTER 6 
RESULTS 
I implemented a compiler that maps and schedules the intermediate representation 
output from Darkroom compiled code onto a wave pipeline topology. The kernels 
mentioned in the previous section were passed to compiler and the output was verified for 
correctness through visual inspection. From the recorded output I can show that the 
compiler compiles in linear time, optimizes mapping and scheduling generally, and that 
extra restrictions made to the fabric do not increase compiler complexity. These factors 
are important to show that the compiler is feasible even as optimizations are made to the 
architecture.  
A set of applications important to image processing and computer vision were 
used to the test the compiler and analyze its configurations. The applications are listed in 
Table 2 along with their corresponding kernels and attributes of the compiled 
configurations. Kernels were compiled onto rectangular fabrics with N waves and M PEs 
per wave. The fabrics were engineered to be exactly as large as needed to fit the 
configuration for each kernel. The size of these fabrics (N and M) is reflected in the last 
two columns of Table 2. 
 
 
 
 
 
 
  48 
Table 2: Application Kernel Profiles 
Application Kernel Name 
Operation 
Count 
Compile 
Time 
(ms) 
Number 
of 
Bypasses 
Number 
of 
Waves 
Number 
of PEs 
Per 
Wave 
Harris 
Convert to Illum. 6 < 0.1 0 4 3 
1x5 Convolution 10 < 0.1 2 5 6 
5x1 Convolution 10 < 0.1 2 5 6 
Response 41 2.95 27 17 11 
Non Maximal 
Suppression 
10 < 0.1 6 6 5 
Canny 
Convert to Illum. 6 < 0.1 0 4 3 
1x5 Convolution 10 < 0.1 2 5 6 
5x1 Convolution 10 < 0.1 2 5 6 
Sobel 22 1 18 10 11 
Non Maximal 
Suppression 
74 7.5 47 11 25 
Hysteresis 1485 1479 3398 40 370 
Convert to Peak Image 8 < 0.1 2 4 5 
Fast 
fast sequence 400 419.6 753 44 64 
Convert to Illum. 5 < 0.1 0 3 3 
Motion 
Estimation 
6x6 Block 226 19 0 4 100 
10x10 Block 730 191 0 4 324 
12x12 Block 1234 278 0 4 484 
 
6.1 Linear Compile Times 
Results from the compile time profiling show that compile times scale linearly 
after some initial startup cost. Fig. 13 plots the compile times against the number of PEs 
that are configured for all kernels that are significantly large. The motion estimation 
kernels were implemented using a 6x6, 10x10, and 12x12 search window. These compile 
times range from 1 msec to 1.5 sec running on an Intel 4440 processor at a max of 2.3 
GHz with 2 cores, 3 MB L3 cache, and 1 MB L2 cache.  
There is good reason to believe that compiler performance can improve all by 
using different tools and languages to write the code. The choice to use Python to build 
the compiler was for easier development, not performance. By porting the code over to a 
  49 
more efficient, compiled language such as C would likely lead to improvements in 
compiler performance. Along a similar argument, the compiler was built with many 
debug and design exploration features that would not be necessary in production grade 
code. By removing these features the objects would be slimmer and probably more 
efficiently. Despite the existing inefficiencies, the fast compile times and linear trend 
suggest that the compiler will scale to out to large kernels. This is important as many 
energy efficient ISP pipelines will execute 1000’s of operations per pixel. 
Fig. 13: Compile times plotted against the number of PEs configured. Only the kernels 
that used 30 or more PEs are shown. 
 
6.2 High Utilization 
 Despite the naïve compiling algorithm, most of the applications attained an 
average of nearly 40% utilization. This means that the compiled fabric is approximately 
2x away from ideal area efficiency. The utilization metrics assumed that each kernel had 
  50 
its own rectangular fabric that was only as tall and deep as required by the kernel. The 
number of utilized PEs were counted to then produce the utilization plot in Fig. 14 for 
each application, which is made up at least one kernel. By plotting the utilized PEs many 
of the kernels, especially the large kernels, form a triangular tree. This is insightful to the 
nature of these type of computations, and it is likely that other energy efficient 
computations observer similar shapes.  
 
Fig. 14: Utilization of PEs by each application. 
The reason for this near 50% utilization is due to the shape of programmed PEs 
on the wave pipeline fabric. The kernels of these applications typically reduced down to a 
single and this happens gradually so that the PEs are placed in a triangle shape, as shown 
in Fig. 15. To improve utilization, it would be feasible to use triangular arrays as opposed 
0%
20%
40%
60%
80%
100%
Harris Canny Fast Conv3x3 Conv1x3 Mean
Pe
rc
en
ta
ge
 o
f 
Fa
b
ri
c 
U
ti
liz
ed
  51 
to rectangular arrays. Similarly, rectangular fabrics could be shared between two kernels 
flowing in opposite and mirrored directions. 
 
Fig. 15: Harris wave pipeline graphs. These plots show only the programmed PEs, labeled 
with red circles, of a rectangular fabric for four of the kernels in the Harris pipeline: (a) 
Corner Response, (b) Non-Maximal Suppression, (c) Convert from RGB to Illuminance, 
(d) 1x5 Gaussian Filter.  
6.3 Utilization Cost of Heterogeneous PEs 
The compiler also created configurations with a heterogeneous fabric using the 
strategy mentioned in Section 5.2 Compiler Metrics. Fig.  plots the results of the 
heterogeneous configurations compared to the homogeneous fabric, as measured in the 
fabric size increase required to support the heterogeneous configuration. The results from 
Fig.  shows that multiply operations do not impact the size of the configuration 
  52 
significantly. This is because each multiply can only support two inputs, so for each stage 
that it has to be delayed only 4 connections are added—two at the input of a bypass node 
and two at the output. For affected applications, fabric size increase for entire 
applications due to delayed multiplication operations range from 0.4% - 37% and a mean 
of 1.9%, 12%, and 20% increase in fabric size for S = 2, 4, and 8 respectively. Fabric size 
increase for entire applications due to delayed summation operations range from 0.1% - 
262% and a mean of 24%, 72%, and 172% increase in fabric size for S = 2, 4, and 8 
respectively.  
 
  53 
 
Fig. 20: Fabric size increase due to heterogeneity. The total number of extra hops as a result 
of only supporting the multiply and reduction operations every few stages. Support begins 
at the first stage and reoccurs in even intervals of 2, 4, 6, and 8. Fabric size increase is 
relative to the homogenous scenario where all PEs support all operations. 
6.4 Measuring Communication 
Even though communication between stages was not a considered compiler cost 
in the current work, it can dramatically affect architecture cost. To quantify the compiled 
configurations’ communication, the routing distance between PEs was measured. Since 
PEs can only communicate horizontally one wave at a time in the wave pipeline, only the 
vertical distance between connected PEs are considered. PEs are stacked in discrete 
  54 
positions, called levels, along the vertical axis in each stage. Between stages the PEs are 
interconnected via switches, and the vertical distance between two interconnected PEs is 
its level change. 
 Despite the naïve compiler methods, the communication between PEs was quite 
local for these sample programs. Fig.  shows the distribution of level changes in the 
Harris, Fast, and Canny pipelines. The Canny pipeline shows much larger communication 
between waves than the other two. This is due to the implementation of Canny. The 
hysteresis kernel is implemented with a morphological filter and as a result most of its 
operations are single bit. Thus each wave is very large. If instead multiple single bit 
operations were packed into a PE, then degree of level changes would drop dramatically.  
It is interesting to note that most of the level changes are small and the 
distributions tail off as the level changes increase. This is promising because large level 
changes are expensive since they require large switches and long wires, both of which are 
neither energy nor area efficient. It also leads us to believe that other ISP kernels may 
follow similar trends. If a kernel does require global communication, then it is likely that 
it is not energy efficient and should be executed on a different architecture. It will be 
important to explore and categorize these types of applications as this may lead to design 
decisions in future work to optimize for the common case. 
  55 
 
Fig. 21: Histogram of level changes. The three histograms are for the three ISP pipelines 
of (a) Harris, (b) Canny, and (c) Fast. The red lines indicate the 90th percentile of level 
changes. The kernels were compiled using the algorithm from Figure 15.  
  56 
 
 
CHAPTER 7 
DISCUSSION 
7.1 Fabric Heterogeneity  
 Reduction operations, such as summation, are penalized more heavily than small 
input operations, like multiplication, when placing an operation resulted in a miss due to 
PE heterogeneity. Therefore, operations like multiplication can remain fairly sparse in a 
fabric with little overhead cost. Reduction operations, however, must be more abundant 
in the fabric to avoid poor area utilization. Luckily, the neighboring PEs to reduction ops 
are generally spatially local to each other. Therefore, the compiler can predict the best 
placement of the neighbors of a reduction op by placing its neighbors close to a PE that 
supports it. Also some reduction units can be virtualized into a number of smaller input 
units. For example, summation can be virtualized into a number of smaller input 
summers. 
 Heterogeneous fabrics can follow many patterns and distributions, and there is 
likely a design space for optimizing the heterogeneity with target applications. One way 
to determine a best known topology for heterogeneous PEs is to start with a fully 
supported, homogeneous fabric and compile a set of target applications to train the fabric. 
The PEs can then be de-functionalized of operations they did not use during the training 
phase. To test the fabric’s flexibility, more applications can be attempted to be compiled 
onto the fabric. Since there are many common patterns between applications in the same 
domain there will likely be applications that can still run on the de-functionalized fabric. 
  57 
7.2 Localized Communication  
Fully interconnected switches are inefficient as many energy efficient 
computations do not require them as demonstrated from the graphs in Fig. . Therefore, 
the lower bound of switch efficiency, without changing the fabric size, in the wave 
pipeline topology is to use switches that support the maximum amount of communication 
used in a kernel. 
Switches can be restricted to only allow communication between PEs within a 
certain vertical distance from each other. These are called banded switches as shown in 
Figure 20. Their communication is similar to what is observed in energy efficient 
compute. 
 
Fig. 16: Banded switch. 
There is a tradeoff between utilization and local communication when a switch is 
too limited. Like increasing PE heterogeneity, limiting the connections inside a switch 
  58 
can result in extra hops and decreased utilization. If a PE is unable to communicate to 
another PE because they are too far apart, then the signals from both PEs need to be 
bypassed to the next stage and moved closer together in an attempt to get them close 
enough to communicate, as illustrated in Fig. 17. This points out a tradeoff between 
switch cost and utilization. 
 
Fig. 17: Utilization cost of banded switch. This banded switch only allows communication 
between PEs one level apart. The first wave of PEs needs to bypass their signals to the next 
stage while moving them closer together. At the third wave they can be routed together. 
There are some applications in energy efficient domains that do not fit the local 
communication model, such as deep neural networks. These computations will require a 
different architecture that allows for global communication between stages within a 
kernel. Other architectural works such as DianNao address issues with massively 
interconnected compute graphs [40]. 
  
 
  59 
CHAPTER 8 
CONCLUSION 
Energy efficient design is important for all types of devices and applications. But 
as Dennard scaling ends and Moore’s law slows down, the industry cannot wait on new 
disruptive technology to solve these issues. There must be a focus on alternatives for 
energy efficient design.  
General purpose processors, such as CPUs, are extremely inefficient due to large 
overhead costs inherent to their generality. Other architectures such as FPGAs and ASICs 
are much more efficient because they remove overhead functionality that is not 
necessary. But not all kinds of compute can be executed more efficiently, only those that 
satisfy the principles in Section 2.1 Energy Efficient Computations. It was shown that 
these energy efficient computations are not currently optimized on general purpose 
processors, such as CPUs. ASICs can execute these computations 1000 times more 
efficiently than CPUs but at the extreme cost of flexibility. However, SPAs are shown to 
be very energy efficient for these energy efficient computations, while maintaining 
flexibility. This type of architecture is typically used as co-processors to a CPU to handle 
heavily used computations that would be otherwise very inefficient if run on a CPU. 
SPAs hold the promise for near ASIC like efficiencies while maintaining flexibility. 
It is difficult to effectively map a program that runs in time to a program that runs 
on a spatial fabric. Therefore, a new programming model is required to efficiently 
program coarse grained SPADE fabrics to execute energy efficient computations. 
Existing DSLs provide a programming model to make energy efficient design simple for 
application developers using restrictions for streaming communications and kernels. 
  60 
Developers can use these DSLs to easily develop code that is then compiled to an 
intermediate graph representation of the program using the DSLs front end compiler. But 
there is not a back end compiler to link the DSL’s compute graph to a SPADE specific 
configuration.  
This study shows the feasibility of a SPADE compiler by creating a compiler that 
compiled a program written in DPDA to SPADE hardware. While the SPADE 
abstraction has several components for ISP problems, the compiler is mainly concerned 
about the computational fabric topology where operations from a kernel are placed and 
routed. The wave pipeline topology was proposed as an efficient SPA for image 
processing, which the SPADE compiler targets. Some assumptions were made about 
wave pipeline implementation and it turns out for the given topology the compiler solves 
for minimum energy and area configurations using low complexity greedy algorithms. 
This lower bound line is important for future work in compilers for spatial 
computing. It proves that a SPADE compiler can be fast enough to run in milliseconds 
which is fast enough to feel interactive. This will speed up the time of development for 
application developers. They will be able to compile quickly for quick iterative 
development and testing. It also makes distributing code incredibly easy by distributing 
source code to be compiled on the client device, regardless of the exact hardware the 
client is using.  
  
  61 
CHAPTER 9 
FUTURE WORK 
This initial compiler implementation sets the stage for studying efficient 
architectural designs. It will allow us to ask questions about fabric heterogeneity, 
communication costs, and computation scale. There are tradeoffs associated with 
architectural optimizations that have to be taken into account. As discussed in this work, 
changing the support of operations at each PE will save area and energy but will also 
degrade utilization. Also router connectivity is expensive for large routers but smaller 
routers, again, degrade utilization. The direction of this research is to explore these 
tradeoffs through simulated hardware and compiler trials. By comparing actual costs 
between PE and switch sizes with utilization costs it will be apparent what the proper 
balance is.  
As the topological network of PEs and switches evolves restrictions that existed in 
the wave pipeline may be relaxed to optimally accommodate other computation flows. 
Communication in the wave pipeline only allowed for forward movement but to achieve 
higher utilization it may be beneficial to relax switch communication to arbitrary 
directions. Sequencers may be added as hardware elements or integrated with PEs to 
allow for more complex data flow inside the fabric. All of these factors will contribute to 
architectural overhead as well as further complicate the compiler and so it will be 
important to study those tradeoffs as well. Future work would calculate these costs and 
tradeoffs to come up with optimal topologies, hardware design, and compilers.  
The existing compiler framework has been set up in a modular framework to 
allow for easy rework and experimentation with the compiler code. The architecture will 
  62 
be able to evolve with input from an application perspective as well as a hardware 
perspective. As these architectures mature we will see increased savings in energy and 
area as well as larger support for multiple domains.  
  63 
REFERENCES 
 
[1] R. Mahajan, C. P. Chiu, and G. Chrysler, "Cooling a microprocessor chip," 
Proceedings of the IEEE, vol. 94, pp. 1476-1486, 2006. 
 
[2] R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C. Lee, et al., 
"Understanding Sources of Inefficiency in General-Purpose Chips," Proceedings 
of the 37th annual international symposium on Computer architecture, pp. 37-47, 
2010. 
 
[3] D. Marković and R. W. Brodersen, DSP Architecture Design Essentials: Springer 
US, 2012. 
 
[4] I. Kuon and J. Rose, "Measuring the Gap Between FPGAs and ASICs," IEEE 
Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 
26, pp. 203-215, 2007. 
 
[5] S. Vassiliadis and D. Soudris, Fine-and coarse-grain reconfigurable computing 
vol. 16: Springer, 2007. 
 
[6] F. Bouwens, M. Berekovic, A. Kanstein, and G. Georgi, "Architectural 
Exploration of the ADRES Coarse-Grained Reconfigurable Array," ARC, pp. 1-
13, 2007. 
 
[7] S. C. Goldstein, H. Schmit, M. Moe, M. Budiu, S. Cadambi, R. R. Taylor, et al., 
"PipeRench: a co/processor for streaming multimedia acceleration," ACM 
SIGARCH Computer Architecture News, vol. 27, pp. 28-39, 1999. 
 
[8] A. Pedram, R. A. Van de Geijn, and A. Gerstlauer, "Codesign tradeoffs for high-
performance, low-power linear algebra architectures," Computers, IEEE 
Transactions on, vol. 61, pp. 1724-1736, 2012. 
 
[9] F.-L. Yuan, C. C. Wang, T.-H. Yu, and D. Markovic, "A Multi-Granularity FPGA 
With Hierarchical Interconnects for Efficient and Flexible Mobile Computing," 
Solid-State Circuits, IEEE Journal of, vol. 50, pp. 137-149, 2015. 
 
[10] J. Hegarty, J. Brunhaver, Z. DeVito, J. Ragan-Kelley, N. Cohen, S. Bell, et al., 
"Darkroom: compiling high-level image processing code into hardware 
pipelines," ACM Trans. Graph., vol. 33, pp. 1-11, 2014. 
 
 
 
 
  64 
[11] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe, 
"Halide: a language and compiler for optimizing parallelism, locality, and 
recomputation in image processing pipelines," presented at the Proceedings of the 
34th ACM SIGPLAN Conference on Programming Language Design and 
Implementation, Seattle, Washington, USA, 2013. 
 
[12] W. Thies, M. Karczmarek, and S. P. Amarasinghe, "StreamIt: A Language for 
Streaming Applications," presented at the Proceedings of the 11th International 
Conference on Compiler Construction, 2002. 
 
[13] J. Sanchez and A. Gonzalez, "Modulo Scheduling for a Fully-Distributed 
Clustered VLIW Architecture," 33th Int. Symp. on Microarchitecture (MICRO-
33), pp. 124-133, Dec 2000. 
 
[14] T. Nowatzki, M. Sartin-Tarm, L. D. Carli, K. Sankaralingam, C. Estan, and B. 
Robatmili, "A general constraint-centric scheduling framework for spatial 
architectures," SIGPLAN Not., vol. 48, pp. 495-506, 2013. 
 
[15] V. Govindaraju, C.-H. Ho, and K. Sankaralingam, "Dynamically Specialized 
Datapaths for Energy Efficient Computing," High Performance Computer 
Architecture (HPCA), 2011 IEEE 17th International Symposium, pp. 503-514, 
Feb. 2011. 
 
[16] J. Brunhaver, "Design and Optimization of a Stencil Engine.," PhD Thesis, 
Stanford University, 2015. 
 
[17] R. Sites, "It's the Memory, Stupid!" Microprocessor Report, vol. 10, pp. 1-2, Aug 
1996. 
 
[18] A. Saulsbury, F. Pong, and A. Nowatzyk, "Missing the memory wall: the case for 
processor/memory integration," SIGARCH Comput. Archit. News, vol. 24, pp. 90-
101, 1996. 
 
[19] J. Ragan-Kelley, A. Adams, S. Paris, M. Levoy, S. Amarasinghe, Fr, et al., 
"Decoupling algorithms from schedules for easy optimization of image processing 
pipelines," ACM Trans. Graph., vol. 31, pp. 1-12, 2012. 
 
[20] H. Lee, K. J. Brown, A. K. Sujeeth, T. Rompf, and K. Olukotun, "Locality-Aware 
Mapping of Nested Parallel Patterns on GPUs," presented at the Proceedings of 
the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 
Cambridge, United Kingdom, 2014. 
 
[21] G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, 
et al., "Conservation Cores: Reducing the Energy of Mature Computations," 
SIGARCH Comput. Archit. News, vol. 38, pp. 205–218, Mar 2010. 
  65 
 
[22] J. Cong, V. Sarkar, G. Reinman, and A. Bui, "Customizable domain-specific 
computing," IEEE Design & Test of Computers, pp. 6-15, 2010. 
 
[23] J. Canny, "A Computational Approach to Edge Detection," IEEE Transactions on 
Pattern Analysis and Machine Intelligence, vol. PAMI-8, pp. 679-698, 1986. 
 
[24] C. Harris and M. Stephens, "A combined corner and edge detector," presented at 
the In Proc. of Fourth Alvey Vision Conference, 1988. 
 
[25] E. Rosten and T. Drummond, "Fusing points and lines for high performance 
tracking," in Tenth IEEE International Conference on Computer Vision 
(ICCV'05) Volume 1, 2005, pp. 1508-1515 Vol. 2. 
 
[26] E. Rosten and T. Drummond, "Machine learning for high-speed corner detection," 
presented at the Proceedings of the 9th European conference on Computer Vision 
- Volume Part I, Graz, Austria, 2006. 
 
[27] H.264/AVC Software Coordination. Available: http://iphome.hhi.de/suehring/tml/ 
 
[28] H. T. Kung, C. E. Leiserson, and C.-M. U. P. P. D. o. C. SCIENCE., Systolic 
Arrays for (VLSI): Carnegie-Mellon University, Department of Computer 
Science, 1978. 
 
[29] A. Parashar, M. Pellauer, M. Adler, B. Ahsan, N. Crago, D. Lustig, et al., 
"Triggered instructions: a control paradigm for spatially-programmed 
architectures," presented at the Proceedings of the 40th Annual International 
Symposium on Computer Architecture, Tel-Aviv, Israel, 2013. 
 
[30] F. Liu, A. Heejin, S. R. Beard, T. Oh, and D. I. August, "DynaSpAM: Dynamic 
Spatial Architecture Mapping using Out of Order Instruction Schedules," 
Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International 
Symposium, pp. 514-553, June 2015. 
 
[31] J. Sugerman, K. Fatahalian, S. Boulos, K. Akeley, and P. Hanrahan, "GRAMPS: 
A programming model for graphics pipelines," ACM Trans. Graph., vol. 28, pp. 
1-11, 2009. 
 
[32] (2016). OpenGL. Available: https://www.opengl.org/ 
 
[33] J. Nickolls, I. Buck, M. Garland, and K. Skadron, "Scalable Parallel Programming 
with CUDA," Queue, vol. 6, pp. 40-53, 2008. 
 
[34] W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. A. 
Horowitz, "Convolution Engine: Balancing Efficiency & Flexibility in 
  66 
Specialized Computing," ISCA '13 Proceedings of the 40th Annual International 
Symposium on Computer Architecture, pp. 24-35, 2013. 
 
[35] P. Milder and M. Telgarsky, "Two Approaches to Optimizing For Power In the 
Spiral Framework." 
 
[36] Introducing JSON. Available: http://www.json.org/ 
 
[37] C. Y. Lee, "An Algorithm for Path Connections and Its Applications," IRE 
Transactions on Electronic Computers, vol. EC-10, pp. 346-365, 1961. 
 
[38] SPADE-ARCH Compiler. Available: https://github.com/ASU-SPADE/SPADE-
ARCH 
 
[39] S. Satapathy, "Data path implementation for a spatially programmable 
architecture customized for image processing applications," Arizona State 
University, 2016. 
 
[40] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, et al., "DianNao: A Small-
Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning," 
ASPLOS '14 Proceedings of the 19th international conference on Architectural 
support for programming languages and operating systems, pp. 269-284, 2014. 
 
 
