DESIGNING COST-EFFECTIVE COARSE-GRAINED RECONFIGURABLE ARCHITECTURE by Kim, Yoonjin
  
 
 
DESIGNING COST-EFFECTIVE COARSE-GRAINED 
RECONFIGURABLE ARCHITECTURE 
 
 
A Dissertation 
by 
YOONJIN KIM 
 
 
Submitted to the Office of Graduate Studies of 
Texas A&M University 
in partial fulfillment of the requirements for the degree of  
DOCTOR OF PHILOSOPHY 
 
 
May 2009 
 
 
Major Subject: Computer Engineering 
  
 
 
DESIGNING COST-EFFECTIVE COARSE-GRAINED 
RECONFIGURABLE ARCHITECTURE 
 
 
A Dissertation 
by 
YOONJIN KIM 
 
Submitted to the Office of Graduate Studies of 
Texas A&M University 
in partial fulfillment of the requirements for the degree of  
DOCTOR OF PHILOSOPHY 
 
Approved by: 
Chair of Committee,  Rabi N. Mahapatra 
Committee Members, Eun Jung Kim 
 Duncan Henry M. Walker 
 Gwan Choi 
Head of Department, Valerie E. Taylor 
 
May 2009 
 
Major Subject: Computer Engineering 
iii 
ABSTRACT 
 
Designing Cost-Effective Coarse-Grained Reconfigurable Architecture. (May 2009) 
Yoonjin Kim, B.S., SungKyunKwan University; 
M.S., Seoul National University 
Chair of Advisory Committee: Dr. Rabi N. Mahapatra 
 
Application-specific optimization of embedded systems becomes inevitable to satisfy the 
market demand for designers to meet tighter constraints on cost, performance and power. 
On the other hand, the flexibility of a system is also important to accommodate the short 
time-to-market requirements for embedded systems. To compromise these incompatible 
demands, coarse-grained reconfigurable architecture (CGRA) has emerged as a suitable 
solution. A typical CGRA requires many processing elements (PEs) and a configuration 
cache for reconfiguration of its PE array. However, such a structure consumes signifi-
cant area and power. Therefore, designing cost-effective CGRA has been a serious con-
cern for reliability of CGRA-based embedded systems.  
As  an effort to   provide  such  cost-effective   design,   the first  half  of  this work 
focuses on reducing power in the configuration cache. For power saving in the configu-
ration cache, a low power reconfiguration technique is presented based on reusable con-
text pipelining achieved by merging the concept of context reuse into context pipelining. 
In addition, we propose dynamic context compression capable of supporting only re-
quired bits of the context words set to enable and the redundant bits set to disable. Fi-
iv 
nally, we provide dynamic context management capable of reducing reduce power con-
sumption in configuration cache by controlling a read/write operation of the redundant 
context words 
In the second part of this dissertation, we focus on designing a cost-effective PE ar-
ray to reduce area and power. For area and power saving in a PE array, we devise a cost-
effective array fabric addresses novel rearrangement of processing elements and their 
interconnection designs to reduce area and power consumption. In addition, hierarchical 
reconfigurable computing arrays are proposed consisting of two reconfigurable comput-
ing blocks with two types of communication structure together. The two computing 
blocks have shared critical resources and such a sharing structure provides efficient 
communication interface between them with reducing overall area.  
Based on the proposed design approaches, a CGRA combining the multiple design 
schemes is shown to verify the synergy effect of the integrated approach. Experimental 
results show that the integrated approach reduces area by 23.07% and power by up to 
72% when compared with the conventional CGRA. 
 
v 
DEDICATION 
 
To my family and friends 
vi 
ACKNOWLEDGEMENTS 
 
It has been my life-long dream to become a professional in the engineering/science field 
and to infuse my passion into my research work. For this reason, this acknowledgement 
is very meaningful for me. The completion a Ph.D. in computer engineering at Texas 
A&M University has been the best way to accomplish my goals and achieve my dream. 
First of all, I am sincerely grateful to my advisor Dr. Rabi N. Mahapatra for allow-
ing me to conduct research with him and for his guidance during my Ph.D. program. His 
exceptional commitment to research and strong demand for excellence have guided me 
this far. I am truly grateful to his insightful advice, encouragement, and constant motiva-
tion throughout this work. Many thanks also go to my previous advisor, Professor Kiy-
oung Choi of Seoul National University, for his encouragement and helpful discussions. 
For two years of my master’s course, he taught me to appreciate that a successful gradu-
ate student must have an arduous and passionate attitude. I would also like to thank the 
other members of my dissertation committees: Professors Eun Jung Kim, Duncan Henry 
M. Walker, and Gwan Choi. Their insightful comments and constructive criticisms 
helped me improve my research. Without their feedback, this dissertation would not 
have been made in it present form. In addition, I am deeply grateful to Professor Jun-
dong Cho of SungKyunKwan University for his teaching and advice in my undergradu-
ate days. 
Furthermore, I would like to thank my friends and fellow students at Texas A&M 
University for numerous discussions about various issues related to research and life. I 
vii 
sincerely thank current and former members of Embedded Systems and Co-design 
Group for being supportive of me during this work. I thank them all, including Woo-
Seok Hong, In-choon Yeo, Baik-Song Ahn, Ja-Ryeong Koo, Sun-Young Choi, Ju-
Young Jung, Young-Ah Kim, Moon-Jeong Kang and Young-Ho Koh, for being great 
friends and always being available whenever I need their assistance and help. Members 
of the Design Automation Lab in Seoul National University have helped me in various 
ways during the years of my Ph.D. program. I thank them all, especially Yong-Jin Ahn, 
Dong-Kwan Suh, Young-Chul Cho, Imyong Lee, Il-Hyun Park, Dong-Wook Lee, and 
Man-Hwee Jo.  
Last, but not least, I am especially grateful to my parents and my elder brother for 
their incredible support and trust for me. Without their dedication and belief in me, I 
couldn‘t have completed this work in due time. 
viii 
TABLE OF CONTENTS 
              
Page 
ABSTRACT..............................................................................................................  iii 
DEDICATION ..........................................................................................................  v 
ACKNOWLEDGEMENTS ......................................................................................  vi 
TABLE OF CONTENTS..........................................................................................  viii 
LIST OF FIGURES...................................................................................................  xiii 
LIST OF TABLES ....................................................................................................  xvii 
CHAPTER 
 I INTRODUCTION................................................................................  1 
 
  A. Objective and Approach..................................................................  2 
  B. Contributions ...................................................................................  3 
  C. Dissertation Organization ................................................................  5 
 
 II BACKGROUND AND RELATED WORKS .....................................  6 
 
  A. Coarse-Grained Reconfigurable Architecture .................................  6 
  B. Related Works .................................................................................  8 
 
 III BASE CGRA IMPLEMENTATION ..................................................      13 
 
  A. Reconfigurable Array Architecture Coupling with Processor ........  13 
  B. Base Reconfigurable Array Architecture ........................................  15 
1. Processing Element ...................................................................  16 
2. PE Array....................................................................................  16 
3. Frame Buffer .............................................................................  18 
4. Configuration Cache .................................................................  18 
5. Execution Controller .................................................................  19 
  C. Breakdown of Area, Delay, and Power Cost...................................  19 
1. Area and Delay..........................................................................  20 
2. Power.........................................................................................  23 
 
ix 
CHAPTER                                                                                                                   Page 
 
 IV LOW POWER RECONFIGURATION TECHNIQUE.......................      25 
 
  A. Motivation .......................................................................................  25 
1. Loop Pipelining .........................................................................  25 
2. Spatial Mapping and Temporal Mapping..................................  31 
  B. Individual Approaches to Reduce Power in Configuration Cache..  32 
1. Spatial Mapping with Context Reuse........................................  33 
2. Temporal Mapping with Context Pipelining.............................  34 
3. Limitation of Individual Approaches ........................................  35 
  C. Integrated Approach to Reduce Power in Configuration Cache .....  36 
1. Reusable Context Pipelining .....................................................  47 
2. Limitation of Reusable Context Pipelining...............................  41 
3. Hybrid Configuration Cache Structure .....................................  43 
  D. Application Mapping Flow .............................................................  44 
1. Temporal Mapping Algorithm ..................................................  45 
a. Covering ............................................................................  45 
b. Time Assignment ..............................................................  46 
c. Place Assignment ..............................................................  47 
2. Context Rearrangement.............................................................  47 
  E. Experiments .....................................................................................  49 
1. Experimental Setup ...................................................................  49 
2. Results .......................................................................................  50 
a. Necessary Context Registers for Evaluated Kernels .........  50 
b. Configuration Cache Size..................................................  52 
c. Performance Evaluation ....................................................  52 
d. Power Evaluation ..............................................................  53 
 
 V DYNAMIC CONTEXT COMPRESSION FOR LOW POWER  
CGRA...................................................................................................  55 
 
  A. Preliminary ......................................................................................  55 
1. Context Architecture .................................................................  55 
  B. Motivation .......................................................................................  57 
1. Power Consumption by Configuration Cache...........................  57 
2. Valid Bit-Width of Context Words ...........................................  57 
3. Dynamic Context Compression for Low Power CGRA ...........  59 
  C. Design Flow of Dynamically Compressible Context Architecture .  59 
1. Context Architecture Initialization............................................  62 
2. Field Grouping ..........................................................................  62 
3. Field Sequence Graph Generation.............................................  64 
4. Generation of Field Control Signal ...........................................  65 
a. Control Signals for ALU-Dependent Fields......................  66 
x 
CHAPTER                                                                                                                   Page 
 
b. Control Signals for ALU-Independent Fields ...................  66 
5. Field Positioning .......................................................................  67 
a. Field Positioning on Uncompressed Context Word ..........  67 
b. Field Positioning on Compressed Context Word..............  69 
6. Compressible Context Architecture ..........................................  78 
7. Context Evaluation....................................................................  78 
  D. Experiments.....................................................................................  78 
1. Experimental Setup ...................................................................  78 
2. Results .......................................................................................  79 
a. Area Cost Evaluation.........................................................  79 
b. Performance Evaluation ....................................................  80 
c. Context Compression Ratio and Power Evaluation ..........  81 
 
 VI DYNAMIC CONTEXT MANAGEMENT FOR LOW POWER 
CGRA...................................................................................................  82 
 
  A. Motivation .......................................................................................  82 
1. Power Consumption by Configuration Cache...........................  82 
2. Redundancy of Context Words .................................................  83 
a. NOP Context Words..........................................................  83 
b. Consecutively Same Part in Context Words .....................  85 
c. Redundancy Ratio .............................................................  86 
  B. Dynamic Context Management .......................................................  86 
1. Context Partitioning ..................................................................  87 
2. Context Management at Transfer Time ....................................  90 
3. Context Management at Run Time ...........................................  92 
  C. Experiments .....................................................................................  94 
1. Experimental Setup ...................................................................  94 
2. Results .......................................................................................  94 
a. Area Cost Evaluation.........................................................  94 
b. Power Evaluation ..............................................................  95 
c. Performance Evaluation ....................................................  96 
 
 VII COST-EFFECTIVE ARRAY FABRIC...............................................      97 
 
  A. Preliminary ......................................................................................  97 
1. Resource Sharing.......................................................................  98 
2. Resource Pipelining...................................................................  101 
  B. Cost-Effective Reconfigurable Array Fabric...................................  103 
1. Motivation .................................................................................  103 
a. Characteristics of Computation-Intensive and Data-Parallel 
    Application ........................................................................  103 
xi 
CHAPTER                                                                                                                   Page 
 
b. Redundancy in Conventional Array Fabric.......................  104 
2. New Cost Effective Data Flow-Oriented Array Structure ........  105 
a. Derivation of Data Flow-Oriented Array Structure...........  105 
b. Mitigation of Spatial Limitation in the Proposed Array  
Structure ............................................................................  109 
3. Data Flow-Oriented Array Design Flow...................................  110 
a. Input Reconfigurable Array Fabric ...................................  112 
b. New Array Fabric Specification-Phase I...........................  112 
c. New Array Fabric Specification-Phase II..........................  118 
d. Connectivity Enhancement ...............................................  122 
4. Cost-Effective Array Fabric with Resource Sharing  
and Pipelining............................................................................    123 
  C. Experiments .....................................................................................  125 
1. Experimental Setup ...................................................................  125 
a. Evaluated Applications......................................................  125 
b. Hardware Design and Power Estimation ..........................  126 
2. Results .......................................................................................  127 
a. Area Evaluation .................................................................  127 
b. Performance Evaluation ....................................................  127 
c. Power Evaluation...............................................................  130 
 
 VIII HIERARCHICAL RECONFIGURABLE COMPUTING ARRAYS .    131 
 
  A. Motivation .......................................................................................  131 
1. Limitation of Existing Processor-RAA  
Communication Structures........................................................    131 
2. RAA-based Computing Hierarchy ............................................  133 
  B. Computing Hierarchy in CGRA ......................................................  134 
1. Computing Hierarchy –Size and Speed ....................................  135 
2. Resource Sharing in RCC and RAA .........................................  136 
3. Computing Flow Optimization..................................................  140 
  C. Experiments .....................................................................................  142 
1. Experimental Setup ...................................................................  142 
a. Architecture Implementation.............................................  142 
b. Evaluated Applications .....................................................  144 
2. Results .......................................................................................  144 
a. Area Cost Evaluation.........................................................  144 
b. Performance Evaluation ....................................................  144 
c. Power Evaluation...............................................................  146 
 
IX INTEGRATED APPROACH TO OPTIMIZE CGRA ........................    149 
 
xii 
CHAPTER                                                                                                                   Page 
 
  A. Combination among the Cost-Effective CGRA Design Schemes ..  149 
  B. Case Study for Integrated Approach ..............................................  150 
1. An CGRA Design Example Merging Three Design Schemes..  150 
2. Results .......................................................................................  151 
a. Area and Performance Evaluation.....................................  151 
b. Power Evaluation ..............................................................  152 
  C. Potential Combinations and Expected Outcomes ...........................  153 
 
        X          CONCLUSIONS..................................................................................    155 
 
REFERENCES..........................................................................................................  158 
 
VITA .........................................................................................................................  168 
xiii 
LIST OF FIGURES 
 
FIGURE                                                                                                                        Page 
 1 Block diagram of general CGRA...............................................................  7 
 
2 Basic types of reconfigurable array coupling.............................................  14 
 
 3 Block diagram of base CGRA....................................................................  14 
 
 4 Processing element structure of base RAA ................................................  16 
 
 5 Interconnection structure of PE array.........................................................  17 
 
 6  Distributed configuration cache structure ..................................................  18 
 
 7 Area cost breakdown for CGRA ................................................................  20 
 
 8 Cost analysis for a PE.................................................................................  21 
 
 9 Power cost breakdown for CGRA running 2D-FDCT...............................  22 
 
 10 4x4 reconfigurable array ............................................................................  26 
 
 11 C-code of Eq. (2)........................................................................................  28 
 
 12 Execution model for CGRA.......................................................................  29 
 
 13 Comparison between temporal mapping and spatial mapping...................  32 
 
 14 Configuration cache structure for context reuse ........................................  33 
 
 15 Cache structure for context pipelining .......................................................  35 
 
 16 Proposed configuration cache structure .....................................................  37 
 
 17 Reusable context pipelining for Eq. (2) .....................................................  40 
 
 18 Reusable context pipelining with temporal cache......................................  41 
 
19 Reusable context pipelining according to the execution time  for one  
iteration (i > 1) ..........................................................................................  42 
xiv 
 
FIGURE                                                                                                                        Page 
 20 Hybrid configuration cache structure.........................................................  44 
 
 21 Application mapping flow for base architecture and proposed architecture 45 
 
 22 Temporal mapping steps ............................................................................  46 
 
 23 Context rearrangement ...............................................................................  48 
 
 24 PE structure and context architecture of MorphoSys.................................  56 
 
 25 Valid bit-width of context words ...............................................................  58 
 
 26 Entire design flow ......................................................................................  60 
 
 27 Context architecture initialization ..............................................................  61 
 
 28 Field grouping ............................................................................................  63 
 
 29 Field sequence graph..................................................................................  64 
 
 30     Control signals for 'MUX_B' and 'PRED' ..................................................  65 
 
 31 Updated FSG from flag merging................................................................  67 
 
 32 Default field positioning.............................................................................  68 
 
 33 Field concurrency graph.............................................................................  69 
 
 34 Examples of ‘Find_Interval’ ......................................................................  75 
 
 35 Multiplexer port-mapping graph ................................................................  76 
 
 36 Compressible context architecture .............................................................  77 
 
 37 Consecutively same part in context words.................................................  84 
 
 38 Redundancy ratio of context words............................................................  85 
 
 39 An example of PE and context architecture ...............................................  87 
 
 40 Context partitioning....................................................................................  88 
xv 
 
FIGURE                                                                                                                        Page 
 41 Comparison between general CE and proposed CE...................................  89 
 
 42 Context management when context words are transferred ........................  89 
 
 43 Context management at run time ...............................................................  92 
 
 44 Snapshots of three mappings......................................................................  98 
 
 45 Eight multipliers shared by sixteen PEs.....................................................  99 
 
 46 The connection between a PE and shared multipliers................................  100 
 
 47 Critical paths ..............................................................................................  101 
 
 48 Loop pipelining with pipelined multipliers................................................  102 
 
 49 Subtask classification .................................................................................  104 
 
 50 Data flow on square reconfigurable array ..................................................  105 
 
 51 Data flow-oriented array structure derived from three types of data flow.  106 
 
 52 An example of data flow-oriented array ....................................................  107 
 
 53 Snapshots showing the maximum utilization of PEs .................................  109 
 
 54 Overall design flow ....................................................................................  111 
 
 55 Basic concept of local triangulation method .............................................  114 
 
 56 Local triangulation method........................................................................  115 
 
 57 Interconnection derivation in Phase I.........................................................  116 
 
 58 New array fabric example by Phase I.........................................................  117 
 
 59 Global triangulation method when n = 2 (L2)...........................................  120 
 
 60 New array fabric example by Phase II .......................................................  122 
 
 61 New array fabric example by connectivity enhancement ..........................  123 
xvi 
 
FIGURE                                                                                                                        Page 
 62 New array fabric with resource sharing and pipelining .............................  124 
 
 63 Mapping example on new array fabric.......................................................  125 
 
 64 Analogy between Memory and RAA-computing hierarchy ......................  134 
 
 65 Computing hierarchy of CGRA .................................................................  134 
 
 66 CGRA configuration with RCC and RAA.................................................  136 
 
 67 Two cases of functional resource assignment ............................................  138 
 
 68 Critical resource sharing and pipelining in L1 and L2 PE array................  139 
 
 69 Interconnection structure among RCC, shared critical resources and  
L2 PE array................................................................................................  139 
 
 70 Four cases of computing flow according to the input/output size of  
application .................................................................................................  141 
 
 71 Performance comparison............................................................................  145 
 
 72 Power comparison ......................................................................................  147 
  
 73 Combination flow of the proposed design schemes...................................  150 
  
 74 A combination example combining three design schemes ........................  151 
 
 75 Potential combination of multiple design schemes ....................................  154 
 
 
 
 
 
 
 
xvii 
LIST OF TABLES 
 
TABLE                                                                                                                          Page 
 
 I Architecture Specification of Base and Proposed Architecture ...............  51 
 
 II Necessary Context Registers for Evaluated Kernels................................  51 
 
 III Size of Configuration Cache and Context Registers ................................  52 
 
 IV Power Reduction Ratio by Reusable Context Pipelining ........................  53 
 
 V Notations for Port-Mapping Algorithm....................................................  71 
 
 VI Area Overhead by Dynamic Context Compression .................................  80 
 
 VII Power Reduction Ratio by Dynamic Context Compression ....................  80 
 
 VIII Area Overhead by Dynamic Context Management .................................  94 
 
 IX Power Reduction Ratio by Dynamic Context Management ...................  95 
 
 X Area Reduction Ratio by RSPA and NAF ..............................................  127 
 
 XI Applications Characteristics and Performance Evaluation ......................  128 
 
 XII Power Reduction Ratio by RSP+NAF .....................................................  129 
 
 XIII Comparison of the Basic Coupling Types................................................  133 
 
 XIV Comparison of the Architecture Implementations ...................................  142 
 
 XV Applications Characteristics.....................................................................  143 
 
 XVI Area Cost Comparison ............................................................................  144 
 
XVII Area Reduction Ratio by Integrated RAA ..............................................  152 
 
 XVIII Entire Power Comparison.........................................................................  153 
1 
CHAPTER I 
INTRODUCTION 
 
With the growing demand for high quality multimedia, especially over portable media, 
there has been continuous development on more sophisticated algorithms for audio, 
video, and graphics processing. These algorithms have the characteristics of data-
intensive computation of high complexity. For such applications, we can consider two 
extreme approaches to implementation: software running on a general purpose processor 
and hardware in the form of ASIC. In the case of general purpose processor, it is flexible 
enough to support various applications but may not provide sufficient performance to 
cope with the complexity of the applications. In the case of ASIC, we can optimize best 
in terms of power and performance but only for a specific application. With a coarse-
grained reconfigurable architecture (CGRA), we can take advantage of the above two 
approaches. This architecture has higher performance level than general purpose proces-
sor and wider applicability than ASIC. 
As the market pressure of embedded systems compels the designer to meet tighter 
constraints on cost, performance, and power, the application specific optimization of a 
system becomes inevitable. On the other hand, the flexibility of a system is also impor-
tant to accommodate rapidly changing consumer needs. To compromise these incom-
patible demands, domain-specific design is focused on as a suitable solution for recent  
____________ 
 
The journal model is IEEE Transactions on Very Large Scale Integration Systems. 
2 
embedded systems. Coarse-grained reconfigurable architecture is the very domain-
specific design in that it can boost the performance by adopting specific hardware en-
gines while it can be reconfigured to adapt to ever-changing characteristics of the appli-
cations. 
In spite of the above advantages, the deployment of CGRA is prohibitive due to its 
significant area and power consumption. This is due to the fact that CGRA is composed 
of several memory components and the array of many processing elements including 
ALU, multiplier and divider, etc. Especially, processing element (PE) array occupies 
most of the area and consumes most of the power in the system to support flexibility and 
high performance. Therefore, reducing area and power consumption in the PE array has 
been a serious concern for the adoption of CGRA.  
A. Objective and Approach 
This dissertation explores the problem of reducing area and power in CGRA based on 
architecture optimization. To provide cost-effective CGRA design, the following ques-
tions are considered.  
• How to reduce area and power consumption in CGRA? For power saving in 
CGRA, We should obtain area and power breakdown data of CGRA to identify 
area and power-dominant components. Then the components may be optimized 
for area and power by removing redundancies of CGRA wasting area and power. 
Such redundancies may depend on the characteristics of computation model or 
applications.  
• How to design cost-effective CGRA with non-sacrificing or enhancing perform-
3 
ance? Ultimately, the goals of designing cost-effective CGRA is that proposed 
approaches do not cause performance degradation with saving area and power. It 
means that the proposed cost-effective CGRA keeps original functionality of 
CGRA intact and does not increase critical path delay. In addition, the perform-
ance may be enhanced by optimizing the performance bottleneck with keeping 
the area and power-efficient approaches.  
In this dissertation, these central questions are addressed for area/power-critical 
components of CGRA and we suggest new frameworks to achieve these goals. The vali-
dation of the proposed approaches is demonstrated through the use of real application 
benchmarks and gate level simulations. 
B. Contributions  
This work makes the following contributions: 
 
• Low power reconfiguration technique for CGRA. It presents a novel power-
conscious architectural technique called reusable context pipelining (RCP) for 
CGRA to close the power-performance gap between low power-oriented spatial 
mapping and high performance-oriented temporal mapping prevailing in existing 
CGRA architectures. A new configuration cache structure has been proposed to 
support reusable context pipelining with negligible overheads. The temporal 
mapping with RCP has been shown to be a universal approach in reducing power 
and enhancing performance for CGRA.  
• Dynamic context compression for low power CGRA. A new design flow for 
4 
CGRA design has been proposed to generate architecture specifications that are 
required for modifying configuration cache dynamically. Design methodology 
for dynamically compressible context architecture and a new cache structure to 
support the configurability are being presented to reduce the power consumption 
in configuration cache without performance degradation.  
• Dynamic context management for low power CGRA. It presents a novel control 
mechanism of configuration cache called dynamic context management to reduce 
the power consumption in configuration cache without performance degradation. 
A new configuration cache structure is proposed to support dynamic context 
management.  
• A new array fabric for CGRA. A novel array fabric design exploration method 
has been proposed to generate cost-effective reconfigurable array structure. 
Novel rearrangement of processing elements and their interconnection designs 
are introduced for CGRA to reduce area and power consumption without any 
performance degradation.  
• Hierarchical reconfigurable computing arrays for efficient CGRA-based em-
bedded systems. A new reconfigurable computing hierarchy has been proposed to 
design cost-effective CGRA-based embedded systems.  Efficient communication 
structure between processor and reconfigurable computing blocks is introduced 
to reduce performance bottleneck in the CGRA-based architecture. 
5 
C. Dissertation Organization 
The rest of the dissertation is organized as follows. In Chapter II, we describe back-
ground and related work of this dissertation. Chapter III presents base architecture im-
plementation and its cost breakdown. In Chapter IV, we propose low power reconfigura-
tion technique to reduce power in configuration cache. Chapters V and VI present dy-
namic context compression and dynamic context management capable of reducing power 
consumption in configuration cache. In Chapter VII, we device a cost-effective array 
fabric for CGRA to reduce area and power in PE array. Chapter VIII presents hierarchi-
cal reconfigurable computing array to reduce area and power with enhancing perform-
ance. Finally, we present integrated approach to merge the multiple design schemes and 
conclude this work in Chapters IX and X. 
 
 
 
 
 
 
 
 
 
 
 
6 
CHAPTER II 
BACKGROUND AND RELATED WORKS 
 
A.  Coarse-Grained Reconfigurable Architecture  
A recent trend in the architectural platforms for embedded systems is the adoption of 
reconfigurable computing elements for cost, performance, and flexibility issues [1]. 
Coarse-Grained Reconfigurable Architectures (CGRAs) [1] exploit both the flexibility 
and efficiency, and are shown to be a generally better solution for compute-intensive ap-
plications than fine-grained reconfigurable architectures. There are different styles of 
CGRAs, but many architectures are based on 2D array of ALU-like datapath blocks. 
These are particularly interesting due to the wide acceptance in recent reconfigurable 
processors as well as their expected high performance for many heavy-load applications 
in the domains of signal processing, multimedia, communication, security, and so on.  
Typically, a CGRA consists of a main processor, a Reconfigurable Array Architec-
ture (RAA), and their interface as Fig. 1. The RAA has identical processing elements 
(PEs) containing functional units and a few storage units such as ALU, multiplier, shifter 
and register file. The data buffer provides operand data to PE array through a high-
bandwidth data bus. The configuration cache (or context memory) stores the context 
words used for configuring the PE array elements. The context register between a PE 
and a cache element (CE) in configuration cache is used to keep the cache access path 
from being the critical path of the CGRA.  
7 
Processing 
Element (PE)
Main
Processor
Main
memory data buffer
Context
registers
Reconfigurable Array Architecture (RAA)
Configuration 
Cache
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
CE CE CE CE
CE CE CE CE
CE CE CE CE
CE CE CE CE
 
Fig. 1. Block diagram of general CGRA. 
 
 
 
Unlike FPGA (most typical of a fine-grained reconfigurable architecture), which 
are built with bit-level configurable logic blocks (CLBs), CGRA is built with PEs, which 
are word-level configurable functional blocks. By raising the granularity of operations 
from a bit to a word, CGRA can improve on the speed and the performance as well as 
the resource utilization for compute-intensive applications. Another consequence of this 
raised granularity is that whereas FPGA can be used for implementing any digital cir-
cuits, CGRA is targeted only for a limited set of applications, although different CGRAs 
may target different application domains. Still, CGRA retains the idea of “reprogramma-
ble hardware” in the reprogrammable interconnects as well as in the configurable func-
tional blocks (i.e., PEs). Moreover, since the amount of the configuration bit-stream is 
greatly reduced through the raised granularity, the configuration can be actually changed 
even at the runtime very fast. Most of the CGRAs feature single-cycle configuration 
change, fetching the configuration data from a distributed local cache. This unique com-
8 
bination of efficiency and flexibility, which in the main advantage of CGRA, explains an 
evaluation result [2] that under certain conditions CGRAs are actually more cost-
effective for wireless communication applications than alternatives such as FPGA im-
plementations as well as DSP architectures. It is worth mentioning that the improved ef-
ficiency of CGRAs in terms of the speed, performance, and area is a result of the archi-
tecture specialization for compute-intensive applications.  
B.  Related Works  
Many kinds of coarse-grained reconfigurable architecture have been proposed with the 
increasing interests in reconfigurable computing until 2001 [1]. These CGRAs can be 
classified into two cases: mesh-based reconfigurable array and linear reconfigurable ar-
ray. Mesh-based reconfigurable arrays arrange their processing elements (PEs) mainly as 
a rectangular 2-D array with horizontal and vertical connections, which support rich 
communication resources for efficient parallelism. In the case of linear reconfigurable 
arrays, they support pipelined execution for stream-based applications with static or dy-
namic reconfiguration. MorphoSys [3] and REMARC [4] are representations of mesh-
based architectures. MorphoSys consists of Tiny_RISC processor, RC (Reconfigurable 
Cell) array, frame buffer, context memory and DMA controller. RC array is an 8×8 array 
of ALUs that performs 16-bit operations based on SIMD programming model. RE-
MARC consists of a global control unit and an 8x8 array of nano processors. A nano 
processor consists of an ALU, a 16-entry data RAM, an 8-entry register file, data input 
registers and data output registers. The configuration for each nano processor is stored in 
the 32-entry instruction RAM to support MIMD execution model as well as SIMD 
9 
model. RaPiD [5] and PipeRench [6][7] have linear array structure. RaPiD provides dif-
ferent computing resources like ALUs, RAMs, multipliers and registers. These resources 
are irregularly distributed on one dimension and are mostly static reconfigured. However, 
PipeRench [6][7] relies on dynamic reconfiguration, allowing the reconfiguration of a 
processing element (PE) in each execution cycle. It consists of strips composed of inter-
connect and PEs with registers and ALUs. The reconfigurable fabric allows the configu-
ration of a pipeline stage in every cycle, while concurrently executing all other stages.  
Since then, many more new CGRAs [2][8][9][10][11][12][13][14] [15][16][17][18] 
[19] have been continuously proposed and evolved. Most of them comprise of a fixed set 
of specialized processing elements (PEs) and interconnection fabrics between them. The 
run-time control of the operation of each PE and the interconnection provides the recon-
figurability.  
However, such fixed architecture has limitations in optimizing the area cost and 
performance for various applications. For example, MorphoSys [3] consists of 8x8 array 
of Reconfigurable Cell coupled with Tiny_RISC processor through system bus. It shows 
good performance for regular code segments in computation intensive domains but re-
quires large amount of area and power consumption. XPP configurable system-on-chip 
architecture [10] is another example. XPP has 4 x 4 or 8 x 8 reconfigurable array and 
LEON processor with AMBA bus architecture. A processing element of XPP is com-
posed of an ALU and some registers. Since the processing elements do not include 
heavy resources, the total area cost is not high but the range of applicable domains is re-
stricted. In addition, XPP shows significant communication overhead between the proc-
10 
essor and RAA through the system bus. REMARC [4] is reconfigurable Multimedia Ar-
ray Coprocessor that consists of a global control unit and an 8x8 array of nano proces-
sors. The nano processors do not also include heavy resources like XPP but it also re-
stricts the range of applicable domains. However, the communication with main proces-
sor is faster than [3] or [20] because the processor can access the register-set by coproc-
essor data transfer instructions. However, limited size of the register-set causes heavy 
registers-array traffic restricting performance enhancement. ADRES [21] tightly couples 
a VLIW processor and a reconfigurable matrix through shared register file. The recon-
figurable matrix is used to accelerate the dataflow-like kernels in a highly parallel way, 
whereas the VLIW processor executes the non-kernel code by exploiting instruction-
level parallelism. Even though it also provides the fast communication speed between 
VLIW and the matrix but the entire structure is very dependent on VLIW processor ar-
chitecture and it require huge register file for the communication. Therefore, the per-
formance is limited by size of the register file. Most design space exploration techniques 
previously suggested are limited to the configuration of the internal structure of a PE and 
the interconnection scheme. Such configuration techniques are in general good at obtain-
ing high performance but require high hardware cost. This is mainly because even a 
primitive PE design should be equipped with basic functional resources to gain reason-
able performance. Moreover, adding a small functional block to a primitive PE design 
increases the total cost of the aggregate architecture a lot. In ADRES template [21], an 
XML-based architecture description language is used to define the overall topology, 
supported operation set, resource allocation, timing, and even internal organization of 
11 
each processing element. KressArray [20] also defines the exploration properties such as 
array size, interconnections, and functionality of certain processing elements. However, 
both templates do not support common resources shared among processing elements, 
thus some critical functional resources may have low utilization while occupying large 
area.  
The research on low power CGRA has three different aspects: architecture explora-
tion, code compilation & mapping and physical implementation. Although the architec-
ture exploration flows that have been suggested in [8][20][21] [22][23][24][25][26][27] 
[28] generate a good instance of CGRA considering area and performance, they do not 
deal with power consumption. Interconnect architecture explorations have been sug-
gested for low energy [21][29]. Because CGRA has complex interconnection for per-
formance and flexibility, power consumption due to interconnection is crucial. In [8][29] 
the authors have proposed energy-aware interconnection exploration to minimize energy 
by changing the topology between global register file and function units. However, this 
exploration only provides the trade-off between performance and energy. In [30] the au-
thors have suggested hierarchical generalized mesh structure exploration that continues 
to exploit locality while reducing the cost of long connections but it has been only evalu-
ated for specific reconfigurable DSPs. In the case of code compilation and mapping, 
loops are exploited mainly for performance [9][31][32][33][34][35][36][37][38][39][40] 
[43][44]. Many reconfigurable architectures have been implemented with various tech-
nologies [6][10][12][43][44][45][46]. Most of these researches have focused on efficient 
design with respect to small area and high performance. In [6][8], even though authors 
12 
have presented power estimation data of the implemented architectures, these are only 
accessorial results and they do not offer power/energy-aware implementation. In [2][14], 
authors have emphasized that the implemented architectures are power-efficient as com-
pared to fine-grained architectures such as FPGA running specific applications. These 
architectures are not general CGRA but specific for running some applications with low 
power consumption. In [6], the authors have fabricated PipeRench [7] in a 0.18 micron 
process. Their experimental results show that the power consumption is significantly 
high. Authors describe that the increase in power consumption is due to the dynamic re-
configuration requiring frequent configuration and state memory accesses. Hence, that 
power consumption by dynamic reconfiguration is a serious overhead as compared to 
other types of IP cores such as ASIC or ASIP. 
 
 
 
 
 
 
 
 
 
13 
CHAPTER III 
BASE CGRA IMPLEMENTATION 
 
We have first designed a conventional CGRA as the base architecture and implemented 
it at the RT-level. This conventional architecture will be used throughout this disserta-
tion as a reference for quantitative comparison with our cost-effective approaches. 
A.  Reconfigurable Array Architecture Coupling with Processor 
A typical coarse-grained reconfigurable architecture consists of a microprocessor, a Re-
configurable Array Architecture (RAA), and their interface. We can consider three ways 
of connecting the RAA to the processor [47]. First, the array can be connected to a bus 
as an ‘Attached IP’ shown in Fig. 2(a). Secondly, the array can be placed next to the 
processor as a ‘Coprocessor’ as shown in Fig. 2(b). In this case, the communication is 
done using a protocol similar to those used for floating point coprocessors. Finally, the 
array can be placed inside the processor like a ‘FU (Functional Unit)’ as shown in Fig. 
2(c). In this case, the instruction decoder issues special instructions to perform specific 
functions on the reconfigurable array as if it were one of the standard functional units of 
the processor.  
 
 
14 
System bus
Processor Memory
RAA
                      System bus
Processor
Memory
RAA
Co‐processor 
interface
MUX unit
 
(a) Attached IP                                            (b) Coprocessor 
Processor
RAA
System bus
Memory
 
(c) Functional unit 
 
Fig. 2. Basic types of reconfigurable array coupling. 
 
 
 
    
RISC
Processor
External
RAM Interface
AHB 
DMA
Controller
Cache
Controller
Configuration 
Memory
Configuration
Cache
Data
Memory
Frame
Buffer
PE Array
Execution
Controller
Frame Buffer
Controller
Reconfigurable Array Architecture (RAA)
 
Fig. 3. Block diagram of base CGRA. 
 
 
 
We have implemented the first type of reconfigurable architecture connecting the 
RAA as an Attached IP. In this case, the speed improvement using the RAA may have to 
15 
compensate for significant communication overhead. However, the main benefit of this 
type is the ease of constructing such a system using a standard processor and standard 
reconfigurable array without any modification. It consists of a RISC processor, a main 
memory block, a DMA controller, and an RAA. The RISC processor is a 32-bit proces-
sor which is small and simple with three pipeline stages and the communication bus is 
AMBA AHB [48], which couples the RISC processor and the DMA controller as master 
devices and the RAA as a slave device. The RISC processor executes control intensive, 
irregular code segments and the RAA executes data-intensive kernel code segments. The 
block diagram of the entire reconfigurable architecture is shown in Fig. 3. 
B.  Base Reconfigurable Array Architecture 
Base RAA is similar to MorphoSys [3], which is a very representative CGRA showing 
high performance and flexibility as well as physical implementation. The difference 
from MorphoSys is that the proposed architecture supports both SIMD and MIMD exe-
cution model whereas the memory structure (frame buffer and configuration cache) of 
MorphoSys supports only the SIMD model. The SIMD model is efficient for data paral-
lelism since it saves configurations and cache storage by sharing an instruction for mul-
tiple data. But its execution models are limited in that each individual PE cannot execute 
different instructions independently at the same time. Therefore, we take MIMD-style 
CGRA in which each PE can be configured separately to facilitate processing its own 
instructions. Since it allows more versatile configurations than their SIMD-style siblings, 
we adopt more general forms of loop pipelining [32] through simultaneous execution of 
multiple iterations of a loop in a pipeline.  
16 
MUX A MUX B
To other PEs
DA DB 
Register 
File
From other PEs
17‐bit
17‐bit
From Cache 
Element 
R0 R1 R2 R3
32‐bit
16‐bit
32‐bit
REG
Shift Logic
16‐bit 16‐bit
D_OUT
16‐bit 16‐bit
16‐bit
Output
Register
32‐bit
Context Word
Data Signal
Control Signal
Symbol Meaning
DA, DB From Frame Buffer
D_OUT To Frame Buffer
SAT Logic
A L U, MULT
 
Fig. 4. Processing element structure of base RAA.             
 
 
 
Base architecture specification is determined by our target application domain in-
cluding audio/video codec as well as various benchmark kernels. Detailed features of 
each component of the architecture are as follows. 
1.  Processing Element 
Each PE is a dynamically reconfigurable unit executing arithmetic and logical operations. 
The inner structure of a PE is shown in Fig. 4. A PE contains a 16-bit ALU, 16 x 16-bit 
array multiplier, shift logic, Arithmetic saturation (SAT_Logic), multiplexors and regis-
ters.  
2.  PE Array 
The PE array is an 8x8 reconfigurable array of PEs, which we think is big enough for 
17 
most of the applications considered in our experiments. We assume that computation 
model of the array is loop pipelining based on temporal mapping [32] for high perform-
ance - each iteration of application kernel (critical loop) is mapped onto each column of 
 
Interconnection 
in column direction 
PE
PE
PE
PE
PE
PE
PE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
CE
PE
PE PE PE PE PE PE PE PE
Interconnection in row direction
From Frame
Buffer
To Frame 
Buffer
Global Bus
Pair‐Wise
Hopping
 
Fig. 5. Interconnection structure of PE array. 
square array. Therefore, in this PE array, columns have more interconnection than rows. 
Fig. 5 shows interconnection structure of the PE array. The interconnection in rows is 
used mainly for the communication taking care of loop-carried dependencies. Columns 
and rows have nearest-neighbor and hopping interconnections for connectivity between 
two PEs in a half column and a half row. In addition, each column has pair-wise inter-
connections and two global buses for connectivity between two half columns. Each row 
shares two read-buses and one write-bus.  
18 
3.  Frame Buffer 
Frame buffer (FB) of MorphoSys does not support concurrency between the load of two 
operands and the store of result in a same column, since it is not needed in SIMD-style 
mapping. However, in the case of MIMD-style execution, concurrent load and store op-
erations can happen between different loop iterations. So our FB has two sets of buffers, 
each having three banks: one bank connected to the write bus and the other two banks  
 
PE
8‐bit
Cache 
Controller
CS signal 
(8 x 8‐bit)
CE CE CE
REG
PE
REG
PE
REG
PE
CE CE CE
REG
PE
REG
PE
REG
PE
CE CE CE
REG
PE
REG
PE
REG
8 Rows
Context RegisterREG
Cache Element
Processing 
Element
PE
CE
Symbol Meaning
CE CE CE
PE
REG
PE
REG
PE
REG
CE CE CE
PE
REG
PE
REG
PE
REG
CE CE CE
PE
REG
PE
REG
PE
REG
8 Columns
 
Fig. 6. Distributed configuration cache structure. 
 
 
 
connected to the read buses. However, any combination of one-to-one mapping between 
the three banks and the three buses is possible.  
4.  Configuration Cache 
The context memory of MorphoSys is designed for broadcasting configuration. So PEs 
19 
in the same row or column share the same context word for SIMD-style operation [3]. 
However, in the case of MIMD-style operation, each PE can be configured by different 
context word. Our configuration cache is composed of 64 Cache Elements (CEs) and a 
cache controller for controlling the CEs (Fig. 6). Each CE has 32 layers, each of which 
stores a context that configures the corresponding PE. The context register between a PE 
and a CE is used to keep the cache access path from being the critical path of the CGRA. 
5  Execution Controller 
Controlling the PE array execution directly from the main processor through AMBA 
AHB will cause high overhead in the main processor. In addition, the latency of the con-
trol will degrade the performance of the whole system, especially when dynamic recon-
figuration is used. So a separate control unit is necessary to control the execution of the 
PE array every cycle. The execution controller receives the encoded control data from 
the main processor. The control data contains read/write mode and addresses of frame 
buffer and cache for guaranteeing correct operations of the PE array. 
C.  Breakdown of Area, Delay, and Power Cost 
We have implemented the base architecture shown in Fig. 2 at the RT-level with VHDL. 
We have synthesized a gate-level circuit from the VHDL description and analyzed area, 
delay, and power cost. The synthesis has been done using Design Compiler [49] with 
0.18 ㎛ technology. We have used DesignWare [49] library for the multipliers (carry-
save array synthesis model) and dividers (restoring carry-look-ahead, 2-way overlapped 
synthesis model). SRAM Macro Cell library is used for the frame buffer and configura-
tion cache. ModelSim [50] and PrimePower [49] have been used for gate-level simula-
20 
tion and power estimation. 
 
DMA 1%
AMBA 1%
RISC 5%
RAA 90%
Interconnection 3%
RAA: 896318 GEs
RISC: 53670 GEs
AMBA: 37777 GEs
DMA: 8911 GEs
Interconnection : 31433 GEs
GEs : Gate Equivalents
  
(a) Entire CGRA 
Frame Buffer
13%
Configuration
Cache
15%
PE Array
70%
Execution Controller
2%
Execution Controller: 4009 GEs
PE Array: 659635 GEs
Configuration Cache: 150012 GEs
Frame Buffer: 129086 GEs
GEs : Gate Equivalents
 
(b) RAA 
 
Fig. 7. Area cost breakdown for CGRA. 
 
 
1.  Area and Delay  
As shown in Fig. 7 (a), the RAA occupies as much as 90 % of the total area of the 
CGRA. Fig. 7 (b) shows more detailed area breakdown in the RAA. The PE array occu-
pies as much as 70.5 % of the total area of the RAA, which is mainly due to heavy com-
21 
putational resources such as ALU, multiplier, etc. in each PE. The critical path of the 
entire RAA is also in the PEs and its delay is given by 
TCritical path = TMultiplexor + TMultiplier  + TShift_logic+Tothers                        (1) 
(8.96ns   =  0.32ns    + 5.21ns + 1.42ns   + 1.78ns)  
From the area and delay cost breakdown of the RAA as shown in Figs. 7 and 8, we see 
that PE array design is crucial for cost-effective design. In the case of area, Fig. 8 (a) 
shows that multiplier occupies about 33.4% of the total area in a PE. In the case of delay, 
the multiplier again takes as much as 58.12 % (Fig. 8 (b)). Therefore, in our PE design, 
the multiplier is considered to be area-critical and delay-critical resource. 
 
5.4%
3.7%
6.6%6.6%
33.4%
18.1%
20.6%
5.7%
0
500
1000
1500
2000
2500
3000
3500
4000
M
ul
tip
lex
or
AL
U
Sh
ift
 Lo
gic
M
ul
tip
ile
r
Re
gis
te
r F
ile
Co
nt
ex
t R
eg
ist
er
Ou
tp
ut
 R
eg
ist
er
In
te
rco
nn
ec
tio
n
Gate Equivalents
        
(a) Area 
 
Fig. 8. Cost analysis for a PE. 
 
22 
                                                                                                  
3.46%5.69%6.47%
58.12%
14.84%
20.64%
5.80%
0
1
2
3
4
5
6
M
ul
tip
lex
or
AL
U
Sh
ift
 Lo
gic
M
ult
ip
ile
r
Re
gis
te
r F
ile
Co
nt
ex
t R
eg
ist
er
Ou
tp
ut
 Re
gis
te
r
Delay(ns)
 
(b) Delay 
 
Fig. 8. Continued. 
 
DMA 0.9%
RISC+AHB+Interconnection 7%
Reconfigurable Array
Architecture
92.09%
Reconfigurable Array
Architecture : 417.33 mW
DMA : 4.23 mW
RISC+AHB+Interconnection
: 31.64 mW
 
(a) Entire CGRA   
Fig. 9. Power cost breakdown for CGRA running 2D-FDCT. 
                                                                                                   
23 
Execution
Controller 0.3%
Frame Buffer
3.4%
Configuration
Cache
45.3%
PE Array
50.8%
PE Array : 212.25mW
Configuration Cache : 190.03mW
Frame Buffer : 15.05mW
Execution Controller : 1.28mW
 
(b) RAA 
 
Fig. 9. Continued. 
 
 
 
2.  Power   
To obtain power breakdown data, we have used 2D-FDCT as the kernel for simulation-
based power measurement. The simulation has been done under the typical operating 
condition of 100 MHz frequency, 1.8 V Vdd, and 27℃ temperature. As can be observed 
from Fig. 9 (a), the RAA spends about 92.09% of the total power consumed in CGRA.  
Fig. 9 (b) shows more detailed power breakdown in the RAA. The RAA spends about 
50.8% of its total power in the PE array, which consists of many components such as 
ALUs, multipliers, shifters and register files. The PE array consumes most of the power, 
which is natural because coarse-grained architecture aims to achieve high performance 
and flexibility with plenty of resources. The configuration cache spends about 45.3% of 
the overall power, which is the second largest. Even though the frame buffer uses the 
same kind of SRAM as the configuration cache, it consumes much less power (3.4%). 
24 
This is because the configuration cache performs read operations frequently to load the 
context words, one for each PE, whereas the frame buffer performs load/store operations 
less frequently to access data on row basis rather than for every PE.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25 
CHAPTER IV 
LOW POWER RECONFIGURATION TECHNIQUE 
 
In this chapter, we suggest a novel power-conscious architectural technique called reus-
able context pipelining (RCP) to reduce power consumption in configuration cache [51]. 
RCP is a universal approach in reducing power and enhancing performance for CGRA 
because it can be achieved by closing the power-performance gap between low power-
oriented spatial mapping and high performance-oriented temporal mapping. Furthermore, 
we propose new configuration cache structure (called hybrid configuration cache) to 
support reusable context pipelining with reduced memory size. Experimental results 
show that the proposed approach saves much power even with reduced configuration 
cache size. Power reduction ratio in the configuration cache and the entire architecture 
are up to 86.33 % and 47.60 % respectively compared to the base architecture. 
A.  Motivation  
In this section, we present the motivation of our power-conscious approaches. The main 
motivation is due to the characteristics of loop pipelining (spatial mapping and temporal 
mapping) [32] based on MIMD-style execution model.  
1.  Loop Pipelining  
To represent the characteristics of loop pipelining [32], we examine the difference be-
tween SIMD and MIMD in the RAA with a simple example. We assume a mesh-based 
4x4 coarse-grained reconfigurable array of PEs, where a PE is a basic reconfigurable 
26 
CE CE CE CE
PE PE PE PE
CE CE CE CE
CE CE CE CE
CE CE CE CE
PE PE PE PE
PE PE PE PE
PE PE PE PE  
 
(a) Distributed cache structure 
 
Bank A
Bank B
Bank C
D
E
M
U
X
Frame Buffer PE Array
PE PE PE PE
PE PE PE PE
PE PE PE PE
PE PE PE PE
M
U
X
MeaningSymbol
bus tap to tap off partial bits of a bus
n‐bit
4n‐bit
4n‐bit
4n‐bit
n‐bit
n‐bit
n‐bit
n‐bit
n‐bit
n‐bit
n‐bit
n‐bit
n‐bit
n‐bit
n‐bit
n‐bit
4n‐bit
 
 
(b) Frame buffer and data bus 
 
Fig. 10. 4x4 reconfigurable array. 
 
27 
PE PE PE PE
PE PE PE PE
PE PE PE PE
PE PE PE PE
                         
PE PE PE PE
PE PE PE PE
PE PE PE PE
PE PE PE PE  
 
(c) Nearest neighbor interconnection                    (d) Global bus interconnection 
 
Fig. 10. Continued. 
 
 
 
 element composed of an ALU, an array multiplier, etc. and the configuration is con-
trolled by the words stored in the CE as shown in Fig. 10 (a). In addition, we assume that 
Frame Buffer has simply one set having three banks and two read-ports and one write-
port, supporting any combination of one-to-one mapping between the three banks and 
the three buses. Fig. 10 (b) shows such a Frame Buffer and data bus structure, where the 
PEs in each row of the array share two read buses and one write bus. The 4x4 array has 
nearest neighbor interconnections as shown in Fig. 10 (c) and each row or each column 
has a global bus as shown in Fig. 10 (d).  
 
 
28 
for (i = 0; i <= 3; i = i+1)
{ 
for (j = 0; j <= 3; j = j+1)
z[i] = (x[i][j]+y[i][j])*c[j] + z[i];
z[i] = K* z[i];
}
for (i = 0; i <= 3; i = i+1)        
{
t1 = x[i][0]+y[i][0] ;
t2 = x[i][1]+y[i][1] ; 
t3 = x[i][2]+y[i][2] ; 
t4 = x[i][3]+y[i][3] ;
t1 = t1*c[0];
t2 = t2*c[2];
t3 = t3*c[3];
t4 = t4*c[4]; 
tmp1 = t1+ t2 ; 
tmp2 = t3+ t4 ;
z[i] = tmp1+ tmp2 ;
z[i] = K*z[i]
}
LD/+
×
2+
1+ 
×/ST
 
 
(a) Before parallelization                                      (b) After parallelization 
 
Fig. 11. C-code of Eq. (2). 
 
 
 
Consider a square matrix X and Y, both of order N, and the computation of Z, an N 
element vector, given by 
∑−
=
×+×=
1
0
)}(),(),({()(
N
j
jCjiYjiXKiZ                            (2) 
where i, j = 0,1,…,N-1, C( j ) is a constant vector, and K is a constant.  
Consider N = 4 for the mapping of the computation defined in Eq. (2) on our 4x4 PE ar-
ray and let the computation be given as a C-program (Fig. 11 (a)). It is assumed that the 
input matrix X, Y, constant vector C and output vector Z are stored in the arrays x[i][j], 
y[i][j], c[j] and z[i], and z[i] is initialized to zero. Fig. 11 (b) shows parallelized code for 
execution on the array as shown in Fig. 12, where we assume that matrix X and Y have 
been loaded into the Frame Buffer (FB) and all of the constants (C and K) have been al-
ready saved in a register file of each PE. Vector Z is stored in the FB after it has been 
29 
processed by the PE array as shown in Fig. 12 (a). 
 
 
 
x[3][3]
x[3][2]
x[3][1]
x[3][0]
x[2][3]
x[2][2]
x[2][1]
x[2][0]
x[1][3]
x[1][2]
x[1][1]
x[1][0]
x[0][3]
x[0][2]
x[0][1]
x[0][0]
z[3]
z[2]
z[1]
z[0]
Bank A
Bank C
y[3][3]
y[3][2]
y[3][1]
y[3][0]
y[2][3]
y[2][2]
y[2][1]
y[2][0]
y[1][3]
y[1][2]
y[1][1]
y[1][0]
y[0][3]
y[0][2]
y[0][1]
y[0][0]
Bank A
           
CE
CE CE CE
PE PE PE PE
CE
CE
CE
CE
PE PE PE PE
PE PE PE PE
PE PE PE PE
Column Direction
Row Direction
 
 
(a) Operand and result data in FB                             (b) Configuration broadcast 
 
LD/+ Data Load and Addition
NOP No Operation
Symbol Meaning
× Multiplication
1+, 2+ Addition
×/ST Multiplication and Store         
Broadcast Column Direction Row Direction Column Direction
Cycle Time 1 2 3 4 5 6 7 8 9 10 11
Column#1 LD/+ NOP NOP NOP × 1+ 2+ ×/ST NOP NOP NOP
Column#2 LD/+ NOP NOP × 1+ 2+ NOP ×/ST NOP NOP
Column#3 LD/+ NOP × 1+ 2+ NOP NOP ×/ST NOP
Column#4 LD/+ × 1+ 2+ NOP NOP NOP ×/ST  
 
(c) SIMD model 
 
Cycle Time 1 2 3 4 5 6 7 8
Column#1 LD/+ × 1+ 2+  ×/ST NOP NOP NOP
Column#2 LD/+ × 1+ 2+  ×/ST NOP NOP
Column#3 LD/+ × 1+ 2+ ×/ST NOP
Column#4 LD/+ × 1+ 2+  ×/ST  
 
(d) Loop pipelining schedule 
 
Fig. 12. Execution model for CGRA. 
 
 
30 
 
The SIMD-based scheduling enables parallel execution of multiple loop iterations 
as shown in Fig. 12 (c), whereas the MIMD-based scheduling enables loop pipelining as 
shown in Fig. 12 (d). The first row of Fig. 12 (c) represents the direction of configuration 
broadcast. The second row of Fig. 12 (c) and the first row of Fig. 12 (d) indicate the 
schedule time in cycles from the start of the loop. In the case of SIMD model, load and 
addition operations in PEs are executed on all columns till 4th cycle with broadcast in 
column direction.  Then the PEs in a row perform the same operation with broadcast in 
row direction. In the case of loop pipelining, PEs in the first column perform load and 
addition operations in the first cycle and then perform multiplications in the second cycle. 
In the next two cycles, the PEs in the first column perform summations, while the PEs in 
the next column perform multiplication and summation operations. When the first col-
umn performs the multiplication/store operation in the 5th cycle, the fourth column per-
forms multiplication. Comparing the latency, SIMD takes three more cycles.  
As shown in this example, SIMD model does not utilize PEs efficiently since all 
data should be loaded before the computations of the same type are performed synchro-
nously. On the other hand, since MIMD allows any type of computations at any moment, 
it does not need to wait for a specific data to be loaded but can process other data that is 
readily available. Loop pipelining is an effective way of exploiting this fact, thereby util-
izing PEs better. The loop pipelining in the example of Fig. 11 improves the perform-
ance by three cycles compared to the SIMD, but for loops with more frequent memory 
operations, it will have higher performance improvement. 
31 
2.  Spatial Mapping and Temporal Mapping  
When mapping kernels onto the reconfigurable architecture with loop pipelining, we can 
consider two mapping techniques [32]: spatial mapping and temporal mapping. Fig. 13 
shows the difference between the two techniques with the previous example. In the case 
of temporal mapping (Fig. 13 (a)), like the previous illustration of loop pipelining in Fig. 
12 (d), a PE executes multiple operations within a loop by changing the configuration 
dynamically. Therefore, complex loops having many operations with heavy data de-
pendencies can be mapped better in temporal fashion, provided that the configuration 
cache has sufficient layers to execute the whole loop body.   
In the case of spatial mapping, a loop body is spatially mapped onto the reconfigur-
able array implying that each PE executes a fixed operation with static configuration as 
shown in Fig. 13 (b). The advantage of spatial mapping is that it may not need recon-
figuration during execution of a loop. As can be seen from Fig. 13, spatial mapping 
needs only one or two cache layers whereas temporal mapping needs 4 cache layers. One 
disadvantage of spatial mapping is that spreading all the operations of the loop body 
over the limited reconfigurable array may require too many resources. Moreover, data 
dependencies between the operations should be taken care of by allocating interconnect 
resources to provide a path and inserting registers (or using PEs) in the path to synchro-
nize the arrival of operands. Therefore, if the loop is simple enough to map the loop 
body to the limited reconfigurable array and there is not much data dependency between 
the operations, then spatial mapping is the right choice. The effectiveness of the mapping 
strategies depends on the characteristics of the target architecture as well as the target 
32 
application. 
Distributed Cache with 1 or 2 layers
NOP
NOP
+
NOP
NOP
+
+
NOP
×
×
×
×
NOP
NOP
×ST
NOP
LD/+
1+
2+
LD/+
×
LD/+
×
×/ST
LD/+
2+
×/ST
Distributed Cache with 5 Layers
×/ST
Operation : Operation executed at the 5th cycle
Cycle Time 1 2 3 4 5 6 7 8
Column#1 LD/+ × 1+ 2+  ×/ST NOP NOP NOP
Column#2 LD/+ × 1+ 2+  ×/ST NOP NOP
Column#3 LD/+ × 1+ 2+  ×/ST NOP
Column#4 LD/+ × 1+ 2+  ×/ST
Cycle Time 1 2 3 4 5 6 7 8
Column#1 LD/+ LD/+ LD/+ LD/+ ×/ST  ×/ST ×/ST ×/ST
Column#2 × × × × NOP NOP NOP
Column#3 1+ 1+ 1+ 1+ NOP NOP
Column#4 2+ 2+ 2+ 2+ NOP
×
×/ST
2+
1+
2+
1+ 1+
×
×/ST 
LD/+
NOP
NOP
×/ST
NOP
*1+ 2+
NOP
+
+
NOP
NOP
NOP
+
NOP
×
×
×
×
×
 
(a) Temporal mapping                                           (b) Spatial mapping  
 
Fig. 13. Comparison between temporal mapping and spatial mapping. 
 
 
 
B. Individual Approaches to Reduce Power in Configuration Cache  
In this section, we suggest individual power-conscious approaches for two different exe-
cution models (spatial mapping and temporal mapping) and describe their limitations. 
These approaches achieve the goal by making use of the characteristics of spatial map-
ping and temporal mapping [52][53][54].  
33 
1. Spatial Mapping with Context Reuse 
Because most power consumption in the configuration cache is due to memory read-
operations, one of the most effective ways to achieve power reduction in the configurati- 
 
 
 
CE
R
E
G
PE CE
R
E
G
PE
CE
R
E
G
PE CE
R
E
G
PE
CE
R
E
G
PE CE
R
E
G
PE
CE
R
E
G
PE CE
R
E
G
PE
CE
R
E
G
PE
CE
R
E
G
PE
CE
R
E
G
PE
CE
R
E
G
PE
CE
R
E
G
PE
CE
R
E
G
PE
CE
R
E
G
PE
CE
R
E
G
PE
Spatial Cache
REG Enable 
R
E
G
P E
CE
1-bit
CLK
Gated Clock 
 
Fig. 14. Configuration cache structure for context reuse. 
 
 
 
on cache is to reduce the frequency of read operations.  
Even though temporal mapping is more efficient in mapping complex loops onto 
the reconfigurable array, it requires many configuration data layers for each PE and per-
forms power consuming read-operations in every cycle. On the other hand, spatial map-
ping does not need to read a new context word from the cache every cycle because each 
34 
PE executes a fixed operation within a loop. As shown in Fig. 14, if a context register 
between a CE and a PE is implemented by a gated clock, one spatial cache1 read-
operation is enough in spatial mapping to configure PEs for static operations with fixed 
output of the context register caused by non-oscillated clock. In summary, spatial map-
ping with context reuse is more efficient than temporal mapping from the viewpoint of 
power consumption in configuration cache. However, all kinds of loops cannot be spa-
tially mapped because of the limitation of the spatial mapping. Moreover, if we consider 
performance alone, temporal mapping is a better choice for loops having long and com-
plex loop body. In the next subsection, we propose a new cache structure and mapping 
technique that reduce power consumption while retaining the merits of temporal map-
ping. 
2. Temporal Mapping with Context Pipelining 
As shown in Fig. 13 (a), in temporal mapping with loop pipelining, operations flow col-
umn by column from left to right. In Fig. 13 (a) for example, the first column executes 
'LD/+' in the first cycle and then in the second cycle, the second column executes 'LD/+' 
while the first column executes '×'. In temporal mapping, there is no need for a PE to 
have a CE. Instead, only PEs in the first column have CEs and the context word can be 
fetched from the left neighboring column. By organizing a pipelined cache structure as 
shown in Fig 15, we can propagate the context words column by column through the 
pipeline. In this way, we can remove most of the CEs from the array keeping temporal 
                                                 
1 We use the term ‘spatial cache’. Spatial cache is connected to context registers implemented by gated 
clock. ‘spatial’ means that such configuration cache is used for spatial mapping with context reuse. This 
naming is to differentiate spatial cache from general configuration cache. 
35 
cache2, thereby saving power consumption without any performance degradation. In 
summary, temporal mapping with context pipelining can efficiently support long and 
complex loops reducing power consumption in configuration cache. However, temporal 
mapping with context pipelining still needs cache-read operations for providing context 
words to the first column of PE array whereas spatial mapping with context reuse can 
remove cache-read operation after initial cache-read operation.  
CE
PE PE PE PER
E
G
R
E
G
R
E
G
R
E
G
CE
PE PE PE PER
E
G
R
E
G
R
E
G
R
E
G
CE
PE PE PE PER
E
G
R
E
G
R
E
G
R
E
G
CE
PE PE PE PER
E
G
R
E
G
R
E
G
R
E
G         
*
-
>>
Execute2
REG REG
*
+
-
LD
LD
LD
Execute1 Load
REG
>> * LD
Temporal Cache
>>
-
*
+
Execute3
REG
Store
Spatial Cache not used
Execute7
Execute5
Execute6
Load
Execute1
Execute2
Execute3
Execute4
LD
LD
LD
LD
Operation : Operation executed in the current cycle  
 
(a) Cache structure                                  (b) Context pipelining 
 
Fig. 15. Cache structure for context pipelining. 
 
 
 
3.  Limitation of Individual Approaches  
As mentioned in previous section, even though individual low power techniques provide 
                                                 
2 We use the term ‘temporal cache’. Temporal cache is composed of the cache elements connected to 
the PEs in the first column. ‘temporal’ means that such CEs are used for temporal mapping with context 
pipelining. This naming is to differentiate temporal cache from general configuration cache and spatial 
cache. 
36 
solution to reduce power consumption for spatial mapping and temporal mapping, each 
case has both advantage and disadvantage. Spatial mapping with context reuse only need 
one cache-read operation for initialization but it can not support the complex loops that 
cannot be spatially mapped. However, temporal mapping with context pipelining support 
such complex loops but cache-read operations still remain in context pipelining for the 
running time. Therefore we should consider the trade-off between performance and 
power while deploying these techniques.  
We can consider two ways to close the gap between spatial mapping and temporal 
mapping. One is to implement more complex architecture to support high performance 
spatial mapping by adding additional interconnections or global register files for data 
dependency. However, in this case the area cost and mapping complexities will increase. 
Another way is to implement low power temporal mapping taking advantage of spatial 
mapping with negligible over-head. However, the problem is how to implement this 
method. In the next section, we propose new technique to guarantee the advantage of 
spatial mapping and temporal mapping. This is achieved by merging the concept of con-
text reuse into context pipelining. 
C.  Integrated Approach to Reduce Power in Configuration Cache 
Filling the gap between two mappings means that context pipelining is executed by reus-
able context words. However, it means conjunction of two mappings that are contrary to 
each other. This is because spatial mapping with context reuse requires spatially static 
position of each context whereas temporal mapping with context pipelining is performed 
with temporally changed context words. To solve this contradiction, we propose to add 
37 
circular interconnection between the first PE and the last PE in the same row and suggest 
a reusable context pipelining using this interconnection. 
1.  Reusable Context Pipelining 
Reusable context pipelining (RCP) means that reusable context words in spatial cache 
are pipelined through context registers as context pipelining. Fig. 16 (a) depicts the pro-
posed configuration cache structure for RCP. Even though it is similar to the structure of 
Fig. 14 (spatial mapping with context reuse), the new one has two context registers (‘R1’ 
and ‘R2’) connected to each PE, circular interconnections and less cache layers whereas 
the original model had one context register and more cache layers.  
 
CE
R
1
R
2
PECE CE
R
1
R
2
PE PECEPE
R
1
R
2
R
1
R
2
Circular
Interconnection
Spatial Cache
CE
R
1
R
2
PECE CE
R
1
R
2
PE PECEPE
R
1
R
2
R
1
R
2
CE
R
1
R
2
PECE CE
R
1
R
2
PE PECEPE
R
1
R
2
R
1
R
2
CE
R
1
R
2
PECE CE
R
1
R
2
PE PECEPE
R
1
R
2
R
1
R
2
 
 
(a) Entire structure 
 
Fig. 16. Proposed configuration cache structure. 
38 
 
P ECE
From left Context Register #1
or Circular Interconnection #1 To right Context Register #1
or Circular Interconnection #1
To right Context Register #2
or Circular Interconnection #2
R
E
G
1
M
U
X
2
M
U
X
1
CLK
1-bit
1-bit
1-bit
1-bit
1-bit
R
E
G
2
Gated CLK
From left Context Register #2
or Circular Interconnection #2
REG 1 Enable
REG 2 Enable
Register Select
Select #2
Select #1
From Cache Control Unit
From Cache Control Unit
REG
M
U
X
Zero
 
 
(b)  Connection between a CE and a PE 
 
Fig. 16. Continued. 
 
 
 
The circular interconnections and the context registers are necessary for pipelining 
of reusable context words from spatial cache. Fig. 16 (b) shows the detailed structure 
between a CE and a PE for RCP. A multiplexer (‘MUX’) is added between context reg-
isters (‘REG1’ and ‘REG2’) and PE for selecting one of the context registers or ‘Zero’. 
Each context register is connected to each multiplexer (‘MUX 1’ or ‘MUX 2’) having 
two inputs: context word from left context register and context word from spatial cache. 
The input from spatial cache is for loading a reusable context word to the context regis-
ter and the input from left context register is for pipelining execution of the loaded con-
39 
text word in left context register. Each select signal (‘Select #1’ or ‘Select #2’) connects 
one from two inputs to the single output connected with right context registers. Each 
context register is implemented by gated clock for holding the output as well as reducing 
the wasteful power consumption. All of the select-signals of the multiplexers are gener-
ated by cache control unit. 
To present the detailed process of RCP, it is assumed that the matrix-vector multi-
plication given as Eq. (2) is mapped onto the proposed structure like the one in Fig. 17. 
Fig 17 (a) shows the context words stored in spatial cache for RCP and Fig. 17 (b) ~ (i) 
shows the RCP process from the first cycle to the eighth cycle. Before starting execution, 
the cont ext words of first layer in spatial cache are loaded into the first context registers 
(‘REG 1’). At the first cycle, the PEs in the first column performs ‘Load’ and the context 
word (‘Store’) in spatial cache is loaded to the ‘REG 2’ in the first column while other 
columns perform no operation (‘NOP’). At the second cycle, the first column performs 
‘Execute1’ from circular interconnection while PEs in the next column perform ‘Load’ 
from the first column. Then context words in the first registers are sequentially pipelined 
for two cycles (the third and forth cycle) and the first column perform ‘Store’ from the 
second register at the fifth cycle. Such a context pipelining is continually executed and 
finished at the eighth cycle. Therefore, if reusable context words are loaded into context 
registers in the circular order, the context words from spatial cache can be rotated for 
temporal mapping without temporal cache. It means that spatiality of the array structure 
and the added context registers can be utilized for low power in temporal mapping. 
40 
(b) Execution at the First Cycle 
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
Execute3
REG 1REG 1REG 1
Execute1
REG 2
Store
Execute2
NOP
REG 1
REG 2 REG 2 REG 2
Ld
Ld
Ld
Ld
Load
REG 2
Store
REG 2 REG 2 REG 2
REG 2
Store
REG 2 REG 2 REG 2REG 2
Store
REG 2 REG 2 REG 2
REG 2
Store
REG 2 REG 2 REG 2REG 2
Store
REG 2 REG 2 REG 2REG 2 REG 2 REG 2
REG 2
Store
REG 2 REG 2 REG 2
LD
LD
LD
REG1 REG1
NOP
NOP
NOP
NOP
NOP
NOP
REG1
LD NOP NOP
*
*
*
*
REG1
Execute3 Execute2Execute1 Load
*
*
*
REG1 REG1
LD
LD
LD
NOP
NOP
NOP
REG1
* LD NOP
+
+
NOP
NOP
REG1
Execute3Execute2 Execute1 Load
+
+
NOP
REG1 REG1
*
*
*
LD
LD
LD
REG1
NOP * LD
+
NOP
NOP
NOP
REG1
Execute3 Execute2 Execute1 Load
(c) Execution at the Second Cycle 
(d) Execution at the Third Cycle (e) Execution at the Forth Cycle 
(i) Execution at the Eighth Cycle (h) Execution at the Seventh Cycle (g) Execution at the Sixth Cycle 
(f) Execution at the Fifth Cycle 
+
NOP
NOP
REG1 REG1
+
+
NOP
*
*
*
REG1
NOP NOP *
St
NOP
NOP
NOP
REG1
Load Execute3 Execute2 Execute1
REG1 REG1
+
NOP
NOP
+
+
NOP
NOP NOP
NOP
NOP
NOP
NOP
REG1
Execute1 Execute3 Execute2
NOP
NOP
NOP
REG1 REG1
St
NOP
NOP
+
NOP
NOP
REG1
NOP NOP NOP
NOP
NOP
NOP
NOP
REG1
Execute2 Execute1 Load Execute 3
NOP
NOP
NOP
REG1 REG1
NOP
NOP
NOP
St
NOP
NOP
REG1
NOP NOP NOP
NOP
NOP
NOP
NOP
REG1
Execute3 Execute2 Execute1 Load
REG 2
Store
St
NOP
NOP
REG1
NOP
Load
Operation : Operation executed in the current cycle
NOP
NOP
NOP
Ld
NOP
NOP
St
Ld
NOP
NOP
NOP
Ld
NOP
NOP
NOP
Ld
NOP
NOP
NOP
NOP
NOP
NOP
NOP
+
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
+
NOP
NOP
NOP
+
NOP
NOP
NOP
NOP
NOP
NOP
NOP
*
NOP
NOP
NOP
*
NOP
NOP
NOP
*
NOP
NOP
NOP
*
Load Execute 3 Execute 2 Execute 1
Spatial Cache
(a) 
 
Fig. 17. Reusable context pipelining for Eq. (2). 
 
41 
Temporal 
Cache
*
-
>>
REG1 REG1
*
+
-
LD
LD
LD
REG1
>> * LD
>>
-
*
+
REG1
Execute2 Execute1 LoadExecute3
REG2 REG2REG2REG2
Spatial Cache used for initialization of  
reusable context pipelining
Execute5
Execute6
Execute7
Execute4
+
-
-
*
Store
 
Fig. 18. Reusable context pipelining with temporal cache. 
 
 
 
2.  Limitation of Reusable Context Pipelining  
If the loop given in Fig. 15 (b) is mapped onto the 4x4 PE array with added context reg-
isters like Fig. 16 (a), RCP cannot finish entire execution because the given architecture 
only supports a maximum number of 8 cycles (2 context registers and 4 columns) for an 
iteration of the loop whereas the loop has loop body taking 9 cycles. Therefore, in this 
case, temporal cache is necessary to support the entire execution as Fig. 18 - RCP is per-
formed for 4 cycles by register 1 and original context pipelining is performed for 5 cy-
cles by register 2. Hence, RCP guarantees reduction of 4 cache-read operations after 
execution of the first iteration. This example shows that power efficiency of reusable 
context pipelining can be varied according to the complexity of evaluated loops and ar-
chitecture specification. 
42 
Col#1
STEX7EX6EX5EX4EX3EX2EX1LDOperation
CacheCacheCacheCacheCacheREG1REG1REG1REG1Cache/REG
i+9i+8i+7i+6i+5i+4i+3i+2i+1Cycle 
 
(a) ith iteration in the case of loop body taking 9 cycles 
Col#1
STEX6EX5EX4EX3EX2EX1LDOperation
REG2REG2REG2REG2REG1REG1REG1REG1Cache/REG
i+8i+7i+6i+5i+4i+3i+2i+1Cycle 
 
(b) ) ith iteration in the case of loop body taking 8 cycles 
Fig. 19. Reusable context pipelining according to the execution time for one iteration (i 
> 1). 
 
 
 
Therefore, we can estimate how many cache-read operations occur after the first it-
eration under architecture constraints. This is given as follows:  
0                                if   Citer  ≤  m×Nctxt                                   (3) 
   NTcache_read =                             
Citer − m×(Nctxt − 1)   if   Citer > m×Nctxt                                     (4) 
where 
y   NTcache_read   : cycle count of temporal cache-read operations after the first iteration 
y   Citer                  : cycle count for an iteration of loop 
y   m               : number of columns on reconfigurable array 
y   Nctxt            : number of context registers for a PE 
 
Based on above formula, the optimal case is when the NTcache_read is zero - context regis-
ters are sufficient to support entire loop body without temporal cache read-operations 
after the first iteration. Fig. 19 shows two cases of temporal mapping with RCP after the 
first iteration. In the case of Fig. 19 (a), it shows the scheduling for previous example in 
Fig. 18 and it corresponds to Eq. (4). However, Fig. 19 (b) shows other case that execu-
tion time for an iteration is 8 cycles and it corresponds to Eq.  (3).   
43 
3.  Hybrid Configuration Cache Structure  
Based on modified interconnection structure as in Fig. 16 (b), we propose a power-
conscious configuration cache structure that supports reusable context pipelining - we 
call it hybrid configuration cache including two cache parts – spatial cache for reusable 
context pipelining and temporal cache to making up for the limitation of RCP.  Fig. 20 
shows the modified configuration cache structure to support the example given in Fig. 
18. It is composed of cache controller, spatial cache, temporal cache, multiplexer and de-
multiplexer. The cache controller supports the same functions as the previous controller 
and in addition it controls increased context registers as well as the selection between 
spatial cache and temporal cache. Therefore, the new cache controller is more complex 
than the base one but the cache controller supports reusable context pipelining with neg-
ligible area and power overheads. As compared to the distributed cache of base architec-
ture, both spatial cache and temporal cache have much less number of layers since spa-
tial mapping does not require many layers and RCP can save the layer of temporal cache 
by up to the number of columns using context registers - the number of spatial cache 
layers should be more than the number of context registers connected to a PE because 
spatial cache should be able to include context words of several applications. Therefore, 
the area cost overhead caused by added context registers offsets because temporal cache 
size can be reduced by same size of total added registers. As mentioned earlier, the ap-
proach does not incur any performance degradation and this hybrid structure saves cache 
area since we keep only one column with reduced number of temporal CEs and less lay-
ers of spatial CEs compared to distributed configuration cache that has much more layers 
44 
of CEs.  
M
U
X
D
E
M
U
X
Context
Context
CTRL & 
Data Signal
Temporal Cache
with few layers CE
CE
CE
CE
CE
CE
CE
CE
Ctrl & 
Data Signal
CTRL & Data Signal
Register & Mux Select Signal
Spatial Cache with few layers
Cache
Control
Unit
Select Signal
CE
CE
CE
CE
CE
CE
CE
CE
CE
CE
CE
CE
Context
20-bit2-bit
PERE
G
1
5-bit 5-bit 5-bit 5-bit
R
E
G
2
PE
R
E
G
1
R
E
G
2
PERE
G
1
R
E
G
2
PE
R
E
G
1
R
E
G
2
PE
R
E
G
1
5-bit 5-bit 5-bit 5-bit
R
E
G
2
PE
R
E
G
1
R
E
G
2
PE
R
E
G
1
R
E
G
2
PE
R
E
G
1
R
E
G
2
PE
R
E
G
1
5-bit 5-bit 5-bit 5-bit
R
E
G
2
PE
R
E
G
1
R
E
G
2
PE
R
E
G
1
R
E
G
2
PE
R
E
G
1
R
E
G
2
5-bit 5-bit 5-bit 5-bit
PE
R
E
G
1
R
E
G
2
PE
R
E
G
1
R
E
G
2
PERE
G
1
R
E
G
2
PE
R
E
G
1
R
E
G
2
20-bit
20-bit
20-bit
20-bit
MeaningSymbol
bus tap to tap off partial bits of a bus  
Fig. 20.  Hybrid configuration cache structure. 
 
 
 
D.  Application Mapping Flow 
We have implemented automatic compilation flow to map applications onto the base ar-
chitecture for supporting temporal mapping [37]. The binary context words for reusable 
context pipelining are basically the same as the context words used for the temporal 
mapping but these context words should be rearranged for context pipelining with circu-
lar interconnection. Fig. 21 shows entire mapping flow for the base architecture and pro-
posed architecture. Binary context words are automatically generated from the compiler 
for temporal mapping. The timing and control information that is used to operate execu-
tion controller is manually optimized and the final encoded data is loaded onto registers 
of the execution controller. 
45 
 
Applications in C
Compiler for Temporal 
Mapping
Spatial Cache Temporal Cache
Initial Context
Binary Context
Context 
Rearrangement
Base Configuration 
Cache
Hybrid Configuration Cache
Array size
No’ of cache layers 
Cycle count for an iteration of loop
Parameters
No’ of  context 
registers
Timing & Control
Information
Manually Optimized
Execution Controller
Encoded DataEncoded Data
Encoded Data
 
Fig. 21.  Application mapping flow for base architecture and proposed architecture. 
 
 
 
1. Temporal Mapping Algorithm 
The temporal mapping algorithm minimizes the execution time of kernel codes on the 
PE array. This execution time is directly proportional to the number of cache layers in 
configuration array. The time, Tcritical is considered as a parameter to be minimized dur-
ing temporal mapping. We implement the temporal mapping in three sequential steps: 
covering, time assignment, and place assignment.  
a. Covering  
For compilation, the original kernel code is initially transformed into a DAG form, 
called the kernel DAG using common sub expression elimination technique [55]. One or 
more operation nodes in a kernel DAG are scheduled in a single configuration of a PE. 
46 
For this, we generate a configuration DAG (CDAG) by clustering the nodes in kernel 
DAG. A CDAG is used to find the minimum number of configurations for kernel code 
execution. To perform this task, we formulate it into a DAG covering problem where 
one has to find the minimal cost set of patterns that cover all the nodes in input CDAG. 
To efficiently solve our DAG covering problem, we implement our algorithm based on 
binate covering [56]. For example, Fig. 22 (a) shows CDAG generation from an input 
DAG.  
 
 
 
Ld
Ld
-
Ld
Ld
-
Ld
Ld
+
Ld
Ld
+
Ld
Ld
+
Ld
Ld
-
*
*
-
*
+
+
*
*
+
-
+
*
-
+
*
-
*
St
St
St
St
St
St
*
-
+
 
(a) Covering            (b) Time assignment               (c) Place assignment 
 
Fig. 22. Temporal mapping steps. 
 
 
 
b. Time Assignment  
Each node in the CDAG is assigned to a cycle in which the node will be executed. In 
order to minimize Tcritical, we must fully exploit the parallel resources provided by the 
m×n PE array using modulo scheduling [57]. For example, Fig. 22 (b) shows assignment 
schedule obtained after applying modulo scheduling to the CDAG. Note that the cycle in 
47 
which a node in the CDAG is scheduled as part of a configuration in this phase, and it 
represents a layer location inside a configuration cache. 
c. Place Assignment  
In this phase, we assign all nodes in the CDAG to actual PEs by storing each of them as 
a configuration entity in the cache of a PE. We split the PEs in a column into two groups, 
called slots. In this phase, the CDAG nodes are first assigned to either slot with resorting 
to the ILP solver, and then within each slot, nodes are finally mapped onto actual PEs. 
Fig. 22 (c) shows the final mapping results after a place assignment is deployed. 
2.  Context Rearrangement 
 In the case of base architecture, the binary context words generated from the compiler 
can be loaded into configuration cache without any modification. However, in the case 
of proposed architecture the generated context words are rearranged and properly as-
signed to spatial and temporal cache. The address of each context word in hybrid con-
figuration cache can be represented by three-dimensional position as Fig. 23 (a). Fig. 23 
(b) shows pseudo code for context rearrangement algorithm that is easily implemented 
based on Eq. (3) and (4). Before explaining the algorithm in detail, we introduce the no-
tations used in the algorithm - NTcache_read , Citer, Nctxt and m are defined in subsection C.2.  
y   n                : number of rows in reconfigurable array  
y   k, l             : number of temporal cache layers, the number of spatial cache layers  
y   Tctxt ,                : set of the context words having positions in temporal cache 
y   Sctxt            : set of the context words having positions in spatial cache 
y  Tctxt(x, y ,z) : context word corresponding the position (x, y, z) in temporal cache  
y  Sctxt(x, y, z) : context word corresponding the position (x, y, z) in spatial cache  
(x : layer index, y :  row index, z : column index) 
48 
Layer k-1
Layer k-2
Layer k-3
Tctxt(k-3, 0)
Layer l-1
ctxt(k-3, 1)
Layer l-2
Layer l-3
Sctxt(l-3,0,0)
Sctxt(l, 1,0)
Sctxt(0, 2, 0)
Sctxt(0, n, 0)
Sctxt(l-3,0,1)
Sctxt(0, 1, 1)
Sctxt(0, 2, 1)
Sctxt(0, n, 1)
Sctxt(l-3,0,m-1)
Sctxt(0,gm-1)
Sctxt(0,  m-1)
Sctxt(0,  m-1)
Layer 2
Layer 1
Sctxt(0,0,0)
Sctxt(0,1,0)
Sctxt(0,2,0)
Sctxt(0,n-1,0)
Layer 0
Sctxt(0,0,1)
Sctxt(0,1,1)
Sctxt(0,2,1)
Sctxt(0,n-1,1)
Sctxt(0,0,m-1)
Sctxt(0,1,m-1)
Sctxt(0,2,m-1)
Sctxt(0,n-1,m-1)
Tctxt(k-2, 2)
Tctxt(k-2, n-1)
Layer 1
Layer 2
Layer 1
Tctxt(0,0,0)
Tctxt(0,1,0)
Tctxt(0,2,0)
Tctxt(0,n-1,0)
Layer 0
Spatial Cache Temporal Cache 
 
(a) positions of binary contexts in hybrid configuration cache 
       
CONTEXT REARRANGEMENT (Tctxt , m, n, k, Citer , Nctxt )
L1   Sctxt← Ø
L2 p ← 0
L3 r ← 0
L4 u ← 0
L5 if Citer ≤ m×Nctxt
L6 then for i ← 0 to k-1 
L7 do for j ← 0 to n-1
L8 do Sctxt(p, j, r) ← Tctxt(i, j,0)
L9 if r = 0
L10 then r ← m - 1
L11 else if  r > 1 
L12                                            then r ← r – 1
L13 else r ← 0, p ← p + 1
L14 else u ← m×(Nctxt-1)
L15 for i ← 0 to k-1  
L16 do if i ≤ u
L17 then for j ← 0 to n-1
L18 do Sctxt(p, j, r) ← Tctxt(i, j,0)
L19 if r = 0
L20 then r ← m - 1
L21 else if  r > 1 
L22                                                      then r ← r – 1
L23 else r ← 0, p ← p + 1
L24                              else for j ← 0 to n-1
L25 do Tctxt(i-u-1, j,0)  ← Tctxt(i, j,0)
L26 return Tctxt, Sctxt
 
(b) rearrangement algorithm. 
 
Fig. 23. Context rearrangement. 
49 
The code between L1 and L4 initialize temporary variables (p, r, u) and Sctxt. If Nctxt×m 
is sufficient to support entire loop body without temporal cache read-operations (L5), all 
of the context positions in temporal cache are remapped to the positions in spatial cache 
with rearrangement in the circular order (L6 ~ L13). Otherwise, the limited number of 
temporal cache layers which can be executed by reusable context pipelining is estimated 
(L14), all of the context positions within the limited temporal cache layers are remapped 
to the positions in spatial cache (L16 ~ L23) in the same manner as (L6 ~ L13). Then the 
layer indices of context positions remaining in temporal cache are updated to fill up the 
empty layers. 
E.  Experiments 
1. Experimental Setup 
For a fair comparison between the base model and the proposed one, we have imple-
mented two cases of reconfigurable architectures as given in Table I.  Base architecture 
is as specified in Chapter III. Proposed architecture is same as base architecture but also 
includes increased context registers and hybrid configuration cache to support reusable 
context pipelining. Two models have been designed at RT-level with VHDL and synthe-
sized using Design Compiler [49] with 0.18 µm technology. We have used SRAM 
Macro Cell library for the frame buffer and configuration cache. ModelSim [50] and 
PrimePower [49] have been used for gate- level simulation and power estimation respec-
tively. To estimate the power consumption overhead in the proposed model, the context 
registers and multiplexers in each case (previous model and proposed architecture) have 
been separated from the PE array and those have been included in the configuration 
50 
cache for each model while implementation. To obtain the power consumption data, we 
have used various kernels (Table II) for simulation with same simulation conditions as 
the previous one mentioned in Chapter III (subsection C.3) - operation frequency of 100 
MHz and typical case of 1.8 V Vdd and 27℃. We have implemented the context rear-
rangement algorithm (Fig. 23) in C++ and the application mapping flow as given in Fig. 
21 by adding the algorithm to the compiler for temporal mapping.  
2.  Results 
a.  Necessary Context Registers for Evaluated Kernels 
We have applied several kernels of Livermore loops benchmark [58], DSPstone [59] and 
representative loops in MPEG-4 AAC decoder, H.263 encoder and H.264 decoder to the 
base and proposed architectures. To determine necessary number of context registers to 
support reusable context pipelining for selected kernels, we have analyzed each case of 
selected kernels and Table II shows execution cycle count for an iteration and necessary 
number of context registers for each kernel. In the case of 2D-FDCT, it shows 11 execu-
tion cycles and the maximum number of context registers among selected kernels. It 
means that composing a PE having 2 context registers is necessary to support reusable 
context pipelining for all of the selected kernels. Therefore, each PE in the proposed ar-
chitecture has 2 context registers for reusable context pipelining while base architecture 
has one context register as shown in Table I.  
 
 
 
51 
Table I. Architecture Specification of Base and  Proposed Architecture 
Parameter Base architec-ture 
Proposed architec-
ture 
Number of context registers for a 
PE 1 2 
Number of rows 8 8 
 
PE Array 
Number of columns 8 8 
Number of sets and banks  2 sets and 3 banks 2 sets and 3 banks
Bit width  16-bit 16-bit Frame buffer  
Bank size  1 KB 1 KB 
Number of layers for a CE  32 16 
Number of Cache Elements (CEs) 64 72 
Configuration 
Cache  
Bit width of a CE 32-bit 32-bit 
 
 
 
 
Table II. Necessary Context Registers for Evaluated Kernels 
Kernels Execution cycle count for an iteration 
Necessary number of 
context registers 
aFirst_Diff 10 2 
aTri-Diagonal 4 1 
aHydro 7 2 
aICCG 5 1 
bDot_Product 5 1 
b24-Taps FIR 8 2 
Complex Multiplication in  
MPEG-4 AAC decoder 10 2 
ITRANS in H.264 decoder. 9 2 
2D-FDCT in H.263 encoder. 11 2 
SAD in H.263 encoder 5 1 
Matrix(10x8)-Vector(8x1) 
Multiplication(MVM) 5 1 
a Livermore loop benchmark suite. b DSPstone benchmark suite. 
 
 
 
 
 
 
 
 
 
52 
Table III. Size of Configuration Cache and Context Registers 
Architecture Size of memory elements
Base  Proposed 
Reduced(%) 
Context registers 256-Byte 512-Byte - 
Spatial cache 4096-Byte 
Temporal cache 8192-Byte 512-Byte 43.75 
Total amount 8448-Byte 5120-Byte 39.39 
 
b.  Configuration Cache Size  
Both temporal cache and spatial cache of the proposed architecture have 16 layers, 
which is half the size of the base architecture. Reducing cache size does not affect per-
formance degradation of evaluated kernels - the size is sufficient to perform the selected 
kernels with reusable context pipelining. Table III shows memory size evaluation be-
tween the base architecture and the proposed one. It shows that added context registers 
offsets by reduction of temporal cache layers. Compared to the base architecture, we 
have reduced the size of memory elements by up to 39.39%. This means that reconfigur-
able architecture with new configuration cache structure is more efficient than previous 
one in terms of memory size and power saving.  
c.  Performance Evaluation 
The execution cycle counts of the evaluated kernels on proposed architecture do not vary 
from the base architecture because the functionality of proposed architecture is same as 
the base model. It also indicates the reusable context pipelining does not cause perform-
ance degradation in terms of the execution cycle count. In addition, the synthesis results 
show that the critical path delay of the proposed architecture is same as the base model 
i.e. 8.96 ns. It indicates the proposed approach does not cause performance degradation 
53 
in terms of the critical path delay. 
d.  Power Evaluation  
To demonstrate the effectiveness of our power-conscious approach, we have evaluated 
the power consumption of only base architecture with temporal mapping and proposed 
architecture with reusable context pipelining on hybrid configuration cache.  
 
Table IV. Power Reduction Ratio by Reusable Context Pipelining 
Power(mW) Reduced(%) 
Cache Entire Kernels 
base proposed base proposed Cache Entire 
First_Diff 171.77 28.08 376.17 232.48 83.65 38.20 
Tri- Diagonal 174.18 31.58 400.19 257.59 81.87 35.63 
Dot_Product 117.84 29.87 328.54 240.57 74.65 26.78 
Complex_Mult  180.63 32.82 452.00 304.19 81.83 32.70 
Hydro 148.23 32.40 356.47 240.64 78.14 32.49 
ICCG 205.80 32.64 434.45 261.29 84.14 39.86 
24-Taps FIR 227.56 31.11 471.44 274.99 86.33 41.67 
MVM 227.57 34.45 405.70 212.58 84.86 47.60 
ITRANS  204.85 69.96 417.95 283.06 65.85 32.27 
2D-FDCT 190.03 37.59 417.33 264.89 80.22 36.53 
SAD 185.30 75.08 415.27 305.05 59.48 26.54 
 
Table IV shows comparison of power consumption between the two architectures. 
Selected kernels were executed with 100 iterations. Compared to the base architecture, 
we have saved up to 86.33% of the total power consumed in the configuration cache and 
47.60 % of that in the entire architecture using reusable context pipelining. These results 
54 
show that reusable context pipelining is a good solution for power saving in CGRA. 
ITRANS and SAD show less reduction in power compared to other kernels because they 
need additional spatial cache-read operations for data arrangement. In the case of 24-
Taps FIR showing the maximum reduction ratio, the total power consumption of pro-
posed architecture is much less than the result of PipeRench [6]. PipeRench has been 
fabricated in a 0.18 micron process and [6] shows power measurement with varying FIR 
filter tap sizes. The power consumption has been measured using a 33.3 MHz fabric 
clock and a 16.7 MHz IO clock. The power measurement shows that the power con-
sumption of 24-Taps FIR ranges from 600 mW to 700 mW.  
 
 
 
 
 
 
 
 
 
 
 
 
55 
CHAPTER V 
DYNAMIC CONTEXT COMPRESSION FOR LOW POWER CGRA 
 
In this chapter, we address the power reduction issues in CGRA and provide a frame-
work to achieve this. A new design flow and a new configuration cache structure are 
presented to reduce power consumption in configuration cache [60]. The power saving is 
achieved by dynamic context compression in the configuration cache – only required 
bits of the context words are set to enable and the redundant bits are set to disable. 
Therefore, the new design flow for CGRA has been proposed to generate architecture 
specifications that are required for supporting dynamically compressible context archi-
tecture without performance degradation. Experimental results show that the proposed 
approach saves up to 39.72% power in configuration cache with negligible area over-
head (2.16%). 
A.  Preliminary 
1. Context Architecture 
The configuration cache provides context words to the context register of each PE on a 
cycle by cycle basis. From the context register, these context words configure the PEs. 
Fig. 24 shows an example of PE structure and context architecture for MorphoSys [3]. 
32-bit context word specifies the function for the ALU-multiplier, the inputs to be se-
lected from MUX_A and MUX_B, the amount and direction of shift of the ALU output, 
and the register for storing the result as Fig. 24 (a). Context architecture means organiza-
56 
tion of context word with several fields to control resources in a PE as Fig. 24 (b). The  
 
 
 
ALU+MULT
MUX_A MUX_B
SHIFT
C
o
n
t
e
x
t
R
e
g
i
s
t
e
r
R0 R1 R2 R3
I M T B XQ R0-R3
L R C VE HE U D L I
Constant
Register File
O/P REG
To data
bus
To HE To VE To other RCs
16 bit data
8
8
8
16
28
16
64
12
16
28
28
co
nt
ex
t  
 w
or
d 
  f
ro
m
   
co
nt
ex
t  
 m
em
or
y
    
(a)  PE structure                                                                                    
 
(b) Context architecture 
Fig. 24. PE structure and context architecture of MorphoSys. 
 
 
 
context architectures of other CGRAs such as [2][8][9][10][11][12][13][14] [15][16][17] 
[18] are similar to the case of MorphoSys although there is a wide variance in context-
11…018…1622…1926…2331 30 29….28 15…1227
W
rite_EXPR
W
rite_R
F_En
R
EG
_FILE
R
S_LS
A
LU
_SFT
M
U
X_A
M
U
X_B
A
LU
_O
P
C
onstant
Em
pty
SU
B
_O
P
Context Word Fields
57 
width and kind of fields used by different functionality. 
B.  Motivation 
1.  Power Consumption by Configuration Cache 
By loading the context words from the configuration cache into the array, we can dy-
namically change the configuration of the entire array within just one cycle. However, 
such dynamic reconfiguration of CGRA causes many SRAM-read operations in configu-
ration cache. In [6], the authors have fabricated a CGRA (PipeRench) in a 0.18 ㎛ proc-
ess. Their experimental results show that the power consumption is significant high due 
to the dynamic reconfiguration requiring frequent configuration memory access. In Fig. 
9, power break-down for the CGRA running 2D-FDCT is proposed with gate-level im-
plementation at 0.18 ㎛ technology based on MorphoSys architecture. It is shown that 
the configuration cache spends about 43% of the overall power, which is the second 
largest after the PE arrays consuming 48% of overall power budget. This is because the 
configuration cache performs SRAM-read operations to load the context words in every 
cycle at run time. In addition, [8][30] also shows power break-down for another CGRA 
(ADRES) running IDCT based on 90nm technology. In this case, the configuration 
memory spends about 37.22% of the overall power. Therefore, it is explicit that power 
consumption by configuration cache (memory) is serious overhead compared to other 
types of IP cores such as ASIC or ASIP.  
2.  Valid Bit-Width of Context Words 
When a kernel is mapped onto CGRA and application gets executed, the usable context 
58 
fields are limited to types of operations involved due to the kernel executed at run time. 
  
 
 
0 2 4 6 8 10 12 14 16 18 20
*First_Diff
*Tri- Diagonal
*State
*Hydro
*ICCG
**Inner Product
**24-Taps FIR
Matrix-vector multiplication
Mult loop in FFT 
Complex_Mult  in MPEG4 AAC dec'
ITRANS  in H.264 dec'
2D-FDCT in H.263 enc'
2D-IDCT in H.23 enc'
SAD in H.263 enc'
Quantization in H.263 enc'
Dequantization in H.263 enc'
Bit-width
Kernel
average bit-width
Maximum bit-width
 
      *Livermore loops benchmark [58], **DSPstone [59] 
Fig. 25. Valid bit-width of context words. 
 
 
 
Furthermore, operation types of an executed kernel on PE array are changed in every 
cycle. It means the valid bit-width of executed context word is frequently less than the 
full bit-width of a context word even though full bit-width can be less often used.  
For statistical evaluation of valid bit-width of contexts, we selected 32-bit context 
architecture of the base architecture (Fig. 4) and mapped several kernels onto its PE ar-
59 
ray in order to maximize the utilization of the context fields. Fig. 25 shows the results 
for various benchmark kernels and critical loops in real applications. In Fig. 25, average 
bit-width is the average value of valid bit-widths of all the executed context words at 
run-time and the maximum bit-width is the maximal valid bit-width among all the con-
text words considered at run-time. The statistical result shows that average bit-widths 
vary from 7 to 11 bits and the maximum bit-width is less than or equal to 18 bits 
whereas the full bit-width is 32-bit.  
3. Dynamic Context Compression for Low Power CGRA 
If the configuration cache can provide only required bits (valid bits) of the context words 
to PE array at run time, it is possible to reduce power consumption in configuration 
cache. The redundant bits of the context words can be set to disable and make those in-
valid at run time. That way, one can achieve low-power implementation of CGRA with-
out performance degradation while context architecture dynamically supports both the 
cases at run time: one case is uncompressed context word with full bit-width and another 
case is compressed context word with setting unused part of configuration cache dis-
abled. In order to support such a dynamic context compression, we propose a new con-
text architecture and configuration cache structure in this chapter. 
C.  Design Flow of Dynamically Compressible Context Architecture  
In order to design and evaluate dynamically compressible context architecture, we pro-
pose a new context architecture design flow. Entire design flow is shown in Fig. 26. This 
design starts from context architecture initialization, which is similar to the architecture 
60 
Field-Sequence Graph 
Generation
APP1 APP3APP2
Domain
Field Positioning
Context Architecture
Initialization
Compressible 
Context Architecture
Can it be 
compressed?
Yes No
Context Evaluator 
Compressed
ConstraintsField- Control Signal 
Generation
Initial ContextUncompressed 
Context
Uncompressed
Field-Grouping
 
Fig. 26. Entire design flow. 
 
 
specification stage of general CGRA design flow given in [21][22][27][29]. Based on 
such architecture specifications, PE operations are determined and initial context archi-
tecture is defined. From the context initialization, fields are grouped by essentiality of 
PE operation and dependency with ALU operation to provide some criterions for context 
compression. A field sequence graph (FSG) is generated to show possible field combina-
tions for PE operation. Then field control signals are generated to make some field en-
able or disable when contexts are compressed. Based on former stages, the position of 
each field is defined and final context architecture is generated. Finally, one can deter-
mine whether the initially uncompressed contexts can be compressed or not by context 
evaluator. From subsection C.1 to subsection C.5, we describe more detailed process for 
61 
each stage in entire design flow. 
 
MUX A MUX B
To data buffer or neighbor PEs
Register fileShifter
A L U
To pred’ bus
REG_FILE3-bit
from data buffer, neighbor PEs or regisiter file
Reg #0
Reg #1
Reg #3
Register
SAT_logic
Reg #2
from pred’ bus
MUX_A4-bit
MUX_B4-bit
ALU_OP5-bit
SAT2-bit
SHIFT6-bit
WDB_EN1-bit
PRED1-bit
[0~2]
[3~6]
[7~10]
[11~15]
[16~17]
[18~23]
[24]
[25]
bit-width
Field name
component index
     
(a) PE structure 
                                                                              
[7,8,9,10]4-bitMUX_B
[16,17]2-bitSAT
[0,1,2]3-bitREG_FILE
[3,4,5,6]4-bitMUX_A
Context register-6-bit
[25]1-bitPRED
[24]1-bitWDB_EN
CTXT_CTRL
[11,12,13,14,15]5-bitALU_OP
Processing 
Element
[18,19,20,21,22,23]6-bitSHIFT
ControlComponent indexBit-widthField name
 
 (b) Context architecture initialization 
Fig. 27. Context architecture initialization. 
62 
1.  Context Architecture Initialization 
Context rchitecture in CGRA design depends on architecture specification. In the proc-
ess of architecture specification, CGRA structure is evolved with PE array size, PE func-
tionalities and their interconnect scheme. The proposed approach starts from the conven-
tional context architecture selection and makes it dynamically compressible context ar-
chitecture through the proposed design flow. We have defined generic 32-bit context ar-
chitecture as an example to illustrate the design flow to support the kernels in Fig. 25. It 
is similar to the representative CGRAs such as MorphoSys [3], REMARC [4], ADRES 
[8] [22][30][43], PACT_XPP [9][10][31]. The PE structure and bit-width of each field 
are shown in Fig. 27. It supports various arithmetic and logical operations with two op-
erands (MUX_A and MUX_B), predicated execution (PRED), Arithmetic saturation 
(SAT_logic), shift operation (SHIFT) and saving temporal data with register file 
(REG_FILE). In Fig. 27 (a), all of the fields are classified by 'Control' of 2 cases - 'Proc-
essing element' and 'context register'. It means that each case is configured by the fields 
included in that case. Furthermore, Fig. 27 (b) shows the bit-width of each field and the 
component index to identify each component configured by each field.  
Even though each field can be positioned on context word under conventional de-
sign flow, this initialization stage does not define any field position. It means field posi-
tion for uncompressed case should be assigned by considering context compression. 
2.  Field Grouping 
All of the context fields are grouped into three sets - necessary set, optional set and un-
necessary set. Necessary set includes indispensable fields for all of the PE operations 
63 
and optional set includes optional fields for PE operations. Unnecessary set is composed 
of fields unrelated to PE operations. It means necessary fields should be included in con-
text words even if context words are compressed whereas optional and unnecessary 
fields can be excluded out of context words. In addition, we classify optional set into two 
subsets. One is a subset composed of fields dependent on the field of 'ALU_OP' and an-
other is a subset composed of fields independent of 'ALU_OP'. This classification is 
necessary for generating field control signals in subsection C.4. Fig. 28 shows field 
grouping based on the context initialization presented in subsection C.1.  
 
Field1 Field2 Field3
Field-Set
Necessary for 
PE operation
Unnecessary for 
PE operation
Optional for
PE operation
MUX_A
ALU_OP
4-bit
9-bit
CTXT_CTRL6-bit
SHIFT
PRED
SAT
5-bit
1-bit
1-bit
REG_FILE2-bit
MUX_B4-bit
ALU-OP
dependent
ALU-OP
independent
PRED1-bit
MUX_B4-bit
SHIFT
WDB_EN
6-bit
1-bit
REG_FILE3-bit
SAT2-bit
 
Fig. 28. Field grouping.  
64 
ALU-independent fieldField
Necessary fieldField
MUX A MUX B
To data buffer or neighbor PEs
Register fileShifter
A L U
To pred’ bus
REG_FILE3-bit
from data buffer, neighbor PEs or regisiter file
Reg #0
Reg #1
Reg #3
Register
SAT_logic
Reg #2
from pred’ bus
MUX_A4-bit
MUX_B4-bit
ALU_OP5-bit
SAT2-bit
SHIFT6-bit
WDB_EN1-bit
PRED1-bit
[0~2]
[3~6]
[7~10]
[11~15]
[16~17]
[18~23]
[24]
[25]
bit-width
Field name
component index
ALU_OP
MUX AMUX B
ALU
SFT
PRED
WDB
EN
REG
FILE
SAT
4 4
5
2 1
6
3 1
ALU-dependent fieldField
n : Bit-width of fieldField
n
 
Fig. 29. Field sequence graph.  
 
 
 
3.  Field Sequence Graph Generation 
Field sequence graph (FSG) is generated from context architecture initialization and 
field grouping. FSG is a directed graph composed of necessary and optional fields and it 
shows possible field combinations for PE operations based on PE structure. Each vertex 
of FSG corresponds to a necessary or optional field in field grouping and each edge of 
FSG shows a possible field combination between two fields. The possible field combina-
tions can be found by vertex tracing in the edge directions and the combinations should 
include all of the necessary fields. Furthermore, optional fields can be skipped out of 
vertex tracing to search possible field combinations. Fig. 29 shows an example of FSG 
65 
from Fig. 27 and Fig. 28. While searching possible field combinations, some times it is 
possible (for example, MUX_A, ALU_OP, SAT is possible) whereas (MUX_A, 
ALU_OP, SAT, PRED) is not possible. FSG is a useful data structure for field position-
ing as described in subsection C.5.  
4.  Generation of Field Control Signal 
When contexts are compressed, optional fields are relocated on compressed space and 
the positions of these fields may be overlapped with each other. Therefore, each optional 
field should be disabled when it is not being compressed in the context word. It means 
that compressed context should have control information for all of the optional fields in 
order to make unused fields disable. In this subsection, control signals generation for op-
tional fields has been described.  
 
ALU_OP [5-bit]
1
1
1
0
0
A1
1
1
1
1
1
A2
1
1
1
1
1
A3
0
1
1
1
1
A4
1A ≤ B
1A!
0A < B
1A || B
0A && B
A0
Logical 
Operation
     
ALU_OP [5-bit]
A0A1A2A3A4
PRED_ENMUX_B_EN
OR
AND AND
CTRL BLOCK
 
(a) logical operations                                 (b) control signals 
 
Fig. 30. Control signals for 'MUX_B' and 'PRED'. 
 
 
 
 
 
66 
a.  Control Signals for ALU-Dependent Fields 
 If the truth table of 'ALU_OP' is classified by the operation type, enable/disable signals 
for ALU-dependent fields can be generated from 'ALU_OP' with some combinational 
logic. Fig. 30 (a) shows the truth table manipulated by classifying operations for the ex-
ample given in subsection C.1. MSB (A4) of 'ALU_OP' is used for classifying opera-
tions according to the number of operands. For example, MSB =1 is used for the opera-
tions with two operands and MSB =0 is used for the operations with one operand. In ad-
dition, A3~A0 are used for classifying logical operations. Based on the truth table, we 
can generate control signals for two fields with some combinational logic as Fig. 30 (b).  
We define such a combinational logic as 'CTRL BLOCK'. 
b. Control Signals for ALU-Independent Fields   
In order to control ALU-independent fields when context words are compressed, the en-
able/disable flag bit on each of the ALU-independent field should be merged with a nec-
essary field. Fig. 31 (a) shows the process that 1-bit flags of ALU-independent fields are 
merged with 'ALU_OP'. After flag merging, the FSG should be updated because the bit-
widths of some of the fields are changed and 1-bit field such as 'WDB_EN' is no longer 
valid in FSG. Fig. 31 (b) shows an updated FSG with modified bit-widths of some of the 
fields. 
67 
REG_FILE
WDB_EN
3-bit
1-bit
SHIFT6-bit
ALU_OP5-bit
ALU_OP9-bit
ALU_SFT5-bit
REG_FILE2-bit
SAT1-bit
SAT2-bit
Merging
1-bit enable/disable flag
                       
MUX AMUX B
ALU
SFT
PRED
REG
FILE
SAT
4 4
9
1 1
5
2
ALU_OP
 
(a) Flag merging                                               (b) Updated FSG 
Fig. 31. Updated FSG from flag merging. 
 
5.  Field Positioning 
The final stage of proposed design flow is positioning each field on the context word. 
Field positioning should be considered for two cases (uncompressed and compressed) 
modes to support dynamic compression.  
a.  Field Positioning on Uncompressed Context Word  
All the fields should have default positions for the case when contexts cannot be com-
pressed. First of all, the necessary fields are positioned to the part near to MSB and the 
unnecessary fields are positioned near the LSB as shown in Fig. 32. Then the optional 
fields are positioned on the available space between the already occupied context word. 
For optional field positioning, the bit-width of compressed context word should be de-
termined. Compressed bit width can be different according to the definition of the capac-
ity of compressed context word. The large capacity of compressed context word can 
68 
show high compression ratio but the amount of power reduction is limited by long bit-
width. However, the little capacity of compressed context word may cause low compres-
sion ratio but the power reduction ratio can be high in short bit-width. To prevent the 
extreme cases (much short or much long bit-width of compressed context word), we de-
termine compressed bit-width based on following criterions.   
i) Compressed context words should be able to support all of the ALU-dependent 
fields. 
ii) Compressed context words should be able to include at least an ALU-independent 
field. 
 
REG_FILE
A7, A6
SHIFT
A12…A8
SAT
A13
ALU_OP
A31…A23
MUX_A
A22…A19 
MUX_B
A18…A15 
CTXT_CTRL
A5…A0
PRED
A14
Longest Field Combination others
Compressed width : 18-bit
Uncompressed width : 32-bit
MSB LSB
Field1 Field2 Field3
Field-Set
Necessary for 
PE operation
Unnecessary for 
PE operation
Optional for
PE operation
MUX_A
ALU_OP
4-bit
9-bit
CTXT_CTRL6-bit
SHIFT
PRED
SAT
5-bit
1-bit
1-bit
REG_FILE2-bit
MUX_B4-bit
 
Fig. 32. Default field positioning. 
 
To satisfy criterions, we determine the longest field combination showing the maxi-
69 
mum bit-width among i) and ii). The maximum width for satisfying i) and ii) is found to 
be 18-bit that consists of 'ALU_OP', 'MUX_A', 'MUX_B' and 'PRED'. Therefore, 18-bit 
is the compressed bit-width. Optional fields that are included in the longest field combi-
nation are preferentially positioned on the compressed zone near the MSB and other 
fields are positioned on uncompressed zone near the LSB as Fig. 32. 
After this, the positions of the necessary fields on FSG are firmly determined and 
the positions of the field control signals are also determined because they are included in 
'ALU_OP' as necessary field.  
 
MUX B
SHIFT
PRED
REG_
FILE
SAT
4
1 1
5
2
Field sequence graph Field concurrency graph
MUX AMUX B
ALU
SFT
PRED
REG
FILE
SAT
4 4
9
1 1
5
2
ALU_OP
                        
Field#1
Field#3
Field#2Dummy
 
 
(a) FCG from FSG                                  (b) FCG with dummy vertex 
Fig. 33. Field concurrency graph. 
 
b.  Field Positioning on Compressed Context Word 
This stage is for positioning fields on compressed context word to guarantee that all the 
possible field combinations are not exceeding the compressed bit-width. Therefore, first 
70 
of all, all the possible field combinations should be found. This process can be achieved 
by searching them from FSG and then generating field concurrency graph (FCG) such as 
Fig. 33 (a). The FCG shows the concurrency between the optional fields. Therefore the 
FCG is used for preventing position that is overlapping between the concurrent optional 
fields. An edge between two fields means that the two fields are included in one of the 
possible field combinations. Even though this example does not show concurrency 
among more than 2 optional fields, such a case can be represented by adding a dummy 
field connected with the fields as Fig. 33 (b). 
Based on a given FCG, the next step is to position the optional fields on com-
pressed context word. The positioning means that some optional fields have additional 
positions as well as default positions on uncompressed context words. To select a posi-
tion among default and additional positions, multiplexers can be used that are composed 
of multiple position inputs and one feasible position output.  Therefore, in this step, the 
field positioning is a mapping among inputs, outputs and control signals for multiplexers 
connected with the optional fields. Thus, we propose a port-mapping algorithm for the 
multiplexers. Before we explain the procedure in detail, we introduce notations we use in 
the explanation as Table V.  
 
 
 
 
 
71 
Table V. Notations for Port-Mapping Algorithm 
Notation Meaning 
GFCG 
field concurrency graph, 
GFCG = (V, E): V is a set composed of the optional field set and E is a set 
composed of edges showing the concurrency between two fields. 
GMUX 
multiplexer port mapping graph 
GMUX = (VMUX, EMUX): VMUX is a set composed of input signals and control 
signals for multiplexers and EMUX is a set composed of weighted edges con-
necting input data  with control signal. 
defV subset of V, defV  is composed of the fields having their default positions on compressed context word 
ndefV subset of V,  ndefV  is composed of the fields not having their default posi-tions on compressed context word 
ctxt[Ai, Aj] bit interval from index Ai to index Aj on the uncompressed context word,  it is used for showing bit position of a field. 
width[v] bit-width of field v 
cmp_lsb LSB of compressed context word 
def_pos[field] default position of field such as interval type of ctxt[Ai, Aj] 
ctrl_pos[field] one bit position of control signal for field such as ctxt[Ai], ctrl_blk[Ai] 
field[i, j] component index corresponding to the interval that is from the ith bit posi-tion  to the jth bit position on field 
cmp_ctrl one-bit signal from cache control unit. ‘1’ means executed context word compressed and ‘0’ means executed context word not compressed. 
pdone[field] ‘1’ means positioning firmly done and ‘0’ means positioning not finished. 
mux[field] mux (multiplexer) connected with field. 
data_in[mux] set composed of mux input data signals 
ctrl_in[mux] set composed of field control signals for mux 
data_out[mux] set composed of mux ouput data signals 
Adj[field] adjacency list of field on graph GFCG, if an adjacent field is dummy, it return adja-cency list of the dummy field 
 
72 
end doL11
pdone[v] = 1L10
data_out[mux[v]] ← v[width[v]-1, 0]L9
Add an edge between [null] and ctrl_pos[v] with weight ‘0’ to EMUXL8
Add an edge between def_pos[v] and ctrl_pos[v] with weight ‘1’ to EMUXL7
Add data_in[mux[v]] and ctrl_in[mux[v]] to VMUXL6
ctrl_in[mux[v]] ← ctrl_in[mux[v]] ∪{ctrl_pos[v]}L5
data_in[mux[v]] ← data_in[mux[v]] ∪{def_pos[v]} ∪{[null]} L4
for each v ∈ defV doL3
Add cmp_ctrl on VMUXL2
VMUX ← Ø, EMUX ← Ø, G MUX← (VMUX, EMUX)L1
Algorithm 1 Mux_Port Mapping (GFCG) - fields having default position
 
The input to the port-mapping algorithm is FCG and the output is multiplexer port-
mapping graph (PMG) showing the relationship among field control signals and input 
data signals (field position). The algorithm is composed of two parts – The first part is 
for the optional fields having default position on compressed context word and the sec-
ond part is for the optional fields not having default position on compressed context 
word. The procedure of the first part is described in Algorithm 1. The algorithm starts 
with initialization step (L1 and L2). In this part, input data signals of multiplexers are 
only two cases - default field position and ‘zero’ selected when the field is not used. This 
is because the fields already have default positions on compressed context space. There-
fore the default field position, ‘zero’ and the field control signal of each field are mapped 
to the input of the multiplexer (L4~L6). Next process is to define the relationship be-
tween field control signal and a field position by adding a weighted edge between them 
(L7 and L8). Weight ‘1’ (or ‘0’) means the input signal is selected when the control sig-
73 
nal is ‘1’ (or ‘0’). Finally, the outputs of multiplexers are connected with the component 
index defined in subsection C.1 (L9) and positioning of the field is firmly done (L10). 
data_out[mux[v]] ← v[width[v]-1, 0]L15
end ifL14
tmp_interval ← ctxt[(width[v]+cmp_lsb), cmp_lsb]L7
pdone[u] ← 1L16
end doL17
else Check_Adjacency(v)L13
Add an edge between tmp_interval and cmp_ctrl with weight ‘1’ to EMUXL12
Add an edge between tmp_interval and ctrl_pos[v] with weight ‘1’ to EMUXL11
Add data_in[mux[v]] and ctrl_in[mux[v]] to VMUXL10
ctrl_in[mux[v]] ← ctrl_in [mux[v]] ∪{ctrl_pos[v]}L9
data_in[mux[v]] ← data_in[mux[v]] ∪{tmp_interval}L8
if Adj[v] = Ø on GFCG thenL6
Add an edge between def_pos[v] and cmp_ctrl with weight ‘0’ on EMUXL5
Add def_pos[v] to VMUXL4
ctrl_in[mux[v]] ← ctrl_in[mux[v]] ∪{cmp_ctrl}L3
data_in[mux[v]] ← data_in[mux[v]] ∪ {def_pos[v]}L2
for each v ∈ ndefV doL1
Algorithm 2 Mux_Port Mapping (GFCG) - fields not having default position 
 
The procedure of the second part is described in Algorithm 2. The algorithm starts 
with mapping default field position and signal ‘cmp_ctrl’ to the input of the multiplexer 
for each field (L2 and L3). Signal ‘cmp_ctrl’ is one-bit signal from cache control unit 
and it gives information whether the context word is compressed (‘1’) or not (‘0’). Then 
the algorithm defines the relationship between signal ‘cmp_ctrl’ and a default position 
by adding a edge showing weight ‘0’ between them (L5). Next process is split into two 
cases – one is for the fields having no adjacent fields on FCG and another is for the 
fields having adjacent fields on FCG. The first case means the fields can be positioned to 
74 
any part of compressed zone except the positions of necessary fields whereas the second 
case means the fields should be positioned to the part not overlapped with the positions 
of their adjacent fields. In the first case (L6), the field is positioned to the part near to 
LSB of compressed context word (L7). Then new field position and field control signal 
are mapped to the input of the multiplexer (L8 and L9). Next process is to define the re-
lationship between field control signal (or ‘cmp_ctrl’) and new field position by adding a 
edge showing weight ‘1’ between them (L11 and L12).  
end doL17
end ifL16
Add an edge between ctxt[Ai, Aj] and ctrl with weight ‘0’ to EMUXL15
for each ctxt[Ai, Aj] ∈ position_set doL7
position_set and ctrl_set ← Find_Interval (v, tmpV)L6
end doL18
if {ctxt[Ai, Aj]} ∩ data_in[mux[v]] =  Ø thenL8
end doL5
end ifL4
tmpV ← tmpV ∪ {u}L3
if pdone[u] = 1 thenL2
data_in[mux[v]] ← data_in[mux[v]] ∪{ctxt[Ai, Aj] }L9
end ifL18
if ctrl overlapped with ctxt[Ai, Aj] thenL14
for each ctrl ∈ ctrl_set doL13
Add an edge between ctxt[Ai, Aj] and cmp_ctrl with weight ‘1’to EMUXL12
Add an edge between ctxt[Ai, Aj] and ctrl_pos[v] with weight ‘1’ to EMUXL11
Add data_in[mux[v]] and ctrl_in[mux[v]] to VMUXL10
ctrl_in[mux[v]] ← ctrl_in[mux[v]] ∪ {ctrl_pos[v]} ∪ ctrl_setL10
for each u ∈ Adj[v] on GFCG doL1
Algorithm 3 Check_Adjacency ( field ) 
 
75 
Field1
Field3
Field2 
available space Field3Field1
Field3Field1 Field 2
Field4 
Field 2
Field 4
Ai Aj
return ctxt[Ai, Aj]
Input field
 
(a) FCG        (b) Available space on compressed zone     (c) When target field (‘Field2’) 
not overlapped with adjacent 
fields 
Field 4 Field3Field1
(a)
(
Ai Aj
return ctxt[Ai, Aj] and 
ctrl_pos[Field1]
 
Field 4 Field3Field1
Bi Bj
return ctxt[Bi, Bj] 
and ctrl_pos[Field3]
 
(d) When target field (‘Field4’) overlapped with adjacent fields (‘Field1’or ‘Field3’). 
 
Field Field: field positioning firmly done : field positioning not done : overlapped part  
 
Fig. 34. Examples of ‘Find_Interval’. 
 
 
 
In the second case (L13), ‘Check_Adjancency’ function is used and it is described 
as algorithm 3. The algorithm start with gathering the adjacent fields firmly positioned. 
Then new position on compressed zone is assigned by ‘Find_Interval’ function (L6).  
Fig. 34 shows examples for this function with two cases -   (c) when new position of in-
put field is not overlapped with the adjacent field positions and (d) when new position of 
input field is overlapped with the adjacent field positions. ‘Find_Interval’ only returns a 
new position (ctxt[Ai, Aj]) in Fig. 34 (c) because of no confliction with the adjacent 
fields. However, it returns two positions (ctxt[Ai, Aj] and ctxt[Bi, Bj]) and field control 
signals from overlapped fields  in Fig. 34 (d). This is because the adjacent field control 
76 
signals are necessary to select proper a field position when multiple field positions exist 
on compressed zone. Such returned new position set and control signal set are mapped to 
the input of multiplexer for the input field (L9 and L10) and the relationship among field 
control signals and a new position is  made by adding  weighted edges among them 
(L11~L17). Finally, the outputs of multiplexers are connected with the component index 
(L15) and positioning of the field is firmly done (L16) in Algorithm 2. 
 
REG_
FILE0
REG_
FILE1
SAT1
PREDZERO
SHIFT0
ZERO
SAT0
SHIFT1
MUX_B
PRED_
_EN 
SAT_
EN
MUX_B
_EN
SHIFT_
EN
CMP_
EN
1
0
1
1
1
1
1
1
0
00
1
ZERO
MUX_B
MUX_B_EN
ZERO
PRED
PRED_EN
REG_
FILE_EN
0
SAT0
SAT1
SAT_EN
CMP_EN
SHIFT0
SHIFT1
SHIFT_EN
CMP_EN
REG_FILE0
REG_FILE1
REG_FILE_EN
CMP_EN
MUX_B[3, 0]
SAT[0]
SHIFT[4, 0]
REG_FILE[1, 0]
PRED[0]
 
 
Fig. 35. Multiplexer port-mapping graph. 
 
PMG example from the port-mapping algorithm is shown in the Fig. 35. Each ver-
tex of PMG corresponds to an input or control signal of multiplexer and each edge 
shows the relationship between control signal and a position that is selected by the 
weight of the edge from control signals such as 'SAT_EN', 'MUX_B_EN', etc. Then the 
77 
outputs of  
MUX A MUX B
To data buffer or neighbor PEs
Register file
Shifter
A L U
To pred’ bus
from data buffer, neighbor PEs or regisiter file 
Reg #0
Reg #1
Reg #3
Register
SAT_logic
Reg #2
from pred’ busC
O
N
T
E
X
T
R
E
G
I
S
T
E
R
`
Others
SFT_EN
REG_EN
WDB_EN
SAT_EN
1
23
31 
CTXT_CTRL
REG_FILE0
SHIFT0
SAT0
PRED_EN
MUX_B
MUX_A
ALU_
OP
3
2
8
9
10
11
0
4
5
6
7
12
13
14
15
16
17
18
19
20
21
22
24
25 
26 
27 
28 
29 
30 
PRED_EN
MUX_B_EN
zero
zero
CTRL
BLK
REG
Cache CTRL Unit
SEL
CMP
SEL
SEL 
CE #1
CE#2
New Cache Element
 
(a) Field layout of compressible context architecture 
 
SHIFT1
A18… A14
REG_
FILE1
A18, A17
REG_
FILE0
A7, A6
SHIFT0
A12…A8
SAT0
A13
ALU_OP
A31…A23
MUX_A
A22…A19
MUX_B
A18…A15
CTXT_CTRL
A5…A0
PRED
A14
compressed width : 18-bit
entire width : 32-bit
SAT1
A14
 
(b) Modified structure between a PE and a CE 
 
Fig. 36. Compressible context architecture. 
 
 
78 
 
multiplexers are connected with the component index defined in Fig. 27 (b). Therefore 
we can implement the multiplexers for the optional fields by the PMG. 
6.  Compressible Context Architecture 
After the field positioning, we have generated a specification of dynamically compressi-
ble context architecture like one in the Fig. 36. Fig. 36 (a) shows the final field layout of 
compressible context architecture. 'REG_FILE', 'SHIFT' and 'SAT' have double posi-
tions for compressed and uncompressed cases. Fig. 36 (b) shows a modified structure 
between a PE and a cache element (CE). New cache element is composed of CE1 and 
CE2 and cache control unit provides compression information from port 'CMP' whether 
executed contexts are compressed or not. CE1 is always selected but CE2 is not selected 
under compression ('CMP'=1) to remove power consumption in CE2. 
7.  Context Evaluation 
The context evaluator in Fig. 26 determines whether initially uncompressed contexts can 
be compressed or not. This evaluation process can be implemented by checking the fact 
that a given context word is compared with one of the possible field combinations not 
exceeding compressed bit-width. Using FCG, we can easily check this and generate 
compressed context words with using position information from PMG. 
D. Experiments 
1.  Experimental Setup 
We have implemented entire design flow in Fig. 26 with C++. We have initialized con-
79 
text architecture as the example described in Section C. The implemented design flow 
generated the specification of dynamically compressible context architecture. For quanti-
tative evaluation, we have designed two CGRAs based on the 8x8 reconfigurable array 
at RT-level with VHDL - one is conventional base CGRA and the other is the proposed 
CGRA supporting compressible features in context architecture. The architectures have 
been synthesized using Design Compiler [49] with 0.18 ㎛ technology. We have used 
SRAM Macro Cell library for the frame buffer and configuration cache. ModelSim [50] 
and PrimePower [49] tools have been used for gate- level simulation and power estima-
tion. To obtain the power consumption data, we have used the kernels (Fig. 25) for simu-
lation with operation frequency of 100 MHz and typical case of 1.8 V Vdd and 27℃. 
These kernels have been executed with 100 iterations while varying test vectors. 
2. Results 
a. Area Cost Evaluation  
Table VI shows the synthesis results from Design Compiler [49] of proposed architec-
ture and base architecture. It shows that area cost of new configuration cache including 
cache control unit, added interconnects and multiplexers has increased by 10.35% but 
the overall area-overhead is only 1.62 %. Thus, the new configuration cache structure 
can support dynamic context compression with negligible overheads.  
 
 
 
 
 
 
 
80 
Table VI. Area Overhead by Dynamic Context Compression 
Area Cost (gate equivalent) Component 
Base Architecture Proposed Architecture 
Overhead (%)
Configuration Cache 150012 165538 10.35 
Entire RAA 942742 958268 1.62 
Overhead (%): {(Proposed/Base) – 1}×100 
 
 
 
Table VII. Power Reduction Ratio by  Dynamic Context Compression 
Configuration Cache Power(mW) 
Kernels Compression Ratio (%) Base  
Architecture
Proposed  
Architecture 
Reduced (%) 
First_Diff 100 171.77 104.97 38.89 
Tri- Diagonal 100 174.18 105.00 39.72 
State 100 161.23 99.38 38.36 
Hydro 100 148.23 91.50 38.27 
ICCG 100 205.80 125.68 38.93 
Inner Product 100 117.84 72.60 38.39 
24-Taps FIR 100 227.56 139.56 38.67 
MVM 100 227.57 140.43 38.29 
Mult in FFT 100 175.48 107.08 38.98 
Comlex Mult 100 180.63 110.18 39.00 
ITRANS 100 204.85 125.27 38.85 
2D-FDCT 95.53 190.03 119.87 36.92 
2D-IDCT 95.49 188.47 118.98 36.87 
SAD 100 185.30 113.07 38.98 
Quant 95.12 185.23 117.51 36.56 
Dequant 95.23 187.78 118.77 36.75 
Compression Ratio (%): number of compressed context words/ number of entire context words)×100,  
Reduced (%): {1-(Proposed/Base)}×100, Execution Cycle Count : cycle count for an iteration. 
 
b. Performance Evaluation 
 In addition, the synthesis results show that the critical path delay of the proposed archi-
tecture is same as the base model i.e. 8.96 ns. It indicates the dynamic context compres-
sion does not cause performance degradation in terms of the critical path delay. In addi-
tion, we have applied several kernels in Fig. 25 to the new and base architectures. The 
execution cycle count of each kernel on proposed architecture does not vary from the 
81 
base architecture because the functionality of proposed architecture is same as the base 
model. It also indicates the dynamic context compression does not cause performance 
degradation in terms of the execution cycle count. 
c. Context Compression Ratio and Power Evaluation 
Table VII shows context compression ratio for the evaluated kernels. Compression ratio 
means how many context words can be compressed among entire context words. The 
execution cycle count of each kernel on proposed architecture does not vary from the 
base architecture because the functionality of proposed architecture is same as the base 
model. It also indicates the dynamic context compression does not cause performance 
degradation in terms of the execution cycle count. All of the kernels show high compres-
sion ratio to be more than 95 %. Furthermore, the comparison of power consumption is 
shown in Table VII. Compared to the base architecture, it has shown to save up to 
39.72% of the power. 4 kernels (2D-FDCT, 2D-IDCT, Quant and Dequant) show less 
reduction in power compared to other kernels. This is because all of the context words 
for 4 kernels are not fully compressed - the compression ratios are in the range of 95.12 
~ 95.53.   
 
 
 
 
 
 
82 
CHAPTER VI 
DYNAMIC CONTEXT MANAGEMENT FOR LOW POWER CGRA 
 
In this chapter, we present a novel control mechanism of configuration cache called dy-
namic context management to reduce the power consumption in configuration cache 
without performance degradation [61]. In addition, a new configuration cache structure 
is proposed to support such a dynamic context management. Experimental results show 
that the proposed approach saves 38.24%/38.15% of the power in write/read-operation 
of configuration cache with negligible area overhead compared to the base design. 
A.  Motivation 
1.  Power Consumption by Configuration Cache 
By loading the context words from the configuration cache into the array, we can dy-
namically change the configuration of the entire array within just one cycle. However, 
such dynamic reconfiguration of CGRA causes many SRAM-read operations in configu-
ration cache. In [6], the authors have fabricated a CGRA (PipeRench) in a 0.18 ㎛ proc-
ess. Their experimental results show that the power consumption is significant high due 
to the dynamic reconfiguration requiring frequent configuration memory access. In Fig. 
9, power break-down for the CGRA running 2D-FDCT is proposed with gate-level im-
plementation at 0.18 ㎛ technology based on MorphoSys architecture. It is shown that 
the configuration cache spends about 43% of the overall power, which is the second 
largest after the PE arrays consuming 48% of overall power budget. This is because the 
83 
configuration cache performs SRAM-read operations to load the context words in every 
cycle at run time. In addition, [8][30] also shows power break-down for another CGRA 
(ADRES) running IDCT based on 90nm technology. In this case, the configuration 
memory spends about 37.22% of the overall power. Therefore, it is explicit that power 
consumption by configuration cache (memory) is serious overhead compared to other 
types of IP cores such as ASIC or ASIP.  
2.  Redundancy of Context Words 
Context words are saved in configuration cache and they show redundancies at runtime. 
We describe two cases for redundancy of context words in following subsections.  
a.  NOP Context Words 
Most coarse-grained reconfigurable arrays arrange their processing elements (PEs) as a 
square or rectangular 2-D array with horizontal and vertical connections, which support 
rich communication resources for efficient parallelism. However, such PE arrays have 
many redundant or unutilized PEs during the executions of applications onto the array.   
Most of subtasks in DSP applications shows lots of redundant PEs that are not used. 
The redundant PEs should be configured by NOP (no operation) context words to avoid 
malfunction and unnecessary waste of power by the PEs. It means that configuration 
cache performs some redundant read-operations for NOP.  
84 
Context Word
Cycle time
ALU Operation Operands Other Operations
1
ALU_OUT <= A A<= Data bus
R0 <= ALU_OUT
2 R1 <= ALU_OUT
3 R2 <= ALU_OUT
4 R3 <= ALU_OUT
 
R0~R3: registers of register file 
(a) Consecutive load operations 
 
ALU_OUT <= A+B
ALU_OUT <= A‐B
ALU_OUT <= A+B
Cycle time
Context Word
ALU Operation Operands Other Operations
1 ALU_OUT <= AXB A <= T, B<= L
SHIFT(ALU_OUT)
2 A <=BT, B <= R
3 A <= R1, B<= R2
4 A <= T  
T, L, R and BT: output from Top PE, Left PE, Right PE and Bottom PE 
(b) Consecutive shift operations 
 
Context Word
4
3
2
1
Cycle time
ALU_OUT <= A
ALU Operation
A <= R3
A <= R2
A <= R1
A <= R0
Operands Other Operations
Data bus <= ALU_OUT
 
(c) Consecutive store operations 
Fig. 37. Consecutively same part in context words. 
 
 
 
 
85 
0 10 20 30 40 50 60 70 80 90
*First_Diff
*Tri‐ Diagonal
*State
*Hydro
*ICCG
**Dot Product
**24‐Taps FIR
Matrix‐vector multiplication
Mult loop in FFT 
Complex_Mult  in MPEG4 AAC dec'
ITRANS  in H.264 dec'
2D‐FDCT in H.263 enc'
2D‐IDCT in H.23 enc'
SAD in H.263 enc'
Quantization in H.263 enc'
Dequantization in H.263 enc'
Kernels
Ratio(%)
Total
Consecutively Same
NOP
 
*Livermore loops benchmark [58], **DSPstone [59] 
Consecutively Same (%) = 100 × (consecutively same part [bits]/total context words [bits]), NOP (%) = 
100 × (NOP context words [bits] / total context words [bits]), Total (%) = NOP + Consecutively Same 
 
Fig. 38. Redundancy ratio of context words.  
 
 
 
b.  Consecutively Same Part in Context Words 
When a kernel is mapped onto CGRA and application gets executed, the consecutively 
changed context fields are limited to types of operations involved due to the kernel exe-
cuted at run time. Fig. 37 shows 3 cases for consecutively-same part in context words at 
run time. In the case of Fig. 37 (a), PEs perform continuous ‘Load’ operations with fixed 
86 
‘ALU Operation’ and ‘Operands’ whereas operand data are saved in different register in 
every cycle. The Fig. 37 (b) and (c) shows consecutive shift operations and store opera-
tions with different ‘Operand’ while keeping same ‘Other Operations’ in every cycle. It 
means that the context words shows consecutively same part and they are repetitively 
read from configuration cache without changing values.  
c.  Redundancy Ratio 
For statistical evaluation of redundant context words, we selected 32-bit context archi-
tecture of the base architecture (Fig. 4) and mapped several kernels onto its PE array in 
order to maximize the utilization of the context fields. Fig. 38 shows the results for vari-
ous benchmark kernels and critical loops in real applications. Each kernel shows three 
cases of redundancy ratios – ‘NOP’, ‘Consecutively Same’ and Total. Total redundancy 
ratio varies from 31% to 75%.  
B.  Dynamic Context Management 
If the configuration cache does not perform read/write operation for redundant part of 
context words, it is possible to reduce power consumption in configuration cache. That 
way, one can achieve low-power implementation of CGRA without performance degra-
dation while managing context words in both cases at transfer time and runtime: one 
case is no read/write operation for NOP and another case is one read/write-operation for 
consecutively same part in context words. In order to support such a dynamic context 
management, we propose a new configuration cache structure and efficient control 
mechanism in this chapter. 
87 
MUX A MUX B
To data buffer or neighbor PEs
Register fileShifter
A L U
To pred’ bus
from data buffer, neighbor PEs or regisiter file 
Reg #0
Reg #1
Reg #3
Register
SAT_logic
Reg #2
from pred’ bus
Field name
Bit‐
width
Bit‐position
REG_FILE 3‐bit A31…A29
MUX_A 4‐bit A28…A25
MUX_B 4‐bit A24…A21
ALU_OP 5‐bit A20…A16
SAT 2‐bit A15, A14
SHIFT 6‐bit A13…A8
WDB 1‐bit A7
PRED 1‐bit A6
CTXT_CTRL 6‐bit A5…A0  
Fig. 39. An example of PE and context architecture.  
 
 
 
1.  Context Partitioning 
Context partitioning is to split context architecture into two parts feasible to dynamic 
context management. As mentioned in subsection A.2.b, the context words shows con-
secutively same part and they are repetitively read from configuration cache without 
changing values. Therefore, if a CE is divided into two parts (CE#1 and CE#2) by con-
text partitioning, one part of CE including continuously same part can be disabled for 
power saving while keeping consecutive read/write-operation of another part of CE. The 
partitioning starts from grouping context field for ALU operation and some context 
fields dependent to ALU operation. This is because ALU have the most dependency 
with other component and they are highly probable to be consecutively changed or un-
changed together. Therefore context partitioning positions such fields on one part of con-
text architecture and other fields on another part of context architecture. We have de-
88 
fined generic PE structure and 32-bit context architecture like Fig. 39 as an example to 
illustrate context partitioning. It can support the kernels in Fig. 38. It is similar to the 
representative CGRAs such as MorphoSys [3], REMARC [4], ADRES [8][30] or-
PACT_XPP [10]. Bit-width and initial bit-position of each field are shown in Fig. 39. It 
supports various arithmetic and logical operations (ALU_OP) with two operands 
(MUX_A and MUX_B), predicated execution (PRED), Arithmetic saturation 
(SAT_logic), shift operation (SHIFT) and saving temporal data with register file 
(REG_FILE). Fig. 40 shows context partitioning of Fig. 39. Field ‘ALU_OP’ and the 
fields dependent to ‘ALU_OP’ are positioned to the part near to MSB and other fields 
are positioned near to LSB. 
WDB
A15A31…A27 A26…A23 A22…A19 A18 A17,  A16 A14…A9 A8 …A6 A5…A0
ALU_OP MUX_A MUX_B PRED SAT SHIFT REG_FILE CTXT_CTRL
ALU_OP and  ALU_OP-dependent Fields  : 14-bit
MSB LSB
ALU_OP-independent Fields  : 18-bit
 
Fig. 40. Context partitioning.  
 
 
 
After context partitioning, we can know the bit-widths of CE #1 and CE#2 and con-
text register is also can be split into two parts with same bit-widths. Fig. 41 shows com-
parison between general CE and proposed CE. The proposed CE is composed of CE#1 
(14-bit) and CE#2 (18-bit) whereas the general CE is a unified one (32-bit). In subsec-
tion B.2 and B.3, we describe more detailed control mechanism for dynamic context 
management based on the proposed CE structure. 
 
89 
 
Configuration
Cache #1
PE Array  
Configuration
Cache #2
PE
PE
PE
PE
PE PE PE
PE
PE
PE PE PE
PE
PE
Configuration 
Cache PE Array  
Cache
Element
MUX A MUX B
To data buffer or neighbor PEs
Register file
Shifter
A L U
To pred’ bus
from data buffer, neighbor PEs or regisiterfile 
Reg #0
Reg #1
Reg #3
Register
SAT_logic
Reg #2
from pred’ bus
Always ON
PE
PE
PE
PE
PE PE PE
PE PE
PE
PE
PE
PE
PE
C
T
X
T
R
E
G
CE  #1
14‐bit
CE #2
18‐bit
MUX A MUX B
To data buffer or neighbor PEs
Register file
Shifter
A L U
To pred’ bus
from data buffer, neighbor PEs or regisiterfile 
Reg #0
Reg #1
Reg #3
Register
SAT_logic
Reg #2
from pred’ bus
32‐bit
R
E
G
1
R
E
G
1
ON/OFF
ON/OFF
32‐bit32‐bit
 
(a) General CE                                        (b) Proposed CE 
Fig. 41. Comparison between general CE and proposed CE.  
 
 
 
DMA
Controller
Cache
Controller
Main 
Memory
Configuration 
Memory
(SRAM)
RF
Register File
Context Word
Context Word
Context Word
Management
CE CE CE CE
CE CE CE CE
CE CE CE CE
CE CE CE CE
R R R R
R R R R
R R R R
R R R R
SRAM block
Width : 32‐bit
Depth: 8
2‐bit 
register
CE#2
CE#1
 
Fig. 42. Context management when context words are transferred. 
 
90 
2.  Context Management at Transfer Time 
Context management at transfer time is to remove redundant cache-write operations by 
using additional hardware detecting redundancy of context words. Fig. 42 shows transfer 
flow of context words from main memory to configuration cache in the case of 4x4 CEs. 
For checking the redundancy, hardware block of ‘Management’ is added to general 
cache controller. ‘Management’ block checks transferred context words whether it has 
redundancy or not. Then it controls cache-write operation as Algorithm 4. In addition, 
Fig. 42 shows register file connected with ‘Management’ block – it has same address-
ability as CE but bit-width is 2.  The register file store 2-bit redundancy information – 
the saved information in register file are used for context management at run time.  
Algorithm 4     Context Management at Transfer Time
L1 begin
L2 if cur_ctxt = NOP then
L3 reg_file[ctxt_addr] ← “01”
L4 cs1 ← ‘0’, cs2 ← ‘0’
L5 else if cur_ctxt[cw-1, cw-w+1] = prev_ctxt[cw-1, cw-w+1] then
L6 reg_file[ctxt_addr] ← “10”
L7 cs1 ← ‘0’, cs2 ← ‘1’
L8 CE#2[ctxt_addr] ← cur_ctxt[cw-w, 0]
L9 else if cur_ctxt[cw-w, 0] = prev_ctxt[cw-w, 0] then
L10 reg_file[ctxt_addr] ← “11”
L11 cs1 ← ‘1’, cs2 ← ‘0’
L12 CE#1[ctxt_addr] ← cur_ctxt[cw-1, cw-w+1]
L13 else 
L14 cs1 ← ‘1’, cs2 ← ‘1’
L15 CE#1[ctxt_addr] ← cur_ctxt[cw-1, cw-w+1]
L16 CE#2[ctxt_addr] ← cur_ctxt[cw-w, 0]
L17 end if     
L18 prev_ctxt ← cur_ctxt
L19 end  
91 
Algorithm 4 shows this management process for a CE. Before we explain this man-
agement in detail, we introduce notations we use in Algorithm 4.  
y cw: bit-width of context word 
y w: bit-width of field group (ALU_OP and ALU_OP-dependent fields) 
y cur_ctxt: context word currently transferred to configuration cache 
y prev_ctxt: context word previously transferred to configuration cache  
y ctxt_addr: address of current context word in configuration cache 
y reg_file: register file, CE#1 and CE#2: Cache Element 
y out_ctxt: context word currently provided to context register 
y cs1 and cs2: chip select signal of CE1 and CE2 
The algorithm starts with checking whether current context word is NOP or not (L2). If 
the context word is NOP, 2-bit information (“01”) is stored in register file and both 
CE#1 and CE#2 are disabled (L4). If it’s not NOP, next process is to check whether the 
upper part (near to MSB) of context word is the consecutively identical to one of previ-
ous context word. If it is the same part as the previous one, information (“10”) is stored 
in the register file (L6) and only CE#2 is enabled (L7) for cache write-operation (L8). 
Checking the lower part (near to LSB) of current context word (L9~L12) shows the 
same manner as previous process but CE#1 is enabled instead of CE#2. Finally, if cur-
rent context word does not correspond to any case of previous checking processes, both 
CE#1 and CE#2 are enabled (L14) and full context word is stored in configuration cache 
(L15, L16). Finally, previous context word is updated by current context word (L18).  
92 
3.  Context Management at Run Time 
Context management at run time is to remove redundant cache-read operations by 
checking redundancy information stored in the register file. Fig. 43 shows structure be-
tween configuration cache and PE array for the context management. The hardware 
block of ‘Management’ controls all of CEs and a context register between a CE and a PE 
is implemented by a gated clock using chip select signals (CS1 and CS2). Gated clock 
implementation is to configure PE with fixed output of the context register caused by 
non-oscillated clock. Therefore, PEs can be configured without cache-read operation in 
the case of consecutively same context words. 
 
CS1 
R1
P E
CE1
1‐bit
CLK
Gated Clock 
R
CS2 
R2CE2
1‐bit
CLK
R
8‐bit
Cache 
Controller
Context Register
Cache Element
Processing 
ElementPE
Symbol Meaning
32‐bit
2‐bit
2‐bit
PE
RF
Management
CS (Chip Select) 
signals for each CE
1‐bit RegisterR
CE1
R1
CE1 CE2
R2
PE
R1
CE1 CE2
R2
2‐bit
PE
R1
CE1 CE2
R2
2‐bit
PE
R1
CE1 CE2
R2
PE
R1
CE1 CE2
R2
PE
R1
CE1 CE2
R2
PE
R1
CE1 CE2
R2
PE
R1
CE1 CE2
R2
PE
R1
CE1 CE2
R2
PE
R1
CE1 CE2
R2
PE
R1
CE1 CE2
R2
PE
R1
CE1 CE2
R2
PE
R1
CE1 CE2
R2
PE
R1
CE1 CE2
R2
PE
R1
CE1 CE2
R2
PE
R1
CE1 CE2
R2 R1 R2
CE2
 
Fig. 43. Context management at run time. 
 
93 
Algorithm 5     Context Management at Run Time
L1 begin
L2 if reg_file[ctxt_addr]  = “01” then
L3 cs1 ← ‘0’, cs2 ← ‘0’
L4 else if reg_file[ctxt_addr] = “10” then
L5 cs1 ← ‘0’, cs2 ← ‘1’
L6 out_ctxt[cw-w, 0] ← CE#2[ctxt_addr] 
L7 else if reg_file[ctxt_addr]  = “11” then
L8 cs1 ← ‘1’, cs2 ← ‘0’
L9 out_ctxt[cw-1, cw-w+1] ← CE#1[ctxt_addr] 
L10 else 
L11 cs1 ← ‘1’, cs2 ← ‘1’
L12 out_ctxt[cw-1, cw-w+1] ← CE#1[ctxt_addr]
L13 out_ctxt[cw-w, 0] ← CE#2[ctxt_addr]
L14 end if     
L15 end  
Algorithm 5 shows this management process for a CE. The defined notations in Al-
gorithm 4 are used in Algorithm 5. The algorithm starts with checking whether the in-
formation (stored in the register file) identified by current address is NOP or not (L2). If 
the information is NOP (“01”), both CE#1 and CE#2 are disabled (L3). If it’s not NOP, 
next process is to check whether the information corresponds to the case (“10”) of con-
secutively same part (near to MSB) or not (L4).  If it is “10”, only CE#2 is enabled (L5) 
for cache read-operation (L6). Next process is to check whether the information corre-
sponds to the case (“10”) of consecutively same part (near to MSB) or not (L7). It shows 
the same manner as previous process but CE#1 is enabled for read-operation instead of 
CE#2. Finally, if the information does not correspond to any case of previous checking 
processes, both CE#1 and CE#2 are enabled (L11) and a full context word is read from 
configuration cache (L12, L13).  
94 
C. Experiments 
1.  Experimental Setup 
For quantitative evaluation, we have designed two CGRAs based on the 8x5 reconfigur-
able array at RT-level with VHDL – one is conventional base CGRA and the other is the 
proposed CGRA supporting dynamic context management. The architectures have been 
synthesized using Design Compiler [49] with 0.18 ㎛ technology. We have used SRAM 
Macro Cell library for the frame buffer and configuration cache. ModelSim [50] and 
PrimePower [49] tools have been used for gate- level simulation and power estimation. 
To obtain the power consumption data, we have used the kernels (Fig. 38) for simulation 
with operation frequency of 100 MHz and typical case of 1.8 V Vdd and 27℃. 
 
Table VIII. Area Overhead by Dynamic Context Management 
Area cost (gate equivalent) Component Base Proposed 
Overhead 
(%) 
Config’cache 150012 162538 8.35 
RAA 942742 955268 1.33 
Base: base architecture, Proposed: proposed architecture,  
Overhead(%) : {(Proposed/Base) – 1}×100 
 
2.  Results 
a.  Area Cost Evaluation 
Table VIII shows the synthesis results from Design Compiler [49] of proposed architec-
ture and base architecture. It shows that area cost of new configuration cache including 
cache control unit, hardware block of “Management” and register file increased by 
8.35% but the overall area-overhead is only 1.33 %. Thus, the new configuration cache 
structure can support dynamic context management with negligible overheads.  
95 
Table IX. Power Reduction Ratio by Dynamic Context Management 
Configuration cache Power (mW) 
Write-operation Read-operation 
Reduction 
Ratio (%) Kernels 
Base Proposed Base Proposed Write Read
Tri- Diagonal 14.98 6.89 171.77 79.03 54.00 53.99
First_Diff 13.34 8.25 174.18 104.51 38.12 40.00
State 15.23 9.37 161.23 93.87 38.45 41.78
Hydro 11.22 7.17 148.23 96.14 36.14 35.14
ICCG 15.39 7.56 205.80 103.35 50.87 49.78
Dot Product 12.11 7.28 117.84 72.51 39.88 38.47
24-Taps FIR 19.20 11.63 227.56 138.90 39.41 38.96
MVM 14.23 8.68 227.57 138.54 38.99 39.12
Mult in FFT 12.12 7.62 175.48 105.88 37.14 39.66
Comlex Mult 11.57 7.86 180.63 123.59 32.12 31.58
ITRANS 14.22 10.17 204.85 148.64 28.47 27.44
2D-FDCT 16.23 11.69 190.03 140.30 27.96 26.17
2D-IDCT 17.34 13.16 188.47 139.88 24.13 25.78
SAD 14.30 4.45 185.30 55.87 68.89 69.85
Quant 12.12 8.73 185.23 134.94 27.99 27.15
Dequant 15.33 11.05 187.78 137.10 27.89 26.99
Average 38.24 38.15
Base: base architecture, Proposed: proposed architecture, Reduced: {1-(Proposed/Base)}×100  
Write/Read: reduction ratio in the case of write/read operation 
 
b.  Power Evaluation 
To demonstrate the effectiveness of the proposed approach, we have applied several ker-
nels in Fig. 38 to the proposed and base architectures. These kernels were executed with 
100 iterations. Table IX shows power evaluation of configuration cache for two cases – 
read operation and write-operation. The power consumptions of write-operations are less 
than the cases of read-operations. This is because a CE performs write-operation at 
transfer time whereas all of CEs perform read-operation at run time. Compared to the 
base architecture, it has shown to save up to 68.89%/69.85% of the power in write/read-
operation. 5 kernels (ITRANS, 2D-FDCT, 2D-IDCT, Quant and Dequant) show less re-
duction in power compared to other kernels. This is because they show less redundancy 
96 
ratios of context words compared with other kernels– Fig. 38 shows that the redundancy 
ratios of these kernels are in the range of 31.22% ~ 33.79%. Average power reduction 
ratios in write-operation and read-operation are 38.24% and 38.15%. 
c.  Performance Evaluation 
The synthesis results show that the critical path delay of the proposed architecture is 
same as the base model i.e. 8.96 ns. It indicates the dynamic context management does 
not cause performance degradation in terms of the critical path delay. In addition, the 
execution cycle count of each kernel on proposed architecture does not vary from the 
base architecture because the functionality of proposed architecture is same as the base 
model. It also indicates the dynamic context management does not cause performance 
degradation in terms of the execution cycle count.  
 
 
 
 
 
 
 
 
 
 
 
97 
CHAPTER VII 
COST-EFFECTIVE ARRAY FABRIC 
 
In this chapter, we propose a new domain-specific array fabric design space exploration 
method to generate a cost-effective reconfigurable array structure [62]. The exploration 
flow efficiently rearranges PEs with reducing array size and change interconnection 
scheme to achieve much reduction in power and area while maintaining the same per-
formance as the original architecture. In addition, the proposed array fabric splits the 
computational resources into two groups (primitive resources and critical resources). 
Critical resources can be area-critical and/or delay-critical. Primitive resources are repli-
cated for each processing element of the reconfigurable array, whereas area-critical re-
sources are shared among multiple basic PEs in order to reduce more area of CGRA. De-
lay-critical resources can be pipelined to curtail the overall critical path so as to increase 
the system clock frequency. Experimental results show that for multimedia applications, 
the proposed approach reduces area by up to 36.75%, execution time by up to 42.86 and 
power by up to 35.45.% when compared with the base CGRA architecture. 
A.  Preliminary 
In this section, we present preliminary concepts of our cost-effective design [44]. They 
come from the characteristics of loop pipelining based on MIMD-style execution model. 
Then we propose two techniques to make an RAA cost-effective in terms of area and 
delay. One is resource sharing and the other is resource pipelining.  
98 
×
×
×
×
×
×
×
×
×
×
×
×
×
×
×
×
Col#1 Col#2 Col#3 Col#4Broadcast Column Direction Row Direction Column Direction
Cycle Time 1 2 3 4 5 6 7 8 9 10 11
Column#1 LD/+ NOP NOP NOP × 1+ 2+ ×/ST NOP NOP NOP
Column#2 LD/+ NOP NOP × 1+ 2+ NOP ×/ST NOP NOP
Column#3 LD/+ NOP × 1+ 2+ NOP NOP ×/ST NOP
Column#4 LD/+ × 1+ 2+ NOP NOP NOP ×/ST
  
(a) SIMD 
 
+
+
+
×
×
×
×
Col#1 Col#2 Col#3 Col#4
Cycle Time 1 2 3 4 5 6 7 8
Column#1 LD/+ × 1+ 2+  ×/ST NOP NOP NOP
Column#2 LD/+ × 1+ 2+  ×/ST NOP NOP
Column#3 LD/+ × 1+ 2+ ×/ST NOP
Column#4 LD/+ × 1+ 2+  ×/ST
×/ST
 
(b) Temporal mapping 
 
×/ST
×
×
×
×
+
+ +
Col#1 Col#2 Col#3 Col#4Cycle Time 1 2 3 4 5 6 7 8
Column#1 LD/+ LD/+ LD/+ LD/+ ×/ST  ×/ST ×/ST ×/ST
Column#2 × × × × NOP NOP NOP
Column#3 1+ 1+ 1+ 1+ NOP NOP
Column#4 2+ 2+ 2+ 2+ NOP
 
(c) Spatial mapping 
 
Fig. 44.  Snapshots of three mappings. 
 
 
 
1.  Resource Sharing 
Fig. 44 shows the snapshot taken at the 5th cycle of execution of the previous example 
shown in Fig. 12 for three cases: (a) SIMD and two cases of loop pipelining - (b) tempo-
ral mapping and (c) spatial mapping. The operations in the 5th cycle for (a), (b) and (c) 
include multiplication and therefore the multipliers in the PE array are to be used. In the 
99 
case of SIMD, all PEs perform multiplication requiring all of them to have multipliers, 
thereby increasing the area cost of the PE array. However, in the case of temporal map-
ping, only PEs in the 1st column and the 4th column perform multiplication while PEs in 
the 2nd and 3rd columns perform addition. In the spatial mapping, only PEs in the 1st and 
 
 
 
Col#1 Col#2 Col#3 Col#4
×/ST
SW
SW
SW
SW
+
SW
SW
SW
SW
+
+
SW
SW
SW
SW
×
×
×
×
MULTLT SW
SW
SW
SW
MULTLT
MULTLT
MULTLT
MULTLT
MULTLT
MULTLT
MULTLT
Cycle Time 1 2 3 4 5 6 7 8
Column#1 LD/+ × 1+ 2+ × /ST  NOP NOP NOP
Column#2 LD/+ × 1+ 2+ × /ST  NOP NOP
Column#3 LD/+ × 1+ 2+ × /ST  NOP
Column#4 LD/+ × 1+ 2+ × /ST 
 
(a) Temporal mapping  
 
Cycle Time 1 2 3 4 5 6 7 8
Column#1 LD/+ LD/+ LD/+ LD/+ ×/ST ×/ST ×/ST ×/ST
Column#2 × × × × NOP NOP  NOP
Column#3 1+ 1+ 1+ 1+ NOP NOP
Column#4 2+ 2+ 2+ 2+ NOP
×/ST
SW
SW
SW
SW
×
×
×
×
SW
SW
SW
SW
+
+
SW
SW
SW
SW
+
MULTLT SW
SW
SW
SW
Col#1 Col#2 Col#3 Col#4
MULTLT
MULTLT
MULTLT
MULTLT
MULTLT
MULTLT
MULTLT
 
(b) Spatial mapping 
 
Fig. 45.  Eight multipliers shared by sixteen PEs. 
 
 
 
2nd columns perform multiplication. As can be observed, in the temporal mapping and 
spatial mapping, there is no need for all PEs to have the same functional resources at the 
100 
same time. This allows the PEs in the same column or in the same row to share area-
critical resources. Fig. 45 shows four PEs in a row sharing two multipliers3 at the 5th cy-
cle in temporal mapping and spatial mapping. We depict only the connections related to 
resource sharing. 
Fig. 46 depicts the detailed connections for multiplier sharing. The two n-bit oper-
ands of a PE are connected to the bus switch. The dynamic mapping of a multiplier to a 
PE is determined at compile time and the information is encoded into the configuration 
word. At run-time, the mapping control signal from the configuration word is fed to the 
bus switch and the bus switch decides where to route the operands. After the multiplica-
tion, the 2n-bit output is transferred from the multiplier to the original issuing PE via the 
bus switch.  
 
 
MULTULT
ctrl
MULTULT
2n‐bit
n‐bit
2n‐bit
Bus 
switch
PE
CE
n‐bit
n‐bit
2n‐bit
 
Fig. 46.  The connection between a PE and shared multipliers. 
 
 
 
                                                 
3 Since multipliers take much more area than other resources, we classify them as critical resources. 
101 
2.  Resource Pipelining 
If there is a critical functional resource with long latency in a PE, the functional resource 
can be pipelined to curtail the critical path. Resource pipelining has clear advantage in 
loop pipelining execution because heterogeneous functional units with different delays 
can run at the same time. In the traditional design (Fig. 47 (a)), the latency of a PE is 
fixed but in our pipelined PE design (Fig. 47 (b)), we allow multi-cycle operations and 
so the latency can vary depending on the operation. This helps increase the system clock 
frequency.  
 
Critical
Resource 
Output Reg’
Front
End
Neighbor PE Neighbor PE
Two cycles operation One cycle operation
Output Reg’
Critical path
Reg
Critical path is 
seperated into two
Output Reg’
Output Reg’
One cycle operation  
 
(a) General PE                            (b) Pipelined PE 
 
 Fig. 47.  Critical paths. 
 
 
 
102 
Cycle Time 1 2 3 4 5 6 7 8 9 10
Column#1 LD/+ 1× 2× 1+ 2+  1× 2×/ST NOP NOP NOP
Column#2 LD/+ 1× 2× 1+ 2+  1× 2×/ST NOP NOP
Column#3 LD/+ 1× 2× 1+ 2+  1× 2×/ST NOP
Column#4 LD/+ 1× 2× 1+ 2+  1× 2×/ST
Col#1 Col#2 Col#3 Col#4
1×
SW
SW
SW
SW
+
SW
SW
SW
SW
+
+
SW
SW
SW
SW
2×
2×
2×
2×
SW
SW
SW
SW
MULTLT
MULTLT
MULTLT
MULTLT
 
(a) Temporal mapping  
Cycle Time 1 2 3 4 5 6 7 8 9 10
Column#1 LD/+ LD/+ LD/+ LD/+ NOP 1× 2×/ST
/1×
2×/ST
/1×
2×/ST
/1× 2×/ST
Column#2 1× 2×/1×
2×/1
×
2×/1
× 2× NOP NOP NOP NOP
Column#3 1+ 1+ 1+ 1+ NOP NOP NOP
Column#4 2+ 2+ 2+ 2+ NOP NOP
1×
SW
SW
SW
SW
2×
2×
2×
2×
SW
SW
SW
SW
+
+
SW
SW
SW
SW
+
SW
SW
SW
SW
Col#1 Col#2 Col#3 Col#4
MULTLT
MULTLT
MULTLT
MULTLT
 
(b) Spatial mapping  
 
1×: First pipeline stage on multiplication, 2×: Second pipeline stage on multiplication  
 
Fig. 48.  Loop pipelining with pipelined multipliers. 
 
 
 
If a critical functional resource such as a multiplier has both large area and long la-
tency, the resource sharing and resource pipelining can be applied at the same time in 
such a way that the shared resource executes multiple operations at the same time in dif-
ferent pipeline stages. With this technique, the conditions for resource sharing are re-
laxed and so the critical resources are utilized more efficiently. Fig. 48 shows this situa-
tion. Through the pipelining, we can reduce the number of multipliers from 8 to 4 to per-
form the execution without any stall. This is because two PEs sharing one pipelined mul-
tiplier can perform two multiplications at the same time using different pipeline stages.  
103 
B.  Cost-Effective Reconfigurable Array Fabric 
In this section, we propose an array fabric design space exploration method to generate a 
cost-effective reconfigurable array structure in terms of area and power. It is mainly mo-
tivated by the characteristics of typical computation-intensive and data-parallel applica-
tions.  
1. Motivation 
a. Characteristics of Computation-Intensive and Data-Parallel Applications 
Most of the CGRAs have been designed to satisfy the performance requirement of a 
range of applications in a particular domain. Especially, they have been designed for ap-
plications that exhibit computation-intensive and data-parallel characteristics. Common 
examples for such applications are digital signal processing (DSP) applications like au-
dio signal processing, image processing, video signal processing, speech signal process-
ing, speech recognition, and digital communications. Such applications have many sub-
tasks such as trigonometric functions, filters and matrix/vector operations that can be 
mapped onto coarse-grained reconfigurable array.  We have classified such subtasks into 
four types as shown by the data flow graphs in Fig. 49. Type (a) shows merge operation 
in which outputs from multiple operations in the previous stage are used as inputs to an 
operation in the next stage. Type (b) shows butterfly operation where output data from 
multiple operations in the previous stage are fed as input data to the same number of 
next stage operations. Finally, type (c) and (d) show the combinations of (a) and (b). 
104 
OP1
OP3
OP4OP2
           
OP1
OP2
OP4
OP5
OP3 OP6              
OP1
OP2
OP4
OP5
OP3
OP6
OP6
OP7                    
OP1
OP2
OP4
OP5
OP3 OP6
OP7
 
(a) Merge           (b) Butterfly         (c) Merge-butterfly              (d) Butterfly-merge 
OPi : operation  
Fig. 49. Subtask classification. 
 
 
 
b. Redundancy in Conventional Array Fabric 
 Most coarse-grained reconfigurable arrays arrange their processing elements (PEs) in a 
square or rectangular 2-D array with rich set of horizontal and vertical connections for 
effective exploitation of parallelism. However, such square/rectangular array structures 
have many redundant or unutilized PEs during the executions of applications on them.  
Fig. 50 shows an example of three types of data flow (Fig. 49 (a), (c), and (d)) mapped 
onto 8x8 square reconfigurable arrays in the two cases of loop pipelining – temporal 
mapping and spatial mapping. The upper part of Fig. 50 shows the scheduling for a col-
umn of PEs based on temporal mapping and also shows how the utilization of the PEs 
changes for the 8 cycles of schedule. As can be seen from the figure, Some PEs have 
very low utilization. The lower part of Fig. 50 shows the spatial mapping of the 8x8 ar-
ray, where some PEs are not used at all. All the three types of implementations show lots 
of redundant PEs that are not used.   
105 
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
1 2 3 4 5 6 7 8
Cycle Time
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
6 7 8
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
3 4 5
Cycle Time
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
1 2 1
Cycle Time
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
2
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
5 6 7 8
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
3 4
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
Col#1
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
Col#2 Col#3 Col#4 Col#5 Col#6 Col#7 Col#8
Column Number
Col#1 Col#2 Col#3 Col#4 Col#5 Col#6 Col#7 Col#8
Column Number
Col#1 Col#2 Col#3 Col#4 Col#5 Col#6 Col#7 Col#8
Column Number
Temporal 
Mapping 
Spatial 
Mapping
 
(a) Merge                  (b) Merge-butterfly          (c) Butterfly-merge 
 
PE : used processing element PE :  unused processing element  
 
Fig. 50. Data flow on square reconfigurable array. 
 
 
 
From these observations, we see that the existing square/rectangular array fabric 
cannot efficiently utilize the PEs in the array and therefore waste large area and power. 
In order to overcome such wastages in square/rectangular array fabric, we propose a new 
cost effective array fabric in the next subsection. 
2. New Cost Effective Data Flow-Oriented Array Structure 
a. Derivation of Data Flow-Oriented Array Structure 
To reduce the redundancy in the conventional square/rectangular array, first of all, we 
can consider a specific array shape that fits well with the applications’ common data 
flows. Fig. 51 shows such a data flow-oriented array structure derived from three types 
106 
of data flow. In Fig. 51 (a), a triangular-shaped array and uni-directional interconnec-
tions among PEs can be derived from the first data flow (merge). Then the interconnec-
tions can be made bi-directional to support the merge—butterfly data flow as shown in 
Fig. 51 (b). Finally, in Fig. 51 (c), the entire array becomes a diamond-shaped structure 
to reflect the butterfly-merge data flow. In this case, the butterfly operations are spatially 
spread on both sides of the array. Then intermediate data merge takes place at the end of 
both sides or they can merge at the center of the array.  
 
Data Flow
Array Shape and direction 
of interconnections
: Input data
: Intermediate data   
(a) Merge        (b) Merge-butterfly         (c) Butterfly-merge 
 
Fig. 51. Data flow-oriented array structure derived from three types of data flow. 
 
 
 
To represent how the data-flow oriented array structure can efficiently utilize PEs, 
we examine the difference between the conventional square-shaped array and the pro-
posed data flow-oriented array with a simple example. We assume a diamond-shaped 
reconfigurable array composed of 12 PEs as shown in Fig. 52 (a) – this is a counterpart 
of the 4x4 PE array shown in Fig. 10. In addition, we assume a Frame Buffer similar to 
the one in Fig. 10 (b) is connected to the array, where the PEs in each row of the array 
107 
share two read buses and the PEs in two neighboring rows share one write bus as shown 
in Fig. 52 (b). The array has nearest neighbor and global bus interconnections in diago-
nal and horizontal directions as shown in Fig. 52 (c) and (d).  
 
CE
CECE CE
CE CE CE CE
CE CE CE
CE
PE
PEPE
PE
PE
PE
PE PEPE
PE
PE PE
      
(a) Distributed cache structure 
 
 
Bank A
Bank B
Bank C
D
E
M
U
X
Frame Buffer
M
U
X
n‐bit
4n‐bit
4n‐bit
4n‐bit
PE
PEPE
PE
PE
PE
PE PEPE
PE
PE PE
n‐bit
n‐bit
n‐bit
n‐bit
n‐bit
n‐bit
n‐bit
n‐bit
n‐bit
n‐bit
n‐bit
n‐bit
PE Array
MeaningSymbol
bus tap to tap off partial bits of a bus
4n‐bit
 
 
(b) Frame buffer and data bus 
 
Fig. 52. An example of data flow-oriented array. 
 
108 
PE
PEPE
PE
PE
PE
PE PEPE
PE
PE PE
                     
PE
PEPE
PE
PE
PE
PE PEPE
PE
PE PE
 
(c) Nearest neighbor interconnection                       (d) Global bus interconnection 
 
Fig. 52. Continued. 
 
 
 
Consider mapping of Eq. (2) in Chapter IV with N = 4 on the proposed array in the 
same way as we did for the 4x4 square-shaped PE array in Chapter IV.  Fig. 53 shows 
the snapshots taken at the time of maximum utilization of PEs for three cases: (a) tempo-
ral mapping on the square-shaped array (4x4 PEs), (b) spatial mapping on the square-
shaped array (4x4 PEs) and (c) spatial mapping on the data flow-oriented array (12 PEs). 
In the case of (a) and (b), five PEs are not used because the merging addition does not fit 
well with the square-shape. However, in the case of (c), the array efficiently utilizes all 
of the PEs without delayed operation. As can be observed, this example shows that the 
propose array structure can avoid the area and power wastages of the square-shaped ar-
ray without performance degradation. 
 
 
 
109 
LD/+
LD/+
LD/+
LD/+
NOP
+
NOP
NOP
NOP
+
NOP
×
×
×
×
+
: from register file : Feedback
               
LD/+
LD/+
LD/+
LD/+
NOP
+
NOP
NOP
NOP
+
NOP
×
×
×
×
+
               
 
(a) Temporal mapping on 4x4 PE array      (b) Spatial mapping on the 4x4 PE array  
 
 
LD/+
LD/+
LD/+
LD/+
×/ST
××
××
+ ++
 
 
 (c) Spatial mapping on the data-flow oriented PE array 
 
Fig. 53. Snapshots showing the maximum utilization of PEs. 
 
 
 
b. Mitigation of Spatial Limitation in the Proposed Array Structure 
As shown in Fig. 53 (c), we spread the operations in the data flows (mostly loop bod-
ies) over the array space, instead of spreading the operations over time for each column 
to implement temporal loop pipelining as shown in Fig. 53 (a). This implies that spatial 
loop pipelining is most suitable to the new array fabric. However, as mentioned in 
110 
Chapter IV (see subsection A.2), spatial mapping is not feasible for complex loops be-
cause of two reasons. One is that a large loop body may not fit in the limited recon-
figurable array and the other is that data dependencies between the operations typically 
require allocating lots of interconnect resources. In order to mitigate such a limitation, 
the new array fabric should have rich interconnections to provide more flexible and 
multi-directional data communication and the PEs should be arranged in such a way to 
utilize such an interconnection structure efficiently. As a solution to this problem, we 
propose a design flow that generates a data-flow oriented array structure by determin-
ing the topology of PEs and their interconnections. 
3.  Data Flow-Oriented Array Design Flow 
The generation of a data-flow oriented array starts from a square-shaped array fabric, 
considering that the original square-shaped array fabric is very well designed. We gener-
ate the new data-flow oriented array such that it can efficiently implement any applica-
tion that can be implemented on the square-shaped array fabric. In the example of Fig. 
52 (a), since the data-flow has a diamond-shape, we can generate a diamond-shaped ar-
ray with 
111 
Analysis of Inter-Half 
Column Connectivity
Square Array Fabric feasible to
Temporal Loop Pipelining
Cost-Effective Reconfigurable 
Array Fabric
Connectivity Enhancement
Analysis of Intra-Half 
Column Connectivity
New Array Fabric Specification – Phase II
New Array Fabric Specification – Phase I
 
 
Fig. 54. Overall design flow. 
 
less number of PEs but without any performance degradation, which is just like garment 
cutting. Since we want to cover the applications that can be implemented through tempo-
ral mapping on the square-shaped array fabric as well, we do not just cut the fabric but 
we compose the new array by transforming the temporal interconnection structure to a 
spatial interconnection structure.  
In the temporal mapping, each loop iteration of an application kernel (critical loop) 
is mapped onto a column of the square-shaped array. Therefore, it is good enough to 
analyze the interconnection fabric only within a column to derive the new array structure. 
Fig. 54 shows the entire design flow. This flow starts from analysis of intra-half 
column and inter-half column connectivity of general square array fabric. Intra-half col-
112 
umn connectivity means nearest neighbor or hopping interconnection between PEs in a 
half column and inter-half column connectivity means pair-wise interconnection or 
global bus between PEs – one PE in a half column and another PE in the other half col-
umn. New array fabric is partially derived by analyzing intra-half column connectivity of 
square fabric in Phase I. Then Phase II elaborates new array fabric by analyzing inter-
half column connectivity of square fabric. Finally, the connectivity of new array fabric is 
enhanced by adding vertical and horizontal global bus. In the remainder of this subsec-
tion -- from a through d below -- we describe more detailed process for each stage of the 
entire exploration flow. 
a. Input Reconfigurable Array Fabric 
The 8x8 array given in Fig. 5 is used for the input array fabric to illustrate the proposed 
design flow. 
b. New Array Fabric Specification – Phase I 
In this phase, an initial version of the new array fabric is constructed by analyzing intra-
half column connectivity of the input square array. Algorithm 6 shows this procedure. 
Before we explain the procedure in detail, we describe the notations used in it.  
y (L1) base column denotes a half column in the n x n reconfigurable array. 
y (L3) new_array_space denotes 2-dimentional space of the constructed reconfigurable 
array.  
y (L5) source_column_group denotes a group of PEs composed of one or two columns 
in the new_array_space. It is used as a source for deriving the new array fabric. 
y (L12) |source_column_group| denotes the number of PEs in source_column_group. 
113 
y (L6) CHECK_INTERCONNECT is a function to identify nearest neighbor or hop-
ping interconnections of PEs in source_column_group by analyzing the base column. If 
there is such an interconnection that has not been processed yet, then it returns true. 
y (L8) LOC_TRI is a function that implements the local triangulation method, which 
adds PEs, assigns them new positions in new_array_space, and connects them with the 
PEs already existing in the source_column_group.  
Algorithm 6 New Array Fabric Specification – Phase I
L1 base ← a half column of n x n reconfigurable array 
L2 m ← number of memory-read buses of n x n reconfigurable array 
L3 new_array_space ← Ø
L4 begin
L5 source_column_group ← Add a column composed of n/2 PEs
in new_array_space
L6 while CHECK_INTERCONNECT(source_column_group, base)
L7 do           
L8 LOC_TRI(source_column_group)
L9 end do
L10 source _column_group ← Ø
L11 source _column_group ← next two columns on the both sides
in new_array_space
L12 if |source-column_group| > 2 then
L13 goto L6
L14 end if
L15 Add nearest-neighbor interconnections 
L16 Add m memory-read buses  
L17 Connect the read buses with the added PEs in the same row
L18 Copy the constructed fabric on vertically symmetric position
L19 end        
The algorithm starts with the initialization step (L1~L3). Then a half column is added 
into new_array_space, which is the initial source_column_group (L5). The next proc-
ess is to check the nearest neighbor or hopping connectivity between two PEs (L6) in 
the same column included in source_column_group. This checking process (L6) con-
114 
tinues until no more interconnection is found. The first checking process is performed 
by simply identifying interconnections of the base column. If an interconnection is 
found, two PEs are added into new_array_space and their interconnections and posi-
tions are assigned by local triangulation method (L8). This method is to reflect intra-
half column connectivity with making the data flow oriented array structure as shown in 
Fig. 52.  
 
A2
PE2
A1
PE1
A2
PE2
A1
PE1
A2A1
A1+A2
PE1
A1‐A2
PE1
 
(a) An operation fully utilizing interconnections between two PEs 
 
                       
A1+A2
A1‐A2A2
A1
                                       
: data ‘Ai’ saved in PEiAi
PEi
: Interconnection 
: Data flow
 
 (b) Butterfly operation example 
 
A2
PE2
A1
PE1
PE4 PE3
A2
PE2
A1
PE1
PE4 PE3
A1
A2
A1
A2
A2
PE2
A1
PE1
A1‐A2
PE4
A1+A2
PE3
         
   (c) Butterfly operation executed on a triangle structure including four PEs 
 
Fig. 55. Basic concept of local triangulation method. 
 
 
 
We illustrate the basic concept of local triangulation method in Fig. 55. Consider a 
base column including 2 PEs and let the operation fully utilizing their interconnections 
as shown in Fig. 55 (a) – data (A1 and A2) saved in two PEs (PE1 and PE2) are ex-
115 
changed with each other through the bidirectional interconnection, and then addition 
and subtraction are performed in PE1 and PE2. This is a kind of butterfly operation and 
Fig. 55 (b) shows an equivalent data flow graph for the butterfly operation. If we con-
sider a triangular structure composed of four PEs reflecting the shape of the data flow 
graph, the example can be mapped on the PEs as shown in Fig. 55 (c) – Two PEs (PE3 
and PE4) on both sides receive the data (A1 and A2) from the PEs (PE1 and PE2), and 
then addition and subtraction are performed in PE3 and PE4. In such a manner, local tri-
angulation method is to make a data flow-oriented array structure reflecting the intra-
half column connectivity.  
 
PE
PE PEPE
PE
PE                           
PE
PE
PE
PE PEPE
 
(a) Nearest-neighbor                                                  (b) Hopping 
PE PE PE: PE in base column : PE in source_column_group : Added PE   
 
Fig. 56. Local triangulation method. 
 
Fig. 56 shows two cases of the method.  In the first case (a), two PEs in the base 
column have nearest-neighbor interconnection, which means maximum two PEs can be 
used for butterfly operation. Therefore, local triangulation method adds two PEs into 
new_array_space and assigns each PE the nearest-neighbor position on each side of the 
source column and the positions are vertices of a triangle. Then the method assigns near-
est-neighbor interconnection between added PEs and the PEs in the 
116 
source_column_group. The second case (b) shows that two PEs in the base column have 
a bidirectional hopping interconnection. Local triangulation method is also applied to 
this case with the hopping interconnections instead of the nearest neighbor interconnec-
tions for the first case. Even though one-way interconnections are sufficient to perform 
butterfly operation in two cases of Fig. 56, the added interconnections are bidirectional. 
This is because it aims to keep the basic characteristics of the data flow-oriented array 
structure derived in Fig. 52.  
 
PE
PE
PE
: PE in base column
: PE in source_column_group
: Added PE 
PE : preoccupied PE in new array space
: Added interconnection
: already connected
                              
PE0
PE1 PE4
PE2 PE5 PE
PE0
PE1
PE2
PE7
PE8PE
PE3 PE3PE9PE PEPE6  
                                                                                         (a) Nearest-neighbor 
 
PE0
PE1 PE4
PE2 PE5 PE
PE0
PE1
PE2
PE7
PE8PE
PE3 PE3PE9 PE6 PEPE
 
 (b) Hopping 
 
Fig. 57. Interconnection derivation in Phase I. 
 
From the second checking process, preoccupied columns are included in 
source_column_group. Fig. 57 shows two examples on how to find connectivity on the 
source columns. In the case of (a), no interconnection between ‘PE4’ and ‘PE6’ (or 
117 
‘PE7’ and ‘PE9’) is added because there is no hopping connectivity between ‘PE0’ and 
‘PE2’ (or ‘PE0 (or PE1)’ and ‘PE3’) in the base column. However, in the case (b), the 
base column has interconnection between ‘PE0’ and ‘PE2’. Therefore PEs and intercon- 
 
PE
PE
PE
PE
PE
PE
PE
PE                                                      
PE
PE PE PEPEPE
PE PEPE
PE
PE PE PE PEPEPE
PE
PEPE
PE
PE
PE PEPEPE
PE PE PE PEPEPEPE
        
(a) Base column                                       (b) Nearest-neighbor interconnection 
 
PE
PEPE
PE
PE
PE PEPEPE
PE PE PE PEPEPEPE
PE PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
                     
PE
PE PE PEPEPE
PE PEPE
PE
PE PE PE PEPEPE
PE
PEPE
PE
PE
PE PEPEPE
PE PE PE PEPEPEPEFrame
Buffer
 
            (c) Hopping interconnection                             (d) Memory read-buses 
 
Fig. 58. New array fabric example by Phase I. 
 
nections between ‘PE4’ and ‘PE6’ (or ‘PE7’ and ‘PE9’) are added by local triangulation 
method. After the iteration of adding PEs and interconnections (L5~L14) is finished, 
nearest-neighbor interconnections are added between two nearest-neighbor PEs that are 
118 
not connected with each other (L15). It is to guarantee the minimum data-mobility for 
data rearrangement. Finally, memory-read buses are added4 (L16, L17) and the derived 
array is copied to the vertically symmetric position (L18). 
Fig. 58 shows the result of the phase I procedure for the example of 8x8 reconfigur-
able array as shown in Fig. 5. 
c. New Array Fabric Specification – Phase II 
In this phase, new PEs and interconnections are added for reflecting intra-half column 
connectivity of the input square fabric. Phase II analyzes two kinds of interconnections – 
pair-wise and global bus.  We propose another procedure as Algorithm 7. Before we ex-
plain the procedure in detail, we introduce notations we use in the explanation.  
y (L5) CHECK_INTERCONNECT is a function to identify bus-connectivity between 
two PEs in the source column by analyzing the base column.  
y (L6) GB_RHT_TRI means global triangulation method that is a function used to add 
global buses and PEs.  
The algorithm starts with initialization step (L1, L2). Then central column in 
new_array_space is initial source_column_group. Next process is to check the pair-wise 
or global bus connectivity between two PEs (L5) - two PEs in the same column included 
in source column group. If an interconnection is found, global buses and PEs are added 
                                                 
4 Memory write-buses are added in the step of connectivity enhancement in subsection VI.B.5). This is because some PEs can be 
added in phase II and they should be connected to memory-write buses.  
119 
Add nearest-neighbor interconnections L13
Algorithm 7 New Array Fabric Specification  - Phase II
L1 base ← a column of n x n reconfigurable array 
L2 n  ← number of global buses 
L3 begin
L4 source_column_group ← central column in new_array_space
L5 while CHECK_INTERCONNECT(source_column_group, base) do
L6 GB__TRI (source_column)
L7 end do
L8 source _column_group ← Ø
L9 source _column_group ← next two columns on the both sides
in new_array_space
L10 if |source-column_group| > 2 then
L11 goto L5
L12 end if
L14 end        
 in new_array_space and their interconnections and positions are assigned by global 
triangulation method (L6). Global triangulation method has the same basic concept of 
local triangulation method in that the method is also to make a triangular-shaped array 
fabric suitable spatial mapping with guaranteeing the maximum inter-half column con-
nectivity of the base column.  
Fig. 59 shows three cases of global triangulation method when the base column has 
a bidirectional pair-wise interconnection and two global buses. In the first case (a), the 
bidirectional pair-wise interconnection means maximum two PEs can be used for butter-
fly operation. Therefore, global triangulation method adds two PEs in new array space 
and assigns each PE the intersection point of two diagonal lines from two PEs in source 
column. The positions are vertices of a triangle. Then the method assigns four global 
buses between added PEs and the PEs in the source_column_group. Fig. 59 (b) and (c) 
show the method when the base column has two global buses. In the case of (b), two di-
120 
agonal lines from two PEs in source column intersect on already existing PE called ‘des-
tination PE’. Therefore, four global buses are added and they connect destination PEs 
with PEs in the source column. However, in the case of (c), no destination PE exists on 
intersection point of four diagonal lines. Therefore, new PEs called global PE (GPE) as 
well as global buses are added on new_array_space.  
 
PE
PEPE
PE
PE PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PEPE
PE
PE
PE
PE
PE
PE
PE PE
PE
PE
PE
PEPE
PE
PE
PE
PE
PE
     
              (a) When pair-wise interconnection exists in base column 
 
 
PE
PE
PE
PE
PE
PE
PE
PEPE
PE
PE
PE PEPEPE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PEPE
PE
PE
PE
PE
PE
 
                                                (b) When destination PE exists 
Fig. 59. Global triangulation method when n = 2 (L2). 
121 
PE
PE
PE
PE
PE
PE
PE
PE PEPE
PE PEPE
PE
PE PEPEPEPE
PE
PE
PE
PE
?
PE
PE
PE
PE
PE
PE
PE
PE
PE
?
GPEGPE
PE
PE
PE
PE
PE
PE
PE
PE
 
                                                                                                                                         
(c) When GPE is added 
 
PE
PE
PE
: PE in base column
: PE in source_column_group
: Added PE 
PE : preoccupied PE in new array space
: added  grobal bus
: connection between PE and bus 
? : destination PE does not exists
GPE : added Global PE
 
 
Fig. 59. Continued. 
 
122 
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE PEPEPE
PE PE PEPEPE
PE PEPE
PE PE PEPEPE PE
PE
PEPE
PE
PE
PE PE
PE
GPEGPE
GPE GPE
 
Fig. 60. New array fabric example by Phase II. 
 
 
 
This checking process (L5) continues until no more connectivity is found. Then 
nearest-neighbor interconnections are added between two PEs not connected with each 
other. This is to guarantee the minimum data-mobility for data rearrangement.  
Fig. 60 shows the result of the phase II procedure for the example of 8x8 recon-
figurable arrays. 
d.  Connectivity Enhancement 
Finally, vertical and horizontal bus can be added to enhance connectivity of new recon-
figurable array. This is because new array fabric from phase I and II only has nearest 
neighbor or hopping interconnection in vertical and horizontal direction whereas it sup-
ports sufficient diagonal connectivity. Added horizontal bus is used as memory-write 
bus connected with frame buffer as well as used for data-transfer between PEs. 
 
123 
Fig. 61 shows the result of the connectivity enhancement for the example of 8x8 re-
configurable array. Each bus is shared by two PEs in both the sides. 
 
 
 
. 
PE
PE
PE
PE
PE
PE
PE
PE
PE PE
PE PE
PEPE
PE
GPE
GPE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
GPE
GPE
PE
PE
PE
PE
Frame
Buffer
 
Fig. 61. New array fabric example by connectivity enhancement. 
 
4. Cost-Effective Array Fabric with Resource Sharing and Pipelining 
The resource sharing and pipelining mentioned in Section A can be applied to the pro-
posed new array fabric because the computation model for the proposed fabric is spatial 
loop pipelining – spatial mapping spreads the entire loop body on the PE array, there is 
no need for all PEs to have the same functional resources at the same  time. Fig. 62 
shows the PEs in the same row share two pipelined multipliers.    
124 
Fig. 63 shows an application mapping on new array fabric example generated from 
the exploration flow - consider N = 8 for the mapping of the computation defined in Eq. 
(2) in Chapter IV on the new array fabric as shown in Fig. 53.  Load and addition opera-
tions in PEs are executed on the central column in the first cycle. Then the next multipli-
cations and summations are spatially spread on both sides of the array till 6th cycle. Fi-
nally, in next two cycles, a PE in the central row performs multiplication/store opera-
tions. The architecture including 16 multipliers supports the mapping example without 
stall caused by  
 
PEPE
PE
PEPE
PE
MULTLT
MULTLT
MULTLT
PE
SW
PE
SW
PE
SW
GPE
SW SW
GPE
SWSWSWSWMULTLT MULTLTSW
MULTLT
PE
SW
PE
SW
PE
SW
PE
SW
PE
SW
PE
SW
PE
SW
MULTLT MULTLT
PE
SW
PE
SW
PE
SW
PE
SWSW
PE
SW
PE
SW
PE
MULTLT MULTLT
PE
SW
PE
SW
PE
SWSW
PE
SW
MULTLT
SW SWSWSW SW
PEPE GPEGPE
PE
MULTLT
MULTLT
MULTLTSWSW SW SW SW
PEPE PE PEPEPE PE PE
SWSWSW
MULTLT PE
PE MULTLT
SWMULTLT : 2-tage pipelined multiplier : bus switch
 
 
Fig. 62. New array fabric with resource sharing and pipelining. 
125 
NOP
: Data load and addition
: No operation, unused PE
: Data flow on global bus
:  
: Consecutive pipeline multiplication
: Summation
op : operation executed at ith cycle
i
: Data flow on nearest neighbor 
interconnection
5
LD/+
1×/
2×
SUM
1X/2×
/ST
LD/+
LD/+
LD/+
LD/+
LD/+
LD/+
1
1
1
1
1
1
NOP
1×/
2×
NOP
2, 3
NOP
SUM1
4
NOP
5
1×/
2×
2, 3
2, 3
1×/
2×
2, 3
NOP
1×/
2×
NOP
SUM1
4
NOP
NOP
NOP
NOP
1×/
2×
2, 3
LD/+
1
1×/
2×
2, 3
1×/
2×
2, 3
1×/
2×
2, 3
NOP
7, 8
1X/2×
/ST
LD/+
1
SUM2
5
NOP
SUM1
4
SUM1
4
SUM3
NOP
6
NOP
NOP
NOP
SUM2
Consecutive pipeline multiplication, 
then store
 
Fig. 63. Mapping example on new array fabric. 
 
 
 
multiplier lack. In this example, nearest-neighbor interconnections and global buses are 
efficiently used for multi-directional data-transfer and the new array has the same per-
formance (number of execution cycles) compared with the square array. 
C. Experiments 
1. Experimental Setup 
a. Evaluated Applications 
The target application-domain is composed of representative kernels in MPEG-4 AAC 
decoder, H.263 encoder, H.264 decoder, and 3D-graphics. In addition, to demonstrate 
the effectiveness of our approaches for benchmark domains, we have applied several 
kernels of Livermore loops benchmark [58] and DSPstone [59].  
 
126 
 
b. Hardware Design and Power Estimation  
To demonstrate the effectiveness of Resource Sharing and Pipelining (RSP), we have 
applied RSP techniques to the base RAA (BASE) defined in Chapter III (see Section B) 
and implemented the RSP architecture (RSPA) at the RT-level with VHDL. In Chapter 
III (see Section C), we have confirmed that multiplier is both area-critical and delay-
critical resources. Therefore we have taken the multiplier out of the PE design and ar-
ranged them to be shared and pipelined resources. From the analysis of our target appli-
cations, we have determined the sharing architecture – two pipelined multipliers shared 
by 8 PEs in each row. Therefore, the RSP architecture including 16 multipliers supports 
all of the target applications without stall caused by multiplier lack. In addition, we have 
implemented entire exploration flow in Fig. 54 with C++. The implemented exploration 
flow has generated the specification of new reconfigurable array fabric. The base RAA 
(Chapter III) has been used for input of the exploration flow. For quantitative evaluation, 
we have designed two cases of PE array based on the generated specification at the RT-
level with VHDL – only new array fabric (NAF) and NAF with RSP (RSP+NAF) - 18 
multipliers are shared by PEs in both row and column directions and this architecture 
also supports all of the target applications without stall caused by multiplier lack. The 
architectures have been synthesized using Design Compiler [49] with 0.18 ㎛ technol-
ogy. ModelSim [50] and PrimePower [49] are used for gate-level simulation and power 
estimation. Simulation has been done for the typical case under the condition of 100 
MHz operation frequency, 1.8 V Vdd, and  27℃ temperature. 
127 
2.  Results 
a.  Area Evaluation 
Table X shows area cost evaluation for the four cases. In RSPA, the area cost of PE ar-
ray is reduced by 22.11% because it has less multipliers than BASE. In the case of NAF, 
the area reduction ratio (25.58%) has relatively increased compared to RSPA. This is 
because the number of PEs is reduced than RSPA. Finally, the area reduction ratio 
(36.75%) in RSP+NAF has also relatively increased compared to NAF because of re-
duced multipliers. However, the interconnect area of the RSPA (or RSP+NAF) has in-
creased compared to the BASE( or NAF). This is because several buses are added to 
connect the shared multipliers with PEs. 
 
Table X. Area Reduction Ratio by RSPA and NAF 
Gate Equivalent PE Array 
Structure 
Number 
of PEs 
Number of 
Multipli-
ers Interconnect
a Logicb Totalc 
Reduction Ratio 
(%) compared 
with BASE 
BASE 64 64 164908 494726 659635 - 
RSPA 64 16 170008 343781 513789 22.11 
NAF 44 44 156163 334737 490900 25.58 
RSP+NAF 44 18 164414 252805 417219 36.75 
Interconnect a: net interconnect area, Logicb: total cell area, Totalc : Interconnecta + Logicb  
 
b. Performance Evaluation 
The synthesis results show that RSPA has reduced critical path delay (5.12 ns) compared 
to BASE (8.96 ns). This is because RSP technique excludes the combinational logic path 
of the multiplier from the original set of critical paths. The critical path of RSPA and its 
delay is given by 
 
128 
 TCritical path = TMultiplexor + TALU  + TShift_logic+Tothers                          (3) 
(5.12 ns   =  0.32 ns   + 2.22 ns + 1.42 ns   + 1.16 ns) 
 
Table XI shows that BASE and NAF (or RSPA and RSP+NAF) have same critical 
path delay. It indicates NAF does not cause performance degradation in terms of the 
critical path delay. In addition, the execution cycle count of each kernel on NAF (or 
RSP+NAF)  
 
Table XI. Applications Characteristics and Performance Evaluation 
PE Array Structure  
BASE and NAF  
(8.96 ns)d 
RSPA and RSP+NAF   
(5.12 ns) d Kenels Operationsc 
Cycle 
count 
eET(ns) Cycle count 
eET(ns) 
fReduced 
(%) 
aFirst_Diff sub 15 134.40 15 76.80 42.86 
aTri- Diagonal sub, mult 17 152.32 18 92.16 39.50 
aState add, mult 20 179.20 23 117.76 34.29 
aHydro add, mult 15 134.40 19 97.28 27.62 
aICCG sub, mult 18 161.28 19 97.28 39.68 
bInner Product add, mult 21 188.16 22 112.64 40.14 
b24-Taps FIR add, mult 20 179.2 21 107.52 40.00 
Matrix-vector multi-
plication add, mult 19 170.24 20 102.4 39.85 
Mult in FFT add, sub, mult 23 206.08 27 138.24 32.92 
Comlex Mult in AAC 
decoder add, sub, mult 16 143.36 17 87.04 39.29 
ITRANS in H.264 
Decoder add, sub, shift 18 161.28 18 92.16 42.86 
DCT in H.263 encoder add, sub, shift, mult 32 286.72 40 204.80 28.57 
IDCT in H.263 
encoder 
add, sub, shift, 
mult 34 304.64 42 215.04 29.41 
SAD in H.263 encoder add, abs 39 349.44 39 199.68 42.86 
Quant in H.263 
encoder 
add, sub, shift, 
mult 39 349.44 45 230.40 34.07 
Dequant in H.263 
encoder 
add, sub, shift, 
mult 41 367.36 57 240.64 34.49 
aLivermore loop benchmark suite. bDSPstone benchmark suite. cAcronym of operations, add:addition, 
sub: subtraction, shift: bit-shift, mult: multiplication, dCritical path delay, eExecution time = cycle × criti-
cal path delay(ns), fReduction ratio of execution time compared with BASE. 
129 
does not vary from BASE (or RSPA) because the functionality of NAF is same as the 
base model. It also indicates NAF does not come by performance degradation in terms of 
the execution cycle count. 
We have applied application kernels to the implemented architectures to obtain the 
results in Table XI. The amount of performance improvement depends on the application. 
For example, compared to DCT and hydro having multiplication, we achieve much more 
performance improvement with RSPA and RSP+NAF for First_Diff, SAD, and ITRANS 
which have no multiplication. This is because the clock frequency has been increased by 
pipelining the multipliers whereas the execution cycle count does not vary from BASE 
and NAF.     
 
Table XII. Power Reduction Ratio by RSP+NAF 
PE Array Structure 
BASE RSP+NAF Kenels 
Power (mW) Power (mW) Reduction Ratio (%) compared with BASE
First_Diff 201.07 129.79 35.45 
Tri- Diagonal 190.75 130.89 31.38 
State 198.37 138.62 30.12 
Hydro 190.86 129.35 32.23 
ICCG 164.42 112.92 31.32 
Inner Product 200.09 139.30 30.38 
24-Taps FIR 174.38 116.40 33.25 
Matrix-vector multiplication 163.25 113.48 30.49 
Mult in FFT 187.68 125.30 33.24 
Comlex Mult in AAC de-
coder 222.14 148.55 33.13 
ITRANS in H.264 decoder 198.32 137.89 30.47 
DCT in H.263 encoder 212.25 147.90 30.32 
IDCT in H.263 encoder 208.99 143.58 31.30 
SAD in H.263 encoder 181.22 123.23 32.00 
Quant in H.263 encoder 199.38 137.33 31.12 
Dequant in H.263 encoder 196.97 131.28 33.35 
130 
c. Power Evaluation  
Table XII shows the comparison of power consumptions between the two reconfigurable 
arrays: BASE and RSP+NAF. The two arrays have been implemented without any low 
power technique to evaluate their power savings. It is shown that compared to BASE, 
RSP+NAF could save up to 35.45% of the power. It has been possible to reduce power 
consumption in RSP+NAF by using less number of PEs and multipliers to do the same 
job compared to the base reconfigurable array. For larger array sizes, the power saving 
will further increase due to significant reduction in unutilized PEs. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
131 
CHAPTER VIII 
HIERARCHICAL RECONFIGURABLE COMPUTING ARRAYS 
 
In this chapter, we propose a new computing hierarchy consisting of two reconfigurable 
computing blocks with two types of communication structure together [63]. In addition, 
the two computing blocks have shared critical resources. Such a sharing structure pro-
vides efficient communication interface between them with reducing overall area. Based 
on the proposed architecture, optimized computing flows have been implemented ac-
cording to the varying applications for low power and high performance. Experimental 
results show that the proposed approach reduces on-chip area by 22%, execution time by 
up to 72% and reduces power consumption by up to 55% when compared with the con-
ventional CGRA-based architectures. 
A. Motivation 
1. Limitation of Existing Processor-RAA Communication Structures 
A typical coarse-grained reconfigurable architecture consists of a microprocessor, a Re-
configurable Array Architecture (RAA), and their interface. We can consider three types 
of organizations in connecting RAA to the processor. First, the array can be connected to 
the processor through a system bus as an ‘Attached IP’ [3] [10][12][15][19][64] shown 
in Fig. 2 (a). In this case, the main benefit of this organization is the ease of constructing 
such a system using a standard processor without modifying the processor and its com-
piler. In addition, large data buffer of RAA can be used to support applications having 
132 
large inputs/outputs. However, the speed improvement using the RAA may have to 
compensate for significant communication overhead between the processor and RAA 
through system bus as well as SRAM-based large data buffer in RAA consumes much 
power. Second type of organization involves the array connected with the processor as a 
‘Coprocessor’[4][65][66] shown in Fig. 2 (b). In this case, the standard processor does 
not change and the communication is faster than ‘Attached IP’ type interconnects be-
cause the coprocessor register-set is used as data buffer of the RAA and the processor 
can access the register-set by coprocessor data transfer instructions. In addition, the reg-
ister-set consumes less power than the data buffer of ‘Attached IP’. Since the size of the 
register-set is fixed by the processor ISA, it creates performance bottleneck for registers-
PE array traffic due to applications having large inputs/outputs run on the RAA. In the 
third type of organization, the array is placed inside the processor like a ‘FU (Functional 
Unit)’ [2][16][22][67][68] as shown in Fig. 2 (c). In this case, the instruction decoder 
issues special instructions to perform specific functions on the RAA as if it were one of 
the standard functional units of the processor. In this case, the communication speed is 
faster than ‘Coprocessor’ and power consumption of the data storage is less than ‘At-
tached IP’ because the processor register-set is used as data buffer of the RAA and the 
processor can directly access the register-set by the processor instructions. However, 
standard processor needs to be modified for due to integration with RAA and its com-
piler should be also changed. The performance bottleneck is caused by limited size of 
the processor registers as in the case of ‘Coprocessor’ type organization. Table XIII 
shows a summary about advantage and disadvantage of three coupling types.  
133 
 
Table XIII.  Comparison of the Basic Coupling Types 
Coupling type 
*Comm’ 
power 
**Comm’
speed 
Performance 
Bottleneck 
Application 
feasibility 
Attached IP high slow communication through system bus 
large size of  in-
put/output 
Coprocessor low fast limited size of coproc-essor register-set 
small size of in-
put/output 
Functional unit low very fast limited size of processor registers 
small size of in-
put/output 
*Comm’ power: power consumption by data-storage (data buffer or registers)  
**Comm’ speed: Communication speed between processor and RAA 
 
 
 
2. RAA-based Computing Hierarchy 
As mentioned in the previous subsection, basic three types of RAA organizations show 
advantage and disadvantage according the input/output size of the applications. It shows 
the existing coupling structure with a conventional RAA cannot be flexible to support 
various applications with sacrificing performance. In addition, such an RAA structure 
cannot efficiently utilize PE arrays and data buffers leading to high power consumption.  
We hypothesize that if CGRA can maintain a computing hierarchy of its RAAs 
having difference size and communication speed as shown in Fig. 64 (b), the CGRA-
based embedded system can be optimized for its performance and power. It is because 
such a hierarchical arrangement of the RAA can optimize the communication latency 
and efficiently utilize functional resources of PE array in various applications. In this 
chapter, we propose a new CGRA-based architecture that supports such a RAA-based 
computing hierarchy.  
134 
Processor
Memory
Memory
Speed Size
Fastest
Slowest
Smallest
Largest
               
Processor
RAA
RAA
Speed
Fastest
Slowest
Size
Smallest
Largest
 
(a) Memory                                                    (b) RAA 
Fig. 64. Analogy between Memory and RAA-computing hierarchy. 
 
nxn PE Array 
L1 
nxm
PE 
Array
L2 nx(n‐m) 
PE Array
Processor
RCC
RAA
System Bus
Co‐proc’
InterfaceData
Buffer 
Config’
Cache
Data
Buffer 
Con‐
Fig’
Cac‐
he
CREG
Reconfigurable Computing 
Cache RAA
Conventional RAA
Config’
Cache
Coprocessor 
Registers
 
(a) Size                                           (b) Speed 
Fig. 65. Computing hierarchy of CGRA. 
 
B.  Computing Hierarchy in CGRA 
In order to implement efficient CGRA-based embedded systems, we propose a new 
computing hierarchy consisting of two computing blocks using two types of coupling 
135 
structures together – ‘Attached IP’ and ‘Coprocessor’. In this organization, a general 
RAA having large size PE array is connected to a system bus and another is a small 
RAA composed of small PE array coupled with a processor through coprocessor inter-
face. We call the small RAA reconfigurable computing cache (RCC) because it plays 
important role in enhancing performance and power of the entire CGRA like data cache. 
The RCC and the RAA share critical resources and such a sharing structure provides ef-
ficient communication interface between two computing blocks. The propose approach 
ensures that the RCC and the RAA are efficiently utilized to support variable size of in-
puts and outputs for variety of applications. In subsection B.1 and B.2, we describe 
computing hierarchy and resource sharing in RCC and RAA in detail. Then we show 
how to optimize computing flow based on reconfigurable computing cache according to 
the applications in subsection B.3. 
1.   Computing Hierarchy – Size and Speed  
A CGRA-based computing hierarchy is formed by splitting a conventional computing 
RAA block into two computing blocks – RCC with small PE array and RAA having 
large PE array as shown in Fig. 65 (a). The RCC is coupled with coprocessor interface 
and the RAA is attached to a system bus as shown in Fig. 65 (b). The RCC provides fast 
communication with the processor and offers low power consumption by using coproc-
essor register-set and small array size. Therefore the RCC can enhance performance and 
reduce power consumption when small applications run on CGRA. If RCC is not suffi-
cient to support computing requirements of in applications, intermediate data from the 
136 
RCC can be moved to the RAA through the interconnections as shown in Fig. 66. Such 
interconnections between the two blocks offer 
 
On‐chip bus
Processor MemoryCREG
MUX unit
DMA
Data
Buffer 
Reconfigurable 
Computing cache
L2 PE Array
L1 
PE 
Array
Config’
Cache
RAA
Config’
Cache
Coprocessor 
Registers
 
Fig. 66. CGRA configuration with RCC and RAA. 
 
flexibility in migrating computing demands from one to another. Such computing flow 
may help to optimize performance and power for the applications having various sizes of 
inputs /outputs whereas the existing models show performance bottlenecks caused by the 
communication overheads or their limited sized data-storage as shown in Table XIII. We 
have described the computing flow optimization in detail in subsection B.3. 
2.  Resource Sharing in RCC and RAA 
We have so far presented two factors (speed and size) in building computing hierarchy 
for CGRAs similar to memory hierarchy. It seems a small portion of RAA has been de-
tached from large CGRA block and placed as the fast RCC block adjacent to the proces-
137 
sor coupled with coprocessor interface. However, only considering two factors is not 
sufficient to design compact RCC for power and area benefits. This is because comput-
ing blocks can have diverse functionality which affects the system capabilities. The 
functionality of computing blocks is specified by functional resources of its PE such as 
adder, multiplier, shifter, logic operations etc. Therefore, it is necessary to examine how 
to select the functionalities of RCC and RAA. This leads to further studies on resource 
assignment/sharing between RCC and RAA.  
First of all, we can classify the functional resources into two groups: primitive re-
sources and critical resources. Primitive resources are basic functional units such as ad-
der/subtractor and logical operators. Critical resources are area/delay-critical ones such 
as multiplier and divider. Based on the classification, let us consider two cases of the 
functional resource configurations as shown in Fig. 67. Fig. 67 (a) shows hierarchical 
functionality that indicates L1 PE array has primitive resources and L2 PE array includes 
critical resources as well as primitive resources. The Fig. 67 (b) shows identical func-
tionalities both in the L1 and L2 PE arrays. In the case of (a), the RCC with L1 PE array 
is relatively lightweight computing block compared to the RAA with L2 PE array. 
Therefore, the RCC can perform small applications having only primitive operations 
with low power consumption. However, it causes ‘lack of resource’ problem when ap-
plications demand critical operations. In (b) L1 and L2 PE arrays have identical func-
tionality with area and power overheads. 
To prevent such extreme cases, we propose resource sharing for the RCC and the 
RAA based on [44]. L1 and L2 PE array have the same primitive resources and shared 
138 
the pipelined critical resources as shown in Fig. 68. Here the RCC and the RAA basi-
cally perform the primitive operations and their functionality will include the critical op-
erations using the shared resources. Fig. 69 shows interconnection structure with shared 
critical resources along with RCC and RAA. PEs in the same row of the L1 and L2 array 
share the pipelined critical resources in the same manner as [44]. Such a structure avoids 
the ‘lack of resource’ problem in Fig. 67 (a) and this structure is more area and power-
efficient than Fig. 67 (b) because the number of critical resources is reduced and the 
critical resources taken out of L1 and L2 PE array are not affected by unnecessary 
switching activity caused by other resources. In addition, interconnections for resource 
sharing can be also utilized for communication interface between the RCC and the RAA 
by adding multiplexer and de-multiplexer between front and end of the critical resources 
as shown in Fig. 69 (b).  
 
L1 PE Array L2 PE Array
PE
PE
PE
ADD, SUB,  AND, OR, XOR
L1 PE Array
ADD, SUB, AND, OR, XOR, 
MULT, SHIFT
L2 PE Array
ADD, SUB, AND, OR, XOR, 
MULT, SHIFT
ADD, SUB, AND, OR, XOR, 
MULT, SHIFT
CR
PE
CR
PE
CR
PE
CR
PE
CR
PE
CR
PE
CR
PE
CR
PE
CR
PE
CR
PE
CR
PE
CR
PE
CR
PE
CR
PE
CR
PE
CR
PE
CR
PE
CR
PE
CR
PE
CR
PE
CR
PE
CR : critical resource 
 
(a) Hierarchical functionality                         (b) Identical functionality 
Fig. 67. Two cases of functional resource assignment. 
 
139 
PE
PE
PE
ADD, SUB, 
AND, OR, 
XOR
L1 PE Array L2 PE Array
CR
CR
CR
PE
PE
PE PE
PE PE
PE PE PE
MULT, 
SHIFT
ADD, SUB, 
AND, OR, 
XOR
Shared Critical 
Resources 
CR : pipelined critical resource   
Fig. 68. Critical resource sharing and pipelining in L1 and L2 PE array. 
 
PE
L1 PE Array
Demux
Mux
PE
PEPE
PEPE
PEPE
L2 PE Array
CR
Interconnect for 
Communication 
between L1 and L2 
PE Array
On‐chip bus
Processor MemoryCREG
MUX unit
DMA
Data
Buffer L2 PE Array
L1 
PE 
Array
Config’
Cache
RAA
Config’
Cache
Shared 
critical 
resources
RCC
 
(a) Entire structure                               (b) Interconnection structure 
Fig. 69. Interconnection structure among RCC, shared critical resources and L2 PE array. 
 
 
 
140 
3.  Computing Flow Optimization  
Based on the proposed CGRA structure, we can classify four cases of optimized comput-
ing flow to achieve low power and high performance. Fig. 70 shows such four comput-
ing flows on the proposed CGRA according to variance of input and output size of ap-
plications – Subsection C.1.a shows that we can select the optimal case among the pro-
posed computing flows for several applications with variance in their input/output size. 
All of the cases show that shared critical resources are used as needed because they are 
only utilized when applications have the operations requiring the critical resources. 
Fig. 70 (a) shows computing flow when application has the smallest inputs and out-
puts. In this case, only RCC functional units are used to execute the application while the 
RAA is disabled to reduce power consumption. However, if the application has larger 
inputs and outputs than (a), the computing flow can be extended to L2 PE array as 
shown in Fig. 70 (b). Even though L2 PE array is used for this case, data buffer of the 
RAA is not used because the coprocessor register-set (CREG) is sufficient to save the all 
of the inputs or outputs. The next case is that when RAA is used with RCC because of 
large inputs and small outputs as shown in Fig. 70 (c). In this case, data buffer of the 
RAA receives inputs using DMA which is more efficient for overall performance than 
CREG. This is because insufficient CREG resource for large inputs causes performance 
bottleneck with heavy registers-PE array traffic. Therefore, the L2 PE array may be used 
first for running such application and the L1 PE array can be utilized for enhancing par-
allelized execution as needed. However, the outputs are stored on CREG because their 
141 
CREG L1 PE
Array
CREGInput 
Data
Input 
Data
Output 
Data
Output 
Data
Shared critical 
resources
 
(a) Smallest inputs and outputs (STIO) 
CREG L1 PE
Array
Input 
Data
Input 
Data
Input 
Data
Input 
Data
CREG
Input 
Data
Input 
Data
Input 
Data
Output 
Data
Shared critical 
resources L2 PE Array 
 
(b) Small inputs and outputs (SIO) 
L1 PE
Array
L2 PE Array 
Data 
Buffer 
Input 
Data
Input 
Data
Input 
Data
Input 
DataInput 
Data
Input 
Data
Input 
Data
CREG
Input 
Data
Input 
Data
Input 
Data
Output 
Data
Shared critical 
resources
DMA
 
(c) Large inputs and small outputs (LISO) 
L1 PE
Array
L2 PE Array 
Data 
Buffer 
Input 
Data
Input 
Data
Input 
Data
Input 
DataInput 
Data
Input 
Data
Input 
Data
Shared critical 
resources
DMA
Data 
Buffer 
Input 
Data
Input 
Data
Input 
Data
Input 
DataInput 
Data
Input 
Data
Output 
Data
DMA
 
(d) Large inputs and outputs (LIO) 
: Optional block 
 
Fig. 70. Four cases of computing flow according to the input/output size of application. 
 
142 
size is small. Finally, Fig. 70 (d) shows a case of RAA used with L1 PE array with large 
inputs and outputs. To avoid heavy registers-PE array traffic by the large input/output 
size, the data buffer with DMA is used and L1 PE array can be optionally utilized for 
enhancing parallelized execution.  
In summary, the computing flow on the proposed CGRA can be adapted according 
to the input/output size of applications. It is more power-efficient than using a conven-
tional CGRA by separated computing blocks with sharing critical resources. This way is 
only necessary computing blocks are utilized. In addition, computing flow with support-
ing two communication interfaces reduces power and enhances performance.  
C.  Experiments 
1.  Experimental Setup 
a. Architecture Implementation  
To demonstrate the effectiveness of the proposed RCC-based CGRA, we have designed 
three different organizations of CGRA with RT-level implementation using VHDL as 
shown in Table XIV.   
 
Table XIV.  Comparison of the Architecture Implementations 
CGRA PE array Data storage 
Attached IP 8x8 PE array 6KB data buffer 
Coprocessor 8x8 PE array 512-byte coprocessor register-set 
Proposed 
RCC-based 
8x2 L1 PE array and
8x6 L2 PE array 
4KB data butter and 512-byte  
coprocessor register set 
 (ARM7-compatible 32-bit RISC processor is used as main processor) 
 
143 
In addition, for resource sharing of RCC-based CGRA, two pipelined multipliers and 
two shifters are shared by PEs in the same row of L1 and L2 PE array whereas conven-
tional two types of CGRA do not support such a resource sharing and pipelining.  
The architectures have been synthesized using Design Compiler [49] with 0.18 ㎛ 
technology. PrimePower [49] has been used for gate-level simulation and power estima-
tion. To obtain the power consumption data, we have used the applications in Table XV 
for simulation with operation frequency of 100 MHz and typical case of 1.8 V Vdd and 
27 . ℃  
 
Table XV. Applications Characteristics 
Real Applications SHR ComputingFlow Benchmarks SHR 
Computing 
Flow 
(H.263) 8x8 DCT 9 SIO *256-point FFT 9 LISO 
(H.263) 8x8 IDCT 9 SIO *256-tap FIR 9 LISO 
(H.263)8x8 QUANT 9 SIO *Complex Mult 9 LISO 
(H.263) 8x8 DEQUANT 9 SIO **State 9 STIO 
(H.263) SAD - LISO **Hydro 9 STIO 
(H.264) 4x4 ITRANS 9 STIO **Tri-Diagonal 9 LIO 
(H.264) MSE 9 LISO **First-Diff - STIO 
(H.264) MAE - LISO **ICCG 9 STIO 
(H.264) 16x16 DCT 9 LISO **Inner Product 9 LIO 
8x8*8x1 Matrix-Vector 
Multiplication 9 SIO 
16x16*16x1Matrix-
Vector Multiplication 
9 LISO 
8x8 Matrix Multiplication 9 SIO 
16x16 Matrix Multiplica-
tion 
9 LISO 
*: DSPstone benchmarks [71] 
**: Livermore loop benchmarks [70] 
SHR:‘9’means critical resources are 
used for the application. 
STIO: smallest inputs and outputs 
SIO: small inputs and outputs 
LISO: large inputs and small outputs 
LIO: large inputs and outputs 
 
 
 
144 
b. Evaluated Applications  
Evaluated applications are composed of real multimedia applications and benchmarks. 
We have analyzed the input/output size and operation-types in the applications to iden-
tify specific computing flow in Fig. 70. Table XV shows the selected applications and 
the optimal computing flows for them.  
 
Table XVI.  Area Cost Comparison 
Gate Equivalent PE 
Array 
No’ of 
PEs 
No’ of 
MULTs
No’ of 
SHTs Interconnect Logic Total 
Reduc-
tion (%)
Base 8x8 64 64 64 164908 494726 659635 - 
Proposed 64 16 16 175434 334595 510029 22.68 
 
2.  Results 
a.  Area Cost Evaluation  
Table XVI shows area cost evaluation for the two cases. ‘Base 8x8’ means 8x8 PE array 
included in ‘Attached IP’ and ‘Coprocessor’ type CGRA. ‘Proposed’ means L1 and L2 
PE array included in the proposed RCC-based CGRA. Even though interconnection area 
of the proposed model increases because of resource sharing structure, entire area of the 
proposed one is reduced by 22.68% because it has less critical resources than base 8x8 
PE array.  
b. Performance Evaluation 
The synthesis results show that the proposed PE array has reduced critical path delay 
(5.12 ns) compared to the base PE array   (8.96 ns). This is because pipelined multipliers 
are excluded from the original set of critical paths. Based on the synthesis results, we 
145 
72.92%
/49.46%
36.48%
/56.60%
61.50%
51.78%/
36.88%
/62.30%
61.85%
/50.45%65.15%
/57.06%
32.47%
/66.84%
64.93%
/58.92%
36.26%
/63.01%
38.53%
/64.41%
36.51%
/63.05%
37.03%
/62.22%
65.08%
/57.43%
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
(H
.26
3)
8X
8 D
CT
(H
.26
3)
8X
8 I
DC
T
(H
.26
3)
8X
8 Q
UA
NT
(H
.26
3)
8X
8 D
EQ
UA
NT
(H
.26
3)
SA
D
(H
.26
4)
4X
4 I
TR
AN
S
(H
.26
4)
M
SE
(H
.26
4)
M
AE
(H
.26
4)
 16
X1
6 D
CT
8X
8*
8X
1_
M
VM
16
X1
6*
16
X1
_M
VM
8x
8 M
M
16
x1
6 M
M
Execution Time (ns) Proposed
Coproc‐Type
IP‐Type
 
(a) Real applications 
60.94%
/52.59%
63.37%
/59.85% 61.76%/
52.61%
64.68%
/42.05% 63.23%
/42.09%
30.31%
/67.90%
34.86%
/66.42%
33.13%
/65.84%
35.96%
/66.23%
0
2000
4000
6000
8000
10000
12000
14000
25
6‐
po
in
t F
FT
25
6‐
ta
p 
FIR
Co
m
ple
x M
ult
St
at
e
Hy
dr
o
Tr
i‐D
iag
on
al
Fir
st‐
Di
ff
IC
CG
In
ne
r P
ro
du
ct
Execution Time (ns)
Proposed
Coproc‐Type
IP‐Type
LISO
LISO LISO
STIO
LIO
STIO
STIO
STIO
LIO
 
(b) Benchmarks 
A%/B%: A% means reduced execution time ratio compared with Coproc-Type and B% means reduced 
execution time ratio compared with IP-Type.   
 
Fig. 71. Performance comparison. 
146 
evaluate execution times of the selected applications on three cases of CGRA as shown 
in Fig. 71. The execution times include communication time between memory/processor 
and the RAA or RCC. Each application is executed on the RCC-based CGRA in the 
manner of selected computing flow as shown in Table XV – all of the applications are 
classified under 4 cases of computing flow (STIO, SIO, LISO and LIO). In the case of 
STIO and SIO, performance improvement compared with ‘Coprocesssor’ type is rela-
tively less (30.31%~37.03%) than LIO and LISO (60.94%~72.92%). This is because the 
improvements of STIO and SIO are achieved by only reduced critical path delay 
whereas the improvements of LIO or LISO are achieved by avoiding heavy coprocessor 
registers-PE array traffic as well as reduced critical path delay. However, compared with 
‘Attached-IP’ type, STIO and SIO achieve much more performance improvement 
(56.60%~67.90%) whereas LISO and LIO show the improvement of (42.05%~59.85%). 
This is because STIO and SIO do not use data buffer of the RAA causing communica-
tion overhead on system bus.  
c. Power Evaluation  
Fig. 72 shows the comparison of power consumptions in three different organizations of 
CGRA. First of all, the proposed L1 and L2 PE array is more power-efficient than the 
base PE array because of the reduced critical resources. With such a power-efficient PE 
array, the amount of power saving depends on the selected computing flow for the appli-
cation. The most power-efficient computing flow is STIO that shows relatively much 
power saving (40.44%~55.55%) compared to other cases (7.93%~29.67%) because the 
STIO does not use the RAA - specially, ‘First_Diff’ shows the highest power saving  
147 
17.22%
/28.90%
23.67%
/29.67%
14.72%
/25.70%24.04%
/27.59%
21.39%
/25.72%
10.62%
/23.52%
7.93%
/20.51%
43.99%
/46.67%
10.16%
/23.11%
24.22%
/27.82%
24.15%
/26.89%
24.23%
/26.98%
0
50
100
150
200
250
300
(H
.26
3)
8X
8 D
CT
(H
.26
3)
8X
8 I
DC
T
(H
.26
3)
8X
8 Q
UA
NT
(H
.26
3)
8X
8 D
EQ
UA
NT
(H
.26
3)
SA
D
(H
.26
4)
4X
4 I
TR
AN
S
(H
.26
4)
M
SE
(H
.26
4)
M
AE
(H
.26
4)
 16
X1
6 D
CT
8X
8*
8X
1_
M
VM
16
X1
6*
16
X1
_M
VM
8x
8 M
M
16
x1
6 M
M
Power (mW)
Proposed
Coproc‐Type
IP‐Type
24.22%
/27.96%
SIO SIO SIO SIO
SIO
LISO
LISO LISO LISO LISO
LISO SIO
STIO
 
(a) Real applications 
15.55%
/25.58%
15%
/25.61%
12.09%
/26.03%
43.11%
/46.11%
40.44%
/45.54%
17.13%
/19.65% 51.71%
/55.55%
44.74%/
48.57% 22.91%
/22.43%
0
50
100
150
200
250
300
256‐point
FFT
256‐tap
FIR
Complex
Mult
State Hydro Tri‐
Diagonal
First‐Diff ICCG Inner
Product
Power(mW)
Proposed
Coproc‐Type
IP‐Type
LISOLISO
LISO
STIO
LIO
STIO STIO STIO LIO
 
(b) Benchmarks 
A%/B%: A% means power saving ratio compared with Coproc-Type, and B% means power saving ratio 
compared with IP-Type. 
Fig. 72. Power comparison. 
 
148 
ratio of 51.71%/55.55% because of not using the shared critical resources. The next 
power-efficient model is SIO showing power saving (23.67%~29.67%). This is because 
the SIO computing flow does not use data buffer of the RAA whereas LISO 
(7.93%~26.03%) and LIO (17.13%~22.91%) utilizes the data buffer for input data or 
output data. Finally, power saving of LISO and LIO is mostly achieved by reduced criti-
cal resources and by not activating L1 PE array.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149 
CHAPTER IX 
INTEGRATED APPROACH TO OPTIMIZE CGRA 
 
In this chapter, we present integrated approach to merge the multiple design schemes 
presented in the previous chapters. A case study is shown to verify the synergy effect of 
combining the multiple design schemes. Experimental results show that the integrated 
approach reduces area by 23.07% of entire RAA and power by up to 72% when com-
pared with the conventional RAA. In addition, we discuss potential combinations among 
the proposed design schemes and their expected outcomes.  
A. Combination among the Cost-Effective CGRA Design Schemes 
From Chapter VI to Chapter VIII, we have proposed the cost-effective CGRA design 
schemes and such schemes can be combined with each other to optimize CGRA in terms 
of area, power and performance. Fig. 73 shows combination flow of the proposed design 
schemes. The flow shows possible scheme combinations for CGRA design. Each arrow 
of the flow shows a possible integration between two design schemes. The possible 
scheme combinations can be found by tracing in the arrow directions. The combination 
flow can be classified into two cases according to the computation model of CGRA. In 
the case of temporal mapping, low power reconfiguration technique by reusable context 
pipelining (Chapter IV) can be selected whereas cost-effective array fabric (Chapter VII) 
is applicable to the spatial mapping. This is because two design schemes have been de-
vised while keeping the characteristics of spatial mapping and temporal mapping - we 
150 
spatially spread the operations in the data flows over the array space in the design 
scheme of the cost-effective array fabric whereas reusable context pipelining spread the 
operations over time for each column to implement temporal loop pipelining. Therefore, 
even though two design schemes cannot be merged, any combination of a design scheme 
in Chapter IV or VII with the remaining three schemes is possible.  
 
Chapter IV. Low Power 
Reconfiguration Technique
Temporal   Mapping
Chapter VII. Cost‐Effective  
Array Fabric
Spatial   Mapping
Chapter V. Dynamic 
context compression
Chapter VIII. Hierarchical 
reconfigurable Computing arrays 
Chapter VI. Dynamic 
context management 
 
Fig. 73. Combination flow of the proposed design schemes. 
 
B.  Case Study for Integrated Approach 
1. An CGRA Design Example Merging Three Design Schemes 
To demonstrate the effectiveness of the integrated approach, we have designed a RAA 
combining three design schemes as shown in Fig. 74 with RT-level implementation us-
ing VHDL. The architectures have been synthesized using Design Compiler [49] with 
151 
0.18 ㎛ technology. PrimePower [49] has been used for gate-level simulation and power 
estimation. To obtain the power consumption data, we have used the same the applica-
tions shown in the previous Chapters for simulation with operation frequency of 100 
MHz and typical case of 1.8 V Vdd and 27℃.  
 
Dynamic context 
compression
Hierarchical reconfigurable 
Computing arrays 
Low Power   
Reconfiguration Technique  
Fig. 74. A combination example combining three design schemes. 
 
2.  Results 
a.  Area and Performance Evaluation  
Table XVII shows area cost evaluation of each component for the base RAA as specified 
in Chapter III and the integrated RAA combining three design schemes. In the case of 
configuration cache, area is reduced by 16.79% - even though dynamic context compres-
sion increases area as shown in Chapter V, low power reconfiguration technique offsets 
the increased area with reduced size of the configuration cache. Area of the PE array and 
frame buffer are also reduced by 17.27%/30% because hierarchical reconfigurable com-
puting arrays supports critical resource sharing with the reduced size of the frame buffer. 
Therefore, the area reduction ratio of the entire RAA is 23.07% compared to the base 
RAA.  
The synthesis results show that the integrated RAA has reduced critical path delay 
(5.12 ns) compared to the base RAA (8.96 ns). This is because dynamic context man-
agement and low power reconfiguration technique don’t affect the original critical path 
152 
delay and pipelined multipliers are excluded from the original set of critical paths by hi-
erarchical reconfigurable computing arrays. In addition, execution time evaluation of 
the applications shows the same results in Chapter VIII – performance enhancement of 
42.05%~67.90% compared with the IP-type base RAA.  
 
Table XVII. Area Reduction Ratio by Integrated RAA 
Gate Equivalent Component 
Base Integrated 
Reduction (%) 
Configuration Cache 150012 124824 16.79 
PE Array 659635 510029 22.68 
Frame Buffer 129086 90329 30.00 
Entire RAA 942742 760869 23.07 
 
 
b. Power Evaluation  
To verify the synergy effect of the integrated approach, we have evaluated power con-
sumption for the five cases:  
a. Base RAA 
b. RAA with low power reconfiguration technique 
c. RAA with dynamic context compression  
d. RAA with hierarchical reconfigurable computing array 
e. integrated RAA.  
Table XVIII shows entire power comparison among the five cases. Each design scheme 
(b, c and d) does not reduce much power of entire RAA – 26.54%~47.6% in b, 
13.77%~21.48% in c and 11.09%~30.19% in d. However, the integrated RAA save 
much power (44.65% ~ 71.29%) because each component of the RAA is optimized by 
153 
the individual design scheme.  
 
Table XVIII.  Entire Power Comparison 
aBase 
bLow Power 
Reconfig’ 
cDynamic context
compression 
dHierarchical 
Reconfig’ Array 
eIntegrated 
kernels 
fP(mW) fP(mW) gR(%) fP(mW) gR(%) fP(mW) gR(%) fP(mW) gR(%) 
First_Diff 376.17 232.48 38.2 309.37 17.76 262.62 30.19 108.01 71.29 
Tri- Diagonal 400.19 257.59 35.63 331.01 17.29 355.79 11.09 200.65 49.86 
State 356.08 228.45 35.84 294.23 17.37 266.23 25.23 125.71 64.7 
Hydro 356.47 240.64 32.49 299.74 15.91 261.64 26.6 133.41 62.57 
ICCG 434.45 261.29 39.86 354.33 18.44 323.39 25.56 137.52 68.35 
Inner Product 328.54 240.57 26.78 283.3 13.77 281.27 14.39 181.83 44.65 
24-Taps FIR 471.44 274.99 41.67 383.44 18.67 408.98 13.25 200.5 57.47 
Matrix-vector 
multiplication 405.7 212.58 47.6 318.56 21.48 356.56 12.11 150.25 62.97 
Mult in FFT 423.59 287.67 32.09 355.19 16.15 360.12 14.98 208.78 50.71 
Comlex Mult in 
AAC decoder 452 304.19 32.7 381.55 15.59 381.38 15.62 220.77 51.16 
ITRANS in 
H.264 decoder 417.95 283.06 32.27 338.37 19.04 318.49 23.8 156.42 62.57 
DCT in H.263 
encoder 417.33 264.89 36.53 347.17 16.81 356 14.7 189.68 54.55 
IDCT in H.263 
encoder 412.91 263.45 36.2 343.42 16.83 352.55 14.62 188.71 54.3 
SAD in H.263 
encoder 415.27 305.05 26.54 343.04 17.39 362.12 12.8 222.63 46.39 
Quant in H.263 
encoder 401.35 255.77 36.27 333.63 16.87 341.22 14.98 181.14 54.87 
Dequant in 
H.263 encoder 401.64 252.3 37.18 332.63 17.18 341.85 14.89 178.38 55.59 
aBase RAA (configuration cache + frame buffer + PE array), bRAA with low power reconfiguration tech-
nique, cRAA with dynamic context compression , dRAA with hierarchical reconfigurable computing array, 
eRAA combining three scheme,  f Power Consumption of RAA, g Power reduction ratio of entire RAA 
compared with BASE. 
 
 
C. Potential Combinations and Expected Outcomes 
As mentioned in Section A, any combination of a design scheme limited by the compu-
tation model with the remaining four schemes is possible and we can consider two cases 
of the maximum combinations – one is the maximum power optimization for the con-
154 
figuration cache and another is area/power optimization of the PE array. Fig. 75 shows 
such two cases of combinations. In the case of Fig. 75 (a), all of the design schemes re-
ducing power in configuration cache are merged with hierarchical reconfigurable com-
puting arrays. Therefore, power saving of the configuration cache can be optimized 
based on the computation model of the temporal mapping. The second case is 
area/power optimization of the PE array as shown in Fig. 75 (b). Compared with (a), in-
stead of low power reconfiguration technique, the design scheme of cost-effective array 
fabric is combined with other design schemes. In this case, the area/power of the PE ar-
ray can be optimized by reducing the number of PEs (cost-effective array fabric) and 
sharing critical-resource (hierarchical reconfigurable computing arrays).   
 
Dynamic context 
compression
Dynamic context 
management 
Hierarchical reconfigurable 
Computing arrays 
Low Power Reconfiguration 
Technique
                            
Dynamic context 
compression
Dynamic context 
management 
Hierarchical reconfigurable 
Computing arrays 
Cost‐Effective  
Array Fabric
                    
(a) Power optimization for           (b) Area/power optimization of the PE array 
the configuration cache  
 
Fig. 75. Potential combination of multiple design schemes. 
155 
CHAPTER X 
CONCLUSIONS 
 
In this chapter, we summarize the major results of this dissertation.  
In Chapter IV, we propose reusable context pipelining for low power reconfigura-
tion and hybrid configuration cache structure supporting this technique. Our architecture 
can be used to achieve power-savings in a reconfigurable architecture while maintaining 
performance same as general CGRA. In addition, new configuration cache structure is 
more efficient than previous one in terms of memory size. In the experiments, we show 
that the proposed approach saves power even with reduced configuration cache size. 
Power reduction ratios in the configuration cache and the entire architecture are up to 
86.33% and 47.60% respectively compared to the base architecture. 
In Chapter V, we introduce new context architecture (dynamically compressible 
context architecture) with its design flow and configuration cache structure to support it. 
The proposed dynamically compressible context architecture can save power in configu-
ration cache without performance degradation. Experimental results show that our ap-
proach saves much power compared to conventional base model with negligible area 
overhead. We have reduced the power by up to 39.72% in configuration cache. 
In Chapter VI, we propose novel dynamic context management for low power 
CGRA and new configuration cache structure supporting this technique. Te proposed 
management method can be used to achieve power-savings in configuration ache while 
maintaining performance same as general CGRA. In the experiments, we show that our 
156 
approach saves much power compared to conventional base model with negligible area 
overhead. We have reduced the power by 38.24%/38/15% in write/read operation of 
configuration cache. 
In Chapter VII, we propose a novel reconfigurable array fabric optimized for com-
putation-intensive and data-parallel applications. It has been shown the new array fabric 
is derived from a standard square-array using the proposed exploration flow. The explo-
ration flow efficiently rearranges PEs with reducing array size and change interconnec-
tion scheme to save area and power. In addition, we suggest the new array fabric which 
splits the computational resources into two groups (primitive resources and critical re-
sources). Critical resources can be area-critical and/or delay-critical. Primitive resources 
are replicated for each processing element of the reconfigurable array, whereas area-
critical resources are shared among multiple basic PEs. Delay-critical resources can be 
pipelined to curtail the overall critical path so as to increase the system clock frequency. 
Experimental results show that the proposed approaches saves significant area and 
power compared to conventional base model with enhancing performance. Implementa-
tion of sixteen kernels on the new array structure demonstrates consistent results. The 
area reduction up to 36.75%, the performance enhancement up to 42.86% and the power 
savings up to 35.45% are evident when compared with the conventional array architec-
ture. 
In Chapter VIII, we propose hierarchical reconfigurable computing array architec-
ture to reduce power/area and enhance performance in configurable embedded system. 
The CGRA-based embedded systems that consist of hierarchical configurable computing 
157 
arrays with varying size and communication speed were examined for multimedia and 
other applications. Experimental results show that the proposed approach reduces on-
chip area by 22%, execution time by up to 72% and reduces power consumption by up to 
55% when compared with the conventional CGRA-based architectures. 
In Chapter IX, we present integrated approach to merge the multiple design 
schemes. A case study is shown to verify the synergy effect of combining the multiple 
design schemes. Experimental results show that the integrated approach reduces area by 
23.07% of entire RAA and power by up to 72% when compared with the conventional 
RAA.  
 
 
 
 
 
 
 
 
 
 
 
158 
REFERENCES 
 
[1] R. Hartenstein, “A decade of reconfigurable computing: a visionary retrospective,” 
in Proc. of Design Automation and Test in Europe Conf., pp. 642-649, March 2001.  
[2] F. Barat, M. Jayapala, T. Vander A. Corporaal, G. Deconinck, and R. Lauwereins, 
“Low power coarse-grained reconfigurable instruction set processor,” in Proc. of  Int. 
Conf. on Field Programmable Logic and Applications, pp. 230-239, September 2003. 
[3] H. Singh, M. Lee, G. Lu, F. Kurdahi, N. Bagherzadeh, and E. Filho, “MorphoSys: 
An integrated reconfigurable system for data-parallel and computation-intensive ap-
plications,” IEEE Trans. on Computers, vol. 49, no. 5, pp. 465-481, May 2000. 
[4] T. Miyamori and K. Olukotun, “A quantitative analysis of reconfigurable coproces-
sors for multimedia applications,” in Proc. of IEEE Symp. on FPGAs for Custom 
Computing Machines, pp 15-17, April 1998. 
[5] C. Ebeling, D. Cronquist, and P. Franklin, “Configurable computing: The catalyst for 
high-performance architectures,” in Proc. of IEEE Int. Conf. Appl.-Specific Syst., 
Arch., Process., pp. 364–372, July 1997.  
[6] H. Schmit, D. Whelihan, A. Tsai, M. Moe, B. Levine, and R. Taylor, "PipeRench: A 
virtualized programmable datapath in 0.18 micron technology," in Proc. of IEEE 
Custom Integrated Circuits Conf., pp 63 –66, May 2002. 
[7] Y. Chou, P. Pillai, H. Schmit, and J. Shen, "PipeRench implementation of the in-
struction path coprocessor," in Proc. of  Annual IEEE/ACM Int. Symp. on Microar-
chitecture, pp 147-158, December 2000. 
159 
[8] F. Bouwens, M. Berekovic, A. Kanstein and G. Gaydadjiev, "Architectural explora-
tion of the ADRES coarse-grained reconfigurable array," in Proc. of Int.Workshop 
on Applied Reconfigurable Computing, pp. 1-13, March 2007. 
[9] F. Hanning, H. Dutta, and J. Teich, “Regular mapping for coarse-grained reconfigur-
able architectures,” IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, pp. 
57-60, May 2004. 
[10] J. Becker and M. Vorbach, “Architecture, memory and interface technology inte-
gration of an indutrial/academic configurable system-on-chip (CSoC),” in Proc. of 
IEEE Computer Society Annual Symp. on VLSI, pp. 107-112, February 2003. 
[11] Y. Kim, J. Lee, J. Junng, S. Kang, and K. Choi, "Design of coarse-grained recon-
figurable hardware," in Proc. of IEEK SoC Design Conf., pp. 312-317, April 2004. 
[12] Y. Kim, C. Park, S. Kang, H. Song, J. Jung, and K. Choi, “Design and evaluation 
of coarse-grained reconfigurable architecture,” in Proc. of Int. SoC Design Conf., pp. 
227-230, October 2004. 
[13] N. Suzuki, S. Kurotaki, M. Suzuki, N. Kaneko, Y. Yamada, K. Deguchi, Y. Ha-
segawa, H. Amano, K. Anjo, M. Motomura, K. Wakabayashi, T. Toi, and T. Awa-
shima, “Implementing and evaluating stream applications on the dynamically recon-
figurable processor,” in Proc. of Field-Programmable Custom Computing Machines, 
pp. 328-329, April 2004. 
[14] S. Khawam, T. Arslan, and F. Westall, “Synthesizable reconfigurable array tar-
geting distributed arithmetic for system-on-chip applications,” in Proc. of IEEE Int. 
Parallel & Distributed Processing Symp., pp. 150-157, April 2004.  
160 
[15] A. Deledda, C. Mucci, A.Vitkovski, M. Kuehnle, F. Ries, M. Huebner, J. Becker, 
P. Bonnot, A. Grasset, P. Millet, M. Coppola, L. Pieralisi, R. Locatelli, and G. Ma-
ruccia,, “Design of a HW/SW communication infrastructure for a heterogeneous re-
configurable processor,” in Proc. of Design, Automation, and Test in Europe Conf., 
pp.1352-1357, March 2008.   
[16] M. Galanis and C. Goutis, “Speedups from extending embedded processors with 
a high-performance coarse-grained reconfigurable data-path,” Journal of Systems 
Architecture - Embedded Systems Design, vol. 50, no. 2, pp. 479-490,  February, 
2008 
[17] G. Rauwerda, P. Heysters, and G. Smit, “Towards software defined radios using 
coarse-grained reconfigurable hardware,” IEEE Trans. on Very Large Scale Integra-
tion Systems, vol. 16, no. 1, pp. 3-13, January  2008. 
[18] M. Myjak, and J. Delgado-Frias, “A medium-grain reconfigurable architecture 
for DSP: VLSI design, benchmark mapping, and performance,” IEEE Trans. on Very 
Large Scale Integration Systems, vol. 16, no. 1, pp. 14-23, January  2008. 
[19] A. Poon, “An energy-efficient reconfigurable baseband processor for wireless 
communications,” IEEE Trans. on Very Large Scale Integration Systems, vol. 15, no. 
3, pp. 319-327, March  2007. 
[20] R. Hartenstein, M. Herz, T. Hoffmann, and U. Nageldinger, “KressArray 
Xplorer: a new CAD environment to optimize reconfigurable datapath array archi-
tectures,” in Proc. of Asia and South Pacific Design Automation Conf., pp. 163-168, 
January 2000. 
161 
[21] B. Mei, S. Vernalde, D. Verkest, and R. Lauwereins, “Design methodology for a 
tightly coupled VLIW/reconfigurable matrix architecture: a case study,” in Proc. of 
Design Automation and Test in Europe Conf., pp. 1224-1229, March 2004. 
[22] N. Bansal, S. Gupta, N. Dutt, and A. Nicolau, “Analysis of the performance of 
coarse-grain reconfigurable architectures with different processing element configu-
rations,” presented at the Workshop on Application Specific Processors, San Diego, 
CA, December 2003. 
[23] N. Bansal, S. Gupta, N. Dutt, A. Nicolau, and R. Gupta, “Interconnect-aware 
mapping of applications to coarse-grain reconfigurable architectures,” in Proc. of  Int. 
Conf. on Field Programmable Logic and Applications, pp. 891-899, August 2004.  
[24] N. Bansal, S. Gupta, N. Dutt, A. Nicolau, and R. Gupta, “Network topology ex-
ploration of mesh-based coarse-grain reconfigurable architectures,” in Proc. of De-
sign Automation and Test in Europe Conf., pp. 474-479, February 2004.  
[25] J. Lee, K. Choi, and N. Dutt, "Evaluating memory architectures for media appli-
cations on coarse-grained reconfigurable architectures," in Proc, of IEEE Int. Conf. 
on Application-Specific Systems, Architectures, and Processors, pp.166-176, June 
2003.  
[26] J. Lee, K. Choi, and N. Dutt, "Design space exploration of reconfigurable ALU 
array (RAA) architectures," in Proc. of IEEE SOC Design Conf.  pp. 302-307, No-
vember 2003. 
162 
[27] J. Lee, K. Choi,  and N. Dutt, "Evaluating memory architectures for media appli-
cations on coarse-grained reconfigurable architectures," Int. Journal of Embedded 
Systems, vol. 3 no. 3, pp.119-127, October 2008.  
[28] Y. Kim, M. Kiemb, and K. Choi, "Efficient design space exploration for domain-
specific optimization of coarse-grained reconfigurable architecture," in Proc. of 
IEEK SoC Design Conf., pp. 19-24, May 2005. 
[29] A. Lambrechts, P. Raghavan, and M. Jayapala, “Energy-aware interconnect-
exploration of coarse-grained reconfigurable processors,” presented at the Workshop 
on Application Specific Processors, New York, September 2005. 
[30] H. Zhang, M. Wan, V. George, and J. Rabaey, “Interconnect architecture explo-
ration for low-energy reconfigurable single-chip DSPs,” in Proc. of VLSI’ 99, April 
1999. 
[31] F. Hannig, H. Dutta, and J. Teich, “Mapping of regular nested loop programs to 
coarse-grained reconfigurable arrays – Constraints and methodology,” in Proc. of 
EEE Int. Parallel & Distributed Processing Symp.,  pp. 148-155, April 2004. 
[32] J. Lee, K. Choi, and N. Dutt, “Mapping loops on coarse-grained reconfigurable 
architectures using memory operation sharing,” Center for Embedded Computer Sys-
tems (CECS), University of California, Irvine, Tech. Rep. 02-34, 2002. 
[33] J. Lee, K. Choi, N. Dutt, "Compilation approach for coarse-grained reconfigur-
able architectures," IEEE Design & Test of Computers, vol. 20 no. 1, pp.26-33, Janu-
ary 2003. 
163 
[34] J. Lee, K. Choi, and N. Dutt, "An algorithm for mapping loops onto coarse-
grained reconfigurable architectures", in Proc. of ACM Workshop on Languages, 
Compilers, Tools for Embedded Systems, pp.183-188, June 2003 
[35] J. Lee, K. Choi, and N. Dutt, "An algorithm for mapping loops onto coarse-
grained reconfigurable architectures," ACM Sigplan Notices, vol. 38 no. 7 pp.183-
188, July. 2003 
[36] M. Ahn, J. Yoon, Y. Paek, Y. Kim, M. Kiemb, and K. Choi, “A spatial mapping 
algorithm for heterogeneous coarse-grained reconfigurable architectures,”  in Proc. 
of Design Automation and Test in Europe Conf., pp. 262-268, March 2006. 
[37] J. Yoon, Y. Kim, M. Ahn, Y. Paek, and K. Choi, "Temporal mapping for loop 
pipelining on a MIMD style coarse-grained reconfigurable architecture," presented at 
the IEEE Int. SoC Design Conf., Seoul, Korea, October 2006. 
[38] G. Lee, S. Lee, and K. Choi, "Automatic mapping of application to coarse-
grained reconfigurable architecture based on high-level synthesis techniques," in 
Proc. of IEEE Int. SoC Design Conf., pp.395-398, September 2008. 
[39] J. Yoon, A. Shrivastava, S. Park, M. Ahn, R. Jeyapaul, and Y. Paek, “SPKM : A 
novel graph drawing based algorithm for application mapping onto coarse-grained 
reconfigurable architectures,” in Proc. of Asia and South Pacific Design Automation 
Conf., pp. 776-782, March 2008. 
[40] H. Park, K. Fan, M. Kudlur, and S. Mahlke, “Modulo graph embedding: Map-
ping applications onto coarse-grained reconfigurable architectures,” in Proc. of Int. 
164 
Conf. on Compilers, Architecture, and Synthesis for Embedded Systems, pp. 136-146, 
October 2006.  
[41] H. Park, K. Fan, S. Mahlke, T. Oh, H. Kim, and H. Kim, “Edge-centric modulo 
scheduling for coarse-grained reconfigurable architectures,” in Proc. of  17th Intl. 
Conf. on Parallel Architectures and Compilation Techniques, pp. 166-176, October 
2008. 
[42] G. Dimitroulakos, N. Kostaras, M. Galanis, and C. Goutis, “Compiler assisted 
architectural exploration for coarse grained reconfigurable arrays,” in Proc. of  Great 
Lakes Symp. on VLSI, pp.164-167, March 2007. 
[43] F. Vererdas, M. Scheppler, W. Moffat, and B. Mei, "Custom implementation of 
the coarse-grained reconfigurable ADRES architecture for multimedia purposes,” in 
Proc. of Int. Conf. on Field Programmable Logic and Applications, pp. 106-111, 
August 2005. 
[44] Y. Kim, M. Kiemb, C. Park, J. Jung, and K. Choi, “Resource sharing and pipelin-
ing in coarse-grained reconfigurable architecture for domain-specific optimization,” 
in Proc. of Design Automation and Test in Europe Conf., pp. 12-17, March 2005. 
[45] C. Park, Y. Kim, and K. Choi, "Domain-specific optimization of reconfigurable 
array architecture," presented at the US-Korea Conference on Science, Technology, 
& Entrepreneurship, Irvine, CA, August 2005. 
[46] M. Lanuzza, M. Margala, and P. Corsonello, “Cost-effective low-power proces-
sor-in-memory-based reconfigurable datapath for multimedia applications,” in Proc. 
of Int. Symp. on Low Power Electronics and Design, pp. 161-166, August 2005. 
165 
[47] F. Barat and R.Lauwereins, “Reconfigurable instruction set processors: A sur-
vey,” in Proc. of Int. Workshop on Rapid System Prototyping, pp. 168-173, April 
2000. 
[48] ARM Corp., Cambridge, U.K., “ARM Corp. home page,” 2002. [Online]. Avail-
able: http://www.arm.com/arm/AMBA 
[49] Synopsys Corp., Mountain View, CA, “Synopsys Corp. home page,” 2005. 
[Online]. Available: http://www.synopsys.com 
[50] Model Technology Corp., Wilsonville, OR, “Model Technology Corp. home 
page,” 2005. [Online]. Available: http://www.model.com 
[51] Y. Kim and R. Mahapatra, “Reusable context pipelining for low power  coarse-
grained reconfigurable architecture,” in Proc. of Int. Parallel & Distributed Process-
ing Symp., pp. 1-8, April, 2008. 
[52] Y. Kim, I. Park, K. Choi, and Y. Paek, "Power-conscious configuration cache 
structure and code mapping for coarse-grained reconfigurable architecture," in Proc. 
of Int. Symp. on Low Power Electronics and Design, pp. 310-315, October 2006. 
[53] I. Park, Y. Kim, C. Park, J. Son, M. Jo, and K. Choi, "Chip implementation of a 
coarse-grained reconfigurable architecture," in Proc. of  IEEE Int. SoC Design Conf.,  
pp. 628-629, October 2006. 
[54] I. Park, Y. Kim, M. Jo, and K. Choi, "Chip implementation of power conscious 
configuration cache for coarse-grained reconfigurable architecture," in Proc. of the 
15th Korean Conf. on Semiconductors, pp.527-528, February 2008. 
166 
[55] J. Cocke, “Global common sub expression elimination,” in Proc. of Symposium 
on Compiler Construction, ACM SIGPLAN Notices 5, pp 850-856, July 1970. 
[56] S. Keutzer, S. Tjiang, and S. Devadas, “A new viewpoint on code generation for 
directed acyclic graphs,” ACM Transactions on Design Automation of Electronic 
Systems vol. 3, no. 1, pp 51-75, January 1998.  
[57] R. Rau, “Iterative modulo scheduling,” Technical Report, Hewlett-Packard Lab: 
HPL-94-115, 1995. 
[58] Netlib Repository at the Oak Ridge National Laboratory, Oak Ridge, TN. 
[Online]. Available: http://www.netlib.org/benchmark/livermorec 
[59] Institute for Integrated Signal Processing Systems, Aachen, Germany. [Online]. 
Available: http://www.ert.rwth-aachen.de/Projekte/Tools/DSPSTONE 
[60] Y. Kim and R. Mahapatra, “Dynamically compressible context architecture for 
low power coarse-grained reconfigurable array,” in Proc. of Int. Conf. on Computer 
Design, pp. 295-400, October 2007. 
[61] Y. Kim and R. Mahapatra, “Dynamic context management for low power coarse-
grained reconfigurable architecture,” presented at the ACM Great Lake Symp. on 
VLSI, Boston, MA, May 2009. 
[62] Y. Kim and R. Mahapatra, “A new array fabric for coarse-grained reconfigurable 
architecture,” in Proc. of EuroMicro Conf. on Digital System Design, pp. 584-591, 
September 2008. 
167 
[63] Y. Kim, and R. Mahapatra, “Hierarchical reconfigurable computing arrays for 
efficient CGRA-based embedded systems,” presented at the Design Automation 
Conf., San Francisco, CA, July 2009.  
[64] M. Jo, V. Arava, H. Yang, and K. Choi, "Implementation of floating-point opera-
tions for 3D graphics on a coarse-grained reconfigurable architecture, " in Proc. of  
IEEE Int. SoC Conf., pp.127-130, September 2007. 
[65] M. Galanis, G. Dimitroulakos, S. Tragoudas, and C. Goutis, “Speedups in em-
bedded systems with a high-performance coprocessor datapath,” ACM Transactions 
on Design Automation of Electronic Systems, vol.12, no 35, pp. 1-22, August 2007.  
[66] T. Callahan,  J. Hauser, and J. Wawrzynek, “The Garp architecture and C com-
piler”, IEEE Computer, vol. 33, no. 4, pp. 62-69, April 2000. 
[67] C. Arbelo1, A. Kanstein, S. López1, J.F. López1, M. Berekovic, R. Sarmiento1 
and J.-Y. Mignolet, “Mapping control-intensive video kernels onto a coarse-grain re-
configurable architecture: the H.264/AVC deblocking filter,” in Design Automation 
and Test in Europe Conf., pp. 642-649, March 2007.  
[68] M. Galanis, G. Dimitroulakos, and C. Goutis, “Speedups and energy savings of 
microprocessor platforms with a coarse-grained reconfigurable data-path,” in Proc. 
of Int. Parallel & Distributed Processing Symp., pp. 1-8, March, 2007. 
168 
VITA 
 
Yoonjin Kim received the B.S. degree in information and communication engi-
neering from SungKyunKwan University, Suwon, Korea, in 2003, and the M.S. degree 
in electrical engineering and computer science from Seoul National University, Seoul, 
Korea, in 2005. He graduated with the Ph.D. in computer engineering at Texas A&M 
University May 2009. His research interests are system-on-chip design, embedded sys-
tems, and reconfigurable computing. He may be contacted at: 
Yoonjin Kim 
Department of Computer Science and Engineering 
Texas A&M University  
TAMU 3112 
College Station, TX 77843-3112 
U.S.A. 
   
