Introduction
Recent IoTs (Internet of Things) and wearable computing require accelerators which achieve a certain performance with extremely small power budget. Coarse Grained Reconfigurable Arrays (CGRAs), which use a large array of processing elements (PEs), are advantageous for low power yet high performance processing [1] , [2] . Thus, they have attracted a lot of attention as such to mitigate the overhead of extremely low-power consuming accelerators. However, the large PE array in CGRAs often requires an enormous amount of leakage power which diminishes their benefits.
Silicon-on-Thin BOX (SOTB) CMOS technology, developed by LEAP [3] , allows transistors operation with much lower power supply voltage than that for conventional bulk CMOS transistors by reducing the variation of the threshold level. By using body-biasing, the leakage current and operational delay can be widely controlled.
A previously proposed CGRA, called Cool Mega Array (CMA) [4] , was developed using SOTB technology , and it is called CC-SOTB (CMA Cube-SOTB). In CC-SOTB, the large PE array is consisting of combinational logic, and the data flow of the target application is mapped directly. The microcontroller manages the data reading and writing between the input/output of the PE array and data memory modules. CC-SOTB has independent body-bias supply for the PE array and microcontroller to make the balance between the performance and leakage power according to the arithmetic intensity of the target application. For a computation intensive application, zero-bias or forward-bias is given to the PE array to enhance the performance while reversebias is given to the microcontroller and data memory. If the target application bottlenecks the data transfer between the memory and PE array, zero-or forward-bias is given to the microcontroller and memory, while the PE array receives the reverse-bias to suppress the leakage power without degrading the performance [5] .
Although the method to find the optimal body-bias voltage has been investigated [6] , the same bias voltage is given to all PEs in the PE array in order to make the voltage management simple. If the body-bias control can be done for finer-grain (for example, a PE or a group of PEs), more leakage power can be reduced without degrading the performance.
This paper investigates the impact of body-bias domain size on the leakage power and area overhead. The contributions of this work are as follows:
• Body-bias voltage selection method for each application program and body-bias domain size using Genetic Algorithm is proposed.
• The overhead of various sizes of domain division is evaluated based on real chip layout.
• By applying the proposed algorithm to the domain including a single PE (1x1 domain), the leakage power can be reduced by 50% on average with 12.6% area overhead. The domain with 2x1 PEs (2x1 domain) can reduce 40% of leakage power on average with 6% area overhead.
• A combination of three body-bias voltages; zero-bias, weak reverse-bias, and strong reverse bias can achieve the best leakage reduction in most cases without using more number of voltages.
The rest of paper is organized as follows: SOTB and CC-SOTB are introduced in Sect. 2 in addition to some of the related work. Section 3 proposes a finer domain division and a body bias voltage selection method using Genetic Algorithm. In Sect. 4, the overhead for the domain division is analyzed. The leakage power reduction and the influence of the number of bias voltages are evaluated in Sect. 5. The
Copyright c 2017 The Institute of Electronics, Information and Communication Engineers key points are summarized, and future work is mentioned in Sect. 6.
CMA-SOTB

The CMA Architecture
A key concept of the CMA architecture is reducing any energy usage other than that required for computation. The PE array is built with combinatorial circuits to eliminate the power needed to store the intermediate results in registers and to distribute the clock to each PE. The data-flow graph of the target application is directly mapped onto the PE array. Registers are only provided at the inputs/outputs of the PE array. Computation starts when all data are set up in the input "Launch register," and the outputs of the PE array are stored into the "Gather register" with a certain delay. The energy overhead caused by glitches in the large combinatorial circuits can be reduced by carefully setting the configuration data of the switching elements so as not to propagate glitches [4] .
The microcontroller flexibly manages the data transfer between the data memory (DMEM) and registers by using mapping registers and vector operations. The aforementioned structure enables the implementation of various application programs without the need for power hungry dynamic reconfiguration in the PE array.
Another key concept of the CMA architecture is optimizing the energy of each target application by balancing the performance of the PE array and microcontroller. For applications with a high degree of arithmetic intensity, the performance of the PE array is enhanced by using a power budget, while the power of the microcontroller is lowered. However, when the application requires a lot of data sets for a computation, the power budget is used for the microcontroller that manages the data transfer between the data memory and Launch/Gather registers. In the first prototype, CMA-1 [4] , [7] independently changes the supply voltage of the PE array and the microcontroller.
The problem of CMA-1 is the large leakage power consumed in the PE array. Since it is critical in IoTs or wearable application, we adopted SOTB CMOS technology to suppress it.
SOTB CMOSFET
SOTB is classified as an FD-SOI technology where the transistors are formed on thin BOX (Buried Oxide) layer, as shown in Fig. 1 .
Unlike conventional bulk CMOS, in SOI transistors are formed on top of the insulator (typically S iO 2 ). Surrounding the transistor with insulating material means that the electrical interference does not need to be considered, and the electric characteristics, therefore, become sharp [8] . By using both extremely thin FD-SOI layer and BOX layer, Short Channel Effect (SCE) can be suppressed. Furthermore, since no impurity dopant to the channel is required, The delay and leakage power consumption can be optimized by controlling the bias voltage to the body (backgate). Here, we refer to the body-bias voltages of NMOS transistor and PMOS transistor as V BN and V BP, respectively. V BN for NMOS transistors is given to p-well. That is, if VBN=0, the transistor works with a normal threshold level. If reverse-bias (V BN < 0) is given, the threshold is raised; thus, the leakage current is reduced while the delay is stretched. On the contrary, forward-bias (V BN > 0) lowers the threshold which enhances the operational speed with an increase of the leakage current. In the case of PMOS transistors, V BP is given to the n-well; thus, zero bias means V BP = VDD. When V BP > VDD, this corresponds to reverse-bias, while V BP > VDD is for forward-bias.
The characteristics of SOTB are summarized as follows: (1) The junction capacitance of the SOI is about 1/10 of the bulk's one, thus making high-speed operation possible, especially with lower voltage operations. (2) The latchup is a problem of bulk CMOS caused by a parasitic thyristor formed by adjacent transistors. However, this problem is suppressed in SOI. (3) Anti-radiation tolerance is high. Radiation charges that can be incidentally generated by a part of the substrate are blocked by the insulation layer, and do not affect the circuit operations. (4) Noise propagation (cross-talk) is small because of the insulation. Figure 2 shows the block diagram of the CC-SOTB, a prototype CMA architecture using SOTB technology [5] .
CC-SOTB
A PE consists of a simple 24-bit ALU that executes multiply, add, subtract, shift, and logic operations, and a switching element (SE). It has a 12 × 8 PE array connected with a network using a two-channel island-style interconnection and direct links that connect to the north-east and east of the PE. The SEs transfer the input data from the PE in the south, west, and east of the PE and the output data of the ALU to the PE in the appropriate direction according to the configuration data.
The microcontroller is a tiny microprocessor that ex- ecutes a 14-bit micro-code stored in a 128-entry micromemory. It has 16 general-purpose registers and eight special-purpose registers storing pointers of DMEM, bitmap vectors, and stride lengths for a stride data transfer. It reads eight data from the DMEM and sets the launch register with a single instruction. A dedicated memory controller triggered with the instruction executes the data transfer with eight clock cycles. Also, the data in the "Gather register" can be written back to the DMEM with a single instruction handled by another controller. Because the DMEM is a single-read/single-write dual-port memory, the reading and writing data can be done in parallel. Two banks of 256-entry 24-bit dual-port memory are provided for overlap operations of streaming data input/output and computation. Since the appropriate size of the SRAM macro is not available in SOTB, it is implemented with a set of registers and the size is limited. In CC-SOTB, unlike controlling independent power supply, independent body-bias is given to the PE array and microcontroller/data memory. Here, bias voltages for the PE array are referred to as VBNM and V BPM, and those for the microcontroller as V BN and V BP. By controlling the body-bias separately, we can optimize the energy consumption while ensuring the required performance. For a target application with strong arithmetic intensity, the PE array is given a forward-bias (VBNM > 0, V BPM < VDD) while the microcontroller/data memory is given a reversebias (V BN < 0, V BP > VDD). In contrast, if the data transfer has a bottleneck, the forward-bias is given to the microcontroller/data memory, and the reverse-bias is given to the PE array.
The process technology and CAD tools used in CC-SOTB design are shown in Table 1 . Figure 3 shows the chip layout of CC-SOTB. The black square is a part of inter-chip communication mechanism that is not related to this paper. As shown in the chip layout, CC-SOTB consists of two macros; the microcontroller and the PE array. Although the area of the PE array is larger than that of the microcontroller, the difference is not so big. This comes from the fact that microcontroller macro includes DMEMs implemented with registers, configuration registers, constant registers, and Launch/Gather registers. Here, the PE array and microcontroller share the same supply voltage; but, they have their own body-bias voltages to control the performance and leakage current independently. In CMA architecture, the PE array is consisting of combinatorial circuits. By giving the reverse bias, the total delay time in the PE array is stretched. In this case, the time to store the results at Gather register must be extended resulting the degradation of the computation performance. On the contrary, if the delay time in the PE array is reduced by the forward bias, the time to store the results at Gather register is shortened so that the total computation time is reduced.
Related Work
Kuehn et al. applied body-bias control [9] to a coarse grained dynamically reconfigurable processor called MuCCRA-4. Since it changes its configuration almost every clock cycle, the main subject of the research is how to change the body-bias dynamically. FlexPower FPGA [10] uses body-bias control for each configurable logic block. Each logic block has its own body-bias domain and the voltage of the body-bias is selected by the configuration data. Although precise delay and leakage current control can be done, the overhead of the small body-bias domain corresponding to the logic cell becomes large.
Finding the energy minimal point by controlling power supply voltage and body-bias supply voltage is widely researched [11] - [13] . Kao et al. [14] investigated optimization techniques from the practical viewpoint; but, their study targeted only the functional units, and they used a conventional bulk technique. Our previous work [5] demonstrates the effect of body-bias control by using real chip measurements. However, in this study, the same body-bias voltage is given to all PEs in the PE array. Although it can control the balance between PE array and the microcontroller, the leakage of the PE array can be further reduced by giving body-bias voltage to each PE or a small group of PEs. Also, our paper [15] at the earlier stage of this research only proposed the concept of fine grained body-bias control, and does not include practical aspects to apply this method.
Body Bias Domain Division in the PE Array
The concept of domain division in the PE array is illustrated in Fig. 4 where each body-bias domain is enclosed in a red frame. White, gray, and black rectangles represent zerobias, weak reverse-bias, and strong reverse-bias domains, respectively.
The current CMA uses a single body-bias for the whole PE array; thus, the body bias voltage for the critical path must be given to all PEs (Fig. 4 (a) ) even if one or more PEs do not work at all. This incurs a waste of leakage power. When we divide the partition into groups each of which has 2x1 PEs, we can use a strong reverse-bias to the unused PEs, as depicted in Fig. 4 (b) . Furthermore, and as represented in Fig. 4 (c) , we can save more leakage power by giving a weak reverse-bias to PEs which are not on the critical path. Obviously, the smaller the group size of PEs is, the more the leakage reduction can be achieved without degrading performance, if the appropriate body-bias is assigned. However, the domain partitioning increases the chip area for two reasons. First, in order to apply different body-bias voltage, the substrate must be separated with a certain distance. Second, two body-bias lines for PMOS transistors (VBP) and NMOS transistors (VBN) must be delivered to each power domain. Such body-bias power distribution requires additional area overhead.
Here, first, we propose the body-bias assignment algorithm for each PE group. Then, the benefit of body-bias partitioning is evaluated based on simple application pro- grams. Considering the area overhead based on the layout, we investigate the optimized size of the PE group.
Body Bias Assignment Algorithm
Delay and Leakage Power Table
Body-bias voltage must be selected considering which type of calculation is done in each PE. Now, in CMA program design, a data-flow graph is extracted from the application program written in C-like language. Then it is mapped onto the PE array with Blackdiamond [16] tool which uses a simulated annealing for the mapping. Here, our body-bias assignment algorithm searches an optimized setting without changing the initial mapping obtained with Blackdiamond. For a large sized PE partitioning, the mapping which considers the PE group would achieve more efficient results. However, this will be considered in our future work. Unlike algorithms for dual-V th or dual-V dd FPGA design [17] , [18] , the bias voltage can be changed widely and delicately in SOTB.
We previously [5] proposed a method to control the body-bias where accurate parameters of the used formulas are obtained from real chip measurements. However, in order to give an appropriate body-bias voltage considering the operation executed in each PE, we need a more precise delay estimation of each operation for each body-bias voltage. Therefore, we firstly made a table of delay and leakage power for each operation in a PE. Here, the balanced bodybiasing, which gives the same bias voltage to PMOS and NMOS transistors, is used. That is, V BN + V BP = VDD. It means that the bias voltage can be represented only by V BN. As shown in Table 2 , we evaluated the maximum delay and leakage power when varrying V BN from −1.0V to 0.4V with a 0.2V interval for each instruction (ADD, SUB, MULT, PASS, NOUSE, AND, OR, SL, SR). The table's results are obtained from the simulation of the PE layout design using Synopsys's HSIM, a light-weight SPICE simulator. Note that the leakage power is not depending on executed operation. The operation only influences the dynamic power and the delay time.
Genetic Algorithm for Assignment
Since there is an enormous number of combinations for the body-bias assignment, we used Genetic Algorithm (GA) to find the sub-optimal combinations.
An element of the algorithm is a domain or a PE group which shares the body-bias with the assigned body-bias voltage. Figure 5 shows an example of individual format used for the division size of 2x1 PE group. An individual is a vector formed by concatenation of all elements in order. The ith numerical value in this vector is body bias voltage which is assigned to p i/2 . Each element calculates the delay of paths which go through the PE group by referring to the delay table and configuration data for each application data flow. Here, GA is designed as follows:
• The fitness function of individual i is determined with the following expression. Here, a low fitness is more preferable.
is the delay of a path p which goes through the PE group i. L i is the total leakage power of PE groups represented with the individual i. D critical is the delay of a critical path, and P is the total PE groups consisting of the individual PE group i. By setting an enough large α, we can avoid the assignment which stretches the delay time of the path.
• Tournament selection with size 3 is used. That is, three individuals are selected randomly from 1400 individuals, and the one with the best fitness is selected and left for cross-over.
• Two-point cross-over is applied with a probability of 0.2.
• The mutation probability is set to 0.2, and each element is changed randomly with a probability of 0.3. • 1400 generations are computed.
In the implementation, Python and GA library called Deap [19] are used. The outline of the GA is summarized in Fig. 6 .
While this algorithm has an advantage to seek the suboptimal solution in a short time, it should be noted that it does not guarantee the accuracy of seeking the exact optimal solution. To examine the quality of the GA solution, a bruteforce-search (BFS) was applied to two simple application programs "alpha-blender" and "sepia-filter". For them, the results from GA and BFS were exactly the same. We can, thus, expect the quality of GA is not far from the results of BFS in other application programs. The computation time of GA is about five minutes while BFS requires S P iterations where S is the number of V BN sources and P is the number of PEs in the PE array. Therefore, BFS requires 8 96 iterations for 12x8 PE array and eight V BN sources which makes it not suitable for complex applications.
Overhead of Domain Separation
Overhead Estimation Method
SOTB provides a triple-well structure in which each bodybias domain has independent N-well and P-well. In order to avoid interference between them, each domain must keep a space of 5.2μm vertically and 7.2μm horizontally. The restriction introduces an area overhead of the domain separation. In addition, the number of straps for body bias increases the area between domains. Figure 7 explains how the body bias is delivered to each domain in the SOTB process. Vertically aligned tap cells are used for body bias delivery. A tap cell delivers the body bias to a certain number of cells in the same row. The column of tap cells must be provided in a fixed interval. Here, common mesh-style delivery is used, that is, the body bias is given through straps provided at both sides of the domain vertically first, then delivered with horizontal straps. A pair of straps for NMOS and PMOS is needed for each domain. Each wire for the strap must have 0.2μm width, and be located by 0.32μm interval. 0.52μm space must be kept between two pairs. Figure 8 shows the outline of the overhead estimation. First, the layout of a PE is designed by using Synopsys ICCompiler, and it is a minimum unit of the body bias domain Fig. 7 Layout for body-bias delivery ( Fig. 8 (a) ). In order to send the results to "Gather registers quickly", the switching element for the south direction (SES) is kept with zero bias. By pushing multiple PEs into a single domain, the overhead can be reduced, as represented in Fig. 8 (b) , since the space between two PEs can be omitted. When two vertically neighboring PEs are in a single domain, the space for the body bias straps is also reduced, since they can share the column of tap cells. The total width of layout extending by the domain separation ΔW (the horizontal overhead) is represented with the following expression.
where N row and N column are the number of rows and columns of body bias division in the PE array. (N row × 0.72 × 2) represents the area for pairs of body bias straps while (N row + 1) × 0.32 is for spaces between pairs. The minimum horizontal interval 5.2μm must be kept even when the space for body bias straps is smaller than that. On the other hand, since only four body bias straps are required for the horizontal direction, vertical space between domains is equal to the minimum vertical interval 7.2μm. That is, the vertical overhead ΔH can be represented as follows:
The total area of the PE array including the overhead Figure 9 shows the area overhead of each domain size. Note that the area overhead shown here is only for interval between power domains or wires to deliver the body bias power supply. Since there is no cells in the area, it does not increase the leakage power. Here, the body bias domains are classified into three categories depending on their size ratios: 1) "1:1" ratio domains (e.g., 1x1 and 2x2 PE domain sizes), 2) "1:2" ratio domains (e.g., 1x2 and 2x4 PE domain sizes), and 3) 2:1 ratio domains (e.g., 2x1 and 4x2 PE domain sizes). We opted for this classification since domains with long height or width introduce other layout challenges that would like to analyze. The overhead of the finest domain (1x1 PE domain) is about 12.6%. It can be reduced to about 6% by merging two neighboring PEs together in a single domain. Note that the overhead of 1x2 domain is slightly smaller than that of 2x1 domain, since the vertical straps can be shared. According to our evaluation results, the space for body bias lines is larger than that for the margin between body bias domains. When the number of domains increases, we need to layout a lot of body bias straps which increase the overhead.
Area Overhead by the Domain Separation
Evaluation
Evaluation Setup
The evaluation target uses the same semiconductor process and design tools shown in Table 1 . Simple image processing programs shown in Table 3 are used for the evaluation. Table 3 also shows the PE utilization. alpha and sepia use 8bit input and so the arithmetic intensity is not so large, while a f and gray use relatively a large number of PEs. The data size of PE array in CMA is 24bit, and we used 24bit image data or 8bit x 3 image data. In CMA, other application programs like DCT and SAT functions have been implemented. However, the tendency was not so different with ones adopted here.
The PE array used here is 12x8, the same design as that of CC-SOTB. The size of a PE group in the same domain is represented by verticalnumber × horizontalnumber. Here, seven sizes: 1x1, 2x1, 1x2, 2x2, 4x2, 2x4 and 4x4 are evaluated. Figure 10 shows the power reduction ratio compared to the case when all PEs use the zero-bias. Note that the body bias voltages optimized by the proposed algorithm are assumed to be given to each power domain directly. As expected, in alpha that uses a small number of PEs, a high degree of leakage reduction can be achieved even with the size of 4x4. However, in other application programs, none or small leakage current is reduced with 4x2, 2x4 and 4x4. 4x2 achieves slightly better power reduction for alpha than that of 2x4. It is the same reason why 2x1 is more advantageous than 1x2 described later. 1x1 achieves the best leakage reduction that reaches 40% in average. If we do not care about the 12.6% area overhead, obviously it is the best solution.
2x1 achieves more reduction than that of 1x2. It comes from the fact that the data flow is assigned from the lower rows to upper rows in Blackdiamond. Thus, the horizontally longer domain can make the better use of reverse bias than the vertically longer ones. Unfortunately, the area overhead of 2x1 is slightly larger than that of 1x2; but, the difference is quite small. Considering the small area overhead (6%) and the large power reduction (35% on average), 2x1 is the best domain size in most cases.
The Number of Body Bias Voltages
From a practical viewpoint, the evaluation in Fig. 10 uses too many independent body bias voltages. Although the area overhead to deliver the body bias voltages is independent from the number of used voltages, it is difficult to prepare a number of independent sources of body bias voltages. Usually, the body bias voltages are supplied from outside the chip or generated inside the chip by using a type of chargepomp circuits [20] . In both cases, the number of body bias voltages should be as small as possible.
Hereafter, we investigate the number of body bias voltage sources (N V bb ) which can efficiently reduce the leakage. First, the reduction ratio by the combination of selected Figures 11 and 12 show how the leakage is reduced by the combinations of voltage sources for 1x1 and 2x1 domain sizes, respectively. The results are averaged from all evaluated application programs.
The figures show that the maximum reduction ratio is achieved with a combination with three voltage sources (0V, −0.2V and −1.0V) in both cases, although the combination of four voltages achieves the same reduction ratio. Figures 13 and 14 show the best reduction ratios that are achieve with the number of voltage sources N V bb for the 1x1 and 2x1, respectively.
In both figures, the improvement of the number of voltages is the diminishing returns. In most application programs, three voltages can achieve enough leakage reduction even for 1x1 domain size. For 2x1 domain size, the difference between two and three becomes small. So, we can conclude that the combination of three bias voltages: zero-bias (0V) for PEs with heavy weight operations, weak reverse-bias for PEs with light weight operations, and strong reverse-bias (−1.0V) for unused PEs are enough in most cases. Although −0.2V was selected as the best weak reverse-bias voltage in both Fig. 11 and Fig. 12 , other values were selected in specific application programs. Thus, variable reverse bias is advantageous if the implementation allows it. Otherwise, −0.2V is a good candidate for weak reverse-bias.
Conclusion
The leakage power reduction and area overhead of the body bias domain separation applied to an energy efficient CGRA were analyzed. By using Genetic Algorithm based body bias assignment method, the leakage reduction of various grain sizes was evaluated. As a result, a domain with 2x1 PEs achieved about 40% power reduction with a 6% area overhead. It has appeared that a combination of three body bias voltages: zero-bias, weak reverse-bias, and strong reverse-bias can achieve the maximum leakage reduction in most cases.
In the proposed algorithm, the application mapping onto the PE array did not consider the body bias domain. By using the application mapping algorithm considering body bias domain, the leakage power reduction in larger domains will be improved. Improvement of the mapping algorithm is our future work.
