Coarse-grained Reconfigurable Architecture (CGRA) is a parallel computing platform that provides both high performance of hardware and high flexibility of software. It is becoming a promising platform for embedded and mobile applications. Since the embedded and mobile devices are usually battery-powered, improving battery lifetime becomes one of the primary design issues in using CGRAs. In this paper, we propose a battery-aware task-mapping method to optimize energy consumption and improve battery lifetime. The proposed method mainly addresses two problems: task partitioning and task scheduling when mapping applications onto CGRA. The task partitioning and scheduling are formulated as a joint optimization problem of minimizing the energy consumption. The nonlinear effects of real battery are taken into account in problem formulation. Using the insights from the problem formulation, we design the task-mapping algorithm. We have used several real-world benchmarks to test the effectiveness of the proposed method. Experiment results show that our method can dramatically lower the energy consumption and prolong the battery-life.
Introduction
Coarse-grained Reconfigurable Architecture (CGRA) is a kind of parallel computing platform combining high performance of hardware and high flexibility of software. It is becoming a promising platform for embedded and mobile applications. During past decade, a large number of CGRAs [1] - [3] have been developed which demonstrated the high performance, flexibility and energy-efficiency.
Since the embedded and mobile devices are usually battery-powered, improving battery lifetime becomes one of the primary design issues in using CGRAs. Recent research results show that the real battery has the nonlinear characteristics: rate capacity effect and recovery effect [5] . The rate capacity effect shows that higher discharge current will bring lower energy conversion efficiency, and recovery effect indicates that inserting idle time into discharge procedure will increase the effective charge that the battery can provide. These effects indicate that the battery lifetime is sensitive to the discharge current distribution. In [5] , Rakhmatov proposed an analytical battery model. This model describes the battery charge loss as a function of the profile of discharge currents, which takes into account both the rate capacity effect and the recovery effect. And this model indicates that the battery charge loss can be optimized by adjusting the profile of discharge currents. For a CGRA, the energy consumption is from the execution of application. A typical CGRA is composed of a host controller and a 2-D processing element (PE) array. For implementing a given application on CGRA, the application is partitioned to several tasks and the tasks are scheduled and executed on PE array. The host controller would controls the PE array to be reconfigured several times for executing these tasks. Therefore, the discharge current distribution is closely correlated with the task partitioning and scheduling. More specifically, the different task partitioning will cause the different number of operators in each task. When these tasks are mapped onto PE array, it will cause different number of active PEs, which results in different working current of PE array. Thus, how to partition the application into tasks determines the working current of each task. The task scheduling determines the task executing order and start time of each task. It also affects the profile of discharge currents.
Since the energy consumption of battery is determined by the profile of discharge currents, the most rational way to optimize the battery lifetime of CGRA is to find the proper task partitioning and scheduling, resulting in the profile of discharge currents that causes minimum battery charge loss. In this paper, we formulate the task partitioning and scheduling as a joint optimization problem of minimizing the battery charge loss. Using the insights from the problem formulation, we design the battery-aware task-mapping algorithm.
The remainder of this paper is structured as follows: we will introduce the background of battery model and CGRA in Sect. 2. In Sect. 3 we will give some related works about energy-efficient task partitioning and scheduling. We will formulate the task mapping problem in Sect. 4. And we will present the battery-aware task mapping method in Sect. 5. Some experiment results are shown in Sect. 6. Finally, we will conclude in Sect. 7.
Backgrounds

Battery Model
The model we use in this paper is Rakhmatov Battery Model [5] . This model depicts the simplified chemical reactions occurring in the battery with an analytical expression. It is based on the laws of chemical kinetics; it is also a variable load analytical model which takes both the rate capacity Copyright c 2013 The Institute of Electronics, Information and Communication Engineers effect and the recovery effect into account. It was shown that the average prediction error of various constant current discharge profiles was 2%, and the maximum of 4%. This model requires only two parameters, α and β to be estimated by conducting constant current discharge experiments. The parameter α represents the battery capacity and β stands for the nonlinear effects of the battery. The higher the value of β is, the more the battery behaves like an ideal one. The battery model can be described by Eq. (1), the value of δ gives the charge loss by time T . The total discharge procedure is described by n distinct discharge intervals. I k , t k and Δ k respectively stand for the discharge current of battery, the start time and the duration of k th discharge interval. The battery lifetime is estimated by evaluating (1) for increasing values of T and stopping where δ α. Then the battery lifetime equals the value of T . Equation (1) can be also explained from the point of view of tasks execution. Considering n tasks, for the k th task, (I k , t k , Δ k ) denotes the working current, the start time and the duration. The total battery charge loss for executing all n tasks would be δ. So we can use Eq. (1) as the energy consumption function to be minimized.
Typical Architecture of CGRA
A typical CGRA is composed of one host controller and one 2-D PE arrays. Each PE includes an ALU and a register file. The functionality of PE could be configured to be different word-level operations of fixed-point numbers. The PEs are power-gated so that the unused PEs can be shutdown to reduce power consumption. The configuration words for PE array are stored in configuration memory. The data memory is used to store intermediate computing results. Figure 1 shows a typical CGRA with a 4 × 4 PE array. For implementing a given application on CGRA, the application is partitioned to several tasks first. The size of each task is limited by the size of PE array. Then the tasks are scheduled and executed on PE array. The host controller would control the PE array to be reconfigured several times for executing these tasks.
Related Works
Low-power design is the fundamental theme pervading the embedded system and mobile computing research. Voltage and frequency scaling has been extensively used on singleprocessor and multi-processor platform. However, in most of the published work, an ideal power source is assumed. In [7] , the authors proposed an energy-efficient scheduling algorithm. Their method is aimed at sporadic real-time tasks in multiprocessor, by hiring LRE-TL scheduling algorithm [16] , it can minimize energy consumption by adjust the voltage and frequency of processors. An energyaware partitioned fixed-priority scheduling method is proposed in [8] . This method can optimize the energy cost by adjusting the clock frequency. In [9] , an energy-efficient scheduling method for real-time tasks is given. This method is cluster-based, by controlling active-island and frequency assignment, it can minimize energy consumption. In [10] the authors proposed an energy-efficient scheduling method for lightly loaded multicore processors. They let the overfull cores work on low frequency and lock down idle cores. A task scheduling method for dual-core system is proposed in [11] . It employs DVS (Dynamic Voltage Scaling) for each subtask and match them to certain processor. This method shows good result in energy optimization. A priority-based scheduling method to optimize energy consumption is proposed in [6] . Firstly, it partitions the input task by priority, and then chooses different strategy for different priority. By controlling the power threshold of each processor, they could balance the load and reduce energy consumption. Since the nonideal battery properties play an important role in effective utilization of the battery capacity, the actual performance of all the above methods is overestimated.
Our work is different to the voltage and frequency scaling methods. Those methods find the proper supply voltage and frequency to reduce energy consumption, while we focus on how to partition and schedule the tasks from a given applications.
Recently, since the reconfigurable architecture is becoming more and more popular, some issues about energy optimization for reconfigurable computing platform have been investigated. In [12] , the authors proposed a task-scheduling algorithm to manage energy consumption for FPGA. This algorithm schedules the tasks based on Rakhmatov battery model. This method works well on fine-grained reconfigurable architecture. An interconnect optimization method is proposed in [13] . It achieves optimized energy consumption by shutting down connections that is not in use. However it does not take into account the nonlinear effects of battery, which limits its effect. In [14] , the authors proposed a task-mapping algorithm for CGRA. This method takes the architectural features of CGRA into account; it can get less power consumption and shorter execution cycles. However, an ideal power source is assumed. A power centric mapping method for reconfigurable processor with dual V dd and dual V th is shown in [15] . But its target architecture is different from ours and they don't consider the battery nonlinear features either.
Our work is also quite different to the above methods for reconfigurable computing platform. First, less of them consider the battery nonlinear features, while our proposed method is based on nonlinear battery model. Second, most existing methods only consider task-scheduling problem, such as the algorithm in [12] . However, task partitioning determines the profile of working current of tasks, which cannot be neglected. In our work, we deal with task partitioning and scheduling jointly.
Task Mapping Problem of CGRA
Mapping Application onto CGRA
In CGRA, the PE array is the basic block to execute tasks. Since the PE array has limited computing resources (e.g. the number of PEs), the size of task (e.g. the number of operators and the length of critical path) it can execute is constrained. When dealing with a given application, the size of the application is usually larger than PE array, and the critical path of the application is usually too long for one PE array to accommodate. This contradiction causes that an input application usually can't be executed on CGRA directly one time.
Task mapping is a procedure to map an application onto CGRA according to the resource constraints. The input application is usually described as a data flow graph (DFG). A DFG is a directed acyclic graph, it can be described by G = (V, E), where V represents all nodes in DFG, |V| = n, and E stands for all the edges in DFG. Each node v i ∈ V represents an operation element. A directed edge e i j = <v i , v j >, e i j ∈ E exists if the function of node v j depends on the result of v i . The first step of task mapping is partitioning the application into several tasks, i.e. dividing G into several parts. Using V i to represent the i th task, the size of V i , S (V i ), equals to the number of operators that task V i has. In partitioning, some edges will be cut, which result in the data input and data output of tasks. The size of each task is required to be smaller than the size of PE array, so that each task can be executed on PE array. After task partitioning, the number of tasks, the size of each task, the duration of each task are determined. The second step is scheduling the tasks. The tasks would be executed in a serialized manner on PE array. The executing order is constrained by the dependency relationship of tasks. Each task would be assigned a start time according to the executing order. Figure 2 shows an example of task partitioning and scheduling. The application is partitioned to six tasks first. According to the dependency relationship, there are four feasible scheduling schemes for execution. 
Energy Consumption Modeling
When executing a task, the CGRA will firstly configure the PE array to meet the task's function requirement. When the hardware is configured, the PE array will load the input data from memory. Then the PE array will start the task execution. After the execution, the calculated results will be stored into memory for further use. This is the standard process of executing a task, which is shown in Fig. 3 . It includes four phases: configuration, data loading, execution and data storing. The energy consumption for executing a task is from these four phases. To model the energy consumption, we need to calculate the working current of each phase.
In configuration phase, the host controller writes the configuration words into the configuration register of each PE. We denote the working current of configuration as I cfg . The physical meaning of I cfg is the current consumed by the controller to write configuration words to PE's register. The duration of configuration step, Δ cfg , equals to the consumed time to write configuration words to all PEs. Therefore, for a given CGRA, Δ cfg is a fixed number.
In data loading and data storing phase, the working current equals to the required current for memory accessing, which is denoted as I cc . The duration of data loading and data storing, namely Δ LD and Δ ST , are in direct proportion to the data volume that need to be loaded and stored. We use D IN (V i ) and D OUT (V i ) to denote the input data volume and output data volume respectively, and B cc to denote the memory bandwidth. Then
In execution phase, the unnecessary PEs can be shut down to save energy. Then the working current, I exe , equals to the sum of working current of all activated PEs. For task V i , the number of activated PEs equals to the size of task, S (V i ). The working current of task V i can be calculated as
where I PE (k) denotes the working current of PE executing the operation k in V i . The duration of execution, Δ exe (V i ), equals to the critical path length of task V i . Figure 4 gives an example of working current and duration calculation. The task has 7 operators, four "+", two "-" and one "×". When it is mapped to PE array, there are 7 activated PEs. Thus the working current is 4I PE (+) + 2I PE (-) + I PE (×). The critical path length of the task is 4. Then Δ exe = 4 cycles.
Based on the above analysis, we get the working current and duration of each phase, (I cfg , Δ cfg ), (I cc , Δ LD ), (I exe , Δ exe ) and (I cc , Δ ST ). Therefore, for a set of tasks, we can construct the whole discharge current distribution. Then the energy consumption for executing an application on CGRA can be calculated by Eq. (1).
Problem Formulation
As mentioned in 4.1 and 4.2, the task partitioning affects the task size, the critical path length and intermediate data volume; the task scheduling affects the start time of each task. In other words, the discharge current distribution is determined by task partitioning and scheduling. Therefore, we can formulate the task partitioning and scheduling as a joint optimization problem. The goal of this problem is to minimize the battery charge loss. The solution of this problem is the optimal task partitioning and scheduling scheme.
The optimization problem has two extra constraints. First, the task size should be smaller than the size of PE array. Second, the total runtime of all tasks should not exceed the deadline of the application. We use A PEA and T to denote the size of PE array and the deadline of application, respectively. Assuming the input DFG is partitioned into K tasks, we use R(V i ) to denote the runtime of V i . Then
Based on the above notations, we can formulate the task partitioning and scheduling problem as follows:
Input: DFG G = (V, E).
Constraints:
Generate the task partitioning and scheduling scheme of G such that the total battery charge loss δ is minimized.
Battery-Aware Task Mapping Method
Basic Idea
As mentioned in 4.2, the energy consumption comes from three aspects: the configuration of PE array, the execution of PE array and necessary memory accessing for data loading/storing.
The energy consumption of configuration is related to the number of tasks, since executing one task means one time configuration of PE array. For a given DFG, the number of partitioned tasks is subject to the size of each task. If we want the least number of tasks, we should make the task size as large as possible in partitioning.
The energy consumption of memory accessing is subject to the data transfer between tasks, while the transferred data volume is determined by task partitioning. Figure 5 shows an example of relationship between task partitioning and data transfer. The DFG in Fig. 5 is partitioned into two tasks, the scheme in Fig. 5 (a) will bring 1 data that needs to be transferred, while the scheme in Fig. 5 (b) will cause 3 data need to be transferred. Therefore, to reduce energy consumption, we should minimize the data transfer in task partitioning.
The energy consumption of execution is related to how we distribute DFG's critical path into tasks, since the length of critical path determines the execution time of each task.
Considering all above insights, the basic idea of our task mapping method is: 1) We partition the DFG into several macro blocks by considering data transfer minimization and task size maximization; 2) We further divide each macro block into basic blocks by considering the critical path distribution. Here the basic blocks are the basic units to form tasks further; 3) We traverse all the possible combination of basic blocks to generate all feasible task partitioning schemes; 4) For each task-portioning scheme, we find the taskscheduling scheme that makes the total battery charge loss δ minimized; 5) From all feasible task partitioning schemes, we can find the optimal one with minimal δ. Since the corresponding task scheduling scheme is achieved in 4), we get the task mapping result finally.
Task Partitioning and Scheduling Algorithms
Based on the above idea, we design three algorithms. Algorithm 1 is for partitioning the DFG into macro blocks. The maxflow-mincut method is used to minimize data transfer. Algorithm 2 is for dividing macro blocks into basic blocks and generating all feasible tasks. Algorithm 3 is for scheduling the tasks and finding the optimal solution. In algorithm 1, the maxflow-mincut method will cut the input DFG into two parts: X and X', the data transfer between them is considered to be least among all possible cuts, this cut is called a mincut. It can be represented by C = (X, X'). By executing the maxflow-mincut method iteratively on DFG, we can find all the mincuts of it. When getting one mincut C = (X, X'), the size of X maybe less than the area constraint A PEA , we can add nodes from X' to X to enlarge X upto A PEA , then we get a new cut C' = (X 1 , X 1 ). If we use this cut to partition DFG, the task size is increased. The newly obtained cut C' is called an areacut. Leveraging mincut and areacut methods, the DFG is divided into several macro blocks. The borderline of a macro block is either mincut or areacut.
The algorithm 1 is shown in Fig. 6 . In step 1, a source node s and a sink node t are added to DFG, these two nodes are used to execute the maxflow-mincut algorithm. If Area(X) > A PEA , we will start to search mincut inside X.
Step 14 and 15 are used for forming X into a new graph G', in step 16, the recursive function is called inside X. If Area(X) ≤ A PEA , we have found a mincut C. Step 6 to step9 are used to add nodes to X to find areacut. In step 10 an areacut is found. Figure 7 shows an example of algorithm 1. Assume the A PEA = 8 in the example. In (a), the first mincut is found, and the size of X is 5, so we can add nodes into X. By executing step 7 to 9 in algorithm 1, we can get first areacut in (b). By executing the maxflow-mincut method iteratively, we can get another mincut as shown in (c), the size of X in this step is 6, we can add nodes to X to get the second areacut in (d). Continuing with the iterative procedure, we can get the result in (f), we find 3 mincuts and 3 areacuts in total. The DFG is divided into 6 macro blocks.
The purpose of Algorithm 2 is to divide macro block into basic blocks. As we already know, for a given DFG, its critical path length is a fixed value, it equals to the sum of critical path in each task. The energy consumption of execution is determined by how we distribute DFG's critical path into tasks. Note that the critical path length is equal to the quantity of priority level, the critical path length of task is determined by how many different priority levels it have. Here, the priority is defined by the data dependency of the nodes, which constrains the temporal ordering on the nodes. We get the priority by finding root nodes iteratively. Therefore we can divide the nodes in the same macro block by priority level. The nodes that have the same priority level will form a basic block. When we got all the basic blocks, we can organize some basic blocks to form a task. By traversing all the possible combinations of basic blocks, we can get all feasible task partition schemes. In order to facilitate the traversing of all possible combination, we will form these blocks into a new graph. Figure 8 shows the detail of Algorithm 2. From step 1 to step 4 we will calculate the priority of nodes in DFG. The steps between 7 to 17 is used to group the nodes between adjacent cuts (in other words, in one macro block) with their Priority flag. Firstly, we will assign the nodes between adjacent cuts a Macro block flag in step 9, for each block, we will scan all possible value of Priority flag the nodes could have from small to big between step 10 to 12, then we will assign the nodes with a flag Basic block count to identify their group between step 13 to 15. The steps between 18 to 27 are used to form the new graph M. Each node in M represent a group in DFG, the weight of the node in M equals to the number of nodes it represents. From step 18 to 22 we will add nodes to M. As the nodes in one basic block come from the same macro block in DFG, and they are grouped because they have same priority, so their Macro block flag and Priority flag are the same. In step 22 we will assign Macro block flag and Priority flag to the node in M. In step 23 we will renumber the nodes in M according to their Priority flag, sometimes we may find several groups with the same Priority flag, but they must come from different macro blocks and have different Macro block flag, for this situation, we will renumber these nodes according their Macro block flag from small to big. From step 24 to 27 we will add nodes to M. We will add an edge between node i to j if the sum of the node weight from i + 1 to j is smaller to A PEA . Figure 9 shows an example of algorithm 2. In (a), all nodes in DFG is assigned a number, the number shows the priority level of the node. By grouping nodes between cuts according to their priority level, we can get the result in (b). The number on node shows the basic block number of nodes. (c) shows an example of graph M, s is a dummy node, the blue number near nodes stand for the basic block number of nodes in graph M, the number on node shows the weight of it. By traversing all possible pathes from node s to node 15 we can get all possible task partitioning scheme. Each node the path goes through is the place of dividing for task partitioning. The red line in (c) shows an example of path. This path goes through node 4, node 7, node 10 and node 15, in this case, the nodes from 1 to 4 will form the first task, the nodes from 5 to 7 will form the second task, the nodes from 8 to 10 will form the third task and the last 5 nodes will form the fourth task.
Task scheduling is the procedure to determine the executing order of tasks. In this step, we should consider the time constraint T . The runtime of all tasks should be no longer than T . However the runtime of tasks is related to not only the task scheduling, but also the task partitioning. That means we can't get an optimal partitioning without considering scheduling, these two problems will be resolved together. The detail of Algorithm 3 is shown in Fig. 8 .
The basic idea of Algorithm 3 is that we scan all possible task partitioning scheme by searching all paths in graph M. For each path, we will use the simulated annealing algorithm to find an optimal scheduling scheme. First we will calculate the energy consumption of current scheduling scheme, then we will generate a new scheduling scheme and compute its energy consumption, we will compare the two energy consumption by using the acceptance condition in simulated annealing algorithm, then we can know whether to accept this new scheme. By doing this procedure iteratively, we will find the optimal result we want. In Fig. 10 , the steps between 1 to 10 is used form the path into a new graph P, each node in P represent a task decided by the path. Steps between 11 to 16 is used to generate an initial schedule scheme, step 17 is used to calculate the energy consumption of it. From step 20 to 29 we will generate a new schedule scheme. First we will choose a number k between 1 and Schedule step-1, Schedule step is used to identify the order during the schedule. Then we will give back all nodes with Schedule flag bigger than k to P, by repeating step 24 to 29 we will find a new schedule scheme. In step 30 we will calculate the energy consumption of the newly generated scheme. For each new generated scheme, we have an acceptance condition in Simulated Annealing Algorithm to check whether we should accept the new scheme as a chosen scheme. The acceptance condition and more details of Simulated Annealing Algorithm are given in Fig. 11 .
Step 21 and 22 is used for generating a new result.
Complexity Analysis of the Algorithms
Since the proposed battery-aware task mapping method includes two main parts: task partitioning and task scheduling, we analyze the complexity of each part respectively.
The most time consuming operation in task partitioning is finding maxflow of DFG. In our algorithm, we use the Ford-Fulkerson method [19] to find the maxflow.
Considering a DFG with n nodes and m edges, the complexity of this method is O(nm 2 ). The worst case of task partitioning is that we cut only one node out of the DFG one time during the iteration process of Mincut Areacut algorithm, then we need to cut the DFG n times to finish this algorithm. Since we use the maxflow method each time of the iteration, the complexity of task partitioning is O(n 2 m 2 ). In order to do the task scheduling, we have to travel all + m) . Each time we find a path, we use the simulated annealing algorithm to find the optimal result. The complexity of simulated annealing algorithm is O(x 3 In(x)), where x is the scale of the problem. In worst case, each node in original DFG becomes one basic block. Then the worst-case path has n nodes. Thus the worst-case complexity of task scheduling is O(n + m) · O(n 3 In(n)) = O(n 4 In(n) + n 3 mIn(n)). From the above analysis, the main causes of complexity are the maxflow algorithm and simulated annealing algorithm. In future work, we will try other maxflow algorithm and search algorithm to reduce complexity.
Experiment Results
To verify and evaluate the proposed task mapping method, we conduct some experiments on REINDEER processor. REINDEER is the low power version of REMUS [4] processor. It is a reconfigurable media computing processor, which is composed of a host controller (UNITY-2 core) and a reconfigurable programmable unit (RPU). The RPU is a 4 × 4 PE array. The system clock is 100MHz. REINDEER was fabricated in 65nm 1P6M CMOS process, and the core and I/O voltage are 1.2V and 3.3V respectively. Figure 12 shows the basic architecture and die photograph of REINDEER. Some key parameters are summarized in Table 1 . Each PE has two input ports (A and B). The functionality of PE and Table 2 The functionality of PE and the working current. the working current of each operation are shown in Table 2 .
We choose some test cases to evaluate the effectiveness of our method. All the test cases are extracted from H.264 decoding application. Details of the cases are shown in Table 3. Loop Filter is a functional pattern for vertical filtering, IDCT for Col is used to process inverse cosine transform for a column of matrix, and IDCT for Row is used to process inverse cosine transform for a row of matrix. ADD4 Clip is used to sum the predicted result and the residual of IDCT. IDCT main is the top function of IDCT, and Mat mul4 4 is used to execute the multiply of 4 × 4 matrix. Table 3 shows the number of nodes, number of edges, the length of critical path and degree of parallelism, the degree of parallelism is the division of nodes and length of critical path.
To evaluate the performance, we compare our method with some related methods. Since our method resolves task partitioning and scheduling jointly, we can get partitioning result and scheduling result simultaneously. However, no other method resolves these two problems jointly. For example, the layered partition agorithm (LPA) [17] and network flow-based algorithm (NFA) [18] only focus on task partitioning; the battery-efficient scheduling algorithm (BES) [12] only focuses on task scheduling. Therefore, in our comparison, we use the combination of them (LPA+BES and NFA+BES) as our counterpart. Another reason to choose BES algorithm for comparison is that it is also built on Rahkmatov battery model, so that the comparison would be fair.
The comparisons are made in the following aspects: the number of tasks (TN) that an algorithm can get, the data volume (DV) need to be transferred between tasks, the battery charge consumed by tasks, and the remaining discharge time of battery.
The experiment results would be relative to the nonlinear factor β and time constraint T . We first evaluate the performance with a typical β = 0.574 [5] and a very tight time constraint (no slack time). Then we vary β to evaluate the influence of nonlinear factor. Finally we vary T to evaluate the influence of time constraint.
6.1 Performance with β = 0.574 and no Slack Time
We first evaluate the performance of task partitioning. Table 4 shows the comparison of TN and DV among three methods.
From Table 4 we can see that LPA always get least TN. This is because LPA only takes the task size into account. It generates the tasks with the same size of PE array (A PEA ). NFA shows good efficiency on optimizing data transfer, since it also adopts mincut method. However, it does not consider task size, which results in large TN. The result of our methods seems like a combination of both LPA's TN and NFA's DV. It has got TN just a little bit larger than LPA, but much less DV than LPA; its DV is almost the same as NFA, but its TN is much less than NFA. Generally speaking, our method has both the advantages of LPA and NFA, since we take both data transfer cost and reconfiguration cost into account.
Next we evaluate the performance of energy consumption optimization. Based on the tasks generated by LPA and NFA, BES algorithm schedules the tasks onto PE array to Table 5 Comparison of battery charge loss at β = 0.574 (unit: mA-sec).
be executed. Table 5 is the comparison of battery charge loss among three methods. Since the charge loss in real battery is hard to measure directly, the charge loss is calculated by Rakhmatov battery model. And the data in Table 5 is the cumulative charge loss of 5000 times execution of each test case. From the data in Table 5 we can see that the battery charge loss of LPA+BES and NFA+BES are both larger than our method. Our method shows much better result than LPA+BES when handling task with high degree of parallelism, and much better result than NFA+BES when handling task with low degree of parallelism. Averagely, our method is 22.37% better than LPA+BES and 10.49% better than NFA+BES.
The improvement of our method is from three aspects. First, our method treats task partitioning and scheduling as a joint optimization problem. As mentioned in Sect. 4, the task partitioning affects the discharge current distribution greatly. However, both LPA and NFA do not take this point into account. Although BES algorithm can optimize the discharge current distribution in scheduling, the effect is limited by task partitioning results. In contrast, our method optimizes task partitioning and scheduling jointly. Second, the battery nonlinear features are considered in both task partitioning and scheduling inherently. In our method, all feasible task partitioning and scheduling schemes are generated first. Due to the joint modeling of the optimization problem, each feasible scheme is a joint task partitioning-scheduling scheme. Next for each feasible scheme, the charge loss is evaluated based on Rakhmatov battery model. It means both task partitioning and task scheduling are evaluated by the nonlinear model. Finally, the optimal solution is the task partitioning and scheduling scheme with minimal charge loss. Therefore, both task partitioning and scheduling must be battery-efficient in the optimal solution. In other words, the nonlinear features are used as criteria to judge both task partitioning and task scheduling scheme in our method. In contrast, although BES algorithm is also based on the nonlinear battery model, the tasks generated by LPA and NFA are not battery-efficient, which constrains the BES effect. Third, the energy consumption model in our method is more precise. In BES, it does not model the configuration, data loading/storing and execution of PE array separately, which results in a rough energy model. We also test the performance on a real 1200 mAh lithium-ion battery. Since the charge loss of real battery is hard to measure directly, we can only use some indirect method. Instead of looking at the battery capacity used, we find out how long the battery can survive after several times execution. Although it only gets some relative results, it can still give some insights. Therefore, we discharge the battery at a constant rate of 500mA, after a hundred million times execution, until the battery is exhausted. The remaining discharge time is shown in Table 6 . We can see, our method is 20.2% better than LPA+BES and 5.7% better than NFA+BES averagely. The better performance is also from the consideration of battery nonlinear characteristics and joint optimization of task partitioning and scheduling. It should be noticed that the numeric results are quite relative to the battery capacity and constant discharge rate in experiment. With different battery capacity and discharge rate, the improvement may be much higher than the results in Table 6 .
Influence of Nonlinear Factor
To evaluate the influence of nonlinear factor of battery, we test the performance with two more typical β, 0.75 and 0.273. These two factors are also extracted from real batteries [20] . Tables 7 and 8 show the comparison of battery charge loss at β = 0.75 and β = 0.273. The data is the cumulative charge loss of 5000 times execution of each test case. At both cases of β, our method outperforms LPA+BES and NFA+BES. Considering the results in Table 5 (the case of β = 0.574), we draw the conclusion that the proposed method can reduce energy consumption greatly no matter β is high or low. Therefore our method has a wide range of use with regard to the battery type and battery nonlinear factor.
By further analysis of the results in Tables 5, 6 and 7, we can see, the smaller β is, the more battery charge loss is reduced. This is consistent with the nonlinear features of battery. Since the smaller β means the more nonlinearity of the battery, the optimizable space is also bigger. The proposed method takes the nonlinear feature fully into account, so it achieves better results when β is smaller.
Influence of Time Constraint
To evaluate the influence of time constraint, we test the performance with loose time constraints (large slack time). Table 9 shows the time constraints set for experiments. We set time constraints as follows: 1) we use the raw execution time of each task as baseline, which means there is no slack time, 2) time constraint 1 is extracted from realtime decoding of H.264 video, which means each task must be executed before the deadline to ensure realtime decoding, 3) we also test the performance with very large slack time, the time constraint 2 is 5 times of time constraint 1. Table 10 and 11 show the comparison of battery charge loss with time constraint 1 and 2. The data is the cumulative charge loss of 5000 times execution of each test case. With both time constraints, our method outperforms LPA+BES and NFA+BES. Considering the results in Table 5 (the case of no slack time), we draw the conclusion that the proposed method can reduce energy consumption greatly no matter slack time is tight or loose.
By further analysis of the results in Tables 5, 6 and 7, we can see, with loose time constraint, more battery charge is reduced. And the improvement of our method increases when the slack time is prolonged. This is mainly due to the recovery effect of battery. The large slack time means more chances to recovery charge of battery. Therefore the proposed method is more effective to reduce battery charge loss. But even in the case of no slack time, the proposed method can also adjust the discharge current distribution by task partitioning and scheduling, which results in more battery charge loss reduced. Therefore our method has a wide range of use in terms of either tight or loose time constraint.
Conclusion
In this paper, we proposed a battery-aware task mapping method for battery-powered reconfigurable computing system. The battery nonlinear characteristics are taken into account. The proposed method optimizes task partitioning and scheduling jointly to minimize the energy consumption. Experiment results show that the proposed task mapping method can reduce the energy consumption and improve the battery lifetime greatly.
