In hardware/software (HW/SW) co-design, HW/SW partitioning is the most important step since it determines which components are implemented in hardware and which are implemented in software. Since most of HW/SW partitioning problems are NP hard, heuristic methods have to be utilized to solve them, especially for the large size problems. GPU-based heuristic methods to accelerate HW/SW co-design are a promising way to reduce run time. However, the existing methods cannot deal with very large embedded applications because of GPU resource limitations. This paper presents a method to overcome the GPU resource limitations for very large partitioning while keeping a reasonable runtime. First, at the stage of computing the costs of the candidates, we propose a fast method of 2-flipping computing for very large HW/SW co-design. Our method is also general and can deal with both odd and even numbers of nodes. More importantly, our method avoids utilizing doubleprecision arithmetic units, which are scarce resources in GPU architecture. Second, since the GPU is constrained by memory limitations and the costs of candidates cannot be directly stored in the GPU's global memory, we present a time-space tradeoff strategy to break memory limitations for very large HW/SW partitioning. In this way, the following steps can be run under the constraint of GPU's memory limitations. Third, an in-place removal of infeasible solutions is proposed to reduce the overhead of global memory by half when the neighborhood is compacted. Fourth, when evaluating the tabu status of feasible candidates, we present a bitwise representation of tabu status to minimize the transfer overhead. Finally, we conduct a number of experiments. The results show that the proposed 2-flipping method of single precision data types works well. The results also demonstrate that the proposed approach expands the number of nodes of the task graph from 10,000 to 30,000 under the limitation of the GPU's global memory of 6 GB. The correlations between compression intensity and solution quality are analyzed to ensure the fairness and soundness of our method. Our work is general and can provide guidance for other applications.
Introduction
An embedded system generally consists of a software unit and hardware unit. The software usually refers to the general-purpose processor. The hardware refers to the Field Programmable Gate Array (FPGA) or Application Specification Integrated Circuit (ASIC). When the system is implemented by the hardware, it is significantly faster and more power-efficient. However, the cost is very high. When the system is implemented by the general-purpose processor, it is power-consuming, and the cost is relatively small, but the system is slow. Therefore, an important issue is to obtain an optimal trade-off between cost, performance and power.
In hardware/software co-design, hardware/software partitioning (HW/SW partitioning) is an essential step because it determines which components to be implemented in hardware and which are implemented in software. The remarkable advantage of HW/SW partitioning is that it can improve the overall performance of modern embedded systems (Teich, 2012) . However, as the architecture of the target embedded system becomes increasingly complex, the process of hardware/software partitioning is problem-dependent. To focus on the essence of HW/SW partitioning and to design an algorithm to be used in large scale projects, Arato et al. did not aim at partitioning for a given architecture, nor did they present a complete co-design environment (Arato, et al., 2003) . Instead, HW/SW partitioning was taken as a more theoretical description. Specially, the application to be partitioned was modeled as an undirected communication graph. Based on the model, Arato categorized two different versions of the partitioning problems. One can be solved using polynomial complexity, while the other one was NP-hard in the strong sense (Arato, et al., 2005) .
From the perspective of the algorithm, there are two categories of algorithms for HW/SW partitioning. One is the exact algorithm, while the other is heuristic methods. The exact algorithm can obtain an exact solution for the small size problem. However, as the problem size becomes large, the solution space of HW/SW partitioning increases exponentially, so it is impractical to explore the exact solution in a reasonable time frame. Heuristic methods have become popular approximate alternatives due to their superior ability to obtain good quality solutions within a limited computing time (Hong-Seok and Nguyen, 2016) , (Karuno and Saito, 2017) .
Heuristic methods for HW/SW partitioning mainly include various intelligent optimization algorithms, such as the genetic algorithm (Arato, et al., 2005) , (Trindade and Cordeiro, 2016) , (Janakiraman and Kumar, 2014) , the ant colony algorithm (Wang, Gong and Kastner, 2006) , (Ferrandi, et al., 2013) , (Zhou, He and Qiu, 2017) , artificial bees (Koudil, et al., 2007) , particle swarm optimization (Abdelhalim and Habib, 2011) , (Yan, et al., 2017) , (Shimizu, Sakaguchi and Miura, 2014) , simulated annealing (López and López, 2003) , (Henkel and Ernst, 2001) , the artificial immune algorithm (Zhang, et al, 2008) , tabu search (Wiangtong, Cheung and Luk, 2002) , (Wu, et al., 2013) and the hybridization of these methods (Yan, He and Hou, 2017) , (Li, et al., 2014) , (Lin, Zhu and Ali, 2014) , (Jiang, et al., 2012) .
In recent years, GPU has become a popular parallel computing platform with low cost and low power consumption (Tan and Ding, 2016) . Additionally, GPU is available on very general PC systems (Owens, et al., 2008) . In our previous work (Hou, et al., 2016) , an adaptive neighborhood Tabu search on GPU, named GANTS, was proposed for HW/SW partitioning. Experiments showed that the solution quality and run time outperformed the state-of-the-art Tabu search for the same problem (Wu, et al., 2013) . However, the previous approach is constrained by GPU resources. First, when the problem becomes very large (meaning the number of nodes in the task graph is more than 4000), the previous method only obtains 2-flipping candidates correctly in double precision arithmetic. Second, when the number of nodes in the task graph is more than 15,000, the previous method was out of the GPU's global memory. These limitations drive us to overcome the issue. This paper presents a method to overcome the GPU resource-limitation with reasonable runtime when compared to existing methods. The main contributions of this manuscript include a fast method of 2-flipping computing for very large HW/SW partitioning; a novel neighborhood compression strategy at the block and warp levels to reduce the size of a very large neighborhood; an in-place removal of infeasible solutions to halve the overhead of global memory; and a bitwise representation of tabu status to minimize the transfer overhead. The experiments confirm the proposed method.
The remainder of this manuscript is organized as follows. Section 2 introduces the background of our work. In section 3, we describe our proposed method in detail. In section 4, the experiments test the effectiveness of our proposed method. Finally, section 5 discusses the conclusions and future work.
Background 2.1 HW/SW partitioning model
Formally, the task to be partitioned is represented as an undirected graph G (V, E), s, h: V→R + , and c: E→R + . V= {v1, v2, ... , vn} indicates the task nodes. Each node includes hardware cost h(vi) and software cost s (vi) . E indicates the set of edges between the nodes. The weights c (vi, vj) on the edges indicate the communication cost when two adjacent nodes are separately implemented (hardware or software). P={VH, VS} is called a HW/SW partitioning if it satisfies VH∩VS = Φ and VH∪VS = V. Accordingly, the edge set of P is defined as Ep = { (vi, vj )|vi ∈VH, vj ∈VS or vi∈VS, vj ∈ VH }.
As in (Arato, et al., 2005) , (Wu, Srikanthan and Chen, 2010) described, a partition that is characterized by three metrics: hardware cost HP, software cost SP, and communication cost CP . They are formulated as follows.
Hence, the partitioning problem is defined as follows. Problem P. Given a graph G with the cost function s, h, c and R≥0. Finding a hardware/software partitioning P with SP+CP≤R that minimizes HP.
Problem formulization
In the n-dimensional space {0,1}
n , let x = (x1, x2, ..., xn). Here, x denotes a solution of the problem P (a partition for the graph G with n nodes). xi=1(xi=0) indicates that the node vi is assigned to software (or hardware), 1 ≤i≤ n. Based on equations (1) to (3), the software cost S(x) and the hardware cost H(x) are formulated as
The communication cost
According to the definition, the problem is treated as a constrained optimization problem:
By treating the problem P as the variation of the standard 0-1 knapsack problem, the authors in (Wu, et al., 2010) , (Wu, et al., 2013) successively proposed alg-new3 and HEUR to obtain the approximate solutions. In the experiments, the authors illustrated that HEUR can obtain better solution quality than alg-new3. Furthermore, the authors in (Wu, et al., 2013) proposed TABU to refine the solution obtained by HEUR. However, when the problem size became large, the execution time of TABU was time-consuming. Therefore, in previous work, we proposed an adaptive neighborhood tabu search on GPU (or GANTS) for optimizing the initial solution obtained by HEUR in (Hou, et al., 2016) . The experimental results show that GANTS outperformed TABU in (Wu, et al., 2013) . However, GANTS cannot deal with very large HW/SW partitioning, which will be discussed in detail.
Proposed method
In previous work, we proposed GANTS for HW/SW partitioning. In every iteration of our method, we considered all the 2-flipping candidates from the current solution to form the neighborhood. For each candidate, the hardware, software and communication costs were obtained by equations (4) to (6). After that, each candidate was judged on whether it satisfied the constraint R in (7). For those not satisfying the constraint, we compacted the neighborhood to remove them. For the remaining candidates satisfying the constraint, we further checked their tabu status. Our goal was to select the non-tabu candidate with the smallest hardware cost. After that, the tabu list, the tabu status array, the current solution and the global hardware costs were updated. When porting the whole procedures to the GPU, we further presented the GPU-based representation of the task graph, GPU thread-candidate mapping method, GPU-based removal of infeasible candidates and GPU-based tabu evaluation. Although GANTS is much more efficient than the state-of-theart tabu search for HW/SW partitioning in (Wu, et al., 2013) , several shortcomings make it difficult for very large HW/SW partitioning. Fig. 1 shows the framework of the proposed method in this paper. The largest difference from the previous method is that after obtaining the costs of candidates, a neighborhood compression strategy is invoked to compress the size of the neighborhood. 
Neighborhood generation
We choose to sequentially generate all possible 2-flipping candidates from the current solution beforehand, as shown in Fig. 2 . Since the dimension of a solution is equal to the number of nodes, namely, n for a given task graph, the size of the neighborhood can be calculated by n×(n-1)/2.
Fig. 2. Neighborhood Generation

Neighborhood evaluation of incremental strategy
To efficiently obtain the hardware, software and communication costs for candidates in the neighborhood, the hardware, software and communication costs of the current solution can be reused. We assume that a new solution xnew is formed by flipping xi in the current solution xcurrent. Hence, the corresponding hardware cost H(xnew) can be formulated as
Likewise, the corresponding software cost S(xnew) can be formulated as
When xi is flipped, the communication cost is changed as well. The change is only influenced by the adjacent nodes of xi. Therefore, based on the communication cost of the current solution, the communication cost of xnew can be obtained by algorithm 1.
Algorithm1. Communication cost of xnew
Set up xnew as xcurrent; Set up C(xnew) as C(xcurrent);
By the above methods, the costs of candidates are efficiently obtained. Since two positions are flipped, the above methods are invoked twice for each candidate.
A fast and general method of 2-flipping without involving double-precision arithmetic
Using an incremental strategy, the threads in the GPU can compute the hardware, software and communication costs of each candidate in the neighborhood by Single Program Multi Data (SPMD) at each iteration. In a GPU thread grid, given the number of thread block B_ID, the dimension of a thread block B_DIM, and the index of thread T_ID in a thread block. The index of each candidate (neibIdx) is obtained by the following equation
Next, to form a candidate in the GPU, two flipped positions are needed. In previous work, we utilized the method in ( Van-Luong, Melab and Talbi, 2013) . Although the method can theoretically generate all 2-flipping candidates regardless of the number of nodes, its effectiveness depends on the two factors of data type and the number of nodes. This means that the method is only effective when the data type is double precision. If the data type is single-precision and the number of nodes in the task graph is more than 4000 (to be exact, 4609), the method is prone to the cross-border producing partially incorrect 2-flippings. Another similar method can be found in (Rocki and Suda, 2012) , which has the same problem when the number of nodes in the task graph is more than 4095.
Hence, devising an efficient 2-flipping method is an important issue. In this paper, we present a fast and general computing model of 2-flipping without involving double-precision arithmetic in the context of HW/SW partitioning.
Computation model
The computation model of the proposed 2-flipping is shown in equation (11).
When compared with the computation model in ( Van-Luong, Melab and Talbi, 2013) and (Rocki and Suda, 2012) , this proposed computation model is simple without a root calculation. Therefore, our model does not involve the timeconsuming arithmetic of square root calculation (Avril, Gouranton and Arnaldi, 2012) . Importantly, our model can deal with very large neighborhood by using only single-precision data, while other two models have to use double-precision data to reach the same range.
Therefore, our computation model of proposed 2-flipping has two aspects of significance.
(i) From the view of hardware resources, even in the latest generations of GPUs, the number of double-precision arithmetic units is less than that of single-precision arithmetic units (Nvidia, 2016) . Hence, our computation model avoids consuming scarce double-precision arithmetic units in the GPU.
(ii) The proposed computation model is faster than previous methods.
Thread-space mapping
This paper presents a new kind of thread-space mapping. As shown in Fig. 3 , we compare our mapping (mapping 4) with other previous mappings, where the number of nodes in the task graph is 6 and the size of neighborhood is 15. (Rocki and Suda, 2012) , mapping 2 in ( Van-Luong, Melab and Talbi, 2013) , mapping 3 in (Zhou, He and Qiu, 2016) , and our mapping 4.
At first glance, the shape of candidates in our mapping appears to be neither upper triangle nor lower triangle. However, due to the symmetry of the triangle domain, pairs (i, j) and (j, i) are equivalent. Hence, our mapping still covers all of the correct results.
Although Zhou et al. presented a relatively simple method (Zhou, He and Qiu, 2016) without involving doubleprecision arithmetic, it will lose some candidates if the number of nodes is even, and it has to repair this problem with an additional operation. As shown in Figure 3 , Mapping 1, mapping 2 and our mapping (mapping 4) can correctly obtain all of the 15 2-flipping positions. By contrast, mapping 3 misses two two-flipping results.
Therefore, our mapping is general and can deal with both odd and even numbers of nodes with correct results.
A neighborhood compression strategy to break the limitation of GPU's global memory
The data structure of the candidate consists of an index, software cost, hardware cost, communication cost and feasibility. For very large HW/SW partitioning in this study, the candidates' costs cannot be directly stored in the GPU's global memory because the GPU's global memory is limited. Therefore, a challenging issue for very large HW/SW partitioning is how to store the neighborhood cost in the physical memory of a resource-constraint GPU device.
Neighborhood compression with time-space tradeoff
In this paper, we take advantage of GPU architecture and present a time-space tradeoff strategy to break memory limitations for very large HW/SW partitioning.
In sequential algorithms, the time-space tradeoff is an important issue (Borodin, and Cook, 1980) . In our approach, the issue becomes the tradeoff between parallel computing time and global memory overhead. The time complexity of the sequential algorithm is reduced by reasonably increasing memory overhead. In contrast, we reduce the overhead of neighborhood cost storage by increasing the workload of the processing unit. We call this procedure neighborhood compression. This strategy makes sense for the following reasons.
(i) A GPU consists of hundreds of single-precision arithmetic units working in a parallel way, which ensures that the work is finished within the acceptable time range.
(ii) In addition to global memory, the GPU has on-chip shared memory and its access bandwidth is as fast as the register. Before writing the final results into the global memory, the temporary results are stored in shared memory and are accessed by scalar processors of the single multi-stream processor.
(iii) Our strategy is universal because it utilizes the characteristic of accessing hierarchical memory space on the GPU.
Neighborhood compression at block level and warp level
According to GPU computing resources, the threads in a grid are organized into equally sized thread blocks. Within a thread block, a number of continuous 32 threads are combined into a warp. Therefore, we compress the neighborhood at two levels.
At the first level, the neighborhood is compressed at the thread block level. The neighborhood is split into several parts according to the thread block size. After performing compression at the thread block level, the size of the neighborhood is reduced. In current GPUs, the maximum size of a thread block is 1024. Thus, the compression ratio between the size of original neighborhood and that of reduced neighborhood can be up to 1024.
At the second level, the neighborhood is compressed at the thread warp level. The neighborhood is split into several parts according to the warp size, which is fixed at 32. After performing compression at the warp level, the compression ratio between the size of original neighborhood and that of reduced neighborhood is 32.
The detailed process of neighborhood compression at the thread or warp level is illustrated in Fig. 4 . Since HW/SW partitioning is treated as a constraint optimization problem, there exist both feasible and infeasible candidates in the neighborhood. When performing compression, it involves the following three cases.
(i) Comparison between feasible candidates. In this case, the feasible candidate with the smaller hardware cost is the winner.
(ii) Comparison between feasible candidate and infeasible candidate. In this case, the feasible candidate is the winner, even though the infeasible candidate has smaller hardware cost.
(iii) Comparison between infeasible candidates. In this case, the first infeasible candidate is the winner, meaning we do nothing. Fig. 4 . An illustration of neighborhood compression. We assume that the size of the neighborhood is 12 and that of the warp or block is 4. By compression, the size of neighborhood is reduced to 3.
Hou, He, Zhou and Ai, Journal of Advanced Mechanical Design, Systems, and Manufacturing, Vol.11, No.5 (2017) After neighborhood compression, a number of candidates are lost, influencing the final solution quality to some extent. Since the compression ratio is related to the size of a single thread block, which is tunable but trivial, we are concerned with how to choose an appropriate value. For our problem, the goal is to retain as large of a neighborhood subset as possible. Allowing for a compression ratio of 32 is enough for very large HW/SW partitioning. Therefore, priority is given to compressing the neighborhood at the warp level.
In-place removal of infeasible solutions
Although the original neighborhood is compressed, infeasible candidates still exist that do not satisfy the constraint. In previous work, neighborhood compaction at the thread block level was adopted to remove the infeasible candidates (Hou et al, 2016) . However, we found that it consumed significantly more of the GPU's global memory since both input and output have to be allocated.
In fact, the original data of input array were written into temporary scratchpad memory during neighborhood compaction and the input array space become free resource. The final feasible candidates can be written back into the input array. Therefore, in order to remove infeasible candidates with the limitation of the GPU's global memory, an inplace procedure is proposed, such as that in Fig. 5 , which shows the difference between the original implementation and the proposed in-place implementation.
Fig. 5. Abstract illustration of in-place removal of infeasible solutions
Compared with previous GPU-based tabu search for HW/SW partitioning, the proposed in-place removal approach halved the utilized GPU memory.
Tabu list with bit-level representation of Tabu status
At the stage of tabu evaluation, the optimal candidates will be chosen from the feasible candidates. To traverse the tabu list efficiently, tabu status was utilized to check whether one candidate is tabu. If it is tabu-active, the tabu status is set as 1, and otherwise, it is set as 0.
Different from previous research (Hou, et al, 2016) , this paper presents a bit-level representation of tabu status for HW/SW partitioning. With this representation, the tabu status of each candidate takes up only one bit. Hence, one byte can store the tabu status of eight consecutive candidates. Fig. 6 shows the difference between byte-level and bit-level representation of tabu status. (i) At each iteration, tabu status is transferred to the device side through PCI-E, and there is transfer overhead between the host side and device side. The proposed bit-level representation minimizes the transfer overhead as much as possible.
(ii) Meanwhile, when the neighborhood is very large, it takes up a considerable portion of the GPU's global memory to store the tabu status of all candidates, which becomes a serious and unavoidable issue. The proposed bit-level representation can minimize the overhead of global memory.
(iii) Specially, when accessing bit-level tabu status, given the candidate's index id, its position and shift can be known by id/8 and id%8. By the OR operator, its tabu status can be written. By the AND operator, its tabu status can be read. This kind of bitwise operation is computationally efficient.
Experiment
Our method includes both parallel and sequential procedures. In their realization, the parallel procedures are developed by CUDA C, and the sequential procedures are programmed in C++. The platform running the parallel procedures is a NVIDIA GTX 980Ti, which consists of 22 SMs and 128 SPs per SM. The clock frequency of each SP is 1.00 GHZ. The size of the global memory is 6 GB. The computing platform running the sequential procedures is Intel i7-4770 CPU with 3.4 GHZ clock frequency. The size of the main memory is 16 GB.
Micro-benchmark for 2-flipping
To test the effectiveness of the proposed 2-flipping, we compare it to other methods from ( Van-Luong, Melab and Talbi, 2013) and (Rocki and Suda, 2012) . We name them method 1 and method 2, respectively. The experimental configuration is listed as follows.
Accuracy test
Given the number of nodes n, we implement the three methods and count the number of correct results of each method. Calculation accuracy is utilized to illustrate the percentage of correct 2-flipping candidates. It is noteworthy that only an accuracy of 100 percent is valid. As shown in Table 2 , when the data type is single-precision, our method can correctly obtain all the 2-flipping candidates. 
Efficiency test
Next, we launch four GPU kernels for the methods. The kernels only compute the 2-flipping position of each candidate in parallel, meaning that they do not access the global memory. Table 3 shows the run time of the three methods on the GPU. For method 1, it includes both the single-precision and double-precision versions. We can see that our method is the fastest. It means our method is efficient. 
Speed up
Let method1 of the double version be the baseline. Fig. 7 further shows the speedups. Our method is always higher than method 1 and method 2, which more clearly illustrates the speed advantage of our method. 
Effectiveness of proposed strategy 4.2.1 Benchmark for very large HW/SW partitioning
This section tests our proposed neighborhood compression strategy and in-place implementation of the compacting neighborhood. We follow the same random graph generation method mentioned in (Hou, et al., 2016) . Table 3 shows the number of nodes n and edge m in each task graph, respectively. The total size is given by 2×n +3×m. Random1  10000  10000  50000  Random2  10000  20000  80000  Random3  10000  30000  110000  Random4  15000  15000  75000  Random5  15000  30000  120000  Random6  15000  45000  165000  Random7  20000  20000  100000  Random8  20000  40000  160000  Random9  20000  60000  220000  Random10  25000  25000  125000  Random11  25000  50000  200000  Random12  25000  75000  275000  Random13  30000  30000  150000  Random14  30000  60000  240000  Random15  30000  90000  330000 Software costs were generated as uniform random numbers from the interval [1,100]. The hardware cost is generated as random numbers from a normal distribution with an expected value ksi and a given standard deviation, where si is the software cost of the given node. The value k has no algorithmic implications since it only corresponds to the choice of units for software and hardware cost (Arató, et al., 2005) , (Wu, et al.,2010) , (Wu, et al.,2013) .
Communication costs were generated as uniform random numbers from the interval [0, 2·ρ·smax] , where smax is the highest software cost. Thus, communication cost has an expected value of ρ·smax, and ρ is the so-called communication to computation ratio (CCR). ρ was taken as 0.1, 1 and 10, which correspond to the computation-intensive case, the intermediate case, and the communication intensive case, respectively. R was randomly generated as a uniform random number (1) Hou, He, Zhou and Ai, Journal of Advanced Mechanical Design, Systems, and Manufacturing, Vol.11, No.5 (2017) cases are indicated as R=low and R=high, respectively. Tables 4 to 9 show the solution quality and run time for 6 cases. In our method, the maximal iteration number is set as 2000. The results are averaged over 10 runs. In the tables, GANTS denotes the previous method. Alg-improve1 denotes the method improved by in-place compaction. We can see that the GANTS method cannot find the solution when the number of nodes in the task graph is beyond 15,000. When it is improved by in-place implementation of the neighborhood compaction, the number of nodes can be extended to 20,000. Alg-improve2 is the method improved by both neighborhood compression and in-place compaction strategies. The number of nodes can be further extended to 30,000. In this subsection, neighborhood compression at the warp level is utilized. --780118  582  --Random9  --668497  816  --Random10  ----855897  546  Random11  ----976980  924  Random12  ----951859  1272  Random13  ----927565  792  Random14  ----1196172  1320  Random15  ----1201804  1866   Table 5 . Solution quality and run time (s) in the case of CCR=0.1, R=high. Solution  Run time  Solution  Run time  Solution  Run time  Random1  131695  204  ----Random2  147940  246  ----Random3  119630  300  ----Random4  319804  372  ----Random5  244863  516  ----Random6  345077  540  ----Random7  --301023  756  --Random8  --389279  828  --Random9  --300049  1056  --Random10  ----254889  564  Random11  ----341166  930  Random12  ----472566  1284  Random13  ----398476  810  Random14  ----370202  1362  Random15 ----567320 1890 Random1  227284  114  ----Random2  332330  168  ----Random3  368863  222  ----Random4  279117  270  ----Random5  450299  336  ----Random6  565814  438  ----Random7  --386098  468  --Random8  --675966  546  --Random9  --724203  720  --Random10  ----541473  546  Random11  ----768308  906  Random12  ----894923  1266  Random13  ----626963  786  Random14  ----957379  1320  Random15 ----1101130 1854 Random1  402405  90  ----Random2  449992  156  ----Random3  465409  216  ----Random4  642074  192  ----Random5  681091  300  ----Random6  693616  420  ----Random7  --857922  324  --Random8  --890620  498  --Random9  --918104  684  --Random10  ----1043127  528 Random11  ----1090704  900  Random12  ----1151236  1260  Random13  ----1252296  1326  Random14  ----1324254  1314  Random15  ----1387618  1854 
Result
name GANTS Alg-improve1 Alg-improve2
Correlation between compression ratio and solution quality
Since the proposed compression strategy can reduce neighborhood size according to the block or warp size, it is necessary to check the correlation between the compression ratio and solution quality. We evaluate solution quality by error as follows.
Given that the number of nodes is 20,000, we illustrate the difference between the solution quality of the proposed method and that of the previous method. Let the solution quality of the previous method be the baseline. Hence, the compression error is defined as 100 ) (
In the six cases, we compare the solution quality under different compression intensities, as Tables 10 to 12 show. Positive values indicate that the solution quality by the previous method is better, while negative values indicate that the proposed method is better. The results show that although the neighborhood is compressed, the solution quality is not significantly worsen. However, when the compression ratio is 1024, the errors in n=20000, m=40000 and n=20000, m=60000 are significant. According to our strategy, compression at the warp level is enough for HW/SW partitioning. 
Compared with previous methods
Through the analysis of the compression error, it can be seen that the proposed method does not worsen the solution quality when compared to the previous GANTS. In this subsection, we further compare the run time of the proposed method with that of GANTS from (Hou et al, 2016 ) based on the previous benchmark. To illustrate the advantage of the GPU-based method, the run time of TABU from (Wu, et al., 2013 ) is also added. Fig. 8 shows the results for the 6 cases. It is obvious that our GANTS and Alg-improve2 are significantly faster than the TABU from (Wu, et al., 2013) . In addition, we can see that for Alg-improve2, although neighborhood compression is utilized, it does not increase the execution time significantly, which is competitive. 
Conclusions
This paper presents a GPU-based Tabu search for very large HW/SW partitioning with constraints on resource usage. The proposed method overcomes the GPU resource limitation for very large partitioning while retaining a reasonable run time. A number of experiments demonstrated the effectiveness of the proposed method.
Specially, the contributions of this paper include the following: (1) At the stage of thread-candidate mapping, we propose a fast method of 2-flipping computing for very large HW/SW co-design; (2) After computing each candidate's costs, a novel strategy of neighborhood compression is proposed, and compression at two different levels is discussed; (3) At the stage of the removal of infeasible candidates, an in-place implementation is proposed to halve global memory overhead; (4) To minimize the transfer overhead of the tabu status, a bitwise representation is proposed; (5) Finally, in the experiments, the effectiveness of the proposed method is justified from different aspects.
In future work, we will explore new models of HW/SW co-design and new acceleration methods on Multi-core SIMD CPUs (Ouyang, et al., 2016) , (Zhou, et al., 2017) . We will also try to extend the proposed idea and method to other areas, such as large-scale data visualization and rendering (Chen, et al., 2017) , (Inui M, et al., 2016) , (Kim, Kyung and Lee, 2012) , (Mandachi, Usuki and Miura, 2014) , (Umezu, 2013) , (Zhang, et al., 2017) , large-scale co-operation editing in text and geometry in Computer-Supproted Cooperation Work (Cheng, et al, 2016) , (Lv, et al, 2016) , (Nomaguchi, Tsutsumi and Fujita, 212) , large-scale 3D model interchange and retrieval in CAD (Wu, et al., 2016) , (Zhang, et al., 2016) , (Komoto, Kondoh and Masui, 2016) , (Yeoun and Kim, 2016) , (Qin, et al., 2016 , (Qin, et al., 2016 , and real-time video and large HD image processing in computer vision (Li, He and Chen, 2016) , (Ni, et al., 2016) , (Li, He and Chen, 2017) , (Li, et al., 2017) , (Li, et al., 2017) , (Sun, et al., 2016) , (Liu, et al., 2016) , . 
