Energy saving and optimization play an increasingly important role in industrial electronic systems. A heterogeneous embedded system is composed of a general-purpose central processing unit with an enhanced module of graphics processing units (GPU). This article explores the effective strategies of task granularity and software prefetching for energy optimization. In this article, we propose a novel energy optimization model for GPU-based embedded systems by harnessing a communication-based pipeline spatial and temporal relation. We analyze the characteristics of a multiple thread execution of parallel GPUs. We present an effective algorithm for the dynamic power optimization with the adaptively adjusted distance of software prefetching. The experimental results show that the dynamic energy consumption can be saved by 22.1% and 21.8%, respectively, under two prefetching strategies (register and shared memory) without loss of performance. We demonstrate the effectiveness of the proposed methods for energy saving and consumption reduction of performance driven computing in industrial scenarios.
I. INTRODUCTION
I N THE field of industrial manufacturing, heterogeneous embedded system is needed as a supporting condition for product design and R&D processes, such as aerospace, automobile, ship, and so on. Although a heterogeneous embedded system shows a higher peak computing speed and peak computing efficiency, the problem of massive power consumption remains. Excessive power consumption poses a severe challenge of reliability and heat dissipation for large-scale heterogeneous embedded systems in safety-critical industrial applications. Therefore, power consumption has become a crux concern at an unprecedentedly high level [1] , [2] .
The difference of speeds between a processor and an off-chip memory leads to the presence of a memory wall, which has always been one of main problems hindering improvements in computational efficiency [3] - [6] . At present, on-chip multi-and many-core processors have been developed, and a parallel system places a more onerous burden on memory registers, which aggravates the severity of problems concerning memory allocation. Thus, reducing or hiding memory latency is important when specifying a system architecture. Prefetching is a method of hiding memory latency by utilizing overlapped memory access and computation [7] , [8] . Prefetching optimization aims to decrease bottlenecks in memory access and improve execution performance by extracting data into cache in advance and overlaps execution of computing and memory access functions on a processor. Prefetching optimization can be divided into hardware and software prefetching. Hardware prefetching aims to identify and predict the memory access mode of programs controlled by the prefetch engine so as to prefetch data automatically. Hardware prefetching is characterized by having no software overhead, while it has low flexibility and pertinence. Software prefetching is illustrated as follows: programmers or compliers insert prefetching instructions at an appropriate location in their code and extract data into a cache (or register) in advance, thus avoiding computation arising from aborting due to a delay when waiting for memory access. Software prefetching is characterized by flexibility, efficiency, and pertinence, while it leads to software and power overheads. In this article, only software prefetching is taken into account.
The contributions of this article are as follows. By considering the energy consumed by prefetching instructions, graphics processing units (GPU) processors, and memory access, as well as the static energy consumption of a system, an optimization model for the energy consumption of a GPU-based embedded system is established. An algorithm for optimizing dynamic energy consumption of homogeneous multi-GPU processors based on adaptive adjustment of the distance of software prefetching is proposed.
This article is organized as follows. Section II presents a review of existing works. Section III provides architecture of a GPU-based embedded system. Section IV analyses the opportunities of task partition and software prefetching for energy optimization. Section V establishes a model for energy optimization of a heterogeneous embedded system. Based on the model, Sections VI propose an algorithm for optimizing the dynamic energy consumption based on adaptive adjustment of the distance of software prefetching. Section VII evaluates and analyzes the experimental results. Section VIII concludes this article.
II. RELATED WORK
Software prefetching has been investigated for a long time. Mowry et al. [9] are one of teams investigating optimization algorithms based on software prefetching. They propose an algorithm for inserting prefetching instructions. The algorithm only prefetches data likely to be subjected to cache failure: this avoids extra overheads due to unnecessary prefetching. By exploring software prefetching from the perspectives of compiling.
Very recently, it has been demonstrated that software prefetching-based optimization can effectively hide memory latency and improve the performance of programs, while it inevitably leads to increased power consumption [10] , [11] . The main reason is that prefetching instructions increase the number of codes and prefetching takes advantage of the spatial parallelism of memory and processor. As a result, the energy consumed by the whole processor per unit time increases. Aiming at the influence of software perfecting-based optimization on power consumption, Agarwal et al. [12] propose an energy optimization strategy such that, the performance gain obtained by software prefetching is converted into a reduction in energy consumption by using dynamic voltage and frequency scaling (DVFS) technology [13] . In this way, about 38% of energy overhead can be eliminated without performance loss. Through analysis, it can be seen that the method ignores the influence of voltage (or frequency) scaling on prefetching-based optimization. Others [14] , [15] propose that controlling the overhead of software prefetching depends on determination of the prefetch distance while the optimal prefetch distance is codetermined by execution time of memory latency and a single iteration. After the voltage (or frequency) of a processor is scaled, only is the execution time of iteration affected, but the absolute latency of memory access is unchanged. Therefore, after reducing the frequency (or voltage), it is necessary to decrease the prefetch distance, making it more reasonable.
Task scheduling and dynamic voltage scaling are two main methods used for optimizing the performance or energy consumption of a system. Keqin et al. [16] investigated a combined optimization of the two methods aiming at a homogeneous parallel system. Through theoretical analysis, they pointed out that the energy optimization under a performance constraint and performance optimization under an energy constraint both can be treated as a general power sum problem. In further investigation [17] , the author analyzed the problem related to energy optimization of parallel tasks in a parallel system and it is necessary to consider simultaneously the influences of three factors (involving system partition, task scheduling, and frequency scaling) on the energy consumption of a system. Goraczko et al. [18] propose a task partition method for energy optimization for heterogeneous multicore processors. The method can optimize the energy overhead of processors under the constraint of satisfying real-time application by mapping tasks into heterogeneous multicore processors and combining the technology of frequency scaling of processors.
The difference between this article and other power optimization work is that software prefetching is introduced into the energy consumption optimization model, and energy consumption optimization is carried out by voltage frequency regulation and task partition.
III. ARCHITECTURE OF A GPU-BASED EMBEDDED SYSTEM
The architecture of a typical heterogeneous embedded system with multiple GPUs is shown in Fig. 1 : this contains a central processing unit (CPU) (host) processor and multiple GPU coprocessors. Each CPU processor and GPU processor have their own memories: when programs run, CPU processors can send direct memory access (DMA) orders and transfer data between the host memory and GPU memory by using a specialized DMA. Owing to the host memory being shared by various GPUs, the host processor is only allowed to transfer data to a GPU at a given time while various GPU processors can run independently. In this architecture, the program is divided into serial program segment and parallel program segment during the executive process. Serial program segments are executed on CPU and parallel program segments are executed on multiple GPUs. Since there is only one CPU, there is no problem of how to allocate tasks and how to adjust the dynamic voltage frequency. Therefore, under this architecture, we focus on the power modeling and optimization in multi-GPU environment.
IV. TASK PARTITIONING AND SOFTWARE PREFETCHING FOR ENERGY OPTIMIZATION
Three communication-computing pipeline spatio-temporal diagrams probably appear during execution of parallel programs in a homogeneous multi-GPU system.
We assume that the whole program contains n tasks m i , m = (m 1 , m 2 , . . . , m n ) refers to the sequence of tasks arranged according to the program. For 1 ≤ i ≤ n, Type(m i ) denotes the operation type of tasks m i . The range of values is {C, P, T, M} represent the kernel computing, data prefetching, data transferring, and memory latency, respectively. Time(Type(m i )) is the time required for a task operation.
The perfect-overlap communication-computing pipeline spatio-temporal diagram is shown in Fig. 2 . The vertical axis refers to different GPUs and the middle part between two blue dashed lines denotes software prefetching and memory access conducted by the same GPU. The horizontal axis represents the execution time of programs where, P (m i ) denotes prefetching, which is responsible for computing the address of data to be prefetched. For example, P (m 1 ) refers to prefetching data required in the C(m n +1 ) computing section and the prefetched data are accessed at point B. It can be seen from Fig. 2 (a) that, in the case of perfect overlap, the memory access brought about by the last prefetching task is completely overlapped with the computing operations in the processors. As this time there is,
When prefetching is executed early, data prefetched in a memory register are not immediately used as they access the memory. It is necessary to wait for some time; therefore, a period of idle memory appears. At this time there is
As depicted in Fig. 2(b) , the situation when the execution time of the processors plays a dominant role, due to prefetching operations being executed early, thus incurring an idle period of memory, is called GPU-bound. The occurrence of the idle period of the memory has an adverse influence on performance. In this case, from the perspective of energy, it is necessary to search for an appropriate task granularity to minimize the area of the rectangle enclosed by horizontal and vertical coordinates in the spatio-temporal diagram, thus decreasing static power consumption. Additionally, the frequency of the memory needs to be scaled to make the memory work at a low frequency. By doing so, on the condition of having no influence on the performance, this avoids accessing data in advance, therefore, reducing dynamic energy consumption.
Under prefetching delay, the time of memory access is longer than the computing time of processors within the same stage. In this case, the idle period in computation occurs due to waiting to access data in memory cannot be completely removed. At this 
As depicted in Fig. 2 (c), the situation when the time of memory access plays a dominate role, due to prefetching operations being delayed, thus incurring an idle period on all processors, is called memory-bound. Under this circumstance, from the perspective of energy, it is feasible to determine an appropriate task granularity to minimize the area of the rectangle enclosed by horizontal and vertical coordinates in the spatio-temporal diagram, therefore, lowering static power consumption. Additionally, the frequency of processors is adjusted to enable processors to work at a low frequency. In this way, while having no influence on the performance, the access of data in advance is avoided, thus decreasing dynamic energy consumption.
V. POWER MODEL OF HETEROGENEOUS SYSTEMS
In a CPU-GPU heterogeneous embedded system, k denotes a basic unit for tasks partition to be executed for a given kernel program. If the number of tasks assigned to a certain GPU is greater than the basic unit k, that is, when each stream multiple processors (SM) can be divided into more than one thread block, it can improve the computational intensive in SM to a certain extent, make more effective use of the performance of single instruction multiple data computing pipeline, and hide the delay caused by GPU memory access. When the number of thread blocks allocated to the GPU is not an integer multiple of the number of SMs, there will be load imbalance, which makes some SMs idle. Therefore, in task partitioning, we take the multiple k · r(r ≥ 1) of the basic task partitioning unit k as the granularity. So, k · r represents the granularity of task partitioning. A parallel program generally consists of multiple parallel loops. As data dependence between iterations does not exist in parallel loops, various loop iterations can be mapped into multiple processors for concurrent execution. In a heterogeneous embedded system with homogeneous multiple GPUs, the task k · r corresponds to a loop iteration in a parallel program and is assigned to a GPU processor for further execution by default.
When describing the problem related to energy optimization of a CPU-GPU heterogeneous embedded system under a performance constraint, the following parameters are involved:
Data(k · r) refers to the data size transferred to GPUs from a host processor (Host) corresponding to task k · r. T t (Date(k · r)) represents the time taken transferring data size (Date(k · r)) from the host processor (Host) to DRAM memory of GPUs. C c (k · r) refers to the clock cycle of processor computing required for completing task k · r by GPUs. b denotes the number of cache blocks prefetched by a prefetching instruction. N b denotes the number of prefetching instructions in a loop iteration. E b denotes the energy overhead consumed when prefetching a cache block. C p (Data(k · r)) denotes the clock cycle consumed during prefetching in each loop iteration. C m (Data(k · r)) denotes the cycles of memory latency in an iteration caused by cache failure.
The optimization objective of energy and the performance constraint for the energy optimization problem of a multi-GPU system under a performance constraint are discussed below. The optimization objective is to minimize the total energy (E t ) consumed (including dynamic (E d ) and static (E s ) energy consumption), containing that in program execution in a prefetched loop section while ignoring energy consumed by the bus and clocks.
Dynamic energy consumption (E d ) mainly includes the energy (E p ) consumed when prefetching instructions calculate data addresses, energy (E c ) consumed during GPU computation, and energy (E m ) consumed by memory access.
It is supposed that the energy consumed in prefetching a cache block is E b . The number of cache blocks prefetched by a prefetching instruction is b and the number of prefetching instructions in an iteration is N b . In this case, the number of cache blocks to be prefetched in an iteration is expressed as b · N b . Therefore, the energy (E p ) consumed when prefetching instructions calculate data addresses within N i iterations can be expressed as follows:
For task k · r, it is supposed that the power consumption of processors at the frequency f c is p c (f c ) and the clock cycle consumed by computation of a processor within an iteration is C c (k · r). Thus, the energy consumed by computation of the processor within N i iterations is calculated as follows:
For task k · r, it is assumed that the power consumption of the memory at frequency f m is p m (f m ) and the cycle of memory access within an iteration caused by cache failure is C m (Data(k · r)). Under this circumstance, the energy consumed (E m ) by the memory within N i iterations can be expressed as follows:
Therefore, the dynamic energy consumption is
According to the reference [19] , the dynamic power consumption P and the frequency f of electronic complementary metal oxide semiconductor circuits satisfy p ∝ αCf 3 (α is the switching activity factor, C is the switching capacitor) and, therefore, the power consumptions of the processors and memories can be separately expressed as follows:
By substituting the above two expressions into Formula (7), Formula (10) can be obtained
where
As the main source of static power consumption, leakage current induced power consumption is generated when the circuit is stable, therefore, it can be assumed that the static power consumption of GPUs remains unchanged when programs run. The system contains M GPUs and the static power consumption is P s . After being subjected to a certain task partition C, N (N ≤ M ) GPUs take part in computing while the other GPUs are turned OFF or run at the lowest power consumption possible. In this case, the static power consumption can be ignored. The total execution time of programs is set to T and it is supposed that the static power consumption of GPUs remains unchanged during program execution. Thus, the total static power consumption of multiple GPUs can be expressed as follows:
Owing to P s remaining unchanged during program execution, E s ∝ N G · T (N G and T refer to the number of GPUs and total execution time of parallel programs, respectively) holds. This means that the static power consumption generated by multiple GPUs is positively proportional to the area of the rectangle enclosed by horizontal and vertical coordinates in the spatiotemporal diagram, therefore, the optimization objective of total energy of a CPU-GPU heterogeneous embedded system is as follows:
In terms of the energy optimization problem, performance is the most important constraint condition. Performance refers to the execution time of parallel programs after being optimized based on an optimal task partition and software prefetching. The performance constraint should ensure that the total execution time of parallel programs does not rise to β. By analyzing the opportunities of task partition and software prefetching for energy optimization, it can be seen that the total execution time of parallel programs is related to the communication-computing spatio-temporal diagram. The diagram contains four basic operation types, involving communication for transferring data from the host processor to GPU DRAM, software prefetching for prefetching data from GPU DRAM to GPU cache, memory latency caused by software prefetching, and data calculation and processing. In the spatio-temporal diagram, it can be seen that, in a heterogeneous embedded system with homogeneous multiple GPUs, all memory latencies caused by software prefetching are hidden by the computing task(s) of GPUs, therefore, the ratio of the sum of the times consumed in GPU computations and software prefetching to the communication time (recorded as R) is a fundamental factor determining the distribution of the spatio-temporal diagram.
The total task load of parallel programs is expressed as F. For task k · r assigned to a certain GPU (Date (k · r) ) .
When considering the execution time of parallel programs, two conditions are shown, involving GPU-bound [ Fig. 1(b) ] and memory-bound [ Fig. 1(c) ] cases. 1 If Rk · r ≥ N G , various GPU processors are all in a fullload working state. Therefore, it can be considered that all data communication latencies are hidden by software prefetching and computing tasks in GPUs. Owing to the number of tasks assigned to each GPU being F N G ·k ·r , the total execution time of parallel programs is calculated as follows:
2 If Rk · r < N G , idling occurs in pipelines and data communication is taken as the key factor determining program execution times. It can be thought that all software prefetching and computing in GPUs are hidden by data communication, therefore, the total execution time of programs is expressed as follows:
Thus, the execution time of parallel programs on the condition of having N G GPUs satisfies the following piece-wise continuous functions:
When the degree of performance loss is allowed to be less than β, the following condition should be satisfied: f 0 c represents the initial frequency of GPU processors and T 0 r denotes the execution time of parallel programs when f c = f 0 c . In this case, the following formula is acquired:
The goal of performance constraints is to require that the execution time of the optimized program should not exceed the original execution time (performance loss is expressed by parameter β) and minimize energy consumption. The optimization objective is the total energy consumption of the system (including dynamic and static energy consumption). Two constraints cond1 and cond2 ensure that the property of the program is not changed during the frequency regulation process (CPU-bound, memory-bound). At the same time, the range of processor frequency and memory frequency is limited. f c , f c , respectively, are the next and last period of processor frequency change. f m , f m , respectively, are the next and last period of memory frequency change.
The energy optimization problem is described as follows:
VI. ALGORITHM FOR DYNAMIC ENERGY OPTIMIZATION BASED ON THE ADAPTIVELY ADJUSTED DISTANCE OF SOFTWARE PREFETCHING
The key to controlling the overhead of software prefetching is to determine the prefetch distance. For a loop structure, the prefetch distance denotes the number of loop iterations between prefetching instructions and true access. To hide the memory latency caused by prefetching, the time when prefetching instructions is completed must correspond to the moment of true access as far as possible during software prefetching. Therefore, the prefetch distance is co-determined by iterative delay and memory latency, so it can be expressed as follows:
where AD denotes the average memory latency and RT denotes the shortest possible execution time (containing time consumed by prefetching instructions) of each loop iteration. The purpose of rounding up is to guarantee that data have been prefetched before they are accessed. If the numerator and denominator of the fraction in Formula (19) are defined as wall-clock times, but not clock cycles, the formula can be written as follows:
where C and f refer to the clock cycles within a single iteration and the working frequency of the processors, respectively. Generally, scaling the working frequency of processors cannot change the absolute latency of memory access. Thus, after reducing power, AD in Formula (20) remains unchanged while the number of clock cycles within a single iteration does not change with the working frequency, however, the delay in each clock cycle increases, thus, it can be seen that the prefetch distance shows an approximate, positively proportional, relationship with clock frequency
where PD and α denote the prefetch distance after scaling and the frequency scaling factor, respectively. For the architecture of GPUs, the burden on a register caused by prefetching may influence the degree of parallelism of programs, thus exerting a significant influence on performance. Therefore, reducing the prefetch distance generally means an increase in the degree of parallelism to improve performance. From this perspective, an algorithm for optimizing dynamic energy consumption based on adaptively adjusted distance of software prefetching (ECADP) is proposed and the pseudocodes of the algorithm are as shown in Fig. 3 .
We assume that a heterogeneous embedded system contains E GPUs in which each GPU has e SMs. Additionally, the processors initially work at frequency f 0 while the initial memory frequency is f m 0 . Simulation analysis is carried out on the original programs. It can be concluded that s thread blocks can synchronously work in SMs. To approach to true executive process, 2s thread blocks are assigned to each SM to make the SM always work in a full-load state during the simulation. According to the communication-computing pipeline spatio-temporal diagram in Section III, T a (r) and T b (r) separately represent the execution time of parallel programs under GPU-bound and memory-bound conditions. For a given parallel program optimized by prefetching, the algorithm computes the data at the interval of 2esE threads and the execution of the 2esE threads is taken as a repetition period. During the execution of the 2esE threads, if the parallel program is in a memory-bound state, the initial 2esE threads are first executed on the condition of allowing prefetching. In this case, the execution time of parallel programs is T a (r). Afterward, the subsequent 2esE threads are executed without allowing prefetching. Under this circumstance, the execution time of parallel programs is T a (r) . According to the execution time of 2esE threads when prefetching is, and is not, allowed, the performance gain can be calculated based on gain = T a (r) − T a (r)/T a (r) . If the performance does not increase through prefetching, the programs are executed by applying the current voltage and frequency of the processors and prefetching is not allowed. When the performance is improved through prefetching, the performance gain can be converted into an energy saving by scaling the voltage and frequency of the processors. C cp (i) refers to the number of execution cycles of the ith GPU when prefetching is allowed, and maxC cnp (i) denotes the largest number of execution cycles of all GPUs when prefetching is not allowed. Through the improvement ratio of performance is calculated. According to the relative increase in performance, the voltages and frequencies of each GPU processor core can be recalculated to find the frequency scaling factor (α = f (i)/f 0 ) of each GPU processor. In a similar way, if parallel programs are GPU-bound, the execution time of 2esE threads when prefetching is, and is not, allowed is expressed as T b (r) and T b (r) , respectively. When performance gain is generated through prefetching, C m p (i) is applied to represent the cycle of memory access of the ith GPU memory when allowing prefetching and maxC m np (i) denotes the largest period cycle of memory latency of all GPU memories when prefetching is not allowed. Through C m p (i) maxC m n p (i) , the performance improvement ratio is attained. Similarly, according to the performance improvement ration, the voltages and frequencies of each GPU memory can be re-calculated to find the frequency scaling factor (α = f m (i)/f m 0 (i)) of each GPU memory. C points are uniformly sampled within the interval [α, 1] in steps 46-50 as frequency scaling factors. The frequency scaling factor α X is selected to minimize the dynamic energy consumption (P X T X ). If P X T X < P 0 T 0 , α X is taken as the final optimal frequency scaling factor (otherwise, the original frequency f 0 /f m 0 is used). In this case, it implies that it is inapplicable to optimize the program by using software prefetching under a performance constraint. After obtaining the optimal frequency scaling factor α X , the access of prefetch distance is roughly determined at first according to the frequency scaling factor when determining the proper prefetch distance based on frequency during simulation in step 56 using Formula (20) . Thereafter, slight scaling, with a small amplitude, is conducted to determine an appropriate prefetch distance, thus reducing dynamic energy consumption. We analyze the time complexity of the algorithm in terms of code nested layers. 
VII. EXPERIMENTAL VERIFICATION

A. Experimental Platform and Test Cases
We tested high performance computing platforms in industrial scenarios. Existing GPUs with few adjustable levels cannot completely support dynamic voltage (or frequency) scaling, which is not conducive to conducting theoretical research and verification of the low-power optimization of GPUs. Therefore, the simulator used for testing the power consumption of GPUs is used for experimental verification [20] .
The simulator for power consumption is realized by adding a Wattch model for power consumption into the GPGPUSim simulator to model the power consumption of various components (Shader Cores, L2 Cache, and the Memory Controller) in GPUs [21] . For an interconnection network, the modeling approach for power consumption used in PowerRed is applied [22] . For DRAM, the modeling is carried out by utilizing a published method [23] . For each component, the simulator counts the ac -TABLE I  PARAMETER SETTINGS OF THE SIMULATOR FOR GPU  POWER CONSUMPTION   TABLE II  TEST CASE tivities of each clock cycle in its clock domain and accumulates power consumption. Finally, the total power consumption of GPU is summed up. Because of the semiconductor technology adopted by the modern GPU is more mature and the characteristic coefficient is smaller than the set in Wattch model, it should be noted that the absolute power consumption given by the simulator is slightly higher than of the simulated target GPU (the general error is less than 10%). However, as a theoretical optimization method, this article focuses on the relationship between power and performance changes of GPU after frequency reduction, rather than the absolute value of power consumption. Therefore, the absolute power error is acceptable. Parameters of the simulator for GPU power consumption are listed in Table I . Eight typical applications (including BlackScholes (BS), dwtHaar1D (DH), fwtBatch1 (FB), MatrixMul (MM), Matrix-Vector (MV), Laplace (LP), MersenneTwister (MT), and SobolQRNG (SQ) from multiple cognate areas such as signal processing, finance, and scientific computation were used as the test cases. BS comes from the financial field. It implements the Black-Scholes model and calculates partial differential equations for financial prices. DH realizes the wavelet transform of signal. FB application comes from fast walsh transform. MT accomplishes Mersenne Twister pseudorandom number generation algorithm. SQ is the Sobol quasi-random number generation algorithm. MM, MV, and LP come from the field of scientific computing. They are matrix multiplication, matrix vector multiplication, and Laplace transformation. These applications are characterized by the fact that their loops are contained in a kernel function and the loop contains references to accessing global memory space. They satisfy the basic conditions for conducting software prefetching-based optimization. The specific data size of the test cases is listed in Table II . Fig. 4 shows the execution time and the ratio between power consumptions (after prefetching/before prefetching) of the applications on the condition of separately using register and shared memory as a prefetching buffer. When using shared memory as the buffer for software prefetching in LP and MT applications, the execution time of the applications increased compared with that before software prefetching, thus leading to poorer program performance. Under the two strategies, the performances after prefetching separately increase by 41% and 30% (on average) while the power consumption increased slightly due to software prefetching. The changes in simulated performances and energy consumptions of eight typical applications before, and after, optimization of dynamic energy consumption based on adaptively adjusted distance of software prefetching are analyzed on a GPU platform. Through analysis, it can be concluded that the adaptively adjusted distance of software prefetching is related to the frequency scaling factor. Therefore, the dynamic energy consumption can be optimized by scaling frequencies of processors and memories. Fig. 5 separately shows the performance and energy benefits when separately using the register and shared memory as a prefetching buffer. The prefetch distance is the optimal value calculated by using the algorithm for optimizing dynamic energy consumption based on the adaptively adjusted distance of software prefetching. As shown, when utilizing the register and shared memory as the prefetching buffer, the energy consumptions of the system can be separately reduced by 18% and 13% by applying the aforementioned algorithm under a performance constraint. It can be seen that the power consumptions of DH, LP, and MT, when using the shared memory as a prefetching buffer, are not optimized. The power consumptions of the application SQ under both strategies are not optimized. The main reason for this is that, when a program undergoing prefetching-based optimization is subjected to incremental frequency reduction, the moment at which the execution time increases to its original level is earlier than that when the energy consumption falls to its original level. This means that, when the execution time increases to the original time through frequency scaling, the power consumption is significantly greater than that seen in its original condition, thus failing to reduce energy consumption by decreasing frequency. Therefore, according to the optimization method, it is not applicable to utilize software prefetching for optimization and the application is still executed using the original programs at the original frequency.
B. Test Results
Additionally, in a homogeneous multi-GPU system, there are three communication-computing pipeline spatio-temporal diagrams during software prefetching. According to different pipeline spatio-temporal diagrams, the performance bottlenecks of the application program can be further judged. By exploring the performance bottleneck, some parameters are calculated, including the frequency scaling factor α X of processors (or memories) under optimal dynamic energy consumption and the frequency scaling factor α of processors (or memories) in performance-bound conditions under a time constraint. Based on the frequency scaling factor, an appropriate prefetch distance can be rapidly determined to further reduce dynamic energy consumption.
In the related work, we mentioned the work of Agarwal et al. [12] . They proposed the energy optimization strategy of prefetching before DVFS. This method ignores the effect of DVFS on prefetching optimization. From the analysis of Section IV, we conclude that after DVFS, the prefetching distance should be reduced appropriately to make it more reasonable. It can also be seen form the experimental results of Fig. 4 , the register pressure brought by prefetching may affect the parallelism of the program and thus have a greater impact on the performance. Therefore, reducing prefetching distance often means increasing parallelism and improving performance.
In the energy consumption optimization method, Agarwal et al. [12] adopted an online dynamic voltage regulation algorithm. The voltage was adjusted online according to the learning of instruction window. This method needs to ensure that the instruction window cannot be selected too small, in order to avoid the overhead of learning. For GPU program, the special structure of its program determines that it is not suitable to use learning method to dynamically adjust the voltage. First, the thread space of GPU is composed of a large number of thread blocks with isomorphic instructions, which also run on SM with isomorphic hardware. Because the number of thread blocks running at the same time on SM is limited, thread blocks occupy SM execution according to macrobatch and microscopic timesharing. From the time dimension of the whole program execution, SM execution is composed of a large number of similar computational processes, so there is no need to learn program behavior at the thread block level, only need to extract the behavior of single thread block by simulation. Second, the execution time of the single thread block itself is relatively short (the proportion of execution time of the total program is very small). If the instruction window is studied in the single thread block, the granularity is too fine. This will result in excessive learning overhead, and there is no practical significance. Therefore, 2esE thread blocks are simulated to extract the relationship between program performance, power consumption and software prefetching optimization.
In this article, the proposed algorithm for dynamic energy optimization based on the adaptively adjusted distance of software prefetching compared with the energy optimization method (Abbreviated as Ag algorithm) propose by Agarwal et al. [12] as shown in Fig. 5 . From the experimental results, we can see that the performance of our energy consumption optimization method is improved by 19% and 16%, respectively, considering the appropriate reduction of prefetch distance compared with Ag algorithm, when the energy consumption is basically the same. So, the method proposed in this article is more effective. Table III lists the optimal frequency scaling factors α X , and frequency scaling factors α at the performance-bound state and optimization effects of energy consumption of eight applications. According to the communication-computing pipeline spatio-temporal diagram, the eight applications can be approximately classified into two types. The applications MM, DH, MV, LP, and MT are constrained by the memory of GPUs and, therefore, communication overhead is considered to be the performance bottleneck (memory-bound) of that type of application. Under this circumstance, the energy consumption can be reduced by decreasing the frequency α c of the processors. Applications BS and FB are restricted by the processors in the GPUs and, therefore, the computing time of processors is regarded as the performance bottleneck (GPU-bound) of that type of application. In this case, the energy consumption can be reduced by reducing the frequency of the memories. It can be seen from Fig. 4 that prefetching using shared memory is inapplicable for DH, LP, and MT applications. For an SQ program, it is not suitable to carry out prefetching under the two strategies, therefore, the power consumptions under these conditions are not optimized, with an energy saving of 0%. In the other applications, under most circumstances, the optimal frequency scaling factor α X for energy consumption is consistent with the frequency scaling factorα (α c /α m at the performance bound; however, for the three applications FB, LP, and MM, the condition α X < α arises. The main reason is that there is also non-negligible static energy consumption in the system, apart from its dynamic energy consumption. When improving the frequency to some extent in the vicinity of α, the reduction in static energy consumption is more significant than the increase in dynamic energy consumption. According to the frequency regulation factor α X , the parameters of core clock and DRAM clock are adjusted in GPGPUSim power simulator. In GPGPUSim, each power consumption data is recorded in a power_result_type structure, and the total power consumption data is output through double total_power. According to the ration of the total energy consumption after to before frequency regulation, the percentage of energy consumption optimization in Table III is obtained. Taking the average value, under the two prefetching strategies, energy consumptions can be separately reduced by 22.1% and 22.8% through frequency scaling.
To verify the effectiveness of the optimal frequency scaling factor in the present study, the applications MM and BS are selected and verified on the true Quadro FX 5600 platform. The two applications separately represent the two types of test cases, i.e., memory-bound and GPU-bound. Fig. 6 shows the performance improvements in applications MM and BS obtained under different frequency scaling factors (α c and α m ). It can be seen from Fig. 6(a) that an inflection point occurs in the frequency scaling factor α c plot between 0.4 and 0.45 for processors when the performance of MM rapidly declines. The reason for this is that MM is a computation-intensive application. When the processor works at a high frequency, the bottleneck during program execution frequency of the processor decreases to a certain extent, the bottleneck during the execution of application MM is found in the processors but not in memory. Under this circumstance, the performance of the application decreases if the frequency of the processor continues to be lowered. Therefore, the optimal frequency scaling factor of the MM application program is supposed to occur at the inflection point. As seen from Table II , the optimal frequency scaling factors of GPU processors in MM application are theoretically calculated as 0.42 and 0.46 by using the algorithm for optimizing dynamic energy consumption based on the adaptively adjusted distance of software prefetching. The results are attained when the register and the shared memory are applied for prefetching-based optimization. The results show an insignificant discrepancy with the optimal frequency scaling factor 0.40 ≤ α c ≤ 0.45 for processors tested on a true platform. In Fig. 6(b) , it can be seen that, for application BS, the performance exhibits a linear relation- ship with the frequency of processors for most α m . It is because application BS is memory-intensive. In this case, the bottleneck of program operation is attributed to computation and therefore reducing the frequency of GPU processors certainly decreases the performance. Thus, under this circumstance, it is necessary to guarantee the performance and lower the energy consumption by decreasing the memory frequency. As shown in the figure, on the condition that α m is low enough, that is, α m = 0.2, the bottleneck during program operation is transformed. In this context, reducing the frequency of GPU processors by a small amount cannot significantly influence program performance, therefore, it can be judged that the optimal frequency scaling factor for application BS is α m = 0.2, which differs by about 10%, from the theoretical optimal frequency scaling factors (0.35 and 0.33): this further validates the effectivenesses of the model for energy consumption and the algorithm for optimizing dynamic energy consumption.
VIII. CONCLUSION
The opportunities of task partition and software prefetching for energy optimization of a CPU-GPU heterogeneous embedded system were analyzed through the communicationcomputing pipeline spatio-temporal diagram. Furthermore, a model for energy optimization of a homogeneous multi-GPU system was established. Based on the model, an algorithm for optimizing dynamic energy consumption of a homogeneous multi-GU architecture based on the adaptively adjusted distance of software prefetching was proposed in this article. The algorithm was used for energy optimization of parallel programs. The test result showed that, under the two prefetching strategies (register and shared memory), the dynamic energy consumptions separately decreased by 22.1% and 21.8% (at most) on the premise of maintaining the program performance through frequency scaling of processors and memories.
Parallel programs exhibited various problems (such as load imbalance and data dependence) during actual executions, so it was difficult to establish models capable of exploring the energy optimization of parallel programs. For this reason, future work will focus on the investigation of the characteristics of program parallelism in order to conduct energy analysis and optimization thereof.
