Energy efficiency is one of the most important metrics in embedded processor design. The use of wide SIMD architecture is a promising approach to build energyefficient high performance embedded processors. In this paper, we propose a design framework for a configurable wide SIMD architecture that utilizes an explicit datapath to achieve high energy efficiency. The framework is able to generate processor instances based on architecture specification files. It includes a compiler to efficiently program the proposed architecture with standard programming languages including OpenCL. This compiler can analyze the static memory access patterns in OpenCL kernels, generate efficient mappings, and schedule the code to fully utilize the explicit datapath. Extensive experimental results show that the proposed architecture is efficient and scalable in terms of area, performance, and energy. In a 128-PE SIMD processor, the proposed architecture is able to achieve up to 200 times speed-up and reduce the total energy consumption by 50 % compared to a basic RISC processor.
Introduction
Emerging mobile systems such as smart phones are becoming more and more important in our daily lives. The rapid development in semiconductor has been boosting the dramatic increase in computational power of embedded processors. As a result, mobile devices are able to run more and more high performance applications such as 4G wireless communication and high definition video codecs. However, energy efficiency is becoming the bottleneck in high performance embedded system design, especially if they are running on limited power sources like batteries. The Single Instruction Multiple Data (SIMD) architecture is able to perform the same operation on multiple data items simultaneously, thereby providing high computational throughput with low control overhead. Since many emerging embedded applications contain abundant data-level parallelism (DLP), using wide SIMD processors in such applications is a promising solution [19, 28] .
In wide SIMD processors, the register files (RFs) are among the most energy consuming components [17, 19] . In this work, we propose to use a wide SIMD architecture with explicit datapath to reduce the RF energy consumption. By using an explicit datapath that allows the software to directly control the data bypassing, accesses to the RFs can be dramatically reduced. Hence the processor energy consumption is brought closer to the intrinsic energy consumption, i.e., the part of energy that is used by useful computation.
An efficient compiler is a key to efficiently utilize the proposed architecture. Code generation for SIMD processors has always been one of the most difficult problems in compiler design. The Open Computing Language (OpenCL) is a standard language for programming heterogeneous parallel platforms. Initially it is designed for general purpose computing on Graphic Processing Units (GPUs) [4, 6] , some of which are also wide SIMD processors. Therefore it is also suitable for programming low-energy SIMD processors. However, the proposed architecture poses additional challenges: -Compared to the flexible crossbar and network in GPGPU, the interconnect between PEs and the memory system in the proposed architecture only allows a limited form of communication between PEs, which makes efficient mapping of OpenCL kernels challenging. -The bypassing in the PE datapath is directly controlled by the software, which means the compiler has to handle that in order to generate correct and efficient code.
We propose the design of a compiler that can analyze the OpenCL kernels which have statically analyzable memory accesses. After mapping OpenCL kernels, the compiler performs bypass-aware scheduling to efficiently utilize the explicit datapath in the proposed architecture.
To achieve high energy efficiency, interfaces and tools that enable software-hardware co-design are required. Examples include TCE co-design framework [20] , and Cadence Tensilica Customizable Processor IP [1] are good examples of design frameworks that aim to provide such capability. Based on the proposed architecture and compiler, we propose a hardware/software co-design framework. The framework is able to automate the instantiation of processors based on the proposed SIMD architecture for standard FPGA and ASIC toolflows. Five kernels developed in OpenCL are tested on the proposed framework. The results are compared against a RISC processor and an SIMD architecture without explicit bypassing. The results show that the proposed framework is scalable, and is able to achieve high energy efficiency. For 128-PE configuration, the average speed-up is 85× compared to the RISC reference, which is the same as the non-explicit bypassing SIMD processor. And on average, the proposed architecture with 128-PE reduces the total energy consumption reduced by 49.5 %, which is 33 % less than the SIMD processor with automatic bypassing.
The key contributions of this work are:
-We introduce modifications to a wide SIMD processor architecture with explicit bypassing so that it supports OpenCL kernel mapping efficiently. -We design a compiler for the proposed architecture. The compiler maps OpenCL kernels and optimizes memory mapping for the proposed architecture. The compiler performs bypass-aware scheduling to fully utilize the explicit datapath. -We propose a co-design framework. The framework includes an RTL generator that can automatically generate implementations for FPGA and ASIC design flow based on an architecture description. -Detailed experiments are carried out. The results show that the proposed framework is scalable, and is able to achieve substantial improvement in both performance and energy consumption.
Compared to the preliminary version published in [10] , this work makes the following extensions:
-We have a more extensive description of the code generation for the explicit datapath, including the instruction scheduling algorithm. -We introduce a co-design framework, including the RTL generator. -The benchmark applications run on three different configurations, which demonstrates the scalability of the proposed framework. -Compared to [10] , the experiments are more detailed.
The core energy consumption is obtained by running gate-level simulation, which is more accurate than the energy model in [10] .
The remainder of this paper proceeds as follows: Section 2 introduces the background information of explicit datapath architectures and the OpenCL programming language. The proposed architecture is described in Section 3. Section 4 introduces the compiler design and the mapping of OpenCL kernels in the proposed architecture. Section 5 presents the co-design framework for the proposed architecture. Experimental results that show the effectiveness of the proposed design are given in Section 6. Related work is discussed in Section 7. Finally, Section 8 concludes our findings and discusses future work.
Background
In this section the concepts that are essential to this work are introduced. Section 2.1 describes the idea of explicit datapath. The basic knowledge of the OpenCL language is given in Section 2.2.
Processors with Explicit Datapath
The key idea of explicit datapath architectures is to expose more details of the datapath to the software, thereby enabling fine-grained control over the datapath in the software. By having fine-grained control, a considerable amount of redundant data movement can be eliminated, which can potentially result in improvement in performance and energy efficiency. The transport-triggered architecture (TTA) is a prime example of explicit datapath architectures [11, 20] . Figure  1b shows an example sequence of TTA instructions that performs the same operations as the RISC instructions shown in Fig. 1a . In a TTA, the software controls data movement, and operations are side-effects of the data movements. By allowing software to have full control over data movement in the datapath, a TTA is able to reduce the register file port requirements dramatically and improve performance [11, 20] . Recent studies also exploit the explicit datapath in TTAs for building energy efficient processors [9, 26, 29] . However, two common problems can hurt their energy efficiency:
-low code density, and -more flexible, therefore more complex interconnect between FU/RF input and output ports.
The code density problem can be mitigated by code compression [15] and micro-architecture modification [29] . The interconnect overhead in TTAs can be reduced by interconnect reduction, which is particularly effective in application specific processor design. Using explicit bypassing is an alternative of building explicit datapath architecture [8, [12] [13] [14] . The bypassing (forwarding) network in a conventional processors is used to reduce data hazard caused by the pipelining. In an explicit bypassing architecture, this network is exposed to the software. Figure 1c shows a code fragment of explicit software bypass that performs the same operation as Fig.  1a . In explicit bypass architectures, the ISAs are typically similar to conventional processors. The main difference compared to conventional architectures is that part of the internal pipeline state is exposed to the software, thereby enabling dramatic decrease of redundant register file traffic. Compared to TTAs, explicit bypass architectures offer less flexibility in datapath control. On the other hand, such architectures have less control overhead. In particular, the instruction width is smaller as the three operand moves for one operation (two sources and one destination) and the opcode can always be specified in one instruction, which requires fewer bits than a typical TTA-based processor.
An explicit datapath is particularly interesting for a wide SIMD architecture. The vector register files often consume considerable amount of energy in SIMD processors [19] . Meanwhile the potential control overhead can be amortized by the large number of processing elements (PEs). In this work, explicit bypassing is used to build the PEs of the proposed SIMD architecture. [16] . The OpenCL standard defines:
-a C-based language called OpenCL C that is used to define kernels for performing computation on compute devices; -a set of APIs in standard C for invoking the kernels from the host and transferring data between the host and compute devices.
In an OpenCL kernel, the workload is divided into workgroups. Each work-group consists of a number of workitems with different indices. Figure 2 illustrates the index space of an OpenCL kernel. In OpenCL kernel semantics, every work-item executes the kernel function independent of other work-items. Different work-items can only synchronize by calling synchronization functions explicitly. So different work-items of the same kernel can be executed in parallel between explicit synchronization points. This model is ideal for wide SIMD architectures because: i) work-items of the same kernels execute the same instruction sequence, which is easy to fit in SIMD semantics; ii) the implicit independence of work-items gives the compilers more freedom to map and schedule them on SIMD processors. The conceptual device architecture in OpenCL is shown in Fig. 3 . There are four different address spaces: private, local, global and constant. Each work-item is mapped onto a processing element (PE) and each workgroup is mapped onto a compute unit. The different address spaces makes the analysis and mapping of communication between work-items easier for wide-SIMD architectures.
The two most important aspects of mapping an OpenCL kernel onto a processor are: -Map and schedule work-items on the PEs of the target architecture. -Map the different address spaces onto the memory hierarchy of the target architecture.
Wide SIMD Architecture
The wide SIMD processor architecture used in this work is based on the one in [17] . To improve the mapping support for OpenCL programs, some modifications are made. The proposed processor architecture consists of two parts, a control processor (CP) and a wide one dimensional (1-D) array of processing elements (PEs), which run in lock-step. It results in a VLIW processor with one scalar issue slot for the CP and one vector issue slot for the PE array. Figure 4 depicts the proposed architecture. The ISA of both the CP and the PE is based on a 24-bit RISC-like ISA similar to the one used in [8] . Table 1 shows the key features of the baseline ISA.
To adapt to the proposed SIMD architecture, the following modifications are made to the baseline ISA: -Only the CP can execute control instructions.
-Two bits are added to the instruction format to encode the communication. The first source operand of each instruction may come from one of: i) local RF/bypass; ii) left neighbor; iii) right neighbor; iv) CP broadcasting.
More details are given in Section 3.2.
-A two-entry predicate register file is introduced. Extra two bits are used to encode whether each instruction is predicated by 0, 1 or both predicate registers. This is essential for mapping OpenCL kernel efficiently, as it simplifies the control flow mapping.
The instruction format of the proposed architecture is shown in Fig. 5 . The size of a 2-issue VLIW instruction packet (1 CP instruction + 1 PE instruction) is 56 bits.
Explicit Datapath
In the proposed architecture, the datapath of the PE uses explicit bypassing. The datapaths of the 4-stage and 5-stage PE are shown in Fig. 6 . Compared to conventional architectures, instructions in the proposed architecture have more control over input operands and destinations:
-An instruction directly specifies whether an input operand is from RF or one of the bypassing sources. -Each function unit (FU) in the datapath has separate input registers that ensures the result of the FU remains stable until the next operation is issued to the FU, resulting in more bypassing opportunities;
-Each instruction can control whether the result needs write-back. Three options are available:
-No write-back: the result is only available at the FU output; -To WB stage: the result is written to the pipeline register in the write-back stage, but not to the RF; -To RF: the result is stored into the RF.
To use the explicit bypass without changing the instruction format, part of the RF index space is used for the bypassing sources. As a result, the number of registers in the RF is reduced from 32 to 28 (4-stage) or 27 (5-stage). The impact of a smaller RF is mitigated by the fact that there is no need to allocate registers for short-live variables in many cases, as shown in the example in Fig. 1c .
Circular Neighborhood Communication Network
Unlike SIMD architectures with small vectors, a fully connected shuffling network is not scalable in a wide SIMD processor. In the proposed architecture, a one dimensional (1-D) neighborhood network is used. Figure 7 shows the architecture of the network. Each PE can only communicate with its left and right neighbors. The network is circular, i.e., the first and last PEs can be neighbors. The connection between the first PE and the last PE does not introduce extra long wires, since in physical layout the PEs can be placed in a circular manner. The PE accesses the network in the decode stage where source operands are collected for an instruction. The first operand of each instruction comes from either the local RF/Bypass or the communication network. Two bits in each instruction are used to indicate which source it is using.
The CP can be an extra node in the PE communication network, allowing data exchange between the scalar datapath and the vector datapath. In addition, the CP is able to broadcast data to all PEs, allowing the CP to perform calculations that are common to all PEs, which is more energy efficient. Figure 8 depicts a generic architectural template for using the proposed architecture as an accelerator. The slave interface allows the host to have direct access to all the memories of the accelerator. For more efficient data transfer, a direct-memory-access (DMA) controller is used to move data between external memory and the memories of the accelerator. Figure 9 shows the process to compile an OpenCL program that runs on the platform depicted in Fig. 8 . The host code is compiled by the native compiler of the host processor. The device code is compiled by a retargetable compiler designed for the proposed wide SIMD architecture. Figure 10 shows the flow of the device compiler. The frontend of the compiler is based on the open-source LLVM compiler framework [5] . The frontend produces a low-level For work-groups with a size that is not aligned to the number of PEs, predication is used to guard the execution. Figure 11 illustrates how the control flow of a kernel is converted. For work-group synchronization barriers, this work uses a similar approach as [21] . The work-group loop is split at each barrier, which ensures that all work-items finish the work prior to the barrier before continuing.
Accelerator Architecture

Code Generation
Memory mapping is an important step in mapping OpenCL kernels. Figure 12 shows the address space of the PE data memory in the proposed architecture. Table 2 shows the mapping of different address spaces in OpenCL to the memory hierarchy of the proposed architecture. The private memory space is only accessible by each work-item. Therefore it is natural to map data in private memory to the memory bank of each PE. Global and local data may be accessed by different PEs, and they are mapped to the vector data memory. By default, the compiler maps global and local data arrays linearly onto the vector data memory, i.e., in row-major order. The method to determine the physical location of an address in a linearly-mapped array is shown in Fig. 12 . If the launch parameters and memory accesses in the kernel can be statically analyzed, the compiler calculates the exact location and the inter-PE communication required for each access.
A series of data shifting via the neighborhood network is inserted if the data needed by a PE is not in the memory bank of that PE. The distance of the communication can be optimized by changing the memory mapping, which is described in Section 4.2. The data in the constant memory can be mapped to the memory of the CP if all PEs access the same data in each loop iteration. Otherwise it is mapped as constant data in the memory banks of each PE.
To generate code for a complete OpenCL program for the system shown in Fig. 8 , the proposed toolflow requires the complete sources of both the host and the device to be available at compile time. The host processor controls the kernel launching. For each kernel, the host processor sends the launch parameters to the accelerator. The input data and instructions are copied to the local memory of the accelerator by the DMA controller. After the kernel execution, the output data is copied out by the DMA controller.
Handling Generic Communication
If the compiler fails to analyze the communication parameters statically, generic communication is required. A generic communication loop for a store operation is shown in Algorithm 1. The upper bound of the distance U is the maximum distance in number of hops in the communication network, which is the number of PEs in the worst case. The loop for a load operation is shown in Algorithm 2. For a load the loop always has #PE+1 iterations as each data item has to reach the node that requests it. Though the load/store loop is not efficient, it allows mapping generic memory accesses on the proposed architecture, resulting in more flexibility. For generic kernels, the work-items are mapped onto the PE array linearly, i.e., the i-th work-item is mapped to the i %N-th PE, where N is the number of PEs. If there are more than one dimensions in the index space, only the first dimension is mapped to the vector, and the others are controlled by CP loops. In the PE array of the proposed architecture, each PE can only communicate with its left and right neighbors. Therefore long distance communication is not efficient. In this work, interleaved mapping and data layout are used to reduce the communication distance. Figure 13 shows an example of the interleaved mapping. The index in Fig. 13 represents both the work-item and the linear address of the data. Each work-item needs to access data in a window of size 5, e.g., work-item 4 needs to access data in address 2 to 6. If a linear mapping is used, each work-item needs to communicate with PEs two steps away, which is less efficient in the proposed architecture. On an architecture with a fully-connected network (e.g. a crossbar), such communication can be done in one step. By using a mapping with interleaving factor of 2, as shown in Fig. 13 , the maximum communication distance is 1 instead of 2. As a result, the proposed architecture can communicate in one step, which is the same as a fully-connected architecture. Each OpenCL memory buffer object is analyzed by its access patterns. If there are different interleaving factors for multiple accesses (from the same or different kernels), the biggest one is used as the actual interleaving factor. The interleaving information is also used to generate the host code for proper data transfer between the system memory and the local memory of the accelerator. The host processor programs the DMA controller according to the interleaving factors determined by the device compiler. In the current implementation, kernels have to be compiled off-line in order to use the interleaved mapping.
When a kernel is mapped with an interleaving factor N > 1, the work-group loop has to be unrolled N times in Figure 13 Example of interleaved mapping with factor of 2 order to handle the irregular communication pattern, which may introduce energy overhead. For example, in Fig. 13 , work-item 4 needs two samples from left and one from right, while work-item 5 needs one from left and two from right. Therefore they need different instructions and the kernel has to be unrolled.
The limitation of the interleaved mapping and layout is that it requires: i) all address expressions for the global and the local memory can be analyzed statically; ii) the kernel launch parameters are compile-time constants, or are chosen from a set of compile-time constants.
Code Generation for Explicit Datapath
After mapping the kernel, the compiler treats the OpenCL kernel as a normal program and proceeds with the compilation. Figure 10 shows the flow of the compiler backend in this work.
To generate correct and energy efficient code, the compiler has to be aware of the explicit bypassing. In particular, the instruction scheduling has to be optimized for increasing the opportunities for bypassing.
In this work, a basic-block level list scheduler is used. A basic block is represented by a Data Flow Graph (DFG) G(V , E), where: -V is a set of nodes. Each node in V represents either an actual operation or a live-in variable. -E is a set of directed edges. An edge e = (u, v) represents that node v depends on u, it can be either a true data dependency or false dependency.
In the proposed architecture, the register pressure in many kernels are higher compared to conventional architectures due to two reasons: i) part of the RF index space is used by the bypassing source, resulting in a smaller RF; ii) because only 8-bit immediate value can be encoded in an instruction, more registers may be needed to hold the long immediate values.
Prior to scheduling, the DFG is partitioned into subgraphs using Algorithm 3. Each partition p can be represented by a tree structure, in which all nodes except the root are used only within p. The fact that nodes in p are consumed locally is helpful in the proposed explicit bypassing architecture: if nodes in p are scheduled close to each other, few or even no registers are required to store the intermediate results, such as the example in Fig. 14. To maximize the bypassing, nodes in p are numbered by running the Sethi-Ullman algorithm on the tree that represents p [24] .
In this work, the scheduler works in a bottom-up fashion, i.e., a node v in G is scheduled after all nodes depends on it are scheduled. There are two main advantages in using bottom-up scheduling: i) branch delay slots can be easily handled by first scheduling the branch in the second last cycle, as it is always ready to be scheduled in bottom-up scheduling; ii) the scheduler knows precisely whether the result of the instruction to be scheduled needs to be written back to the register file.
Algorithm 4 depicts the scheduling algorithm used in this work. When selecting the next instruction (line 11), the schedule tries to keep nodes in the same partition close. And within a partition, nodes are ordered by the Sethi-Ullman numbers.
The operand bypassing state is set by scanning the scheduled machine code. However, the schedule may change during register allocation if there is spilling. In that case, the operand bypass state initialization and the register allocation need to be run again. This process is illustrated in the loop in Fig. 10 . Though in practice the loop usually terminates quickly, because spill and reload codes on the proposed architecture do not need extra registers thanks to the explicit bypassing. Figure 15 shows the structure of the proposed design flow. The architecture specification is given by the user in a JSON structured document. The specification file is used by three components in the framework: a compiler (described in Section 4), a RTL generator and a cycle-accurate simulator.
Co-Design Framework
The automatic RTL generator is depicted in Fig. 16 . RTL codes are generated according to architecture parameters, including:
-The number of pipeline stages in control processor (CP) and processing elements (PEs). In addition, the generator provides various options for different architectures and target technologies: -RTL codes are optimized for different target technologies. For example, when targeting FPGAs, it is much more efficient to implement small memories like the register files in the look-up tables of the FPGA.
Figure 14
Partitioning the DFG -For a Xilinx FPGA target, the generator generates system wrapper files, enabling the processor to be integrated as a peripheral in a multi-core processing system. -For an ASIC target, a testbench that simulates the behavior of host and system interface is generated, along with auxiliary files for synthesizing and simulating the core in standard ASIC design flow. Area, timing and power can be accurately estimated for different technology libraries.
The cycle-accurate simulator is able to simulate the program at much higher speed than the RTL simulation, which is particularly helpful for software development.
Experimental Results
In this work, the metrics used to evaluate the proposed architecture are performance and energy consumption. The kernels used for evaluation are listed in Table 3 . The experimental setup is described in Section 6.1. The performance and energy results are shown in Section 6.3 and Section 6.4, respectively. 
Experimental Setup
To evaluate the performance and energy efficiency of the proposed design, the kernels in Table 3 are tested on two types of processors: one with explicit bypassing, one with automatic bypassing. Three different setting are tested: 32 PEs, 64 PEs and 128 PEs. Other important parameters of the two architectures are listed in Table 4 .
The kernel codes written in OpenCL are compiled by the proposed compiler off-line, i.e., at compile time. The generated binaries are executed on the cycle-accurate simulator for the SIMD processor to collect statistics. Due to the lack of implementation of a complete platform as shown in Fig. 8 , the host processor is emulated by a test driver program. Therefore, parts of the host overhead, e.g., the cost of re-organizing data layout, are not reflected in the results in this section. Sequential codes written in C are compiled and run on a 4-stage 32-bit RISC processor with 24-bit instructions as the reference (RISC). The RISC processor has 12kB (24-bit) instruction memory and 16kB (32-bit) data memory. The core area of different processors used in the experiments is shown in Table 5 . The results show that with the neighborhood network, the proposed architecture is scalable. Note that compared to cores with automatic bypassing, cores with explicit bypassing have smaller physical register files and simpler bypassing logic. As a result, the core area is reduced by 4.6 %, 3.9 %, 4.0 % for 32-PE, 64-PE and 128-PE configurations, respectively. In the FIR, Sobel and YUV2RGB kernels, the speed-up is larger than the number of PEs. This is because the proposed architecture is able to exploit the instruction-level parallelism by using two issue slots. It is also shown that by using the interleaved mapping described in Section 4.2, the FIR kernel gains extra 23.4 % in performance for 128-PE. The performance of the matrix transpose kernel is considerably worse compared to other kernels. The main reason is that in matrix transpose, long distance communication is required. And since only neighborhood communication is possible, such a kernel is not efficient on the proposed 
Performance
Energy
In this work, the energy of both the core and memory are considered. The 7 processors (3 automatic bypassing SIMD + 3 explicit bypassing + 1 RISC) used in the experiments are generated by the RTL generator described in Section 5, and synthesized with TSMC 40nm low power library. The core energy consumption is accurately estimated using the physical information in the technology library and circuit toggle rate generated by post-synthesis simulation on the gate-level netlist. The energy consumption of the memory are estimated with CACTI [7] . Table 6 shows the energy consumption of different types of accesses.
To compare between scalar and vector register accesses, an access to the register file of one PE is considered equivalent to accessing the scalar register file. Figure 18 shows the number of register accesses. For the SIMD processor with automatic bypassing, the reduction in register access comes mostly from the fact that the control-related instructions are run on the CP, which greatly reducing the number of register accesses in these instructions. It is clear that the explicit bypassing has dramatic impact on the number of register accesses. In particular, almost all register accesses are eliminated in the MAdd and FIR. In the transpose kernel, the SIMD processor with automatic bypassing has much more register accesses than RISC due to the communication. In contrast, the processor with explicit bypassing is able to eliminate all register accesses in the communication, resulting in over 70 % decrease in number of accesses. Figure 19 shows the normalized energy results. Both SIMD processors benefit from the reduction in instruction memory, which is a natural advantage of SIMD architectures. However, it is also clear that the reduction in the register access has noticeable impact in the energy consumption. Compared to RISC, the explicit bypassing SIMD processors with 32-PE, 64-PE and 128-PE reduce the energy consumption by 51.6 %, 49.9 % and 49.5 %, respectively. And the average energy improvement of using explicit bypassing over automatic bypassing is 26.6 %, 30.6 % and 33.2 % for 32-PE, 64-PE and 128-PE respectively. As mentioned in Section 4.2, the interleaved mapping introduces a small overhead in the number of register accesses. The reason is that in the unrolled kernel, samples need to be stored in register for the following work-item, preventing the elimination of write-back. But overall it still achieves energy reduction compared to the non-interleaved version.
In all, the proposed architecture and compiler are able to achieve substantial improvement in both performance and energy consumption. The proposed architecture is highly scalable in terms of area, performance and energy.
Related Work
Wide SIMD architectures are used in many embedded processors. The Xetal from NXP [3] is an SIMD processor with Figure 19 Energy consumption (normalized to RISC) 320 PEs that is designed for smart camera data processing. The PEs in Xetal are connected by neighborhood network. However, due to the lack of a register file, the energy consumption of the vector memory is high. He et al. addressed this problem in Xetal-Pro, by introducing extra level of memory, as well as aggressive voltage scaling, resulting in a much more efficient architecture [28, 30] . The IMAP from NEC [2] is another example of wide SIMD processor. The IMAP has 128 PEs connected with a ring network. A key difference in IMAP compared to Xetal is that it has independent address generation for each PE. While the memory is more complex in such configuration, it also results in much better programmability. Woh et al. proposed AnySP, a wide SIMD targeting wireless and multimedia applications [19] . The PE interconnect in AnySP is a reconfigurable RAMbased crossbar, which is more flexible compared to Xetal and IMAP. In [19] , it is also identified that the vector register file in a wide SIMD processor is one of the biggest contributor to the energy consumption, and the authors propose eliminate register file accesses by utilizing bypassing (forwarding). In the implementation, the AnySP introduces an extra 4-entry small register file for the most frequently used variables. In this work, the proposed architectures is similar to the Xetal-Pro [28] . The main difference is that the proposed architecture uses per-PE register file and local index addressing, and a PE datapath with explicit bypassing. The results show that the proposed architecture achieved high energy-efficiency.
Reducing the energy consumption of the register file is always considered important in improving processor energy efficiency [19, 29] . Park et al. presented a greedy algorithm for bypass-aware scheduling on a modified RISC processor [25] . For architecture with explicit datapath, the compiler has great impact on the energy efficiency. She et al. proposed a graph-based model for scheduling MOVE-Pro, a variant of TTA [9] . Guzma et al. presented algorithm for optimizing RF accesses in the TCE framework using compiler optimization [27] . In this work, we proposed an algorithm that partitions the DFG prior to running the list scheduler. The DFG partitioning algorithm in this work is similar to the one used in by Govindarajan et al. [22] . The main difference is that to better utilize the explicit bypassing feature, the algorithm in this work partitions the DFG into tree-like sub-graph instead of linkages in [22] .
Programming wide SIMD architectures has always been difficult. IMAP uses a dedicated C dialect called onedimension C (1DC) to develop data-parallel processing programs [2] . Languages like 1DC can be fine tuned for the target architecture and is used by similar architectures such as Xetal. But they lack portability and are not compatible with standard languages. OpenCL is a standard parallel programming language for heterogeneous platforms [16] . It is initially designed for GPGPU architectures [4, 6] . And it can also be mapped to general purpose CPUs efficiently [23] . Recent studies attempt to use OpenCL for more diverging target architectures, such as FPGA [18] and ASIP [21] . In [10] , She et al. proposed OpenCL code generation for wide SIMD architectures. In this work, we improved the work in [10] , and presented a complete design framework for generating and programming scalable wide SIMD architecture with explicit datapath.
Conclusions and Future Work
In this work, an energy-efficient wide SIMD processor design framework is introduced. The proposed framework is based on an SIMD processor architecture with a scalable neighborhood communication network and explicit datapath. An OpenCL compiler design is proposed for the target architecture. The compiler is able to analyze and map OpenCL kernels onto the proposed architecture. A bypassaware scheduling algorithm is proposed to efficiently utilize the explicit bypassing feature. The proposed framework is able to generate implementations for different target technologies, including FPGA and ASIC. Experimental results show that the framework is scalable and efficient. Compared to the RISC reference, a 128-PE instance achieve an average speed-up of 85×, and reduces the average total energy consumption by 51.9 %.
Future work includes more sophisticated analysis and optimization of the memory layout and work-item scheduling. In particular, handling less static address expressions would be useful for generalizing the proposed architecture. It is also very interesting to perform software-hardware exploration for improving the inter-PE communication in an energy efficient way such that kernels with irregular and long distance communication pattern are easier and more efficient. In addition, further changes in the architecture, e.g., clustering PE memory banks to reduce memory energy, also require the adaption of the compiler.
