Abstract-In this paper, a novel parallel DSP platform based on master-multi-SIMD architecture is introduced. The platform is named ePUMA (I). The essential technology is to use separated data access kernels and algorithm kernels to minimize the communication overhead of parallel processing by running the two types of kernels in parallel. ePUMA platform is optimized for predictable computing. The memory subsystem design that relies on regular and predictable memory accesses can dramatically improve the performance according to benchmarking results. As a scalable parallel platform, the chip area is estimated for different number of co processors. The aim of ePUMA parallel platform is to achieve low power high performance embedded parallel computing with low silicon cost for communications and similar signal processing applications.
INTRODUCTION
Parallel processing has emerged as an area to meet the ever-increasing computing demands in real-time signal processing. Existing parallel solutions are mostly based on general multi-core platform with a cache coherent programming model [2] or custom implementations with an application specific hardware design [3] . General solutions are not cost and energy efficient for embedded systems. Custom solutions are excellent only for a selection of applications. Parallel architectures with on-chip scratch pad memory and very large register file were developed [4] . However, large register files consume much power. Recently, the master-multi-SIMD architecture has emerged in high performance parallel computing, for example the CELL processor from STI [5] . The Cell architecture provides up to hundreds GOPS computing power for a wide range of applications. However, the power consumption of CELL is not satisfactory for some applications.
From the computational perspective, the major part of high performance embedded algorithms and applications are based on predictable computing kernels. Most of the computations use regular and repetitive data accesses. For example, the memory access patterns for video coding are analyzed in [6] . The data accesses tend to be data independent, meaning that the locations of the data in memory do not depend on the data values and can be determined beforehand. Therefore, when designing a parallel platform for such applications, it is important to have both the hardware architecture and the software programming model rely on regular and predictable memory access. The ePUMA project at Linkoping University aims at providing such a parallel platform optimized for predictable computing. Its essential technology is to separate data access from arithmetic computing and run them in parallel to save execution time.
In a recent publication [7] we presented a white paper of the ePUMA project. The programm ing flow was described in detail in the white paper. The originally proposed multi-core architecture is with one master controller, eight SIMD processors, two ring buses, and two DMA controllers. In this paper, a new interconnection network is designed to replace the previous two buses. The new architecture is presented in section II. Section III is about the memory subsystem design essentials. This paper also brings the chip area estimation in section IV and some more benchmarking results in section V. The paper is ended with a conclusion and a discussion of future work in Section VI.
II.

MASTER-MUL TJ-SIMD ARCHITECTURE
The ePUMA master-multi-SIMD architecture is illustrated in Figure 1 . It consists of one master controller, eight SIMD coprocessors, and a memory subsystem for the on-chip communication. The master processor executes the sequential task in an application algorithm, while the SIMD cores run the parallelizable portion of the algorithm. Usually for embedded streaming applications, the MIPS cost of the sequential task in the algorithm is about 10%.
Each SIMD has a local program memory (PM) and data memory (DM). DM is a vector memory which can exchange data with main memory through the central DMA controller. The vector data from one SIMD could also be sent to any other SIMD(s) by the packet based interconnection network with eight switching nodes.
III. MEMORY SUBSYSTEM
The goal of the memory subsystem design is to minimize the data access cost for the SIMD cores. It should hide the communication overhead as much as possible so that the execution time approaches the computing time for the arithmetic instructions. This is possible for regular and predictable memory accessing patterns.
A.
The memory hierarchy
The ePUMA memory hierarchy is compatible with the OpenCL specification. OpenCL (Open Computing Language) is a programm ing framework for heterogeneous parallel platforms and will be used in the software tool-chain of ePUMA. As illustrated in Figure 2 , the memory hierarchy consists of three layers. The highest layer is the off-chip main memory, which is with a low clock rate and has the longest access latency from the processing cores. The local memory as the second level computing buffer includes both data memory and program memory. The master controller uses two data memories and a cache as the local program memory. The SIMD processors use eight-bank vector memory as the local data buffer and a simple scratchpad memory for program. The lowest layer in the memory hierarchy is the register files in the master and SIMD processors.
It takes two steps to move a vector data from main memory to the SIMD data-path across the memory hierarchy. The first step is to load data from main memory to SIMD local vector memory. This is done by a DMA transaction issued by master controller. In order to hide this communication time, ePUMA implements ping-pong buffer in local vector memory. By this way the DMA transaction is overlapped with the SIMD computing. The second step is to load data from the local vector memory to the SIMD register.
It is also beneficial if the SIMD load instruction can run in parallel with its arithmetic instructions. The ePUMA platform implements SIMT (Single Instruction Multiple Tasks) [8] instructions to achieve this. The SIMT instruction is a task level instruction, which is usually an iterative loop function and is handled by a FSM in the data-path. By SIMT instructions, the run time cost of loading data to register file can be negligible. Table 1 Vector AGU 1
Permutation Table 2 Vector AGU2 The data is stored back to the main memory in a reversed process. The ways of overlapping data access with SIMD computing in the two steps are the same as loading data.
B. SIMD local memory and data permutation
SIMD local memory uses a group of scratch pad memory blocks instead of cache. Three vector memories are prepared for each SIMD; each one is composed of eight single-port SRAMs. At run time, two vector memories are accessible from SIMD data-path, and the third one can do data exchange with either the main memory or other SIMD(s). A software controlled switch circuit is implemented to achieve the ping-pong swapping.
As the vector memory is used in the SIMD local buffer, it is essential to achieve conflict free parallel memory access to provide vector data to SIMD data-path at a minimum latency. Conflict free memory access is important when the designers want to save SIMD register size for cost and power reasons, especially when doing transform and matrix operations. The solution to conflict free memory access is to use data permutation. In hardware design, two permutation tables are implemented in the SIMD local memory. One is used for DMA access and another is for SIMD access. In the software side, prolog code of a kernel running in master processor selects the permutation function and configures the permutation tables in SIMD. Since the data access kernels are separated from and orthogonal to its computing kernel in ePUMA, designers for the data access kernels can therefore focus on the design for conflict free data access. A well known data permutation example is shown in Figure 4 . It shows two ways to store a 4*4 matrix in the local vector memory. In Figure 4 (a), when there is no permutation, it can be observed that only row vectors can be accessed without bank conflict. In Figure 4 (b), a permutation is applied while loading data to the memory. In this way both row and column vectors can be accessed in parallel. The dirty data from one SIMD is sent to all the other SIMDs from the central DMA controller.
D. On-chip network
As discussed in our previous publication [7] , the first proposed interconnection network is to use normal buses, one shared bus for program code and one cross-bar for data. These two buses are implemented in parallel connecting two off-chip memories and on-chip processor cores. Each bus has its own DMA controller and a bridge is designed to support communication between two shared buses.
In the new interconnection architecture, we use a star connector for DMA transactions, a ring connector for streaming computing, and mixed star and ring for emulating cache coherence. The DMA controller is designed with multi 110 ports; it has direct connection to every processing core's local memory. We change the interconnection between 
IV. CHIP AREA ESTIMATION
The chip area is estimated in this section. Table I lists the local memories for master and SIMD, both program memory and data memory are specified. In Table II , the logic gates estimation is listed. We use a single MAC DSP controller [8] as the master processor, which has already been done in our group. The total gate count of the master processor is about SOk gates. The SIMD logic estimation is divided into three parts, the register file, the data-path, and the control-path plus AGU. The register file includes both SIMD vector registers and permutation table implementation.
The ePUMA architecture is fully scalable and the number of SIMD co-processors can be configured from 1 up to 8. Table III shows the fmal chip area estimation (6Snm) with different number of SIMD co-processors. The ePUMA architecture is fully scalable and the number of SIMD co-processors can be configured from 1 up to 8. Table III shows the fmal chip area estimation (6Snm) with different number of SIMD co-processors. 
V. BENCHMARKING RESULTS
The ePUMA platform is designed to achieve a goal of minimizing the communication overhead in parallel processing. We use the ratio R (Equation 1) as a measurement of efficiency when evaluating the ePUMA platform at the beginning of this project.
total cycles
The ideal value of R is 1, that is, the total execution time is just the arithmetic computing time and the data access V5-34 overhead equals to O. Table IV shows ePUMA is a parallel DSP platform optimized for predictable computing. It aims at providing a low power high performance embedded parallel computing platform with low silicon cost. In this paper, the master-multi-SIMD architecture has been introduced. The memory subsystem architecture, conflict free vector memory access, and the interconnection network were described. Preliminary chip area estimation is done for ePUMA as a scalable platform.
The fmished work in the frrst year includes a white paper as the design guideline, a behavior simulator and some benchmarks as an early proof of ePUMA's design essentials.
The future work is to improve the cycle accuracy of ePUMA behavior simulator, which will be done using our cycle accurate simulation framework specifically developed for the ePUMA project. Models of SIMD processor core and memory subsystem will be further improved and integrated into the simulator. The existing master processor simulator will be wrapped into one module and compiled into the ePUMA simulator. After updating the behavior model, more algorithms will be mapped to ePUMA platform and be simulated in the cycle true simulation. More benchmarking results are expected to prove the efficiency of ePUMA platform. We also need to fmd the optimal hardware configuration, the best fit permutation table size, and the addressing patterns for the vector address generation unit from the later benchmarking.
