By the analysis of the application requirement and the architectures of parallel computer, an embedded data parallel computer architecture model is proposed for multimedia processing applications. In the proposed model, local memory based on PIM technology reduces memory latency and increases bandwidth. Additionally, segmentable bus provides high flexibility for different demands so that PEs can cooperate with each other more efficiently. The main components and the instruction set were described in detail. A typical algorithm example is given to show the process of parallel computation. And we are implementing this model under Xilinx FPGA board.
Introduce
As a growing demand for multimedia processing, it becomes important to achieve high performance on algorithms such as video compression and decompression. This has motivated some new technologies to improve processor performance on multimedia application.
The work of this paper mainly focuses on a one dimensional SIMD array based PIM technology with a segmentable bus [1] , as shown in Fig. 1. (a) . In this model, each processor element has local memory based on PIM technology. And communication among PEs is achieved by a segmentable bus, which is one of the most fundamental structures in reconfigurable computing. The proposed model not only meets the needs of SIMD computing, but also reduces the communication overhead with high flexibility.
Each PE is composed of arithmetic logic unit ALU, status register PSR, shift register SR, four general purpose registers, routers, and on-chip memory DRAM. Communication among components inside PE is achieved by three buses. A bus and B bus are source operands buses. C bus is a destination operand bus. The structure is shown in Fig. 1 
PSR is a special register that is used to store PE's current state, including PE's current connection state with segmentable bus. Router is a register for communicating among PEs. PE sends data to or receives data from segmentable bus by router. According to PE's connection state from PSR, router switches data with segmentable bus through two data ports: segment_L and segment_R.
The rest of this paper is organized as follows. Section 2 discusses the implementation of the segmentable bus along with instructions set architecture in Section 3. Section 4 describes an application example with the proposed data parallel computer model. Concluding remarks are made in Section 5. A segmentable bus is a bus with three switches [2] placed on it as shown in Fig. 2 . (a). By opening or closing these switches, PEs can segment the bus into many, independently usable pieces, or connect pieces together. In terms of router, data have three ways to be transferred: west port (segment_L), east port (segment_R) and internal connection. Internal connection is used for communication when PE is inactive, as following discussion. From the view of segment bus, data are just transferred from previous segment bus to the next one (seg_busi-1 and seg_busi in the Fig. 2 . (a)), following direction of data flow. Above all, PE's connection state consists of these three switches values, denoted as L_con, R_con, LR_insidecon. These three variables are Boolean which are true when the corresponding switch is closed [3] . As mention before, these three variables are stored in each PE's PSR in order that each PE uses them to manipulate the router.
The Implementation of the Segmentable bus
As shown in Fig. 2 . (a), signals L_read, L_write (R_read, R_write) stand for read and write operations of segment_L (segment_R) port on router module. These signals cooperate with L_con, R_con to accomplish I/O operations on two data ports of router module.
In the example of Fig. 2 . (b), PE0's east port sends data to PE2's west port. PE0's R_con is true and signal R_read is up so that data can be transferred from segment_R port of PE0's router to seg_bus0. By following data flow, seg_bus0 sends these data to the next segment bus (seg_bus1). Then PE2's west port receives data from seg_bus1, while L_con and L_write of PE2 are true.
In a word, I/O operations can be done when the data port is connected with segmentable bus and the corresponding signal is up.
Neighbor localization [4] is fundamental to dynamic reconfiguration. Consider a one-dimensional PEs array. Let processor i (where 0 ≤ i < N) hold a flag f(i). If f(i) = 1, then processor i is termed active; Otherwise, the processor is inactive. For 0≤i<j<N, processor j is the neighbor of processor i if and only if f(i) = f(j) = 1 and for any index k such that i<k<j, flag(k) = 0. That is, processor j is the nearest active processor after active processor i.
Internal switch (denoted as LR_insidecon) is used to active or inactive PEs. When internal switch of
PE sends and receives data by segment_L, segment_R ports of router module as long as PE is active. However, when PE is inactive, PE's router can't send and receive any data. Namely, data passes by PE when PE is inactive. As the example of Fig. 2. (b), f(0) = f(2) = 1 and f(1) = 0. So processor 0 is the neighbor of processor 2. Processor 0 and processor 2 communicate with each other through processor 1.
Instruction Set Architecture
In this paper, the proposed model is organized in Harvard architecture, the instruction width being 32bits and the data width 16-bits [5] . On the basis of function, instructions are divided into ALU instructions, transfer instructions, access instructions, segmentable bus configuration instructions, and control instructions [6] , as shown in Table 1 . Instructions can also be divided into controller instructions and array instructions.
Memory addressing is used in two ways: internal and external DRAM access. Internal DRAM access is that each PE accesses its own DRAM. And external access is that the host accesses PE's DRAM for exchanging data.
In assembly language level, it's necessary to define a number of pseudo instructions to complete data definition and memory allocation (DW, DD), paragraph definition (CREGION, DREGION), process definition (PROC) and other functions. The definition methods are similar to traditional assembly language. 
Application examples
The operations of image processing can be divided into point operations, local operations and global operations. Point operations are simple for data parallel computing, each PE processes a pixel. The local and global operations tend to have higher communication complexity.
To save space, this paper only gives one example of a SIMD computing for the proposed model: a common algorithm that is the prefix sum for calculation of the histogram which is a typical point operation with complicated communication pattern and is different from ordinary serial algorithm. In this algorithm, image information is send to each PE by column in order that PEs can concurrently calculate the gray value of each column in a picture. After the processing for gray value of each column, the problem is transformed into the prefix sum of a series a[n] (n is the number of PEs) for 256 times (gray values are integers from 0 to 255).
The prefix sum is given a series a [n] , so that S[k] ＝a[0]+a[1]+...+a[k]，(k = 0, 1, 2…n-1), then S[k] is the prefix sum of a[n]. Fig. 3 shows part of the prefix sum solution process. The entire calculation process requires iterative log 2 N times, which is faster than the serial algorithm N/log 2 N times. Assembly language program is as follows: 
Future Work And Conclusions
In this paper, a data-parallel computer architecture model is proposed, which uses PIM and dynamic reconfiguration to improve coprocessor performance on image or video processing. Now, this model is under implementation on Xilinx DNV6_F2PCIe board. The fundamental modules have been designed and tested. Future work includes the study of pipelines in controller module and PE array.
