One Dimensional SIMD Array Processor with Segmentable Bus  by Zhang, Fa-cun et al.
Procedia Engineering 15 (2011) 3704 – 3709
1877-7058 © 2011 Published by Elsevier Ltd.
doi:10.1016/j.proeng.2011.08.694
Available online at www.sciencedirect.com
Available online at www.sciencedirect.com
          Procedia Engineering  00 (2011) 000–000 
Procedia
Engineering
www.elsevier.com/locate/procedia
Advanced in Control Engineeringand Information Science 
One Dimensional SIMD Array Processor with Segmentable 
Bus
Fa-cun ZHANG*, Wei Liu, Qian-kun Wang
School of Computer Science and Engineering, Xi’an University of Technology, Xi’an and Shaanxi, China  
Abstract 
By the analysis of the application requirement and the architectures of parallel computer, an embedded data parallel 
computer architecture model is proposed for multimedia processing applications. In the proposed model, local 
memory based on PIM technology reduces memory latency and increases bandwidth. Additionally, segmentable bus 
provides high flexibility for different demands so that PEs can cooperate with each other more efficiently. The main 
components and the instruction set were described in detail. A typical algorithm example is given to show the process 
of parallel computation. And we are implementing this model under Xilinx FPGA board. 
© 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of [CEIS 2011] 
Keywords:  SIMD; PIM; Data Parallel Computer Architecture; Segmentable Bus; PE Array; Instruction Set Architecture; 
1. Introduce 
As a growing demand for multimedia processing, it becomes important to achieve high performance 
on algorithms such as video compression and decompression. This has motivated some new technologies 
to improve processor performance on multimedia application. 
The work of this paper mainly focuses on a one dimensional SIMD array based PIM technology with a 
segmentable bus [1], as shown in Fig. 1. (a). In this model, each processor element has local memory 
based on PIM technology. And communication among PEs is achieved by a segmentable bus, which is 
one of the most fundamental structures in reconfigurable computing. The proposed model not only meets 
the needs of SIMD computing, but also reduces the communication overhead with high flexibility. 
Each PE is composed of arithmetic logic unit ALU, status register PSR, shift register SR, four general 
purpose registers, routers, and on-chip memory DRAM. Communication among components inside PE is 
* Corresponding author: +8613609186827  
E-mail address: zfc@xaut.edu.cn  
Open access under CC BY-NC-ND license.
Open access under CC BY-NC-ND license.
3705Fa-cun ZHANG et al. / Procedia Engineering 15 (2011) 3704 – 37092 Fa-cun Zhang, Wei Liu and Qian-kun Wang/ Procedia Engineering 00 (2011) 000–000 
achieved by three buses. A bus and B bus are source operands buses. C bus is a destination operand bus. 
The structure is shown in Fig. 1. (b). 
PSR is a special register that is used to store PE’s current state, including PE’s current connection state 
with segmentable bus. Router is a register for communicating among PEs. PE sends data to or receives 
data from segmentable bus by router. According to PE’s connection state from PSR, router switches data 
with segmentable bus through two data ports: segment_L and segment_R. 
The rest of this paper is organized as follows. Section 2 discusses the implementation of the 
segmentable bus along with instructions set architecture in Section 3. Section 4 describes an application 
example with the proposed data parallel computer model. Concluding remarks are made in Section 5. 
      
Fig. 1. (a) The data parallel computer architecture;   (b) PE node structure; 
2. The Implementation of the Segmentable bus 
        
Fig. 2. (a) Internal relationship of router and segmentable bus;    (b) A configuration of a 3-processor segmentable bus 
A segmentable bus is a bus with three switches [2] placed on it as shown in Fig. 2. (a). By opening or 
closing these switches, PEs can segment the bus into many, independently usable pieces, or connect 
pieces together. In terms of router, data have three ways to be transferred: west port (segment_L), east 
port (segment_R) and internal connection. Internal connection is used for communication when PE is 
inactive, as following discussion. From the view of segment bus, data are just transferred from previous 
segment bus to the next one (seg_busi-1 and seg_busi in the Fig. 2. (a)), following direction of data flow. 
Fig. 2. (a) shows more details about the relationship of router and segmentable bus. 
Above all, PE’s connection state consists of these three switches values, denoted as L_con, R_con, 
LR_insidecon. These three variables are Boolean which are true when the corresponding switch is closed
3706  Fa-cun ZHANG et al. / Procedia Engineering 15 (2011) 3704 – 3709 Fa-cun Zhang, Wei Liu and Qian-kun Wang / Procedia Engineering 00 (2011) 000–000 3
[3]. As mention before, these three variables are stored in each PE’s PSR in order that each PE uses them 
to manipulate the router. 
As shown in Fig. 2. (a), signals L_read, L_write (R_read, R_write) stand for read and write operations 
of segment_L (segment_R) port on router module. These signals cooperate with L_con, R_con to 
accomplish I/O operations on two data ports of router module.  
In the example of Fig. 2. (b), PE0’s east port sends data to PE2’s west port. PE0’s R_con is true and 
signal R_read is up so that data can be transferred from segment_R port of PE0’s router to seg_bus0. By 
following data flow, seg_bus0 sends these data to the next segment bus (seg_bus1). Then PE2’s west port 
receives data from seg_bus1, while L_con and L_write of PE2 are true. 
In a word, I/O operations can be done when the data port is connected with segmentable bus and the 
corresponding signal is up. 
Neighbor localization [4] is fundamental to dynamic reconfiguration. Consider a one-dimensional PEs 
array. Let processor i (where 0 ≤ i < N) hold a flag f(i). If f(i) = 1, then processor i is termed active; 
Otherwise, the processor is inactive. For 0≤i<j<N, processor j is the neighbor of processor i if and only if 
f(i) = f(j) = 1 and for any index k such that i<k<j, flag(k) = 0. That is, processor j is the nearest active 
processor after active processor i. 
Internal switch (denoted as LR_insidecon) is used to active or inactive PEs. When internal switch of 
PE i is open, PE i is active (f(i) = 1); Otherwise, PE i is inactive (f(i) = 0). 
PE sends and receives data by segment_L, segment_R ports of router module as long as PE is active. 
However, when PE is inactive, PE’s router can’t send and receive any data. Namely, data passes by PE 
when PE is inactive. As the example of Fig. 2. (b), f(0) = f(2) = 1 and f(1) = 0. So processor 0 is the 
neighbor of processor 2. Processor 0 and processor 2 communicate with each other through processor 1. 
3. Instruction Set Architecture 
In this paper, the proposed model is organized in Harvard architecture, the instruction width being 32-
bits and the data width 16-bits [5]. On the basis of function, instructions are divided into ALU instructions, 
transfer instructions, access instructions, segmentable bus configuration instructions, and control 
instructions [6], as shown in Table 1. Instructions can also be divided into controller instructions and array 
instructions. 
Memory addressing is used in two ways: internal and external DRAM access. Internal DRAM access 
is that each PE accesses its own DRAM. And external access is that the host accesses PE’s DRAM for 
exchanging data. 
In assembly language level, it’s necessary to define a number of pseudo instructions to complete data 
definition and memory allocation (DW, DD), paragraph definition (CREGION, DREGION), process 
definition (PROC) and other functions. The definition methods are similar to traditional assembly 
language. 
Table 1. ISA 
Type Mnemonic Function 
Arithmetic 
operation 
padd, psub, pmul
pdiv,pincls,pmod
add、sub、mul 
div、condition count、mod 
3707Fa-cun ZHANG et al. / Procedia Engineering 15 (2011) 3704 – 37094 Fa-cun Zhang, Wei Liu and Qian-kun Wang/ Procedia Engineering 00 (2011) 000–000 
Logic operation 
pand, por 
pxor, pnot 
and、or、not、xor 
Comparison pcmpie, pcmpil equal、less than 
Shift operation 
psll, psrl 
psra, pshlr 
integer left、integer right 
logical right，rotate left 
pmove, pmovge 
pmovg, pmovle 
pmovl, pmovne 
According to the source 
registers values are equal to 
greater than or equal、
greater、less than or equal、
less、not equal to 
0,transmission data is carried 
on between registers 
prcvw, prcve 
psndw, psnde 
PE west and east port 
received data from bus or PE 
send data from the two ports 
to bus 
Transfer                 
operation 
pmovreg, pmovrt
data transfer between router 
and registers 
Bus connection 
pdcone, pdconw, 
pdconew
pconiew, 
pdconiew, prst 
processing element west and 
east ports to the bus are off ，
internal port and bus 
connect、cut，and the bus 
reset
Data access 
ppload, ppstore 
prindex 
processing element loads and 
saves data from internal 
DRAM、read PE index 
No operation 
pnop, pbnop, 
pnbnop, penop 
no operation 、conditional no 
operation start、no operation 
end
bt, blink 
bsub, return 
jump instructions 
ldn, ldary, ldcell
ldreg , stary 
controller registers、 load and 
save DRAM data 
Control operation 
loop control loop instructions 
4. Application examples 
The operations of image processing can be divided into point operations, local operations and global 
operations. Point operations are simple for data parallel computing, each PE processes a pixel. The local 
and global operations tend to have higher communication complexity. 
To save space, this paper only gives one example of a SIMD computing for the proposed model: a 
common algorithm that is the prefix sum for calculation of the histogram which is a typical point 
operation with complicated communication pattern and is different from ordinary serial algorithm. In this 
algorithm, image information is send to each PE by column in order that PEs can concurrently calculate 
the gray value of each column in a picture. After the processing for gray value of each column, the 
problem is transformed into the prefix sum of a series a[n] (n is the number of PEs) for 256 times (gray 
values are integers from 0 to 255). 
3708  Fa-cun ZHANG et al. / Procedia Engineering 15 (2011) 3704 – 3709 Fa-cun Zhang, Wei Liu and Qian-kun Wang / Procedia Engineering 00 (2011) 000–000 5
The prefix sum is given a series a[n], so that S[k] ＝a[0]+a[1]+...+a[k]，（k = 0, 1, 2…n-1）, then 
S[k] is the prefix sum of a[n]. Fig. 3 shows part of the prefix sum solution process. 
Fig. 3. Part of prefixes sum solution process 
In Fig 3, the calculations of 8 operands completed after 3 times of bus reconfiguration. Each time data 
transfer between PEs is achieved by dynamic reconfiguration of the segmentable bus. The program of 
assembly language code is shown as following, which consists of data segment and code segment. Line 0 
~ line 4 are data segments, which distribute space for input data, result, and loop variable. The rest lines 
are codes. From above process, the prefix sum processes of N numbers can be concluded as following 
steps:
a. The source data a[k] (0≤k<N) is loaded into the PEk’s internal DRAM at address x (0≤x≤
0x3FF). N is the number of elements in series (N must less than or equal to the number of PE 
nodes). As shown in Figure5, i stands for one configuration of the segmentable bus, and its initial 
value is 0. This part corresponds to the code lines 6~9. And line 8 omitted the part of initialization, 
line 9 initialize the loop counter. 
b. Reset the segmentable bus and PE reads its own index. This part corresponds to the code lines 
10~11. 
c. Reconfigure the segmentable bus: if index%2i ==0, then disconnects the internal connection and 
west port of PE, corresponding to the code lines 12~18. 
d. Send and receive data: if index%2i == 2i-1-1, the PE sends data; if index%2i>2i-1-1, the PE receives 
data from the segmentable bus and make addition operation with local data, put the result to address 
x. This part corresponds to the code lines 19~31. 
e. To increase i by 1, if i≤log2N, get back to Step 2; Otherwise continue. The part corresponds to 
the code lines 32~33. 
f. Calculation end. The result is saved as each PE’s internal DRAM at address x. 
The entire calculation process requires iterative log2N times, which is faster than the serial algorithm 
N/log2N times. Assembly language program is as follows: 
0 DREGION 
1  src DW 0,1,2,3,4,5,6,7 
2  rslt DW 8 DUP(0) 
3  LCnt DW 3   ;log28 
4 ENDD 
5 CREGION 
6 start: LDN AR1, AR0, src 
7  LDARY AR1, 0 
8  ... 
9  LDN AR4, AR0, LCnt 
10 begin: prst 
11  prindex r3 
12  ppload  r1, 1  ;load 2i
13  pmod  r3, r3, r1 
3709Fa-cun ZHANG et al. / Procedia Engineering 15 (2011) 3704 – 37096 Fa-cun Zhang, Wei Liu and Qian-kun Wang/ Procedia Engineering 00 (2011) 000–000 
14  pcmpiei  psr, r3, 0 
15  pnbnop 
16  pdconiew    ;adjust bus 
17  pdconw 
18  penop 
19  ppload  r0, 0  ;load src data 
20  ppload  r1, 2  ;load 2i-1
21   psubi  r2, r1, 1 
22  pcmpie  psr, r3, r2 
23  pnbnop 
24  psnde  r0   ;send 
25  penop 
26  pcmil  psr, r2, r3 
27  pnbnop 
28  prcvw  r1    ;receive 
29  padd  r2, r0, r1 
30  ppstore r2, 0 
31  penop 
32  ... 
33  LOOP  AR4, begin 
34  LDN  AR1, AR0, rslt 
35  STARY AR1, 0   ;save result 
36 ENDC 
37 END start 
5. Future Work And Conclusions 
In this paper, a data-parallel computer architecture model is proposed, which uses PIM and dynamic 
reconfiguration to improve coprocessor performance on image or video processing. Now, this model is 
under implementation on Xilinx DNV6_F2PCIe board. The fundamental modules have been designed 
and tested. Future work includes the study of pipelines in controller module and PE array. 
References 
[1]  Todman T J, Constantinides G A, Wilton S J E. Reconfigurable Computing: Architectures and Design methods[J]. IEE 
Proceedings: Computers and Digital Techniques, 2005, 152(2):193-207. 
[2] Hatem M. El-Boghdadi, Ramachandran Vaidyanathan, Jerry L. Trahan and Suresh Rai. On the Communication Capability of 
the Self-Reconfigurable Gate Array Architecture[J]. 9th Reconfigurable Architectures Workshop in Proc. Int. Parallel and Distrib.
Proc. Symp. 2002. 
[3] El-Boghdadi, Hatem Mahmoud El-Sayed. On Implementing Dynamically Reconfigurable Architectures[J]. Electrical and 
Computer Engineering of Louisiana State University. 2003. 
[4] Ramachandran Vaidyanathan and Jerry L. Trahan. Dynamic Reconfiguration Architectures and Algorithms[J]. Kluwer 
Acdemic/ Plenum Publishers. 2003 
[5]Guo-chang Zhou, Zhong Wang, De-liang Che and Guo-chen Feng. The Improved Design of Embeded SIMD Coprocessor[J]. 
COMPUTER ENGINEERING AND APPLICATIONS, 2004, 40(31): 13-16 
[6] CHEN Chao-Yang, Wang Zhong, SHEN Xu-Bang. The LS MPP Parallel Image Processor[J]. CHINESE J. COMPUTERS, 
2002, 25(3): 292-296. 
