Techniques for algorithm design on the instruction systolic array by Bertil Schmidt (84784)
 
 
 
This item is held in Loughborough University’s Institutional Repository 
(https://dspace.lboro.ac.uk/) and was harvested from the British Library’s 
EThOS service (http://www.ethos.bl.uk/). It is made available under the 
following Creative Commons Licence conditions. 
 
 
 
 
 
For the full text of this licence, please go to: 
http://creativecommons.org/licenses/by-nc-nd/2.5/ 
 
Loughborough 
Univcrs i ty 
Techniques for Algorithm Design on 
the Instruction Systolic Array 
Bertil Schmidt 
A Doctoral Thesis 
Submitted in partial fulfilment of the requirements for the award of 
Doctor of Philosophy 
Parallel Algorithms and Architectures Research Centre 
Department of Computer Science 
Loughborough University 
Loughborough, Leics., U. K. 
January 1999 
0--) by Bertil Schmidt 1999 
Acknowledgements 
Greatest thanks must go to my supervisor Professor Heiko Schr6der whose insights, support, 
encouragement, and friendship have made the past three years a truly enjoyable experience. I 
am particularly indebted to Heiko for giving me the opportunity to work on this thesis. Many 
thanks also to my director of research, Dr. Helmut Bez. 
Thanks must go to Professor Manfred Schimmler of the TU Braunschweig for many useful 
inspirations, discussions, and proof reading. 
Thanks also go to Professor Juraj HromkoviC' of the RWTH Aachen and Professor Hartmut 
Schmeck of the Universitdt Karlsruhe for giving me the opportunity to work on this thesis 
during my employment at the RWTH Aachen and at the Universitdt Karlsruhe. 
Many thanks also to the people of the ISATEC Soft- und Hardware GmbH. They have helped 
me immeasurably with useful ideas and technical support, most notably Christoph Starke, 
Wulf Kolbe, RUdiger MaaJ3, J6rg Zessin, and Emil Treuer. 
To my friends in the PARC Department at Loughborough University (most notably Alistair 
May), in the Department of Computer Science I at RWTH Aachen, in the Department AIIFB at 
Universitdt Karlsruhe,, and from outside the university, I owe many thanks for providing the 
essential social environment that has allowed me to retain my sanity. 
Many thanks also go to my parents and family, who have always given me their unconditional 
support during my studies. 
iii 
Abstract 
Instruction systolic arrays (ISAs) provide a programmable high performance hardware for 
specific computationally intensive applications. Typically, such an array is connected to a 
sequential host, thus operating like a coprocessor which solves only the computationally 
intensive tasks within a global application. The ISA model is a mesh connected processor 
grid, which combines the advantages of special purpose systolic arrays with the flexible 
programmability of general purpose machines. 
The subject of this thesis is the analysis, design, and implementation of several special 
purpose algorithms and subroutines on the ISA that take advantage of the special features of 
the systolic information flow. 
The ability of ISAs to perform parallel prefix computations in an extremely efficient way is 
exploited as a key-operation to derive efficiency as well as local operations within each 
processor. Therefore, given sequential algorithms has to be decomposed in simple building 
blocks of parallel prefix computations and parallel local operations. To modify sequential 
algorithms for a parallelisation several techniques are introduced in this thesis, e. g. swapping 
of loops in the sequential algorithm, shearing of data, and appropriate mapping of input data 
onto the processor array 
It is demonstrated how these techniques can be exploited to derive efficient ISA algorithms 
for several computationally intensive applications. These include cryptographic applications 
(e. g. arithmetic operations on long operands, RSA encryption, RSA key generation) and image 
processing applications (e. g. convolution, Wavelet Transform, morphological operators, 
median filter, Fourier Transform., Hough Transform, Morphological Hough Transform, and 
tomographic image reconstruction). 
Their implementation on Systola 1024 - the first commercial parallel computer with the ISA 
architecture - shows that the concept of the ISA is very suitable for these applications and 
results in significant run time savings. 
The results of this thesis emphases the suitability of the ISA concept as an accelerator for 
computationally intensive applications in the areas of cryptography and image processing. 
This might lead research towards further high-speed low cost systems based on ISA hardware. 
Keywords: instruction systolic array, parallel computing, image processing, cryptography, 
algorithm design, Systola 1024. 
iv 
Contents 
1 INTRODUCTION 
1.1 Aims and Research Context 
1.2 Organisation of the Thesis 3 
2 THE ISA 7 
2.1 Principle of the ISA 7 
2.2 Parallel Prefix Computations 10 
2.3 Relation to other Models of Parallel Computers 13 
3 SYSTOLA 1024 15 
3.1 Board Architecture 15 
3.2 Processor Architecture 17 
3.3 Programming Environment 20 
3.3.1 Application Level 20 
3.3.2 Controller Level 21 
3.3.3 ISA Level 23 
3.4 Modifications of the Conventional Systola Architecture 25 
3.4.1 Bus system 25 
3.4.2 Processor Architecture 27 
4 ARITHMETIC ON LONG NUMBERS AND ITS APPLICATION IN 
RSA CRYPTOGRAPHY 29 
4.1 Addition and Subtraction of Long Operands 29 
4.2 Multiplication of Long Operands 31 
4.3 Division of Long Operands 35 
4.4 The ISA in RSA Cryptography Applications 37 
4.4.1 Private-key and Public-key Cryptosystems 37 
4.4.2 The RSA Cryptosystern. 38 
4.4.3 Parallel Implementation of RSA encryption and decryption 40 
V 
4.4.3.1 Modular Multiplication 40 
4.4.3.2 Modular Exponentiation 41 
4.4.3.2.1 Square-and-multiply Algorithm 42 
4.4.3.2.2 m-ary Algorithm 43 
4.4.3.2.3 Sliding-window Algorithm 44 
4.4.4 Performance Evaluation of RSA encryption 46 
4.4.5 Parallel Implementation of RSA Key Generation 47 
4.5 Other Cryptographic Applications on the ISA 49 
5 THE ISA IN IMAGE PROCESSING APPLICATIONS 50 
5.1 Neighbourhood Operations 51 
5.1.1 Convolution 52 
5.1.2 Wavelet Transform 57 
5.1.3 Morphological Processing 61 
5.1.4 Median Filter 64 
5.1.5 Improvements of Systola 4096 to increase the Efficiency of Neighbourhood 
Operations 66 
5.2 Fourier Transform 67 
5.2.1 Discrete Fourier Transform 67 
5.2.2 Fast Fourier Transform 69 
5.2.3 ISA algorithm 72 
5.2.4 Parallel Implementation on Systola 1024 76 
5.2.5 Discrete Cosine Transform 78 
5.3 Hough Transform 80 
5.3.1 Line Detection by Hough Transform 80 
5.3.2 Morphological Hough Transform 82 
5.3.3 Parallel Implementation of Morphological Hough Transform 84 
5.3.4 Performance Evaluation and Experimental Results 87 
5.3.5 Future Work 89 
5.4 Tomographic Image Reconstruction 91 
5.4.1 Cornputerised Tomography 91 
5.4.2 Filtered Backprojection 93 
5.4.3 ISA algorithm of Image Reconstruction 94 
5.4.4 Performance Evaluation on Systola 1024 97 
5.4.5 Future Work 98 
6 CONCLUSIONS AND FUTURE WORK 100 
6.1 Conclusions 100 
6.2 Future Work 100 
vi 
BIBLIOGRAPHY 102 
A RSA PROGFLAM CODE 108 
A-1 Application Program 108 
A. 2 Controller Program 113 
A. 3 ISA Programs 114 
A. 3.1 XEXPOI024 114 
A. 3.2 YEXPO1024 118 
A. 3.3 INIT 119 
A. 3.4 INIT2 119 
A. 3.5 INPUT 120 
A. 3.6 OUTPUT 120 
A. 3.7 INTROINT 120 
vii 
1 Introduction 
1.1 Aims and Research Context 
Personal Computers (PCs) are increasingly used for very demanding tasks in the professional 
sector. The necessary basis for this are high performance processors (e. g. Pentium HO) as well 
as sufficiently large memories. 
There are, however, a number of tasks, especially in the technical and scientific field, that 
require an even higher computing power. These include for instance: 
* multiplication of large matrices 
solution of large systems of linear equation (e. g. in connection with finite-element- 
methods), 
many image processing applications (e. g. line detection by Hough Transform, image 
reconstruction from projections, object recognition using Karhunen-Loeve Transform, 
Wavelet Transform, or steerable filtering) 
video compression (e. g. MPEG-2 video encoding) 
computer graphics (e. g. radiosity) 
visualisation (e. g. perspective views of landscapes) 
artificial intelligence (e. g. simulation of neural networks) 
cryptography (e. g. RSA encryption, RSA key generation) 
This has hindered the use of computer systems in real time applications where high computing 
power is required. 
Parallel computing is one of the techniques available which can reduce the processing time 
significantly. However, cost efficiency does not allow for a massively parallel general purpose 
machine. We argue that there are niches where massively parallel special purpose computing 
does deliver a very good cost efficiency. We believe that there is a range of such niches and 
we argue that the above applications on Instruction Systolic Arrays (ISAs) belong into these 
niches. 
The ISA is a special type of a massively parallel computer architecture based on a mesh 
connected processor array [41,42,441. It provides a programmable high performance 
hardware for specific computationally intensive applications. ISAs are available as square 
arrays of small RISC processors capable of performing integer and floating point arithmetic. 
Typically, such an array is connected to a sequential host, thus operating like a coprocessor 
which solves only the computationally intensive tasks within a global application. 
The ISA model has been developed in order to combine the area efficiency, speed and the low 
cost of Systolic Arrays (SAs) with the programmability of general purpose machines [33]. 
For example, with current 0.25g CMOS technology it is possible to integrate an ISA of size 
I 
32 X 32 with a clock frequency of 200 MHz onto a single chip with only I CM2 silicon area 
[73]. However, the essential advantage of the ISA concept compared with the SA concept is 
its flexibility [82,83]: 
While an SA only realises one special algorithm, an ISA is capable of performing a large 
variety of algorithms belonging to quite different problem classes. For example, ISA 
algorithms have been developed for numeric (e. g. matrix multiplication, matrix transposition, 
matrix inversion, computation of eigenvalues), graph theory (e. g. connected components, 
transitive closure), sorting, simulation of neural networks, and pattern matching [9,15,16,43, 
58,66,72,74,85,86]. 
Systola 1024 (see Figure 1.1) is the first commercial parallel computer based on the ISA 
architecture [27,45,69,70]. The ISA has been integrated on a low cost add-on board for 
standard PCs. The board can be used following a strict coprocessor concept: Sequential 
programs can be accelerated by replacing computationally intensive procedures with the 
execution of corresponding parallel programs on the Systola 1024. 
Optical surface inspection of coated surfaces serves as an example of a real time application 
where ISAs can be used. This application requires special measuring procedures, which 
enable fast scanning of large surfaces, avoiding the direct contact to the surface. Optical 
methods combined with digital image processing are a suitable solution. Systems for 
applications in the area of Machine Vision and Fast Vision offer the needed computing power 
using special image processing hardware or high-power workstations. The drawback of these 
systems is the large budget involved. The immediate consequence is that out of economical 
reasons, cuts in the quality control are made, i. e. the end quality control is often performed by 
human visual inspection. The Surface Quality Scanner (SQS 1024), realised by the company 
ISATEC, is a low cost alternative to large budget solutions [34]. The combination of a 
standard PC, low cost video data acquisition boards and the Systola 1024 board, as the 
hardware base for the technology, offers a quality control solution at a competitive 
price/performance ratio. 
The disadvantage of this solution is the effort to develop efficient parallel ISA programs. 
Programming a parallel computer is a challenge. In contrast to sequential programming, one 
has to watch and co-ordinate the activity of numerous processors. Although the structure of 
the ISA is a useful simplification, the central question is to match the algorithm and 
architecture in such a good way that good performance is reached. It is often the case that a 
2 
Figure 1.1: Picture of Systola 1024 
good sequential algorithm behaves poorly once implemented on a parallel architecture, unless 
a significant effort is devoted to its optimisation. 
In practice, each application requires an individual solution on the ISA to reach good 
performance. Thus, it is of great interest to study which algorithms are suitable for the ISA 
architecture and which techniques are used for their efficient parallelisation. The main 
emphasis of this thesis is the analysis of design techniques and implementation of algorithms 
and subroutines that take advantage of the special features of the systolic information flow 
within the ISA. 
One important advantage of ISAs is the capability of performing parallel prefix 
computations extremely efficiently. These operations can provide an elegant and efficient 
solution for the problem of interprocessor communication, e. g. broadcasting, accumulation, 
ringshifting with constant period. Of course, purely local operations within each processor 
are very efficient to perform computational parts of an applications. In this thesis we will 
exploit these two properties as key-operations to derive efficient ISA algorithms. 
However, the problem of this approach is that most algorithms do not use parallel prefix 
computations and parallel local operations in an explicit form. Thus, a given sequential 
algorithm has to be decomposed in simple building blocks of parallel prefix computations and 
parallel local operations. Several techniques are introduced in this thesis to modify sequential 
algorithms for an efficient parallelisation on the ISA, e. g. swapping of loops in the sequential 
algorithm, shearing of data, and appropriate mapping of input data onto the processor array. 
The main contribution of this thesis is how these techniques can be exploited to derive 
efficient ISA algorithms for several computationally intensive applications in the areas of 
cryptography (e. g. arithmetic operations on long operands, RSA encryption, RSA key 
generation) and image processing (e. g. convolution, Wavelet Transform, morphological 
operators, median filter, Fourier Transform, Hough Transform, Morphological Hough 
Transform, tomographic image reconstruction). Their implementation on Systola 1024 results 
in significant run time savings and are likely to lead to further low cost systems based on ISA 
hardware in these areas. 
Some of the results and techniques presented in this thesis have already been published in 
several papers by the author [75 - 79]. All presented parallel implementations on Systola 1024 
were used in products or projects of the company ISATEC. 
1.2 Organisation of the Thesis 
Chapter 2: The concept of the ISA is explained in detail in Chapter 2. Its relation to other 
models of parallel computers (Systolic Array, SIN4D processor array, MIN4D processor array) 
is also given. The capability of ISAs to perform parallel prefix computations extremely 
efficient is described and demonstrated in several examples, e. g. broadcasting, accumulation, 
ringshifting with constant period. These operations will later serve as building blocks for more 
complex algorithms. 
Chapter 3: This chapter introduces Systola 1024, the first implementation of a parallel 
computer based on the ISA. The ISA is integrated on an low cost add-on board for PCs. The 
architecture of this board is described in Section 3.1. The architecture of the individual 
processor is described in Section 3.2. 
3 
Beside its architecture the programming environment of a parallel computer is a crucial part. 
The programming environment of Systola 1024 consists of three different levels: application 
level, controller level, and ISA level. These three levels are explained by means of a simple 
ISA program that multiplies two matrices in Section 3.3 . 
The provided high performance architecture has been used in several applications, e. g. visual 
quality control. Of course, the computing power of the current Systola is not sufficient for 
other applications, e. g. real-time face recognition. Thus, Systola 4096, the successor of 
Systola 1024, has been already announced. Because of the four times higher processor number 
and the two times higher cycling frequency it will have eight times the theoretical peak 
performance of Systola 1024. In practice, the performance also depends on the architecture of 
the processors and the throughput of the bus system. 
In Section 3.4 we propose a collection of minor improvements in the architecture of the 
processors and the bus system. It will be shown in Chapter 4 and 5 that these improvements 
lead to an additional increase of efficiency between 10% and 150 % with respect to the 
conventional Systola architecture. 
Chapter 4: Chapter 4 describes, how the systolic control flow can be used to implement 
arithmetic operations on very long operands (e. g. 1024 bits) in an extremely efficient way, 
i. e. addition, subtraction, multiplication, division. 
The difficulty of partitioning an addition in an array where only neighbours can talk to each 
other is the carry propagation. If it meets a sum value consisting of ones only, then an 
incoming carry produces a new carry in the most significant digit. In the worst case this can 
hold for all blocks of the partitioned addition, such that the carry-bit of the first block 
propagates through all other blocks. Of course, the addition on the ISA must be solved data 
independently, i. e. the number of instructions must not depend on the propagation distance of 
the carries. Therefore, an parallel prefix computation technique based on the generate and 
propagate signals of a carry-look-ahead-adder is introduced in Section 4.1 to solve the carry 
propagation in constant Period on the ISA. The same technique can be also applied for the 
subtraction of long operands. 
Multiplication of long operands on an ISA (Section 4-2) also benefits from the systolic 
instruction flow. The idea behind the multiplication on the ISA is related to the school method 
for integer multiplication: One operand is cut into pieces and all pieces are multiplied with the 
other operand in parallel. The results are then shifted with respect to each other in an 
appropriate way and added to form the final result. 
Division (Section 4.3) can be efficiently reduced to the implementations of multiplication and 
subtraction on the ISA by using the Newton-Raphson-method. For a value B the reciprocal 
value is computed by an iteration that converges rapidly towards 11B. 
The demand for long operand arithmetic arises in the field of public-key cryptography. In 
particular, in applications like RSA encryption and RSA key generation, it can lead to very 
fast implementations. The computation of these algorithms is based on modular 
exponentiation. In order to implement the modular exponentiation a lot of multiplications and 
subtractions have to be performed in a row. In Section 4.4 it is shown how the new arithmetic 
leads to implementations for these algorithms that are superior to all previously known 
software solutions on hardware within the same price range. Finally, some other cryptographic 
(e. g. IDEA, elliptic curve cryptosystem) applications on the ISA are shortly described in 
Section 4.5. 
4 
Chapter 5: Several applications in the field of image processing on the ISA are discussed in 
this chapter. We will give insight into different techniques to parallelise computationally 
intensive standard routines. 
An analysis of the spatial resolution of the pixels in a local neighbourhood provides the first 
clue for the recognition of objects in images, e. g. edge detection, boundary extraction. Thus, 
classes of operations are necessary that combine the pixels of a neighbourhood in an 
appropriate manner and yield a result which forms a new image. Operations of this kind 
belong to the general class of neighbourhood operations. They are a central tool for low- 
level image processing. 
Very frequently used neighbourhood operations are convolution, Wavelet Transform, 
morphological operators, and median filter. However, their computation on sequential 
machines remains a very time consuming task when the image size and the considered local 
neighbourhood is large. The ISA algorithms in Sections 5.1.1 to 5.1.4 provide one way to 
speed up their computation. 
A problem of many parallel implementations of neighbourhood operations is the cost of 
communication between processors. Thus, we present techniques, which provide an efficient 
communication between neighboured pixels based on the parallel prefix computations 
row/column ringshift and row/column broadcast. 
The Fourier Transform - introduced in Section 5.2 - is another important image processing 
tool which is used to decompose an image into its sine and cosine components. The output of 
the transformation represents the image in the Fourier or frequency space, while the input 
image is the real space equivalent. In the Fourier space image, each point represents a 
particular frequency contained in the real domain image. The Fourier Transform is used in a 
wide range of applications, such as image analysis, image filtering, image reconstruction and 
image compression. 
The Fourier Transform can be computed efficiently by the well-known Fast Fourier Transform 
(FFT) algorithm. The key idea behind the FFT algorithm is to use the divide and conquer 
paradigm: The FFT of a vector of N= 2" points consists of n stages. In each stage the 
calculation of the transform for k points is recursively reduced to two calculations of length 
k12. In Section 5.2.3 it is demonstrated how the divide and conquer strategy of the FFT can be 
mapped efficiently onto the ISA architecture. The same technique can be also applied for the 
parallel implementation of the Discrete Cosine Transform, which is a frequently used tool in 
image and video compression standards, e. g. JPEG, MPEG, H. 263. 
The Hough Transform is introduced as a powerful standard technique for line detection in 
image processing in Section 5.3.1. It is based on the accumulation of information about 
straight lines intersecting with pixels of the image plane. The classical Hough Transform 
creates artefacts if there are structures in the image like dashed lines or small line segments. 
Therefore, modifications of Hough transform have been developed. In Section 5.3.2 an 
approach based on mathematical morphology is taken to find an efficient implementation of 
Hough Transform that reduces the number of artefacts in the transform space (Morphological 
Hough Transform). 
This new algorithm has been tailored towards the capability of the ISA to perform parallel 
prefix computations. Thus, the parallel implementation technique described in Section 5.3.3 is 
very efficient: 
5 
Detection of horizontal lines can be done very fast by row accumulation and row ringshift. 
This can be achieved by swapping the loops in the sequential algorithm. By means of 
shearing the image all other line angles can be also transformed in horizontal position, such 
that the efficient horizontal operations are always applied. In Section 5.3.4 it is shown that the 
resulting processing time per image on Systola 1024 (Section 5.3.4) leads to a solution on the 
ISA which is superior with respect to performance and hardware cost. 
Computerised Tomography is used in many applications, e. g. medicine and non-destructive 
evaluation, to look inside objects and to analyse their internal structures. However, the 
problem of CT image reconstruction is computationally intensive. This has hindered the use 
of CT in applications where CT image data is required in real time. 
In Section 5.4 we illustrate how the capabilities of ISAs to perform parallel prefix 
computations are exploited again to derive an efficient parallel algorithm of CT image 
reconstruction on the ISA based on the well-known Filtered Backprojection algorithm. To 
achieve this goal the similar techniques as in the parallelisation of the Hough Transform in the 
previous section can be applied again, i. e. swapping of loops in the sequential algorithm and 
shearing. The implementation on Systola 1024 shows that the concept of the ISA is very 
suitable for CT image reconstruction and results in significant run time savings. 
Chapter 6: Finally, a summary of the thesis and an outlook on possible further research 
directions related to the topic of this thesis is given. 
6 
2 The ISA 
In this chapter the principle of the ISA and its relation to other models of parallel computers is 
described. Especially the ability of ISAs to perform parallel prefix computations extremely 
efficiently is pointed out, e. g. broadcasting, accumulation, and ringshifting with constant 
period. These operations will provide an efficient and elegant solution for the problem of 
interprocessor communication in many ISA algorithms in the following chapters. 
2.1 Principle of the ISA 
In an ISA, the instructions are pumped through a processor array in a systolic way, instead of 
data as in standard Systolic Arrays (SAs). This allows the execution of different algorithms 
on the same processor array belonging to quite different problem classes (e. g. numeric, image 
processing, cryptography, sorting [9,15,16,43,58,66,72,74-79,85,86]), in contrast to SAs 
which are special purpose architectures. Furthermore, the instruction stream is combined with 
two streams of selector bits. This allows a very flexible addressing of subsets of the 
processors. The basic architecture of the ISA is a quadratic nxn array of identical processors, 
each connected to its four direct neighbours by data wires. The array is synchronised by a 
global clock. The processors can execute instructions from some instruction set, the execution 
time is assumed to be the same for all instructions. The processors are controlled by 
instructions, row selectors and column selectors. 
The ISA instructions are input into the upper left corner of the processor array, and from 
there they move step by step in horizontal and vertical direction through the array. This 
guarantees that within each diagonal of the array the same instruction is active during each 
clock cycle (see Figure 2.1). An instruction that enters the array at time t in processor (0,0) is 
executed at time t+k in all processors (r, s) with r+s=k. Since every instruction is 
propagated to the neighbour processor directly after its execution, the processors do not 
require their own control memory. 
(0) 
(4) 
7 
(1) 
(5) 
Figure 2.1: Execution of an ISA instruction 
(2) (3) 
(6) (7) 
The selectors also move systolically through the array: the row selectors horizontally from 
left to right, the column selectors vertically from top to bottom. The selectors mask the 
execution of the instructions within the processors, i. e. an instruction is executed if and only if 
both selector bits, currently in that processor, are equal to one. Otherwise, a no-operation is 
executed, which leaves all data register contents unchanged. 
An instruction together with the corresponding row selectors and column selectors is called an 
ISA diagonal (see Figure 2.2). 
instructions column 
selectors 
.......... 
roi 
selec 
rT7" 
(1) 
Lii 
E M--MýIM 
M 
(5) 
(2) 
(6) 
(0) 
1-1 
(3) (4) 
(7) 
Figure 2.2: Execution of an ISA diagonal (black squares indicate the selector bit zero, grey 
squares the selector bit one) 
An ISA program is a sequence of ISA diagonals. It consists of a sequence of instructions 
P M'... IP(s) together with two sequences of row-selectors r(1),..., r(s) and column selectors 
CMI..., c(s) of n-tuples over 10,11. For every I :! ý i, j :! ý n and 1 s: t :! ý s the instruction p(l) is 
executed by processor (ij) in clock cycle t+i +j -2 if ri(t) =I and c, 
(t) =I and no-operation 
otherwise (see Figure 2.3). 
The period of ISA programs is the number of their instructions, which is In -2 clock cycles 
less than their execution time. The period of an ISA program describes the minimal time from 
the first input of an instruction of this Program to the first input of an instruction of the next 
program. In the following the period of ISA programs is used to specify their time-complexity. 
This is appropriate because they will be used as subroutines of much larger ISA programs in 
the next chapters. 
8 
instructions 
row 
selector 
(3) 
(1) 
Figure 2.3: Execution of an ISA program 
(2) 
(5) 
ISA programs can be viewed as diamond shaped. This is the case because each instruction can 
be combined with its corresponding row selector bits and column selector bits while it is 
shifted through the processor array. As shown in Figure 2.4 the column selector bits can be 
merged together with their corresponding instruction as follows: A selector bit zero is 
replaced by a no-operation and a selector bit one is replaced by the corresponding instruction. 
Now, the ith entry in a row selector determines if the corresponding instruction will be 
executed in row i of the ISA or not. This graphical notation is used for explaining the ISA 
programs in this section. 
Communication between processors is done in the following way. Every processor has read 
and write access to its own memory. Besides that, it has a designated communication 
register (C-register) that can also be read by the four neighbour. Within each clock phase 
reading access is always performed before writing access. Thus, two adjacent processors can 
column 
electors 
(0) 
r--ý-l 
Ll 
(4) 
9 
exchange data within a single clock cycle in which both processors overwrite the contents of 
their own C-register with the contents of the C-register of its neighbour. This convention 
avoids read/write conflicts and also creates the possibility to perform parallel prefix 
computations within constant period. 
The advantages of the ISA architecture can be summarised with the following properties 
" only local communication for data and control flow 
" small processors (small is fast, small is many) 
" fast parallel prefix computations 
" modularity, scalability 
" wide applicability 
well suited for VLSI (Very Large Scale Integration) technology 
The company ISATEC holds the patents for the architecture in Europe and the US [46-48]. 
2.2 Parallel Prefix Computations 
We define parallel prefix computations to be operations where every processor needs 
information of all processors with smaller column index and the same row index (or vice 
versa). Examples of parallel prefix computations on the ISA are the computation of row 
broadcast, row ringshift and row accumulation. 
Row Broadcast 
Each processor reads the contents of the C-register from its western neighbour and writes the 
result into its own C-register (this instruction is denoted as C: =C [WEST] in Figure 2.4 (A)). 
2: 
C: =C[WEST] 
M: 
no-operation 
C: =C[EAST] : selector bit 
(A) (B) 
Figure 2.4: Row broadcast on an 4x4 from west to east is possible with period I (A), while 
row broadcast in the opposite direction requires period 5 (B). 
Since the execution of this operation is pipelined along the row, the same value is propagated 
from one C-register to the next, until it finally arrives at the rightmost processor. It is an 
10 
interesting fact that, although there is no common bus connecting the processors of a row, this 
is possible with period 1. Similarly, column broadcasting from north to south is also possible. 
However, there is no fast broadcasting in the opposite direction, i. e. from east to west and 
from south to north. Broadcasting in these directions takes period 2-n-3 in a row with n 
processors (see Figure 2.4 (B)). 
Figure 2.5: Input of a4X4 matrix of data items 
By iterating n instructions of this kind, a program for input of an nxn matrix of data items 
can be constructed (see Figure 2.5). For input and output of data items the open ended 
processor links at the western and northern boundaries of the array are used, which might be 
interpreted as 1/0-pads of a chip. In particular, a data item is input whenever a processor on 
the boundary executes an instruction that attempts to read from the C-register of a non- 
existent neighbour. 
Row Ringshift 
The contents of the C-registers can be ringshifted along the processor rows by two 
instructions. Every two horizontally adjacent processors exchange data (using one read west 
and one read east operation). Because of the instruction flow from west to east this 
implements a ringshift (Figure 2.6). 
C: =C [WEST] 
C: =C[EAST] 
Figure 2.6: Ringshift in each row of an 4x4 ISA. 
II 
Again it seems surprising that, although there are no wrap-around connections, such a 
ringshift can be done with constant period. The program for column ringshift is analogue. As 
in the case of broadcasting, ringshifting in the opposite directions can only be done with 
period Q(n). Ringshift is used as a basic routine for the implementation of many other 
permutations on the ISA, e. g. interchanging of two halves and the perfect shuffle. 
Row Accumulation 
Each processor computes the sum of the C-register of its western neighbour and itself 
(denoted as C: =C+C [WEST] in Figure 2.7). Since the execution of this operation is pipelined 
along the row each processor accumulates the sum of all C-registers up to its own which is 
identical to the prefix sum. Thus, row accumulation and the calculation of row prefix sums is 
possible with period 1, without needing a common bus connecting the processors of a row. 
C: = C+ C[WEST] 
Figure 2.7: Row accumulation in each row of an 4x4 ISA. 
Formally, the parallel prefix computation (also known as all partial sums or scan operation 
[39]) can be recursively defined as to be given inputs XO, ... 1, Xn- IG IB' compute outputs 
YO, --., Yn- IG 
IB' with yo = xo and yj = yi- I ED xi for I :! ý- i :! ý n- 1 where IB' x IB' -> IB' is an 
associative operator. 
Some examples of the parallel prefix operator ED are x@y=x+y (accumulation), x@y=x 
(broadcast), x (D y=x or y, and x ED y= maxfx, yj. Of course, ringshifting cannot be identified 
as a pure parallel prefix computation. It is a broadcast operation together with a simple 
leftshift. 
The following theorem holds for the calculation of parallel prefix computations on the ISA: 
Theorem 2.1: If the operator (D can be executed in a single ISA processor within a single 
clock cycle, the corresponding parallel prefix computations yo, -'Yn- I can 
be computed in an 
ISA processor row (column) of length n with period 1. 
Proof: Let the C-register of the ith processor of the ISA processor row (column) initially store 
the input value xi, 0:! ý- i:! ý n-l. The instruction C: =C (@ C [WEST] (C: =C (D C [NORTH]) 
performs the parallel prefix computation, i. e. each processor i, 0 :! ý i :! ý n- 1, has calculated yj in 
its C-register after the execution of this instruction. This holds because of the pipelined 
instruction flow. 
Which parallel prefix computations are executable efficiently on the ISA depends on the 
instruction set of the individual processor. In the case of the Systola 1024, the above examples 
broadcast, ringshift, and accumulation can be implemented. It is necessary that each operation 
12 
that needs to be executed within a single processor can be executed within a single clock 
cycle, i. e. it can be implemented by a single instruction. This requirement restricts the row 
accumulation in case of the Systola 1024 to 16-bit integers and does not allow to implement 
the row-product with constant period. Also maximum and minimum (desirable parallel prefix 
computations) are not implemented on the Systola 1024. 
2.3 Relation to other Models of Parallel Computers 
There are a number of results concerning the problem of transforming programs that were 
originally designed for different models of parallel computers into ISA programs [35,82,83]. 
Among these different models are the MIMD- and SIMD-type mesh connected processor 
array, the Systolic Array, and some variants of the ISA. 
The most powerful control structure is provided by the MIMD (Multiple Instruction Multiple 
Data) concept, where all processors of the array may execute different instructions, while in 
the SIMD (Single Instruction Multiple Data) concept all active processors execute the same 
instruction. 
The main result in [35] is that arbitrary mesh connected processor array programs can be 
simulated on the ISA. This transformation is in general not efficient, i. e. an arbitrary program 
that runs on an nxn MIN41) computer in k steps can be transformed into an ISA program 
having O(n - k) steps. In many cases, however, especially when the original machine is SIMD- 
type, this transformation introduces only a delay of O(l). On the other hand, the computation 
of a parallel prefix computations on an nxn SIMD-type mesh connected array is only 
possible with period Q(n). 
Because of their own control memory, processors of the MIMD-type have to be much larger 
than ISA processors. Thus, the ISA needs a much smaller area than the MIMD processor array 
and can provide a larger degree of parallelism for a given chip area. 
Systolic Arrays (SAs) are networks of not necessarily identical processing elements [36 - 38]. 
Each of them is capable of executing one instruction only. In contrast to ISAs a sequence of 
data (not instructions) is pumped through the network. SAs have turned out to be well suited 
for VLSI technology, since they, 
" consist of a regular net of simple processing cells, 
" use local communication between the processing cells, 
9 exploit a maximal degree of parallelism. 
Of course, the ISA also meets these requirements for VLSI. Furthermore, it is capable of 
efficiently executing a large variety of parallel algorithms belonging to quite different problem 
classes, while SAs fit only one special algorithm. For example, an SA designed for sorting 
cannot multiply matrices, whereas an SA for matrix multiplication cannot solve pattern 
matching problems. 
It was shown in [831, that simulating SAs by the ISA does not affect the cost. Moreover, the 
length of a program on an ISA does not affect its area requirements, whereas the area 
complexity of many systolic algorithms is proportional to the product of their time complexity 
and their data rate. The reduced area requirements imply that on the fixed area of a chip 
larger 
problem sizes can be treated than on comparable standard systolic architectures. 
13 
These results show that the ISA concept combines the advantages of SAs with those of the 
MIN4D concept. Sometimes it is possible, e. g. in the case of two dimensional systolic 
algorithms, to re-design an existing algorithm. Often it will be necessary to develop a new 
algorithm in order to meet the requirements of the ISA. Especially, this is the case for 
implementations on Systola 1024, which is introduced in the next chapter. 
14 
Systola 1024 
This chapter introduces Systola 1024, the first implementation of a parallel computer based on 
the ISA concept. The ISA is integrated on an low cost add-on for PCs. The architecture of this 
board is described in Section 3.1 and the architecture of the individual processors are 
described in Section 3.2. 
Beside its architecture the programming environment of a parallel computer is a crucial part. 
The programming environment of Systola 1024 consists of three different levels: application 
level, controller level, and ISA level. These three levels are explained by means of a simple 
ISA program that multiplies two matrices in Section 3.3 . 
The current Systola was built in 1994 and is a machine based on technology far from being 
state-of-the-art (I. Og CMOS technology). Extrapolating to technology used for processors 
such as the Pentium II (0.25R CMOS technology) would lead to much higher speedups of the 
ISA. 
Thus, Systola 4096, the successor of Systola 1024, has been already announced. Because of 
the four times higher processor number and the two times higher cycling frequency it will 
have eight times the theoretical peak performance of Systola 1024. In practice, the 
performance also depends on the architecture of the processors and the throughput of the bus 
system. In Section 3.4 we propose a collection of minor improvements in the architecture of 
the processors and the bus system. It will be shown in Chapter 4 and 5 that these 
improvements lead to an additional increase of efficiency with respect to the conventional 
Systola architecture. 
3.1 Board Architecture 
The parallel computer Systola 1024 is a low cost add-on board for standard PCs [43,66,671. 
The ISA on the board is integrated on a4x4 array of processor chips. Each chip contains 64 
processors, arranged as an 8x8 square. This provides 1024 processors on one PC board. In 
order to exploit the computational capabilities of this unit, it is necessary to provide data and 
control information at an extremely high speed. Therefore, a cascaded memory concept is 
implemented on board that forms a fast input and output environment for the parallel 
processing unit. 
For the fast data exchange with the processor array there are rows of intelligent memory units 
at the northern and western borders of the array. These units are called interface processors 
(lPs). Eight IPs are integrated on one chip. Each IIP is connected to its adjacent array processor 
by a single wire for data transfer in each direction. The IPs have access to an on-board 
memory by means of special fast data channels, those at the northern interface chips with the 
northern board RAM, and those of the western chips with the western 
board RAM. The 
northern and the western board RAM can communicate bidirectionally with the 
PC memory 
over the PCI bus. 
15 
Figure 3.1: Data paths in the parallel computer 
16 
Figure 3.2: Execution of a program 
The data transfer between every two memory units within this hierarchy is controlled by an 
on-board controller chip. In particular, it controls the channels between the IPs and the board 
RAM as well as between the board RAM and the PCI bus (Figure 3.1). The controller receives 
its instructions either directly from the PC as board instructions, or it can operate auto- 
nomously. In the second case it receives controller instructions from an instruction queue, 
which is located on the board, too. The instruction queue is loaded before the execution of an 
application. The ISA gets its instructions from the ISA program memory. It is also loaded 
before the execution with the desired application programs. The execution of such programs is 
started by the controller, as indicated in Figure 3.2. 
Each IIP contains two independent RAM banks with 128 words each that are organised as 
stacks (LIFO memory: last in, first out). One of these stacks is always assigned to the ISA, the 
other to its environment. The exchange of data between the ISA and the environment is done 
by bank switching. Both memory banks can be active in parallel. Thus, the data transfer 
between IPs and board RAM can be done concurrently to the execution of an ISA program. 
3.2 Processor Architecture 
Due to the limited chip area each processor has to be very compact. Reduction of size was the 
main reason for their bit-serial design. The size of the local memory is limited to 32 internal 
registers (RO,..., R31) plus communication register (C). The word length of the data items is 
16 bits. The instruction set and the internal data formats were defined such that arbitrary 
formats for integer and floating point arithmetic could be handled [71]. Figure 3.4 shows the 
global architecture of one single processor. 
Two operand-multiplexers choose two registers of the own local memory or of the C-registers 
of the neighbours (incoming data wires from the neighbouring processors; called according to 
their position in the north, west, south or east: CN, CW, CE, CS). 
The operands are propagated to the units required for execution of the current instruction. The 
following processing units can be used: 
" ALU 
" conditional unit 
" multiplier 
" floating point unit 
" shifter 
The result of the execution is given to the result bus and from there written into the local 
registers or the C-register or both. 
During the execution of one instruction the processor receives the next instruction together 
with a column selector bit from the upper neighbour and a row selector bit from the left 
neighbour. The selectors are interpreted. If both are 1 then the execution of the instruction 
is 
prepared. Is one of the selectors 0 then the execution of a no-operation 
is initialised. 
Instruction and column selector are propagated bit serially to the lower neighbour and the row 
selector to the right neighbour during the execution of the instruction. 
17 
datalinks 
north 
COlUrT 
row ý 
elector 
ý; ýt 
ilinks 
as, 
clatal 
wc 
coiUrrm 
OL 
Figure 3.4: Architecture of the array processor of Systola 1024 
18 
The processors operate with an instruction set of 44 instructions. It is a RISC concept, where 
the instruction set is even further reduced than in standard RISC processors. The instruction 
set consists of 
assignment instructions, e. g. A: =B, 
logical instructions, e. g. AND, OR, 
arithmetic instructions, e. g. ADD, SUB, 
conditional instructions, e. g. A: =IF zero THEN B ELSE C, 
special instructions, e. g. floating point arithmetic, input-/output instruction. 
Complex operations, like floating point multiplication are composed of elementary 
instructions. 
Furthermore, each array processor has two special registers 0 and -1. The register 0 always 
outputs the value naught (all 16 bits are 0) and the register -1 always outputs the value -1 (all 
16 bits are 1). The register 0 can also be given as a destination address. In this case, the result 
is thrown away. 
In addition to the registers, there are a number flags that control the processing units 
depending on data or depending on the state of the processor: 
" flags (flag manager) 
" activity flag AF 
The flags can be used to control the unit for conditional instructions dependent on previous 
computing results; the activity flag contains information about the status of the processor. 
This can be active or inactive. The activity flag is set by instructions. If the processor is 
inactive, all instructions (excluding the reactivating instruction) are in interpreted as no- 
operation. 
The FPs operate on the same instruction set as the array processors. Since they do not contain 
an ALU, arithmetic instructions are interpreted as no-operation. Other instructions only 
address the EPs. These are in particular in- and out-instructions and switching of banks. 
Performance: At a clock frequency of f 50 MHz and using a word format of m= 16 bits, 
each processor can execute f/m= (50 16)- 10 
6=3.125-10 6 word operations per second. 
Therefore, each processor has a performance of 3 AKPS (millions of instructions per second). 
The whole array with its 32 x 32 = 1024 processors thus performs up to 3200 MIPS. 
The multiplication of two matrices may serve as an example for floating point performance 
and data rate of Systola 1024. 
Using the 64 bit floating point format of IEEE, one multiplication requires 16 machine 
instructions and one addition requires 18 instructions. For a scalar product operation, 
44 
instructions are necessary (10 to fetch the operands and store them internally, 16 
for the 
multiplication, and 18 for the addition). Defining a scalar product operation two 
floating point 
operations (one multiplication and one addition), each processor reaches a performance of 
0.14 MFlops (millions of floating point operations per second). This leads to a value of 145 
MFIops for the whole processor array. 
Using a precision of 32 bits, a scalar product operation requires 
27 instructions (6 for fetching 
the operands, 7 for the multiplication, and 
14 for the addition). This adds up to a performance 
19 
of 237 MFlops. Note that these numbers are not theoretical peak values, but real performance 
measurements of this particular application. The peak values are 188 Mflops for double 
precision and 305 MFIops for single precision. 
3.3 Programming Environment 
Beside its architecture the programming environment of a parallel computer is a crucial part. 
The programming environment of Systola 1024 consists of three different levels: 
e Application level 
e Controller level 
9 ISA level 
These three levels are explained by means of a simple ISA program that multiplies two 
matrices. 
In Section 3.3.1 it is shown how ISA programs are embedded in PASCAL or CIC++ 
programs. 
Section 3.3.2 shows how on board data transfer is organised by means of the controller 
program. The program that runs on the ISA itself is explained in Section 3.3.3. A more 
detailed description of the programming environment can be found in the manual of Systola 
1024 [271. 
3.3.1 Application Level 
The top level of programming is the interface between the host computer and Systola 1024. 
These interfaces are provided for PASCAL, Delphi, and CIC++. Typically, the application 
program controls data transfers between PC and board RAM and loads and starts programs on 
Systola 1024. 
Example 3.1: In this example, a 32 X 128 matrix A and a 128 x 32 matrix B are multiplied 
resulting in a 32 x 32 matrix C. A simple program that performs this operation on Systola 
1024 by use of the program interface for PASCAL could look like this: 
program matmult; 
uses isa; isa contains Systola instructions 
const n=32; P=128; 
var A: array [O.. n-1,0-. P-11 of integer; 
B: array [O.. p-1,0.. n-11 of integer; 
C: array [O.. n-1,0.. n-11 of integer; 
begin 
isa_board_reset; Activate Systola 1024 
load 
- cfg(Imatmult. 
cfg') Load configuration 
load 
- 
block (WEST, 0, n*p, A) Matrix A -> western board RAM 
load-block-transpose(NORTH, O, p, n, B); 
I Matrix B -> northern board RAM 
run_iq('START', 1); Computation of the product 
store - 
block(WEST, 4, n*n, C); Transfer of the results back to C 
isa_board-down; Deactivate Systola 1024 
end. 
20 
With uses isa a module is linked into the program that contains all the necessary global 
variables and procedures for using the board. 
The first instruction isa board-reset switches Systola 1024 from the stand-by mode 
into the fast operating mode and the instruction isa_board_down switches the parallel 
machine back into the stand-by mode. 
With 1o ad_c fga configuration is loaded onto the parallel computer board. It consists of all 
parallel programs required for this application and of instruction sequences to control the 
controller. 
The operand matrix A is written into the western board RAM of Systola 1024 by 
load-block (WEST, 0, n*p, A), then the transposed matrix of Bs written into the 
northern board RAM with load-block-transpose (NORTH, 0, p, n, B). 
The execution of a controller program, which performs the required matrix multiplication on 
Systola 1024, is started by run_iq (I START' , 1). The procedure run-iq used waits until 
the parallel computation is completely performed. There is another possibility to run parallel 
computations in order to avoid such busy waiting. Instead of run-iq the programmer can 
use start_iq, which does not wait for the termination of the process on the parallel unit. 
Therefore, the host PC can perform other operations while the Systola 1024 is active. 
The results of the calculation on Systola 1024 are written into the western board RAM. The 
resulting matrix C is stored in the main memory of the PC with 
store-b1ock(WEST, 4, n*n, C). 
The next section describes the implementation of the configuration ma tmu 1t. cfg and 
Section 3.3.3 goes into what happens within the processor array. 
3.3.2 Controller Level 
The next level of programming are controller programs. Controller instructions organises the 
exchange of data between the interface processors and the memory of the board. Typically, 
such a controller program consists of a data transfer to the interface processors, followed by 
the execution of an ISA program, followed by the transfer of the results back to the board 
RAM. A controller program is stored in a configuration. Loading and execution of a 
configuration is started by the PC (see above). 
Example 3.2: Here follows the controller program for the example above. It is implemented 
using the toolset ISATOOLSTM, which provide a progranu-ning environment to develop 
configurations. 
START ISA-START 
WAIT 
BORDER-PAR 
BORDER-PAR 
BLOCK-TO-IP 
BLOCK-TO-IP 
WAIT 
ISA-START 
WAIT 
IP-TO-BLOCK 
HALT 
INTRO 
Ignore 
WEST 
NORTH 
WEST 
NORTH 
N_Ready 
CALC 
Ignore 
WEST 
Ignore 
Ignore 
32 
32 
0 
0 
W-Ready 
Ignore 
4 
W-Ready 
Prg_Ready 
128 
128 
Reset 
Reset 
Ignore 
Prg_Ready 
Reset 
Ignore 
21 
The label START is the reference point for the PC to start this instruction sequence 
ISA_START INTRO starts the execution of an ISA program named INTRO. This program 
must be stored in the ISA program memory as part of the configuration, in order to be 
accessible to the controller. INTRO is a standard program for initialisation of the processor 
array. Since it comes from the delivered library it can be linked into any configuration. 
The WAIT instruction assures that the execution of subsequent operations is delayed until the 
program INTRO is terminated. Such an explicit WAIT instruction is necessary because 
different activities can run on the board simultaneously, e. g. data transfer between different 
levels can run in the same time with the execution of an ISA program. 
WAIT has three parameters, corresponding to these different activities on the board: the first 
one is responsible for data transfer in the north, the second one for the data transfer in the 
west, and the third for ISA program executions. If one of these parameters has the value 
ignore, the controller does not wait for the termination of the corresponding activity. If it 
has the value ready, the controller is delayed until the operation is finished. 
The instruction BORDER 
- 
PAR arranges the channel between the board RAM and the interface 
processors. Each data transfer between one of the board RAM and the interface processors is 
controlled by the parameters of the last BORDER-PAR instruction. 
The first parameter (WEST or NORTH) identifies the part of the board RAM from/to where 
data transmission shall be done. The second parameter (I ... 3 2) 
defines the number of 
interface processors that take part at the transmission. The third parameter (1 ... 12 8) gives the 
number of words to be sent from/to each interface processor. In the example above, each of 
the 32 interface processors in the west and north will receive/send 128 words per data transfer. 
With BLOCK_TO_IP WEST 0 Reset a data transmission is started from the western 
board RAM to the western interface processors. The first parameter (NORTH or WEST) 
identifies the part of the board RAM where the data is to be sent. The second parameter 
(0 
... 255) 
is addressing the memory block. With the third parameter (Reset or Retain) the 
controller takes notice whether after reading the start address within the block is kept or is 
reset. 
As mentioned above, the amount of data to be transmitted has been determined by the last 
BORDER PAR instruction. In this example, the complete matrix A (32 x 128 words) is loaded 
from the western board RAM row wise into the western interface processors and the matrix B 
(128 x 32 words) is loaded from the northern board RAM column wise into the northern 
interface processors. 
The following WAIT delays until this transfer is completed. 
The program to perform the calculations is now started by ISA_START CALC. The 
controller has again to wait before the results can be output. 
The data transfer from the interface processors to the memory is started by the instruction 
IP-TO-BLOCK. The parameters are used like in the data transmission, but in opposite 
direction (BLOCK_TO-IP). 
Finally, HALT delays the controller until this activity is finished. A ready signal is then sent to 
the host PC to indicate that the results are available. 
22 
3.3.3 ISA Level 
The bottom level of programming consists of ISA programs. Here the actual calculation steps 
are performed, while the higher programming levels are mostly used for transfer of data, 
partitioning of problems and task control. 
ISA programs are edited and compiled by means of the language LAISA7 in the ISATOOLS 
programming environment. Compiled programs can be then loaded into the ISA program 
memory. From there on they are accessible to the controller, which can start their execution 
with an I SA_START instruction. 
LAISA is a PASCAL-like language. In particular, variables and constants, for, while, and 
repeat loops, if statements, functions and procedures can be used just as in PASCAL. Basic 
machine code for the ISA is implemented in LAISA using brackets: 
<instruction; row selector; column selector>; 
As defined in Section 2.1, an instruction is executed by a processor in the array if both of its 
corresponding selector bits are equal to one. 
Example 3.3: The program CALC of the above example shows the abilities of the LAISA 
notation: 
program calc; 
{#include IO. LIB} 
const n= 128; 
p= 32; 
one : selector 
var i: integer; 
{Compute product matrix I 
= 
JA 
begin (n) 
< bank WEST, 3,1, true; one; one >; Switch banks 
< bank NORTH, 3,1, true; one; one >; for input of Operand data 
< set 0, R2; one; one >; 
I Initialising registers for the partial sums: R2: =O in all processors 
for k: =1 to p do p steps of matrix multiplication 
begin 
< set CW, C; one; one >; 
I C: =C[WEST]: Sending a column of A to all processors 
< set C, R27; one; one >; 
I Rl: =C: saving these values in RI 
< set CN, C; one; one >; 
I C: =C[NORTH]: Sending a row of B to all processors 
< mult C, R27, C; one; one >; 
IC := C*Rl: Multiplication of the operands 
< add R2, C, R2; one; one >; 
I R2: = C+R2: Addition of the product to the partial sum 
end; 
< bank WEST, 2,1, true; one; one >; 
OutMat(n, WEST, R2); 
< bank WEST, 2,1, true; one; one >; 
< bank WEST, 3,1, true; one; one >; 
I Reset bank I 
Write results into interface processors 
Reset bank I 
Switch bank 
23 
for i: =1 to n do< noop; one; one >; Wait loop 
end. 
Before the execution of the loop the memory banks of the interface processors must be opened 
to the parallel unit. Furthermore, the register R2 must be set to 0 before the first execution of 
the loop. 
The program executes the innermost loop p times. Within the loop, a column of matrix A is 
broadcast to all processors (each processor (ij) gets the value a[i, k]). Similarly, in the next 
instruction a row of matrix B is broadcast to all processors, such that processor (ij) gets value 
b[kj]. This is accomplished by the parallel prefix computations row broadcast and column 
broadcast (set CW, C in rows or set CN, C in columns). 
Now each processor computes with one multiplication a[i, k] x b[kj]. Afterwards it adds the 
result to the partial sum. Thus, by p loops the matrix product is accumulated in the processors. 
After the computation the memory banks of the interface processors are reset to prepare them 
for the output of the results. The library routine OutMat then outputs the product matrix. 
Afterwards the memory banks are switched such that the controller can now access the result 
data. 
24 
3.4 Modifications of the Conventional Systola Architecture 
Systola 4096 is the already announced successor of Systola 1024 [80]. Because of its the four 
times higher processor number (4096) and its two times higher clock frequency (100 MHz) it 
will have eight times the theoretical peak performance of Systola 1024. Beside the processor 
number and the clock frequency the performance of ISA applications also depends on the 
architecture of the individual processors, e. g. number of registers per processor, instruction 
set, and throughput of the bus system. 
In this section a collection of minor improvements in the architecture of the processors and the 
bus system of Systola 4096 is proposed. The technical realisation of them is easy. In Chapter 4 
and 5 it will be shown in several applications how these improvements lead to an additional 
increase of efficiency with respect to the conventional Systola 1024 architecture. 
The layout of processors has been developed at the company ISATEC for a 0.35ýL CMOS 
technology and a full-custom design. 
3.4.1 Bus system 
Many image processing applications perform only a few operations on a large amount of data, 
e. g. calculating the absolute difference between two continuous frames in a video sequence 
requires two operations per pixel. Since the operations can be performed for each pixel in 
parallel, this algorithm is suitable for the ISA. In spite of this, its implementation on Systola 
1024 results in a bad runtime behaviour, because data transfer time dominates computing 
time. Such bad runtime behaviours of ISA suitable algorithms on Systola 1024 can be caused 
by the following facts: 
Due to too small memory within the board RAM (1 MB), or within the array processors 
(1024 x 32 registers) or within the EPs (64 x2x 256 words) additional data transfer has to 
be done, e. g. for processing of a 1024 x 1024 RGB image Q MB). 
Too small bandwidth of busses, i. e. PCI bus (maximal 130 MB/s), RAM - IP bus (maximal 
50 MB/s) 
Too many instructions are required for the 1/0 of data between EPs and processor array, e. g. 
the 1/0 of a 32 X 32 matrix of data items requires 32 input instructions and 64 output 
instructions 
To overcome these limitations of Systola 1024 several improvements of the architecture of 
Systola 4096 are proposed: 
" Larger memories 
" Larger bus bandwidths 
" New 1/0 concept between EPs and processor array 
Larger memories: Larger memory are provided for the board 
RAM (between 8 MB and 128 
MB), array processors (4096 x 64 registers) and IPs (512 x2x 
64 words). 
25 
Larger bus bandwidths: The maximal bandwidth of the RAM - EP bus is increased to 800 
MB/s. For a fast data acquisition in real time image processing applications a fast frame grabber 
interface of maximal 400 MB/s is provided by the Systola 4096 board, too. 
New 1/0 concept between IPs and processor array: To speedup the 1/0 between EPs and 
processor array a modification of the conventional architecture is suggested on Systola 4096. 
The processor array consists of a4x4 matrix of several small ISAs. Each of the small ISAs 
contains a 16 x 16 processor array together with 16 western IPs and 16 northern IPs. All lPs of 
the same row resp. column are connected via a common bus with the board RAM (see Figure 
3.5). The modified ISA architecture can be used in three different modes. Switching between 
these modes can be done within a single instruction. 
1.1/0 Mode (Figure 3.5): The processor array is configured as a4x4 array of small 16 x 
16 ISAs. All IPs can communicate with the board RAM. Concurrently, ISA programs can 
be executed on all small 16 x 16 ISAs. Thus, the input of a 64 x 64 matrix of data items 
requires only 16 instructions and the output 32 instructions. This leads to a speedup of 
factor 4 compared to the conventional architecture of a 64 x 64 ISA. 
Frame grabber interface 
400 MByte/s F- 
Figure 3.5: Bus architecture of Systola 4096: The processor array 
is configured in the 1/0 
mode. 
2. ISA Mode (Figure 3.6 left): The processor array is configured as a conventional 
64 x 64 
ISA. Thus, only the left and upper IN of the 
4x4 matrix of small ISAs are active and 
each array processor is physically connected 
to its logical neighbour. 
26 
3. Systolic Mode (Figure 3.6 right): The processor array is configured as in the ISA mode, 
switching the western/northern IN to eastern/southern IPs. This implements a continuous 
data flow in one direction (north-south or west-east). The systolic mode can speedup 
applications which produce results in the last processor row/column, e. g. row 
accumulation. 
Figure 3.6: ISA mode (left): A conventional 64 x 64 ISA. Systolic mode (right): Continuous 
data flow from west to east. 
3.4.2 Processor Architecture 
The runtime of computing intensive applications on Systola 1024 is often determined by the 
abilities of individual processors, e. g. the number of instructions to perform a division. The 
following improvements of the processor architecture of Systola 4096 are proposed to get an 
additional increase of efficiency in several applications in Chapter 4 and 5: 
" more registers 
" indirect addressing 
" MMX instruction set 
two C-registers 
ready flag 
more CR-registers 
faster floating point unit 
More registers: In many applications each processor has to store more data than fits into 32 
registers. Thus, the processor memory is increased to 64 registers. Since there is a trade-off 
between processor area and memory capacity a further increase in the number of registers is 
likely to decrease the overall performance. 
Indirect addressing: Indirect addressing is a frequently used technique 
in many algorithms 
using arrays as data structure, e. g. sorting, histogram. 
The instructions on Systola 4096 are 
able to address source registers and destination register 
by an index register. 
MMX instruction set: MMX technology, recently introduced by Intel [26], is designed as a 
set of basic, general purpose integer 
instructions for PCs that can be easily applied to the 
27 
needs of the wide diversity of multimedia and communication applications. The principle data 
type of the Systola 4096 MMX instruction set is a byte, where two bytes are grouped into a 
single 16-bit register. When an MMX instruction is executed, it takes the two byte values 
from a register, performs the arithmetic operation on both bytes in parallel, and writes the 
result into another register. These instructions lead to significant speedups in several image 
processing applications in Chapter 5, e. g. the instruction that counts the number of ]-bits in a 
register, will speed up our Hough transform implementation by a factor of more than 2. 
Two C-registers: While the C-register on Systola 1024 is operating in western/eastern 
direction the data links to northern/southern neighbour are not active. An additional C-register 
can easily be added to each processor, which can operate in this direction simultaneously. 
Thus, the routing capabilities of the ISA are improved, e. g. row broadcast and column 
broadcast can be done within the same clock cycle. 
Ready flag: Implementations of iterative algorithms on Systola 1024 always require the worst 
case runtime, since there is no global control during the execution of an ISA program, e. g. 
testing if a sequence of numbers contains the value zero, all numbers have to be tested in 
every case. Using the ready flag of a Systola 4096 processor it is possible to stop the execution 
of an ISA program dependent on the data. If any processor finds the result it can set its ready 
flag. Afterwards a global or of all ready flags is computed and sent to the controller. 
Dependent on this result the controller can stop the execution of ISA program. 
More CR-registers: On Systola 1024, one of the registers C, RO,..., R31, AF, 0 can be 
chosen as destination as well as the combined addressing CR24,..., CR31. In the latter case, 
the two registers are written into at the same time,, i. e. the communication register C and one 
of the registers R24,..., R31. On Systola 4096, the combined addressing is possible with each 
internal register, i. e. CRO,..., CR63. 
Faster Floating Point Unit: There is a number of instructions for the floating point 
arithmetic that automatically deal with all exceptional cases. These are no simple machine 
instructions, but macros made up of several instructions. The number of instructions required 
for these macros are reduced by approximately 10% on Systola 4096 in comparison to Systola 
1024. 
28 
4 Arithmetic on Long Numbers and its 
Application in RSA Cryptography 
The data security issue is of primary importance for the users of global networks. One of the 
most efficient protection against unauthorised access is the encryption of the transmitted data, 
e. g. for electronic banking, electronic commerce, data transmission (e. g. data, speech, video) in 
internet/intranets [8 1 ]. 
RSA [631 is a very popular cryptographic algorithm which provides high security. However, the 
speed of RSA implementations is not tremendous. Computing the RSA algorithm requires 
modular exponentiation of long integers, e. g. 1024 bits. Thus, software implementations of the 
RSA algorithm suffer from the microprocessors' inability of directly processing long integers. 
This has hindered the use of the RSA algorithm in applications where encryption and 
decryption is required in real time. 
In this chapter we present new ISA algorithms for arithmetic operations on operands that are 
too long to be handled within one processor, i. e. addition, subtraction, multiplication, division, 
modular multiplication, and modular exponentiation. It is shown how the new arithmetic leads 
to a high-speed instruction systolic implementation for RSA encryption and decryption, and 
for RSA key generation based on prime number testing. Finally, some other cryptographic 
applications on the ISA are presented. 
4.1 Addition and Subtraction of Long Operands 
The difficulty of partitioning an addition in an array where only neighbours can talk to each 
other is the carry propagation. The situation is easy as long as a generated carry has to be 
passed to the next neighbour only where it can be added without generating a new carry. But if 
it meets a sum value consisting of ones only, then an incoming carry produces a new carry in 
the most significant digit. In the worst case this can hold for all blocks of the partitioned 
addition, such that the carry-bit of the first block propagates through all other blocks up to the 
most significant block where it is finally absorbed. Of course, the addition on the ISA must be 
solved data independently, i. e. the number of instructions must not depend on the propagation 
distance of the carries. Therefore, an accumulation technique based on the generate and 
propagate signals of a carry-look-ahead-adder is used. 
Suppose, every processor knows whether it has to propagate an incoming carry from the left to 
the right. Then it can set a flag (e. g. the zero flag) to one if and only if it propagates an 
incoming carry. Furthermore every processor can store the carry generated by itself in its 
communication register. With the parallel prefix computation 
(*) if zeroflag then C: =C[WEST] 
all carry-bits travel to the processor one position left of their 
destinations. This is again 
because of the skewed execution of the instructions in the processor array. 
Suppose, 
processors (ij-1), (ij), and (ij+l) have set their zero 
flag to one (in order to propagate a 
29 
carry) and processor (ij-2) has generated a carry-bit one. The instruction (*) is executed in 
processor (ij-1) which replaces its carry-bit (0) with that of processor (ij-2). In the next 
clock cycle the same instruction is executed in processor (ij), where again the own carry (0) is 
replaced by the value of the left neighbour (which has been set to one in the previous 
instruction cycle). In the following cycle, processor (ij+l) replaces its carry with the one from 
the left neighbour, etc. 
With the final assignment C: =C [WEST] in every processor all carries are at the place where 
they are needed. C denotes the own communication register and C [WEST] the 
communication register of the left (western) neighbour. Note that the carry propagation 
requires only two instructions, although the carry-bits can travel over an arbitrary distance (of 
up to n-I processors of an nx n-ISA). 
Of course, this technique only works if no processor propagating a carry has generated a carry- 
bit on its own. But this is obviously impossible, since the propagation condition is a sum 
consisting of ones only, whereas the generating condition is a sum ý! 2' where m is the 
number of bits to be added in one block of the partition. If both conditions hold, the sum 
would be 2'+1-1, which is impossible because the operands are both:! ý- 2'-l. 
To achieve the optimal speedup, the operands to be added or subtracted are distributed over 
the rows of the ISA. If the length of the operands is 512 bits, every processor in the row (of 32 
processors) gets 16 bits. The least significant bits are stored in the leftmost and the most 
significant bits in the rightmost processor of the row. 
For longer operands (e. g. 1024 bits), every processor is responsible for more bits (e. g. 32 bits). 
If the length of the operand is smaller, it is useful to divide the rows into pieces where every 
piece is responsible for one addition or subtraction. It is not reasonable to store less then 16 
bits per processor because the arithmetic units of the processors of Systola 1024 perform one 
16-bit operation within one clock cycle. 
Now the addition is performed in the following way. Every processor has 16 bits of the 
operand A in register RA and 16 bits of B in register RB. The operations 
RSUM RA + RB; 
C carry; 
store the sum of RA and RB in RSUM and the carry-bit in the C-register. Now every 
processor must find out whether it has to propagate an incoming carry. This is the case if and 
only if the value of RSUM is consisting of ones only. If RSUM has at least one zero, this will 
absorb an incoming carry-bit. If the propagate condition holds, RSUM +1 is zero. 
Now we can proceed as pointed out above with 
if (RSUM +I= 0) then C := C[WEST]; 
RSUM := C[WESTI + RSUM; 
completing the long addition. Note that the carry-bit generated 
in the final addition is 
redundant. 
30 
Example 4.1: Addition of two 24-bit numbers in a row with 6 processors works as follows: 
Processor 3 4--] [= 
A 010, 1100 1010 0010 1101 1000 Every processor gets 4 bits of the 
B 000, 001, 010, 1010 1010 0100 operands A and B. LSB in the 
leftmost bit in processor 1. MSB in 
the rightmost bit in processor 6. 
ISUM ---] OjOO 1111 1 1111 1001 0000 1100 SUM: = A+B 
C 1 0 0 0 1 0 Communication register (C) stores 
the carry-bit 
zeroflag 0 1 1 0 0 0 if (SUM +1 = 0000) 
then zeroflag :=I else zeroflag: = 0 
1 0 1 0 C after the propagation step: 
if zeroflag then C: = C[WESTI I sum 1 0100 1 0000 0000 1 0101 1 0000 0010 SUM: = SUM + QWEST] 
The subtraction works in principle in the same way as the addition, using RDIFF: =RA-RB, 
the propagation condition if (RDIFF=O) and the carry-bit subtraction RDIFF: =RSUM- 
C [WEST I. 
Due to the systolic control flow the n-bit addition / subtraction is possible using O(n) ISA 
processors with constant period, although there is no common bus connecting the processors 
of a row. Table 4.1 shows the good runtime behaviour of the introduced technique on Systola 
1024. 
Table 4.1: Number of instructions in one Systola 1024 processor row for a complete addition 
and subtraction of length 512 bits and 1024 bits. 
Length of operands 512 bit 1024 bit 
Addition 7 10 
Subtraction 7 10 
However, having 32 processor rows, 32 different additions / subtractions can be calculated in 
parallel in the same time. 
The source code for the 1024-bit subtraction on Systola 1024 is presented in the ISA 
procedure s ub- 10 24 in Appendix A. 3.1. 
4.2 Multiplication of Long Operands 
Multiplication of long operands on an ISA also benefits from the systolic instruction flow. Of 
course, the solution depends on the capabilities of the multiplier of the individual processor. 
Therefore, we begin with a description of the function of the multiplier of Systola 1024. 
Multiplication of operands of wordlength (16 bits) is performed within three cycles: 
1. The first operand is loaded into the multiplier register where it is temporarily stored. 
2. The other operand is shifted through the multiplier register and the 
lower word of the 
result is produced. 
3. The higher word of the result is produced. 
31 
Three additional features are available for gaining efficiency: 
" Loading of a new operand: The last cycle of one multiplication can be used as the first 
cycle of the following multiplication. 
" Multiply-Add: Simultaneously with multiplication, a third operand can be added to the 
result without additional delay. 
" Multiplication of longer operands: If the second operand is longer than one word, the 
multiplier can continue immediately after the second cycle with the next word of the 
second operand, thus producing the second word of the result in the third cycle, etc. This 
means that the multiplier can be used to perform a multiplication of a 16-bit operand with 
a 16. q-bit operand within q+l cycles (q+2 if the initial loading cycle is counted too). 
The idea behind the multiplication on the ISA is related to the school method for integer 
multiplication: 
One operand is cut into pieces and all pieces are multiplied with the other operand in parallel. 
The results are then shifted with respect to each other in an appropriate way and added to form 
the final result (Figure 4.1). 
b, 
_1 ... 
bo - a, -, ... ao 
CnO*** C20CIOCOO (=bn-I ... bo - ao) Cn I*- C2 I Cl I CO I bn-I ... bo - a, ) 
.......... I ............ 
............... 
............ Cn m-1 *'., C2 M-1 Cl M-1 CO M-1 bn- I ... bo - am-, ) 
Cn+m-i ................................. C2 Cl CO 
Figure 4.1: Scheme of school method multiplication 
For the mx n-bit (m = 16- p, n= 16-q) multiplication, the first (m-bit) operand is distributed 
over the rows of the ISA with p processors per row, like the operands in long addition. Every 
processor gets the complete second operand, using broadcast along rows (see Section 2.1). 
Now a 16 x 16-q-bit multiplication is computed in each processor, as described above. The 
result is a 16. (q+1)-bit number. 
Next step is the addition of all 16. (q+l)-bit results in a processor-row, where the result in 
processor (ij) has to be considered as a multiple of 2 16 with respect to the result in processor 
(ij-1) (Figure 4.1). Each processor writes the least significant word of the result in its 
communication register. This value is added to the 16 bit more significant word of the western 
neighbour, which again stores the sum in its C-register and an internal register. This addition 
(with carry) is repeatedp times up to the most significant word. Finally, the carry-bit, which is 
generated by the last addition, is propagated along the processor rows as described in the 
previous section. 
Now the most significant m bits of the result of the mx n-bit multiplication are distributed 
over the rows of the ISA. They can be used as an operand in a following m-bit operation 
without additional communication. The least significant n bits of the result are stored in the 
first processor. If they are used as an operand in a following n-bit operation they can be 
broadcasted along the processor-row in 2-q instructions. 
32 
ISA program: The ISA program for the mx n-bit (m = 16-p, n= 16. q) multiplication in each 
processor-row is explained below. 
Every processor has 16 bits of the first operand in register RA, the least significant 16 bits in 
the leftmost and the most significant 16 bits in the rightmost processor. 
The complete second operand is stored in the registers RBO, ..., RBq-1in each processor. At 
the end the most significant m bits of the result will be distributed over the rows in RPq. The 
least significant n bits will be stored in the first processor of each row in RPO, ..., RPq-1- 
Loadmul RA; 
for i: =O to q-1 do 
multiply RBj -> RPi; 
multiply 
C: = RPO; 
for i: =1 
-ý RPq; 
to q do 
begin 
RPi-1: =C; 
C: =C [EAST] +RPj +carry; 
end; 
RPq := 
C= carry; 
If (RPq+1=0) then C: =C [WEST]; 
RPq := C[WESTI+RPq; 
Hload RA into the multiplier register 
Hcompute the 16 x 16. q-bit multiplication: 
RPO ... RPq := RA - RBO ... RBq-1 
Hload least significant result in C-register 
add RPO ... RPq along processor-row as 
shown in Figure 4.1 in q iteration steps 
//store result of previous addition in RPj -I 
Hadd with carry of RPj and C-register of the 
eastern neighbour (C [EAST]). C [EAST] 
stores the result of the eastern neighbours' 
previous addition. Result is stored in the own 
C -register. 
//propagate the carry-bit of the last addition 
33 
To illustrate the concept of multiplication presented in this section, we discuss the following 
example: 
Example 4.2: Multiplication of two 64-bit numbers in a row with 4 processors: 
Processor 1 2 3 4 
RA $3210 $7654 $BA98 $FEDC Every pr. has rst 16 bits of the fi- 
operand in RA (least significant 16 bits 
in pr. 1, most significant 16 bits in 
proc. 4) 
RBO $0001 $0001 $0001 $0001 The complete second operand is stored 
RB1 $0000 $0000 $0000 $0000 in the registers RBO, RB3 in each 
RB2 $0000 $0000 $0000 $0000 processor. 
RB3 $1000 $1000 $1000 $1000 
RPO $3210 $7654 $BA98 $FEDC 16 x 64-bit multiplication 
RP, $0000 $0000 $0000 $0000 RPO ... RP4 = RA - RBO ... RB3 RP2 $0000 $0000 $0000 $0000 RPO in Proc. 1 stores word 0 of the 
RP3 $0000 $4000 $8000 $COOO final result 
RP4 $0321 $0765 $OBA9 $OFED 
RP, $7654 $BA98 $FEDC $0000 RP, (pr. i) RPI(pr. i) + RPO (pr. 
i+]) (pr. 4 adds zero) 
Pr. 1 stores word 1 of final result 
RP2 $BA98 $FEDC $0000 $0000 RP2(pr. i) := RP2(pr. i) + RP, (pr. 
i+l) + carry 
Pr. 1 stores word 2 of final result 
RP3 $FEDC $4000 $8000 $COOO RPApr. i) := RP3(pr. i) + RP2 (pr- 
i+l) + carry 
Pr. 1 stores word 3 of final result 
RP4 $4321 $8765 $CBA9 $OFED RP4(pr. i) := RP4(pr. i) + RP3 (pr. 
C 0 0 0 0 i+1) +carry; C := carry 
Pr. 1 stores word 4 of final result 
Zeroflag 0 0 0 0 if (RP4 +1 $0000) then 
zeroflag := 
else zeroflag :=0 
C 0 0 0 0 C after the propagation step: 
if zeroflag then C: = 
C[WEST] 
RP4 $4321 $8765 $CBA9 $OFED RP4 := RP4 + C[WEST] 
Pr. 2-4 store words 5-7 of final 
result 
The time complexity of the presented ISA algorithm for nx n-bit multiplication is O(n) using 
O(n) ISA processors, whereby the maximum speedup compared to the school method 
multiplication is attained. A drawback of the algorithm is the need of O(n) registers per 
processor for storing the complete second operand and for storing the lower n bits of the 
result. The following decomposition technique solves the problem with a constant number of 
registers in the same time: 
Partitioning of the multiplication with a constant number of registers: 
An nx n-bit 
multiplication can be solved by computing n/k times an nx 
k-bit multiplication and adding 
the most significant n bits of the previous result to the current result 
in each iteration step. The 
ISA implementation achieves this by loading only a constant number k bits of the second 
operand into all processors. Each processor computes a 
16 X k-bit multiplication. 
34 
Simultaneously with this multiplication, the 16-bit result of the previous multiplication is 
added to the result without additional delay, using the multiply-add feature of the multiplier 
(see above). The next k bits of the second operand are loaded after the completion of the nx k- 
bit multiplication from the adjacent western IP into the processor row. 
The performance of the implementation of the above partitioning algorithm on Systola 1024 is 
depicted in Table 4.2. 
Of course, 32 multiplications can be calculated simultaneously within 223 (507) clock cycles, 
i. e. it takes less than 7 (16) ISA instructions per 512 x 512 (1024 x 1024)-bit multiplication. 
The constant k is chosen to be 128 (8 - 16) bits, taking advantage of the 8 CR-registers of 
each processor of Systola 1024 (see Section 3.4.2). 
The source code for the 1024-bit multiplication on Systola 1024 is presented in the ISA 
procedure mu 1t 10 24 in Appendix A. 3.1. 
4.3 Division of Long Operands 
Division can be efficiently reduced to multiplication and subtraction by using the Newton- 
Raphson-method. For a value B the reciprocal value is computed by an iteration that 
converges rapidly towards 1/B. To achieve a fast convergence (2' bits precision in m iteration 
steps), 0.5 :! -ý B<1 must hold for B. The iteration is given by the equations 
(1) XO: = 1 
(2) xi+l ý (2 -B- xi), xi 
Obviously, any division Q=A/B can be computed by multiplying A with I/B. For division 
on the ISA we assume B to be in the range between 0.5 and 1. If this is not the case, B is 
initially shifted in the ISA such that the most significant one is the MSB of the rightmost word 
of B. The shifting distance is stored in order to adjust the result after having computed the 
value of I/B. 
In order to understand the choice of intermediate operand lengths in our division method it is 
useful to consider the behaviour of the convergence in the Newton-Raphson-algorithm. Let xi 
differ from the precise value of I/B by some F-. Then the following holds for xi+i: 
x, +, -= 
(2-B- x, ) -x, = (2- B. (11B-e». (IIB-e) = 11B- B. e' 
Since B<1, this means that the precision of xj+1 (i. e. the number of correct leading digits) is 
at least double the precision of xj. This means that we can restrict the operand lengths to 
2' bits 
for B and xj-1 while computing the ith iteration xi. E. g. the first four iterations can 
be performed 
in one processor, since the required precision of the operands 
is :! ý- 16. The fifth iteration 
requires 32-bit precision, etc. Only the last iteration step needs 
full precision of m bits. The 
subtractions and multiplications can be performed efficiently as explained above. 
35 
Table 4.2: Number of instructions in one Systola 1024 processor row for one multiplication 512 x 512-bit resp. 1024 x 1024-bit multiplication. 
The implementation of the iteration is straightforward according to equation (2). Therefore, 
we only concentrate on some technical details to gain efficiency in the ISA division: 
The multiplicand B is used in each iteration (with increasing precision). It is distributed in the 
usual way along the row and stays there from the first to the last iteration step, being used 
once in every iteration. 
Whenever the subtraction is performed in equation (2), we know beforehand that the result 
will be greater than zero. Therefore, we need no information about the sign of the result after 
the subtraction. This means that we can represent the number two by a sequence of zeros (the 
digits following the most significant bit which itself is hidden). 
The subtrahend must then also be represented as one digit before and m-1 digits behind the 
point. In this representation a problem arises when a multiplication of two such numbers is 
performed: the point moves by one position with respect to the most significant digit. To 
compensate for this, we have to shift the result by one bit after each multiplication. This shift 
operation can be executed on the ISA in three instructions. Assume the value to be shifted is 
stored in the RA registers of the processors of a row. The instructions 
RA := RA + RA; 
C: = carry; 
RA := RA or C[WEST]; 
perform this shift operation. The first instruction shifts RA one digit to the left, filling the least 
significant bit up with a zero. The second instruction stores the bit that has been shifted out in 
the first statement into the communication register. Now it is read from the right neighbour 
(C [ WEST ]) which copies it into the LSB of its shifted RA register. 
The shift operation can be avoided after multiplication with B because B is smaller than one. It 
can be represented with the binary point left to the MSB. Whenever a multiplication with xj 
has to be performed, this value has to be distributed to all processors that are active in that 
multiplication. This can be done in a constant number of instructions by using the broadcast 
operation (see Section 2). 
The performance of the computation of 1/B on Systola 1024 is shown in Table 4.3. 
Again we need to point out that 32 divisions can be executed simultaneously on the ISA, 
leading to an average of less than 35 (80) instructions per 512 (1024)-bit division. 
36 
Table 4.3: Number of instructions for the computation of I/B with 0.5 :! ý B<1 in one 
processor row of Systola 1024 with a precision of 512 bits and 1024 bit. 
4.4 The ISA in RSA Cryptography Applications 
4.4.1 Private-key and Public-key CryPtosYstems 
Imagine two communication partners, a sender and a receiver, where the sender privately 
transmits a message M to the receiver. Therefore, the sender uses a cryptographic algorithm E 
and a key k in order to encrypt the message M to the ciphertext C= E(Mk), making it 
unreadable for opponents. Since the receiver also knows the key k, he can recreate the initial 
message M by decrypting the ciphertext C. For that purpose, he computes M= El (C, k) using 
the inverse cryptographic algorithm E" (see Figure 4.2). 
An opponent might know about the used cryptographic algorithm E; the reason why he cannot 
recreate the plain message M from the ciphertext C is that the key k is exclusively known by 
the sender and the receiver. Cryptosystems of this type are known as private-key systems, 
since the exposure of k renders the system insecure. Two well known private-key systems are 
DES and IDEA [55]. 
joendoo, Ciphertext Ciphertext 1.9colmew 
C insecure C %%- 
Secret Messag M 'o" Channel too, 
M E-I (C, k) 
Key k C= E(Mk) 
oppol"Ofto 
M= E-I (C, ?) 
does not know k 
Figure 4.2: Principle of a private-key cryptosystem 
One drawback of a private-key system is that it requires the prior communication of the key k 
between sender and receiver, using a secure channel, before any ciphertext is transmitted. In 
practice, this may be very difficult to achieve, e. g. for e-mail. 
A public-key cryptosystem uses different keys ek and dk for encryption and decryption. The 
idea behind a public-key system is that it might be possible to find a cryptosystem, where it is 
cornputationally infeasible to determine dk given ek- If so, then the encryption key ek could be 
made public by publishing it in a directory (hence the term public-key system). 
The advantage of a public-key system is that any person can send an encrypted message to the 
receiver by using the public encryption key ek. The receiver will be the only person that can 
decrypt the ciphertext, using his secret decryption key dk, see Figure 4.3. A well 
known 
public-key system is RSA, which is introduced in the next section. 
A drawback of public-key cryptosystems is their relative high computing time, e. g. 
DES 
implementations have approximately 1000 times higher throughput than RSA 
implementations. 
37 
Encryption 
Key ek 
Ciphertext 
C insecure Message M Channel 
C= E(M, ek) 
opponso4r 
M=E-I(C, ed 
intractable 
Ciphertext 
c 
Figure 4.3: Principle of a public-key cryptosystern 
4.4.2 The RSA Cryptosystem 
M= D(Cdý 
Decryption 
Key dk 
Rivest, Shamir and Adleman invented the most popular public-key cryptosystem named RSA 
[63]. The security of RSA is based on the difficulty of factoring large integers. The encryption 
and decryption keys are functions of a pair of large prime numbers. Recovering the plaintext 
from the public encryption key and the ciphertext is conjectured to be equivalent to factoring 
the product of the two primes. 
To generate the two keys, choose two large random prime numbers, p and q. For maximum 
security, choose p and q of equal length, e. g. 512 bit. 
Compute the product 
N=p. q 
Then randomly choose the encryption key, e, such that e and (P- 1) - (q- 1) are relatively prime. 
For optimum security e has the same bit length as N. Finally, use the extended Euclidean 
algorithm to compute the decryption key, d, such that 
e. d=- 1 mod (p-l). (q-1) 
Note that d and N are also relatively prime. The numbers e and N are published in a directory; 
the number d is kept secret (see Figure 4.4). The two primes, p and q, are no longer needed. 
They should be discarded, but never revealed. 
To encrypt a message M, first divide it into numerical blocks smaller than N (with binary data, 
choose the largest power of two less than N). The encrypted message, C, will be made up of 
similarly sized message blocks, G of about the same length. 
The encryption formula is simply 
Ci = Mi'mod N 
To decrypt a message, take each encrypted block Ci and compute 
M 
i= Cid mod N 
Since 
38 
cd (Me)d 
i (mod 
Med 
i (mod 
M k(p-1)(q-1) +I 
i (mod N) (for some integer k ý! I with e-dk- (p- (q- 1)+ 1) 
(m (P-1)(q-1))k m (mod N) 
Ik Mi (mod N) (because Z N* is a multiplicative group of order (p-1)(q-1), 
M (p-1)(q-1) I it holds i (mod N) [for more details see [9111) 
= Mi (mod N) 
the formula recovers the message. This is summarised in Table 4.4. 
'emblic 
'trust cent%, 
I Name I Key I 
I N=p-q I 
generate 
large primes 
p, q 
Schimmler I N, e 
de 
such that d- e -=I 
mod (p-1)(q-1) 
I 
GbIMM, oeB 
Pd",,, I 
d 
Figure 4A RSA key generation and distribution (The secret key d has to be transmitted 
securely, e. g. by using a key-exchange protocol. ) 
Encryption Key 
N product of two primes p and q (p and q must remain secret) 
e relatively prime to (p-1)(q-1) 
Decgption Key 
d e-I mod ((p-l)(q-l)) 
Encrypting 
C=MmodN 
Decrypting 
C- 
-ýI mod 
N 
Table 4.4: RSA encryption and decryption 
39 
4.4.3 Parallel Implementation of RSA encryption and decryption 
For obtaining a high-speed implementation of RSA encryption and decryption a fast modular 
exponentiation (M mod N) is necessary. To achieve a sufficient degree of security the 
wordlengths in the modular exponentiation have to be relatively large (currently ranging from 
512 up to 2048 bits). The modular exponentiation can be accomplished by iterating modular 
multiplications. 
In the next two sections it is described how these two operations can be efficiently 
implemented on the ISA. 
4.4.3.1 Modular Multiplication 
The difficulty in implementing an efficient modular multiplication (a-b mod N) on the ISA is 
the modular reduction step. In the case of RSA encryption and decryption the modulus N is 
known in advance to each modular multiplication. So the precomputation of the modulus' 
reciprocal 1/N requires only a negligible amount of work. Now the division x div N can be 
replaced by the more efficient multiplication x. (IIN). 
Of course we cannot produce 1/N precisely. We instead produce a number [t such that x. [t is 
close enough to x div N, i. e. the result needs only a small correction by at most four. This is 
2n- n-1 n achieved by using g=21 div N with 2<N<2. 
Instead of computing the precise quotient 
q=x div N= Lx / NJ = L(x /2n )(2 
2n-1 / N) 2 n-1 
an estimate is calculated by replacing the floating point divisions by integer divisions 
q'= ((x div 2n)-, u) div 2 
n-1 
Obviously this estimate is never too large. The situation in RSA encryption and decryption is 
x< N2,2 n-1 <N<2 n. Then the following holds: 
q'= «x div 2") -, u) div 2 
n-1 = «x div 2n )(2 2n-' div N» div 2"-' 
L«x div 2n )(2 2n-' div N» /2 n-1 
j 
ýý L«x /2n -1)(2 
2n-1 /N- 1» /2 n-1 
1 
L(2 n-1 
-x/N-x2n -2 
2n-1 /N+ 1) /2 n-1 
1= Lx 
/N-x/2 2n-1 -2 
n/N+2 n-1 
ýýLxlN-2-2j LxlNJ-4 = q-4 
Thus, the error of the integer division is at most four: 
(3) q-4 !ý q':! ý q 
With another multiplication an estimate of the remainder x mod N is given 
by 
(4) r9= x- qýN 
Because of (3) at most four further subtractions of N are required to obtain the correct result. 
(5) for i :=1 to 4 do if Wý! N) then r': = r ý- N 
The complete modular multiplication is summarised 
in Algorithm 4.1 (known as Barrett's 
algorithm [2,6]). It is taken to implement the n-bit modular multiplication 
in each ISA 
processor-row. 
40 
Algorithm 4.1: n-bit modular multiplication 
Input: a, b, N, g, with a<N, b<N, 2'-' <N<2n2 
2n- I div N 
Output: x= a-b mod N 
1. x: = a. b 
2. y: = «x div 2 n)-g) div 2 n-1 
3. y: =y. N 
4. x: =x-y 
5. fori: =lto4doif(xýýN)thenx: =x-N 
The nx n-bit multiplication (see Section 4.2) and the (n+3)-bit subtraction (see Section 4.1) 
are used as basic operations. At first three subsequent nx n-bit multiplications (1., 2. , 3. in Alg. 4.1) are computed. Because x-y<2 n+3 the 2n-bit subtraction in (4. ) can be replaced by 
a (n+3)-bit subtraction x=x mod 2 n+3 _y mod 2 n+3 . The (n+3)-bit subtraction is implemented 
as a n-bit subtraction with an additional subtraction with carry of the most significant three 
bits in the last processor of each processor row. 
Finally, the corrections of the remainder x=x-N are computed using again (n+3)-bit 
subtractions. Because the number of corrections has to be determined data independently on 
the ISA the worst case of four corrections has to be considered. 
The time complexity of the presented ISA algorithm for n-bit modular multiplication in one 
processor row (with precomputedu ) is O(n) using 0(n) ISA processors and 0(l) registers, 
whereby the maximum speedup is attained again. The performance of the implementation of 
the presented ISA algorithm on Systola 1024 is shown in Table 4.5. 
Table 4.5: Number of instructions for a single 512 (1024)-bit modular multiplication in one 
processor-row of Systola 1024. 
Length of operands 512 bits 1024 bits 
Modular Multiplication IF- 701 1519 
Of course, 32 modular multiplications are calculated simultaneously within 701 (1519) clock 
cycles, i. e. it takes less than 22 (48) ISA instructions per 512 (1024)-bit modular 
multiplication. 
The source code for the 1024-bit modular multiplication on Systola, 1024 is presented in the 
ISA program XEXPO 10 24 in Appendix A. 3.1. 
4.4.3.2 Modular Exponentiation 
One of the most important arithmetic operations for public-key cryptography is modular 
exponentiation (At mod N), e. g. for the encryption and decryption operations in RSA, as 
noted above. 
Because it holds (a - b) mod N= ((a mod N) - (b mod N)) mod 
N, computation of 0 mod N 
can be done by using e-I modular multiplications. However this is very 
inefficient if e is 
large. 
41 
There are two ways to reduce the time to do modular exponentiation. One way is to decrease 
the time to do modular multiplication (see previous section); the other to reduce the time to 
reduce the number of modular multiplications (presented in this section). 
In the following the implementations of three efficient algorithms for modular exponentiation 
[18,33] on Systola 1024 are investigated: 
" square-and-multiply algorithm 
" m-ary algorithm 
" sliding-window algorithm 
Each of these algorithms reduce the number of modular multiplications required to compute 
At mod N toO(1092e). 
4.4.3.2.1 Square-and-multiply Algorithm 
The well-known square-and-multiply algorithm (also known as binary algorithm) [32] uses 
the fact that 
me=(... ((M e,, -, ) 
2. M e,, -2 ) 
2.... ) 2. M eo; 
e= (en-I e(, )2 
Thus, the number of modular multiplications required in this method varies with the number 
of ones in the binary representation of the exponent. Assuming the values of bits in the binary 
representation of the n-bit exponent are equal distributed the square-and-multiply method 
requires 1.5. (n-1) modular multiplications on the average and 2-(n-1) modular multiplications 
in the worst case. 
Algorithm 4.2a: left-to-right square-and-multiply algorithm 
Input: M, e= (e, -,,..., eo)2, 
en-1=1, N 
Output: y= Atmod N 
I. y: = M 
2. for i: = n-2 downto 0 do 
2a. y := Y2 mod N 
2b. if (e, = 1) then y: = y-M mod N 
The right-to-left square-and-multiply algorithm becomes possible, if the exponent e is scanned 
from the LSB to the MSB: 
n-I 
me... (((Meo). M2e, ). M4e2).... ). M2n-len-I = 
1-1 M2'ej; 
i=O 
Algorithm 4.2b: right-to-left square-and-multiply algorithm 
Input: M, e= (en-l, -.., eo)2, 
en-1=1, N 
output: x= Alf mod N 
1. x: = I 
2. Y: = M 
3. for i: = Oto n-ldo 
3a. if (e, = I) then xx-y mod N 
3b. Y: = y-M mod N 
e= (en-I ý..,, eo)2 
42 
Both square-and-multiply algorithms can be implemented efficiently on Systola 1024 if the 
same exponent e is used for the modular exponentiation in each processor row, i. e. one large 
message is encrypted or decrypted for one user, in the following way: 
The sequential host evaluates the exponent bits ej. Depending on the result (if ej = 1) the ISA 
program for modular multiplication (Step 2b in Alg. 4.2a, Step 3a in AIg. 4.2b) is executed on 
the processor array, otherwise not. Thus, the implementation requires 1.5. (n-1) modular 
multiplications on the average. 
Having 32 different exponents used for the modular exponentiation each processor row, i. e. 
several small messages are encrypted or decrypted for several users, the worst-case execution 
time of 2. (n-1) modular multiplications has to be considered in every case: 
The ISA program for modular multiplication (Step 2b in Alg. 4.2a, Step 3b in Alg. 4.2b) is 
executed in each iteration step i, n-2:! ý i:: 'ý, 0. After that, all processors of each processor row j, 
I :! ýj:! ý 32, get the ith bit of their corresponding exponent e(i), using the efficient row broadcast 
operation (see Chapter 2). Depending on ej(') each processor takes either the result of the 
modular multiplication (if ei(') = 1) or the previous y-value (if ei(i) = 1) as the new y-value. 
Table 4.6 depicts the performance of modular exponentiation using the right-to-left square- 
and-multiply algorithm on Systola 1024. The corresponding source code for 1024-bit modular 
exponentiation with same exponent is presented in Appendix A. 
4.4.3.2.2 m-ary Algorithm 
The m-ary algorithm is a generalisation of the square-and-multiply algorithm. Partitioning the 
exponent into blocks of d bits, the exponentiation can be calculated by: 
Me ((M 
A-1 )M. M A-2 )M.... ) rn 
-M fO ;e= (e,, -, 
eo)2 = (fk-lfk-2 fo)m; m=2d; k=n d 
The m-ary method (for m=2 d) first computes the values of Af for w=2,3,..., 2ý-I. The 
exponent e is then scanned d bits at a time from the most significant to the least significant. At 
each step the partial result is raised to the 2d power and multiplied with M I, where wi is the 
current nonzero word. 
43 
Table 4.6: Computing time for a 512 (1024)-bit modular exponentiation using the square- 
and-multiply algorithm in each processor row on Systola 1024 with the same exponent in each 
processor row and using different exponents in each processor row. (All data transfers are 
included. ) 
Algorithm 4.3: m-ary method 
d Input: M, e= (e, -,,..., eO)2, en-1 
N, m2, n= k-d 
Output: y= At mod N 
1. compute and store M' for w=2,3,..., 2 
d 
_I 
2. decompose e into d bit words fj for i=0,1,..., k-1 
3. y := 
Mfk-I 
4. for i: = k-2 downto 0 
4a. Y: -----,: y2 
4b. if fi #0 then y: = y- MJ, 
The number of modular multiplications of the m-ary method is in the worst case 
2d-2 (Step 1) +n-d (Step 4a) +n-I (Step 4b) 
d 
Taking n= 512 (1024) and d=4, the number of modular multiplications is 649 (1289) in the 
worst case. Thus, the square-and-multiply algorithm requires about 50% more multiplications, 
i. e. 1022 (2046), than the m-ary method in these cases. 
The m-ary algorithm can be taken to implement the modular exponentiation efficiently on 
Systola 1024 if 32 different exponents are used in each processor row. The values Mj' are 
computed in all processor rowsj, 1 :! ýj:! ý 32, w=1,2,..., 2 d_ I (Step 1). A table of these values 
is stored in the memory blockj in the western board RAM, where memory blockj, I :! ýj:! ý 32, 
stores Mjw for w=1,3,..., 2 d_ 1. 
The sequential host partitions the exponents d) for i=1,..., 32 (Step 2). For a modular 
multiplication y- Mj f' (Step 4b) the table entry Mi f' is transferred from memory block j of 
the western board RAM to the jth western IP. If fjý-) =0 holds the n-bit constant one is loaded 
into the j th western IP. This data transfer can be done concurrently to the computation of the 
previous modular multiplication. 
The computing time for a 512/1024-bit modular exponentiation using d=4 with the m-ary 
algorithm in each processor row of Systola 1024 with different exponents in each row would 
be constant 160/640 ms including all data transfers. 
4.4.3.2.3 Sliding-window Algorithm 
The m-ary method decomposes the bits of the exponent into d-bit words. The probability of a 
word of length d being zero is 2 -d , assuming that the zero and one 
bits are produced with equal 
probability. In Step 4b of the m-ary method, we skip a multiplication whenever the current 
word is equal to zero. Thus, as d grow larger, the probability that we have to perform a 
multiplication operation in Step 4a becomes larger. However, the total number of 
multiplications increases as d decreases (see above). 
The sliding-window algorithm provides a compromise by allowing zero and nonzero words of 
variable length; this strategy aims to increase the average number of zero words, while using 
relatively large values of d. 
44 
The sliding-window method decomposes the n-bit exponent into zero and nonzero words 
(windows) fi of length L(fi). Let k be the number of windows and d be the length of the longest 
window d= max (L(fj)) for i=0,1,.., k- I- 
Furthermore, iff is a nonzero window, then the least significant bit off must be equal to one. 
This is because the exponent is partitioned starting from the least significant bit, and there is 
no point in starting a nonzero window with a zero bit. Consequently, the number of pre- 
processed multiplications (Step 1 in the sliding window algorithm) are only 2 d-I , since 
M' are 
computed for odd w only. 
Algorithm 4.4: Sliding-window 
Input: M, e= (e, -,...., eo), e, _1 = 
1, N, d 
Output: y= Af mod N 
1. Compute and store M' for all w=1,3,..., 2' - 1 
2. Decompose e into zero and nonzero windows f of bitlength L(f i=0,... , 
k- I 
3. y. - =M fk- 1 
4. for i: = k- 2 downto 0 do 
4a. Y: =y 2 
L(Fi) 
4b. if (fi # 0) then y: =y -M fi 
The average number of multiplications in the sliding-window algorithm depends on the 
partitioning strategy (Step 2) and the window size d. We use the Variable Length Nonzero 
Window (VLNUO partitioning strategy, described in [33]. For example the partitioning of an 
exponent for d=5 is 
1010 1110100 101 10111000000 100 111000 1011 
For n= 512 (1024) and d=5 (6) the average number of multiplications is 595 (1174). 
The sliding-window algorithm can be taken to implement the modular exponentiation 
efficiently on Systola 1024 if 32 modular exponentiations for the same exponent and 
different bases are computed in each processor row. 
The table for the precomputed values in each row Mj' is stored in the memory block w in the 
western board RAM for 1 :! ý i :! ý 32, j=1,3,..., 2ý-l (Step 1). Concurrently to the ISA 
computation of Step 1 the sequential host partitions the exponent using the VLNW strategy 
(Step 2). Depending on this partition Systola 1024 computes a modular squaring of the 
previous result (Step 4a) or a modular multiplication of the previous result with a table entry 
(Step 4b). For a modular multiplication with the table entry Mjw the memory block w of the 
western board RAM is transferred to the western IPs for i=1,..., 32. This data transfer can be 
done concurrently to the computation of the previous modular multiplication. Note that no 
additional work for the input of the first operand (y in Steps 4a and 4b) is necessary because 
this operand is always the result of the previous multiplication. 
The computing time for a 512/1024-bit modular exponentiation using 
d=516 with the sliding- 
window algorithm in each processor row of Systola 1024 with the same exponent 
in each row 
would be approximately 140/580 ms on average including all 
data transfers. Since the window 
lengths of the partitioned exponents are variable, there 
is no chance in implementing the 
sliding-window algorithm efficiently with different exponents 
in each processor row. 
45 
4.4.4 Performance Evaluation of RSA encryption 
The performance of the implementation on Systola 1024 is compared with a sequential 
implementation on a PC (Pentium 11 Processor, 266 MHz). Both implementations use the 
same technique for modular exponentiation (right-to-left square-and-multiply algorithm (Alg. 
4.2b)) with same exponent in each processor row, see Section 4.4.3.2.1) and the same modular 
multiplication presented in Section 4.4.3.1. 
The time critical routine for modular multiplication is implemented in assembly language 
optimised for 32-bit Intel processors. The throughput of the PC is 14 Kbit/s for 512-bit RSA, 
resp. 5 Kbit/s for 1024-bit RSA encryption on average. 
The performance of Systola 1024 is 102 Kbit/s for 512-bit RSA resp. 44 Kbit/s for 1024-bit 
RSA, which corresponds to a speedup of 7 resp. 9 of the parallel implementation (see Table 
4.9). 
The source code for the 1024-bit RSA encryption and decryption on Systola 1024 is presented 
in Appendix A. 
Gieseke reports [19] the performance of 1024-bit RSA encryption/decryption on a 600 MHz 
DEC Alpha processor to be 16 Kbit/s. 
As mentioned in Chapter 3, the current ISA-prototype Systola 1024 is a machine based on 
technology far from being state-of-the-art. Extrapolating to technology used for processors 
such as the Pentium 11266 MHz would lead to a speedup of the ISA. Thus, the performance of 
the PC is also compared to the expected performance on Systola 4096. 
The performance of Systola 4096 is estimated to be (at least) ten times higher than on Systola 
1024: 
Factor 8 because of the four times higher processor number and two times higher cycling 
frequency 
Factor 1.25 because of a faster modular multiplication routine (faster 1/0 of operands and 
taking advantage of more CR-registers in the partitioning of multiplication) 
Taking advantage of the Chinese Remainder Theorem (CRT), computational effort of RSA 
decryption can be reduced by almost a factor of 4. The CRT can not be applied for RSA 
encryption [9 1 ]. 
Table 4.9: Comparison between the performance of RSA encryption and RSA decryption on 
Systola 1024, Pentium 11266 MHz, and Systola 4096 with a 512-bit and 1024-bit exponent 
(without taking advantage of the CRT). 
512-bit RSA 1024-b-it RSA 
Systola 1024 102 Kbit/s 44 Kbit/s 
Systola 4096 1020 Kbit/s 440 Kbit/s 
Pentium H 266 14 Kbit/s 5 Kbit/s 
Speedup 7/70 9/90 
46 
4.4.5 Parallel Implementation of RSA Key Generation 
Especially for key generation within the RSA algorithm, it is important to test whether a 
random number is a prime. In order to answer the question "Is p prime ? ", it is not necessary 
to factor p- testing for primality is much easier than factoring. The probabilistic Rabin- 
Miller algorithm [56,6 1] is frequently used for primality testing (see below). 
Thus, for generating one of the primes p, q one can proceed as follows: 
1. Generate a random n-bit number 
2. Set the MSB andthe LSB ofp to 1. (This ensures thatp is an odd number with 2`1 <p 
<2 n) 
3. Make sure that p is not a multiple of small primes lower than a certain threshold, e. g. 
100. Otherwise, start again with Step 1. 
4. Perform the Rabin-Miller test for some random a. If p passes, generate another random 
a and go through the test again. Sixteen subsequent tests are assumed to prove the 
primality of p with sufficiently very high probability. If any of the tests fails, p is not 
prime. Start again with Step 1. 
Step 3 can significantly accelerate primality testing. For example, 76% of all odd numbers can 
be divided by a prime less than 100, and are thus quickly verified to be composite. 
The Rabin-Miller test for composites is a yes-biased Monte Carlo algorithm, i. e. the test 
gives always the right result if it identifies a number as composite. But it makes mistakes with 
a certain probability if it identifies a number as prime. 
The test uses Fermat's theorem, which tells us that 
Suppose p is prime and acI0,..., p- 11, then aP- 1 -= 1 (mod 
and the fact that 
( **) If x2=1 (mod p) has another solution than x=1 or x=-1, p is not prime 
The Rabin-Miller test calculates a-' mod p. The number a is called a witness as soon as p is 
identified as a composite number by running the test with a. Three-quarters of the possible 
values of a are guaranteed to be witnesses [23]. Thus, the probability for making a mistake is 
at most I/4 per test and the probability for delivering the wrong result with t tests is only I 
4'. 
This makes the Rabin-Miller algorithm very reliable and it is definitely secure enough to use it 
for generating RSA keys. 
Algorithm 4.5: Rabin-Miller 
Input: p, 2' "<p< 2", rO)2: = p-1, a<p 
Output: isprime 
1. isprime: = TRUE 
2. y: = a; 
3. for i: = n-2 downto 0 do 
3a. x: = y 
3b. Y: = y2 modp 
47 
3c. if (y =I and x#I and x# -1) then isprime: = FALSE 
3d. if (r, = 1) then y: = y-a mod p 
if (y: gl- 1) then isprime: = FALSE 
The Rabin-Miller algorithm calculates a-' mod p (see Algorithm 4.5) using modular 
squarings (3b) and modular multiplications (Step 3d) based on the left-to-right square-and- 
multiply method (see Algorithm 4.2a). Additionally, the results of each modular squaring 
(Step 3b) and the final result (Step 4) is evaluated based on Fermats' Theorem and Theorem 
(**). Depending on the result p is detected as a composite. 
The ISA algorithm for one iteration step (Steps 3a, 3b, 3c, 3d) in a processor row uses the 
instruction systolic modular multiplication, presented in Section 4.4.3.1, for the computation 
of 3b and 3d. The required comparison on long numbers (Step 3c) can be performed as 
follows: 
Values to be compared (y and 1, x and 1, x and -1) are subtracted along the processor 
row, using the ISA subtraction algorithm on long numbers (see Section 4.1) 
Compared values are equal if and only if the result in each processor is equal to zero. 
Assuming this result is stored in the C-register, the parallel prefix computation 
C :=C or C[WESTI 
computes the result of the comparison in the C-register of the rightmost processor. 
The boolean value of isprime (Step 4) is stored in a register R[i sprime I also in the 
rightmost processor. This register is initialised with TRUE (1) (Step 1). After each 
comparison in Step 3c it is updated by R[ isprime I: =R [ isprime I and C. The C- 
register stores FALSE (0), if the result of the performed comparison is true, and TRUE 
(1) otherwise. 
The complete iteration (Step 3) is carried out in the same way as the modular exponentiation 
with the square-and-multiply algorithm on Systola 1024 (see Section 4.4.3.2.1). Thus, 32 
different numbers are tested for primality in parallel. After the final comparison (Step 4) the 
resulting values of R[isprimel are transferred to the sequential host. If a number is 
identified as a composite (R [ isprimel = 0) anew number is loaded into the corresponding 
processor row. Otherwise, if a number is identified as a prime (R [i spr ime I= 1) it is tested 
with a new a-value. Generation of new prime number candidates and new witnesses can be 
computed on the sequential host concurrently to the parallel Rabin-Miller algorithm. 
Of course, the Rabin-Miller algorithm could stop if isprime (Step 3c) is seeded to f alse 
once. A disadvantage of the parallel ISA algorithm is, that it can not stop its computation data 
dependently. Thus, the worst case computing time has to be considered in each test. 
Our implementation of the Rabin-Miller algorithm on Systola 1024, which is described more 
detailed (with source code) in [21], searches for 512 bit prime numbers. For each tested 
number 16 witnesses are used constantly. About 40 to 70 possible prime numbers are found 
on the average out of 4096 tested numbers within ten minutes. 
48 
4.5 Other Cryptographic Applications on the ISA 
It should be emphasised that the ISA is not only suitable for the implementation of the RSA 
cryptosystem. Also the efficient implementation of many private-key cryptosystems is pretty 
easy to achieve. One example is the implementation of IDEA, which is used in the famous 
PGP electronic mail security program [81]. IDEA is often applied in hybrid cryptosystems 
in connection with RSA to speedup data encryption: 
A (fast) private-key algorithm, e. g. IDEA, with a random session key is used to encrypt the 
message, and a (slower) public-key algorithm, e. g. RSA, is only used to encrypt the random 
session key. 
IDEA operates on 64-bit data blocks. The key is 128 bits long. The same algorithm is used for 
both encryption and decryption. All operations operate on 16-bit sub-blocks, i. e. XOR, 
addition modulo 2 16 , multiplication modulo 
2 16 . Thus, each ISA processor can encrypt / 
decrypt a 64-bit data block only using its internal registers without additional communication 
between processors. The implementation on Systola 1024 has a throughput of 80 Mbit/s, 
which corresponds to a speedup of factor 8 compared to a Pentium 11266 MHz. 
The Elliptic Curve Cryptosystem (ECQ is another popular public-key system. The ECC 
relies upon the difficulty of the Elliptic Curve Discrete Logarithm Problem. In [3] an efficient 
parallel implementation of ECC on Systola 1024 is described, based on the techniques for 
instruction systolic long operand arithmetic present in this chapter. 
49 
The ISA in Image Processing Applications 
Real-time image processing and understanding has long been regarded as a particularly 
demanding problem for computer implementation, both because of the computational 
complexity and because of the large 1/0 bandwidth required by most of the tasks involved. 
Consider, for an example, the 1/0 bandwidth required to perform HDTV simulation. Such a 
task typically involves the handling of 1024 x 1024 frames at the rate of 60 frames per second 
and results in a bandwidth requirement of approximately 500 MB per second for a 
progressively scanned image. 
Not surprisingly, such problems generally lie beyond the capacity of existing sequential 
processors. Consequently, a great deal of effort has been devoted to developing parallel 
architectures and algorithms for real-time image processing [8,22,53,88]. A central question 
is to match architectures and algorithms in such a way that good performance is reached. 
A two-dimensional array of processors can be used in image processing applications to match 
the spatial relationship of an input image by assigning pixels to processors in a natural way. 
As a consequence, machines of this architecture can perform the local window operation with 
extreme efficiency. While nearest-neighbour links make local inter-processor communication 
quite fast, communication between two processors on either end of an nxn array will require 
Q(n) time on a SIMD processor array. 
Because of the skewed instruction execution this operation can be performed more efficiently 
on the ISA within constant period. In this chapter we present techniques, which provide 
efficient solutions for several low level and intermediate level tasks on the ISA based on 
parallel prefix computations and local operations, e. g. convolution, Wavelet Transform, 
morphological operations, median filter, Fourier Transform, Hough Transform, and 
tomographic image reconstruction. In the case of Morphological Hough Transform these 
techniques are used to modify a classical algorithm (Hough Transform) in order to improve its 
performance, while fitting the characteristics of the ISA. 
In general, high level image processing requires the simultaneous execution of several 
independent tasks, which is more appropriate for a MIMD machine. But the disadvantage with 
using a purely MIMD machine, while it can simulate ISA operations, is that it do so only with 
a considerably overhead for control and synchronisation. Hence, they can not be expected to 
perform the data parallel computations typical of low level image processing with the same 
efficiency displayed by truly ISAs. 
50 
5.1 Neighbourhood Operations 
An analysis of the spatial resolution of the pixels in a local neighbourhood provides the first 
clue for the recognition of objects in images, e. g. edge detection, boundary extraction. Just 
processing individual pixels of an image by point operations does not provide this type of 
information. Point operations are only useful as an initial step of image processing to correct 
inhomogeneous and non-linear responses of the image sensor, to interactively manipulate 
images for inspection, or to improve the visual appearance. Other classes of operations are 
necessary that combine the pixels of a neighbourhood in an appropriate manner and yield a 
result which forms a new image. Operations of this kind belong to the general class of 
neighbourhood operations. They are a central tool for low-level image processing. This is 
why we discuss their implementation on the ISA in this section. 
Very frequently used neighbourhood operations are convolution, wavelet transform, 
morphological operators, and median filter [29]. However, their computation on sequential 
machines remains a very time consuming task when the image size and the considered local 
neighbourhood is large. The ISA algorithms in Sections 5.1.1 to 5.1.4 provide one way to 
speed up their computation. A problem of many parallel implementations is the cost of 
communication between processors, see e. g. In the following sections techniques are 
presented, which provide an efficient communication between neighboured pixels based on 
parallel prefix computations and local operations. 
Convolution 
Convolution is a fundamental image processing operation that is computationally demanding. 
It focuses on the question how to combine the grey values of pixels in a local neighborhood. 
Its first characteristic is the size of the neighbourhood, which we call the window or filter 
mask. The window may be rectangular or of any other form. We must also specify the 
position of the pixel relative to the window which will receive the result of the operation. 
With regard to symmetry, the most natural choice is to place the result of the operation at the 
pixel at the centre of an odd-sized mask. 
The elementary combination of the pixels in the window is given by an operation which 
multiplies each pixel in the range of the filter mask with the corresponding weighting factor of 
the mask, adds up the products (multiply-add), and writes the sum to the position of the 
centre pixel: 
mm 
gj =12: hn, 'fi-n, j-m = 
2: 2: h-n, 
-m * 
fi+n, j+m for i, j =O,..., N-1 
n=-M m=-M n=-M m=-M 
This equation assumes an odd-sized mask hm, nwith 2M+ Ix 2M+ I coefficients. It describes a 
discrete convolution. In comparison to the continuous convolution, the integral is replaced by 
a sum over discrete elements. The complexity of the discrete convolution 
is O(N 2. AJ ). The 
undefined borders of the input image can be zero-padded or cyclic. 
In the following, we will 
use the cyclic convolution. Thus, pixelsfi-nj-mwith 
i-n or j-m 010,..., N- 11 are defined by 
the physically existing pixels 
f(j_, ) mod N, &-m) mod N-Only this type of convolution will reduce to a 
multiplication in the Fourier space. 
51 
n 
-- 
Figure 5.1.1: Illustration of the discrete convolution operation with a3x3 filter mask. 
Typical application of convolutions are the detection of simple local structures such as edges 
and comers, and complex local structures such as textures. Especially the detection of 
complex local structures requires a large window size, which is computationally very 
intensive. The following convolution algorithm on the ISA is one way to speedup computing 
time significantly, such that convolution techniques can be applied to more real time image 
processing tasks. 
ISA algorithm: The computation of the convolution g between an NxN image f and an 
2M+I x 2M+I filter mask h on an NxN ISA assigns each pixel of the image to a unique 
processor. The basic idea of the parallel algorithm is to consider a superposition of (2M+l )2 
frames of the input image. If the computations for pixels in each frame can be made in 
parallel, then the time complexity of this scheme is O(A12). 
When M is much smaller than N, this scheme is very efficient. Otherwise, the convolution is 
more efficiently performed by the product of the Fourier Transforms of f and h using the FFT 
algorithm (see Section 5.2). Even more attractive is the indication that the time complexity 
can be independent of N, the image size. 
The strategy of the ISA algorithm for moving data to compute each frame of image for 
superposition is based on the fast row ringshift and column ringshift. Thus, the computation of 
each image can be done with constant period, although there are no wrap-around connections 
at the borders of the processor array. 
Assume image pixels have been assigned to the ISA one pixel per processor as follows: 
Processor (ij) stores fi-mj-m in a register RO for i, j=0, - 
N- 1. Three procedures are used 
for a concise representation of routing: 
row_ringshift (register): Ringshifts the content of each processor's register one 
position to the left (requires four instructions using the row ringshift operation: 
C: =register; C. =C [WEST] ; C: =C [EAST] , register: =C; 
) 
column - 
ringshift (register): Ringshifts the content of each processor's register one 
position up (requires four instructions using the column ringshift operation: 
C: =register; C. =C [NORTH] ; C: =C [SOUTH] , register: =C; 
) 
in_broadeast (value): Loads value into the C-register of all processors (requires two 
instructions using row/column broadcast) 
n-I 
DUE-l n 
1 ! 1111 11 n+l 
52 
The result of the convolution is saved in register RI of each processor, i. e. processor (ij) saves 
gjj for i, j=0,... N- 1. Because all computations and communication operations are 
accomplished in constant time, the complexity of this algorithm is optimal (O(M). 
Algorithm Convolve; 
begin 
RI :=0; 
[ initialise output register in all processors 
for p: =-M to M do 
begin 
R2 RO; 
I R2 storesfipj-m in all processors (ij), ij = 0,... N- I 
for q: =-M toMdo 
begin 
in-broadcas t (h-p, -q) 
f C-register stores h-,, -q. in all processors 
RI RI + R2 C; one; one; 
I R1 RI +fi+pj+q - h-p, -q 
in all processors (ij), ij = 0,... N- I 
row-ringshift(R2); 
I calculate new framefi+pj+q+l in R2 in all processors (ij) I 
end; 
column-ringshift(RO) 
calculate new framefi+p+lj_m in RO in all processors (ij) I 
end; 
end; 
Implementation on Systola 1024: The natural way of mapping an image processing problem 
onto a 32 x 32 processor array architecture is to associate each pixel of a 32 x 32 window of 
the image with one processor. However, there are a lot of other possibilities of such mappings. 
As an example, an entire row of 1024 pixels can be mapped onto the 1024 processors of the 
ISA in row-major way. Since it is possible to store more than one pixel in each processor, 
several rows of the image can be processed simultaneously with the additional benefit that 
local operations in the columns of the image take place in one processor (i. e. no routing is 
necessary). 
In order to implement a 2M+1 x 2M+1 convolution the pixels of the image have to be stored 
in the processor array in a way that the locality of the operation corresponds to local routing of 
data on the array. 
The solution that we chose is to store all N pixels of a row of the image in adjacent NI(32-P) 
rows of the ISA, where each processor stores P subsequent pixels. The main reason for this is 
for better support of the row-wise input of image data from the camera. Each processor 
produces also P subsequent of pixels of a row of the output image. Thus, the complete output 
image is produced in N2 / (1024-P) iteration steps. Like in the previous ISA algorithm the 
input image rows are initially loaded, such that the input pixels are shifted by M positions with 
respect to the output pixels (see Figure 5.1.2). 
53 
I fi-2,5 
10 
fi-2,511 fi-2,0 
-* 
fi, 
5 gi, o ... 90 
fi-2,246 
... 
fi, 
253 gi, 248 ... gi, 255 
I fi-2,254 
... 
fi, 
261 gi, 256 -- gi, 263 
I fi_2,502 
*- 
fi, 
509 gi, 504 ... gi, 511 
fi+13,510 fi+13,511 fi+13,0 
... 
fi+13,5 gi+15,0 ... gi+ 15,7 
I fi+13,254 
... 
fi+ 
13,261 gi+15,256 ... gi+15,2631 
fi+ 
13,246 ... 
fi+13,253 
fi+13,502 
*- 
fi+13,509 
gi+15,248 **- gi+15,255 
gi+ 15,504 ... gi+ 15,511 
Figure 5.1.2: Illustration of the initially mapping of two 16 x 512 subimages of the 512 x 512 
input imagef and the a 512 x 512 output image g onto the 32 x 32 ISA of Systola 1024 for P 
= 8. 
In order to reduce routing P should be chosen as high as possible. Of course, P depends on the 
number of registers per processor (32), the pixel size, the coefficient size, and the accumulator 
size. In our implementation 8-bit pixels, 16-bit coefficients, and 32-bit accumulators are used. 
Thus, P is seeded to 8. 
Because of the new mapping of the input image onto the processor array, procedures for 
ringshifting of the input image (row-ringshift and column-ringshift in the 
procedure convolve) have to be modified: 
In contrast to the original ISA algorithm an image row is distributed over K :=N/ (32-P) 
processor rows instead of only one. Thus, the modified procedure 
image_row-ringshif t has to perform routing as illustrated in Fig. 5.1.3: 
processors (ij) with 1 :! ý i :! ý 32 and 1j:! ý 3 1, get a value from their right neighbour 
(iIj+ I). 
processors (i, 32), with I :! -ý i: 5 32 and i#0 mod K receive a value from the first processor 
of the lower row: (i+ 1,1). 
processors (i, 32), with I :! ý, i:! ý 32 and i=0 mod K get a value from the first processor of 
K- 1 upper rows: (i-(K- 1), 1). 
Since each processor stores P subsequent pixels of a row, this procedure needs only to be 
performed on the register storing the rightmost pixel. Taking advantage of the efficient 
parallel prefix computations this operation is accomplished by the following LAISA 
procedure with constant period 0(l): 
procedure image_row-ringshift(reg: register); 
begin(n) 
< set reg, C; one; one >; 
< set CW, C; one; [2.. nl >; 
< set CE, reg; one; [1.. n-11 >; 
I all processors (ij) withj <n receive a value from processor (ij+ 1) 1 
< set CN, C; 
(0 JA (K-1)) A (n div K); [n] >; 
< set cs, reg; (1"(K-1) 
0)^(n div K); [n] >; 
I all processors (ij) with i#0 mod K receive a value from processor 
(i+ 1,1) 1 
< set C, reg; 
(OA (K-1) 1)^(n div K); [n] >; 
54 
I all processors (ij) with i=0 mod K receive a value from processor 
(i+ 1-K, 1) ) 
end; 
Figure 5.1.3: Illustration of the performed routing of the above procedure row ringshift on a 
4x 32 subarray of Systola 1024 for N= 1024, i. e. K=4. (All other 4x 3f-subarrays are 
performing the same routing operation in parallel) 
The modified procedure image_column - ringshift 
is more complex than 
r ow-r i ng shift. This has two reasons. First, the distance between two neighbouring image 
rows in the processor array is K processor rows instead of one. Second, the lower neighbour of 
the image row stored in the lower K processor rows is currently not available in the processor 
array. 
The second problem can be solved by broadcasting the required image row in the northern IPs. 
to the lower K processor rows. All other image rows are shifted K processor rows up (see Fig. 
5.1.4). Since each processor stores P subsequent pixels of a row, this routing has to be done 
for each of the corresponding registers. The following LAISA procedure performs the 
operation for one register by K subsequent column ringshift operation. 
procedure image_column_ringshift(reg: register); 
begin 
< set reg, C; one; one 
for i: =l to K do 
begin 
< in NORTH, CN, C; one; one >; 
< out NORTH, CS, C; [l.. n-11; one >; 
I these two instructions works like a normal column ringshift (set CN, C; set 
CS, C; ) operation with two differences: 
1. Values from the northern lPs are broadcasted to the lowest processor 
row (instead of broadcasting it from the uppest processors) 
2. C-register values of the uppermost processor row are output to the 
northern IPs I 
end; 
< set C, reg; one; one >; 
end; 
Note that our mapping of the input image onto the ISA reduces the time 
for row ringshifting. 
2 
This is a nice property, since row ringshifting is applied O(M 
) times and column ringshifting 
is only applied O(M) times in the procedure c onvo lve. 
55 
The implementation of the multiply-add operation on Systola 1024 benefits from the multiply- 
add feature (see Section 4.2) of the multiplier. Thus, one multiply-add operation requires only 
three processor instructions. 
fi+16, 
j+N/2 
fi+3, 
j+N/2 fij+N/2 
. fi+16j 
I 
fij 
II 
fij 
c= fij C= fij+N/2 
i, j+N/2 
f 
/2 'j 
c= fi,,, 
i 
C= fij+N/2 c= fi+ 
1+ 
+N 
j 
C= fi+lj+N/2 
c= fi+lli c fi+lj+N/2 fl+l'j+N/2 C fi+2, 
j 
C= fi+lj+N/2 fi+2, 
jj +N/ 
C 
--= 
fi+14, 
j 
II 
r->i 
c= fi+14, 
j+N/21 
IC= fi+15, 
j 
fi+14, 
j+N/21 
I 
r--41 
C 
-= 
fi+15, 
j 
IIc= fi+i 
c 
--= 
fi+15, 
j 
I 
r->[C=fi+15, j+N/21 
IC= fi+16, 
j 
fi+15j+N/21 C ý'- fi+16, j 
IIC : -- fi+ I 6, j+N/2 
(1) 
Figure 5.1.4: Illustration of the performed routing in the procedure column-ringshif t in 
one processor column of Systola 1024 for N= 512, i. e. K=2. (All other processor column are 
performing the same routing operation in parallel): (1) illustrates routing performed in the first 
iteration step (i = 1) and (2) illustrates routing performed in the second step (i = 2). 
Table 5.1.1 shows the good runtime behaviour of the introduced routing techniques based on 
parallel prefix computations and local operations: The number of processor instructions for 
computing (multiply-add) exceeds the number of instructions for communication (row 
ringshift, column ringshift, and 1/0) by a high factor if M increases. This results in high 
speedup values compared to the sequential implementation on a PC for large window sizes. 
Data transfers between board RAM and IPs are executed concurrently to the computation. 
While computing subimage i in the processor array, input image data for computing subimage 
41 in the next iteration step is loaded into the northern IPs, and the previously calculated 
subimage i-1 of the output image is transferred from the western EPs into the western board 
RAM. For filter mask sizes larger than 5X5 computing time dominates data transfer time in 
our implementation. Thus, runtimes for 3x3 convolution and 5x5 convolution are equal on 
Systola 1024. 
Many image processing applications perform several neighborhood operations in a row on the 
same image. In these cases, output data need not to 
be transferred to the board RAM at once. 
It can be used as input data for the next operation. 
56 
If the image size is larger than the capacity of the western/northern board RAM, e. g. 2 MEB, it 
is impossible to store the complete input image data on Systola 1024. However, our 
implementation of convolution requires only a small subimage in each iteration step. Thus, it 
is sufficient to load only a subimage from the PC into the board RAM, resp. to store only a 
subimage from the board RAM into the PC. 
Table 5.1.1: Comparison between the performance of the convolution of a 512 x 512 x8 bit 
image on Systola 1024 and a Pentium H 266 for different filter mask sizes. The numbers in 
brackets show the percentual part of processor instructions to perform all multiply-add 
operations in the parallel implementation. 
F3x3 x5 15 x 15 23 x 23 F Systola 1024 16 ms (30 %) 16 ms (44 %) 26 ms (56 %) 95 ms (68 %) 195 ms (82 %) 
Pentium 11266 50 ms 115 ms 260 ms 1.15s 3. Os 
Speedup 3 7 10 12 15 
5.1.2 Wavelet Transform 
Wavelet transform (WT) provides an alternative approach to traditional signal processing 
techniques such as Fourier analysis for breaking a signal up into its constituent parts. The 
driving impetus behind WT is their property of being localised in time (space) as well as scale 
(frequency). This provides a time-scale map of a signal, enabling the extraction of features 
that vary in time. WTs have become prominent in many areas such as signal and image 
processing, medical imaging, data compression, to name only a few [5,11 ]. 
The booming of practical applications of WTs is greatly attributed to the discovery of a fast 
(sequential) algorithm, namely Mallat algorithm [54], for computing a discrete wavelet 
transform (DWT). This algorithm does a K-scale DWT with a time complexity of 
O(K- M. N) in 2-D, where M is the size of the used filter mask and N2 is the size of the 
image. Problems from the areas of the applications mentioned above are typically large and 
the DWTs can be very time-consuming although the algorithmic complexity is proportional to 
the problem size. The parallelization of Mallat algorithm provides one way to speed up the 
computation of DWTs. 
The computational structure of a 2-D Mallat algorithm is quite simple. The basic building 
block consists of convolutions and subsequent downsamplings at half rate. The algorithm runs 
over a finite number of scales with a number of filter masks. In the simplest yet most common 
case where wavelets are considered separable in x and y directions, two filter masks are 
employed. The choice of the filter coefficients and the window size comes from wavelet 
theory, see e. g. [ 13 ] for a good explanation. 
Figure 5.1.5 illustrates a single 2-D forward WT of an image, which is accomplished by two 
separate 1-D transforms. The image f is first convolved along the x 
dimension, resulting in a 
lowpass image t and a highpass image 
/I. Since the bandwidth of I and r along the x 
dimension is now half that off, we can safely downsample each of the 
filtered images in the x 
dimension by 2 without loss of information. The downsampling is simply accomplished 
by 
dropping every second filtered value. 
L 
Both/ and)"' are then convolved along the y 
dimension, resulting in four subimages: JL '_ýH 
P, and P. Once again, we can 
downsample the subimages by 2, this time along the y 
dimension. As illustrated in Figure 5.1.5, the 2-D convolution 
decomposes an image into an 
57 
average signal VL ) and three detail signals which are directionally sensitive: f- H emphasises 
the horizontal image features, P the vertical features, and r the diagonal features. The 
directional sensitivity of the detail signal is an artefact of the frequency ranges they contain. 
This transformation is applied recursively on the average signal up to the coarsest level, i. e. 
the overall average is represented by a single coefficient. However, in practical applications 
typically no more than 4-7 levels are computed. The DWT is fully invertable by the inverse 
WT. 
IýU 
Figure 5.1.5: Block diagram of the 2-D forward DWT. 
ISA algorithm: To adapt the ISA 2M+I x 2M+I convolution algorithm for WT, we have to 
consider two separable convolution masks, i. e. two vectors h, and gp for p M,... M. 
When the convolution mask is separable into a row and a column vector, moving of data to 
compute each frame of image for superposition reduces to separate row ringshifts and column 
ringshifts. In the row convolution, there are 2M row ringshifts. Similarly, there are 2M column 
ringshifts in the column convolution. Note that the number of filter coefficients to be 
broadcasted is now 4M. This means that the time complexity of the ISA convolution 
algorithm is reduced to O(M for separable filter masks. 
After the convolution at one scale, only the processors on both even-indexed columns and 
even-indexed rows hold the valid results for the convolution at next scale. Conceptually a 
subsampling at half rate along the rows and columns will accomplish this decimation. On an 
ISA, we can simply mask off invalid processors by selectors. However, this makes valid 
processors not to be nearest neighbours at the next scale. When the scale increases further, the 
valid processors will be even further apart. 
This will make convolutions at higher scales more 
58 
time consuming on data shifts between valid processors and complicate the convolution 
algorithm. An alternative is to compact the valid data to processors on the northwest corner 
after convolutions on scale 1. At scale 2 we work on only a quarter of the mesh on the 
northwest comer. The valid processors are again the nearest neighbours and the convolution 
algorithm works equally efficiently as at scale 1. This procedure repeats until the K-scale 
DWT is done. 
Assume an NxN input image f is to be processed by an NxN ISA, and the filter masks f and 
g have length 2M+I. In each processor (ij), 0:! ý iqj:! ý N-1, the following internal registers are 
utilised: 
RO: holds initially the pixelfi-mj-m 
R1, R2, R3, R4: hold convolved pixels jLLij, jW ij, 
P 
i J, 
r 
ij 
0 R5, R6: hold temporary values 
Procedures row - ringshift, column - ringshif 
t, and in_broadcast are defined as 
in the previous section. Procedure reset - working_grid 
(m, n) enables a processor 
subarray of m rows and n columns on the northwest comer and disables all other processors. 
Procedure store-result (level) stores the results after the scale level of the DWT in 
a manner as shown in Figure 5.1.6. in RO. 
JIL(k) 
... 
fLH(2) 
J 
)HL(2) )HH(2) 
)IHH(l) 
Figure 5.1.6: Multiresolution image data storage after k-scale DWT 
Algorithm Wavelet, 
begin 
for k: =1 to K do 
begin 
k-1 k-1) 
reset - working-grid(m/2 
n/2 
Rl: =O; R2: =O; R3: =O; R4: =O; R5: =O; R6: =O; 
for p: =-M to M do row convolution 
begin 
in 
- 
broadcast (h-p) 
R5: =R5 + RO*C; 
JR5: =R5+h-p- fi+p, j-min all processors (ij), ij 0,... N-11 
in_broadcast (g-p) ; 
59 
R6: =R6 + RO*C; 
I R6 := R6 + g- p- fip, j- m in all processors (ij) 
row-ringshift(RO); 
end; 
I R5 holdstij- m and R6 holdsrij- m in all processors (ij) 
for P: =-M toMdo column convolution 
begin 
in-broadcast (h-p) 
RI: =R1 + R5*C; 
IR1: =RI+ h- p- 
fj, 
j+p in all processors (ij) 
R3: =R3 + R6*C; 
I R3: = R3 + h-p 
column_ringshift(R5); 
in_broadcast (g-p) 
R2: =R2 + R5*C; 
I R2: = R2 + g-p 
R4: =R4 + R6*C; 
)"j, p, min all processors (ij) I 
. 
ýj+ýp, 
min all processors (ij) I 
I R4 := R4 + g- p-r,, P, j- m in all processors (ij) 
column-ringshift(R6); 
end; 
IR1 holds jLLi, j, R2 holdspi, j, R3 holdsp,,,., R4 holdsri, j 
store-result(k); 
end; 
end; 
The time complexity of Wavelet is divided into computational and communication 
complexity. It is easy to see the computational complexity is O(M. K). The communication 
complexity in convolution has the same order. It is well balanced with the computation. Extra 
communication is added by rearranging the data to the northwest corner after convolution of 
each scale. This operation accounts for O(N). Therefore the total time complexity of 
Wave 1et is O(M. K+N). When K equals N and L is much smaller than N, Wave 1et has the 
cost O(N). 
Implementation on Systola 1024: In the above ISA algorithm only the processors on both 
even-indexed rows and even-indexed columns produce valid results. Computing time can be 
saved by producing only valid results. This can be achieved by assigning the activity of a2x2 
processor subarray to a single processor: Each processor holds a2x2 window of the input 
image and produces only the corresponding valid value of each output image, dropping 
unnecessary computations for invalid values. 
In order to reduce routing, the considered neighbourhood in each processor can be enlarged 
further, e. g. holding an N/nxN/n subimage in each processor of an nxn ISA. On Systola 
1024 this size is limited by the number of internal registers per processor. Thus, the window 
size is seeded to 2x4. All pixels of a row of the image are distributed over adjacent processor 
rows like in the implementation of convolution. Thus, routing of the 
image can be performed 
by the procedures r ow-r i ng shift and co1 umn-r i ng shift of the previous section. 
At scale 2 of the DWT the ISA algorithm only works on a quarter of the processor mesh on 
the northwest comer, and so on. On Systola 1024 the result of each scale 
is output to the board 
60 
RAM. The average signal is then input for the computations of the next scale, using again the 
complete processor array. 
We implemented the 4-scale DWT of a 512 x 512 image on Systola 1024 for different kernel 
lengths. This results in speedup values over the sequential implementation on a PC as shown 
in Table 5.1.2. For kernel sizes larger than 7 computing time also dominates the data transfer 
time between board RAM and IPs. The ratio between number of instructions for computing 
and communications is worse than in the previous section. This is due to the fact that because 
of separability the more efficient procedure row-ringshif t is only applied O(M times to 
perform routing, instead of O(A42) times in the implementation of arbitrary convolution. 
Table 5.1.2: Comparison between the performance of the convolution of a 512 x 512 x8 bit 
image on Systola 1024 and a Pentium Id 266 for different kernel lengths. The numbers in 
brackets show the percentual part of processor instructions to perform all multiply-add 
operations in the parallel implementation 
1 3 ---] F- 57 JE 15 IF 23 
[-Systola 1024 20 ms (22 %) 20 ms (29 %) 20 ms (32 %) 30 ms (38 %) 45 ms (40 %) 
I Pentium Il 266 45 ms 70 ms lloms 240 ms 400 ms 
,I 
Speedup 2 3.5 5.5 89 
5.1.3 Morphological Processing 
The term morphology originally comes from the study of forms of plants and animals. In the 
context of image processing we mean the study of topology or structure of objects from their 
images. Morphological operations can be applied for image analysis in applications requiring 
defect or object shape recognition, such as robotic vision and diagnostic medical imaging 
[891. 
Although they do not involve multiply-add operations, the very high data rates required by 
real-time applications can be achieved only by employing techniques like parallel processing 
or pipelining. In this section we will derive efficient instruction systolic algorithms for the 
most important morphological operations for binary and greyscale images. 
Most morphological operations can be defined in terms of two basic operations: erosion and 
dilation. Both need a structural element. This is a small image segment, that is moved over 
the binary image such that a predefined reference point of the structural element is 
subsequently aligned with every pixel of the image. 
The erosion matches the structural element in the image, i. e. if all pixels of the structural 
element are set in the image, the corresponding pixel in the erosion is set too. Suppose the 
binary object X and the structuring element B are represented as sets in 2D Euclidean space. 
Let Bx denote the translation of B so that its reference point is located at x. Erosion is then 
defined as 
Erosion: XOB: =Ix. -B, (--Xl 
In order to produce a parallel implementation the computations 
have to be done in a matrix 
oriented way. The erosion of an 
image can also be defined using a binary convolution with 
logical AND operations: 
61 
n 
nhn, 
m 
A fi-n, j-m 
n=-M m=-M 
where f is a binary NxN image and h is 
22 complexity of erosion is O(N -M 
for i, j=0,..., N-I 
a (2M+ I)x(2M+ 1) structuring element. The 
The dilation checks whether there is an intersection of the structural element with the image. 
A pixel of the dilation is set if and only if the structural element an the image have at least one 
pixel in common: 
Dilation: X@B: =Ix: Bxr)X#01 
The matrix orientated definition of dilation for a binary image f and a structuring element h is 
given by 
m" 
U Uh 
n, m 
A fi-n, j-m 
n=-M m=-M 
for i, j=0,..., N-1 
NNW 
MEN 
mom: 
(a) 
(c) 
L 1 unwä 
t 
Bauzaun 
+ 
--Roman 1 »ZUM 11 1 1 
(d) 
Figure 5.1-7: Erosion (c) and dilation (d) of the image (a) with the structuring element (b). 
Figure 5.1.7 gives an example of erosion and dilation. Although these two basic operations are 
very simple and cheap to be implemented they provide an extremely powerful toolbox for 
image processing applications. More complex operations to work on the forms of objects can 
be developed by combinations of the elementary operations erosion and dilation, e. g. opening, 
62 
closing, and skeleton. Opening dilates an image after erosion with the same structuring 
element. Closing performs the same operations in reversed order. The skeleton of a binary 
object is defined as the set of points whose distance from the boundary is locally maximal. In 
the context of basic morphological operations the skeleton can be computed in the following 
way: 
n. n.. 
Skeleton: us, (x)=U[(X(S)nG)-(X(&nG)G 
n=O n=O 
where G denotes the 3x3 square structuring element (see Fig. 5.1.7 (b)), the operation (X & 
nG) denotes the nh iteration (X 0 G) &G0... & G, (X & nG)G is the opening of (X 0 nG) 
with respect to G, and nax is the max size after which X erodes down to an empty set. 
Morphological operations can be easily extended to greyscale images. The erosion/dilation is 
then defined as the minimum/maximum grey value of the set of pixels defined by the 
structuring element. 
ISA algorithm: The ISA algorithm for convolution can be easily adapted for the binary and 
greyscale erosion/dilation on an NxN ISA. The multiply-add operations have only to be 
replaced by the corresponding minimum/maximum operations. Thus, the optimal speedup of 
N2 is also achieved. 
Implementation on Systola 1024: Like mentioned above the implementation of greyscale 
erosion/dilation can be efficiently performed by the Systola program for convolution, where 
the multiply-add operations are replaced by minimum/maximum operations. Since each 
minimum/maximum operation requires only two processor instructions (a subtraction 
followed by a conditional statement) - instead of three instructions for each multiply-add 
operation - greyscale erosion/dilation is a little bit faster than convolution on Systola 1024 
(see Table 5.1.3). 
A more efficient implementation of binary erosion/dilation is possible, if it is parallelised at 
bitlevel. For a 512 x 512 image, each processor can hold a 16 x 16 subimage in 16 internal 
registers, where each register holds 16 subsequent pixels of a row. Thus, computations can be 
efficiently performed by AND/OR-operations between a register calculating output image 
pixels and a register holding corresponding input image pixels for 16 pixels in parallel in each 
processor. 
Again, we have to consider a superposition of (2M+ 1)2 frames of the input image. But in 
contrast to the Systola implementation of convolution, column ringshifting is now more 
efficient then row ringshifting. This is because pixels are held rowwise in each register. Thus, 
an image column ringshift is simply performed by the column ringshift of the register holding 
the actually uppermost image row pixels in each processor (requires 4 instructions using the 
procedure column-ringshif t (register) of Section 5.1.1), while an image row 
ringshift has to perform the same operation along the processor rows using the pixels of the 
leftmost image column in each processor. Since these pixels are stored in 16 different 
registers, this operation is more complex. It requires 66 instructions. 
Since the Systola implementation of convolution reduces the time for row ringshifting, it 
applies row ringshift O(A42) times and column ringshift only O(M times. 
Because this is 
contrary in binary erosion/dilation, we swap the two routing procedures. 
Thus, column 
ringshift is applied O(A42) times and row ringshift only 
O(M times. As shown in Table 5.1.3 
63 
this results in a good runtime behaviour, since the number of processor instructions for 
computing exceeds the number of instructions for communication if M increases. 
Because the two basic binary morphological operations are very cheap, their single 
computation on Systola 1024 is dominated by the data transfer time between board RAM and 
IPs (2 ms for a 512 x 512 image). Of course, most applications of morphological processing 
are performing many operations in a row, e. g. skeleton. The presented implementation of 
binary erosion/dilation holds the complete input and output image in the processor array. 
Thus, several morphological operations can be performed on the same image without 
additional data transfer time. In Table 5.1.3 the parallel program is compared to a sequential 
implementation on a PC. Because we assume that several morphological operations are 
performed successively data transfer times are not included. 
Table 5.1.3: Comparison between the performance of binary and greyscale erosion of 512 x 
512 image on Systola 1024 and a Pentium R 266 for different window sizes. Note that no data 
transfer times are included for the runtimes of binary erosion on Systola 1024. The numbers in 
brackets show the percentual part of processor instructions to perform all minimum/maximurn 
operations in the parallel implementation. 
3x3 --I 1 l7; 7 [ 15 23 x 23 
Systola Binary 0.25 ms (28 %) 0.7 ms (48 %) 2.3 ms (60 %) 5.1 ms (63 %) 
L 
1024 j Greyscale 16 ms (23 %) 20 ms (46 %) 70 ms (60 %) 130 ms(75 %) 
Pentium Binary 2 ms 9 ms 44 ms 106 ms 
Il 266 Greyscale 35 ms 190 ms 870 ms 2. Os 
Speedup Binary 8 13 19 21 
., Ila Greyscale 2 9.5 12 15 
5.1.4 Median Filter 
Median filtering is the process of replacing each pixel of a given image by the median of the 
pixels contained in a window centred around that pixel. This filtering is useful in removing 
isolated lines or pixels while preserving spatial resolution. More specifically, let fij be a 
grayscale input of size NxN and let the window be of size 2M+ 1x 2M+ 1. The filtered image 
fij is defined by 
med ij = median I fi- , j-,,, 
I-M:! ý n, m: ý M and 0:! ý i- nj- m:! ý N- II 
Figure 5.1.8 illustrates how the median filter works. 
64 
sorted list 
i 
I 
39 33 132 
mdr. ý 
36 31 
35 34 1 37 36 33 34 
34 
1 
33 1 98 36 
1 34 32 
321 361 32 351 36 35 
43 34 29 44 47 41 
33 31 36 34 31 32 
input 
I 
output 
Figure 5.1.8: Illustration of the principle of a3x3 median filter 
The straightforward approach to solving this problem on the ISA is to compute the median independently for each output pixel. Specifically, to produce one output pixel in each 
processor in parallel, each processor first accumulates the (2M+ 1)2 pixels necessary to 
compute the median. The median of these values is then computed by an appropriate 
sequential algorithm. 
There is, however, a more efficient procedure for median filtering. This procedure is based on 
the observation that the window surrounding pixel (ij) and the window surrounding pixel 
(i, j+]) differ only by 2-(2M+l) pixels. This immediately suggests that we should first sort the 
common (2M+ 1)2 _ (2M+I) values. After pre-sorting of these values the median 2. (2M+I) 
elements remain as candidates for the medians of each of the two overlapping 2M+1 x 2M+I 
windows. The two medians are computed by performing a merge operation between the 
2. (2M+1) candidates and the (2M+l) elements not considered so far of each window. 
This procedure can be further improved by using akxk, k ý! 2, neighbourhood instead of only 
two pixels. Of course, k is limited by the number of registers in each processor, e. g. for aIIx 
11 median and k=2,81 values have to be pre-sorted, which is impossible on Systola 1024. 
Thus, the performance of the parallel implementation on Systola 1024, using k=2, is 
compared in Table 5.1.4 with a sequential implementation on a PC only for small window 
sizes up to 7x7. 
Table 5.1.4: Comparison between the performance of median filtering of a 512 x 512 x8 bit 
image on Systola 1024 and a Pentium 11 266. The numbers in brackets are the ratio between 
the processor instructions for computing and communication of the parallel implementation. 
1 
3X3 5x-5--][ 7X7 
Systola 1024 16 ms (35 %) 22 ms (40 %) 48 ms (67 %) 
Pent um Il 266 
Speedup 
2 ms 
4 
145 ms 320 ms 
77 
i 
s, k9 33 32 35 36 31 
35 "% 37 36 33 34 
34 33 36 34 _ 32 
32 36 32 35 
1 36 35 
43 
- 
34 29 44 47 41 
P ý 31 36, 34 31 32 
65 
5.1.5 Improvements of Systola 4096 to increase the Efficiency of Neighbourhood 
Operations 
Beside the four times higher processor number and its two times higher cycling frequency the 
following four improvements of the architecture of Systola 4096 will additionally increase the 
efficiency of the implementation of neighbourhood operations further with respect to the 
conventional Systola 1024 architecture. 
More internal registers: Each processor of Systola 4096 has 64 internal registers instead 
of 32. Thus, each processor can store more subsequent pixels of a row. This has the 
benefit that more local operations in the columns of the image take place in one processor 
(i. e. less routing is necessary). 
Larger bus bandwidths: For neighbourhood operations with small window sizes, e. g. 3x 
3, data transfer time between board RAM and EPs dominates computing time. Because the 
maximal bandwidth of the RAM - IP bus is increased to 800 MB/s on Systola 4096, the 
data transfer time is reduced significantly. 
Faster 1/0 between IN and processor array: A high number of instructions is required 
for the 1/0 between IPs and processor array, e. g. nearly 50% of all instructions in the 
Systola 1024 program for 3x3 convolution. Due to the modified 1/0 architecture of 
Systola 4096 (see Section 3.4.1), the input of a 64 x 64 matrix of data items requires only 
16 instructions and the output 32 instructions. This leads to a speedup of a factor 4 
compared to the conventional architecture of a 64 x 64 ISA. 
66 
5.2 Fourier Transform 
5.2.1 Discrete Fourier Transform 
The Fourier Transform is an important image processing tool which is used to decompose an 
image into its sine and cosine frequencies. The output of the transformation represents the 
image in the Fourier or frequency space, while the input image is the spatial space equivalent. 
In the Fourier space image, each point represents a particular frequency contained in the 
spatial domain image. The Fourier Transform is used in a wide range of applications, such as 
image analysis, image filtering, image reconstruction and image compression [7]. 
As we are only concerned with digital images, we will restrict this discussion to the Discrete 
Fourier Transform (DFT). 
The DFT is the sampled continuous Fourier Transform and therefore does not contain all 
frequencies forming an image, but only a set of samples which is large enough to approximate 
the spatial domain image. The number of frequencies corresponds to the number of pixels in 
the spatial domain image, i. e. the image in the spatial and Fourier space are of the same size. 
For a square image of size NxN, the two-dimensional DFT is given by: 
i N-1 N-1 
Wk. mWI-n F(k, 1) -= - 
11: f(m, n) xT2 NN N m=o n=O 
i. 2. z - where WN= exp - 
land 
i=V I 
N) 
whereftm, n) is the image in the spatial domain and the exponential term is the basis function 
corresponding to each point F(k, l) in the Fourier space. The equation can be interpreted as: the 
value of each point F(kl) is obtained by multiplying each point of the input image with the 
corresponding base function and summing the result. 
The Fourier Transform produces a complex number valued output image. The basis functions 
( wk-m AAh jv )of the DFT are the complex roots of unity (a complex number w is said to be an 
Mhcomplex root of unity if wN= 1). Because of Euler's Formula (exp(i. x) = cos(x) + i-sin(x)) 
the real and imaginary parts of the basis functions are sine and cosine waves with increasing 
frequencies, i. e. F(0,0) represents the DC-component of the image which corresponds to the 
average brightness and F(N-1, N-1) represents the highest frequency. 
The Fourier domain image has a much greater range than the image in the spatial domain. 
Hence, to be sufficiently accurate, its values are usually calculated and stored in float values. 
In a similar way, the Fourier image can be re-transformed to the spatial domain. The inverse 
Fourier transform is given by: 
N-1 N-1 
f(m, n)=l IF(k, 1)W, -k- 
w. -1- 
k=O l=O 
The inverse DFT can be calculated by complex conjugation of the input values, adding on the 
introduced forward FFT algorithm without dividing by N, and complex conjugation of the 
67 
output values. This is achieved by the complex conjugation of both sides of the above 
equation (complex conjugation is denoted by *): 
N-1 N-1 
k-mw-l-n 
N-1 N-1 
k. m -1-n 
ý_, 
(F(k, 1)WN N f*(m, n)=l 2: F* (k, 
1»(WN (WN- 
k=O l=O k=O l=O 
N-1 N-1 
-1. n 
«pý 
1. n WNI n YF*(k, 1)WNk because (W, 2: *, wý 
1) N 
k=O l=O 
A Fourier transformed image can be used for frequency filtering. This operation takes an 
image and a filter function in the Fourier domain. This input image is then multiplied with the 
filter function in a point by point fashion: 
F(kl) - H(kl) 
where F(k, l) is the input image in the Fourier domain, H(k, l) the filter function, and G(k, I) is 
the filtered image. To obtain the resulting image in the spatial domain, G(kI) has to be re- 
transformed using the inverse Fourier Transform. 
Since the multiplication in the Fourier Space is identical to convolution (see Section 5.1.1) in 
the spatial domain, all frequency filters can in theory be implemented as a spatial filter. 
However, in practice, the Fourier domain filter function can only be approximated by the 
filtering mask in the spatial domain. 
The form of the filter function determines the effects of the operator. There are basically three 
different kinds of filters: lowpass, highpass and bandpass filters. A low-pass filter attenuates 
high frequencies and retains low frequencies unchanged. The result in the spatial domain is 
equivalent to that of a smoothing filter - as the blocked high frequencies correspond to sharp 
intensity changes, i. e. to the fine-scale details and noise in the spatial domain image. 
A highpass filter, on the other hand, yields edge enhancement or edge detection in the spatial 
domain, because edges contain many high frequencies. Areas of rather constant greylevel 
consist of mainly low frequencies and are therefore suppressed. An example of highpass 
filtering is shown in Figure 5.2.1. 
A bandpass attenuates very low and very high frequencies, but retains a middle-range band of 
frequencies. Bandpass filtering can be used to enhance edges (suppressing low frequencies) 
while reducing the noise at the same time (attenuating high frequencies). 
68 
(a) 
(c) 
(b) 
(d) 
Figure 5.2.1: Highpass filtering in the Fourier domain: (a) Original input image, (b) Fourier 
transformed image, (c) Fourier transformed image after multiplication with a highpass filter (higher frequencies away from the centre are kept, while lower ones are masked out) (d) Inverse Fourier Transform of (c). 
5.2.2 Fast Fourier Transform 
To obtain the result of the Fourier Transform, a double sum has to be calculated for each 
image point. Each point in the transformed image requires N2 complex multiplications and 
N2-1 complex additions (not counting the calculation of the basis functions). In total we need 
N4 complex multiplications and N2(N2-1) complex additions. This adds up to 8-N4 floating 
point operations. For a 512 x 512 image, this results in 5.1011 operations. A single DFT of a 
512 x 512 with 5.1011 operations would require about 10000 s or 3h on a Pentium 11 266, 
much too slow to be of any relevance for practical applications. 
However, because the Fourier Transform is separable, it can be written as 
I N-1 1 N-1 
WI-n 
JWI. 
m 
-I -I: f(m, n) N N m=o 
Nn=O 
Using this formula, the spatial domain image is first transformed into an intermediate image 
using N one-dimensional Fourier Transforms of each image row. This intermediate image is 
then transformed into the final image, again using N one-dimensional Fourier Transforms of 
each image column. Expressing the two-dimensional Fourier Transform in terms of a series of 
IN one-dimensional transforms decreases the number of required computations to O(N3). 
Even with these computational savings, the ordinary one-dimensional DFT has N2 
complexity. This can be reduced to O(N-log2N) if we employ the one dimensional Fast 
69 
Fourier Transform (FFT) to compute the one dimension DFTs [1,121. This is a significant 
improvement, in particular for large images. 
The key idea behind the FFT algorithm is to use the divide and conquer paradigm: The FFT 
of a vector of N= 2' points consists of n stages. In each stage the calculation of the transform 
for k points is recursively reduced to two calculations of length k12. In detail, this algorithm 
works as follows: 
The transformation of a vectorj(n) of length N is split into two transformations of length N12 
over the even and odd sampling points: 
N-1 
Wn*k F(k) f (n) N 
n=O 
N12-1 
f (2n)W2nk + N 
n=O 
N/ 2-1 
+ I)W(2n+l)k Y, f(2n N 
n=O 
N12-1 N/ 2-1 
If (2n)W k+ Wk ; /2 N If (2n + 1)Wnk N12 
n=O n=O 
Both sums constitute a DFT of length N12. The second sum is multiplied by a base function 
which depends only on the wave number k. 
The operations necessary to combine the partial Fourier Transforms are just one complex 
multiplication and addition, i. e. O(N). Some more detailed considerations are necessary, 
however, since the DFT over the half-sized vectors only yields N12 values. In order to see 
how the composition of the N values works, we study the values for k from 0 to N12-1 and 
N12 to N- I separately. 
The partial transformations over the even and odd sampling points are abbreviated by F(k) 
and F(k), respectively. For the first part we can just take the partitioning as expressed above. 
For the second part, k'= k+ N12, only the base function changes. Because of Euler's Formula, 
the addition of N12 results in a change of sign: 
k+N12 k 
- -Wý 
Making use of this symmetry we can write 
F(k) =Fe (k) + WkF'(k) O: fý, k:! ý N/2 
F(k + N12) =Fe (k) - 
WNkFo (k) 
The DFT for the indices k and k+ N12 only differ by the sign of the second term. Thus, for the 
composition of two terms we only need one complex multiplication. The partitioning is now 
applied recursively. The two transformations of the vectors of length N12 are parted again into 
two transformations each. We obtain similar expressions as in (*) with the only difference 
N12 . The even and odd being that the base function before the second sum has doubled to 
Wk 
parts of the even contain the points 10,4,8, ..., 
N12-41 and 12,6,10, ..., N/2-2 
70 
In the last step, we decompose a vector with two elements into two vectors with one element. Since the DFT of a single element vector is an identical operation, no further calculations are 
necessary. After the decomposition is complete we can use (**) recursively with appropriate 
base functions to compose the original vector step by step in the inverse order. 
F(O) 
F(1) 
F(2) 
F(3) 
F(4) 
F(5) 
F(6) 
F(7) 
Figure 5.2.2: Signal flow diagram for the FFT algorithm for N=8. 
All steps of the FFT algorithm are shown in Figure 5.2.2 for N=8. The left half of the 
diagram shows the decomposition steps. The first column contains the original vector, the 
second the result of the first decomposition step into two vectors. The vectors with the even 
and odd elements are put in the lower and upper halves, respectively. This decomposition is 
continued until we obtain vectors with one element. 
As a result of the decomposition, the elements of the vectors are arranged in a new order. That 
is all that is performed in the decomposition steps. No computations are required. We can 
easily understand the new ordering scheme if we present the indices of the vector in binary 
notation. In the first decomposition step we order the elements according to the least 
significant bit, first the even elements (least significant is zero), then the odd elements (least 
significant bit is one). With each further decomposition step, the bit that governs the sorting is 
shifted one place to the left. In the end, we obtain a sorting in which the ordering of the bits is 
completely reversed. The element with index I ..: (0002, for example, will be at the position 4 
ý (100)2, and vice versa. Consequently, the chain of decomposition steps can be performed 
with one permutation operation, by interchanging the elements at the normal and bit-reversed 
positions. 
This reordering is known as bit reversal permutation p,: N -> N. It can be expressed as 
follows: 
Let k be a natural number in binary representation of length n, i. e. k= (xO + (x, 2 +_+ 
(Xn-I 
2 n-I 'OýE_= 
10,11. 
n-I 
The bit reversal of k is p,, (k) =a n-i+a n- 2-2... +a o2. 
71 
Further steps on the right side of the signal flow diagram show the stepwise composition to 
vectors of double size. Thus, the complete ID-FFT algorithm can be divided into the 
following three parts: 
Algorithm 5.2.1: ID-FFT 
1. The input values J(k), 0 :! ý k :! ý N- 1, N=2n, are sorted in vector a according to the bit 
reversal function Pn. After this step holds a[p,, (k)] =J(k). 
2. The following algorithm is processed on the vector a: 
for M : 7- 1 to 1092 N do 
begin 
fork := Oto2'1 -1 do 
begin 
e exp (-27cik/2m) 
for r0 to N/2m-1 do 
begin 
u a[2 m r+k] 
v := a[2 M. r+k+2m-l I *e; 
a[2m. r+k] := u+v; 
a[2m. r+k+2m-ll := U-V; 
end; 
end; 
end; 
3. At the end of step 2, vector a contains the values: a[k] =N- F(k), 0 :! ý k:! ý N-1. Finally, 
every value must be divided by N. After that, a contains the complex values of the FFT 
F(k) in natural order: a[k] = F(k). 
As explained in Section 5.2.1, the inverse 1D-DFT can be realised by complex conjugation of 
the input values, adding on the introduced forward FFT algorithm without dividing by N (Step 
3 in Algorithm 5.2.1), and complex conjugation of the output values. 
5.2.3 ISA algorithm 
The implementation of the 2D-FFT of an NxN, N= 2", image on an NxN ISA consists of 
two subsequent passes: first all image rows are transformed,, using a 1D-FFT along each 
processor row, and then all columns are transformed, using a 1D-FFT along each processor 
column. 
For the implementation of the 1D-FFT, Algorithm 5.2.1 is parallelised on the ISA. 
First all rows and columns of the input image flm, n), 0 :! ý i, j :! ý N- 1, have to be permuted 
according to the bit reversal permutation (see Step I in Algorithm 5.2.1). Because we want to 
assign each input image pixel to a unique processor, processor (ij), 0 :! ý i, j :! ý N- 1, has to store 
the pixelflPn(i), Pn(i))- 
Loading pixel (ij) into processor (ij), for all 0 :! ý i, j :! ý N- 1, and performing this permutation 
within the processor array would require a very complex routing procedure. This time 
consuming operation can be avoided by taking advantage of the 1/0 concept of the ISA as 
follows. 
72 
1. transfer image row p, (i) into the western EP i, for all 0 :! ý i :! ý N- 1. (performs the bit 
reversal permutation within each column) 
2. load image row p, (i) into processor row i from the western EP i, for all 0 :! ý i :! ý N-1, by 
performing the bit reversal permutation within each processor row. 
The second step can be accomplished very efficiently by the procedure ff t-input (see 
below): Column i of each image row p, (i) is broadcasted from the western EP i to all 
processors of the row i, 0:! ý i:! ý N-1 (using the procedure in - 
broadcast 
- complex). Afterwards, only processors in column p (i)= bit-reversal (i, n) store the input 
image data. Thus, after N iteration steps , processor (ij) stores J(p, (i), p, (j)), for all 0j 
:! ý- N- 1. 
Because input image pixels are complex numbers, operands of type complex has to be 
used (temp and pixel in procedure ff t_input). These operands occupy more than 
one register. Thus, the procedure in - 
broadcast_complex is a very fast procedure if 
the registers chosen for complex numbers are CR-registers, i. e. it can be written into the 
C-register and the corresponding internal register at the same time (see also Section 3.4.2). 
Taking advantage of the efficient row broadcast operation and the CR-registers 
in-broadcast-complex can be implemented with only m instructions: 
for j: =O to m-1 do CR[j]: =C[WESTI; one; one >; 
(where the complex number operand consists of the m CR-registers R[0 
procedure fft_input; 
begin(n) 
for i: =O to N-1 do 
begin 
in-broadcast-complex(WEST, temp); 
f broadcasts column i of each image row Pn(i) from the western IP i to 
all processors of the row iI 
< pixel: =temp; one; bit - reversal(i, n) 
>; 
f only processors in column p, (i) = bit_reversal (i, n) store the 
input image data in pixel. Finally, processor (ij) storesj(pji), p, (j)) 
in pixel, for all 0:! ý i, j:! ý N-1 I 
end; 
end; 
Fig. 5.2.3 shows how the introduced input procedure works on an 8x8 ISA. 
The computation of the 1D-FFT consists of n stages. In each stage, a complex multiplication 
with some primitive root of unity, a complex addition, and a complex subtraction is 
performed at each of the N12 point pairs (see Step 2 in Algorithm 5.2.1). 
The complex roots of unity, exp (-2 IC ik/ 2') in Algorithm 5.2.1, are known 
in advance. 
Thus, they are precomputed and are loaded into the processor array when needed. This 
load 
operation takes only a small constant number of instructions cycles since the values can 
be 
loaded using the efficient column/row broadcast operation (column 
broadcast for row FFT; 
row broadcast for the column FFT). 
73 
IN west 8x8- ISA 
(A) 
(B) f(0,0) t--i f(0,4) rl f(0,2) rl f(0,6) rl f(0,1) rl f(0,5) 
f(4, O) ý-j f(4,4) " f(4,2) " f(4,6) " f(4,1) r-1 f(4,5) 
f(2,0) ý-j f(2,4) rl f(2,2) rl f(2,6) " f(2,1) " f(2,5) 
f(6, O) ý--j f(6,4) rl f(6,2) rl f(6,6) rl f(6,1) r-1 f(6,5) 
f(1,0) ý-j f(1,4) " f(1,2) rl f(1,6) rl f(l, I) rl f(1,5) 
f(5, O) H f(5,4) ý-j f(5,2) [I f(5,6) rl f(5,1) rl f(5,5) 
f(3, O) H f(3,4) ýj f(3,2) " f(3,6) [I f(3,1) rl f(3,5) 
f(7, O) H f(7,4) " f(7,2) rl f(7,6) rl f(7, I) rl f(7,5) 
f(0,3) rl f(0,7) 
f(4 , 3) f(4 , 7) 
f(2,3) f(2 , 7) 
f(6 , 3) f(6 , 7) 
f(1 , 3) f(1,7) 
f(5,3) f(5,7) 
f(3,3) rl f(3,7) 
f(7,3) ý-j f(7,7) 
Figure 5.2.3: The bit reversal permutation of all rows and columns of an 8x8 image J(ij), 0 
:! ý ij:! ý 7, on an 8x8 ISA. (A) All columns are permuted by transferring column PO) 
into the 
western IP i, for all 0 :! ý i :! ý 7 (Step I in the above algorithm). 
(B) All rows are permuted by 
loading image row P80) into processor row i from the western EP i, for all 0: ý i:! ý 7 
(Step 2 in 
the above algorithm). 
74 
The implementation of the exchange operation between every two stages involves routing of data (see Figure 5.2.4). This is performed within each row/column of the ISA by 
interchanging the two halves of subsequences of length 2' in stage m, I :! ý m :! ý n. The 
interchanging of the C-register contents of two halves of m processors is implemented m12 
ringshift operations. 
The following algorithm computes the 2D-FFT on the ISA as described above. 
Algorithm Transform(var pixel: complex); 
I input pixels are assigned one pixel per processor in pixel in bit 
reversed order (as described above) 
var temp, coef: complex; 
procedure row-fft; 
local procedure to compute I D-FFT along each processor row 
begin(n) 
for m: =1 to n do 
begin 
in-broadcast_complex(NORTH, coef); 
I complex root of unity exp(-2ni(k mod 2'-')/2') is broadcasted from 
the northern IP k into all processors of column k for all 0 k:! ý N- I in 
parallel I 
pixel: = complex-mult (pixel, coef, one, ( (0^2']-) (IIN2m-") 
n-m 
I all processor columns k, 0:! ý k:! ý N-1, with k mod 2' > 2`1 perform 
the complex multiplication a[ 2m. r+k+2'--'] *e of Algorithm 5.2.1 
(with r=k div 2') in parallel I 
exchange_complex (WEST, pixel, temp, 2m) 
interchange contents of pixel of two halves of 2'processors of one 
row and store it in temp I 
pixel: =complex-add(pixel, temp, one, ((lA2m-'-) 
(OA 2m-'-) ) A2 
n-m) 
t all processor columns k, 0: ý k:! ý- N-1, with k mod 2' < 2'-1 perform 
the complex addition a[ 2m. r+k =u+v of Algorithm 5.2.1 (with r 
k div 2') in parallel I 
pixel: =complex_sub(temp, pixel, one, 
(OA 2'-') (1A2 m-") 2 
n-m) 
I all processor columns k, 0:! ý k! ý- N-1, with k mod 2' >- 2'-1 perform 
the complex subtraction a[ 2'. r+k+2u-v of Algorithm 5.2.1 
(with r=k div 2') in parallel I 
end; 
end; 
begin (* main 
row-f f t; 
reflect(row-fft) 
I ID-FFT along each processor column is calculated by applying the 
reflected ISA program of the ID-FFT along the processor rows (by 
interchanging row and column selectors, changing west and north, 
east and south) I 
pixel: =sif t_complex (pixel, -2-n) 
f division by IV2 (implemented efficiently by a shift operation by -2-n 
positions). After that, processor (ij), 0:! ý ij:! ý N-1, contains the value 
F(ij) of the 2D-FFT of the input image f(m, n) in pixel I 
end. 
75 
The time complexity of Trans f orm is divided into computational and communication 
complexity. All computations within one iteration step can be performed in constant time. Thus, the computational complexity is O(log2N). The procedure exchange_complex 
(WEST, pixel, temp, 2m) accounts for 0(2'-') operation (implemented by 2'-1 ringshifts of 
1092 N 
M-I a complex value). Therefore, the communication complexity is 0 12 O(N). Thus, 
the total complexity of the introduced 2D-FFT ISA algorithm is O(N). 
The inverse 2D-FFT can be realised by the same ISA algorithm with a complex conjugation 
of the input values, without dividing by N2 and with a complex conjugation of the output 
values (see Section 5.2.1). 
(A) 
(B) 
(C) 
Figure 5.2.4: Interprocessor communication and used basis functions in Algorithm 
Trans fo rm in each iteration step for one row of an 8x8 IS A, i. e. n=8 ((A) for m= 1, (B) 
for m=2, (C) for m=3). 
5.2.4 Parallel Implementation on Systola 1024 
The implementation of the FFT on Systola 1024 has to partition the algorithm described in 
previous section for an NxN ISA on an 32 x 32 ISA. We implemented the forward and 
inverse 2D-FFT for complex valued input images of size the NxN for N= 256,512,1024, 
2048. IEEE single floating point numbers are used for all computations to achieve reasonable 
accuracy. 
In order to reduce computing and communication time, each processor calculates two values 
in every stage instead of only one. Because of the limited processor memory this number 
cannot be increased further. Thus, each image row is mapped onto N164 processor rows. In 
the case of N= 512, for instance, eight processor rows are needed. 
To perform a ID-FFT one complex multiplication, addition and subtraction is calculated in 
each stage in every processor. The permutation between every two stages requires only 
routing of one of the two values. It is implemented either within each processor row (in the 
case of interchanging the two halves of subsequences of length :! ý- 64) or by interchanging 
76 
w wi wl xxx XW, 
X W2 X W2 A VV2 VV2 
"' 4 '^' "r 4A VV 4A VV4 
complete rows of the ISA (in the case of interchanging the two halves of longer 
subsequences). 
In this partitioning a problem arises when the last iteration step is performed for N= 2048: 
each processor needs a different complex root of unity, i. e. processor (P, q), 0 !ýp, q :! ý 31, 
needs exp(-27cik/2') for k= 32-p +q and m= 10922048. Loading all these values from the IN 
would require 6.32 instructions and a large amount of data transfer into the IN (8 KBytes). To 
compensate for this, we only load one complex value from each western EP (exp(-27Ciq/2m) 
from western IP q, 0:! ý q:! ý 31) and from each northern EP (exp(-2ni-(32-p)/2m) from northern 
IP p, 0 :! ý p :! ý- 3 1) into the processor array by using the efficient row broadcast and column 
broadcast operations. Afterwards each processor (p, q), 0 :! ý, p, q :! ý 3 1, calculates its required 
coefficient by the complex multiplication exp(-27ci-(32-p)/2m) - exp(-27ciq/2m) = exp(-21ri(32-p 
+ q)12'). The new load procedure only requires 96 instructions and a smaller amount of data 
transfer into the IN (512 Bytes). 
Of course, each N164 processor rows compute a 1D-FFT in parallel. Thus, all rows of the 
input image are transformed by N2/2048 iteration steps. Afterwards all columns of the 
intermediate image are transformed by applying the same program on the image columns. The 
number of processor instructions to perform N/64 lD-FFTs on Systola 1024 is displayed in 
Table 5.2.1, for N= 256,512,1024,2048. 
Table 5.2.1: Number of instructions to perform N/64 ID-FFTs on Systola 1024 for N= 256, 
512,1024,2049 in single floating point format. The overall number of instructions consists of 
instructions for 1/0 (read input image data from the IPs and write transformed data into IPs), 
communication (perform permutations and load coefficients), and computing (arithmetic on 
complex numbers). 
IN 256 ] F--512 10 2048 
EOverall 3289 3021 3349 3881 
11/0 1288 1288 1288 1288 
Communication 1161 781 913 1249 
Computation 840 952 1148 1344 
The data transfer time between board RAM and lPs can be almost totally hidden as it can be 
executed concurrently to the computation. For N= 512,1024, and 2048 the image size (8 
Bytes per pixel) is larger than the capacity of the board RAM. Thus, the intermediate image of 
transformed rows has to be transferred to the main memory of the PC and loaded again 
columnwise onto the board RAM for the transformation of columns. The necessary 
transpositions of the intermediate image and the column transformed image is also performed 
on the PC. It can be executed concurrently to the computation on Systola 1024. The 
transposition for N= 256 is performed on the board by some additional routing in the 
processor array and applying a corresponding controller program. Thus, the number of 
processor instructions spend on communication for N= 256 is higher than for N= 512 (see 
Table 5.2.1). The absolute runtimes on Systola 1024 including the data transfer between PC 
RAM and board RAM are shown in Table 5.2.2 in comparison to a sequential implementation 
on a PC. 
77 
Table 5.2.2: Comparison between the performance of the 2D-FFT of a complex valued NxN 
image in single floating point data format on Systola 1024 and a Pentium 11266 for N= 256, 
512,1024,2048. The numbers in brackets show the time spent on data transfer between PC 
RAM and board RAM, which is not executed concurrently to the computation. 
N 1 256 1 11 1024 2048 
Systola 1024 93 ms (25 ms) 447 ms (200 ms) 1,9 s (0ý8 S) 8,3 s (3,3 s) 
Pentium 11266 160 ms 720 ms 3,3 s 14,7s 
ISpeedup 197 1,6 1,7 1,8 
In order to overcome the limitation of storing the intermediate transformed image on board a 
significant larger board RAM has been integrated onto the architecture of Systola 4096. 
Beside that, the faster floating point unit of the processor architecture, the faster 1/0 between 
IPs and processor aff ay, and the higher number of internal registers of Systola 4096 will 
additionally increase the efficiency of the FFT with respect to the conventional Systola 1024 
architecture. 
The described FFT implementation of the Fourier Transform maps a complex valued image 
into the complex valued frequency domain. However, in image processing applications the 
input image is often real valued. If so, additional efficiency can be gained by applying the 
doubling real FFT algorithm. It states that the DFT of a real valued NxN, N= 2", image can 
be computed by the FFT of a complex valued N12 xN image and an additional permutation of 
input and output values (see [60]). 
Moreover, if the input and output pixels of the Fourier Transform are 8-bit greyscale values 
the runtime of the FFT on Systola 1024 can be reduced further, e. g. the instructions for 1/0 
and the data transfer time are reduced. Thus, the FFT for an 256 x 256 8-bit greyscale image 
can be accomplished within only 30 ms on Systola 1024. 
5.2.5 Discrete Cosine Transform 
Another periodical transform (i. e. a transform with periodical base functions) related to the 
DFT is the Discrete Cosine Transform (DCT). For an NxN image, the DCT is given by 
N-1 N-1 (2m + I)k; r Icos (2n + I)hr 
Qk, 1) = a(k, 1)11 f (i, j) Cos 
I 
2N 2N m=O n=O 
wit 
I 
for k, n=O 
N 
2 for k, nN-I N 
The main advantages of the DCT are that it yields a real output 
image and that it is a fast 
transform. The DCT is able to represent a high proportion of the 
image information in 
relatively few coefficients, and is therefore often used 
in data compression, e. g. in the JPEG 
[93], MPEG, and H. 263 compression standards [28]. Like the 
DFT, the DCT is also a fast 
transform, i. e. the DCT of an NxN image, N=2, can 
be computed in time O(N2 1092N) by 
78 
the Fast Cosine Transform (FCT) [501. The FCT can be implemented in a similar way as 
the DFI' on the ISA. 
79 
5.3 Hough Transform 
5.3.1 Line Detection by Hough Transform 
Straight lines occur very frequently in scenes containing man-made objects, e. g. manufactured 
parts, buildings, satellite images of towns and road systems. Therefore, the detection of 
straight lines is an important task in many pattern recognition systems. The Hough Transform 
(HT) [24] is a very powerful robust technique to detect straight lines in binary edge images. It 
has proved to be a valuable method in the areas of automatic quality control and image 
classification [25,49]. 
The HT transforms an input image into a parameter space by counting pixels on straight lines. 
For every possible straight line in the image a counter is introduced which accumulates the 
number of pixels which satisfy the corresponding line equation 
=m x+C. 
In the parameter space we represent a straight line by these two parameters: the slope m and 
the intercept c (the intersection point with the y-axis). This representation is rather simple. 
The set of straight lines intersecting in one point of the original image is represented by a 
straight line in the m-c-space. 
10 
9 
8 
7 
6 
y5 
4 
3 
10. 
9 
8 
7 
6 
5 
4 
3 
2 
1 
D 
(B) 10 
9 
8 
7 
6 
C5 
4 
3 
2 1 1 1 
2 21 1 
2 21 2 
2 31 3 3 
2 21 4 5 5 
1 2 2 5 10 14 2 11 1 
16 5 3 2 
3 4_ 3 
2 
_2 
1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 
m 
m 
Figure 5.3.1: The basics of the HT for line detection: (A) (xy) point image space; (B) 
transformation of (A) in the continuos m-c-parameter space; (C) 
discrete accumulator array 
corresponding to (B). 
80 
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 
Figure 5.3.1 (A) is a typical point image and Figure 5.3.1 (B) shows the parameter lines 
produced by backprojecting image points into parameter space using the above line equation. 
Points which are collinear in image space all intersect at a common point in parameter space 
and the co-ordinates of this parameter point characterises the straight line connecting the 
image points. The HT identifies these points of intersection in parameter space. Determination 
of the point of intersection in parameter space is a local operation and should be considerably 
easier than detecting extended point patterns in image space. 
For the purpose of computing the HT the continuous parameter space is usually considered to 
be composed of the union of a number of finite-sized regions in the following way: An 
accumulator array B[mkcl] is introduced to represent the parameter space. For every seeded 
pixel in the image and every slope Mk the corresponding value cl is computed from the line 
equation and the counters B for all straight lines intersecting in this pixel are incremented by 
one. By that at the end of the execution the value of each counter represents the fraction of the 
corresponding straight line in the original image (see Figure 5.3.1 (Q). The local maxima of 
the B array indicate the presence of lines of certain slopes and intercepts in the image. 
The HT algorithm in the simplest form for a binary image I(ij) of size NxN and an 
accumulator array B[mk, cl], with k=0, ---, 
M-1, Mk = k1M, looks like this: 
Algorithm 5.3.1: Standard Hough Transform 
for i: =O to N-I do 
for j: =O to N-1 do 
if I(i, j)=l then 
for k: =O to M-1 do increment B[ný, round(j-i-k/M)] 
A disadvantage of the used m-c-representation is the fact that vertical lines (with infinite 
slope) cannot be represented. This problem can be eliminated by considering only slopes m 
with -1 :! ý m :! ý 1. By applying the Hough transform to the original input image and the 
transposed input image the obtained discrete parameter domain estimates the required Hough 
transform adequately. 
The HT method is very robust to the addition of random data and to occlusion. Random 
image points are very unlikely to contribute coherently to a single bin of the accumulator 
array and therefore produce only a very low level background of counts in the accumulator 
array. An example is shown in an Figure 5.3.2. 
Another desirable feature of the HT is that it can simultaneously accumulate evidence for 
several straight lines occurring in the same image. Generally each line produces a distinct 
peak or cluster in the accumulator array. 
One disadvantage of HT is the high requirement in computing power such that online 
computations are impossible on existing sequential computers. The work is of the order N2M 
Thus, a parallel high-speed implementation is useful and necessary for many applications. 
Another disadvantage is the fact that HT creates artefacts by counting structures in the image 
that finally turn out to be only very small fractions of straight lines (e. g. only one dot). This is 
because the HT examines purely the collinearity of edge points. The current distribution of 
edge points along a line is ignored. On the other hand, this global nature of the HT is a useful 
property since it leads to robustness. Thus, to obtain a useful line finding scheme, when 
faced 
with a complex scene both collinearity and proximity constraints should 
be applied. 
In the next subsection a new method to solve this problem 
is presented, which applies the 
morphological approach to the HT. The quality of 
its results is significantly higher than that 
81 
of the HT. This new algorithm is also tailored towards the capabilities of the ISA. It can serve 
as an example how a classical algorithm is modified in order to improve its performance, 
while fitting the characteristics of a parallel architecture. 
Figure 5.3.2: Robustness of the HT: The left image contains two lines where uniformly 
distributed and uncorrelated noise has been added. The corresponding HT is shown in the 
right image. Each of the two peaks correspond to the lines in the input image. The line 
parameters can here be recovered by simple means, e. g. thresholding. 
5.3.2 Morphological Hough Transform 
The idea of the Morphological Hough Transform (MHT [4,77]) is to reduce artefacts in the 
accumulator array by combining the global character of the Hough transform with a local 
morphological operation. 
Mathematical morphology is a technique well suited for image processing applications (see 
also Section 5.1.3). Complex functions on binary images such as opening, closing, and 
skeleton are decomposed into sequences of atomic operations of two different types: erosion 
and dilation. Both need a structural element. This is a small image segment, that is moved 
over the image such that a predefined reference point of the structural element is 
subsequently aligned with every pixel of the image. 
The erosion matches the structural element in the image, i. e. if all pixels of the structural 
element are set in the image, the corresponding pixel in the erosion is set too. The dilation 
checks whether there is an intersection of the structural element with the image. A pixel of the 
dilation is set if and only if the structural element an the image have at least one pixel in 
common. Figure 5.3.3 gives an example of erosion and dilation on an 8x8 image. 
Although these two basic operations are very simple and cheap to implement they provide an 
extremely powerful toolbox for image processing applications. 
(b) 
(a) (c) (d) 
Figure 5.3.3: Erosion (c) and dilation (d) of the image (a) with the structuring element (b). 
82 
Thus, the identification of a seeded pixel is the erosion with a structural element that consists 
of only one dot at its origin. Whenever the Hough transform algorithms finds a seeded pixel, it increments the counter for every straight line containing this pixel. 
By using a structural element that represents a straight line by more than one dot we can 
significantly reduce the number of counter increments. In particular, all those cases where the 
white pixel was isolated or consists of only a small white segment do not lead to an increment 
of any counter. This suppresses the artefacts in the m-c-space. Ideally, a straight line can be 
represented by two pixels (because there is exactly one straight line intersecting with these 
two pixels). By using the morphological approach with a structural element consisting of two 
pixels for every possible slope we increase a counter if and only if both pixels match in the 
original image. 
Algorithm 5.3.2 shows an implementation of this method (with I(ij), B[Mk,, Cll like in Alg. 
5.3.1) using the structuring elements Bk (0,0), (qk, Pk) I for each slope k. 
Algorithm 5.3.2: Morphological Hough Transform 
for i: =O to N-1 do 
f or j: =0 to N-1 do 
if I(i, j)=l then 
for k: =O to M-1 do 
if I (i+q :1 then increment B[mk, round (j -'*Pk/q k) k +Pk) 
In order to produce a parallel implementation on the ISA a much more efficient version of the 
morphological approach becomes possible if the loops in this program are interchanged as in 
Algorithm 5.3.3. 
Algorithm 5.3.3: Morphological Hough Transform with interchanged loops 
for k: =O toM-1 do 
for i: =O to N-1 do 
for j: =O to N-1 do 
if I(i, j)=l and I(i+qk, j+pk) =1 
then increment B[m,, round(j-i-pk/q k) 
Now we perform the accumulation operations for the complete image slope by slope. By 
shifting the image by (-qk, -pk) for the slope k we get two copies of the image on top of each 
other. Then the complete erosion can be performed by one AND-operation of the original 
image with the shifted image. 
In reality, there may occur uncertainties due to rounding effects: Not every straight line has a 
slope which can be exactly represented by two pixels with a constant distance. Skewing the 
image and choosing an appropriate structural element can avoid this problem. The details of 
this solution as well as the parallel implementation of the morphological erosion with the 
shifted image are given in the next subsection. 
83 
5.3.3 Parallel Implementation of Morphological Hough Transform 
We have implemented the MHT for binary images of size NXN, N= 512. To reduce data 
transfer and communication time it is useful to store the complete input in the processor array. Thus, the image is partitioned as follows: every processor of Systola 1024 stores a 16 x 16 
subimage in the 16 internal registers R[01R[ 15 Each of these registers store 16 
adjacent pixels of the same row. 
We will use the capabilities of the ISA to compute the row (column) sum and the row (column) ringshift parallel prefix computations in a very efficient way (see Chapter 2) for the 
parallel implementation: 
The easiest case is the detection of horizontal (vertical) lines. We use an erosion with a 
structuring element of two pixels with horizontal (vertical) distance d. Initially, we shift a 
copy of the image relative to the original image by d pixels. Now a counter has to be 
incremented if the pixel of both, the origin and the shifted image are set. 
By means of shearing the image all other line angles can be transformed in horizontal 
(vertical) position, such that the efficient horizontal (vertical) operations are always applied. 
1. Horizontal Accumulation. This operation produces the sum of the seeded pixels within 
each image row. Each processor first counts the seeded pixels in the corresponding 
register. This sum is written into the C-register. Afterwards the row sum operation 
(explained in Chapter 2) computes the number of seeded pixels in the whole image row. 
This operation is repeated 16 times for every subimage row of each processor. The results 
are collected in the last processor in each row. Because of the limited memory (32 
registers) of each processor the results are shifted one processor column to the west in 
each iteration step. 
2. Horizontal Erosion. Let the structuring element have the distance d from the image 
origin. Then an image row ringshift for d positions and an AND-Operation computes the 
horizontal erosion. 
Each image subrow is first ringshifted one processor to the left. Afterwards a shifting for 
(d mod 16) positions within each processor is performed. Then a ringshift for a (d div 16) 
processor distance computes the image row ringshift. Finally one AND-Operation 
between the origin subrow register and the ringshifted subrow register in each processor 
calculates the erosion. 
3. Shearing. Shearing is an operation frequently used in scan line image processing [64]. 
The idea is to reshape the image such that the actual scan line of arbitrary slope appears to 
be horizontal. This is achieved by shifting the columns with increasing shift distances. 
Figure 5.3.4 depicts the effect of shearing: A straight line of 45' slope appears to be 
horizontal if the image is sheared by 45'. 
84 
shear by 45" column ringshift 
Figure 5.3.4: The central image depicts the left image sheared by 45 Mr right image 
depicts the sheared image stored in place using column ringshift. 
The implementation of shearing an NxN image on the ISA has to be done in place taking 
only integer values as shifting distances. Thus, shifting of each column is implemented by 
the ringshift operation. 
' 1, and an intercept cl, - For a line slope Mk., -1 :! ý Mkfý N <-vj < N, the accumulation has to 
be done along the pixels (i, round(i. Mk+C1)) for all columns i=0,... N-1. TAdng only 
integer values as intercept parameters the pixels to be accumulated are (i, round(i. mk) + 
cl). These pixels can be transformed into a horizontal position by ringshifting each column 
i=0,..., N-1 of the input image by round(i. mk)positions. Now the efficient horizontal 
erosion and accumulation can be applied again. 
The ISA procedure for the horizontal accumulation of one register reg in each processor is 
implemented as follows: 
procedure hor-accumulation(reg: register); 
begin 
C := sum(reg); one; one; 
I computes the sum of seeded pixels in reg and writes the result into 
the C-register in every processor (requires 16 instructions on Systola 
1024)1 
CR[251 C+C [WEST] ; one; 2. . n] 
I row sum of the C-register values in each processor row; result is 
written into C and the internal register R2 5 in the last processor 
column [n] I 
C := R24; one; one; 
R24 R25; one; [n] 
R24 C[EAST]; one; [I.. n-1]; 
shift results one position to the left and store them in R24 
end; 
When all processor columns are covered the accumulator array results are transferred to the 
board RAM. This data transfer can be done concurrently to the computation of the next line 
slopes. The procedure hor-accumulation requires 20 processor instructions on 
Systola 
1024. 
The ISA procedure for the horizontal erosion of one image row 
in each processor row (the 
image subrow in each processor is stored in register source, the eroded result 
is written into 
des t) with distance d>0 works as follows: 
85 
procedure hor_erosion(source, dest: register; d: integer); 
begin 
dest lo word(leftshift(source, d mod 16)); 
C hi-word(left_shift(source, d mod 16)); 
1 leftshift source-value by d mod 16 bit positions in each processor. 
Low Word of the result is written into dest, High Word into C 
row-ringshift; 
I ringshift C-register contents one processor to the left in each 
processor row (implemented by: C: =C [WEST]; C: =C[EAST])l 
CC OR dest; 
I now the input image row is shifted for d mod 16 positions in each 
processor row in CI 
for i: = 1 to d div 16 do row-ringshift; 
I ringshift C-register d div 16 processors to the left. Afterwards image 
row is shifted for d positions 
dest :=C AM source; 
performs the erosion operation in each processor 
end; 
The procedure hor - ero s 
ion requires 6+ (d div 16). 2 instructions on Systola 1024 (if d> 
0). Afterwards the result is horizontally accumulated. Thus, the horizontal erosion and 
horizontal accumulation of the complete image is computed by: 
for i: =O to 15 do 
begin 
hor erosion(R[i], R[161, d); 
hor accumulation(R[161); 
end; 
The horizontal erosion and accumulation of the complete image requires 417 + (d div 16). 32 
processor instructions on Systola 1024. 
For an efficient implementation of shearing on Systola 1024 ringshifting along large distances 
has to be avoided. Thus, the different line slopes are handled in increasing order and the 
ringshifting distance only needs to incorporate the relative shear between neighbouring angles 
round(i mk) - round(i ný, 1), for all columns i=0,..., N-1. 
The distances for relative shearing are known in advance and can be precalculated offline. We 
assume that the maximal absolute ringshifting distance between two neighbouring angles is D 
= 1. Then, the precalculated shearing distances turn into a binary image row mask: Only the 
image columns with an entry in the corresponding mask positions have to be ringshifted. 
Larger absolute ringshift distances between two neighbouring angles of D>1 can be 
accomplished by D iterations of the above binary case. 
The binary mask can be stored in the same way as an image row using only one register per 
processor column. After the broadcasting of this register along the columns each processor 
stores the relevant part of the mask. The upper image subrow within each processor (R [0]) is 
ringshifted in vertical direction and stored in R[ 16 1. Then each processor can compute the 
relative shearing by replacing the pixels at seeded mask positions with the pixels of the 
underlying row. Thus, the ISA procedure for shearing with D=I looks as follows: 
86 
procedure shear; 
begin 
C R[O]; 
colmn 
- ringshift; R[161 :=C; 
I upper image subrow in R[01 within each processor is ringshifted in 
vertical direction in each processor column and written into R[ 16. ] 
In_Broadcast(NORTH, mask); 
I input binary mask with shearing distances from the northern I]Ps 
using the efficient column broadcast operation I 
for i: =O to 15 do R[i] : =(R[il A --, mask) v (R[i+l] A mask) 
I performs the shear operation by replacing the pixels at seeded mask 
positions with the pixels of the underlying row 
end; 
The procedure shear requires 54 processor instructions on Systola 1024. After each shearing 
of the image the horizontal erosion and horizontal accumulation is applied. These two 
operations require much more instructions than shearing. Thus, shearing is a very efficient 
operation on the ISA. 
Because column ringshifting on the ISA is only efficient from southern in northern direction, 
the above implementation can only be used for line slopes between 0' and -45'. Slopes 
between 0' and 45' are computed by loading the input image with transposed rows on the ISA 
and applying the same program. The angles between 45' and 90' (-45' and -90') are 
calculated by storing the 16 x 16 subimage columnwise in each processor and applying the 
reflected ISA program (swapping row and column selectors, changing west and north, east 
and south) for 0' to 45' (0' and -45'). 
5.3.4 Performance Evaluation and Experimental Results 
To compare the performance of the parallel algorithm in the previous section, we compare the 
implementation on Systola 1024 with two sequential versions on a Pentium 11266 MHz. 
1. The standard version of the HT algorithm (referred to as "P266 Standard" in Table 
5.3.1). 
2. The sequential simulation of the MHT algorithm implemented on Systola 1024 (referred 
to as "P266 Parallel" in Table 5.3.1). 
For a comparison to other parallel architectures for the HT see [ 17]. 
For the implementations we use binary images of size NxN, N= 512, and a 16 bit 
accumulator array consisting of 4-N slopes and N intercepts. Systola needs 0,25 sec. (375 
processor instructions per slope) for the standard HT and 0,31 sec. (471 instructions per slope) 
for the MHT (D = 1, d= 10). Since computing time dominates communication time in this 
application, data transfers (board RAM <--> IPs, board RAM <--> PC RAM) can be almost 
totally hidden as it can be executed concurrently to the computation. 
The standard HT (Algorithm 5.3.1) depends on the number of seeded pixels 
in the input 
image, but has the disadvantage of a complex computation of the accumulator bins to be 
incremented. Since the corresponding bins are not held in contiguous area of memory, many 
cache misses are caused. This means a computing time of 
32 sec in the worst case (all pixels 
seeded). The used edge images from the pick and place process of multichip modules 
(Figure 
5.3.9) have 10 % seeded pixels on an average, which leads to a computing time of 3,2 sec. 
87 
The standard MHT (Algorithm 5.3-2) depends on the number of seeded pixels in the input 
image and in each eroded image. An average multichip module edge image (Figure 5.3.9) has 
10 % seeded pixels in the input and 1% in each eroded image, which leads to a computing 
time of 2,6 sec. in the average case. 
The sequential version of the parallel algorithm is independent of the number of white pixels, 
but has the advantage of an easier accumulation and a linear memory access of accumulator 
bins. This leads to a constant computing time of 2,3 sec for the HT and 2,8 sec for the MHT 
for each input image. 
Table 5.3.1: Comparisons of the average performance of HT/MHT for a 512 x 512 image and 
2048 different slopes on Systola 1024 (including data transfers) and Pentium 11266 
I 
---I 
F- HT I F MHT I Systola 1024 0.25s 0.31 s 
P266 Standard 3.2s 2.6s 
P266 Parallel 2.3s 2.8s 
ISpeedup 13/9 8/9 
In order to evaluate the qualitative performance of the MHT we have compared it with the HT 
on several different images. 
Figure 5.3.7: Accumulator from a 
MHT 
Fig. 5.3.5 shows three lines close together. The relevant part of the accumulator array 
from 
the HT and MHT (with distance d= 50 for the structuring element) 
is displayed as a height 
field in Fig. 5.3.6 and 5.3.7. 
88 
Figure 5.3.5: Original Image 
Figure 5.3.6: Accumulator from a HT 
It can be seen that both, HT and MHT produce artefacts (in the Figures they are behind the three peaks representing the lines). In case of MHT these artefacts are significantly reduced. The MHT also creates an enhanced accumulator contrast, which simplifies the higher order image analysis. 
Comparisons of the effectiveness of the HT and MHT methods are displayed in Figures 5.3.8 
- 5.3.11. Figure 5.3.8 shows an image from the pick and place process of multichip modules. Fig. 5.3.9 shows the result after an edge detector is performed on the image. Figures 5.3.10 
and 5.3.11 show the result after a HT and MHT (d = 10), where accumulator entries greater than 90 resp. 80 are detected. The resulting lines are displayed after an AND-operation with the edge image. It can be readily seen that the HT picks up many more false lines than the MHT in the dotted areas. 
5.3.5 Future Work 
The presented algorithm for line detection provides a morphological approach to Hough 
transform. The algorithm is designed to take advantage of the ability of ISAs to implement 
accumulation and ringshift operations extremely efficient. The processing time per image 
89 
Figure 5.3.8: Original image Figure 5.3.9: Edge image 
Figure 5.3.10: Result from a HT Figure 5.3.11: Result from a MHT 
shows that the principle of the ISA leads to solutions which are superior with respect to 
performance and hardware cost. Since image processing is one of the most computation 
intensive fields in computer science, these results might lead research into the direction of low 
cost solutions for automatic visual quality control. 
As for many application of parallel processing, a central question is to match algorithms and 
architectures in such a way that good performance is reached. We modified a classical 
algorithm in order to improve its performance, while fitting the characteristics of the flexible 
ISA architecture. 
Obviously there is still lots of scope for fine tuning the combination of HT and mathematical 
morphology to best fit a given application. Although no details will be given, the structuring 
elements could be optimised to search for lines: 
" of a given thickness (using larger structuring elements) 
" of given length 
" which are dotted or dashed 
For the MHT, tailoring towards one of the above requirements will not result in a much 
higher execution time, since the implementation of the morphological operations is very 
cheap. 
The presented ideas for the ISA implementation of the MHT can be extended to efficient ISA 
implementations of other algorithms in image processing, like the Filtered Backprojection of 
parallel beams and fan beams in Computerised Tomography. This algorithm is presented in 
the next section. 
The current ISA prototype Systola 1024 is a machine based on technology far from being 
state-of-the-art. Extrapolating to technology used for processors such as the Pentium 11266 
MHz would lead to a speedup of the ISA by a factor of at least 100. Thus, resulting in 
processing speeds under 40 ms for the MHT of 512 x 512 images with 2048 slopes and 512 
intercepts. Such performance can be expected from Systola 4096. 
90 
5.4 Tomographic Image Reconstruction 
5.4.1 Computerised Tomography 
Computerised Tomography (CT) is used in many applications, e. g. medicine and non- 
destructive evaluation (NDE), to look inside objects and to analyse their internal structures 
[31]. However, the problem of CT image reconstruction is computationally intensive. This 
has hindered the use of CT in applications where CT image data is required in real time. 
Parallel processing is one of the techniques available which can reduce the image 
reconstruction time significantly. 
In this section we illustrate how the capabilities of ISAs are exploited to derive an efficient 
parallel algorithm for CT image reconstruction on the ISA. The implementation on Systola 
1024 shows that the concept of the ISA is very suitable for CT reconstruction and results in 
significant run time savings. 
detectors detectors 
(A) 
Ci .: i.. 
(B) 
__ Eýz I, I, 'I a M ED FM 
X-ray ) sources FD 
CO 
point sourc6--------" 
scan different scan different 
angles angles 
detectors 
(C) 
-- 
-/ 
-r 
X-ray 
point source 
scan different 
angles 
Figure 5.4.1: Different scanning modes in CT: (A) parallel beam, (B) fan beam (C) cone 
beam 
Tomographic reconstruction deals with the reconstruction of an object 
from its projections 
[65]. The technique of CT scanners consists of passing a series of X-rays through an object, 
and measuring the attenuation in these rays 
by placing a series of detectors on the receiving 
side of the object. These measurements are called projections. 
For the collection of projection 
91 
three different modes are used: parallel beam scanning, fan beam scanning and cone beam 
scanning (see Figure 5.4.1). 
The parallel beam scanning mode and the fan beam scanning mode collects the data in the 2D 
plane from several different projection angles. The goal of tomographic reconstruction is to 
obtain an image of a cross section of the object from such a set of profiles. If this process is 
repeated at various heights along the object, we obtain several 2D cross-sectional images, 
which can then be stacked one on the top of another to get internal 3D volume visualisation of 
the object. Thus, the computational burden is large because of many 2D reconstructions 
required to obtain the 3D volume. 
The cone beam scanning mode collects projection data directly in the 3D volume from several 
different angles. In comparison to the other two scanning modes these measurements have not 
to be repeated from various heights. The scanning mode is chosen according to the 
complexity and suitability of the structure to be tested. 
The Radon Transform (RT) [14,62,92] can model X-ray attenuation density acquired from 
an 1D array of detectors. The RT R(p, O) of a continuous image g(xy) is defined as a line 
integral through g(xy) along all lines L inclined at an angle 0 from the y axis: 
R(p, 0) = 
fL(P, 
0) g 
(x, y)du 
where p= x-cos 0+ y-sin 0. 
As Figure 5.4.2 (A) shows, L(p 0ý the line inclined at an angle 0 forn the y axis at a 
distance p from the origin, and du is the distance along the line L. All values of R(p, O) for a 
fixed 0 form a 1D signal called a profile. Figure 5.4.2 (B) shows an example of a projection 
at one angle. The Hough Transform - introduced in the previous section - is a special case of 
the RT for binary images. 
(A) 
V 
(B) 
y 
p 
x 
x 
Figure 5.4.3: 
of one profile. 
Because only 
imno,,;,,, qible to 
(A) Illustration of the (p, O) line parameterisation. (B) Example of calculation 
a finite number of angles and rays are taken from the CT scanner it is 
reconstruct the analysed object exactly. This problem is mathematically 
described as the Inverse Radon Transform [65]. Thus, only discrete approximations can 
be 
computed. Filtered Backprojection is currently the most used reconstruction method 
to 
approximate the inverse RT. This algorithm 
is introduced in the next subsection. 
92 
5.4.2 Filtered Backprojection 
Filtered Backprojection of parallel beams consists first of a filtering operatlon and a following 
backprojection operation. 
The backprojection alone is not strictly the inverse of the RT. As shown in [ 14], the 
backprojected RT (without filtering) is a function of the original image, g(xy), blurred by the 
point spread function 
1 
x2+y2 
Hence, to recover the original image, filtering is needed to undo blurring effects due to the 
lowpass nature of the backprojection line integrals (see Figure 5.4.3 for an example). Thus, a 
ID filter is applied on each profile (projections of the same angle). Filtering of parallel beams 
is performed by taking the Fast Fourier Transform (FFT, see Section 5.2.2) of each projection, 
multiplying by the filter frequency response, and taking the inverse FFT. The type of filter 
used can have significant impact on the quality of image reconstruction. 
In our implementation, we use the Hamming, Bracewell and Shepp-Logan filter [3 1 ], which 
are commonly used in NDE image reconstruction. The work of filtering is the number of 
angular samples (7) times the work of the FFT operation of the number of projections per 
profile (R): TFiltering ---: O(T-R-log2R). 
(A) (B) 
Figure 5.4.3: Backprojection without filtering (A) and with filtering (B) (using the Shepp- 
Logan filter) 
Backprojection of parallel beams reconstructs each pixel by summing up one filtered 
projection value of each angular sample. To reconstruct the pixel 
(xy) the projection value 
PO(x-cos 0+ y-sin 0) has to be added for the angle 0, asdepicted in Figure 
5.4.4. In general, 
x-cos 0+ y-sin 0 is a non-integer. Thus, the needed value 
has to be determined by 
interpolation between the two physically existing neighbouring projections. 
Usually, linear interpolation provides the best trade-off between reconstruction quality and 
execution time [90]. The work of the backprojection 
is given by the image size (N x N) times 
2 
the number of angular samples (7): 
TBackprojection "' O(N - 7). 1hus, the backprojection part is 
computationally more expensive than the 
filtering part. 
93 
Algorithm 5.4.1: Backprojection of parallel beams with linear interpolation 
for i: =O to N-1 do 
for j: =O to N-1 do 
for t: =O to T-1 do 
cos 6, +j- sin 6, 
g 0, j) :=g0, j) + P, (Lp J) + (Lp I-p)- (Pg, (Fp 1) - Po, (Lp 1)) 
Filtered backprojection of fan beams can be reduced efficiently to the filtered backprojection 
of parallel beams using the rebinning algorithm [30]. Since a ray in the fan beam geometry 
is also some ray in the parallel beam geometry, their respective projection functions can be 
related. Using this relation and linear interpolation parallel beam profiles can be generated. In 
practice, rebinning is preferred compared to directly fan beam reconstruction because it can 
be fed into already developed filtered backprojection implementations for parallel beams. 
t 
Po( 
x 
Figure 5.4.4: For the reconstruction of the pixel g(xy) the projection Po(x. cos 0+ y-sin 0) has 
to added for the angle 0. 
5.4.3 ISA algorithm of Image Reconstruction 
In this section the ISA algorithm for the Filtered Backprojection of parallel beams for image 
reconstruction is described. The capabilities of ISAs to compute the row (column) broadcast 
and the row (column) ringshift parallel prefix computations very efficiently (see Chapter 2) 
are exploited in the parallel algorithm. To achieve this goal, similar techniques as in the 
parallelisation of the Hough Transform are applied: 
" Swapping the loops in the sequential algorithm 
" Shearing 
Filtering: The filtering step is implemented by computing the ID-FFT of each profile, 
multiplying each transformed projection by the filter frequency response and the 
corresponding inverse 1D-FFT (see previous section). 
The ISA algorithm for forward and inverse 1D-FFT is described in detail in Section 5.2.3. 
The filter coefficients are known in advance. Thus, they are precomputed and loaded into the 
processor array when needed. The multiply operation for filtering can then performed 
in each 
processor in parallel. 
94 
Backprojection: In order to produce a parallel implementation of backprojection of an NxN image on an NxN ISA each processor reconstructs one pixel in parallel. Because the backprojection in Algorithm 5.4.1 reconstructs each pixel sequentially a more efficient 
version of the backprojection becomes possible if the loops in Algorithm 5.4.1 are interchanged as follows: 
Algorithm 5.4.2: Backprojection with interchanged loops 
for t: =O to T-1 do 
for i: =O to N-1 do 
for j: =O to N-1 do 
i- cos 0, +j -sin 0, 
g 0, j) :=g (i, j) + Po, (Lp J) + (Lp I-p)- (Po, (Fp 1) - Pig, (Lp J)) 
Now we perform the accumulation operations for the complete image slope by slope. In each 
iteration step t only the projection values of one profile (0t) have to be considered. These 
projections are broadcasted through the processor array such that each processor gets the 
relevant projection values corresponding to its pixel. Afterwards each processor can compute 
the interpolation and accumulation in parallel. 
The easiest case is the backprojection of horizontal projections Each projection Po(i), i 
0,..., N-1,0 = 0', is broadcasted and accumulated along processor row i in parallel. By means 
of shearing of the image all other projection angles can be reduced to horizontal 
backprojection, such that the efficient row broadcast operation can always be applied. 
The effect of the shear operation is already explained in the previous section (see Figure 5.3.4 
for an example). 
Shearing an image by an angle 0, with 0' -< 
0 :! ý 45', is calculated by shifting each column j 
by the distance j. tan(O). The implementation of shearing an NxN image on an NxN ISA has 
to be done in place taking only integer values as shifting distances. Thus, shifting of columnj 
is implemented by ringshifting columnj with the distance Lj. tan(O)J . 
For an efficient implementation on the ISA ringshifting along large distances has to be 
avoided. Thus, different projection angles are handled in increasing order and the ringshifting 
distance only needs to incorporate the relative shear between neighbouring angles Lj. tan(Ot)d - 
Li-tan(Ot-01 for each column j. The distances for relative shearing are known in advance. 
Thus, they are precomputed and are loaded into the processor array when needed using the 
column broadcast operation. 
Assuming, projections were generated using N rays for each angle and the reconstruction 
image is of size Nx N, the distance between every two rays of an angle 0, with 0' <- 0 :! ý 45', 
is l1cos(O) with respect to each image column (see Figure 5.4.5, left). 
Because row broadcasting is always applied for backprojection, exactly one projection value 
has to exist between two neighboured pixels of each image column. Thus, projection values 
between every two rays of an angle 0, with 0' <- 0 :! ý 45', are calculated with distance I with 
respect to each image column (see Figure 5.4.6, right). This operation is computed 
by linear 
interpolation between the existing projections and takes only a negligible amount of work. 
95 
linear interpolation 
10 
Figure 5.4.4: Linear interpolation of projections to distance I with respect to each image 
column. 
Now, the indices for linear interpolation between neighbouring projections of an angle 0 urn 
to a constant value for each image column j: frac(i. 0). Thus, they can be loaded into the 
processor array using the column broadcast operation. 
The accumulation step for each angle 0, with 0' <- 0 :! ý 45', loads the projection values 
PO(i- 1) and PO(- N+im- 1) into each processor row i, for i=O,... N- Iýng row broadcast (see 
Figure 5.4.5). The interpolation indices IO(j) are loaded into the processor columnsj, forj = 
0,..., N-1, using column broadcast (see also Figure 5.4.5). Afterwards each processor gets also 
the projection values from its southern neighbour using the column ringshift operation. 
interface 
0 
0 
13 
-n -R 
= 
processors 
C z 0 (D 
west T 0 T- Cr II 0 PO(-N-1) POH) accu accu a 0 W 0 accu - (0,0) (0,1) 7 0 (0, N-1) 
I- I m (A 
Mo. 
I 
- PO(-N) PO(O) accu accu accu] 
(1,0) (1,1) (1, N-1 
row broadcast NxN- ISA 
PO(-2) PO(N-2) accu accu accu 
301- 1 
-d 
(N- 
1 
1,1)] (N-1 
I 
N-1) (N-1, O) 
Figure 5.4.5: Broadcasting of projections PO and interpolation indices 10 into the processor 
array using the efficient row broadcast and column broadcast operations. 
Because the implementation of shearing was done in place there are pixels in some processor 
rows i, i=O,..., N-1, needing the projections Po(-N+i) and 
PO(- N-4- 1) iltead of PO(i) and 
PO(i-I). These are exactly the pixels which lie above the Oth image row in the original sheared 
image (see Figure 5.3.4). Each processor stores this information in a flag register. Depending 
on this flag each processor (ij) 
decides either to take the projections PO(i), PO(i-l) or 
po(-N+i), po(-N+i-1) for the linear interpolation and accumulation: 
96 
for all processors (i, j) do in parallel 
if flag then PO: =Pq (i) ; Pi: =PO (i-1) -PO 
else PO: =PO (-N+i) ; pl: =PO (-N+i-1) -Po (-N+i) 
accu[i, j]: =accu[i, j] + PO + IO(j)-P1; 
Because column ringshifting on the ISA is efficient only in one direction, the above implementation can only be used for projection angles 0, with 00 <0:! ý 450. Angles 0, with 
-45' <- 0:! ý 0' are backprojected by mirroring the image columns and applying the same ISA 
program. The angles between 45' and 90' (-45' and -900) are calculated by applying the 
reflected ISA program (interchanging row and column selectors, changing west and north, 
east and south) for 0' to 45' (-45' to 0'). 
5.4.4 Performance Evaluation on Systola 1024 
The Filtered Backprojection of parallel beams has been implemented on Systola 1024. 
Projection were generated using N rays over N angles and the reconstruction grid was of size 
NxN, for N= 256,512,1024,2048. IEEE single floating point numbers are used for all 
computations to achieve reasonable accuracy in the reconstructed images. 
The filtering step is implemented on Systola 1024 using the FFT and inverse FFT 
implementation described in Section 5.2.4 and an additional complex multiplication with the 
filter coefficients. Having N rays over N angles this computation has to be iterated N2 /2048 
times. 
FFT, filtering, and inverse FFT can be performed in a row on each profile. Thus, no 1/0 
between these three computational steps is necessary on the processor array. Thus, computing 
time dominates communication time in this application, i. e. data transfers (PC-RAM <-> board 
RAM, board RAM <-> interface processors) can be executed concurrently to the computation. 
Table 5.4.1 shows the runtimes of filtering. 
The implementation of backprojection on Systola 1024 has to partition the implementation 
described for an NxN ISA on an 32 X 32 ISA. In order to reduce computing time each 
processor reconstructs 2x2 pixels instead of only one. Because of the limited processor 
memory (32 internal registers per processor) this number cannot be increased further. Thus, a 
64 x 64 subimage is reconstructed in parallel in one iteration step. The complete NxN image 
is reconstructed by N164 - N164 iteration steps. The backprojection and shearing needs 325 
processor instructions (210 instructions for accumulation, 115 instructions for shearing) per 
slope for a 64 x 64 subimage. Thus, the data transfers of projections, interpolation indices and 
shearing distances can be executed concurrently to the computation. 
Table 5.4.1. Performance on Systola 1024 of filtering and backprojection 
N] E-256 512 1024 2048 
Filtering 0.06s 0.25s 1.2s 5.5s 
Backprqjection 0.5s 3.8s 28.5s 223s 
As seen in Table 1, the time for backprojection and filtering increases by a 
factor of 
approximately 8 resp. 4.5 if N increases by a factor of 2. In general, the computation time is 
proportional to N3 resp. N2- b&N (see Section 5.4.2). Thus, the execution time scales nearly 
perfectly (i. e. without loss of efficiency) with the size of the problem. 
97 
The performance on Systola 1024 is compared with two sequential version on a PC (Pentium 
11 Processor, 266 MHz): 
1. The standard filtered backprojection algorithm (see Algorithm 5.4.1). An appropriate 
program package for this comparison has been provided by the Federal Institute for 
Materials Research and Testing (BAM). The performance of the PC is 8 sec for N 
256. This corresponds to a speedup of 14 of the parallel implementation. 
2. The sequential simulation of the ISA algorithm. The algorithm accesses pixels and 
projections always in contiguous areas of memory. Thus, it supports the memory 
hierarchy structure of the PC and causes much less cache misses than the standard 
algorithm. Its sequential simulation on a PC is more than factor two faster than the 
standard algorithm. The runtime is 3,6 sec for N= 256. This corresponds to a speedup of 6 
of the parallel implementation on Systola 1024. 
The parallel algorithm performs two linear interpolation steps (see Figures 5.4.4,5.4.5). Thus, 
the introduced ISA algorithm uses a different interpolation scheme than the standard linear 
interpolation. We compared the quality of the reconstruction of both implementations on 
several test images. Obviously, the parallel implementation creates no artefacts. Of course, the 
reconstructed images are a little bit smoother than that ones of the standard algorithm. 
5.4.5 Future Work 
We presented how the capabilities of ISAs to compute the broadcast and the ringshift parallel 
prefix computations can be exploited to derive a very efficient Parallel algorithm of CT image 
reconstruction for parallel beams and fan beams. The parallelisation and implementation of 
the backprojection of cone beams on the ISA is a challenging task for the future. 
The implementation on Systola 1024 shows that the concept of the ISA is very suitable for CT 
image reconstruction and results in significant runtime savings. Extrapolating these results to 
technology used for processors such as the Pentium 11 would lead to a much higher speedup 
of the ISA. 
Beside reconstruction, interpretation and visualisation of reconstructed object data are two 
other computationally intensive tasks in the area of CT. A next generation Systola board (e. g. 
Systola 4096) could be used to build a low cost complete system for real-time inspection 
system of CT data on PC basis. This system could consist of the following components 
" High-speed reconstruction, e. g. filtered backprojection of parallel beams, fan beams, and 
cone beams 
" High-speed interpretation, e. g. the detection of cracks in glass shells for nuclear using the 
image segmentation algorithms of [591 waste (see Figure 5.4.6 for an example) 
" 3D-visualisation of reconstructed and interpreted objects 
98 
(A) (B) 
A-A I-.. 
In 1" 9 
TIN 
Figure 5.4.6: (A) image representing CT reconstruction result of defective glass shells; (B) 
automatically detected cracks in (A) (images are provided by [59]) 
99 
Conclusions and Future Work 
6.1 Conclusions 
The subject of this thesis has been the analysis of design techniques and implementation of 
several special purpose algorithms and subroutines on the ISA that take advantage of the 
special features of the systolic information flow. 
The ability of ISAs to perform parallel prefix computations (e. g. broadcasting, accumulation, 
ringshifting) in an extremely efficient way was exploited as a key-operation to derive 
efficiency as well as local operations within each processor. 
Therefore, given sequential algorithms has to be decomposed in simple building blocks of 
parallel prefix computations and parallel local operations. To modify sequential algorithms for 
a parallelisation several techniques has been introduced in this thesis, e. g. swapping of loops 
in the sequential algorithm, shearing of data, and appropriate mapping of input data onto the 
processor array 
It was demonstrated how these techniques can be exploited to derive efficient ISA algorithms 
for several computationally intensive applications. These included cryptographic applications 
(e. g. arithmetic operations on long operands, RSA encryption, RSA key generation) and image 
processing applications (e. g. convolution, Wavelet Transform, morphological operators, 
median filter, Fourier Transform, Hough Transform, Morphological Hough Transform, and 
tomographic image reconstruction). 
Their implementation on Systola 1024 shows that the concept of the ISA is very suitable for 
these applications and results in significant run time savings. Of course, the current ISA 
prototype Systola 1024 is a machine based on technology far from being state-of-the-art. 
Extrapolating to technology used for processors such as the Pentium Id would lead to a much 
higher speedup of the ISA. Such performance can be expected from the already announced 
next generation Systola board. 
6.2 Future Work 
The results of this thesis emphases the suitability of the ISA concept as an accelerator for 
computationally intensive applications in the areas of cryptography and image processing. 
This might lead research towards further high-speed low cost systems based on ISA hardware. 
We are currently working on the integration of an ISA of size 32 x 32 onto a single chip with 
only I cm 2 silicon area in a 0.25g technology [73]. At the expected clock 
frequency of 200 
MHz the chip will have a peak performance of 12 GOPS (16 
bit operations) resp. 750 MFlops 
(IEEE single accuracy). Of course, this chip could 
be used to build an extremely powerful 
add-on board for PCs, e. g. 192 GOPS peak performance with an 
4x4 array of processor 
chips. 
100 
Moreover, a simpler version of this chip is designated for the application in videophones and 
mobile phones of the next generation (UMTS standard). These two applications are currently 
investigated in a project together with the company SIEMENS. 
For the videophone application it is planned to use the ISA as a video accelerator unit to 
perform the computationally intensive parts of the H. 263 video codec standard [28], e. g. 
motion estimation [10], forward and inverse Discrete Cosine Transform [50], filtering. 
Of course, the necessary ISA algorithms can take advantage of the fast parallel prefix 
computations and parallel local operations. High level control tasks can be mapped onto a 
(slower) programmable standard RISC processor, e. g. rate control. 
This would lead to a more cost effective solution than using a fully programmable device, 
which is required for the consumer market. Simultaneously enough flexibility is provided to 
implement sophisticated encoding strategies on the ISA in comparison to a pure dedicated 
solution. The same statements hold for the mobile phone application. 
101 
Bibliography 
Aho, A. V., J. E. Hopcroft, J. E.., J. D. Ullman, J. D.: The Design and Analysis of Com uterAlgorithms, Addison-Wesley (1974). 
2. Barrett, P. D.: Implementing the Rivest, Shamir and Adleman public key encryption 
algorithm on a standard digital processor, In: Proc. Crypto'86, Lecture Notes in 
Computer Science Vol. 263 (Springer, 1987) 311-342. 
3. Benaoun, M.: Implementierung eines Kryptoalgorithmus auf einem Systolic-Array- 
Prozessor, Master Thesis, Technical University of Braunschweig, Germany (1998). 
4. Beresford-Smith, B. Pham, B., Schr6der, H., A Parallel Morphological Implementation of 
the Hough transform, In: Proc. 25thHICSS, (1992) 111-119. 
Beylkin, G., Coiffnan, R., Rohklin, V.: Wavelets in Numerical Analysis, In: Ruskai, M. B. 
et al. (eds. ) : Wavelets and their applications, Jones and Bartlett (1992) 181-210. 
Bosselaers, A., Govaerts, R., Vandewalle, J.: Comparison of three modular reduction 
functions, In: Proc. Crypto'93, LNCS 773 (Springer, 1994) 377-386. 
7. Bracewell, R.: The Fourier Transform and its Applications., Mc Graw Hill, New York 
(1986). 
8. Brdunl, T., Feyrer, S., Rapf, W., Reinhardt, M.: Parallele Bildverarbeitung, Addison- 
Wesley (1995). 
9. Bfittner, P.: Simulation neuronaler Netze mittels eines befehlssystolischen Feldes, Master 
Thesis, University of Karlsruhe, Germany (1995). 
10. Cheng, S. C., Hang, H. -M.: A Comparison of Block-Matching Algorithms Mapped to 
Systolic-Array Implementation, ]EEE Transactions on Circuits and Systems for Video 
Technology, 7 (5) (1997) 741-757. 
Chui, C. K.: Wavelets: A Mathematical Tool for Signal Analysis, SIAM Monographs on 
Mathematical Modelling and Computation (1997). 
12. Cooley, J. W., Tukey, J. W.: An Algorithm for a Machine Calculation of Complex 
Fourier Series, Mathematics of Computation 19 (1965). 
13. Daubechies, I.: Ten Lectures on Wavelets, SIAM (1992). 
14. Deans, S. R.: The Radon Transform and Some of its Applications, Wiley, New York 
(1983). 
102 
15. Dittrich, A., Schmeck, H.: Given's Rotation on the Instruction Systolic Array. In: Wolf, 
G., Legendi, T., Schendel, U. (eds. ): PARCELLA '88, Mathematical Research, Vol. 48 
(Akadernie Verlag, Berlin, 1988) 340-346. 
16. Dittrich, A.: Matrixoperationen auf dem befehlssystolischen Feld, Master Thesis, 
Christian-Albrechts-University, Kiel, Germany (1989). 
17. Ferretti, M., Albanesi, M. G.: Architectures for the Hough Transform A Survey, In: 
Proc. of IAPR workshop on MVA, Tokyo (1996) 542-551. 
18. Fox, D.: Der "Digital Signature Standard": Aufivand, Implementierung und Sicherheit, 
In: Proc. GI-Fachtagung VIS'93, DuD-Fachbeitrdge 16, (Vieweg, Braunschweig, 1993) 
333-352. 
19. Gieseke, B.: A 600 MHz superscalar RISC microprocessor with out-of-order execution, 
ISSCC Digest of Technical Papers (1997). 
20. Gonzales, R. C., Woods, R. E.: Digital Image Processing, Addison-Wesley (1992). 
21. Hahnel, T.: The Rabin-Miller Prime Number Test on Systola 1024 on the Background of 
Cryptography, Master Thesis, University of Karlsruhe, Germany (1998). 
22. Helman, D., Jaja, J.: Efficient Image Processing Algorithms on the Scan Lien Array 
Processor, IEEE Transactions on Pattern Analysis and Machine Intelligence 17 (1) 
(1995). 
23. Higgins, B. C.: The Rabin-Miller Probabilistic Primality Test: Some Results on the 
Number of Non-Witnesses to Composites, MAA 1996 Spring Meeting, 
htlp: //www. ma. iup. edu/MAAýproceedings/voll/higgins. htm (1996). 
24. Hough, P. V. C.: A method and means for recognizing complex patterns, US Patent 
3,069,654 (1962). 
25. Illingworth, J., Kittler, J.: A Survey of the Hough Transform, Computer Vision, Graphics 
and linage Processing, 44, (1988) 87-116. 
26. Intel Corporation: MMx Technology Application Notes, 
http: //developer. intel. com/drý/mmx/i! PPnotes/index. htm (1998). 
27. ISATEC Soft-und Hardware GrnbH: The ISATEC Parallel Computer Systola 1024, Users 
Guide V. 3.0 (1998). 
28. ITU-T (CCITT) Recommendation H. 263: Video Coding for low bit rate communication, 
Geneva (1996). 
29. Jdhne, B.: Digital Image Processing, Springer Verlag, Berlin (1997). 
30. Jain, A. K: Fundamentals of Digital Image Processing, Prentice Hall, Englewood 
Cliffs 
(1989). 
103 
3 1. Kak, A. C., Slaney, M.: The Fundamentals of Computerized Tomographic Imaging, IEEE 
Press, New York (1988). 
32. Knuth, D. E.: The Art of Computer Programming, Vol. 2, Seminumerical Algorithms 
(Addison-Wesley, Reading MA, 198 1). 
33. Koc, C. K.: Analysis of Sliding Window Techniques for Exponentiation, Computers and 
Mathematics with Applications 30 (10) (1995) 17-24. 
34. Kolbe, W.: Online Qualitdtskontrolle von Oberfldchen mit den ISATEC Surface Quality 
Scannern SQS, Joumal ftir Oberfldchentechnologie 3 (1997) 
35. Kunde, M., Lang, H. -W., Schimmler, M., Schmeck, H., Schr6der, H.: The Instruction 
Systolic Array and its Relation to Other Models of Parallel Computers, Parallel 
Computing 7 (1988) 25-39. 
36. Kung, H. T., Leiserson, C. E.: Systolic arrays (for VLSI), In: Proc. 1978 Symposium on 
Sparse Matrix Computations (1978) 256-282. 
37. Kung, H. T.: Why systolic architectures ?, Computer Magazine 15 (1982) 37-46. 
38. Kung, S. Y.: VLSIArray Processors, Prentice Hall, Englewood Cliffs (1988). 
39. Ladner, R. E., Fischer, M. J.: Parallel Prefix Computation, Journal of the ACM, 27 (4) 
(1980) 831-838. 
40. Lang, H. -W., Schimmler, M., Schmeck, H., Schrbder, H... Systolic sorting on a mesh- 
connected network, ]EEE Transactions on Computers 34 (1985) 652-658. 
41. Lang, H. -W.: The Instruction Systolic Array, a parallel architecture 
for VLSI, Integration, 
the VLSI Journal 4 (1986) 65-74. 
42. Lang, H. -W.: ISA and SISA, Two Variants of a 
General Purpose Systolic Array 
Architecture, In: Proc. 2nd Int. Conf. on Supercomputing (1988) 460-465. 
43. Lang, H. -W.: Transitive Closure on the Instruction 
Systolic Array, In: Proc. Int. Conf. On 
Systolic Arrays, (Computer Society Press, 1988) 295-304. 
44. Lang, H. -W.: Das befehlssystolische 
Prozessorfeld - Architektur und Programmierung, 
PhD Thesis, Christian-Albrechts-University, Kiel, Germany (1989). 
45. Lang, H. -W., MaaB, R., Schinunler, 
M.: Implementation of a 1024-Processor Array 
Computer as an Add-On Boardfor Personal Computers, In: 
Proc. HPCN 94, LNCS 797 
(Springer, Berlin, 1994) 487-488. 
46. Lang, H. -W.: Verfahren zum 
Betreiben eines hochintegrierten Wellenfront-Feldrechners. 
Patent No. DE 35 33 800, Federal Republic of Germany. 
47. Lang, H. -W.: Waveftont 
Array Processor. Patent No. 4,884,193, USA. 
48. Lang ., H. -W.: Wellenftontfeldrechner. 
Patent No. 0 220 474, European Patent Office. 
104 
49. Leavers, V. F.: R%ich Hough Transform, Computer Vision, Graphics and Image Processing 58 (1993) 250-265. 
50. Lee, A H.: On Computing 2-D systolic algorithms for discrete cosine transform, ]EEE 
Transactions on CAS 37 (10) (1990) 1321-1323. 
5 1. Leighton, F. T.: Introduction to parallel algorithms and architectures: arrays, trees, hypercubes, Morgan Kaufmann (1992). 
52. Lenders, P., Schr6der, H., Strazdins, P.: Microprogramming Instruction Systolic Arrays, 
MICRO 22, Dublin, (1989). 
53. Lenders, P., Schr6der, H.: A programmable systolic device for image processing based on 
mathematical morphology, Parallel Computing 13 (1990) 337-344. 
54. Lu, J.: Parallelizing Mallat Algorithm for 2-D Wavelet Transforms, Information 
Processing Letters 45 (1993) 255-259. 
55. Menezes, A. J.: Handbook ofApplied Cryptography, CRC Press (1997). 
56. Miller, G. L.: ARiemann*s Hypothesis and Tests for Primality, Journal of Computer 
Systems Science, 13 (3) (1976) 300-317. 
57. Montgomery, P. L.: Modular Multiplication without trial division, Mathematics of 
Computation, 44 (170) (1985) 519-521. 
58. Pham, B., Schr6der, H.: An Instruction Systolic Device for Quadratic Surface 
Generation, CompEuro 89, Hamburg., West Germany (1989). 
59. Pisinger., G.: Risserkennung in Glaskokillen, http: //www. forwiss. uni- 
passau. de/projekte/riss/index. html, Forwiss Passau (1997) 
60. Press, W. H., Teukolsky, S. A., Vetterling, W. T., Flannery, B. P.. - Numerical Recipes in 
C, Cambridge University Press (1992). 
61. Rabin, M. 0.: Probabilistic Algorithm for Primality Testing, Journal of Number Theory 
12 (1980) 128-138. 
62. Radon, J.: Über die Bestimmung von Funktionen durch ihre Integralweglänge längs 
gewisser Manigfaltigkeiten, Ber. Ver. Sächs. Akad. Wiss. Leipzig, Math-Phys. KI., 69 
(1917) 262-277. 
63. Rivest, R. L., Shamir A., Adleman, L: A method for obtaining digital signatures and 
public key cryptosystems, Communications of the ACM 21 (2) (1978) 120-126. 
64. Robertson, P. K.: Fast perspective views of images using one-dimensional operations, 
CG and A, 7 (2) (1987) 47-56. 
65. Rowland, S. W.: Computer implementation of image reconstruction formulas, in image 
Reconstruction from Projections, G. T. Herman, ed., vol. 32 in Topics in Applied 
Physics. New York, Springer Verlag, pp. 9-79 (1979). 
105 
66. Schimmler, M.: Fast Sorting on the Instruction Systolic Array, Technical Report No. 
8709, Christian-Albrechts-University, Kiel, Germany (1987) 
67. Schimmler, M.: Gra henalgorithmen auf gitterverbundenen Prozessorfelden, PhD p 
Thesis, Christian-Albrechts-University, Kiel, Germany (1988). 
68. Schimmler, M., Schrbder, H.: A Simple Systolic Method to Find all Bridges on an 
undirected Graph, Parallel Computing 12 (1989) 107-111. 
69. Schimmler, M.: Instruction Systolic Arrays - Experiences with a First Implementation, In: Proc. ISCA'92, Hamilton Island, Australia (1992). 
70. Schimmler, M.: Realisierung eines Instruction Systolic Array als Parallelrechner- 
Zusatzkarteftir PCs, In: Proc. HCF, Berlin (1993). 
71. Schimmler, M.: A bit-serial floating point processor below 10 mW, In: Proc. 7. 
Fachtagung Mikroelektronik, Zwickau, Germany (1996). 
72. Schimmler, M., Lang, H. -W.: The Instruction Systolic Array in Image Processing 
Applications, In: Proc. Europto'96, SPEE Vol. 2784 (1996) 136-144. 
73. Schinu-nIer, M.: Kiloprocessor -A Floating Point Processor with 0,1 mm 
2 Silicon Area, 
http: //www. ida. ing. tu-bs. de/projects/kilopro/home. e. htrnl (1997) 
74. Schmeck, H.: A Comparison-Based Instruction Systolic Array, In: Robert, Y. et al. (eds. ). - 
Parallel Algorithms and Architectures (North Holland, Amsterdam, 1986). 
75. Schmidt, B.: Computer-Tomographie mit Standard-PCs mit der Parallelrechnerkarte 
Systola 1024, In: Statustagung des BMBF HPSC97 - Paralleles Höchstleistungsrechnen 
und seine Anwendungen, (Munich, 1997) 261-27 1. 
76. Schmidt, B., Schimmler, M., Schr6der, H.: Line Detection by Mathematical Morphology 
on the Instruction Systolic Array, Technical Report, Loughborough University, Computer 
Studies 1032 (1997). 
77. Schmidt, B., Schimmler, M., Schr6der, H.: Morphological Hough Transform on the 
Instruction Systolic Array, In: Proc. Euro-Par'97, Lecture Notes in Computer Science, 
Vol. 1300 (Springer, 1997) 798-806. 
78. Schmidt., B., Schinu-nler, M., Schr6der, H.: Long Operand Arithmetic on Instruction 
Systolic Computer Architectures and Its Application to RSA Cryptography, In: Proc. 
Euro-Par'98, Lecture Notes in Computer Science, Vol. 1470 (Springer, 1998) 916-922. 
79. Schmidt, B., Schinuffler, M., Schrbder, H.: The Instruction Systolic Array in 
Tomographic Image Reconstruction Applications, In: Proc. PART'98, (Springer, 1998) 
343-354. 
80. Schmidt, S.: Modellierung eines Prozessors ffir ein befeWssystolisches Feld mit 4096 
Prozessoren, Master Thesis, Technical University of Braunschweig, Germany (1998). 
106 
8 1. Schneier, B.: Applied Cryptography: Protocols, Algorithms, and Source Code in C, John 
Wiley and Sons (1995). 
82. Schr6der, H.: Systolic Arrays Versus Instruction Systolic Arrays, TR-CS-86-08, l5pp, 
Dept. of Computer Science, Australian National University, 1986. 
83. Schr6der, H.: Instruction systolic array - trade-off between flexibility and speed, Computer Systems, Science and Engineering 3 (2) (1988) 83-90. 
84. Schr6der, H.: Systematic Design of Instruction Systolic Arrays -a Case Study, 
Proceedings of the 1 Ith Australian Computer Science Conference, Brisbane, (1988) 2-15. 
85. Schr6der, H.: Top-Down Designs of Instruction Systolic Arrays for Polynomial 
Interpolation and Evaluation, Parallel and Distributed Computing, 6 (1989) 692-703. 
86. Schrbder, H., Krishnamurthy, E. V.: Instruction Systolic Array Computation of the 
Characteristic Polynomial of a Ressenberg Matrix, Parallel Computing, 17, (1991) 273- 
278. 
87. Schr6der, H., Strazdins, P.: Program compression on the ISA, Parallel Computing 17 
(1991) 207-219. 
88. Schr6der, H., et al: PIPADS: A Vertically Integrated Parallel Image Processing and 
Display System, 5th Australian Supercomputing Conference, Melbourne (1992). 
89. Serra, J.: Image analysis and mathematical morphology, Academic Press, New York 
(1982). 
90. Shieh, E., Current, K. W., Hurst, P. J., Agi, I.: High-Speed Computation of the Radon 
Transform and Backprojection Using an Expandable Multiprocessor Architecture, ]EEE 
Transactions on Circuits and Systems for Video Technology 2 (4) (1992) 347-359. 
91. Stinson, D.: Cryptography -Theory and Practice, CRC Press (1995). 
92. Toft, P.: The Radon Transform - Theory and Implementation, PhD Thesis, Technical 
University of Denmark (1996). 
93. Wallace, G. K.: The JPEG still picture compression standard, Commun. ACM 34 (4) 
(1991) 
107 
A RSA Program Code 
In this appendix, source codes of the 1024-bit RSA encryption and decryption on Systola 1024 
- as described in Section 4.4 - are presented. In Section 3.3 the used programming 
environment is explained in more detail. 
A-1 Application Program 
This level of programming is the interface between the host computer and Systola 1024 (see 
Section 3.3.1). Typically, the application program controls data transfer between PC and 
board RAM and loads and starts controller programs on Systola 1024. 
The following application program performs 1024-bit RSA encryption (decryption) on Systola 
1024 by use of the program interface for Borland PASCAL 7.0. The program reads plaintext 
(ciphertext) to be encrypted (decrypted) and key data (exponent E, modulus N, modulus' 
reciprocal REZ - 
N) from two input file parameters (data-file and key_file in 
procedure init). 
The modulus N (N with 2`1 <N< 2", and n= 1024, see Section 4.4.3.1) and modulus' 
reciprocal REZ -N 
(ýt =2 2n-1 div N in Section 4.4.3.1) is written into the northern board RAM 
of Systola 1024 by load_block (NORTH, 2, L*ROWS, N) and 
load-block(NORTH, O, L*ROWS, REZ-N). 
The input plaintext (ciphertext) is partitioned into blocks of adequate size and transferred into 
the western board RAM with 1o ad-b 1ock (WE S T, 0, ROWS * L, bu ffe r) . 
The encryption (decryption) is performed as follows: The sequential host evaluates the 
exponent bits using the right-to-left square-and-multiply algorithm (see Algorithm 4.2b) for 
modular exponentiation in procedure expo_isa. Depending on the result (if 
(ptext (exponent) ^[i div 161 and (I shl (i mod 16) ) <>O)) an ISA 
program for modular multiplication and modular squaring is executed on the processor array 
by run- iq('X", I otherwise only the ISA program for modular squaring is executed by 
run-iq ( 'Y' , 1). 
The results of the calculation on Systola 1024 are written into the western board RAM. The 
encrypted (decrypted) ciphertext (plaintext) is stored in the main memory of the PC with 
run - Mst -stor e-b 
1ock (WE S T, 12 8, ROWS * L, bu ffe r) and written into the output 
file Out. 
Section A. 2 describes the implementation of the configuration expo1024. cfg andSection 
A. 3 goes into what happens within the processor array. 
108 
Program isa_rsal024; 
{$N+} 
uses wincrt, isawin, winprocs, win3l, windos, wintypes; 
const cfg-path='expol024. cfg'; 
L=64; 
ROWS=32; 
type text array[O.. L-11 of word; 
ptext '*text; 
text_isa array[O.. ROWS*L-1] of word; 
ptext 
- 
isa A text 
- 
isa; 
bytearray array[O.. $FFFE] of byte; 
pbytearray A bytearray; 
var N, REZ-N, buffer : ptext-isa; 
E: text; 
j, e_length, rest, teiler, offset integer; 
size longint; 
fl, f2 file; 
hl, h2, ml, m2, sl, s2, cl, c2 word; 
time : single; 
searchpath : string; 
procedure get-searchpath; 
I get searchpath for input files information from paramstr (0) 1 
var i: integer; 
begin 
searchpath: =paramstr(O); 
i: =length (searchpath) ; falways>01 
while (i>O) and (not(searchpath[i] 
searchpath[01: =chr(i); 
in do dec(i); 
if not (searchpath[il in [#0,1: 1,1\11) then searchpath .= 
searchpath + 1\1; 
end; 
procedure write-time(hl, ml, sl, cl, h2, m2, s2, c2: word); 
I output computing time required for encryption/decryption I 
begin 
time (h2*3600.0+m2*60.0+s2*1.0+c2*0.01) - 
(hl*3600.0+ml*60.0+sl*I. O+cl*0.01); 
writeln; 
writeln(time: 0: 4,1 sec for ', size, ' Bytes'); 
if (time <> 0) then writeln('Throughput: ', 
(8*size)/(1000*time): 0: 4, ' Kbit/sec'); 
end; 
109 
P-r0cedure init; 
I initialise input file and output file; read key data from a file; activate Systola 1024; load 
controller configuration I 
var i, j, posl, pos2: integer; 
file_l, file_2, key_file, data_file, endung: string; 
begin 
new (N) 
new(REZ-N); 
new(buffer); 
isa-board_reset; activate Systola 1024 
get_searchpath; 
searchpath: =searchpath + cfg_path + #0; 
load_cfg(@searchpath[ll); load configuration 
file-1 paramstr(l); 
file_2 paramstr(2); 
if (pos('. RSA, file_1)<>O) or (pos('. rsa', file-I)<>O) then 
begin 
key-file: =file_l; 
data-file: =file-2 
end 
else begin 
key-file: =file_2; 
data-file: =file-1; 
end; 
assign(fl, key-file); 
reset (f 1,1) ; 
blockread(f1, N'^', 2*L); 
for i: =O to ROWS-1 do hmemcpy(@N A [L*i], N, 2*L); 
blockread(f1, REZ_N'^', 2*L); 
for i: =O to ROWS-1 do hmemcpy(@REZ_N"ý[L*i], REZ-N, 2*L); 
blockread(fl, E, 2*L); 
close(fl); 
load-block(NORTH, O, L*ROWS, REZ-N); 
load-block(NORTH, 2, L*ROWS, N); 
dispose(N); 
dispose(REZ-N); 
e-length := 16*L; 
while (E[(e-length-1) div 161 and 
(1 shl ((e-length-1) mod 16))=O) do 
dec(e-length); 
if (pos('. enc', data-file)<>O) or 
(pos('. ENC', data-file)<>O) then 
begin 
endung: ='. dec'; 
writeln(, systola1024 
1024-Bit-RSA Decryption') 
end 
else begin 
endung:: ý:: '. enc'; 024 1024-Bit-RSA Encryption 
writeln(, Systolal 
end; 
. 
data-file); 
posl: =Pos(l 
' 
110 
asSign(fl, data 
- 
file); 
reset (f 1,1) 
writeln; 
writeln('Input File: '); 
writeln(data-file); 
writeln; 
delete(data_file, posl, 4); 
insert(endung, data_file, posl); 
assign(f2, data 
- 
file); 
rewrite(f2,1); 
size: =filesize(fl); 
rest: =size and $OOOOOFFF; 
of f set: = ($0080 - (rest and $007F) and $007F; 
teiler: =size shr 12; 
writeln('Output File:, ); 
writeln(data-file); 
writeln; 
writeln(Ifilesize= ', size); 
end; 
procedure expo_isa(exponent: pointer; 
length, teiler, rest: integer; var inp, out: file); 
perform encryption/decryption by modular exponentiation using the right-to-left square-and- 
multiply algorithm 
var i, j: word; 
begin 
gettime(hi, ml, sl, cl); 
for j: =O to teiler do 
begin 
if (j=teiler) then 
begin 
blockread(inp, buffer^, rest); 
for i: =rest to rest+offset-1 do 
pbytearray(buffer)^[i1: =O; 
end 
else blockread(inp, buffer^, 2*ROWS*L); 
run_iq('START', I); 
load 
- 
block(WEST, O, ROWS*L, buffer); 
run_iq (1 IN 1,1) ; 
for i: =O to length-1 do 
if (ptext (exponent) ^[i div 16 ] and (1 shl (i mod 16) <>O) 
then run_iq('X', l) 
else run_iq('Y', I); 
run_iq(1OUTPUT', 1); 
run_mst_store_block(WEST, 
128, ROWS*L, buffer); 
if (j=teiler) then blockwrite(out, buffer^, rest+offset) 
else blockwrite(out, buffer^, 2*ROWS*L); 
end; 
gettjme(h2, m2, s2, c2); 
end; 
III 
PrOcedure quit; 
var i, j, counter: word; 
begin 
isa-board_down; 
close(fl); 
close(f2); 
dispose (buffer) 
end; 
I deactivate Systola 1024 1 
begin (* main *) 
if (paramcount=2) then 
begin 
init; 
expo-isa(@E[O], e_length, teiler, rest, fl, f2); 
write time(hl, ml, sl, cl, h2, m2, s2, c2); 
quit; 
end; 
end. 
112 
A. 2 Controller Program 
Here follows the controller program for the configuration expo1024. Cf9 used in the 
application program above. It is implemented using the toolset ISATOOLSTM (see Section 
3.3.2). 
The label START is the reference point for the PC to start an instruction sequence for the 
initialisation of the processor array (ISA - 
START INTROINT) and for the transfer of the 
modulus N and the modulus' reciprocal REZ N to the northern IN (BLOCK_TO_IP 
NORTH). The programs for the input of this data from the northern lPs into the processor array 
are started by ISA_START INIT and ISA-START INIT2. Afterwards, HALT delays the 
controller until this activity is finished. 
The label IN is the reference point to start an instruction sequence for a data transmission of 
plaintext (cpihertext) to be encrypted (decrypted) from the western board RAM into the 
western IPs (BLOCK_TO_IP WEST). ISA_START INPUT starts the executions of an 
ISA program for the input of this data from the western IN into the processor array. 
The label X is the reference point to execute the ISA programs for modular multiplication 
(ISA_START XEXP1024) and modular squaring (ISA-START YEXP1024). The 
execution of only a modular squaring on the ISA can be referenced by the label Y. Finally, the 
label OUTPUT is the reference point to start an instruction sequence for the data transfer of the 
results from the western IPs into the western board RAM (IP_TO_BLOCK WEST). 
STAP, T ISA-START INTROINT 
WAIT Ignore Ignore Prg_Ready 
BORDER_PAR NORTH 32 2 
BORDER_PAR WEST 32 64 
BLOCK_TO_IP NORTH 2 Reset 
WAIT N_Ready Ignore Ignore 
ISA_START INIT 
WAIT Ignore Ignore Prg_Ready 
BORDER_PAR NORTH 32 64 
BLOCK_TO_IP NORTH 0 Reset 
BLOCK_TO-IP NORTH 2 Reset 
WAIT N_Ready Ignore Ignore 
START ISA INIT2 
_ HALT Ignore Ignore Prg_Ready 
IN BLOCK_TO_IP WEST 0 Reset 
WAIT Ignore Wý_Ready Ignore 
START ISA INPUT , __ HALT Ignore Ignore Prg_Ready 
x ISA. START XEXPI024 
y ISA. START YEXPI024 
HALT Ignore Ignore 
Ignore 
OUTPUT WAIT 
Ignore Ignore Prg_Ready 
START isA OUTPUT . 
WAIT Ignore Ignore 
Prg-Ready 
TO-BLOCK IP WEST 
128 Reset 
_ 
HALT Ignore 
W-Ready Ignore 
113 
A-3 ISA Programs 
The source codes of all ISA programs used in the configuration expo 10 24. cf9 of the 
previous section are presented. The ISA programs are edited and compiled by means of the 
language LAISA TM in the ISATOOLS programming environment (see also Section 3.3.3). 
A. 3.1 XEXPOI024 
This program performs a 1024-bit modular multiplication X: = X-Y mod N in each ISA 
processor row using the algorithm described in Section 4.4.3.1 (with the same N-value in each 
processor row). 
Each processor stores 32 bit of the corresponding operand X in the two (16-bit) registers 
X-LO and X_HI. The LSBs are stored in the leftmost processor column and the MSB are 
stored in the rightmost processor column. Each western IP stores the corresponding complete 
second operands Y. 
The modulus N and the modulus' reciprocal REZ N are completely stored in each northern EP. 
The modulus N is also distributed along each processor in the registers N_LO and N_HI. The 
registers N62 and N63 store the most significant 32 bits of N. The registers S- HI1, S-HIO, 
S_LO 11 S_LOO, Q_HI1, Q_HIO, Q_LO1, Q_LOO, S62, S63, S64, Q62, Q63 are 
used for storing intermediate results. 
At the end, the results are distributed along the corresponding processor rows in the registers 
X LO and X_HI. 
program XEXPO1024; 
{#include io. lib} 
{#include fpsng. lib} 
const n=32; 
one: selector = 
JA 
n; 
Y_LO=Rl; Y_HI=R2; X_HI=R3; X-LO=R4; 
S-HI1=R5; S-HIO=R6; S-LOI=R7; S-LOO=R8; 
Q-HII=R9; Q_HIO=RIO; Q_LO1=Rll; Q_LOO=RI2; 
S62=R13; S63=R14; S64=R15; 
Q62=R17, Q63=RI8; Q64=R19; 
N_LO=R24; Ný_HI=R25; 
N62=R26; N63=R27; 
procedure MUJt1024 (A. HI, A. 
LO, P-HII, P-HIO, P-LO1, P-LOO, P-62, 
P-63, P_64 : register; in_dir: 
direction; store - 
low: boolean); 
1024-bit multiplication in each processor row: 
P2047 
... 
PO : =A1023 ... 
Ao x B1023 ... 
Bo 
Input: 
Each operand A is distributed along 
the corresponding processor row in A_HI, A_LO 
least significant 32 bits 
in the leftmost processor columns 
most significant 
32 bits in the rightmost processor columns 
if in_dir == WEST each operand 
B is completely stored in the adjacent western EP 
114 
if in-dir = NORTH each processor row uses the same value of B 
In the latter case, B is completely stored in each northern EP 
output: 
The least significant 1024 bits of each result (PI023 ... 
Po) will be distributed over 
the corresponding row in the registers P_LO 1, P_LO 0 (if store- low = TRUE) 
The most significant 1024 bits of each result (PI023 ... Po) will be distributed over the corresponding row in the registers P- HI1, P-HIO 
P- 62, P- 63, P_ 64 store the bits P991 P 1039 of the corresponding 
result in each processor 
var i, j: integer; 
begin (n) 
for j: =0 to 7 do 
begin broadcast 8 words of operand B into each processor 
for i: =O to 7 do 
if (in_dir=WEST) then < in WEST, CW, CR[24+il; one; one > 
else < in NORTH, CN, CR[24+i]; one; one >; 
if (j=O) then perform 1024-bit x 128-bit multiplication 
begin 
< ldmul 0, A_LO, O; one; one >; 
< mul R24,0, RI6; one; one >; 
end 
else begin 
< ldmul P_HIO, A. LO, O; one; one >; 
< mul R24, P_HI1, RI6; one; one >; 
end; 
for i: =O to 6 do < mul R[25+i], O, R[17+i]; one; 
< ldmul R17, A_HI, P_HII; one; one >; 
IP 
- 
Hl: =Mul-Carriage I 
for i: =O to 5 do < mul R[24+il R[18+il R[24+i] 
< mul R30, P_HI1, R30; one; one >; 
< mul R3 1,0, R3 1; one; one >; 
< ldmul O, O, P_HIl; one; one >; 
< set R16, C; one; one >; 
< add R2 4,0, C; one; one >; 
for i: =O to 7 do 
if (i=7) then < adc CE, P_HI1, P_HIl; one; one 
else < adc CE, R[25+i], CR[25+il; one; 
< adc O, O, C; one; one >; 
< set R31, P_hj-u; c 
< sub P_HIO, -1,0; 
< adc P_HI 1,0,0; c 
< adc 0,0,0; one; 
< ifz C, CW, C; one; 
< add P_HIO, CW, 
P_HIO; 
< adc p_HI1,0, 
P_HIl; 
if store_10w then 
begin 
R31, P_HIO; one; 
P-HIO, -1,0; one; 
P-HII, 0,0; one; 
0,0,0; one; one 
o rTAT r- nne: one 
IC := carry-bit 
one >; 
one >; 
one >; 
I propagate carry-bit 
one; [2 . n] >; 
one; [2 . n] >; 
one 
one; one 
> 
one >; 
115 
for i: =3 downto 0 do 
begin 
< set R[24+2*i], C; one; [11 >; if (j<>O) or (i<>O) 
then < set CW, C; one; [2.. 4*j+l+il >; if (j=7) and (i=3) then < set C, P-63; one; one >; 
end; 
< set C, P-LO1; one; [4*j+l.. 4*j+41 >; for i: =3 downto 0 do 
begin 
if (i=O) then < set R16, C; one; [11 > 
else < set R[23+2*i], C; one; >; 
if (j<>O) or (i<>O) 
then < set CW, C; one; [2.. 4*j+l+il >; if (j=7) and (i=3) 
then < set C, P-62; one; one >; 
end; 
< set C, P_LOO; one; [4*j+l.. 4*j+41 >; 
end; 
end; 
if store-low then 
begin 
< set P-HIO, C; one; [11 >; 
< set CW, C; one; [2. n] >; 
< set C, P_64; one; one >; 
end 
else begin 
< add P_HIO, P_HIO, P_HIO; one; 
< adc P_HII, P_HI1, P_HIl; one; 
< adc O, O, C; one; one >; 
< add P-HIO, CW, P_HIO; one; 
< add R30, R30,0; one; [11 
< adc P-HIO, O, P_HIO; one; 
end; 
end; 
one >; P-Hl: = P-HI shl I 
one >; 
[2. nl 
>; 
procedure sub_1024 (A, __HI, 
A, 
__LO, 
B_HI, B_LO, DIFF_HI, DIFF_LO 
: register); 
1024-bit subtraction in each processor row DIFF: = A-B 
Input: 
Each operand A (B) is distributed along the corresponding processor row 
in the registers A_HI, A_LO (B_HI, B_LO) 
least significant 32 bits in the leftmost processor columns 
most significant 32 bits in the rightmost processor columns 
output: 
Each result will be distributed over the corresponding row 
in 
f the registers DIFF-HI, DIFF_LO 
begin (n) 
116 
< sub A- LO, B_LO, DIFF 
_LO; one; one >; < sbc A_HI, B_HI, DIFF 
- 
HI; one; one >; 
< sbc O, O, C; one; on e >; generate carry-bit 
< add DIFF_LO, -1,0; one; one >; 
< adc DIFF_HI, -1,0; one; one >; 
< adc O, O, RI6; one; one >; 
< add R16, -I, O; one; one >; test if DIFF-LO=O and DIIFF-HI=O. 
I Since CW cannot be used as first operand (i. e. <if z CW, C, C; one; [2. n] >is 
not possible), the more efficient test <or DIFF_LO, DIFF_HI, 0; one; one> does 
not work on Systola 1024.1 
< ifz C, CW, C; one; [2.. nl >; propagate carry-bit 
< add DIFF_LO, CW, DIFF - 
LO; one; [2.. nl >; 
< adc DIFF_HI, CW, DIFF_HI; one; [2.. n] >; 
end; 
procedure subtraction(HI, LO: register); 
I perform steps 4 and 5 of Algorithm 4.1 1 
var i: integer; 
begin(n) 
< bank NORTH, 2,1, true; one; one >; 
< bank NORTH, 3,1, true; one; one >; 
< in NORTH, CN, CR24; one; one >; input N _LO 
< in NORTH, CN, CR25; one; one >; input N -HI 
< bank NORTH, 2,1, true; one; one >; 
< bank NORTH, 3,0, false; one; one >; 
< in NORTH, CN, CR27; one; one >; input N -63 
< in NORTH, CN, CR26; one; one >; input N _62 
< bank NORTH, 2,1, true; one; one >; 
< sub S62, Q62, S62; one; one >; 
< sbc S63, Q63, S63; one; one >; 
< sbc S64, Q64, S64; one; one >; 
sub_1024(S_LO1, S_LOO, Q_LOI, Q_LOO, S_LOI, S_LOO); 
for i: =O to 3 do 
begin 
sub-1024 (S_LO1, S-LOO, N_HI, N_LO, Q_LOI, 
Q_LOO) 
< sub S62, N62, S62; one; one >; 
< sbc S63, N63, S63; one; one >; 
< sbc S64,0, S64; one; one >; 
if (i=3) then 
begin 
< ifn S-LOO, Q_LOO, LO; one; one >; 
< ifn S-LO1, Q_LO1, HI; one; one 
>; 
end 
else begin 
< ifn S-LOO, Q_LOO, 
S_LOO; one; one >; 
< ifn S_LO1, Q_LO1, 
S_LO1; one; one >; 
end; 
end; 
end; 
117 
]begin (n) (* main *) 
< bank NORTH, 2,1, true; one; one >; reset banks < bank WEST, 2,1, true; one; one >; 
multl024(X_HI, X 
- 
LO, S_HI1, S_HIO, S_LO1, 
S_LOO, S62, S63, S64, WEST, true); 
I perform Step 1 of Alg. 4.1 in each processor row 
mult1024(S_HI1, S_HIO, X_HI, X 
- 
LO, O, 
0,0,0,0, NORTH, false); 
I perform Step 2 of AIg. 4.1 in each processor row 
multi 02 4 (X. HI, X-LO, Q_HI1, Q_ýHIO, Q-LO1 , Q_LOO, Q62, Q63, Q64, NORTH, true); 
I perform Step 3 of AIg. 4.1 in each processor row 
subtraction(X_HI, X_LO); 
perform Step 4 and Step 5 of Alg. 4.1 in each processor row 
end. 
A. 3.2 YEXP01024 
This program performs a 1024-bit modular squaring Y: = Y-Y mod N in each ISA processor 
row using the algorithm described in Section 4.4.3.1 (with the same N-value in each processor 
row). 
The program works in the same way as XEYPO 10 24 with only two differences 
o The registers Y-LO, Y_HI are used instead of X-HI, X-LO. 
e The results are written into the adjacent western IPs. 
program YEXPO1024; 
{#include io. lib) 
{#include fpsng. libl 
I Same constants and procedures as in YEXPO 1024 
procedure output-1024; 
begin(n) 
< set Y- LO, S[6,01 ; one; one >; 
< set Y HI, S[6,11; one; one >; 
< bank WEST, 2,1, true; one; one >; 
OutVecShift_sng(WEST, S[6l, one, n); 
end; 
ht=-c! ri-n (n) 
true: one; one >; reset banks < JDUIJY,,. IýI -I -- ---I 
< bank WEST, 2,1, true; one; one >; 
multl024(Y-HI, Y_LO, 
S_HI1, S_HIO, S_LOlI 
S-LOO, S62, S63, S64, WEST, true); 
I perform Step 1 in Alg. 4.1 in each processor row 
mul t 10 24 (S__ýHj 
1, S-HI 0, Y_HI, Y_LO, 0, 
0,0,0,0, NORTH, false); 
I perform Step 2 in Alg. 4.1 in each processor row 
Mult 10 24 (Yý_HI, 
Y_LO, Q_HI 1, Q_HI 0, Q-LO1 I 
(* main *) 
118 
Q_LOO, Q62, Q63, Q64, NORTH, true); 
I perform Step 3 in Alg. 4.1 in each processor row subtraction(Y_HI, Y_LO); 
I perform Steps 4 and 5 in Alg. 4.1 in each processor row Output_1024; write results into the western EPs 
end. 
A. 3-3 INIT 
This program initialises the registers N-LO, N-HI, SHIFT-REG and the multiplier in 
each processor. 
program init; 
J#include io. lib) 
const n=32; one: selector = 1^n; 
SHIFT_REG=RO; 
Y-LO=Rl; Y-HI=R2; X_HI=R3; X_LO=R4; 
N_LO=R5; N_HI=R6; 
var i, j: integer; 
begin (n) 
< initmult; one; one >; 
< bank NORTH, 3,1, true; one; one >; 
< bank WEST, 3,1, true; one; one >; 
< in NORTH, CN, C; one; one >; 
< set C, N_LO; one; one >; 
< in NORTH, CN, C; one; one >; 
< set C, N__HI; one; one >; 
< in WEST, CW, C; [I] ; one >; 
< set CN, C; [2.. n]; one >; 
< set C, SHIFT_REG; one; one >; 
< bank WEST, 2,1, true; one; one >; 
< bank NORTH, 2,1, true; one; one >; 
end. 
A. 3.4 INIT2 
This program initialises the registers N_62 and N_63 in each processor. 
program INIT2; 
{#include io. lib) 
const n=32; one: selector = 1^n; 
Y-LO=Rl; Y-HI=R2; X_HI=R3; X. LO=R4; 
N_LO=R5; Ný_HI=R6; N_62=R7; Ný_63=R8; 
var i, j: integer; 
begin(n) 
< bank NORTH, 3,0, false; one; one 
< in NORTH, CN, C; one; one >; 
< set C, N 63; one; one >; 
< in NORTH, CN, 
C; one; one >; 
< set c, Ný_62; one; one 
>; 
119 
< bank NORTH, 2,1, true; one; one 
end. 
A. 3.5 INPUT 
Each processor row initialises the two corresponding operands X and Y (Step I and Step 2 in Alg- 4.2b) in the registers X_LO and X-HI resp. Y-LO and Y-HI. 
Program input; 
f#include io. lib) 
const n=32; one: selector = 1^n; 
Y-LO=Rl; Y_HI=R2; X-HI=R3; X_LO=R4; 
var i, j: integer; 
begin (n) 
< bank WEST, 3,0, false; one; one >; 
for i: =32 downto 1 do 
for j: =l downto 0 do 
< in WEST, CW, CR[24+j]; one; 
< set R24, Y-LO; one; one >; 
< set R25, Y_HI; one; one >; 
< bank WEST, 2,1, true; one; one >; 
< set O, X - 
HI; one; one >; 
< set O, X-LO; one; [2. nl >; 
< sub 0, -I, X_LO; one; [11 >; 
end. 
A. 3.6 OUTPUT 
11. 
-il 
This program writes the 1024-bit result in each processor row - stored in X-LO and X-HI - 
into the western EPs. 
program output; 
{#include io. lib) 
{#include fpsng-lib) 
const n=32; one: selector = 1^n; 
Y-LO=Rl; Y_HI=R2; X_HI=R3; X_LO=R4; 
S-HII=R5; S_HIO=R6; S_LO1=R7; S_LOO=R8; 
Q_HI1=R9; Q_HIO=RIO; Q_LO1=Rll; Q_LOO=Rl2; 
var i: integer; 
begin (n) 
< set X LO, S[6,01; one; one >; 
< set X- HI, S[6,11 ; one; one >; 
< bank WEST, 2,1, true; one; one >; 
OutVecShift_sng(WEST, S[6l, one, n); 
< bank WEST, 2,1, true; one; one >; 
< bank WEST, 3,1, true; one; one >; 
end. 
A. 3.7 INTROINT 
init ISA. 
120 
PrOgram intro; 
t#include intl6. lib) 
begin (32) 
Intro-int 
end. 
121 
