Sparse Matrix Sparse Vector Multiplication using Parallel and Reconfigurable Computing by Baugher, Kirk Andrew
University of Tennessee, Knoxville 
TRACE: Tennessee Research and Creative 
Exchange 
Masters Theses Graduate School 
5-2004 
Sparse Matrix Sparse Vector Multiplication using Parallel and 
Reconfigurable Computing 
Kirk Andrew Baugher 
University of Tennessee 
Follow this and additional works at: https://trace.tennessee.edu/utk_gradthes 
 Part of the Electrical and Computer Engineering Commons 
Recommended Citation 
Baugher, Kirk Andrew, "Sparse Matrix Sparse Vector Multiplication using Parallel and Reconfigurable 
Computing. " Master's Thesis, University of Tennessee, 2004. 
https://trace.tennessee.edu/utk_gradthes/4651 
This Thesis is brought to you for free and open access by the Graduate School at TRACE: Tennessee Research and 
Creative Exchange. It has been accepted for inclusion in Masters Theses by an authorized administrator of TRACE: 
Tennessee Research and Creative Exchange. For more information, please contact trace@utk.edu. 
To the Graduate Council: 
I am submitting herewith a thesis written by Kirk Andrew Baugher entitled "Sparse Matrix Sparse 
Vector Multiplication using Parallel and Reconfigurable Computing." I have examined the final 
electronic copy of this thesis for form and content and recommend that it be accepted in partial 
fulfillment of the requirements for the degree of Master of Science, with a major in Electrical 
Engineering. 
Gregory D. Peterson, Major Professor 
We have read this thesis and recommend its acceptance: 
Donald W. Bouldin, Kwai L. Wong 
Accepted for the Council: 
Carolyn R. Hodges 
Vice Provost and Dean of the Graduate School 
(Original signatures are on file with official student records.) 
To the Graduate Council: 
I am submitting herewith a thesis written by Kirk Andrew Baugher entitled "Sparse Matrix Sparse Vector Multiplication using Parallel and Reconfigurable Computing." I have examined the final paper copy of this thesis for form and content and recommend that it be accepted in partial fulfillment of the requirements for the degree of Master of Science, with a major in Electrical Engineering. � tJ � Dr. G egory D. Peterson, Major Professor 
We have read this thesis and recommend its acceptance: 
cJ�V.� Dr� -Dr. Kwai L. ong 
Accepted for the Council: 

Sparse Matrix Sparse Vector Multiplication using Parallel and Reconfigurable 
Computing 
A Thesis 
Presented for the 
Master of Science 
Degree 
The University of Tennessee 
Kirk Andrew Baugher 
May 2004 
Dedication 
This thesis is dedicated to my loving wife and our families for their motivation 
and support, which has inspired me to push my goals higher and obtain them. 
11 
Acknowledgements 
I wish to thank all of those who have helped me along my journey of completing 
my Master of Science degree in Electrical Engineering. I would especially like to thank 
Dr. Peterson for his patience, guidance, wisdom, and support for me in obtaining my 
degree. I would like to thank Dr. Bouldin for exposing me to microelectronic design and 
for serving on my committee. I would also like to thank Dr. Wong for his support and 
guidance and also serving on my committee. In thanking Dr. Wong, I wish to also thank 
him on behalf of the Joint Institute for Computational Science in Oak Ridge for making 
all of this happen by their support through graduate school. Finally, I wish to thank my 
peers who have encouraged and helped me make this all possible. 
111 
Abstract 
The purpose of this thesis is to provide analysis and insight into the 
implementation of sparse matrix sparse vector multiplication on a reconfigurable parallel 
computing platform. Common implementations of sparse matrix sparse vector 
multiplication are completed by unary processors or parallel platforms today. Unary 
processor implementations are limited by their sequential solution of the problem while 
parallel implementations suffer from communication delays and load balancing issues 
when preprocessing techniques are not used or unavailable. By exploiting the 
deficiencies in sparse matrix sparse vector multiplication on a typical unary processor as 
a strength of parallelism on an Field Programmable Gate Array (FPGA), the potential 
performance improvements and tradeoffs for shifting the operation to hardware assisted 
implementation will be evaluated. This will simply be accomplished through multiple 
collaborating processes designed on an FPGA. 
lV 
Table of Contents 
Chapter Page 1 Introduction ................................................................................................................... 1 
2 Background ................................................................................................................... 4 2.1 Double Precision Floating-Point ......................................... .................................... 4 2.2 Sparse Representation ............................................................................................. 6 2.3 Sparse Matrix Sparse Vector Multiplication ........................................................... 9 2.4 Field Programmable Gate Arrays .......................................................................... 11 2.5 Pilchard System ..................................................................................................... 12 2.6 Computing Platform .............................................................................................. 13 
3 Analysis of Related Work ........................................................................................... 14 3.1 Floating Point Multiplication and Addition on FPGAs ........................................ 14 3 .2 Sparse Matrix Vector Multiplication on FPGAs .................................... ............... 15 3 .3 Sparse Matrix Vector Multiplication on a Unary Processor ................................. 16 3 .4 Sparse Matrix Vector Multiplication on Parallel Processors ................................ 16 3 .5 Sparse Matrix and Sparse Matrix Sparse Vector Multiplication ........................... 18 
4 Design Approach ......................................................................................................... 19 4 .1 Assumptions .......................................................................................................... 19 4.1.1 Limited IEEE 754 Format Support ................................................................. 19 4.1.2 Use of Compressed Row Scheme ................................................................... 20 4.1.3 Sparse Matrix and Sparse Vectors .................................................................. 20 4.1.4 Generic Design Approach ............................................................................... 20 4.1.5 No Pre-Processing ........................................................................................... 21 4.1.6 Addressable Range .......................................................................................... 21 4.2 Analysis of the Problem ........................................................................................ 21 4.3 Analysis of Hardware Limitations .................................................. ...................... 22 4.4 Partitioning of the Problem ................................................................................... 25 4.4.1 Transmission of Data ...................................................................................... 26 4.4.2 Logic Flow ...................................................................................................... 30 4.4.3 Comparing of Addresses .............................................................. ................... 31 4.4.4 Multiply Accumulator ..................................................................................... 32 4.5 FPGA Design ........................................................................................................ 33 4.5.1 Pilchard System Interface .................................................................. .............. 37 4.5.2 State Machine .................................................................................................. 40 4.5.3 Comparators .................................................................................................... 45 4.5.4 Multiply Accumulator Interface ...................................................................... 51 4.5.5 Double Precision Floating-Point Multiplier and Adder .................................. 54 4.5.6 C code interface ............................................................................................... 57 
5 Results ......................................................................................................................... 63 5 .1 Comparison of Results .......................................................................................... 63 5.2 Difficulties ............................................................................................................. 76 
V 
5.2.1 Pcore Interface ................................................................................................. 76 5.2.2 Memory and 1/0 Constraints ........................................................................... 77 5.2.3 Logic Glitch ..................................................................................................... 79 
6 Conclusions and Future Work ..................................................................................... 81 6.1 Hardware Improvements ....................................................................................... 81 6.2 FPGA Architecture Improvement ......................................................................... 83 6.3 Algorithmic Improvements ................................................................................... 84 6.4 Future Applications ............................................................................................... 87 6.5 Conclusion ............................................................................................................. 88 
References ......................................................................................................................... 91 
Appendices ........................................................................................................................ 94 




List of Tables 
Page 
Floating-Point Value Range .................................................................................... 5 
Vll 
List of Figures 




The implementation of sparse matrix sparse vector multiplication on a 
reconfigurable computing platform provides a unique solution to limitations often 
encountered in software programming. Typical software programming languages such as 
C, C++, and Fortran are usually used in scientific computing applications. The drawback 
to using such software languages as the primary method of solving systems or systems of 
equations is due to the fact that they are all executed in a sequential fashion. 
Many applications based on software languages such as C, C++, or Fortran can all 
be implemented in some fashion on parallel machines to help improve their performance. 
This can be accomplished using parallel platforms such as MPI [ 1] or PVM [2]. While 
using these parallel tools to implement sparse matrix sparse vector multiplication can 
improve the computational performance, a cost is paid for communication over the 
network of parallel machines. In addition to the parallel communication cost, efficiently 
distributing the workload between machines can be challenging. When designing parallel 
architectures, the problem must be broken down into the ideal granularity to distribute 
between machines to achieve the best possible load balance. Unfortunately, if the sparse 
matrix is structured and that structure is unknown before designing the system, there is no 
way of achieving optimal load balance without dynamic scheduling of tasks. While 
- . dynamic scheduling may then improve performance, its overhead also cuts into 
performance. 
1 
In this thesis, the focus will be towards the performance of one processor 
accompanied by an FPGA compared to a stand-alone processor. The limited focus of 
performance comparisons is due to two reasons: the complexities of designing a parallel 
computer architecture specifically for this comparison is too costly, and if the FPGA 
assisted processor yields better performance versus one processor, then the scaling factor 
of both systems to parallel machines could debatably be equivalent provided that 
identical parallelization schemes benefit both designs equally. 
The type of data supported for the sparse matrix sparse.vector multiplication is 
double precision floating-point. This data type corresponds to usage for real scientific 
applications using sparse matrix sparse vector multiplication as scientific computations 
are typically concerned about data precision and accuracy. This way more reasonable 
performance measures can be obtained for actual computation times providing a level of 
realism and not just theoretical or simulated results. The particular format for the double 
precision floating-point type values used is the IEEE 754 standard [3]. The IEEE 
standard is recognized worldwide and is a logical choice for use as a standard to represent 
the floating-point values used here. The difficulty in using double precision floating­
point format is the bandwidth that the data type commands as it uses 64-bits to represent 
one piece of data putting a strain on 1/0 and memory. 
The following chapter will provide background into the IEEE 7 54 floating-point 
standard representation, floating-point multiplication and accumulation, sparse matrix 
and sparse vector representation, FPGAs, the Pilchard System [ 4 ], and the computer 
system used. The remaining chapters will discuss areas of related work, the overall 
2 
design approach, results, future work, and conclusions describing the successes and 




2.1 Double Precision Floating-Point 
For double precision floating-point data the IEEE 754 format was utilized. It is 
important that format be defined as it has implications for the double precision values' 
representation in C to its binary representation in memory and in the FPGA. This format 
then ensures compatibility so long as the compiler used for the software code supports the 
IEEE 754 double precision floating-point standard. 
The double precision standard calls for values to be represented by a specific 64-
bit structure. As can be seen in Figure 2.1 below, the binary structure is broken up into 
three sections, the sign bit, exponential bits, and fraction bits. The exponential bit range 
is 11-bits in width while the fraction is represented by 52-bits of precision. The exponent 
is biased by 1023, i.e. if the exponent field equals 1023, the value's actual exponent 
equals 0. 
s - sign bit 
e - exponential bits 
f - fraction bits 
1 11 52 . .. widths 
s e f 
,---...,-..--------1 
_________ .._ ____________ __, . . .  order msb lsb msb lsb 
Figure 2.1 - Floating-Point Representation 
4 
Table 2.1 - Floating-Point Value Range 
e f Value 
e = 2047 fj0 NaN 
e = 2047 f = 0 
0 <e <2047 Don't care 
e = 0 f;t0 
e = 0 f = 0 
(-1 )500 
(- I )5 2e-tuL.5(1 •f) 
(- I )5 Tl ULL(0•f) 
0 
Depending on the value of the three components, the value of the floating-point 
number is determined by Table 2.1. In general the formula used to represent a number 
from its binary floating-point representation is 
V =(-l)s • 1. { [f (22)
22 + f(2l) 21 + ... + f (O)o] • r23 } • 2(e-1023 ) 
The leading I is an implied I that is added to the exponent. An example of going from 
scientific notation to binary floating-point representation is below: 
If converting 1.1 e I to its 64-bit double precision floating-point value 
I . Convert 1.1 e I to its decimal representation = 11 
2. Convert 11 to its binary representation = IO 11 
3. The leading bit is the implied 1 automatically added to the exponent, 
therefore move the decimal left just to the right of the leading 1 
= 1.011 
4. Since the decimal was moved 3 times, e = 3 
5. Add the bias of I 023 to e and convert to binary = I 0000000010 
5 
6. Now the 
f = 0ll0000000000000000000000000000000000000000000000000 
and it is positive so s = 0 
7. v = 
0 10000000010 0110000000000000000000000000000000000000000000000000000000000000 
2.2 Sparse Representation 
Sparse matrices or vectors can be defined as a matrix or vector that is sparsely 
filled with nonzero data. So for example, a matrix may have only 10% of its elements 
filled with nonzeros. Due to this large amount of nonzero values, it is not practical to 
spend time operating or accessing zeros; therefore, special methods or representations 
have been designed to compress their storage of data. In short, sparse matrices and 
vectors can be described such that; given the number of elements in the matrix or vector 
that are zero, the use of special measures to index the matrices or vectors becomes ideal 
[5]. Some sparse matrices can be structured where the data appears to have some sort of 
pattern while other sparse matrices are irregular and therefore have no pattern. By 
viewing the following two figures, Figure 2.3 has a diagonal pattern while Figure 2.2 has 
no such pattern. 
1 
10 
Figure 2.2 - Irregular Sparse Matrix Figure 2.3 - Structured Sparse Matrix 
6 
10 4 -1 
-4 9 -1 
8 3 -1 
1 -3 7 
1 6 2 
1 -2 5 
Figure 2.4 - Sparse Matrix 
Because these structures are filled with a high percentage of zeros, it is best to use 
a format to only represent the nonzero values so time and memory space are not wasted 
on processing or storing zeros. Some popular formats for storing sparse matrices and 
vectors are the Compressed Row, Compressed Column, and Coordinate Storage Schemes 
(CRS, CCS, CSS) [6]. The matrix in Figure 2.4 above would have the following 
representations for these three schemes: 
Compressed Row Scheme 
Val(i) = (10,4,-l ,-4,9,-1,8,3,-1, 1,-3, 7, 1,6,2, 1,-2,5) 
Col(i) = (0,2,3,0, 1,4, 1,3,5,0,2,5, 1,4,5,2,4,5) 
Rowptr = (0,3,6,9,12,l5,18) 
Compressed Column Scheme 
Val(i) = (10,-4, 1,9,8, 1,4,-3, l,-1,3,-l ,6,-2,-1,7,2,5) 
Row(i) = (0, 1,3, l ,2,4,0,3,5,0,2, 1,4,5,2,3,4,5) 
Colptr(i) = (0,3,6,9, 11, 14, 18) 
Coordinate Storage Scheme 
V al(i) = ( 10,4,-l ,-4,9,-l ,8,3,-l, 1,-3, 7, 1,6,2, 1,-2,5) 
Row(i) =(0,0,0,l,l,l,2,2,2,3,3,3,4,4,4,5,5,5) 
Col(i) = (0,2,3,0, 1,4, l ,3,5,0,2,5,2,4,5,2,4,5) 
The Coordinate Storage Scheme is a typical representation of a matrix with the 
data represented from left to right and top to bottom in three storage arrays. The arrays 
hold the column, row, and values each. The Compressed Row Scheme stores the values 
7 
and column addresses in two separate arrays in the same order as the Coordinate Storage 
Scheme, except the row pointer, or "Rowptr", array stores the index of the first number in 
each row of the value array. As can be observed, less storage room is necessary when 
using the row pointer array versus a full row array as in the Coordinate Storage Scheme. 
This can become very important as the number of data become large. The Compressed 
Column Scheme works like the Compressed Row Scheme except that values are stored 
with respect to column order, the row values are stored in an array, and it has a column 
pointer array instead of row pointer array. 
The most popular storage format typically used with sparse matrix sparse vector 
multiplication is the Compressed Row Scheme as it lends itself well to coding and 
memory access for improved performance with respect to this problem. An advantage of 
using these schemes is that pointer indexing can be used for the arrays, which is faster for 
indexing large arrays than actual array indexing if programming in C. Because of this 
advantage, linked lists are usually not used with large arrays. 
Storage for a sparse vector is simpler than for matrices because it only requires 
two arrays to store information instead of three. One array stores the values while the 
other array stores the value's vector address. This can be seen in Figure 2.5 below. 
I Val(i) = (1,2,3) Row(i) = (0,3,4) 
Figure 2.5 - Sparse Vector 
8 
2.3 Sparse Matrix Sparse Vector Multiplication 
Sparse Matrix Sparse Vector Multiplication is simply the multiplication of a 
sparse matrix by a sparse vector. The general format follows typical matrix vector 
multiplication except that it would be a waste of time to multiply zeros by any number. 
Figure 2.6 illustrates this dilemma. To handle this implementation, a storage scheme is 
used to hold the data for the sparse structures. Due to the storage scheming, matrix 
vector multiplication is no longer a straightforward operation. The column address of the 
current row of a matrix being multiplied must correspond with an existing row address of 
the vector. If there is a match, then the two corresponding values can be multiplied 
together. This situation can be observed in Figure 2. 7. If the first row of the sparse 
matrix from Figure 2.4 was multiplied by the sparse vector in Figure 2.5, the resulting 
answer would be 1 O* 1 + -1 *2 = 8. The C code to implement this was derived from the 
algorithm for sparse matrix vector multiplication, where the vector is dense and 
Compressed Row Scheme is used. The colG) array is the column address array for the 
sparse matrix and directly maps to the matrices matching value in the dense vector. 
Do I = 1 to number of rows Sum(I) = 0 Do j = Rowptr(I) to Rowptr(I+ 1 )-1 Sum(I) = Sum(I) + matrixG)*vector( colG)) End Do End Do 
Because the vector is sparse in the case of this thesis, the matrix value's column 
address must be compared to the vector's row_ address and cannot be directly mapped as 
above. If the matrix address is less than the vector address, then the next matrix value's 





• • • • • • • • • X • • • • • • • • 





Vector 100 4.1e1 
1 5 100 205 278 567 610 891 150 1e2 









Figure 2. 7 - Sparse Vector Multiplication 
10 
address', then the next vector value's address must be retrieved. If they both match, then 
they are obviously multiplied together. This algorithm can be seen in the Appendix B 
and is the C code that will be compared against the FPGA assisted processor. 
The large amount of comparing necessary to implement the sparse matrix sparse 
vector multiplication is where the sequential nature of software programming becomes a 
weakness. Unfortunately for this algorithm, no optimization exists for an implementation 
used for both structured and irregular sparse matrices. Specific algorithms can be created 
for the optimization of structured sparse matrices but such an approach is beyond the 
scope of this thesis. The critical question however is how often do address matches 
typically occur; however, this cannot be answered unless the sparse matrix and vector 
formats are known in advance, which affects the load balance of the problem. 
2.4 Field Programmable Gate Arrays 
Field Programmable Gate Arrays or FPGAs are prefabricated rows of transistor 
and logic level gates attached to electronically programmable switches on a chip. To 
program an FPGA, many different tools exist to accomplish such a task. Typically a 
Hardware Description Language (HDL) is used to describe the behavioral and register 
transfer level (RTL) of the FPGA. VHDL or Verilog are the two most popular used 
hardware description languages. Programming an FPGA involves the development of 
"processes". A process is essentially a set of digital logic that continuously runs. 
Creating multiple processes on one FPGA in essence creates a parallel architecture on a 
chip that handles information on a bit or signal level. The ability to create multiple 
processes all running simultaneously sharing and computing information on a bit level 
gives FPGAs the capability to handle the processing of specific problems efficiently. The 11 
more gates or transistors that are on one FPGA, the more data an FPGA can process at 
one time. Because FPGAs are electronically reprogrammable, designs can quickly be 
loaded, erased, and upgraded provided that designs have already been developed. Due to 
the portable nature ofHDLs, HDL designs can be used on many different FPGAs. 
The use of FPGAs when improving existing problems is usually targeted to 
exploit any and all redundancy and maximize parallelism through multiple processes. 
This allows the use of FPGAs to out perform software programs where processing large 
amounts of redundant information or parallelism can be exploited. Depending on the 
complexity of the design, interfacing and synchronizing multiple processes can be 
difficult. If used correctly, FPGAs could demonstrate beneficial performance 
improvements. 
2.5 Pilchard System 
The Pilchard System [4] is an FPGA based platform that was developed by the 
Chinese University of Hong Kong to add FPGA functionality to an existing computer. 
While other systems that add FPGA functionality to computers utilize the PCI bus of a 
computer to interface with an FPGA, the Pilchard System uses the memory bus. 
Essentially an FPGA has been placed on a board which fits into a DIMM memory slot on 
a computer and can be accessed using special read and write functions in C as if writing 
and reading to and from the computer's main memory. The advantage the Pilchard 
System provides is the use of the faster memory bus over the slower PCI bus, which 
allows for higher communication speeds in data processing. 
The Pilchard System has a relatively light interface that helps HDL programmers 
spend less time learning the system and allows for more room on the FPGA to be 
12 
Figure 2.8 - Pilchard System 
utilized. The Pilchard System in use has a Xilinx Virtex 1000-E FPGA on the board. 
Figure 2.8 is a picture of the Pilchard System. 
2.6 Computing Platform 
The computing platform used to compare performance between the FPGA 
assisted computer and the computer performing the software-only sparse matrix sparse 
vector multiplication, were kept the same. The computer system used has a 933 MHz 
Pentium III processor with a 64-bit memory bus of 133 MHz that the Pilchard System has 
access to. The operating system is Mandrake Linux version 8. 1 with a Linux kernel no 




Analysis of Related Work 
3.1 Floating Point Multiplication and Addition on FPGAs 
In the early stages of FPGA development and usage exploration, it was deemed 
that FPGAs were not suitable for floating-point operations. This was mainly due to the 
low density of early FPGAs being unable to meet the high demands of resources by 
floating-point operations [8]. Floating-point operations involve separate processes to 
handle the exponents, sign values, and fractions. These operations must normalize 
portions of the floating-point data as well. 
Shirazi, Walters, and Athanas [8], demonstrated that FPGAs became a viable 
medium for floating-point operations in 1995 as Moore's Law had time to alter the FPGA 
landscape. Designs were created that supported eighteen and sixteen floating-point 
adders/subtractors, multipliers, and dividers. Shirazi, et al., reported tested speeds of 10 
MHz in their improved methods for handling addition/subtraction and multiplication all 
using three stage pipelines. The multiplier had to be placed on two Xilinx 4010 FPGAs. 
Seven years later in 2002, Lienhard, Kugel, and Manner [9] demonstrated the 
ability of current FPGA technology of that time and its profound effect on floating-point 
operations conducted on FPGAs. Essentially Moore's Law had continued to provide 
greater chip density as faster silicon was being produced. These authors reported design 
frequencies ranging from 70 to 90 MHz for signed addition and 60 to 7 5 MHz for 
. multiplication. 
14 
In comparing the improvements in floating-point calculations over the last several 
years, it as become apparent that floating-point operations can be done efficiently and 
effectively thus lending them for co-processor uses. 
3.2 Sparse Matrix Vector Multiplication on FPGAs 
Sparse matrix vector multiplication is the multiplication of a sparse matrix and a 
dense vector. Minimal work has actually been documented in applying this to FPGAs. 
The need always exists for faster methods of handling sparse matrix vector 
multiplication; however, the lack of information involving FPGA implementations leads 
to minimal information regarding possible future implementations. 
ElGindy and Shue [10] implemented a sparse matrix vector multiplier on an 
FPGA based platform. In their research they used the PCI-Pamette, which is a PCI 
board, developed by Compaq that houses five FPGAs with two SRAMs connected to two 
of the FPGAs. The implementations explored used one to three multipliers and the 
problem is described as a bin-packing problem. The bin-packing side of the problem is 
handled by preprocessing on the host computer and the constant, or vector values are 
stored before computation times are observed. When comparing results obtained, the 
single multiplier is outperformed by the other two methods and by software. All of the 
execution times grew quadratically as the size of the matrix grew, giving the performance 
an O(n)2 appearance. The dual multiplier saw results close to that of the software 
multiplier and the triple multiplier showed some improvements in performance over the 
software multiplier. Performance was measured in clock ticks with the triple multiplier 
taking roughly 200 clocks, the software and dual multipliers were around 50% slower and 
the single multiplier was almost 4 times as slow as the triple multiplier. How these 15 
multipliers are developed is not discussed in any detail. The performances of the FPGA 
based implementations are only given for the core multiplication. No information is 
provided as to how the preprocessing times affect results and if preprocessing is also 
done for the software version. 
3.3 Sparse Matrix Vector Multiplication on a Unary Processor 
Sparse matrix vector multiplication on a single processor is widely used in 
scientific computing, and circuit simulations among various other fields. Even though 
the use of sparse matrix vector multiplication varies widely across industries, the basic 
form remains unchanged. Wong [ 6] provides a simple model to compute sparse matrix 
vector multiplication in compressed row storage and compressed column storage formats. 
The very same format can be seen in multiple resources found through Netlib.org [11], a 
major website that provides vast amounts of efficient computing algorithms in various 
programming languages. The formula driving this algorithm was previously mentioned 
in section 2.3 in compressed row storage. This algorithm simply uses a column address 
( assuming compressed row storage) from the sparse matrix to pull the appropriate vector 
data out for multiplication and accumulation, and can be performed in O(n) time where n 
represents the number of nonzero elements in the sparse matrix. No more efficient 
implementation of this algorithm has been found for sequential designs. 
3.4 Sparse Matrix Vector Multiplication on Parallel Processors 
Implementing sparse matrix vector multiplication on parallel processors has been 
done with success. In general, the problem is distributed by rows of the sparse matrix 
across the parallel processors. Wellein et al [12] demonstrated that use of parallel 
16 
machines could provide performance improvements that improve linearly with the 
number of processors added to the overall design. Performance was measured in 
gigaflops. Some of the machines that were used were vector computers and current 
supercomputers such as SGI Origin3800, NEC SX5e, and Cray T3E to name a few. 
Gropp, et al., [ 13] provide ways of analyzing realistic performance that can be 
achieved on processors and parallel processors by simply evaluating the memory bus 
bandwidth available. They simply state that the sparse matrix vector multiplication 
algorithm is a mismatch for today's typical computer architecture as can be seen by the 
low percentage of performance observed to peak performance available by processors. 
Geus and Rollin [ 14 ]evaluated the problem to improve eigenvalue solution 
performance. Eigenvalue problems can compute sparse matrix vector multiplication 
"several thousand times" for large sparse matrices and thus take up "80 to 95% of the 
computation time." Performance speedup was achieved by "pipelining the software" by 
forcing the compiler to prefetch data. Matrix reordering and register blocking found 
some additional improvements as well. The additions help improve performance in an 
assisted sense. The same preprocessing techniques can be implemented in applying 
designs to an FPGA. What makes Geus and Rollins' research applicable is their 
application of their parallel implementation on more standard parallel computing 
platforms. The workload was again distributed by rows, more specifically in this case, 
blocks of rows per processor. Performance improvements were seen from 48% (DEC 
Alpha) to 151 % (IBM SP2). These results also demonstrated that the inherent problem 
scales well. 
1 7  
3.5 Sparse Matrix and Sparse Matrix Sparse Vector Multiplication 
Virtually no resources are available in this area for reference, let alone discovery; 
however, the need exists for scientific computations. These computations are used in 
Iterative Solver [15] methods, Eigenvalue problems [6], and Conjugate Gradient methods 
[6]. Khoury [15] also stated the lack of existing information regarding this area. Khoury 
needed sparse matrix multiplication in solving blocked bidiagonal linear systems through 
cyclic reduction. Khoury had to develop a sparse matrix multiplication method due to 
being unable to find resources supporting such areas. Unfortunately, Khoury' s results 
were skewed due to compilers unoptimizing the design and the heap not being cleaned 
appropriately. 
Sparse matrix multiplication is an area of interest; however, due to the very core 
of its operation being sparse vector multiplication. Sparse vector multiplication is also 
the basis behind sparse matrix sparse vector multiplication. Sparse matrix sparse vector 
multiplication can then be looked as a core component of sparse matrix multiplication. In 
speeding up sparse matrix sparse vector multiplication, sparse matrix multiplication can 
be sped up as a result. 
1 8  
Chapter 4 
Design Approach 
The flow of the design process involves making assumptions and providing an in­
depth analysis of the problem. In observing the big picture considering sparse matrix 
sparse vector multiplication running in software on a stand-alone processor, the biggest 
possible limitation was considered to involve the sequential compares. Due to this 
observation, the FPGA design was built around the parallelization of the compares and 
the supporting components. The following sections will discuss the assumptions made, 
analysis of the problem, analysis of the hardware limitations, detailed partitioning of the 
problem, and design of the implementation on a FPGA. 
4.1 Assumptions 
To help constrain the problem to reasonable limitations to allow for an effective 
implementation of sparse matrix sparse vector multiplication, some necessary 
assumptions are required. All assumptions made apply to both the HDL and to the C 
algorithm used except where an assumption can only apply to the HDL. 
4.1.1 Limited IEEE 754 Format Support 
In the design of the floating-point multiplier and accumulator, support is not given 
to all features of the standard. The floating-point multiplier and adder can at the least 
handle, NaN, zero, and infinite valued results but that is all. Neither of the two support 
rounding, invalid operations, or exceptions including the handling of underflow and 
overflow. Overflow should not be an issue since rounding is not supported. Invalid 
operations are those such that there is a divide by zero, magnitude subtraction by 
19 
infinites, and an operation involving a NaN among other scenarios. A full list can be 
found in the IEEE 7 54 Standard. 
4.1 .2 Use of Compressed Row Scheme 
The assumption is made that all sparse matrices used are formatted using the 
Compressed Row Scheme. This is so there are no discrepancies in performance of data 
that use different storage format schemes. This constraint also helps simplify the design 
process by limiting the support to one input format. The storage scheme will be 
combined with only using the C programming language to eliminate performance 
discrepancies across various programming languages. 
4.1 .3 Sparse Matrix and Sparse Vectors 
It is assumed that the design involved in this scope of work is to improve 
performance of sparse matrix sparse vector multiplication. All matrix and vector 
structures that are not sparse will not have competitive performance results as that is out 
of the design scope; however, the ability for dense matrices and vectors to be solved will 
be available. This is necessary as a portion of a sparse matrix and sparse vector 
multiplication could have the potential of appearing dense. Since this possibility is 
supported, dense matrix vector multiplication can be accomplished but with a significant 
cost in performance. Dense matrices and vectors are still expected to conform to the 
Compressed Row Scheme. 
4.1 .4 Generic Design Approach 
In the consideration and design of this sparse matrix sparse vector multiplication 
algorithm, a general approach towards the possible sparse matrix structure is assumed. 
Due to the vast types of structured sparse matrices, many designs would be necessary to 
20 
cover them all. This design is to have the capability to solve any type of sparse matrix. 
This also makes the assumption that no optimizations are made towards any particular 
sparse matrix structure such that it might reduce the performance of a different structured 
sparse matrix. 
4.1 .5 No Pre-Processing 
It is also assumed that no pre-processing of matrices or vectors will take place. It 
is recognized that pre-processing of sparse matrices and even sparse vectors can help 
improve performance; however, the implementation would then be constrained to one 
particular type of sparse matrix sparse vector multiplication. That would defeat the 
purpose of not optimizing for any one particular type of sparse matrix structure. 
4.1 .6 Addressable Range 
The addressable range for data will be limited by 32-bits in any one particular 
dimension. This means the potential address span of a matrix could be 4,294,967,296 by 
4,294,967,296. The vector address range must also be able to support up to a 32-bit 
address value. 
4.2 Analysis of the Problem 
In analyzing the overall problem to solve the multiplication of sparse matrices 
with sparse vectors, one key critical area appears to allow for the most improvement 
given the sequential nature of the C programming language. This important area is the 
penalty paid by the C program if the address values for a matrix and vector value do not 
match. When this occurs, the program must then begin searching through the next 
address values of the matrix or vector, comparing the addresses one-by-one until a match 
is found or no match can exist. This searching adds another nested for loop to the 
21 
algorithm thus creating a potential O(n
3
) worst case scenario to solve the matrix vector 
multiplication. It is this main point that the FPGA design will focus upon. 
As mentioned earlier, the sequential nature of the C code prevents the algorithm 
from handling concurrency in the processing of a piece of data multiple times at once. 
The more parallelism that can be explored and put to use in the FPGA design, the greater 
the benefit can become for using an FPGA. 
4.3 Analysis of Hardware Limitations 
The hardware limitations imposed by the equipment being used is important to 
mention, because it ultimately has a significant impact on the overall design. Limitations 
can be found on the FPGA, Pilchard System, Memory Bus, and Computer system used 
with the underlying theme of memory limitations. 
The FPGA used, the Xilinx Virtex 1000-E, has its own limitations being 
resources. This FPGA part has approximately 1.5 million gates, 12,288 slices and 4Kbits 
of block RAM. While this may appear like plenty, the double precision multiply 
accumulator uses 26% of the available slices, 18% of the FF Slices, 21% of the LUTs, 
and 120,000 gates. If using Dual Port RAM [16] IP from Xilinx to hold data leaving and 
entering the Pilchard System as is customarily done, that will cost over 70,000 gates. 
Very quickly 20-25% of the FPGA's resources have been used as an absolute minimum 
for the sparse matrix sparse vector multiplication design to work with. While the design 
will likely fit, room for improvements like adding an additional adder or even multiply 
accumulator will become difficult if not impossible. Another issue regarding limitations 
of the current FPGA is its age. The Xilinx part being used is becoming obsolete, as there 
are much larger and faster FPGA parts available today. While how a design is created in 
22 
HDL has the largest effect on overall system speed, that overall system speed is limited 
by the speed of the logic available on the FPGA itself. If a faster and larger chip were to 
be available the design would have better performance as more parallel compares could 
also fit. The effects the FPGA size and speed has on the overall design will be explored 
further in the Results, and Conclusions and Future Work Chapters. 
The limitations of the Pilchard System's interface affect the overall 1/0 bandwidth 
of the FPGA system. Figure 4.1 below displays a behavioral view of the memory bus 
and Pilchard connection. It is highly unlikely that the sparse matrix sparse vector code 
design will run at the Pilchard interface's top speed; therefore, it only makes sense to run 
the sparse matrix sparse vector multiplication code at half the Pilchard System's speed. 
This way for every 2 clocks cycles of the Pilchard, the sparse code can work on twice the 
amount of data as it would have been able to if it could have worked at twice the speed. 








Figure 4.1 - Memory Bus to Pilchard: Behavioral View 
23 
This then puts more pressure on the Pilchard System's interface to operate at a higher 
speed since the code beneath it will be running at half that. Unfortunately, simply 
passing information through the Pilchard at its top speed of 133MHz is too difficult for it 
to handle. This makes the target top speed for the code underneath it (likely limited by 
the floating-point unit speeds) slower than hoped. Due to the Pilchard operating at twice 
the clock speed as the sparse operating code, the Pilchard then needs to read in 2 64-bit 
values from the memory bus in two clock cycles so that it may send 128-bits for every 
sparse code clock cycle. Although the code to handle this is relatively straightforward 
and not very complicated, producing results that operate allow the Pilchard to operate at 
1 00Mhz will remain a challenge. 
An additional limitation of the Pilchard System is the lack of onboard RAM or 
cache. This requires that the Pilchard then take the time to access main memory, which is 
costly, while the C code has the benefit of being able to take advantage of cache. If the 
Pilchard Board were to have onboard RAM and/or cache, the entire vector and extremely 
large portions of the matrix could quite possibly be stored right on the board itself, saving 
the Pilchard System and sparse matrix sparse vector code time in having to constantly use 
and compete for the memory bus for data. 
Another major limitation is the memory hus itself. The memory bus operates at 
13 3 MHz and is 64-bits wide; therefore only 1 double precision floating-point value can 
be passed per bus clock cycle. This will put a significant strain on the memory bus as 
potentially thousands to hundred of thousands of double precision values will be passed 
along the memory bus alone, not to mention all of the 32-bit address values that need to 
be compared. Two 32-bit address values can be passed per clock cycle. 
24 
4.4 Partitioning of the Problem 
With the necessary assumptions made, analysis of the problem completed, and 
hardware limitations explored, the problem can then be partitioned. When partitioning a 
problem four main factors need to be considered to achieve the best possible load 
balance. These factors are decomposition, assignment, orchestration, and mapping [17]. 
Decomposition involves exposing enough concurrency to exploit parallelism, but not too 
much such that the cost of communication begins to outweigh the benefits of parallelism. 
Assignment considers the assignment of data to reduce communication between 
processors and balance workload, and efficiently interfacing parallel processes is what 
orchestration entails. This means reducing communication through data locality, 
reducing synchronization costs, and effective task scheduling. Mapping is simply 
exploiting existing topology and fitting as many processes on the same processor as 
effectively possible. 
Altering one of these attributes of a parallel design effects the other attributes. 
Ideally some sort of complete balance is achieved between them all. These attributes will 
be addressed specifically or implied as the problem is partitioned in the subsequent 
sections. The greater the understanding of both the software and hardware issues, the 
more effective the partitioning process can be, which leads to a more complete design. 
The decomposition and mapping stages are essentially predetermined due to hardware 
limitations and data format already being determined. The data has already been 
decomposed into 64-bit double precision floating-point values and 32-bit address values. 
The only other area of decomposition is in the parallel comparators, which is attempting 
to create the maximum number of parallel compares on the FPGA. As for mapping, the 
25 
goal is for the entire sparse matrix sparse vector architecture to fit on the FPGA chip 
provided. The problem will be analyzed with the flow of data as it moves from the 
memory bus to the FPGA to the multiply accumulators. Figure 4.2 provides a general 
architecture for the possible design. 
4.4.1 Transmission of Data 
The transmission of data encompasses several different issues. Those issues 
include the transmission and storage of the sparse vector, sparse matrix, answers, and any 
handshaking if necessary. 
In the handling of the sparse vector, consideration must be given towards either 
the storage of the vector addresses and vector data, or just the vector addresses. Because 












Figure 4.2 - Basic Architecture 
26 
only makes sense to store the vector information and not resend the same information 
repeatedly. In determining how much of the sparse vector to store, as much as 
reasonably possible should be stored due to the large amount of reuse in the comparison 
of matrix column addresses and vector addresses. If only the vector addresses are stored, 
it would result in a reduced overhead for storing the vector data; however, it would cost 
more to send the vector value when a match is repeatedly found for one vector location. 
Consideration could also be given to storing vector values only after a match is found 
instead of storing them when they may or may not be needed. The cost for sending the 
vector data when needed would ideally be the same as sending the value before knowing 
if it is needed. This way unnecessary resources are not spent in transmitting vector 
values that will never be needed. The downside to following this path is that the 
complexity to handle this format would be increased on both the FPGA side and 
supporting C code. Additional logic would be needed to determine if the vector value 
exists and how to handle the request of it. The space would have to be available to store 
all possible vector values for each address stored so there would be no benefit in memory 
reduction, only in overall performance so long as the extra complexity does not negate 
the benefits. Both vector address and value could be stored, with the convenience of 
having all necessary vector data available at the cost of memory usage. Registers are 
inexpensive on an FPGA and thus a large number of vector data could be stored to make 
the additional overhead cost worthwhile. 
After determining the storage scheme for the sparse vector, it is more than likely 
that the entire vector will not fit all at once on the FPGA. This makes the decision of 
27 
how to store the sparse vector even more important because the more often a new section 
of the vector is stored, the more often the vector loading overhead will be incurred. 
When sending matrix information over the memory bus to the FPGA, similar 
issues are encountered as with the vector transmission, which was determining whether to 
send matrix values with the addresses, or just the addresses alone. If matrix addresses 
were accompanied by their values, then those values would be readily available to begin 
multiplication. If the values were not needed, they would simply be discarded. The 
downside to sending the values with addresses is that if the values are not needed then 
time was wasted on the bus sending the information. The less matches there are per 
compare, the more costly. If considering sending the addresses alone, after a match is 
found the matrix value could be requested by the FPGA. While this format may reduce 
the waste of matrix value transmissions, some form of handshaking would have to be 
introduced to notify the C code what values need to be sent. Unless performed cleverly, 
handshaking could be costly and it disrupts any notion of streaming data to the FPGA. 
The final area to evaluate in data transmission is how data is transmitted over the 
memory bus itself. This partly depends on the storage scheme of vectors and how matrix 
information is processed. In regards to sending vector information to the FPGA, if both 
vector values and addresses are transmitted then simply transmitting two addresses in one 
clock and the corresponding values the next two clock cycles should be efficient. The 
memory bus would be utilized to the fullest. If vector values were transmitted as needed 
then they would need to be transmitted after a certain number of compares have been 
processed. The C code would need to know what vector value(s) to transmit; therefore, 
the FPGA would have to initiate some form of handshaking. Upon completion of 
28 
handshaking, the C code should send only the necessary vector values in the order needed 
by the FPGA. 
In the former method mentioned of sending vector values with addresses, the data 
can simply be streamed in until all vector registers are full. In the latter format, vector 
addresses could be streamed in, but values would only be transmitted after some 
additional handshaking to notify the C code of what is needed. In general, vector 
transmission is a cost only paid when necessary to load up vector data or send vector 
values separately. 
Most of the bus time will be spent sending matrix information instead of vector 
information in the overall scheme of things. Here, two main different methods are 
explored, streaming and block transfer. The streaming method is tied to the transmission 
of both matrix addresses and values. This could be accomplished by sending two address 
values in one memory bus clock cycle followed by two clock cycles of sending the two 
corresponding values. The C code and FPGA code should already have a set number of 
transmissions before either needing to take any special action. 
The block transfer method would send a set number of matrix addresses or block 
of addresses, and the FPGA would respond in some manner with a request for matrix 
values if there were a match. The C code would then send the correct matrix values for 
multiplication. This block transfer process would be repeated as necessary. 
In comparing the two different data transmission methods, both have their 
advantages and disadvantages. The streaming method requires no loss of time in having 
to implement any handshaking. A disadvantage of streaming; however, is that two out of 
every three clock cycles are spent sending data that may or may not be needed when most 
29 
of the time addresses are needed for comparing. The block transfer method does not 
waste valuable clock time in transmitting unwanted matrix values, but additional 
handshaking is necessary which has its own penalties to be paid. All of these different 
methods have their advantages and drawbacks. 
4.4.2 Logic Flow 
The flow of logic and data must be controlled in some fashion as it enters the 
FPGA because there is not enough bandwidth to handle requests for comparison 
information, multiply information, handshaking, and answers all simultaneously. The 
entire design has one 64-bit bus to utilize therefore the data trafficking must be 
controlled. Several different possible states must be considered. Data needs to go to 
comparators in some efficient and highly parallel manner, data needs to fill vector 
information stored on the FPGA, and data needs to be directed to the multiply 
accumulators. Also, the state machine will need to accommodate the ability to send 
information back to memory bus for the C code to retrieve. In addition to the state 
machine providing all of the necessary states, it must flow between states in a logical 
manner and have the capability to jump to any necessary state given any possible 
condition. The state machine should also be able to protect the sparse matrix sparse 
vector code from operating when it should not. Ideally the state machine will help 
improve orchestration between processes. Also, in orchestrating the processes and data 
communication, a balanced workload should be strived for by keeping the assignment of 
data dispersed appropriately. 
30 
4.4.3 Comparing of Addresses 
As the flow of data moves from the memory bus to the state machine, it should be 
sent into a structure to handle parallel comparisons. This is essentially the main reason 
for this entire implementation of a sparse matrix sparse vector multiplication. Ideally the 
more parallel compares that can be implemented the better; however, a few 
considerations need to be made. As the number of simultaneous compares increase, the 
more room on the FPGA is used. At the very least, enough space needs to be provided 
for a floating-point multiply accumulator as well as minimal control for data flow. To 
accommodate greater amounts of concurrent comparators, the capability needs to exist to 
handle the possible large amount of data resulting from all of the comparing. One giant 
comparator cannot efficiently do all of the comparing at once as it would be too slow, so 
the comparing would have to be divided in multiple processes. The more processes 
running, the more individual results there are to multiplex. If more than one element of 
the matrix is being compared then a matching result can exist for as many elements of the 
matrix being compared. This creates a dynamic load balance of results being passed on 
to the multiply accumulator. When and how often multiple results will be calculated is 
unknown and can make handling the results difficult. Dynamic task scheduling must 
then be employed to help balance possible imbalanced results passed on to the multiply 
accumulators. The increased complexity becomes very real as parallel comparators are 
added and increased in size while trying to achieve optimal performance. In determining 
how to handle this portion of the design, a balance needs to be achieved between creating 
as many compares as possible, with still being able to provide the means to handle the 
results under desired operating speeds. 
31 
4.4.4 Multiply Accumulator 
Performing any kind of matrix vector multiplication requires the use of multiply 
accumulation. When a row of a matrix is multiplied to the matching elements of a vector, 
all of the multiplication results need to be accumulated for the resulting answer vector's 
corresponding element. When performing this operation on sparse matrix sparse vector 
data, the essential nature remains the same of needing to multiply values together and 
sum those results. Constructing such an architecture to handle this is not as obvious as 
handling dense matrices and vectors where maximum parallelization can be achieved due 
to the static scheduling nature of information and results. In consideration of sparse 
matrix sparse vector multiplication on an FPGA sharing resources, loading balancing 
becomes a factor as well as limited real estate for placing MACs. 
The effects of an imbalanced load and the uncertainties of the frequency at which 
address matching will occur, complicates multiply accumulator design immensely. These 
unknowns make it extremely difficult if not impossible to create an optimized 
architecture when designing for the general case with capabilities to handle any situation. 
The multiply accumulator design must therefore be able to handle dense matrices and 
vectors. Obviously performance will suffer heavily for dense data since that is not the 
target of the design, but what level of sparseness to target the design, cannot be 
determined so it must be prepared to handle all sparseness. The structure of the sparse 
matrix and sparse vector also play a part. What is important given the limited design 
space is that the "best bang for the buck" is achieved. In other words, in the 
determination of how many multipliers and accumulators are to be used; it is desired that 
all of the arithmetic units placed on the FPGA stay busy. There is no point in wasting 
32 
space on the chip if arithmetic units are not used often, because it is all about designing 
for the average or general case and getting the most out of what is available or in use. 
Another issue to observe when creating the multiply accumulation units is how to 
handle answers or the sums created. Ideally, one or multiple answers could be found at 
one time. The problem in doing so range from discerning one sum from the next, to 
knowing which values in the pipelined adder correspond to what partial sum. Unless 
there are multiple multiply accumulator units available to keep track of their own sum, 
keeping track of multiple sums would become difficult and complex even though the 
capability would be convenient. Traversing an entire row of the matrix and vector to 
obtain just one sum would create the need to reload the entire vector per matrix row. This 
would become costly due to not maximizing reuse; therefore, the multiply accumulator 
must solve for a partial sum for a given row of the answer vector. In simplifying the 
work of handling vector storage and reuse, handling partial sums instead of full sums 
becomes another complexity to consider. It must then be determined if the FPGA or the 
computer keeps track of the partial sums, while keeping in mind that there could be a few 
partial sums to thousand upon thousands of them. If handling partial sums, each partial 
sum can be sent back out to the CPU to let it finish each sum. As can be seen as the data 
flows from the memory bus down to the multiply accumulators into answers, the effects 
of each part all tie in to each other and will be put together in one design in the following 
section. 
4.5 FPGA Design 
The architecture of the sparse matrix sparse vector multiplication algorithm 
attempts to utilize the partitioning of the problem to the highest degree possible. The 33  
overall design has been broken down into six major components, the interface to the 
Pilchard System or Pcore, State Machine, Comparators, Multiply Accumulator Interface, 
the Double Precision Floating-Point Multiplier and Adder, and the C code that is used to 
interface to the Pilchard System from the user side. 
In developing these components in HDL, the flow chart in the Figure 4.3 shows 
the general design flow process in developing HDL for a FPGA. Essentially, after the 
specifications and requirements of a system have been determined, HDL in the form of 
behavioral and structural code formats is simulated to check for accuracy. If simulation 
provides accurate results, the design is then synthesized and re-simulated, or post­
synthesis simulation. After post-synthesis simulation produces valid results, the design is 
place and routed which provides the real design that will go into an FPGA. The place 
and routed design is also simulated to check for accuracy. If the design continues to 
prove accurate, it is placed on the FPGA for actual implementation. Unfortunately, the 
Pilchard System does not currently allow support for simulation after synthesis or place 
and route. This deficiency is critical as these simulations can often show design flaws 
that pre-synthesis simulation cannot, thus making debugging of actual performance 
extremely difficult. 
In general, the sparse matrix sparse vector multiplier reads in 128-bits of 
information in one clock cycle. With this information, vector addresses and values are 
both stored on the FPGA to minimize the complexities of having to request vector values 
on an as needed basis. After storing a vector, matrix addresses are compared to vector 
addresses in mass parallel. Sending in 56 addresses in one block transfer, 4 separate 







Placement and Routing 
Physical Implementation 
Figure 4.3 - FPGA Design Flow 
35 
Structural Description 
determination of this block transmit size will be discussed later. When an address value 
has exceeded the vector address range, the partial sum for that answer vector element 
corresponding to the portion of the vector and matrix is found and the overall design will 
proceed to the next row. After a portion of the vector has been compared to all of the 
corresponding portions of the matrix rows, the next 32 vector locations are loaded and 
compared to the rest of the remaining rows of the matrix. This is repeated until all partial 
sums have been found. Figure 4.4 provides a very general description of the design on 
the FPGA. 
On a more systemic level, after the vector is stored and matrix vector address 
comparing begins, the results from the compare matches are encoding with some flags in 
the 64-bit output to handshake with the C code. While matches are being found, the 
vector values that will be necessary for multiplication are stored in four buffers in the 
Peare Com arator 
MAC 









order that they will be multiplied. Following the handshaking, all of the matching matrix 
values are streamed in for multiplication with their corresponding vector values in one of 
two multipliers. Up to two multiply results are then ready to be accumulated per clock. 
The adder will then accumulate all of the multiply results and intermediate partial sums 
into one consolidated partial sum. There is just one adder to accomplish this. The 
following sections describe each portion in detail. Figure 4.5 on the next page offers a 
more detailed view of the overall architecture. 
4.5.1 Pilchard System Interface 
Two main vhdl files, Pilchard.vhd and Pcore.vhd, predominantly handle the 
Pilchard System interface. The former helps setup the actual interfacing between the 
FPGA pins and Pilchard board to memory bus while the latter is in a sense the wrapper 
around all of the supporting sparse matrix sparse vector code design. It is the Pcore file 
that needs to be manipulated to accommodate the input and output requirements of the 
design to the Pilchard interface for the memory bus. The Pcore operates by receiving a 
write and a read signal when there is a request to send information or read information to 
and from the Pilchard's FPGA. Also, there are dedicated lines for the input and output of 
data as well as address lines if interfacing directly with generated block RAM on the 
FPGA. There are other signals available but they are mainly for use with the Pilchard 

























__ 6�·- · ····· MULT I 





























Figure 4.6 - Pcore 
SMSVM 
When developing the Pcore, the requirements and needs of the underlying system 
are important. The necessary components can be seen in Figure 4.6 above. Since the 
sparse matrix sparse vector multiplication will be operating on a clock cycle twice as long 
as Pcore's clock, it is important that the synchronization between clocks and the 
communication of information between those clocks is accurate. To make-up for the 
slower speed of the matrix vector multiplication, twice the amount of memory bus data 
can be sent to the sparse code to operate on. Pcore will have the capability to read in two 
64-bit values in two clock cycles and pass one 128-bit value on to the sparse code in one 
sparse code clock cycle. This allows the memory bus to stream data in, while providing a 
way to get the information to the sparse matrix sparse vector code on a slower clock. 
The difficulty lies in the synchronization of passing the data back and forth 
between the top level Pilchard structure and the slower clock of the sparse code. The 
39 
slower clock is based off of a clock divider from the main clock and will be referred to as 
clockdiv. Because the faster clock operates at twice the speed of clockdiv, the 1 28-bits 
being passed along to the sparse code needs to be held long enough for the sparse code to 
accurately retrieve the 1 28-bits. To accomplish this, an asynchronous FIFO buffer was 
generated using Xilinx' s Coregen program. This generated core can handle reading data 
on one clock while writing data out on a different clock. Due to the core being available 
for professional use, it is reliable and can handle the asynchronous data transfer 
effectively. The use of this asynchronous FIFO was a convenient and time saving 
solution to handle the memory bus to sparse matrix sparse vector code data transfer. 
When passing answers back from the sparse code through the Pcore out of the 
FPGA, Xilinx' s Coregen block RAM was used. Using block RAM to output data 
ensured that data would be stabilized for the memory bus to read from. This is important 
due to interfacing two different clock speeds again. The depth of the RAM was four. 
Currently only two locations are in use; however, that can be expanded if desired. 
4.5.2 State Machine 
The state machine is the main interface to the Pcore when controlling the data 
read into the FPGA and also for controlling and monitoring the sparse matrix sparse 
vector multiplication process. The different states utilized to accomplish this are: 
INITIALIZE, ADDRESS, DATA, PROCESSING, REPORTw, REPORTx, SEND, and 
MACN. All states check an input signal called "din _rdy" that when goes high, notifies 
everything that valid 1 28-bits of input are available. If this signal is not high, the state 
machine simply holds its current status and position. Figure 4. 7 gives a graphical 





Figure 4.7 - FPGA State Machine 
41  
The INITIALIZE state is run first and only once. It receives the first writes from 
the C code, which notify the state machine of how many rows exist in the matrix. This is 
necessary so that the state machine knows when it is handling the last row so it can 
transition to appropriate states. After this state, the state machine moves to the 
ADDRESS state. 
The ADDRESS state receives the address data for the vector and stores the 
addresses in registers. Registers are used for storage to help simplify their frequent 
access by the compators. Due to the 128-bit input, 4 32-bit addresses can be 
simultaneously stored into registers in one clock cycle. After the four addresses are read 
from the input, the state machine will transition to the DATA state for the next two clock 
cycles. 
The DATA state breaks the 128-bit input into 2 64-bit inputs, which represents 
vector data, in one clock cycle and stores them into block RAM designated to hold vector 
values. Because in the previous state; 4 addresses were read in, the DAT A state is held 
for 2 clock cycles so that it will have read in 4 vector values. After reading in 4 vector 
values, the state machine transitions back to the ADDRESS state. The transition back 
and forth between these two states goes on until 32 vector addresses and values have all 
be input into the vector registers. When this is done, the state machine moves on to the 
PROCESSING state. 
The PROCESSING state constantly reads in matrix addresses for mass parallel 
comparing. This state keeps count of how many input values have been read in using a 
decrementing counter. The counter allows for 14 block transfers of 4 matrix addresses 
42 
each. When the counter is zero, the maximum number of address values has been read in 
and the state will transition to the REPORTw and REPORTx states. 
The two REPORT states are executed successively. The REPORTw stage is the 
next stage after PROCESSING, and it buffers the 1 clock delay required to ensure all 
comparing is done so the comparator results can be sent to the C code. This one clock 
delay is necessary for the transition from REPORTw to REPORTx state. The REPORTw 
state is left out of the diagram for simplification. In the REPORTx state, information is 
gathered from all of the comparing of addresses. All of this information is used to notify 
the C code if there were any address matches, what addresses had matches, if the matrix 
addresses went over the current address range stored on the vector, and a special last 
address match flag. All of this information must fit into one 64-bit output signal to 
simplify the number of clocks of handshaking down to one. Five bits ended up being 
extra and are reserved for future use. One bit each is reserved for the match flag, over 
flag, and last address match flag. The overflag signals to the C code that a matrix address 
wnt past the vector address range. The match flag indicates that there was at least one 
match, and the last address match flag indicates if the last bit in the 56-bit encoded 
compare result stands for a match if equal to one. This is done for redundancy checking 
to ensure the very last bit is transmitted correctly. The remaining 56 bits are used to 
encode which matrix addresses matching occurred on. This will be described in the 
Comparator section. After reporting the status of compares back to the C code, 
depending on the status, the state machine will transition to one of three states: the 
MACN, SEND, or back to PROCESSING state. The MACN state has first priority, as it 
needs to be next if there were any address matches. The SEND state has second priority 
43 
meaning if there were no matches and the over flag is high, then a partial sum needs to be 
found based on the current data that has been input and calculated to be sent to the C 
code. Last priority is given to moving to the PROCESSING state. This is only done if 
there were no matches and the over flag has not been set high; therefore, continue 
processing more matrix addresses. 
If the MACN state is next, all of the matrix values that correspond to matches will 
be read in, in order. As these values are read in, they are sent to two multipliers to be 
multiplied with their corresponding vector values. Because there is no dedicated signal to 
the FPGA to notify it when the C code is done sending matrix values, the C code will 
send in all zeros when it is done. This is necessary because the "din _rdy" flag is already 
in use to notify the MACN stage if it even needs to be looking for valid input. It is 
possible that there may be a delay in sending matrix values to the FPGA; therefore, an 
input of all zeros will be acceptable. If the MACN stage receives all zeros as an input, it 
knows it is time to move on to the next state. The purpose for sending all zeros as 
notification is due to the fact that a zero should never be stored as a matrix value to begin 
with because zero values should not be stored, thus providing flexibility in reading 
inputs. After reading in all of the input values, the state machine will transfer to the 
SEND stage if the over flag is high; otherwise if no answer needs to be sent, go back to 
the .PROCESSING state. 
In the SEND state, the state machine simply waits for the floating-point multiply 
accumulator to find the current partial sum. When the partial sum is ready, the state 
machine notifies Pcore that an answer is ready to be sent out from the FPGA. When this 
is done, the state machine checks if the flag was set to notify it that the last row of the 
44 
matrix was being processed. If so, then the state machine needs to go back to the 
ADDRESS state to begin loading a new vector; otherwise, the state machine will transfer 
back to the PROCESSING state to begin handling another partial sum for a different 
matrix row. 
4.5.3 Comparators 
Each comparator is made up of four parallel compare processes for each of the 
four matrix addresses input per clock during the PROCESSING state. Each process 
handles 8 compares in one clock cycle for a total of 128 simultaneous compares in one 
clock cycle. This is where the strength of the design lies. Figure 4.8 displays the overall 
comparator design in its entirety with an individual compare process shown in greater 
detail. 
For each of the four processes per matrix address, only one match can exist, as all 
vector addresses are unique per row. After the initial 32 compares per matrix address are 
executed, the four individual process results are multiplexed to check for a match. If 
there is a match, several different tasks occur. One task is that the corresponding vector 
value for that match is grabbed and sent off to a dynamic task scheduler to handle the 
random nature that multiple matches may occur. What the dynamic task scheduler does 
is keep all of the matches in order in the four buffers that will feed the two floating-point 
multipliers. Each buffer is filled in order as a match is found, so if buffer one is filled 
first, buff er two is next. After buffer two, buffer three is next and so on. After buff er 
four is filled, the dynamic task scheduler should assign the next matched vector value to 
the first buffer. Due to the random matching nature, the next buffer to fill on a clock 
45 
Input 
( 1 27->96 ) 
CMPA ... � 
� 








(3 1 ->0) 

















••••••••••••••••••••••••••••••••u•••••••••••••••••••••••••••••.,.,., .. .,.,.,.,., .................... , •• •••••
•••• ., , ••• .. ··- · 
Vector CMPD Addresses 
0 
8 -+ CMPs 
7 
8 







1 5  r 







8 � CMPs 
3 1  






cycle depends on how many matches the previous clock cycle had and whicl) buffer was 
next to have been filled first in that previous clock cycle. Figures 4.9, 4.10, and 4.11 
demonstrate this process. 
Another task that is completed on a match is the encoding of matching addresses 
for the REPORTx state. A 14-bit vector is used for each of the four main sets of 
comparators to encode matching information. To break it down, 56 addresses from the 
matrix are sent to the FPGA in one block transfer. This means 4 matrix addresses are 
used at once for 14 clock cycles. If there is a match for a particular matrix address on the 
first of the 14 sends, then the highest bit of its particular vector is assigned a one. If there 
was a miss, a zero is put in that bit. As compares are completed the results are encoded 
in all 14-bits representing the 14 clock cycles to transmit the block transfer. These 
vectors are assigned their values when the four processes for each matrix address are 
multiplexed. After the block transfer is done, the four vectors must be rearranged into 
Ptr l 
L+ 
Dynamic Task Scheduler 
Ptr2 Ptr3 Ptr4 
FIFO FIFO FIFO FIFO 
1 2 3 
i 
L+ � -. 
Figure 4.9 - Dynamic Scheduler Before any Matches 
47 
4 
Dynamic Task Scheduler 
Ptr2 Ptr3 Ptr4 Ptrl 
FIFO FIFO FIFO FIFO 
1 2 3 4 ! 
_. -+ 
1 2 3 l.-+ 
Figure 4.10 - Dynamic Scheduler After 3 Matches 
Dynamic Task Scheduler 
Ptr3 Ptr4 Ptrl Ptr2 
FIFO FIFO FIFO FIFO 
1 2 3 4 
6 l.-+ l.-+ 
1 2 3 4 
Figure 4.11 - Dynamic Scheduler After Another 3 Matches 
48 
one 56-bit vector such that the order of the matches are preserved. Figures 4.12 and 4.13 
show how this comes together. 
The final task performed by the comparator is monitoring matching and if a 
matrix address goes over the vector address range. If there is just one match by any of 
the comparator processes, the match flag goes high for that block transfer. Similarly as 
soon as the first matrix address exceeds the vector range, the over flag goes high. To 
support these functions, two smaller functions are performed. When the first compare 
trips the over flag to high, that exact location in the entire 56 address compare scheme is 
identified and sets that particular location in the 56-bit encoded vector to a one. After an 
over is triggered, all of the remaining bits of the 56-bit encoded vector are set to zero, so 
when the C code knows the over flag has been tripped, the first "one" it encounters in 
checking the encoded match vector from right to left signifies the position of where the 
first matrix address exceeded the vector address range. That matrix address will then be 
the starting address when that row is returned to, to multiply with the next section of the 
vector. Figure 4.14 in the next page shows an over bit being assigned to its appropriate 
location in the 56-bit encoded vector. 
Match on 1st CMP 
i 
I 1 I 0 0 13 
Match Over on 4th occurs CMP A Output Status CMP 
l 
Vector 
i 1 I 0 0 1 I 0 0 0 0 
Figure 4.12 - Fourteen Bit Hit Vector 
49 
0 0 I O 11 0 
55 
1 0 0 0 0 0 0 0 0 1 
I I I L-------------------, 1 I I 
r - - - - - � - - - - : - - - - J  
r- - - - - - - - - - - - - - - - - - - - : - - - - - : - - - - J  - - - - - - - - - - - - - - - - - - - ·  - - - - - - - - - - - - - - - - - - �- - - - - - J  
1 0 0 1 
0 CMPA 3 
39 
I o I o I o I o 
23 
I o I o I o I o 
I 
1 0 0 1 








I 1 0 0 1 
0 CMPC 3 
Figure 4.13 - Fifty-six Bit Hit Vector 




0 I o I o I 0 I 1 I o 0 I o 
0 J 0 • • • I o o I o 
Figure 4.14 - Hit Vector with Over Bit 
50 
1 0 0 
0 CMPD 
I o I 0 I o 





I 0 I 
0 0 
4.5.4 Multiply Accumulator Interface 
The multiply accumulator interface handles several processes involved in the 
overall design of finding a partial sum. Figure 4. 15  displays how these processes 
interconnect. Due to the dynamic scheduling of the four buffers that feed into the 
multipliers from the comparators, handling the multiplier input data becomes a static 
workload. The first multiplier reads data from buffers one and three while alternating 
between them to multiply with the top 64 bits of the 128-bit input signal. The second 
multiplier alternates between the second and fourth buffer multiplying the buffer data 
with the bottom 64 bits of the input vector. This ordering is to preserve the matching 
order of addresses from the comparator such that the appropriate matrix and vector values 
are multiplied together. What happens is the upper 64-bit input value is multiplied by a 
buff er one value simultaneously while the bottom 64-bit input value is multiplied by a 





Figure 4 . 1 5  - MAC Interface 
51  
MAC FIFO 
the same multipliers, but multipliers one and two get their vector values from buffers 
three and four respectively. This alternating sequence continues until all data has been 
sent in for multiplication. The advantage in using two multipliers and four buffers lies in 
that there is no backup of data on the FPGA, the two 64-bit multipliers can handle the 
128-bit input. 
As the multiplication results are found, they are fed into an adder. Over time, 
addition results are accumulated into one partial sum, but the process of accumulation is 
complex. Due to the pipelined nature of the adder, results cannot be available on the next 
clock following the input. As the number of values to accumulate drop below what can 
fit in the pipeline, the values must be temporarily stored in a buffer until another result is 
available to be added with. There will be times when there are so many values to 
accumulate, that accumulation has not finished before the next round of multiplication 
results come in. Soon monitoring all of the data to be summed becomes difficult. Input 
to the adder can come from two multipliers, the output of the adder, and the FIFO buffer 
used to store overflow. The combination of obtaining these values and when they are 
available is complex. There could be two multiplication results available or there could 
be only one. There could be an adder result available too. Not helping the situation is if 
data is being stored in the buffer. When data is requested from the buffer, there is a two­
clock cycle delay. Depending on if data is requested from the buff er, as the second or 
first input into the adder is another issue as well. 
To begin sorting out this complication, priorities must be set as to what 
component's result has the highest and lowest priority with respect to being an input into 
52 
the adder. The multiplication results are given the highest priority because their four 
buffers must be cleared as soon as possible to avoid a backup of matching vector value 
information. If a backup were to occur, the system as a whole would have to stall, a 
situation to be avoided if possible. Because they are given such priority and the MACN 
stage can have the buffers cleared during that state, this potential back up is avoided. 
Multiplier one will have priority over multiplier two as multiplier one would be handling 
a greater number of matches if the number of matches is odd. Next in line on the priority 
chain is the adder result. Last priority is given to retrieving data from the buffer. A mini­
pipeline is established to handle the funneling of data into the adder, mainly due to the 
possibility of there being one answer available for a clock or two before another potential 
input is ready. This pipeline is also used to time input into the adder upon a request for 
data from the buffer. When one input is available and waiting for another input, the first 
input will hang around for two clock cycles. If no other input is available at that time, it 
is written to the buffer to wait for a longer period for another input. When multiple 
inputs are present, the prioritized scheme is used to determine what values get put into the 
adder and what value is written to the buffer. 
Some complications involved in using a buffer with a delayed output is that if a 
request has been made for buffer data, it then holds the ''trump card" over all other inputs. 
This is because of the complicated nature of timing its availability with the 
unpredictability of other inputs. If the first input for the adder is a call to the buffer for 
input, the process monitoring all the activity will wait for another input to be available 
while the buffer output is taking its two clock cycles to come out. If something becomes 
available, the newly available data is sent to one of three stages to time it with the 
53 
buffer's output into the adder. If more than one piece of data becomes available while 
waiting on output from the buffer, the priority scheme kicks in. If two inputs are 
available, one will be sent into to the buff er while the other will be sent with the buff er 
output to the adder. If data is available from both multipliers and the adder while not 
waiting for output from the buffer, an overflow signal must be used to store the second 
extra piece of data available. The worst-case scenario is, when two values are being 
pulled from the buffer (one ahead of the other) and values become available from the 
multipliers and the adder. Now both buffer outputs hold the highest priority, one 
multiplication result gets written to the buffer, the other multiplication results is written to 
the overflow signal, and the adder result is written to a second overflow signal. 
Fortunately this worst-case scenario cannot happen in consecutive clocks or every other 
clock as it takes that many clocks for such a situation to develop. This allows time 
immediately following the worst-case scenario to clear out the two overflow signals so 
they are not overwritten. Another reason why the worst-case scenario cannot repeat itself 
is once multiplication results are incoming and the worst-case scenario has occurred, for 
the next several clock cycles the multipliers will control the inputs into the adder thus 
flushing out any remaining multiplication results so the worst-case scenario still cannot 
repeat itself. All adder results in the meantime are written to the buffer. 
4.5.5 Double Precision Floating-Point Multiplier and Adder 
The floating-point multipliers and adder both handle double precision (64-bit) 
data and are pipelined processes. The multipliers are 9 pipeline stages and the adder has 
13 pipeline stages. Both support the IEEE 754 format and are constrained as mentioned 
in the Assumptions section. 
54 
The multipliers XOR the sign bits to determine the resulting sign of the answer. 
The fractional part of each input has a one appended to them to account for the implied 
one and both are multiplied together. Meanwhile the exponents are added together and 
then biased ( subtracting by I 023) since the biases of each will also have been added 
together. The exponent result is then checked for overflow. After these simultaneous 
processes have a occurred, the top 54 bits of the multiplication result are taken, and the 
rest discarded. If the highest bit of the multiplication result is a one, then the exponent 
needs to be incremented by one and the fraction shifted left by one. If the highest bit was 
a O then shift the fraction part by two to the left. After the fractional shift, keep the top 52 
bits to fit the fraction format in the IEEE standard. The sign bit, exponent, and fraction 
all need to be put together to form the 64-bit double precision representation. The 
following flowchart in Figure 4.16 outlines the behavioral model of two floating-point 
numbers being multiplied on a bit level. 
The floating-point adder is more involved. First the larger input needs to be 
determined. Subtracting the two exponents does this. If the exponent difference is 
positive then the first operand is the larger; otherwise, the second operand is. The larger 
operand's exponent is stored for the answer while the exponent differential will be used 
to shift right the fraction part of the smaller number to normalize it to the large fraction 
for addition. The sign bit will also be equal to the larger number's sign bit. If subtraction 
is being performed (sign bits are different), the smaller number's fraction needs to be 
two's complemented after being shifted. Before any modifications are made to either 
fraction or before they are added, a 1 is appended to the highest bit to account for the 










Iner. by 1 
E 






Shift Left 1 
Shift Left 2 
F 
Figure 4.16 - Floating-Point Multiplier Flow Chart 
56 
are summed. After the fractions are added, the resulting fraction must be shifted left until 
the highest bit was one to renormalize the fraction. The sign bit, exponent, and resulting 
fraction are all appended together in order to form the double precision addition result. 
The flowchart in Figure 4. 17 depicts this process. 
4.5.6 C code interface 
The C code that will interface with the Pilchard System is an optimized code that 
is written to cater to the state machine inside of the FPGA; therefore, to a large degree the 
state machine of the C code will look identical to the FPGA state machine. The Pilchard 
System comes with a special header file and C file that defines special functions to map 
the Pilchard to memory, and to read and write to the Pilchard System. The Read64 and 
Write64 commands will read and write 64 bits of data and the inputs to the functions are 
assigned their values by using pointers. This is so 64-bit double precision values do not 
have to be manipulated in order store the data in upper and lower 32-bit halves of the data 
type required for the special read and write functions. 
The C code will begin by opening up the necessary files that are in CRS format, 
check to see how big the matrix and vector both are, dynamically allocate space, and 
store all of the data in arrays. After closing those files, the Pilchard space in memory will 
then be initialized. Now the processing can begin. The C code has several states, 
INITIALIZE, SEND_ V ADDR, SEND_ VDAT A, CMPING, GET_ ST A TUS, 
SEND_MDATA, and GET_ANS. The code will start out in the INITIALIZE state by 
sending the FPGA the number of rows in the matrix. It will then transition to the 
SEND_ V ADDR state by sending two addresses consecutively to achieve the 4 32-bit 
57 
l __ s_ign_(_S_)_.I ___ E_xp_o_n_en_t_(E_) ____ F_ra_ct_io_n_(F_) ___ I = Value 
El>E2 
S=S l 












Shift left until 
1st "1" is out 
F 
Fa=Fl  





If S l  XOR S2 = 1 
2's Compl Fb 
Figure 4.17 - Floating-Point Adder Flow Chart 
58 
address input. After this state, the program will go to the SEND_ VDA TA state where 
four separate writes will be performed to write four vector values that correspond with 
the address values. After sending 32 vector locations, the state machine then moves to 
the CMPING state. If for some reason, there is only an odd number of vector data left or 
if the amount of vector data to be sent is less than 32, then the C code will send all zeros 
for the address data and values. This is so an over flag will be correctly triggered 
provided that the matrix addresses exceed the vector address range. These states keep 
track of where they are in the vector so that each new section of the vector is loaded into 
the FPGA appropriately. Figure 4.18 provides a graphical view of the state machine. 
The CMPING state is very straightforward in that it sends matrix addresses to the 
FPGA for comparison. It sends 56 addresses, 2 in one write. If the amount of addresses 
to send runs out before 56 addresses have been sent, the program will send all ones as the 
addresses to trip the over flag on the FPGA and let it know that the row is done. Before 
leaving this state, the program checks to see if it has sent information from the last row. 
Next the state machine will proceed to the GET_STATUS state where it will read the 64-
bit output from the FPGA to get status information on what to do next. If the match bit is 
high, the program will know to go to the SEND_ MDA TA next. After this check, the 
over bit is checked. If the over bit is one, the program will scan from right to left the 56 
bits of the FPGA output to find the first one. The first one that is found is the point in the 
row address transmission that the matrix address values exceeded the vector address 
range. This point is remembered for when this row is processed again after a new set of 




Figure 4. 1 8  - C Code State Machine 
60 
left off. After finding the over bit, that bit is set to zero. This is done because the 
SEND_ MDA TA stage will check these bits for a one, and will send the corresponding 
data if a one is found. After all of the over processing is done, or after a match flag is 
found without the over flag equal to one, the state machine will transfer to one of three 
states: SEND_MDATA, GET_ANS, or CMPING. If the match flag was high, the state 
machine will go to the SEND_ MDAT A state next. If the over flag was high then the 
state machine transitions to the GET_ ANS state; otherwise, the CMPING state is next. 
If the state machine goes to the SEND_ MDAT A stage, the program will traverse 
the 56-bit match vector from left to right to send matching data in order. After gathering 
two matches it will write the two matrix values. If there are an odd number of matches, 
the program will send in dummy data so that there have been an even number of writes 
(so the asynchronous FIFO gets 128-bits in Pcore - the FPGA will not process dummy 
data). After the matching data has been sent, all zeros are sent to notify the FPGA that all 
data has been sent. This occurs when 56 values have been transmitted or while sending 
data, if the stop-point is reached (point at where an "over" occurred), the state will 
terminate the sending of data and send in all zeros to signify that it is done. If the over bit 
is high the state machine then moves to the GET_ANS state, otherwise it moves on to the 
CMPING state. 
The GET_ANS state simply waits in this state for a valid answer to present itself 
from the FPGA. When it does, the partial sum is taken and added to the existing partial 
sum for that particular row in the answer vector. If the program had been processing the 
last row of the matrix (but not the last row and set of columns) it will then go to the 
61 
SEND_ V ADDR state to send in new vector data and start processing the next chunk of 
the matrix. If the last row and column were just processed then the program has finished; 
otherwise, the program will proceed to the CMPIN G stage where the next set of row 




The following chapter summarizes the results of the FPGA assisted computer's 
design implementation in comparison to the stand-alone processor's results. Several 
points of interest will be observed and evaluated to help distinguish the differences and 
similarities in the results. The overall performance of the FPGA design yielded slower 
results than hoped in that the stand-alone processor outperformed the FPGA design. The 
design was place and routed with timing constraints of 50 MHz for the sparse matrix 
sparse vector portion of the FPGA while the Pcore interface ran at 100 MHz bus speed so 
it could supply 128 bits per 50 MHz clock. Approximately 70% of the FPGA's slices 
were used and approximately 60 of the 96 block RAM locations were also utilized. The 
following sections will discuss and interpret the results and difficulties encountered in 
developing a double precision floating-point sparse matrix sparse vector multiplier on a 
FPGA. 
5.1 Comparison of Results 
In the process of evaluating results, it is important to properly put them in 
perspective. To accomplish this, various characteristics of sparse matrix sparse vector 
multiplication data will be utilized in the analysis of the results, which are: overall 
performance, hits (when a matrix element address and vector address match to yield a 
FPMAC), compares, hits-to-compares ratio ('P), quantity of nonzero values, the number 
of vector loads, and percentage of theoretical MFLOPS achieved. Several sets of test 
data were used to determine all of the following evaluations. When observing these 
63 
results, dataset 4 was varied four times with those variations all yielding extremely 
similar results. A table with all of the statistics regarding each dataset can be viewed in 
Appendix A. 
The performance of the FPGA assisted implementation proved to be slow at best 
when compared to the stand-alone computer's performance. Across the various tests, the 
stand-alone's performance averaged 50-60 times faster than the FPGA assisted 
computer's implementation. This slow-down will be discussed further in the difficulties 
section later in this chapter. The figures throughout this chapter depict the difference in 
computational performance to the characteristics mentioned above. In the all of the 
graphs, the legend shows the curves for the "CPU time" and "FPGA time" where the 
"CPU time" refers to the total time for the stand-alone processor to compute its results, 
while the "FPGA time" represents the time taken for the FPGA and supporting CPU to 
compute its results. Both times include the time spent communicating the data over the 
memory bus. Due to the large performance difference between designs, all 
characteristics plotted versus performance are done on both a base 10 and logarithmic 
scales for execution time. Also, all graphs with time being represented by the y-axis, is 
in microseconds. 
Figure 5.1 plots performance time versus the datasets used. The performance 
slow-down in the FPGA design is obvious when comparing results between the two 
design implementations. This graph depicts the 50-60x performance difference 
throughout the datasets. The performance of the two designs appear to mimic one 
another on the logarithmic scaled graph in Figure 5.2. 
64 
Overa l l  Performance 
j 1 500 -+------+---------------------� 
G) 
! 1 000 -+------+-----------------------' 
Data 1 Data 2 Data 3 Data 4a Data 4b Data 4c Data 4d Data 5 
1--------, 
-+- CPU time Data Set 
-11- FPGA time 
Figure 5 .1 - Dataset Performances 
Overa l l  Performa nce 









Data 1 Data 2 Data 3 Data 4a Data 4b Data 4c Data 4d Data 5 1-------� 
-+- CPU time Data Set 
-11- FPGA time 
Figure 5.2 - Dataset Performances (Log Scale) 
65 
The following two figures, Figure 5.3 and 5.4, display the number of hits to 
I execution time. Both figures continue the trend of 50-60 times performance slow-down 
for the FPGA based design. The hits were determined by determing the total number of 
actual vector address to matrix address compare matches in each dataset computation. 
The performance times generally increase for both designs as the number of hits increase. 
This is likely due to the additional number of floating-point operations and matrix data 
that needs to be communicated. The four points with nearly identical performance as the 
number of hits vary represents the dataset 4 variations where the number of hits has been 
altered on purpose with the intentions of observing performance tradeoffs as the number 
of hits are varied for any given dataset. 
2500 
2000 
� 1 500 





-+- FPGA time 





Performance to Hits 
__________. 
- -
I I I I I 
1 00 1 50 200 250 300 350 
H its 
Figure 5.3 - Performance to Hits 
66 
1 0000 





::::!. 1 00 






-+- FPGA time 
- cPU time 
Performance to Hits 
- - - -- - r -� 
- - - -- -
- -
I I I I I I 
50 1 00 1 50 200 250 300 350 
H its 
Figure 5 .4 - Performance to Hits (Log Scale) 
The next two figures, Figure 5.5 and Figure 5.6, depict the performance numbers 
of both the FPGA and CPU for the number of compares incurred by the stand-alone 
processor. The results show the performance time increasing with the number of 
compares executed. The logarithmic scale shows both performance times increasing at 
relatively the same scale. At the far right hand side of Figure 5.6, it appears as if the 
performance time continues to increase for the CPU while the FPGA performance begins 
to level out. The results here are surprising, as it was expected that the logarithmic 
curves would at least converge proving the effectiveness of the parallel compares on the 
FPGA. While the trend mentioned on the far right hand side of the graph may support 
this expectation, there is not enough data here to fully support that expectation. 
67 
Compares to Performa nce 
� 1 500 -
a, 
.5 1 000 -1--------�-----------------; I-
0 
-+- FPGA time 
--- cPU time 
1 0000 





1 0  
1 
0 
-+- FPGA time 
--- cPU time 
500 1 000 
Compares 
1 500 
Figure 5 .5 - CPU Compares to Performance 
Compa res to Performance 
- � � 
I 
500 1 000 1 500 
Compares 






Viewing the graphs in Figures 5.7 and 5.8, comparing '11 to performance does 
show an interesting trend between the datasets. Each row of a matrix multiplied by the 
vector yields its own number of compares and hits. A ratio for each row can then be 
determined to evaluate the number of hits to compares. '11 represents the average of these 
ratios over all of the rows of a matrix. '11 is important because viewing the performance 
trends against the number of hits or compares separately does not take the whole picture 
into account. The performances between the stand-alone and FPGA assisted computer 
designs could have a different relationship when looking at the effects '11 has on them. In 
order to isolate the effects '11 has on both methods, the same dataset was used for each 
plot below; however, the number of hits was varied to alter '11. These are the first four 
points of Figure 5.7 and Figure 5.8 and are the four variations of dataset 4. The next 
Ps i to Performa nce 
j1 500 -+------------------�----- - - --I 
.5 1 000 -l----------------------.------------1 t-
0.05 0. 1 0. 1 5 0.2 0.25 0 .3 0.35 
-+- FPGA time Psi 
-11- CPU time 
Figure 5.7 - Performance to Psi 
69 
Psi to Performance 





:::!. 1 00 
E 
i= 
-+- FPGA time 
-11- CPU time 
0.05 0 . 1  0. 1 5  0.2 0.25 0.3 0 .35 
Psi 
Figure 5.8 - Performance to Psi (Log Scale) 
three points were datasets with relatively the same number of nonzero values, while the 
last data point is a small matrix. The varied data essentially shows no performance 
variations. This is most likely due to the structures of each dataset being similar, with the 
matched data positioned close together in the matrix too. The three data points in the 
center convey the potential impact q, has on the overall performance. Evaluating these 
three points show a potential for performance time increasing as q, increases. 
The next characteristic observed is the number of nonzeros found in each matrix. 
Figures 5.9 and 5.10 clearly depict performance reduction as the number of nonzeros 




� 1 500 
Cl) 
E 




--+- FPGA time 
-- c PU time 
200 
Nonzeros to Pe rforma nce 
400 600 800 
Nonzeros 
1 000 1 200 1 400 1 600 
Figure 5.9 - Nonzero to Performance 
Nonzeros to Performa nce 








0 200 1-------� 
--+- F PGA time 
-- c PU time 
400 600 800 
N onzeros 
1 000 1 200 1 400 1 600 
Figure 5. 10 - Nonzeros to Performance (Log Scale) 
71 
each nonzero value that exists, at least one compare must be executed on the stand-alone 
computer while 32 compares will be performed on the FPGA. 
The final characteristic observed is the number of vector loads necessary to 
complete a full sparse matrix sparse vector multiplication on a FPGA. Each vector load 
only represents a portion of the main vector loaded as portions are loaded as needed and 
only once each. Loading the entire vector only once may mask having to load the vector 
in pieces by not using any additional load time, but each load disrupts the flow of the 
program and requires a partial sum per matrix row, per vector load. Obviously as there 
are more vector loads, the longer overall computation will take due to the requirement for 
more partial sums. Figure 5 . 1 1  shows the performance of the FPGA to the number of 
vectors loads. 
Vector  Loads to Performance 
"[1 soo ---------f-------------------------4 
! 1 000 -----t------------------------1 
0 1 2 3 4 5 6 7 
-+- FPGA time 
Vector Loads 
Figure 5. 1 1  - Vector Loads to FPGA Performance 
72 
One other area that provided interesting results was comparing the number of 
FPGA compares versus CPU compares for each dataset. Due to the 1 28 simultaneous 
compares per 50 MHz FPGA clock cycle, the number of compares performed reaches 
into the tens of thousands, while the CPU performs just the number of compares 
necessary. It is important to mention though, that the FPGA pays no penalty for 
computing excess compares as they are all done in parallel. Figure 5 . 1 2  shows these 
results as Figure 5 . 1 3  puts the same results on a logarithmic scale. 
The various graphs paint a large picture of the different intricacies affecting the 
big picture. Most of the characteristics do not influence the outcome on their own, but 
have a collective effect. The most important characteristic is 'P as it takes into account 



















-+- CPU Compares 
- FPGA Compares 
... ... ... . ... - - - - -
I I I 
2 4 6 
Dataset 





8 1 0  





1 0  
1 
0 2 4 8 1 0  
� CPU Compares Dataset 
-- FPGA Compares 
Figure 5.13 - Compares per Dataset (Log Scale) 
point operations that will be necessary to solve the problem due to hit quantity. Because 
of this consequence, there is no single MFLOPS to be obtained. As 'I' will vary, the 
actual MFLOPS or the percent of theoretical MFLOPS achieved will also vary. For 
example, the third dataset has320 hits. The stand-alone CPU runs at 933 MHz; therefore, 
its theoretical MFLOPS is 933. Its actual MFLOPS for this problem is 12.8. The 
theoretical MFLOPS for the FPGA design is 150 while the actual MFLOPS is 0.278. The percentage of the theoretical MFLOPS yielded is 1.07% and 0.19% for the stand­
alone and FPGA based designs respectively. Figures 5.14 and 5.15 display the variation 
in percentage of theoretical MFLOPS achieved as 'I' varies. 
When looking back over the graphs, the number of nonzero values plays a major 
role in all of these results, for each nonzero value is processed through the designs. 
74 










0.000 0.050 0. 1 00 0. 1 50 0 .200 0.250 0.300 0.350 
--.- Percent of MFLOPS CPU 
Psi 
- Percent of MF LOPS FPGA 
Figure 5 . 1 4 - Percentage of Theoretical MFLOPS Achieved 
Percentage of Theoretical M FLOPS Ach ieved 
1 00 .00% -.----...-------,----,.----,------.-------.------. 
0. 00 0.050 0. 1 00 0.1 50 0 .200 0.250 0.300 0. 50 
1 0 .00% --------------------------! 
0.01 % ........_ __________________ _ 
--.- Percent of MFLOPS CPU 
Psi 
- Percent of MFLOPS FPGA 
Figure 5. 1 5  - Percentage of Theoretical MFLOPS Achieved (Log Scale) 
75 
What needs to be taken into perspective is that when comparing results, the number of 
nonzero values must be taken into consideration. If the amount of nonzero values' 
addresses is relatively the same in comparing results across datasets, the results should 
allow for comparison between other characteristics. If the number of nonzero values is 
not relatively the same, then the results cannot be compared. The next sections discuss 
the difficulties involved. 
5.2 Difficulties 
Several areas of difficulty were encountered in the development of the sparse 
matrix sparse vector computation engine. The areas included developing a proper 
interface from the FPGA back out to the C code, to memory and 1/0 limitations, and a 
minor glitch. 
5.2.1 Pcore Interface 
An extremely large amount of time was spent in this single area alone of 
developing a consistently working Pcore interface to connect the sparse matrix sparse 
vector multiplication code to the memory bus. Roughly 33% of the design time was 
spent creating and adjusting a Pcore interface to adequately support the overall design. 
The Pcore interface must monitor data traffic on two different clock speeds between the 
memory bus ( 100 MHz) and the sparse matrix sparse vector multiplication code (50 
MHz). The synchronization of data between the two different clocks presented the 
greatest challenge. There is no guarantee there will be a consistent stream of data with the 
memory bus; therefore, the Pcore interface must be very versatile as well as fast. To 
further increase the difficulty of this task, the Pilchard System's design does not lend 
itself to post-layout simulation, a key stage in the design process where the designer can 
76 
get a much more realistic view of how a design will work in reality, versus the perfect 
world of presynthesis simulation. Pcore designs ranged from using block RAM as the 
main device for interfacing to the memory bus, to relying on the handshaking read/write 
signals to grab and send data appropriately. Various combinations of designs were 
implemented with varying ranges of success. Often this part of the design drove the 
overall clock speed of the design as whole. This is because the pcore needs to be twice as 
fast as the rest of the FPGA so it could properly supply twice the amount of data ( 128 
bits) as the memory bus (64 bits) in one 50 MHz clock cycle. The final design involved 
using an asynchronous FIFO buffer created by Xilinx' s Corgen to handle the movement 
of data between the two different clock speeds. A small ( 4 address locations) block RAM 
was used for data leaving the FPGA to the memory bus to read. 
5.2.2 Memory and 1/0 Constraints 
The constraints the memory and I/O limitations held on the design dominated the 
performance of the FPGA assisted sparse matrix sparse vector multiplication results. 
These limitations ranged from space on the FPGA, to FPGA slice usage, to a lack of 
RAM on the Pilchard System board to supply this data hungry design with information as 
fast as possible. 
When possible, Xilinx Coregen components were utilized to help save space and 
time in the FPGA design; however, this was not always enough. Originally the FPGA 
was to hold a vector of size 64 and perform 256 simultaneous compares. Unfortunately, 
doing so utilized 87% of the FPGA's slices when combined with the rest of the design. 
Attempting to completely place and route a design with the usage of slices reaching over 
75% becomes difficult, not even considering the small likelihood of meeting timing 
77 
constraints. To alleviate this limitation, the onboard vector size was cut back to 32, thus 
reducing the total number of simultaneous compares to 1 28 in one clock cycle. The 
vector data was also moved to block RAM from registers. The slice usage dropped to 
70% of the FPGA' s available 12,288 slices. This was enough to allow the design to be 
place and routed at the necessary timing constraints. The effects of reducing the vector 
size on performance in presynthesis simulation were felt immediately as simulation times 
nearly doubled. The effects on the actual FPGA was never observed due to the larger 
design never being able to be place and routed. 
Lastly, due to the need for the FPGA to have constant and quick access to large 
amounts of data, the use of onboard RAM on the Pichard would have been very 
beneficial . The FPGA must grab data from the main memory (RAM) on the CPU, which 
is a very time consuming operation when typical programs can utilize cache. The 
availability of RAM on the Pilchard board would not only act similar to cache, but it 
would also give the FPGA the ability to have the entire vector stored nearby instead of 
some location off in the main memory. This would eliminate time lost for memory bus 
contention. Also, large amounts of the rp.atrix could be stored and possibly the entire 
vector. This reduction in the use of the memory bus would reduce the communication 
cost incurred essentially over the entire design execution time. 
The one single factor that potentially had the largest effect was the result of the 
I/0 constraints. The design is highly dependent upon the I/0 or memory bus due to the 
large amounts of data transfer. A small test was conducted to measure the time to 
perform 1 0  writes to get an estimate of how much time is spend on I/0 in the overall 
design. This test concluded that 1 0  writes would take 4.5us. Simulation provided 
78 
accurate results but the implementation of the Pilchard System Wrapper could not 
accurately be simulated nor could a memory bus be accurately simulated; however, the 
execution time estimated by the simulator painted a much different picture. If dataset 3 
was considered, the FPGA execution time was 2300us. By determining the number of 
writes and using the small test data, it was determined that 30% of that execution time 
was spent conducting 1/0. While this seems underestimated and the small test is likely 
inaccurate when referring to a larger design, the simulation results estimated that the 
design should be executed in 53us, a 97% improvement. At the very least, 1/0 
performance has a huge impact on the design whether that be 30 to 97% of the execution 
time. 
While performance results did not meet the desired goals, it is clear what is 
necessary to alleviate these problems through larger FPGA's, multiple FPGA's, improved 
interfacing, and onboard RAM. These suggestions for improvements with further details 
can be found in the next chapter, Conclusions and Future Work. 
5.2.3 Logic Glitch 
Through the various tests run on the FPGA design, a minor flaw in the logic was 
exposed. Fortunately this glitch has no effect on performance or accuracy when the 
FPGA produces results. The flaw was identified as the FPGA logic not synchronizing 
correctly with the host C code when a set of compares produced no hits for a partial sum; 
thus yielding a partial sum of zero. Currently, several conditions must be met for a zero 
partial sum to occur and these conditions are all not being met. When this occurs, the 
FPGA essentially locks into the SEND state. Extensive simulation never brought this 
situation to light when using the same test data that would cause the error in real-time 
79 
calculations on the FPGA; thus underlining the importance of post-synthesis and post­
layout simulation capabilities. Due to the specific nature of this issue, it can be 
efficiently resolved by creating a process on the FPGA to monitor if hits ever occur 
during a partial sum. It no hits ever occur, the C code and FPGA will not expect nor send 
a zero partial sum respectively, and both will move on to the next appropriate state. This 




Conclusions and Future Work 
In hindsight, the overall design was successful in the fact that results were 
achieved with data from which to extrapolate and learn from sparse matrix sparse vector 
multiplication on a FPGA. When comparing the performance to the stand-alone 
processor, a significant gap in performance must be corrected and improved upon. The 
following chapter will discuss future work by analyzing areas of improvement and 
interesting applications to apply this design to. Finally, conclusions will be given 
encapsulating the entire experience. 
6.1 Hardware Improvements 
A few areas are available to improve from the hardware side of the system. The 
Pilchard System is aging; new Xilinx Virtex-11 Pro FPGAs are larger, faster, and have 
optional PowerPC processors on board, and/or onboard memory or cache could be added 
to the system. The Pilchard System was designed to operate on a 133 MHz memory bus; 
however, today's computers have much faster memory buses with speeds ofup to 800 
MHz [18]. The Pilchard System could not take advantage of today's bus speeds. If the 
overall system hardware was upgraded, several innovations could play an immense role 
without the sparse matrix sparse vector design even changing. If the bus was dedicated to 
the FPGA's needs and running at a speed of 800 MHz, the bus could theoretically support 
up to 8 FPGAs all running at 100 MHz assuming the Pilchard board would be compatible 
or if it were upgraded. 
81 
Upgrading the FPGA could play a significant role in improvement as well. If a 
board using a Xilinx Virtex-II Pro X were placed on one of the latest computers, 
speedups and improvements would be found in several areas. Xilinx Virtex-II Pro X 
XC2VPX70 has almost three times the number of slices (33 ,088) as the Virtex 1 000-E 
( 1 2,288 slices and 4Kbits of block RAM) and has 5 .5Mb of dual port block RAMs 
available allowing for significant upgrades in vector storage size, concurrent compares, 
and vector data storage. With that much block RAM available; it is even possible that the 
entire vector and even small to midsize sparse matrices could be stored on the FPGA at 
the very least. The optional PowerPCs could also be used on the latest Virtex-II Pro X 
FPGAs to assist in the overall control or various other areas. With this single 
improvement in FPGA size, the vector size stored on the FPGA and number of parallel 
compares could at least be tripled if the rest of the design remains intact. This estimation 
is based on an earlier design that had twice the current vector size stored and double the 
number of compares, and the design was only over-mapped by only 4-5 ,000 slices. 
Simulation times improved almost 40% when twice the current number of compares and 
vector storage was implemented. 
Another benefit in being able to store more if not the entire vector on the FPGA 
and possibly the matrix is that the memory bus would only be needed at the beginning of 
execution to load all of the_ information onto the FPGA. As mentioned in the previous 
chapter, the 1/0 communication cost of the memory bus is potentially consuming 30 to 
97% of execution time. Having onboard RAM or cache on the pilchard board, or another 
similar type board would create the same improvements in eliminating as much of the 
memory bus dependency as possible. Having onboard RAM would likely be very large 
82 
in comparison to the FPGA' s RAM (32 - 128 Mb) and could quite possibly store all 
necessary matrix and vector data. 
If upgrading the Pilchard System is not an option, at the very least more than one 
Pilchard System could be placed on a computer with a faster bus speed so more FPGA 
resources are available to improve overall performance ( again assuming the Pilchard 
Board could operate on a faster bus). If the Pilchard System was no longer usable the 
design, excluding the Pcore interface, could be fitted onto another system where more 
FPGA resources are available. The Pcore interface is specific to the Pilchard System; 
therefore, a new interface would have to be developed for any new system the design is 
placed on. While hardware improvements are not always easy to accommodate due to 
economics, design improvements can still be made. 
A final hardware improvement would be for the Pilchard Board to be on a bus 
utilizing some sort of DMA Controller. Currently the Pilchard must compete for the 
memory bus like every other process running on the main computer. This can create 
unknown and unpredictable data transfer times, not to mention increased communication 
costs. If a DMA was used, the controller could gain access of the bus and be able to send 
dedicated block transfers of data to the FPGA without so much interruption, again further 
reducing 1/0 costs. 
6.2 FPGA Architecture Improvement 
An architectural improvement in the design or layout of processes on the FPGA 
would be to add the capability of allowing multiple configurations based on the structure 
of the sparse matrix. This analysis and design would require pre-processing which could 
be done on the FPGA or software side and would likely require a fair amount of research 
83 
into how this could be efficiently implemented, but having the ability to cater to how the 
FPGA solves a structured matrix would be beneficial to overall performance provided 
that the preprocessing step did not outweigh the improvement. Sometimes sparse matrix 
sparse vector multiplications are run multiple times for example in executing some 
iterative solvers. If the FPGA could adapt by monitoring the '¥, after the first run, the 
design could adjust by possibly utilizing more MA Cs if the '¥ value was large (0.6 to 1 .0 
possibly). This would be assuming more MAC units could fit on the FPGA. 
6.3 Algorithmic Improvements 
In the creation and design of the sparse matrix sparse vector multiplication, 
several important areas have opened up to improve efficiency and overall performance by 
reducing the amount of inactivity during waiting, improving data accuracy, and not 
wasting clock cycles handling unnecessary information. These algorithmic 
improvements include only grabbing vector values as needed and then storing them if 
needed again, loading up the next compare results while waiting on an answer from the 
multiply accumulator, adding full IEEE 754 support for floating point numbers, keeping 
track of multiple partial sums simultaneously, and reducing the number of pipeline stages 
in the floating point units. 
An algorithm that only requested and stored vector values as needed could 
possibly be implemented over the existing model with only incurring the penalty of one 
extra sparse code clock cycle per REPORTx state encountered. This extra clock cycle 
would be needed to encode a request to the C program to send a vector value or values 
with the matched matrix values. To handle this on the FPGA side, each vector slot would 
have a status bit of whether it had the value available for a particular address or not. The 
84 
check for the existing vector value could be done in the same processes as the current 
compare processes if the overall speed is not slowed down, or it could be handled in 
separate processes. The only difficulty in handling this approach would be keeping track 
of matrix and vector values when streaming them into the multipliers. All that should be 
necessary is to predetermine an algorithm that would handle expected ordering of the 
values sent on the FPGA side and implemented by the C program. The improvements 
seen by this scheme would be every time a startup cost is incurred of reloading the 
vector. The best-case scenario for the improvement would be that 2 out of every 3 clock 
cycles during a vector load would be saved per vector value not ever needed. As is 
typical, the more dense the matrix and vector, the less of an improvement that will be 
observed; however this improvement while not helping the worst-case scenario, would 
help the best-case scenario of an extremely sparse matrix and vector. Jbe cost of sending 
the vector value with the matrix values is no different than preloading it. Again, this 
method would reduce costly 1/0 as has been shown to be a large problem. 
Another improvement to the algorithm of the system would be improving the 
overall system efficiency while waiting for an answer from the FPGA code. Currently 
while the FPGA code is determining a partial sum after an over flag has gone high, the C 
program waits for the answer, which could take about 1 to 1 .5 microseconds (according 
to simulation times) if waiting on a partial sum after a set of compare matches. During 
this time the FPGA could send in the next row's matrix addresses to begin the next round 
of compares. If the partial sum were to be found during this time, the FPGA could 
simply wait until after the address data has been streamed in for comparing. A special 
case to take care of here is if there were no matches to be calculated into the current 
85 
partial sum, it is possible the partial sum could already be ready; therefore, in this case 
the C program should just wait on the partial sum. This improvement could help both the 
best and worst-case scenarios. 
Adding full IEEE 754 support into the floating-point units would be beneficial in 
that the usefulness of this design for scientific use would be more practical. While 
improving the floating-point units, they could both be analyzed to see if any of the 
existing pipeline stages could be consolidated. Typically the more pipeline stages, the 
faster the component, but if the speed can be maintained while reducing pipeline stages, 
the overall latency when waiting for a result is reduced. In particular, consolidating the 
adder pipeline would shave clock cycles off finding the partial sum as that is a bottleneck 
in the overall design due to the adder having to wait on itself for 1 3  clock cycles for a 
result if the last two pieces of data to be summed are not available at the same time. 
The last algorithmic improvement involves the continued improvement over the 
adder bottleneck. As the number of results run low for the adder to put together for a 
partial sum, the adder may only have a couple of additions in the pipeline while another 
piece of data waits on results. The adder is in use, but the pipeline is becoming a 
hindrance instead of a benefit. To help mask this problem, giving the multiply 
accumulator the ability to handle multiple partial sums would help immensely. Creating 
this improvement would automatically improve some other areas too. To handle multiple 
partial sums simultaneously, the overall system would need to just send in rows of 
information and not have to wait for an answer like mentioned above. For the FPGA to 
notify the C program that an answer is ready, it can do this by using a reserved bit of the 
output vector during the REPORTx stage. Also, the remaining 3 bits could be used to 
86 
signal that up to 8 answers are available; therefore, up to 8 partial sums could be the limit 
supported (000 would stand for 1 ,  since the answer flag must be high to signal that any 
results are available). This improvement would definitely require further analysis as 
supporting 8 partial sums could overload the existing single adder requiring the addition 
of one or more addition units. The downside to this approach is determining how an 
adder knows which set of data that is currently being considered for addition goes to 
which partial sum. A buffer that mirrors the data buff er would likely be used that stands 
for the partial sum that data corresponds to. Implementing this could be very complex 
yet very beneficial. 
6.4 Future Applications 
In addition to improvements that could be made on the existing system, the 
overall approach could be applied to new problems or altered for different applications. 
Some of these different applications would be applying the design to sparse matrix sparse 
matrix multiplication problems, altering the design to only handle compares, or reading in 
multiple rows of a matrix instead of multiple elements from one row. Altering the design 
to handle only comparisons could have a unique impact on the overall problem. 
Essentially, the only involvement of the FPGA would be to read in vector and matrix 
addresses, compare them, and send out the results in some encoded fashion. No multiply 
accumulation would be handled on the FPGA. While this greatly simplifies the amount 
of work done on the FPGA it also further complicates the work of the C program. 
Unfortunately for the C program, it must compete with an operating system and possibly 
other programs. The FPGA only has to worry about itself once it has the data necessary 
87 
to process. As has been observed the overhead of transferring and partial summing all of 
the floating-point data is costly. 
Another application of this design would be to transfer addresses of the matrix 
two to multiple rows at a time. This would require the ability to handle partial sums. The 
perf onnance results could have some intriguing affects when compared to the original 
algorithm. A problem with this method; however, is if one row is ready for a new vector 
to be loaded while other rows are not ready. 
A final and more practical application would involve the exploration of applying 
this design to the calculation of sparse matrix sparse matrix multiplication. The basis for 
sparse matrix sparse matrix multiplication is essentially sparse matrix sparse vector 
multiplication repeated for each vector of the second matrix. This application could be 
divided on multiple computers using the Pilchard System over a network. The load could 
be distributed by one row of the second matrix per computer or it could be broken down 
further into each computer gets a portion of a vector per row of the second matrix. 
There are numerous possibilities to improve upon the existing design and to apply 
it new areas in computing for improved performance results. 
6.5 Conclusion 
In evaluating the entire design process, design, and results a lot has been learned 
about sparse matrix sparse vector multiplication on parallel and reconfigurable 
computing. When evaluating results from this problem various different characteristics 
of the sparse matrix and vector affect the outcome. When viewing results, the number of 
nonzeros must always be taken into consideration as well when comparing to other 
. 88 
results. The potential influence 'P has on performance with how it measures the number 
of compares and hits regarding performance time are important too. 
The importance of being able to perform post-synthesis and post-layout simulations 
was reinforced too. When attempting to troubleshoot problems that did not appear in 
simulation, determining what exactly is going wrong in the chip is extremely difficult and 
it is hard to know if processes sharing data over different clock rates are synchronizing 
correctly. A lot of time spent troubleshooting errors could conceivably have been saved 
if this capability was available. 
Even though FPGA performance results were slower than the stand-alone computer's 
performance by 50-60 times, it is worth continued research in this area for several 
reasons. First not much research as been conducted or at least published regarding sparse 
matrix sparse vector multiplication using FPGA' s, and this one approach certainly 
doesn't cover all of the possibilities of implementations; however, it does take a 
performance minded approach and discusses several possibilities for improvements. Due 
to the largest bottleneck being the heavy performance cost paid for memory bus 
communication and contention as well as memory constraints, the design could actually 
be extremely competitive if not faster to the stand-alone computer's performance. As 
was mentioned in the previous chapter, simulation times were extremely close to that of 
the actual performance times on the stand-alone computer's results. Add in more FPGA 
room and the potential performance improvement is quite optimistic. Exploring these 
external limitations to the design's performance is a must. In addition to external 
limitations, there are also design improvements that can still be made as mentioned 
earlier in the chapter. 
89 
In summary, being a pioneer in this particular application of the sparse matrix sparse 
vector multiplication, much has been learned but plenty of research remains in exploring 
this topic. Its importance is felt in the scientific computing community and could 
therefore take advantage of performance improvements resulting from continued research 





1 .  The Message Passing Interface Standard (2004). http://www­
unix.mcs.anl.gov/mpi/mpich/ 
2. Parallel Virtual Machine (2004). http://www.csm.oml.gov/pvm/pvm_home.html 
3 .  Institute for Electrical and Electronics Engineers. IEEE 754 Standard/or Binary 
Floating-Point Arithmetic, 1 985. 
4. K.H. Tsoi. Pilchard User Reference (VO. l), Department of Computer Science 
and Engineering, The Chinese University of Hong Kong, Shatin, NT Hong Kong, 
January 2002. 
5. K. A. Gallivan. Set 2 - Sparse Matrix Basics. School of Computational Science 
and Information Technology, Florida State University,2004. 
http://www.csit.fsu.edu/-gallivan/courses/NLA2/set2.pdf 
6. K. L. Wong. Iterative Solvers for System of Linear Equations. Joint Institute for 
Computational Science. 1997. 
7. D. W. Bouldin. Lecture Notes: Overview, ECE 551, Electrical and Computer 
Engineering Department, University of Tennessee, 2002. 
8. N. Shirazi, A. Walters, and P. Athanas. Quantitative Analysis of Floating Point 
Arithmetic on FPGA Based Custom Computing Machines. EEE Symposium on 
FPGAsfor Custom Computing Machines, Napa, California, Apr 1 995. 
9. G. Lienhart, A. Kugel, and R. Manner. Using Floating-Point Arithmetic on 
FPGAs to Accelerate Scientific N-Body Simulations. Proceedings, IEEE 
Symposium on Field-Programmable Custom Computing Machines, pages 1 82-
19 1 ,  Napa, CA, Apr. 2002. 
1 0. H. ElGindy, Y. Shue. On Sparse Matrix-Vector Multiplication with FPGA-Based 
System. Proceedings of the IO th Annual IEEE Symposium on Field­
Programmable Custom Computing Machines (FCCM'02), p.273, September 22-
24, 2002 . 
1 1 . Netlib Repository. The University of Tennessee, Knoxville and Oak Ridge 
National Laboratories, www.netlib.org. 
12. G. Wellein, G. Hager, A. Basermann, and H. Fehske. Fast sparse matrix-vector 
multiplication for TeraFlop/s computers. In High Performance Computing for 
Computational Science - VECPAR 2002, Lecture Notes in Computer Science, 
pages 287-30 1 .  Springer, 2003. 
1 3 .  W. Gropp, D. Kaushik, D. Keyes, and B. Smith. Improving the performance of 
sparse matrix-vector multiplication by blocking. Technical report, MCS Division, 
Argonne National Laboratory. www-fp.mcs.anl.gov/petsc-
fun3d/Talks/multivec _ siam00 _ l .  pdf. 
14. R. Geusand and S. Rollin. Towards a fast parallel sparse matrix-vector 
multiplication. Proceedings of the International Conference on Parallel 
Computing (ParCo), pages 308.3 1 5. Imperial College Press, 1999. 
1 5. F. Khoury. Efficient Parallel Triangular System Solvers for Preconditioning 
Large Sparse Linear Systems. Honour's  Thesis, School of Mathematics, 
University of New South Wales. http://www.ac3.edu.au/edu/papers/Khoury­
thesis/thesis.html, 1 994. 
92 
16. Xilinx, www.xilinx.com. 
17. D. E. Culler, J. P. Singh, and A. Goopta. Parallel Computer Architecture: A 
Hardware/Software Approach. Morgan Kaufman Publishers, Inc., San Francisco: 
1 999. 




Appendix A - Dataset Statistics 
95 
Characteristic Data 1 Data 2 Data 3 
CPU time 1 0  36 50 
FPGA time 548 1 770 2300 
Hits 1 8  1 70 320 
Compares 62 1 455 1 852 
� 0.3 0. 1 2  0. 1 7  
Nonzeros 1 8  858 1 020 
Wector Loads 1 4 6 
Actual MFLOPS CPU 3.600 9.444 1 2.800 
Actual MFLOPS FPGA 0.066 0. 1 92 0.278 
ifheoretical MFLOPS CPU 933 933 933 
ifheoretical MFLOPS FPGA 1 50 1 50 1 50 
Percent of MFLOPS CPU 0.30% 0.79% 1 .07% 
Percent of MFLOPS FPGA 0.04% 0. 1 3% 0. 1 9% 
FPGA Compares 576 27456 32640 
Characteristic Data 4a Data 4b Data 4c 
CPU time 37 38 38 
FPGA time 2353 2354 2357 
Hits 24 48 72 
Compares 1 51 2  1488 1464 
'I' .01 6 .032 .049 
Nonzeros 1 344 1 344 1 344 
Wector Loads 1 1 1 
�ctual MFLOPS CPU 1 .297 2 .526 3.789 
�ctual MFLOPS FPGA 0.020 0.04 1  0 .06 1  
iTheoretical MFLOPS CPU 933 933 933 
iTheoretical MFLOPS FPGA 1 50 1 50 1 50 
Percent of MFLOPS CPU 0. 1 4% 0.27% 0.41 % 
Percent of MFLOPS FPGA 0.01 % 0.03% 0.04% 
FPGA Compares 43008 43008 43008 
Characteristic Data 4d Data 5 
CPU time 35 28 
FPGA time 2353 1 703 
Hits 96 96 
Compares 1440 1 080 
'I' .067 .089 
Nonzeros 1 344 984 
Wector Loads 1 1 
Actual MFLOPS CPU 5.486 6.857 
Actual MFLOPS FPGA 0.082 0. 1 1 3  
ifheoretical MFLOPS CPU 933 933 
Theoretical MFLOPS FPGA 1 50 1 50 
Percent of MFLOPS CPU 0.59% 0.73% 
Percent of MFLOPS FPGA 0.05% 0.08% 
FPGA Compares 43008 31 488 
96 
Appendix B - Sparse Matrix Sparse Vector Multiplication C Code 
97 
#include <stdio . h> 
#include <string . h> 
#include <sys/ time . h> 
# include <time . h> 
long lt ime ; 
float lt ime2 ; 
struct timeval t start , t fini sh ; 
/ * Pointer indexing for a 1 -d array 
p [ S ]  = * (p+S )  * /  
/* Remember to p = i ;  first i f  int *p , i [ l 0 ] ; earlier*/  
/ * Pointer indexing for a 1 0  by  1 0  int array 
the 0 , 4 element of array a may be referenced by 
a [ 0] [ 4 ]  or * ( ( int * )  a+ 4 ) 
element 1 , 2 
a [ l ]  [ 2 ]  or * ( ( int * ) a+ 12 ) 
in general 
a [ j ]  [ k l = * ( ( base type * ) a+ ( j * row length ) +k ) * /  
int main ( int argc , char *argv [ ] ) 
{ 
FILE *mdatfp ,  *mrowfp , *vecfp;  
unsigned long int *c ,  *r ,  *va ,  * ya ,  a ,  matsize= 0 ,  matrs i ze=0 ,  
vecsi ze=0 ;  
double *v,  *x ,  *y ,  d ;  
int i , j , k ; 
i f  ( argc ! = 4 )  { 
print f ( "mv_fpga z matrixfi le matrixrowfile  vectorfi le\n " ) ;  
exit ( l ) ; 
i f  ( (mdat fp = fopen ( argv [ l ] , " r " ) ) == NULL ) { 
printf ( "  cannot open file 1 . \n " ) ; 
exit ( 1 )  ; 
if ( ( mrowfp = fopen ( argv [ 2 ] , " r " ) ) == NULL ) { 
print f ( "  cannot open fi le 1 . \n" ) ; 
exit ( 1 ) ; 
i f ( ( vecfp = fopen ( argv [ 3 ] , " r " ) ) == NULL ) { 
print f ( "  cannot open file 1 . \n " ) ; 
exit ( 1 ) ;  
while ( fscanf (mdat fp ,  " %u%le " ,  & a , &d ) ==2 )  
{ 
mat si ze++ ; 
98 
while ( fscanf (mrowfp , " %u" , & a ) ==l ) 
{ 
matrs i ze++ ; 
while ( fscanf ( vecfp ,  " %u%le 11 , & a , &d ) ==2 ) 
{ 
vecsi ze++ ; 
rewind (mdatfp ) ; 
rewind (mrowfp ) ; 
rewind ( vecfp ) ; 
c ( unsigned long int * ) malloc (mat s i ze * si zeof ( unsigned long int ) 
) ; 
v = ( double * ) malloc (mat s i ze * s i zeof ( double ) ) ;  
r ( unsigned long int * ) malloc (matrsi ze * s i zeof ( uns igned long int ) 
) ; 
x ( double * ) malloc ( vecs i ze * s i zeof ( double ) ) ;  
va = ( unsigned long int * ) malloc ( vecs i ze * si zeof ( uns igned long int ) 
) ; 
y = ( double * ) malloc (matrsize  * s i zeof ( double ) ) ;  
ya = ( unsigned long int * ) malloc (matrs ize  * s i zeof ( uns igned long int ) 
) ; 
i=O ; 
while ( fscanf (mdat fp ,  11 %u%le 11 , ( c+i ) , ( v+i ) ) ==2 ) i++ ; 
i=O ; 
while ( fscanf (mrowfp , 11 %u  11
, 
( r+i ) ) ==l ) i++ ; 
i=O ; 
while ( fscanf ( vecfp ,  11 %u%le 11
, 
( va+i ) , ( x+ i ) ) ==2 )  i++ ; 
fclose (mdat fp ) ; 
fclose (mrowfp ) ; 
fclose ( vecfp ) ; 
gettimeofday ( &t_start , NULL) ; 
for ( i  = 0 ;  i < matrsi ze-1 ; i++ ) 
{ 
o . o ; * ( y+i )  
k=O ;  
for ( j  
{ 
* ( r+i ) ; j <* ( r+i+ l ) ;  j ++ )  
i f ( * ( c+j ) < * ( va+k ) ) continue ; 
else i f ( * ( c+j ) == * ( va+k ) ) 
else 
{ 
* ( y+i ) 
k++ ;  
* ( y+i ) + * ( v+j ) * * ( x+ k ) ; 
99 
if ( k<=vecs i ze )  
{ 
for ( k=k++ ; k<=vecs i ze-l ; k++ ) 
{ 
if ( * ( c+j ) < * ( va+k )  ) brea k ;  




else  brea k ;  
if ( k<=vecsize )  
{ 
* ( y+i )  
brea k ;  
* ( y +  i ) + * ( V+ j ) * * ( X + k )  ; 
if ( k<=vecs i ze ) cont inue ; 
else  brea k ;  
i f ( * ( c+j ) < * ( va+ k ) ) cont inue ; 
else  i f ( * ( c+j ) == * ( va+ k ) ) continue ; 
else brea k ;  
else brea k ;  
gettirneofday ( &t_finish, NULL ) ; 
ltirne = ( t_finish . tv_sec-t_start . tv_sec ) * 1 000000  + 
( t_fini sh . tv_usec-t_start . tv_usec ) ;  
ltirne 2 = ( float ) ltime / 1 ;  
print f ( "CPU : calculation completed in % f  usec\n" , ltime2 ) ;  
for ( i=0 ; i<matrsize- l ; i + + )  
{ 
printf ( " -->  % f \n " , * ( y+ i ) ) ;  
return 0 ;  
100 
Appendix C - FPGA Host C Code 
1 0 1  
# include <stdio . h> 
# include <stdlib . h> 
# include <tirne . h> 
# include <sys/tirne . h> 
# include " iflib . h " 
# include <unistd . h> 
# include <sys /types . h> 
# include <sys/stat . h> 
# include <fcntl . h> 
# include <sys/rnman . h> 
long ltirne ; 
float ltime2 ; 
struct timeval t start , t_fini sh ; 
/ * Pointer indexing for a 1 -d array 
p [ S ]  = * (p+S ) * /  
/ * Remember to p =  i ;  first i f  int *p, i [ l 0 ] ; earlier * /  
/ * Pointer indexing for a 10  b y  1 0  int array 
the 0 , 4 element of array a may be referenced by 
a [ 0 ] [ 4 ]  or * ( ( int * ) a+4 ) 
element 1 , 2 
a [ 1 ]  [ 2 ]  or * ( ( int * )  a+ 12 ) 
in general 
a [ j ]  [ k ] = * ( (base type * ) a+ ( j * row length ) + k ) * /  
int main ( int argc , char *argv [ ] ) 
{ 
unsigned long long int * c ,  *va ,  *ya ,  al , *r l , *r2 , mat size=0 ,  
matrsize=0 ,  vecsize=0 ,  zeros ; 
unsigned long long int *ansptr , ternp, * si ze2 fpga ; 
double *v ,  *x ,  *y ,  *d ;  
FILE *mdat fp,  *mrowfp , *vecfp ; 
int 64 data ,  addr , input , input 2 , init ; 
int i , fd, sw, t ;  
char *memp ; 
int stoppoint , q , firsttime ; 
double * z , ps , *partialsum, lastsent , ps2 , ps 3 ;  
char ch [ 2 ] , status [ 65 ] , match , over,  ans f , crrnt ; 
int 
stppt , strtm, A, rowcnt , numrows , strtpt , n , done , check , newvecflag2 , g , p , sent , r 
, w ;  
if  ( argc ! = 4 )  { 
print f ( "mv_fpgaz  matrixfile matrixrowfile vectorfile\n " ) ;  
exit ( l ) ; 
i f ( (mdat fp = fopen ( argv [ l ] , " r " ) ) == NULL ) { 
print £ ( "  cannot open file 1 . \n" ) ; 
exit ( l ) ; 
102 
i f  ( (rnrowfp = fopen ( argv [ 2 ] , " r " ) ) == NULL ) { 
print f ( "  cannot open file 1 .  \n " ) ; 
exit ( l ) ; 
i f ( ( vecfp = fopen ( argv [ 3 ] , " r " ) ) == NULL ) { 
print f ( "  cannot open file 1 . \n" ) ; 
exit ( l ) ; 
while ( fscanf (rndat fp , " %u%le " ,  & a l , &d ) ==2 ) 
{ 
rnatsize++ ; 
while ( f scanf (rnrowfp , " %u " ,  &al ) ==l ) 
{ 
rnatrsi ze++ ; 
while ( fscanf ( vecfp,  " %u%le " ,  &al , &d } ==2 ) 
{ 
vecsi ze++ ; 
rewind (rndatfp ) ; 
rewind (rnrowfp ) ; 
rewind ( vecfp ) ; 
/ *Matrix Column Address* / 
c = ( unsigned long long int * ) rnalloc (rnatsize * si zeof ( unsigned long 
long int ) ) ;  
/ *Matrix Value * /  
v = ( double * ) rnalloc (rnatsize  * si zeof ( double ) ) ;  
/ *Matrix Row Pointer 1 & 2 * /  
r l  = ( unsigned long long int * ) malloc (matrsize * si zeof ( unsigned long 
long int ) ) ;  
r2 = ( unsigned long long int * ) mal loc (matrsize * s i zeof ( unsigned long 
long int ) ) ;  
/ *Vector Value * /  
x = ( double * ) rnalloc ( vecsi ze * si zeof ( double ) ) ;  
/ *Vector Addres s * /  
v a  = ( unsigned long long int * ) rnalloc ( vecs i ze * si zeof ( unsigned long 
long int ) ) ;  
/ * Resultant Vector* /  
y = ( double * ) malloc (matrsi ze * si zeof ( double ) ) ;  
/ * Resultant Vector Addres s * /  
ya = ( unsigned long long int * ) malloc (rnatrsize * s i zeof ( unsigned long 
long int ) ) ;  
partialsum = ( double * ) mal loc ( l  * si zeof ( double ) ) ;  
i=O ;  
while ( fscanf (rndat fp, " %u% le " ,  ( c+i ) , ( v+i ) ) ==2 ) i++ ; 
i=O ;  
103 
whi l e ( fscanf (mrowfp , 11 %u 11 , ( r l+i ) ) == l ) i++ ; 
i=0 ;  
whi le ( f scanf ( vecfp ,  11 %u%le  11 , ( va + i ) , ( x+i ) ) ==2 )  i++ ; 
for ( w=0 ;  w<matrs i ze ;  w++ ) 
{ 
* ( r2+w)  = * ( r l+w) ; 
/ * * ( y+w ) = 0e0 ; 
print f ( "y % ct  init to % le \n 11 , w , * ( y+w ) ) ; * /  
/ *print f ( " %u and % e  for % d  reads \n " , addr , dat a , x ) ; * / 
fclose (mdatfp ) ; 
fclos e (mrowfp ) ; 
fclose ( vecfp ) ; 
fd = open ( DEVICE ,  O_RDWR ) ; 
memp = ( char * ) mmap ( NULL , MTRRZ , PROT_READ , MAP_PRIVATE , fd , 0 ) ; 
i f  (memp == MAP_FAILED )  { 
perror ( DEVICE ) ;  
exit ( 1 ) ; 
SW = O ;  
t=0 ;  
stoppoint 
q=0 ;  
3 1 -3 ; 
/ * gettimeofday ( &t start , NULL ) ; * / 
z = &dat a ;  
anspt r  = &input ; 
partialsum = &input2 ; 
s i ze2fpga = & init ; 
stppt = 0 ;  
A = O ;  
strtm = 0 ;  
rowcnt = 0 ;  
numrows = matrs i ze - 2 ;  
temp = 0e0 ;  
/ * zeros = 
0 0 0 0 0 0000000000000000000000000000000000 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 ; * /  
zeros = 0x0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 ; 
status [ 64 ]  = ' \ 0 ' ; 
Ch [ 0 ]  I O I ;  
Ch [ 1 ] = I \ 0 1 i 
match = ' 0 ' ; 
over = ' 0 ' ; 
strtpt = 0 ;  
strtm = 0 ;  
104 
n = 63 ; 
done = 0 ;  
check = 0 ;  
crrnt = ' 0 ' ;  
newvecflag2 0 ;  
g= B ;  
r=O ;  
lastsent=OeO ; 
sent=O ;  
p=O ; 
ps=OeO ; 
ps 2=0e0 ; 
ps 3=0e0 ;  
firsttime = 0 ;  
* si ze 2 fpga = matrsi ze - 2 ;  
/ *Note : The case statement s will fall through to each one checking 
sw . I f  a break is  the last statement in a case ,  then it brea ks 
out of the switch without . falling through the rest of the 
cases * /  
gettimeofday { &t_start , NULL ) ; 
while { 1 )  
{ 
switch { sw )  
{ 
case 0 :  
/ *print f { " # of rows = % 08x , % 08x\n " , init . w [ l ] , init . w [ O ] ) ; * /  
write 6 4 { init , memp+ { Ox0000<<3 ) ) ;  
write64 { init , memp+ { Ox0000<<3 ) ) ;  
case 1 :  
/ * Loop will  send 64  vector addresses and values * /  
/ * It remembers  where t o  continue for next time * /  
for { t=t ; t<=stoppoint ; t=t+4 ) 
{ 
i f { t  <= vecs i ze- 1-3 ) 
{ 
/ * * k  = * { va + t ) ; * / 
/ *data . w [ O ]  = * { va + t + 1 ) ; * / 
addr . w [ l ]  = * { va + t ) ; 
addr . w [ O ]  = * { va + t + 1 ) ; 
write6 4 { addr , rnernp+ { Ox0000<<3 ) ) ;  
addr . w [ l ]  = * { va + t + 2 ) ; 
addr . w [ O ]  = * ( va + t + 3 ) ; 
write 6 4 ( addr , rnernp+ { Ox0000<<3 ) ) ;  
* z  = * { x  + t ) ; 
write 64 ( data , rnernp+ { Ox000 0<<3 ) ) ;  
* z  = * { x  + t + 1 ) ; 
write6 4 { data , rnernp+ { Ox0000<<3 ) ) ;  
* z  = * ( x + t + 2 ) ; 
write6 4 ( data ,  rnernp+ { Ox00 00<<3 ) ) ;  
* z  = * { x + t + 3 ) ; 
write 6 4 { data ,  rnernp+ { Ox0000<<3 ) ) ;  
else i f { t  == vecsize-1-2 ) 
{ 
addr . w [ l ]  = * ( va + t ) ; 
1 05 
addr . w [ O ]  = * ( va + t + l ) ; 
write64 ( addr , mernp+ ( Ox0000<<3 ) ) ;  
addr . w [ l ]  = * ( va + t + 2 ) ; 
addr . w [ O ]  = OxO O O O O OO O ;  
write64 ( addr , mernp+ ( Ox 0 0 0 0<<3 ) ) ;  
* z  = * ( x + t ) ; 
write64 ( data , memp+ ( Ox00 00<<3 ) ) ;  
* z  = * ( x + t + 1 ) ; 
write 6 4 ( data , memp+ ( Ox00 00<<3 ) ) ;  
* z  = * ( x + t + 2 ) ; 
write 6 4 ( data , memp+ ( Ox0 00 0<<3 ) ) ;  
* z  = OxO OOO O O O O O O O O O OO O ; 
write64 ( data ,  rnemp+ ( Ox00 00<<3 ) ) ;  
else  i f ( t  == vecsi ze- 1 - 1 )  
{ 
addr . w [ l ] = * ( va + t ) ; 
addr . w [ O ]  = * ( va + t + 1 ) ; 
write 64 ( addr , memp+ ( Ox 0 0 0 0<<3 ) ) ;  
addr . w [ l ] = OxO O O O OO O O ; 
addr . w [ O ]  = OxO O O O OO O O ; 
write64 ( addr , memp+ ( Ox00 00<<3 ) ) ;  
* z  = * ( x + t ) ; 
write 64 ( data , memp+ ( Ox0000<<3 ) ) ;  
* z  = * ( x + t + 1 ) ; 
write 64 ( data , rnernp+ ( Ox0000<<3 ) ) ;  
* z  = OxOO O O O O O O O O O O O O O O ; 
write64 ( data , memp+ ( Ox000 0<<3 ) ) ;  
* z  = OxOO O O O O O O O O OO O O O O ;  
write 64 ( data ,  memp+ ( Ox 0 0 0 0<<3 ) ) ;  




addr . w [ l ]  = * ( va + t ) ; 
addr . w [ O ]  = OxO O O O OO O O ; 
write64 ( addr,  rnemp+ ( Ox 0 0 0 0<<3 ) ) ;  
addr . w [ l )  = OxO O O O O O O O ;  
addr . w [ O )  = OxO O O O O O O O ; 
write64 ( addr , memp+ ( Ox 0 0 0 0<<3 ) ) ;  
* z  = * ( x + t ) ; 
write64 ( data , memp+ ( Ox0000<<3 ) ) ;  
* z  = OxOO O O O O O OO O O O O OO O ;  
write64 ( data , rnemp+ ( Ox 0 0 0 0<< 3 ) ) ;  
* z  = OxO OOO O O O O O O O O O O O O ; 
write 64 ( data , memp+ ( Ox000 0<<3 ) ) ;  
* z  = OxOO OO O O O O O O O O OO O O ; 
write 64 ( data , mernp+ ( Ox0000<<3 ) ) ;  
addr . w [ l ]  = OxO O O O O OO O ;  
addr . w [ O ] = OxO O O O O OO O ;  
write64 ( addr ,  rnemp+ ( Ox0000<<3 ) ) ;  
addr . w [ l ] OxOOO O O OO O ; 
addr . w [ O ] = OxOOOO O O O O ; 106 
write 6 4 ( addr , rnernp+ ( Ox0000<<3 ) ) ;  
* z  = OxO O OOOOOOOOOOOOOO ; 
write 64 ( data ,  rnernp+ ( Ox0 000<<3 ) ) ;  
* z  = OxO O O O O OOOOOOOOO O O ;  
write64 ( data ,  rnernp+ ( Ox0000<<3 ) ) ;  
* z  = OxOOOOOOOOOOOOOO O O ; 
write 6 4 ( data , rnemp+ ( Ox000 0<<3 ) ) ;  
* z  = OxOOOOOOOOOOOO O OO O ;  
write6 4 ( data , rnernp+ ( Ox000 0<<3 ) ) ;  
stoppoint = t + 2 8 ; 
/ * go to processing stage * /  
SW = 2 ;  
case 2 :  
/ * Reset all  handshaking locations * /  
addr . w [ l ]  = OxFFFFFFFF; 
addr . w [ O J  = OxFFFFFFFF ; 
write 64 ( addr , rnernp+ ( Ox0002<<3 ) ) ;  / * set report location* / 
write6 4 ( addr , rnernp+ ( Ox0003<<3 ) ) ;  / * set answer location * /  
for ( q=O ; q<=52 ; q=q+4 ) 
{ 
if ( * ( r l+A) +strtrn+q < * ( r2+A+l ) - 3 )  
{ 
addr . w [ l ]  = * ( c + ( * ( r l+A) +q+strtrn) ) ;  
addr . w [ O J = * ( c + ( * ( rl+A) +q + l+strtrn) ) ;  
write 6 4 ( addr , rnernp+ ( Ox0000<<3 ) ) ;  
addr . w [ l ]  = * ( c + ( * ( rl+A ) +q + 2+st rtrn ) ) ;  
addr . w [ O ]  = * ( c + ( * ( rl+A) +q + 3+strtrn ) ) ;  
write 6 4 ( addr , rnernp+ ( Ox0000<<3 ) ) ;  
else i f ( * ( r l+A) +strtrn+q == * ( r2+A+l ) -3 }  
{ 
addr . w [ l ]  = * ( c + ( * ( rl+A} +q+strtrn} } ;  
addr . w [ O ] = * ( c + ( * ( rl+A} +q + l+strtrn } ) ;  
write64 ( addr, mernp+ ( Ox0000<<3 } ) ;  
addr . w [ l ]  = * ( c + ( * ( rl+A} +q + 2+strtrn } } ;  
addr . w [ O ]  = OxFFFFFFFF ; 
write64 ( addr , rnernp+ ( Ox0000<<3 ) ) ;  
else i f { * { rl+A) +strtm+q == * { r2+A+l ) -2 ) 
addr . w [ l ]  = * ( c + ( * ( rl+A ) +q+strtrn} ) ;  
addr . w [ O J  = * ( c + ( * ( rl+A ) +q + l +strtm) ) ;  
write6 4 ( addr , rnemp+ ( Ox0000<<3 ) ) ;  
addr . w [ l ] = OxFFFFFFFF; 
addr . w [ O J  = OxFFFFFFFF; 
write 64 ( addr , rnernp+ ( Ox0000<<3 ) ) ;  
else i f ( * ( rl+A} +strtm+q == * ( r2+A+ l ) -1 }  
{ 
addr . w [ l ]  = * ( c + ( * ( rl+A) +q+strtrn } } ;  
addr . w [ O J  = OxFFFFFFFF ; 




addr . w [ l ]  = OxFFFFFFFF ; 
addr . w [ O ]  = OxFFFFFFFF;  
write6 4 ( addr , memp+ ( Ox0000<< 3 ) ) ;  
addr . w [ l ] = OxFFFFFFFF; 
addr . w [ O ]  = OxFFFFFFFF;  
write6 4 ( addr , memp+ ( Ox0 000<< 3 ) ) ;  
addr . w [ l )  = OxFFFFFFFF ; 
addr . w [ O ]  = OxFFFFFFFF; 
write6 4 ( addr , memp+ ( Ox000 0<<3 ) ) ;  
/ * i f  ( rowcnt == numrows ) 
{ 
/ *Noti fy FPGA on last row ! 
addr . w [ l ]  = OxO OOOOOOO ; 
addr . w [ O ]  = OxO OOOOOOO ; 
write 6 4 ( addr , memp+ ( Ox000 0<<3 ) ) ;  
addr . w [ l ]  = OxO OOOOOOO ; 
addr . w [ O ] = OxOOO OOOO O ;  
print f ( " % 08x  --> % 0 8x\n " , addr . w [ l ] , addr . w [ O ] ) ;  
write6 4 ( addr , memp+ ( Ox0000<<3 ) ) ;  
} * /  
/ *Now send start signal * /  
SW = 3 ;  
/ * go t o  report stage t o  receive feedback* / 
/ *break ; * /  
case 3 :  
for ( i=O ; i<lOO ; i++ ) ; 
read64 ( &input , memp+ ( Ox00 02<<3 ) ) ;  
whi le ( input . w [ l ] ==OxFFFFFFFF & &  input . w [ O ] ==OxFFFFFFFF) 
{ 
read64 ( &input , memp+ ( Ox0002<<3 ) ) ;  
} 
addr . w [ l ]  = OxFFFFFFFF; 
addr . w [ O ]  = OxFFFFFFFF ; 
write 6 4 ( addr , memp+ ( Ox0002<<3 ) ) ;  
/ *temp = * anspt r ; * /  
for ( i= 6 3 ;  i>=O ; i-- ) 
temp = * ansptr ;  
temp = ( temp >> i )  % 2 ;  
sprint f ( ch , " % ld" , temp ) ; 
status [ 63-i ] = ch [ O ] ; 
status [ 6 4 ]  = ' \ 0 ' ; 
match = status [ O ] ; 
over = status [ l ] ; 
strtpt = * ( rl+A )  + strtm; 
strtm = strtm + 5 6 ;  108 
stppt = 0 ;  
i f  ( over == ' 1 ' ) 
{ 
n = 6 3 ;  / * range 8 to 6 3 * /  
done = O ; 
stppt = 0 ;  
whi le ( done == 0 )  
{ 
crrnt = status [ n ] ; 
i f  ( status [ 2 ] == ' 1 ' )  
{ 
stppt = 63 ; 
check = strtpt + 5 6 ;  
i f  ( check > matsize- 1 )  
{ 
* ( rl+A)  = matsize ; 
else 
{ 
* ( r l+A )  
done = 1 ;  
check ; 
else  if ( crrnt 
{ 
' 1 '  & &  match == ' 1 ' )  
stppt = n ;  
status [n ] = ' O ' ;  
check = strtpt - 1 - ( 63 - n )  + 5 6 ;  
i f  ( check > mat size-1 ) 
{ 
* ( rl+A)  = mat size ;  
else  
{ 
* ( r l+A) check ; 
done = 1 ;  
else i f ( crrnt 
{ 
' 1 ' & &  match == ' 0 ' ) 
stppt = n ;  
status [n ]  = ' 0 ' ; 
check = strtpt - 1 - ( 63 - n )  + 5 6 ;  
i f  ( check > matsi ze- 1 )  
{ 
* ( rl +A )  = mat size ; 
else  
{ 
* ( r l+A)  check;  
done = 1 ;  
n = n - 1 ;  
1 09 
st rtm = 0 ;  




A = O ;  
rowcnt = 0 ;  
newvecfl ag2 = 1 ;  
fi rsttime = 1 ;  
A++ ;  
rowcnt ++ ; 
/ * grab individual bits to  see where to  go* /  
i f  (match == ' 1 ' )  
{ / *Send matched data* / 
SW = 4 ;  
else  if ( over == ' 1 ' ) 
{ / *Get answer* /  
SW = 5 ;  
else  
{ / *Go back to process ing* / 
SW 2 ;  
break ;  
case 4 :  
/ * Loop back through previous 5 6  submi s si ons  to send dat a * /  
/ * i f ( over = =  1 I I ans f = =  1 )  
{ 
SW = 5 ;  
} * /  
match = ' 0 ' ; 
sent = 0 ;  
lastsent = 0e0 ; 
r = O ;  
for ( g=8 ; g<=63 ; g++ ) 
{ 
i f ( status [ g ] == ' l ' )  
{ 
* z  = * ( v+ ( strtpt + r ) ) ;  
writ e64 ( data , memp+ ( 0x0000<<3 ) ) ;  
lastsent = * z ;  
} 
sent = s ent + 1 ;  
i f ( g==stppt ) brea k ;  
e l s e  r++ ; 
i f  ( ( s ent%2 ) ==l )  
{ 
* z  = lastsent ; 
1 1 0 
else 
{ 
write 6 4 ( data , memp+ ( 0x000 0<<3 } } ;  
/ * * z  = 0x000 00000000000 00 ; 
write6 4 ( data , memp+ ( 0x0 000<<3 } } ; * / 
* z  = 0x00000000000 000 0 0 ;  
write 6 4 (dat a ,  memp+ ( 0x000 0<<3 } } ;  
* z  = 0x0 000000000000000 ; 
write64 ( dat a ,  memp+ ( 0x0 00 0<<3 } } ;  
* z  = 0x00000000000000 0 0 ;  
write 6 4 ( data , memp+ ( 0x00 0 0<<3 } } ;  
* z  = 0x0000000000000 00 0 ;  
write 6 4 ( data ,  memp+ ( 0x000 0<<3 } } ;  
i f  ( over == ' l ' } sw= S ;  
else  sw=2 ;  
/ * NOW send start signal * /  
break;  
case  5 :  
/ *grab answer* / 
read64 ( &input2 ,  memp+ ( 0x0003<< 3 } } ;  
while ( input 2 . w [ l ] ==0xFFFFFFFF &&  input 2 . w [ 0 ] ==0xFFFFFFFF } 
{ 
read64 ( & input2 , memp+ ( 0x0003<<3 } } ;  
i f  ( A==0 & &  firsttime==l } p=numrows ; 
else  i f  (A==0 & &  firsttime==0 } p= 0 ;  
else p=A- 1 ;  
* ( y+p } += *partialsum;  
i f  ( newvecflag2 = =  1 }  
{ 
if ( t>=vecs i ze } sw 6 ;  
else  s w  = 1 ;  
newvecflag2 = 0 ;  
else  sw = 2 ;  
over ' 0 ' ; 
break ;  
case 6 :  
gett imeofday ( &t_finish, NULL } ; 
for ( i=0 ;  i<=matrsize-2 ; i++ } 
{ 
print f ( "---> %le\n" , * ( y+ i } } ;  
lt ime = ( t_finish . tv_sec-t_start . tv_sec } * 1 000000  + 
( t_finish . tv_usec-t_start . tv_usec } ;  
ltime2 = ( float } ltime / l ;  
1 1 1  
printf ( "CPU : calculation completed in % f  usec\n " , ltime2 ) ; 
munmap (memp , MTRRZ ) ;  
close ( fd) ; 
exit ( 1 ) ; 
default : 
} 
return 0 ;  
1 12 
Appendix D - Pilchard.vhd 
1 1 3 
· library ieee ; 
use ieee . std_logic 1 1 6 4 . al l ;  
entity pilchard i s  
port ( 
PADS_exchecker_reset : in std_logic ;  
PADS_dimrn_ck :  in std_logic ;  
PADS dimrn eke : in  std_logi c_vector ( l  downto 0 ) ; 
PADS_dimrn_ra s : in  std_logic ;  
PADS_dimrn_cas : in std_logi c ;  
PADS_dimrn_we : in std_logic ;  
PADS_dimrn_s : std_logic_vector ( 3  downto 0 ) ; 
PADS_dimrn_a : in std_logic_vector ( 1 3 downto 0 ) ; 
PADS_dimrn_ba : in std_logic_vector ( l  downto 0 ) ; 
PADS_dimrn_rege : in std_logic ;  
PADS_dimrn_d : inout std_logic_vector ( 63 downto 0 ) ; 
PADS_dimrn_cb : inout std_logic_vector ( 7  downto 0 ) ; 
PADS_dimrn_dqmb : in std_logic_vector ( 7  downto 0 ) ; 
PADS_dimrn_scl : in std_logic ;  
PADS_dimrn_sda : inout std_logi c ;  
PADS_dimrn_sa :  in std_logic_vector ( 2  downto 0 ) ; 
PADS_dimrn_wp : in std_logic ; 
PADS io conn : inout std_logic_vector ( 2 7 downto 0 )  ) ;  
end pilchard; 
archit ecture syn of pi lchard is  
component INV 
port ( 
0 :  out std_logic ; 
I :  in std_logic ) ;  
end component ; 
component BUF 
port ( 
I :  in std_logic;  
0 :  out std_logic ) ;  
end component ; 
component BUFG 
port ( 
I :  in std_logi c ;  
0 :  out std_logic ) ;  
end component ; 
component CLKDLLHF is  
port ( 
CLKIN : in std_logi c ;  
CLKFB : in  std_logi c ;  
RST : in std_logic ; 
CLK0 : out std_logic ;  
CLK1 8 0 : out std_logic;  
CLKDV : out std_logic ;  
LOCKED : out std_logic ) ;  
end component ; 
1 14 
component FDC i s  
port ( 
C :  in  std_logic;  
CLR : in std_logic ;  
D :  i n  std_logic ;  
Q :  out std_logic ) ;  
end component ; 
component IBUF 
port ( 
I :  in std_logic ;  
0 :  out std_logic ) ;  
end component ; 
component IBUFG 
port ( 
I :  in std_logic ;  
0 :  out std_logic ) ;  
end component ; 
component IOB_FDC is  
port ( 
C :  in std_logic ;  
CLR : i n  std_logic ;  
D :  in std_logic ;  
Q :  out std_logic ) ;  
end component ; 
component IOBUF 
port ( 
I :  in std_logic ;  
0 :  out std_logic ;  
T :  i n  std_logic ;  
IO : inout std_logic ) ;  
end component ; 
component OBUF 
port ( 
I :  in std_logic ;  
0 :  out std_logic ) ;  
end component ; 
component STARTUP_VI RTEX 
port ( 
GSR :  in std_logic;  
GTS : in std_logic ; 
CLK : in std_logic ) ;  
end component ; 
component pcore 
port ( 
elk :  in std_logic ; 
clkdiv : in std_logic ; 
rst : in std_logic ;  
1 1 5 
read : in std_logic ;  
write : in std_logic ;  
addr : in std_logi c_vector ( 1 3 downto 0 ) ; 
din : in std_logi c_vector ( 6 3 downto 0 ) ; 
dout : out std_logi c_vector ( 6 3 downto 0 ) ; 
drnask : in std_logic_vector ( 6 3 downto 0 ) ; 
extin : in std_logic_vector ( 2 5 downto 0 ) ; 
extout : out std_logic_vector ( 2 5  downto 0 ) ; 
extctrl : out std_logi c_vector ( 2 5  downto 0 )  ) ;  
end component ; 
signal cl kdl lhf_cl k0 : std_logic ; 
signal cl kdllhf_cl kdiv : std_logic ;  
signal dimm_ck_bufg : std_logic ;  
signal dimm_s_ibuf : std_logic ;  
signal dimm_ras_ibuf :  std_logic;  
signal dimm_cas_ibuf :  std_logic ;  
signal dimm_we_ibuf : std_logi c ;  
signal dimm_s_ibuf_d : std_logic ;  
signal dimm_ras_ibuf_d : std_logic ;  
signal dimm_cas_ibuf_d : std_logic ;  
signal dimm we ibuf d :  std_logic ;  
signal dimm d iobu f i :  std_logic_vector ( 6 3 downto 0 ) ; 
signal dimm_d_iobuf_o : std_logi c_vector ( 6 3 downto 0 ) ; 
signal dimm d iobuf t :  std_logi c_vector ( 6 3 downto 0 ) ; 
signal dimm_a_ibuf :  std_logic_vector ( 1 4 downto 0 ) ; 
signal dimm_dqmb_ibuf : std_logic_vector ( 7  downto 0 ) ; 
signal io conn iobuf i :  std_logi c_vector ( 2 7  downto 0 ) ; 
signal io conn iobuf o :  std_logic_vector ( 2 7  downto 0 ) ; 
signal io conn iobuf t :  std_logic_vector ( 2 7 downto 0 ) ; 
signal s , ras , cas , we : std_logic ;  
signal VDD : std_logi c ;  
signal GND : std_logic ;  
signal CLK : std_logi c ;  
signal CLKDIV :  std_logi c ;  
signal RESET : std_logi c;  
signal READ : std_logi c ;  
signal WRITE : std_logic ; 
signal READ_p : std_logi c ;  
signal WRI TE_p : std_logi c ;  
signal READ_n : std_logic ;  
signal READ_buf :  std_logic ;  
signal WRITE_buf :  std_logic ;  
signal READ_d : std_logic ;  
signal WRI TE_d : std_logi c ;  
signal READ_d_n : std_logic ;  
signal READ_d_n_buf : std_logi c ;  
signal pcore_addr_raw : std_logic_vector ( 1 3 downto 0 ) ; 
s ignal pcore_addr : std_logic_vector ( 1 3 downt o 0 ) ; 
signal pcore_din :  std_logic_vector ( 63 downto 0 ) ; 
signal pcore_dout : std_logi c_vector ( 63 downto 0 ) ; 
1 1 6 
signal pcore_dmas k : std_logic_vector ( 63 downto 0 ) ; 
s ignal pcore_extin : std_logic_vector ( 2 5  downto 0 ) ; 
s ignal pcore_extout : std_logic_vector ( 2 5  downto 0 ) ; 
signal pcore_extctrl : std_logic_vector ( 2 5  downto 0 ) ; 
s ignal pcore_dqmb : std_logic_vector ( 7  downto 0 ) ; 
CLKDIV frequency control , default is  2 
uncomment the following lines so  as  to rede fined the clock rate 
given by cl kdiv 
begin 
attribute CLKDV_DIVIDE : string ; 
attribute CLKDV DIVI DE of U cl kdllhf : label is  " 4 " ;  
VDD <= ' 1 ' ;  
GND <= ' 0 ' ; 
U_ck_bufg : I BUFG port map ( 
I => PADS_dirnm_ck ,  
0 => dirnm_ck_bufg ) ;  
U reset_ibuf : IBUF port map 
I => PADS_exchecker_reset , 
0 => RESET ) ;  
U clkdllhf : CLKDLLHF port map 
CLKIN => dirnm_ck_bufg , 
CLKFB => CLK, 
RST => RESET , 
CLK0 => clkdllhf_cl k0 , 
CLK1 8 0  => open , 
CLKDV => cl kdllhf_cl kdiv,  
LOCKED => open ) ;  
U_clkdllhf_cl k0_bufg : BUFG port map ( 
I => clkdllhf_cl k0 , 
0 => CLK ) ; 
U clkdllhf cl kdiv_bufg : BUFG port map ( 
I => cl kdllhf_cl kdiv,  
0 => CLKDIV ) ; 
U_startup : STARTUP VIRTEX port map ( 
GSR => RESET , 
GTS => GND ,  
CLK =>  CLK ) ; 
U dimm_s ibuf : IBUF port map 
I => PADS_dimm_s ( 0 ) , 
0 => dirnm_s_ibuf ) ;  
U dimm ras_ibuf : I BUF port map 
I => PADS_dirnm_ras , 
0 => dimm ras ibuf ) ;  
1 1 7 
U_dimm_cas ibuf : IBUF port map 
I => PADS_dimm_cas , 
0 => dimm_cas_ibuf ) ;  
U_dimm_we_ibuf : IBUF port map 
I => PADS_dimm_we , 
0 => dimm_we ibuf ) ;  
G_dimm_d : for i in integer range O to 63 generate 
U_dimm_d_iobuf : IOBUF port map 
I => dimm_d_iobuf_i ( i ) , 
0 => dimm_d_iobuf_o ( i ) , 
T => dimm_d_iobuf_t ( i ) , 
IO => PADS_dimm_d { i )  ) ;  
U dimm d iobuf ·  o :  IOB FDC port map ( 
C => CLK, 
CLR => RESET , 
D => dimm_d_iobuf o ( i ) , 
Q => pcore_din ( i )  ) ;  
U dimm d iobuf i :  IOB FDC port map ( 
C => CLK, 
CLR => RESET , 
D => pcore_dout ( i ) , 
Q => dimm_d_iobuf_i ( i )  ) ;  
U dimm d iobuf t :  IOB FDC port map 
C => CLK, 
CLR => RESET , 
D => READ_d_n_buf ,  
Q = >  dimm_d_iobuf_t { i )  ) ;  
end generat e ;  
G dimm a :  for i i n  integer range O to 1 3  generate 
U_dimm_a_ibu f :  IBUF port map 
I => PADS_dimm_a { i ) , 
0 => dimm_a_ibuf ( i )  ) ;  
U_dimm_a_ibuf o :  IOB FDC port map { 
C => CLK, 
CLR => RESET , 
D => dimm_a_ibuf ( i ) , 
Q => pcore_addr_raw { i )  ) ;  
end generate ;  
pcore_addr ( 3  downto 0 )  <= pcore_addr_raw ( 3  downto 0 ) ; 
addr correct : for i in integer range 4 to 7 generate 
ADDR INV :  INV port map ( 
0 => pcore_addr ( i ) , 
1 1 8 
I => pcore_addr_raw ( i }  } ;  
end generate ; 
pcore_addr ( 1 3 downto 8 )  <= pcore addr_raw ( 1 3 downto 8 } ; 
G_dimm_dqmb : for i in integer range O to 7 generate 
U_dimm_dqmb_ibuf : I BUF port map 
I => PADS_dimm_dqmb ( i } , 
0 => dimm_dqmb_ibuf ( i }  } ;  
U_dimm_dqmb_ibuf_o : IOB FDC port map ( 
C => CLK, 
CLR => RESET , 
D => dimm_dqmb_ibuf ( i } , 
Q => pcore_dqmb ( i )  } ;  
end generate ; 
pcore_dma s k ( 7  downto 0 }  <= ( others => (not pcore_dqmb ( 0 } } } ;  
pcore_dma s k ( 1 5 downto 8 }  <= ( others => ( not pcore_dqmb ( l } } } ;  
pcore_dmas k ( 2 3  downto 1 6 }  <= ( others => (not pcore_dqmb ( 2 } } } ;  
pcore_dmask ( 3 1 downto 2 4 } <= ( others => (not pcore_dqmb ( 3 ) ) } ;  
pcore_dma s k ( 3 9 downto 32 } <= ( others => (not pcore_dqmb ( 4 ) } } ;  
pcore_dmask ( 4 7 downto 4 0 }  <= ( others => ( not pcore_dqmb ( S } } } ;  
pcore_dmas k ( S S  downto 4 8 }  <= ( others => ( not pcore_dqmb ( 6 } } } ;  
pcore_dmas k ( 63 downto 5 6 }  <= ( others => (not pcore_dqmb ( 7 } } } ;  
G io conn : for i in integer range 2 to  27  generate 
U io_conn_iobuf :  IOBUF port map 
I => io_conn_iobuf_i ( i } , 
O => io_conn_iobuf_o ( i } , 
T => io_conn_iobuf_t ( i ) , 
IO => PADS_io_conn ( i }  } ;  
U io conn iobuf o :  IOB FDC port map ( 
C => CLK, 
CLR => RESET ,  
D => io_conn_iobuf o ( i } , 
Q => pcore_extin ( i  - 2 )  } ;  
U io conn iobuf i :  IOB FDC port map 
C => CLK, 
CLR => RESET , 
D => pcore_extout ( i  - 2 ) , 
Q => io_conn_iobuf_i ( i )  ) ;  
U io conn iobuf t :  IOB FDC port map 
C => CLK, 
CLR => RESET , 
D => pcore_extctrl ( i  - 2 } , 
Q => io_conn_i obuf_t ( i }  ) ;  
end generate ;  
1 1 9 
U io conn 0 iobuf : IOBUF port map ( 
I => dirnrn_ck_bufg , 
0 => open, 
T => GND ,  
IO => PADS i o  conn ( 0 )  ) ;  
U io conn 1 iobuf : IOBUF port map 
I => GND, 
0 => open , 
T => VDD , 
IO => PADS io_conn ( l )  ) ;  
READ_p <= 
( not dirnrn_s_ibuf )  and 
(dirnrn_ras_ibuf )  and 
(not dirnrn_cas_ibuf )  and 
(dirnrn_we_ibuf ) ; 
U_read : FDC port map 
C => CLK , 
CLR => RESET ,  
D => READ_p , 
Q => READ ) ; 
U buf read : BUF port map 
I => READ, 
0 => READ_buf ) ; 
U_read_d : FDC port map 
C => CLK, 
CLR => RESET , 
D => READ , 
Q => READ d ) ; 
WRI TE p <= 
(not dirnrn_s_ibuf ) and 
(dirnrn_ras_ibuf ) and 
( not dirnrn_cas_ibuf )  and 
(not dirnrn_we_ibuf ) ; 
U_write : FDC port map 
C => CLK, 
CLR => RESET , 
D => WRI TE_p , 
Q => WRI TE ) ;  
U buf write : BUF port map 
I => WRI TE,  
0 => WRITE_buf ) ;  
U write d :  FDC port map 
C => CLK, 
CLR => RESET , 
D => WRI TE ,  
Q =>  WRI TE d ) ;  
1 20 
READ n <= not READ ; 
U read_d_n : FDC port map ( 
C => CLK, 
CLR => RESET , 
D => READ_n , 
Q => READ d n ) ;  
U buf read d n :  BUF port map 
I => READ_d_n , 
0 => READ d n buf ) ;  
User logic should be placed inside pcore 
U_pcore : pcore port map ( 
el k => CLK, 
end syn ;  
cl kdiv => CLKDIV, 
rst => RESET , 
read => READ, 
write => WRI TE,  
addr => pcore_addr , 
din => pcore_din , 
dout => pcore_dout , 
dmas k  => pcore_dmas k ,  
extin = >  pcore_extin , 
extout => pcore_extout , 
extctrl => pcore_extctrl 
) ; 
121 
Appendix E - Pcore.vhd 
1 22 
Pcore Wrapper 
Author : Kirk A Baugher 
library ieee ; 
use ieee . std_logic_l l 64 . all ; 
use ieee . std_logic_unsigned . all ; 
entity pcore is  
port ( 
elk : in std_logic ;  
clkdiv : in  std_logic ;  
rst : in std_logic ;  
read : in std_logi c ;  
write : i n  std_logic ;  
addr : in std_logic_vector ( 13 downto 0 ) ; 
din : in std_logic_vector ( 63 downto 0 ) ; 
dout : out std_logic_vector ( 63 downto 0 ) ; 
dmas k :  in std_logic_vector ( 63 downto 0 ) ; 
extin : in std_logic_vector ( 2 5  downto 0 ) ; 
extout : out std_logic_vector ( 2 5  downto 0 ) ; 
extctrl : out std_logic_vector ( 2 5  downto 0 )  ) ;  
end pcore ; 
architecture syn of  pcore is  
component asyncfifo 
port ( 
din : IN std_logic_VECTOR ( 127  downto 0 ) ; 
wr_en : IN  std_logi c ;  
wr_cl k :  IN std_logic ;  
rd_en : IN  std_logic ;  
rd_clk :  I N  std_logic ;  
ainit : IN  std_logi c ;  
dout : OUT std_logic_VECTOR ( l2 7  downto 0 ) ; 
full : OUT std_logic ;  
empty :  OUT std_logic ;  
r d  ack : OUT std_logi c ;  
r d  err : OUT std_logic ;  
wr ack : OUT std_logic ;  
wr err : OUT std_logic ) ; 
END component ; 
component sparsemvmult 
PORT ( 
CLK : IN STD_LOGIC ;  
RESET : IN  STD_LOGIC ;  
din_rdy : IN  STD_LOGI C ;  
INP : IN  STD_LOGIC_VECTOR ( 12 7  DOWNTO 0 ) ; 
ADDR : OUT STD_LOGIC_VECTOR ( l  DOWNTO 0 ) ; 
REPORTFLAG : OUT STD_LOGIC ;  
ANS FLAG OUT : OUT STD  LOGIC ;  - -
OUTPUT : OUT STD_LOGIC_VECTOR ( 63 DOWNTO 0 ) ) ;  




addra : IN std_logi c_VECTOR ( l  downto 0 ) ; 
addrb : IN std logic VECTOR ( l  downto 0 ) ; 
clka : IN std_logic ;  
clkb : I N  std_logic ;  
dina : IN  std_logic_VECTOR ( 63 downto 0 ) ; 
dinb : IN std_logic_VECTOR ( 63 downt o 0 ) ; 
douta : OUT std_logic_VECTOR ( 63 downt o 0 ) ; 
doutb : OUT std_logic_VECTOR ( 63 downto 0 ) ; 
wea : IN std_logic ; 
web : IN std_logic ) ; 
END component ;  
signal wr en , wr_ack ,  wr_err , rd_en, rd_ack,  rd_err : std_logi c ;  
signal ful l , empty ,  din_rdy, t i c , fini sh , ready , read2 , read3 , read4 
std_logi c ;  
signal FI FO_in , FI FO_out , out2parith : std_logic_vector ( 1 2 7  downto 
0 ) ; 






wea , web , ndb , rfdb, rdyb , tac 
REPORTFLAG , ANS FLAG_OUT 
dina , dinb , douta , doutb 
addrb 
std_logic ;  
: std_logi c ;  
std_logi c_vect or ( 63 downto 0 ) ; 
std logic_vector ( l  downto 0 ) ; 
fifo0 : asyncfifo port map 
din =>FI FO_in ,  




wr elk  =>el k,  
rd_en =>rd_en , 
rd el k =>clkdiv ,  
ainit  =>rst , 
dout =>FI FO_out , 
full =>full , 
empty =>empty,  
rd ack =>rd_ack,  
rd err =>rd_err , 
wr ack =>wr_ack ,  
wr  err =>wr err 
sparsemvrnult port map ( 
CLK => cl kdiv ,  
RESET => rst , 
din_rdy => din_rdy, 
INP => out2parith , 
ADDR => addrb , 
REPORTFLAG => REPORTFLAG , 
ANS FLAG OUT => ANS_FLAG_OUT , 
OUTPUT => parithout 
outram0 : outram port map ( 
addra => addr ( l  downto 0 ) , 
124 
addrb => addrb , 
cl ka => el k ,  
cl kb => cl kdiv,  
dina => din , 
dinb => parithout , 
douta => douta ,  
doutb => doutb , 
wea => writ e ,  
web => finish 
) ; 
fini sh <= ( REPORTFLAG OR ANS FLAG_OUT ) ;  
proces s ( clk , rst ) 
variable tmpx 
begin 
if rst= ' l '  then 
tmpx : = ( OTHERS=> ' l ' ) ;  
wr_en <= ' 0 ' ; 
tic <= ' 0 ' ; 
FI FO_in <= ( OTHERS=> ' l ' ) ;  
elsif  cl k ' event and clk= ' l '  then 
std_logic_vector ( 63 downto 0 ) ; 
if  write= ' l '  and addr ( l ) = ' 0 '  then 
if tic = ' 0 '  then 
else  
tmpx : = din ; 
tic <= ' 1 ' ;  
wr_en <= ' 0 ' ; 
FI FO_in <= tmpx & din ; 
wr_en <= ' 1 ' ;  
else  
tic <= ' 0 ' ; 
end i f ;  
wr_en <= ' 0 ' ;  
tic <= tic ;  
tmpx . - tmpx ; 
end i f ;  
end i f ;  
end process ; 
proces s ( cl kdiv , rst ) 
begin 
if rst = ' l '  then 
rd_en <= ' 0 ' ; 
elsi f cl kdiv ' event and cl kdiv= ' l '  then 
if empty = ' 0 '  then 
rd en <= ' 1 ' ;  
else 
rd en <= ' 0 ' ; 
end i f ;  
end i f ;  
end process ; 
proces s ( clkdiv , rst ) 
begin 
1 25 
i f  rst= ' l ' then 
out 2parith <= ( OTHERS=> ' l ' ) ;  
din_rdy <= ' 0 ' ; 
elsif  cl kdiv ' event and cl kdiv= ' l ' then 
if rd err = ' 0 '  and rd ack = ' l '  then 
out2parith <= FI FO_out ; 
else 
din_rdy <= ' l ' ;  
out2parith <= ( OTHERS=> ' l ' ) ;  
din_rdy <= ' 0 ' ; 
end i f ;  
end i f ;  
end proces s ;  
dout <= doutb ; 
end syn ;  
126 
Appendix F - Sparsemvmult.vhd 
127 
Sparse Mat rix Spar se Vector Mult iplier 
< sparsernvmult . vhd > 
4 / 1 9 /2004  
Kirk A Baugher 
kbaugher . edu 
LI BRARY IEEE ; 
USE IEEE . std_logic_1 16 4 . ALL ; 
USE IEEE . std_logic_arith . ALL ; 
use IEEE . std_logi c_unsigned . al l ;  
ENT ITY sparsernvmult IS  
PORT ( 
CLK : IN STD_LOGIC ; 
RESET : IN STD LOGIC ; 
din_rdy : IN STD_LOGIC ; 
INP : IN STD_LOGIC_VECTOR ( 1 27 DOWNTO 0 ) ; 
ADDR : OUT STD_LOGIC_VECTOR ( l  DOWNTO 0 ) ; 
REPORTFLAG : OUT STD_LOGIC ; 
ANS_FLAG_OUT : OUT STD_LOGIC ;  
OUTPUT : OUT STD_LOGIC_VECTOR ( 63 DOWNTO 0 ) ) ;  
END spars ernvmult ;  
ARCHITECTURE behavior OF sparsernvrnult IS 
SI GNAL ANS_FLAG , overflag , New_vectorflag : STD_LOGIC ;  
SIGNAL ANSWER : STD_LOGIC_VECTOR ( 63 DOWNTO 0 ) ; 
TYPE STATE_TYPE IS  (ADDRESS , DATA, 
PROCESS ING , INITIALI ZE , REPORTw , REPORTx , SEND, MACN ) ; 
SIGNAL STATE, STATEX , STATE_DEL : STATE_TYPE ; 
TYPE elmnt_addr IS ARRAY ( 0  TO 3 1 )  OF STD_LOGI C_VECTOR ( 3 1 DOWNTO 0 ) ; 
SIGNAL ea elmnt_addr ; 
--TYPE elrnnt data IS ARRAY ( 0  TO 63 ) OF STD_LOGIC_VECTOR ( 63 DOWNTO 0 ) ; 
--SIGNAL ed : elrnnt_data ; 
SIGNAL j , gnd_bit STD_LOGIC ; 
SIGNAL i INTEGER RANGE O TO 63 ; 
TYPE MACbuffer IS ARRAY ( 0  TO 63 ) OF STD LOGIC_VECTOR ( 63 DOWNTO 0 ) ; 
SIGNAL buff , accbuff : MACbuffer ; 
SI GNAL count l , count2 : INTEGER RANGE O TO 31 ; 
SIGNAL OUT PUT1 , OUTPUT2 , OUTPUT3 , OUTPUT4 : STD_LOGI C_VECTOR ( 1 3 DOWNTO 0 ) ; 
SIGNAL GND, rowcnt , rowcnt_lessl , cntr STD_LOGIC_VECTOR ( 3 1 DOWNTO 0 ) ; 
S IGNAL Acount , Acount l : INTEGER RANGE O TO 1 3 ;  
SIGNAL Mult_inlA, Mult_inlB  STD_LOGIC_VECTOR ( 6 3 DOWNTO 0 ) ; 
SIGNAL Mult_in2A, Mult_in2B STD_LOGIC_VECTOR ( 63 DOWNTO 0 ) ; 
SIGNAL over , stall , matchflag , one , din_rdy2 : STD_LOGIC ;  
SI GNAL overl l , over12 , over 1 3 , over 1 4 , overl : STD_LOGIC;  
SI GNAL matchl l , rnatch12 , rnatchl 3 , matchl4 , matchl , matchlx : STD_LOGIC ;  
SIGNAL OUTPUT11 , OUTPUT12 , OUTPUT1 3 , OUTPUT 1 4  : STD_LOGIC ;  
--SIGNAL Dataout l l , Dataout 1 2 , Dat aout 1 3 , Dataout 1 4 , Dat aout l 
STD_LOGIC_VECTOR ( 63 DOWNTO 0 ) ; 
SIGNAL over2 1 , over2 2 , over2 3 , over2 4 , over2 : STD_LOGIC ; 
SIGNAL match2 1 , match2 2 , match2 3 , match2 4 , match2 , match2x : STD LOGIC ; 
SIGNAL OUTPUT2 1 , OUTPUT22 , OUT PUT2 3 , OUTPUT24 : STD_LOGIC ; 
--SIGNAL Dataout2 1 , Dataout22 , Dataout 2 3 , Dataout 2 4 , Dataout2 
STD_LOGIC_VECTOR ( 6 3 DOWNTO 0 ) ; 
S IGNAL over31 , over 32 , over33 , over34 , over3 : STD_LOGIC ; 
SIGNAL match3 1 , match32 , match33 , match34 , match3 , match3x STD_LOGIC ; 
1 28 
S IGNAL OUTPUT3 1 , OUTPUT32 , OUTPUT33 , OUTPUT3 4  : STD_LOGIC ;  
--SIGNAL Dataout3 1 , Dataout32 , Dataout 3 3 , Dataout 3 4 , Dataout3 
STD_LOGIC_VECTOR ( 63 DOWNTO 0 ) ; 
S IGNAL over4 1 , over4 2 , over4 3 , over4 4 , over4  : STD_LOGIC ;  
S IGNAL match4 1 , match4 2 , match4 3 , match4 4 , match4 4x , match4 , match4 x 
STD_LOGIC ;  
SIGNAL OUTPUT4 1 , OUTPUT4 2 , OUTPUT4 3 , OUTPUT4 4  : STD_LOGIC ;  
--SIGNAL Dataout 4 1 , Dataout 4 2 , Dataout 4 3 , Dataout 4 4 , Dataout 4 
STD_LOGIC_VECTOR ( 63 DOWNTO 0 ) ; 
S IGNAL spot 
S IGNAL addra l , addrbl , addra2 , addrb2 
0 )  ; 
S IGNAL addra lx , addrblx , addra2x , addrb2x 
DOWNTO 0 ) ; 
S IGNAL addral z , addrb l z , addra2 z , addrb2 z 
DOWNTO 0 ) ; 
S IGNAL addra l l , addrbl l , addra2 1 , addrb2 1 
DOWNTO 0 ) ; 
S IGNAL addra1 2 , addrb 1 2 , addra2 2 , addrb22 
DOWNTO 0 ) ; 
S IGNAL addra1 3 , addrb1 3 , addra2 3 , addrb2 3  
DOWNTO 0 ) ; 
S IGNAL addra1 4 , addrb1 4 , addra24 , addrb2 4  
DOWNTO 0 ) ; 
S IGNAL dina l , dinbl , dina2 , dinb2 
DOWNTO 0 ) ; 
S IGNAL doutal , doutbl , douta2 , doutb2 
0 ) ; 
S IGNAL wea l , webl , wea2 , web2 
S IGNAL ial , ia2 , i a 3 , ia4  
0 ) ; 
. 
INTEGER RANGE O TO 5 5 ;  
STD_LOGIC_VECTOR ( 5  DOWNTO 
STD_LOGIC_VECTOR ( 5  
STD_LOGIC_VECTOR ( 5  
STD_LOGIC_VECTOR ( 5  
STD_LOGIC_VECTOR ( 5  
STD_LOGIC_VECTOR ( 5  
STD_LOGIC_VECTOR ( 5  
STD_LOGIC_VECTOR ( 63 
STD_LOGIC_VECTOR ( 63 DOWNTO 
STD_LOGIC ;  
STD_LOGIC_VECTOR ( 5  DOWNTO 
signal mout l , mout2 , Aa , Xa , Ab , Xb, mout l a , mout2a  
STD_LOGIC_VECTOR ( 63 DOWNTO 0 ) ; 
signal wr enx 
signal rd_enl , rd_ackl , rd_errl 
signal wr enl : std_logic ;  
signal rd_en2 , rd_ack2 , rd_err2 
signal wr en2 : std_logic ; 
s ignal rd_en3 , rd_ack3 , rd_err3 
signal wr en3 : std_logic ;  
signal rd_en4 , rd_ack4 , rd_err4 
signal wr_en4 : std_logic ;  
std_logic_vector ( l  to 4 ) ; 
std_logic ;  
std_logic ;  
std_logic ;  
std_logic ; 
TYPE quadBufferin I S  ARRAY ( 1  TO 4 )  OF std_logic_vector ( 63 downto 0 ) ; 
signal buffin quadBufferin ; 
signal dout l , dout 2 , dout3 , dout 4 : std_logic_vector ( 63 downto 
0 )  ; 
s ignal emptyl , empty2 , empty3 , empty4 
s ignal full l , full2 , full 3 , full 4  
signal sml , sm2 , fmla , fm2a 
signal ptrl , ptr2 , ptr3 , ptr4  
signal smla , sm2a , smlb, sm2b 
signal sidel , sidela , sidelb 
129 
std_logic ;  
: std_logic ;  
: std_logic !  
integer range 1 to 4 ;  
std_logic ;  
std_logic ;  
signal side2 , side2a , side2b : std_logic ;  
signal 

















c , d , cl , c2 , dl , Ainl , Ain2 , aout : std_logic_vector { 63 downto 
rd_en , wr_enbuff : std_logic ; 
sa , rd_err std_logic ; 
rd ack  std_logic ;  
ready : std_logi c ;  
fa , full out : std_logic ;  
empty_out std_logic ; 
dinbuff  : std_logi c_vector { 63 downto 0 ) ; 
dout out : std_logic_vector { 6 3 downto 0 ) ; 
overflow val , over flow val2 : std_logic_vector { 63 downto 0 ) ; 
overflow, overflow2 : std_logic ;  
fml , fm2 , num_inputs 
instatus , input status 
size  
pending 
pendingml , pendingm2 
buff  reset 
: std_logic ;  
: integer range O to 9 ;  
integer range O to 6 4 ; 
: integer range O to 1 3 ;  
: integer range O to 9 ;  
: std logic ; 
--signal 
real , rea2 , rea3 , rea4 , rea5 , rea 6 , rea7 , rea8 , rea 9 , rebl , reb2 , reb3 , reb4 , reb5 , r 
eb6 , reb7 , reb8 , reb9 : std_logi c ;  
COMPONENT dpfpmult 
port ( CLK : in std_logic ; 
A in std_logic_vector ( 63 downto 0 ) ; 
B : in std_logic_vector ( 63 downto 0 ) ; 
OUTx : out std_logic_vector { 63 downto 0 ) ; 
start : in std_logic;  
finish : out std_logic 
) ; 
end COMPONENT ;  
COMPONENT dpfpadd 
port { CLK : in std_logic;  
Ain : in std_logic_vector ( 63 downto 0 ) ; 
Bin : in std_logic_vector ( 63 downto 0 ) ; 
OUTx : out std_logic_vector ( 63 downto 0 ) ; 
start : in std_logi c ;  
finish : out std_logic 
) ; 
end COMPONENT ; 
component s yncfi fo 
port { 
· el k : IN std_logic ;  
sinit : IN std_logi c ;  
din : I N  std_logic_VECTOR ( 63 downto 0 ) ; 
130 
wr en : IN std_logic ; 
rd en : IN std_logic ; 
dout : OUT std_logic_VECTOR ( 63 downto 0 ) ; 
full : OUT std_logic;  
empty :  OUT std_logic ;  
rd  ack : OUT std_logic ;  
rd  err : OUT std_logi c ) ; 
end component ; 
component dpram64  6 4  
port ( 
addra : IN std_logic_VECTOR ( 5  downto 0 ) ; 
addrb : IN std_logic_VECTOR ( 5  downto 0 ) ; 
clka : IN std_logic ;  
clkb : IN std_logic ;  
dina : IN std_logic_VECTOR ( 63 downto 0 ) ; 
dinb : IN std_logic_VECTOR ( 63 downto 0 ) ; 
douta : OUT std_logic_VECTOR ( 63 downto 0 ) ; 
doutb : OUT std_logic_VECTOR ( 63 downto 0 ) ; 
wea : IN std_logic ;  
web : IN std_logic ) ; 
END component ; 
BEGIN 
GND<= ( OTHERS=> ' 0 ' ) ;  
gnd_bit<= ' 0 ' ; 






addra => addra l ,  
addrb => addrbl ,  
clka => el k ,  
clkb => el k ,  
dina => dina l ,  
dinb => dinb l ,  
douta => douta l ,  
doutb => doutbl , 
wea => weal , 
web => webl 
dpram64 64  port 
addra => addra2 , 
addrb => addrb2 , 
cl ka => elk ,  
clkb => elk ,  
dina => dina2 , 
dinb => dinb2 , 
douta => douta2 , 
doutb => doutb2 , 
wea => wea2 , 
web => web2 
map ( 
map ( 
fpmult l dpfpmult port map ( 




B=>Mult_inl B ,  
OUTx=>mout 1 ,  
start=>sml , 
finish=>fml 




B=>Mult_in2 B ,  
OUTx=>mout2 ,  
start=>sm2 , 
fini sh=>fm2 






finish=>fa ) ; 
buf : synefi fo port map 
el k => el k ,  
) ; 
din => dinbuff ,  
wr en => wr_enbuff ,  
rd en => rd_en , 
sinit => buffreset , 
dout => dout_out , 
full => full_out , 
empty => empty_out , 
rd aek  => rd_aek,  
rd err => rd err 
bufl : synefi fo port map 
} ; 













buff  in ( 1 ) , 
wr_enx ( l } , 
rd_enl , 
buffreset , 
dout l ,  
ful l l , 
empty => emptyl , 
rd aek => rd_aekl , 
rd err => rd err l  
buf2 : synefi fo port map 
el k => el k ,  
din => buffin ( 2 } , 
wr en => wr_enx ( 2 ) , 
rd en => rd_en2 , 
s init => buffreset , 
1 32 
dout => dout2 , 
ful l => full2 , 
empty => empty2 , 
rd ack => rd ack2 , 
rd err => rd err2 
) ; 
buf3 : syncfi fo port map 
el k => elk ,  
din => buff in  ( 3 ) , 
wr en => wr enx ( 3 ) , -
rd en => rd en3 , 
s init => buff reset , 
dout => dout 3 ,  
ful l  => ful l 3 ,  
empty => empty3 , 
rd ack => rd ack3 , 
rd err => rd err3 
) ; 
buf 4 : syncfi fo port map 
elk => el k ,  
din => buff in ( 4 ) , 
wr en => wr enx ( 4 ) , - -
rd en => rd en4 , 
s init => buff reset , 
dout => dout 4 ,  
full => ful l 4 , 
empty => empty4 , 
rd ack => rd ack4 , 
rd err => rd err4 
) ; 
- -buf freset <= reset OR ANS FLAG ; 
- -moutl <= mout la  when rea9= ' 0 '  else  (OTHERS=> ' O ' ) ;  
- -mout2 <= mout 2a when reb9= ' 0 '  else (OTHERS=> ' O ' ) ;  
--fml <= fmla  when rea9= ' 0 '  else ' 0 ' ;  
-- fm2 <= fm2a when reb9= ' 0 '  else ' 0 ' ; 
MAIN : PROCESS ( CLK, RESET ) 
VARIABLE w : INTEGER RANGE O TO 3 ;  
VARIABLE over , report flagx std_logi c ;  
BEGIN 
IF RESET= ' l '  THEN 
STATE<=INITIALI ZE ; 
STATEX<= INITIALI ZE ; 
ANS FLAG OUT <= ' 0 ' ; 
w : =0 ;  
i<=O ; 
j <= I O I j 
Acount<= 1 3 ;  
--overflag <= ' 0 ' ; 
over : = ' 0 ' ; 
report fl agx : = ' 0 ' ;  
ADDR <= " 1 0 " ;  
--match : = ' 0 ' ; 
OUTPUT<= ( OTHERS=> ' O ' ) ;  
1 33 
ea ( 0 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( l ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( 2 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( 3 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( 4 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( S ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( 6 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( 7 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( B ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( 9 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( l 0 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( l l ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( l2 ) <= ( OTHERS=> ' 0 ' ) ; 
ea ( 1 3 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( l 4 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( l S ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( l 6 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( l 7 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( l 8 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( l 9 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( 2 0 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( 2 1 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( 22 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( 2 3 ) <� ( OTHERS=> ' 0 ' ) ;  
ea ( 2 4 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( 2 5 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( 2 6 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( 27 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( 2 8 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( 2 9 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( 3 0 ) <= ( OTHERS=> ' 0 ' ) ;  
ea ( 3 l ) <= ( OTHERS=> ' 0 ' ) ;  
ia l  <= " 0 00000 " ;  
ia2  <= " 0 00001 " ;  
ia3 <= " 0 00 0 1 0 " ;  
ia4  <=  " 000 0 1 1 " ;  
wea l  <= I Q  I ; 
webl <= I Q  I ; 
wea2 <= ' 0 ' ;  
web2 <=== ' 0 ' ;  
dina l<= ( OTHERS=> ' 0 ' ) ;  
dinbl<= ( OTHERS=> ' 0 ' ) ;  
dina2<= ( OTHERS=> ' 0 ' ) ;  
dinb2<= ( OTHERS=> ' 0 ' ) ;  
addra l z  <= ( OTHERS=> ' 0 ' ) ;  
addrb l z  <= ( OTHERS=> ' 0 ' ) ;  
addra2 z  <= ( OTHERS=> ' 0 ' ) ;  
addrb2 z <= ( OTHERS=> ' 0 ' ) ;  
rowcnt <= ( OTHERS=> ' 0 ' ) ;  
rowcnt_les s l  <= ( OTHERS=> ' 0 ' ) ;  
ELS I F  CLK ' EVENT AND CLK= ' l '  THEN 
CASE STATE I S  
WHEN INITIALI ZE => 
IF din_rdy = ' 1 ' THEN 
rowcnt <= INP ( 3 1 DOWNTO 0 ) ; 
134 
rowcnt_less l  <= INP { 63 DOWNTO 32 ) ; 
STATE <= ADDRESS ; 
END I F ;  
OUT PUT <= { OTHERS=> ' l ' ) ;  
reportflagx : = gnd_bit ; 
ANS_FLAG_OUT<= ' 0 ' ;  
weal <= ' 0 ' ; 
webl <= ' 0 ' ; 
wea2 <= ' 0 ' ; 
web2 <= ' 0 ' ; 
WHEN ADDRESS => 
--I f these  128  bit not equal to ' s  are too slow for 6 6  or 50  
MHz ,  they can be  checked at 
-- 133  or 1 0 0  MHz ,  64 -bit s at a time at Peare and Peare 
could send a 1 -bit flag here 
-- noti fying the code i f  the input is  invalid or not 
-- I F  INP 
/= " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1  
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 " THEN 
IF din_rdy = ' 1 '  THEN 
ea { i ) <=INP { 1 27  DOWNTO 9 6 ) ; 
ea { i+ l ) <=INP { 95 DOWNTO 64 ) ;  
ea { i+2 ) <=INP { 63 DOWNTO 32 ) ;  
ea { i+ 3 ) <=INP { 3 1  DOWNTO 0 ) ; 
STATE<=DATA ; 
END I F ;  
OUTPUT <= { OTHERS=> ' l ' ) ;  
report flagx : = gnd_bit ; 
ANS_FLAG_OUT<= ' 0 ' ; 
wea l <= ' 0 ' ;  
web l <= ' 0 ' ;  
wea2 <= ' 0 ' ; 
web2 <= ' 0 ' ;  
WHEN DATA => 
- - I F  INP 
/=" l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l  
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 "  THEN 
IF din_rdy = ' 1 '  THEN 
IF j = ' 0 '  THEN 
dina l<=INP { 1 2 7  DOWNTO 64 ) ;  
dinbl<=INP { 63 DOWNTO 0 ) ; 
dina2<=INP ( 1 27  DOWNTO 64 ) ;  
dinb2<=INP { 63 DOWNTO 0 ) ; 
weal  <= ' 1 ' ;  
webl <= ' 1 ' ;  
wea2 <= ' 1 ' ; 
web2 <= I 1 I ;  
addra l z  <= ial ; 
addrbl z <= ia2 ; 
addra2 z  <= ial ; 
addrb2 z <= ia2 ; 
j <= ' l ' ;  
ELSE 
dinal<=INP { 12 7  DOWNTO 6 4 ) ; 
dinbl<=INP { 63 DOWNTO 0 ) ; 
1 35 
dina2<=INP ( 1 27 DOWNTO 64 ) ;  
dinb2<=INP ( 63 DOWNTO 0 ) ; 
weal  <=  I 1 I ;  
webl <= I 1 I ;  
wea2 <= I 1 I ;  
web2 <= I 1 I ; 
addra l z  <= ia3 ; 
addrbl z <= ia4 ; 
addra2 z  <= ia3 ; 
addrb2 z <= ia4 ; 
j <= ' 0 ' ; 
I F  i<2 8 THEN 
STATE<=ADDRESS ; 
i<=i+ 4 ; 
ial  <= ial  + " 0 00100 " ;  
ia2 <= ia2 + " 0 0 0 1 0 0 " ;  
ia3 <= ia3 + " 0 0 0 1 0 0 " ;  
ia4  <= ia4  + " 0 00 100 " ;  
ELSE 
STATE<=PROCESS ING ; 
i<=0 ;  
ia l <= " 000000 " ;  
ia2 <= " 00000 1 " ;  
ia3 <= " 0 0 0 0 10 " ; 
ia4  <= " 0 00 0 1 1 " ;  
END IF ;  
END IF ;  
END I F; 
OUTPUT <= ( OTHERS=> ' l ' ) ;  
report flagx : = gnd_bit ; 
WHEN PROCESSING => 
--Don ' t  decrement the count er i f  the input is  invalid ! !  
--Reading in addresses for comparators here 
wea l <= ' 0 ' ; 
webl  <= ' 0 ' ; 
wea2 <= ' 0 ' ; 
web2 <= ' 0 ' ; 
I F  din_rdy= ' l '  THEN 
ADDR <= " 1 0 " ; 
I F  Acount = 0 THEN 
Acount<= 1 3 ;  
STATE<=REPORTw; 
ELSE 
Acount<=Acount- 1 ;  
-- OUTPUT<= ( OTHERS=> ' 0 ' } ;  
END IF ;  
OUTPUT <= (OTHERS=> ' l ' } ;  
END I F ;  
ANS_FLAG_OUT<= ' 0 ' ; 
reportflagx : = gnd_bit ;  
WHEN REPORTw => 
STATE <= REPORTx ; 
report flagx . - gnd_bit ; 
WHEN REPORTx => 
report flagx . - one ; 
136  
--match : = matchl OR match2 OR mat ch3 OR match4 ; 
--over : = overl  OR over2 OR over3 OR over4 ; 
OUTPUT ( 6 3 ) <=mat chflag ; 
OUTPUT ( 62 ) <=overflag ; 
OUTPUT ( 6 1 ) <=match4 4 x ;  --noti fies C code if  last bit of status 
is  a match 
--this is  important so if  the overflag goe s 
high the 
match and not 
--C code will know that the last bit was a 
--the over bit address  
OUTPUT ( 6 0 } <= ' 0 ' ;  --Reserved for future use 
OUT PUT ( 5 9 ) <=ANS_FLAG; --Not curently in use but wil l be when 
multiple answers 
--are supported 
--Bits 58  to 56 are reserved for future use 
--OUTPUT ( 5 8 DOWNTO 5 6 ) <=ANS_SIZE ;  
- -These  results wi ll be reserved for later use  for when the 
sparse code can 
-- keep track of multiple answers so these bits can signi fy 
up to 7 
--answers available 
OUTPUT ( 5 8 DOWNTO 5 6 ) <= " 0 00 " ;  
OUTPUT ( 55 )  <= OUTPUTl ( 1 3 )  ; 
OUT PUT ( 5 4 ) <= OUTPUT2 ( 1 3 ) ; 
OUTPUT ( 53 )  <= OUTPUT3 ( 1 3 ) ; 
OUTPUT ( 52 )  <= OUTPUT4 ( 1 3 ) ; 
OUTPUT ( 5 1 )  <= OUTPUT1 ( 12 ) ; 
OUTPUT ( S0 }  <= OUTPUT2 ( 1 2 ) ; 
OUTPUT ( 4 9 )  <= OUTPUT3 ( 1 2 ) ; 
OUTPUT ( 4 8 )  <= OUTPUT4 ( 12 ) ; 
OUTPUT ( 4 7 }  <= OUT PUTl ( 1 1 ) ; 
OUTPUT ( 4 6 )  <= OUTPUT2 ( 1 1 ) ; 
OUTPUT ( 4 5 )  <= OUTPUT3 ( 11 ) ; 
OUTPUT ( 4 4 )  <= OUT PUT 4 ( 1 1 )  ; 
OUTPUT ( 4 3 )  <= OUTPUT l ( l 0 ) ; 
OUTPUT ( 4 2 )  <= OUTPUT2 ( 1 0 ) ; 
OUTPUT ( 4 1 )  <= OUT PUT3 ( 1 0 ) ; 
OUTPUT ( 4 0 )  <= OUT PUT4 ( 1 0 } ; 
OUTPUT ( 3 9 }  <= OUTPUT1 ( 9 ) ; 
OUTPUT ( 38 )  <= OUTPUT2 ( 9 ) ; 
OUTPUT ( 37 )  <= OUTPUT3 ( 9 ) ; 
OUT PUT ( 3 6 }  <= OUTPUT 4 ( 9 ) ; 
OUTPUT ( 35 )  <= OUTPUT 1 ( 8 ) ; 
OUT PUT ( 3 4 ) <= OUTPUT2 { 8 ) ; 
OUTPUT ( 3 3 }  <= OUTPUT3 ( 8 ) ; 
OUTPUT ( 32 )  <= OUTPUT4 ( 8 ) ; 
OUTPUT ( 3 1 )  <= OUT PUT1 ( 7 } ; 
OUTPUT ( 3 0 }  <= OUTPUT2 ( 7 ) ; 
OUTPUT ( 2 9 )  <= OUTPUT3 ( 7 ) ; 
OUTPUT ( 2 8 ) <= OUTPUT 4 ( 7 } ; 
OUTPUT ( 2 7 )  <= OUT PUT1 ( 6 ) ; 
OUTPUT ( 2 6 )  <= OUTPUT2 ( 6 ) ; 
OUTPUT ( 2 5 )  <= OUTPUT3 ( 6 ) ; 
1 37 
flag ! ! 
OUTPUT ( 2 4 ) <= OUTPUT4 ( 6 ) ; 
OUTPUT ( 2 3 )  <= OUTPUT l ( S ) ; 
OUTPUT ( 22 )  <= OUTPUT2 ( 5 ) ; 
OUTPUT ( 2 1 )  <= OUTPUT3 ( 5 ) ; 
OUTPUT ( 2 0 ) <= OUTPUT4 ( 5 ) ; 
OUTPUT ( 1 9 )  <= OUTPUT1 ( 4 ) ; 
OUTPUT ( 1 8 )  <= OUTPUT2 ( 4 ) ; 
OUTPUT ( 1 7 )  <= OUTPUT3 ( 4 ) ; 
OUTPUT ( 1 6 )  <= OUTPUT4 ( 4 ) ; 
OUTPUT ( l S )  <= OUTPUT1 ( 3 ) ; 
OUTPUT ( 1 4 )  <= OUTPUT2 ( 3 ) ; 
OUTPUT ( 1 3 )  <= OUTPUT3 ( 3 ) ; 
OUTPUT ( 12 )  <= OUTPUT4 ( 3 ) ; 
OUTPUT ( l l )  <= OUTPUT1 ( 2 ) ; 
OUTPUT ( l 0 )  <= OUTPUT2 ( 2 ) ; 
OUTPUT ( 9 )  <= OUTPUT3 ( 2 ) ; 
OUTPUT ( 8 )  <= OUTPUT4 ( 2 ) ; 
OUTPUT ( ? )  <=  OUTPUTl ( l ) ; 
OUTPUT ( 6 ) <= OUTPUT2 ( 1 ) ; 
OUTPUT ( S )  <= OUTPUT3 ( 1 ) ; 
OUTPUT ( 4 )  <= OUTPUT4 ( 1 ) ; 
OUTPUT ( 3 )  <= OUTPUTl ( 0 ) ; 
OUTPUT ( 2 )  <= OUTPUT2 ( 0 ) ; 
OUTPUT ( l )  <= OUTPUT3 ( 0 ) ; 
OUTPUT ( 0 )  <= OUTPUT4 ( 0 ) ; 
I F  overflag = ' 1 '  THEN 
OUTPUT ( spot ) <= ' 1 ' ;  
END I F ;  
I F  matchflag = ' 1 '  THEN 
STATE <= MACN ; 
ELSE 
IF overflag = ' 1 '  THEN 
--go here regardless  of ANS FLAG state 
STATE <= SEND ; 
ELSE --should never have ans flag = ' 1 '  be fore over 
STATE <= PROCESSING ;  
END I F ;  
END I F ;  
END I F ; 
--overflag <= over ;  
--New_vectorflag <=  New_vector ; 
WHEN SEND => 
report flagx : = gnd_bit ; 
I F  ANS FLAG = ' 1 ' THEN 
ANS FLAG OUT <= ANS FLAG ; 
OUTPUT <= ANSWER;  
ADDR <= " 1 1 " ;  
I F  New_vectorflag = ' 0 '  THEN 
STATE <= PROCESS ING ;  
--overflag < =  ' 0 ' ;  
ELSE 
--New_vectorflag <= ' 0 ' ; 
STATE <= ADDRESS ; 
--overflag <= ' 0 ' ;  
138 
- -New_vectorflag <= ' 0 ' ;  
END I F ; 
ELSE 
--wait until  an answer is found 
ANS FLAG OUT <= ANS_FLAG ; 
END I F ;  
WHEN MACN => 
I F  din_rdy= ' l '  AND 
INP= " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 000000000000000000000000000000000000000000000000000  
0 0 00000000000000000 0 0 0 0 00 0 000000000000000000000000000000 0 0 0 0 0 0 "  THEN 
--C code done sending input s 
--I F overflag = ' l '  or New vector = ' 1 '  THEN 
I F  overflag = ' l '  THEN 
STATE <= SEND ; 
ELSE 
STATE <= PROCESS ING ; 
END I F ;  
ELSE 
--do nothing here , other proces ses are handling things 
END I F ;  
report flagx : = gnd_bit ; 
OUTPUT <= { OTHERS=> ' l ' ) ;  
WHEN OTHERS => 
END CASE ;  
END I F ;  
report flag < =  report flagx ; 
END PROCESS MAIN ;  
PROCESS { CLK, RESET ) 
BEGIN 
IF RESET = ' 1 '  THEN 
cntr <= { OTHERS=> ' 0 ' ) ;  
New_vectorflag <= ' 0 ' ; 
ELSI F  CLK ' EVENT AND CLK= ' l '  THEN 
I F  STATE = REPORTx AND overflag = ' 1 '  THEN 
I F  cntr = rowcnt THEN 
cntr <= { OTHERS=> ' 0 ' ) ;  
New_vectorflag <= ' 1 ' ; 
ELSE 
cntr <= cntr + ' 1 ' ;  
END I F ;  
ELS I F  STATE = SEND AND ANS FLAG 
THEN 
' 1 ' AND New_vectorflag 
New_vectorflag <= ' 0 ' ;  
END I F; 
END I F; 
END PROCESS ; 
addral  <= addra l z  when STATE DATA 
DATA else addralx ;  
addrbl <=  addrbl z  when STATE DATA 
DATA else addrblx ; 
addra2 <= addra2 z when STATE DATA 





STATE=ADDRESS OR STATE DEL 
STATE=ADDRESS  OR STATE DEL 
STATE=ADDRESS OR STATE DEL 
I 1 '  
addrb2 <= addrb2 z when STATE 
DATA else addrb2x ; 
DATA OR STATE=ADDRESS OR STATE DEL 
DELAY PROC : PROCESS ( CLK , RESET ) 
BEGIN 
IF  RESET = ' l '  THEN 
Acount l<=0 ; 
din_rdy2 <= ' 0 ' ; 
STATE_DEL <= ADDRESS ; 
ELS I F  CLK ' EVENT AND CLK= ' l '  THEN 
Acount l<=Acount ; 
din_rdy2 <= din_rdy; 
STATE DEL <= STATE ; 
--Acount2<=Acount l ;  
--overa <= overl or over2 or over3 or over4 ; 
END I F ;  
END PROCESS DELAY_PROC ; 
PROCESS ( CLK, RESET ) 
BEGIN 
IF RESET= ' l '  THEN 
overflag <= ' 0 ' ;  
ELS I F  CLK ' EVENT AND CLK= ' l '  THEN 
IF din_rdy2 = ' l '  AND STATE_DEL = PROCESSING then 
-- I F  ANS FLAG = ' 1 '  THEN 
overflag <= ' 0 ' ; --reset it for next time 
I F  over4 4  = ' 1 '  THEN 
overflag <= ' 1 ' ;  
--ELS I F  Acount l = 0 AND match4 4 = ' 1 '  THEN 
overflag <= ' 1 ' ;  
ELSE 
overflag <= overflag ; 
END I F ;  
ELS I F  ANS FLAG = ' 1 '  AND STATE 
overflag <= ' 0 ' ; 
END I F ;  
END I F ;  
END PROCESS ; 
MATCH PROC : PROCESS ( CLK, RESET ) 
BEGIN 
IF RESET= ' l '  THEN 
matchflag <= ' 0 ' ; 
ELSI F  CLK ' EVENT AND CLK= ' l '  THEN 
SEND THEN 
I F  din_rdy2 = ' 1 '  AND STATE_DEL = PROCESSING THEN 
I F  (matchll  OR match12  OR match1 3  OR match1 4 ) =  ' 1 '  THEN 
matchflag <= ' 1 ' ; 
ELS I F  (match2 1 OR match22 OR match2 3 OR match2 4 ) =  ' 1 '  THEN 
matchflag <= I 1 I ; 
ELS I F  (match31  OR match32 OR match3 3  OR match34 ) =  ' 1 '  THEN 
matchflag <= I 1 I ; 
ELS I F  (match4 1 OR match4 2 OR match4 3 OR match4 4 ) =  ' 1 '  THEN 
matchflag <= I 1 I ; 
ELSE 
matchflag <= matchflag ; 
140 
END I F ;  
ELSI F  STATE DEL = MACN THEN 
matchflag <= ' 0 ' ; 
END IF ;  
END IF ;  
END PROCESS MATCH_PROC ; 
OVER_ADJ : PROCESS ( CLK ) 
BEGIN 
IF CLK ' EVENT AND CLK= ' l '  THEN 
I F  STATE DEL = PROCESS ING AND din_rdy2= ' 1 ' THEN 
I F  overflag = ' 0 '  THEN 
I F  over4 4 = ' 1 '  AND (match4 1 OR match4 2 OR match4 3 OR 
match4 4 ) = ' 0 '  THEN 
spot <= ( 4 *Acount l ) ; 
ELSE 
END IF ;  
ELSE 
spot <= spot ; 
END IF ;  
END IF ;  
END IF ;  
END PROCESS OVER_ADJ; 
COMPARATORl l : PROCESS ( CLK, STATE , RESET ) 
BEGIN 
IF  RESET = ' 1 '  THEN 
addra l l  <= " 000000 " ;  
OUTPUT l l<= ' 0 ' ;  
overl l<= ' 0 ' ; 
matchl l<= ' 0 ' ; 
ELSI F  CLK ' EVENT AND CLK= ' l '  THEN 
IF STATE /= PROCESSING or din_rdy= ' 0 ' THEN 
addra l l  <= " 0 0 0 00 0 " ;  
OUTPUTl l<= ' 0 ' ;  
overl l<= ' 0 ' ; 
matchl l<= ' 0 ' ; 
--ELSE 
ELSIF  INP ( 127  DOWNTO 9 6 )  
addral l  < =  " 0 0 0 00 0 " ;  
overl l<= ' 0 ' ; 
matchll<= ' l ' ; 
OUTPUTl l<= ' l ' ; 
ELS I F  INP ( 127  DOWNTO 9 6 )  
addra l l  < =  " 0 0 0 00 1 " ;  
overll<= ' 0 ' ; 
matchll<= ' l ' ; 
OUTPUTl l<= ' l ' ;  
ELS IF  INP ( 127  DOWNTO 9 6 )  
addral l  < =  " 0 0 0 0 1 0 " ;  
overl l<= ' 0 ' ; 
matchl l<= ' l ' ; 
OUTPUTl l<= ' l ' ; 
ELSIF  INP ( l27  DOWNTO 9 6 )  
addral l  < =  " 000 0 1 1 " ;  
ea ( 0 )  THEN 
ea ( l )  THEN 
ea ( 2 )  THEN 
ea ( 3 )  THEN 
141 
overl l<= ' 0 ' ; 
matchll<= ' l ' ;  
OUTPUTl l<= ' l ' ;  
ELS I F  INP ( 1 2 7  DOWNTO 9 6 )  
addra l l  < =  " 0 0 0 1 0 0 " ;  
overll<= ' 0 ' ; 
matchll<= ' l ' ; 
OUTPUTl l<= ' l ' ; 
ELS I F  INP ( 127  DOWNTO 9 6 )  
addra l l  < =  " 0 0 0 1 0 1 " ;  
overl l<= ' 0 ' ; 
matchll<= ' l ' ;  
OUTPUT ll<= ' 1 ' ;  
ELS I F  INP ( 1 27  DOWNTO 9 6 )  
addra l l  < =  " 0 00 1 1 0 " ;  
overll<= ' 0 ' ; 
matchl l<= ' l ' ; 
OUTPUTl l<= ' l ' ; 
ELS I F  INP ( 12 7  DOWNTO 9 6 )  
addral l  < =  " 0 0 0 1 1 1 " ; 
overl l<= ' 0 ' ; 
match l l<= ' l ' ;  
OUTPUTl l<= ' l ' ;  
ea ( 4 ) THEN 
ea ( 5 )  THEN 
ea ( 6 ) THEN 
ea ( 7 )  THEN 
ELS I F  INP ( 127  DOWNTO 9 6 )  > ea ( 63 )  THEN 
ELSE 
overl l<= ' l ' ;  
matchl l<= ' 0 ' ; 
OUTPUTl l<= ' 0 ' ;  
addra l l  <= " 00000 0 " ;  
matchl l<= ' 0 ' ; 
overll<= ' 0 ' ; 
OUTPUTl l<= ' 0 ' ; 
END I F ;  
END I F ;  
END PROCESS COMPARATORl l ;  
COMPARATOR12 : PROCESS ( CLK, STATE,  RESET ) 
BEGIN 
IF RESET = ' l '  THEN 
OUTPUT12<= ' 0 ' ; 
over12<= ' 0 ' ;  
match12<= ' 0 ' ; 
addra 12  <= " 0 000 0 0 " ; 
ELS I F  CLK ' EVENT AND CLK= ' l '  THEN 
IF STATE /= PROCESS ING or din_rdy= ' 0 '  THEN 
OUTPUT12<= ' 0 ' ; 
over12<= ' 0 ' ; 
match12<= ' 0 ' ;  
addra12  <= " 00000 0 " ; 
--ELSE 
ELS I F  INP ( 12 7  DOWNTO 9 6 )  
addra12  < =  " 0 0 1 0 0 0 " ;  
over12<= ' 0 ' ;  
matchl2<= ' 1 ' ;  
OUTPUT12<= ' 1 ' ;  
ea ( B )  THEN 
142 
ELS I F  INP ( 127  DOWNTO 9 6 )  
addra 12  < =  " 00 1 0 0 1 " ;  
over12<= ' 0 ' ; 
match12<= ' 1 ' ;  
OUTPUT12<= ' 1 ' ; 
ELS I F  INP ( 12 7  DOWNTO 9 6 )  
addra 12  < =  " 00 1 0 1 0 " ; 
over12<= ' 0 ' ; 
match12<= ' 1 ' ;  
OUTPUT12<= ' 1 ' ;  
ELS I F  INP ( 1 2 7  DOWNTO 9 6 )  
addra12  < =  " 0 0 1 0 1 1 " ;  
over12<= ' 0 ' ;  
match12<= ' 1 ' ;  
OUTPUT 12<= ' 1 ' ;  
ELSI F  INP ( 12 7  DOWNTO 9 6 )  
addra 12  < =  " 0 0 1 1 0 0 " ;  
over12<= ' 0 ' ; 
match1 2<= ' 1 ' ;  
OUTPUT1 2<= ' 1 ' ;  
ELSI F  INP ( l2 7  DOWNTO 9 6 )  
addra 12  < =  " 0 0 1 1 0 1 " ;  
over12<= ' 0 ' ;  
match1 2<= ' 1 ' ;  
OUTPUT12<= ' 1 ' ;  
ELS I F  INP ( 127  DOWNTO 9 6 )  
addra 12  < =  " 0 0 1 1 1 0 " ;  
over12<= ' 0 ' ;  
match1 2<= ' 1 ' ;  
OUTPUT12<= ' 1 ' ; 
ELS I F  INP ( 1 2 7  DOWNTO 9 6 )  
addra 12  < =  " 0 0 1 1 1 1 " ;  
over12<= ' 0 ' ;  
match1 2<= ' 1 ' ;  
OUTPUT1 2<= ' 1 ' ;  
ea ( 9 ) THEN 
ea ( l O )  THEN 
ea ( 1 1 )  THEN 
ea ( 12 )  THEN 
ea ( 1 3 )  THEN 
ea ( 1 4 )  THEN 
ea ( 1 5 )  THEN 
ELS I F  INP ( 12 7  DOWNTO 9 6 )  > ea ( 63 )  THEN 
over12<= ' 1 ' ;  
match12<= ' 0 ' ; 
OUTPUT12<= ' 0 ' ;  
ELSE 
match12<= ' 0 ' ;  
over12<= ' 0 ' ;  
addra12  <= " 0 0 0 00 0 " ;  
OUTPUT12<= ' 0 ' ; 
END I F; 
END I F; 
END PROCESS COMPARATOR1 2 ;  
COMPARATOR1 3 :  PROCESS ( CLK, STATE , RESET ) 
BEGIN 
IF RESET = ' 1 '  THEN 
OUTPUT13<= ' 0 '  ; 
over1 3<= ' 0 ' ; 
match1 3<= ' 0 ' ; 
addra 1 3  <= " 0 00000 " ;  
1 43 
ELS I F  CLK ' EVENT AND CLK= ' l '  THEN 
I F  STATE /= PROCESSING or din_rdy= ' 0 '  THEN 
OUTPUT 13<= ' 0 ' ; 
over13<= ' 0 ' ; 
match1 3<= ' 0 ' ; 
addra 1 3  <= " 0 00000 " ;  
--ELSE 
ELS I F  INP ( 12 7  DOWNTO 9 6 )  
over13<= ' 0 ' ; 
match1 3<= ' 1 ' ;  
OUTPUT 1 3<= ' 1 ' ; 
addra 1 3  <= " 0 1 00 0 0 " ;  
ELS I F  INP ( l2 7  DOWNTO 9 6 )  
over13<= ' 0 ' ;  
match1 3<= ' 1 ' ;  
OUTPUT1 3<= ' 1 ' ; 
addra 1 3  <= " 0 1 0 0 0 1 " ;  
ELS I F  INP ( 1 27  DOWNTO 9 6 )  
over13<= ' 0 ' ; 
match1 3<= ' 1 ' ;  
OUTPUT1 3<= ' 1 ' ;  
addra1 3  <= " 0 1 0 0 1 0 " ; 
ELS I F  INP ( 1 27  DOWNTO 9 6 )  
over13<= ' 0 ' ; 
match1 3<= ' 1 ' ; 
OUTPUT1 3<= ' 1 ' ;  
addra1 3  <= " 0 1 0 0 1 1 " ; 
ELS I F  INP ( 1 2 7  DOWNTO 9 6 )  
over13<= ' 0 ' ; 
match1 3<= ' 1 ' ;  
OUTPUT1 3<= ' 1 ' ;  
addra1 3  <= " 0 1 0 1 0 0 " ;  
ELS I F  INP ( 1 27  DOWNTO 9 6 )  
over13<= ' 0 ' ; 
match1 3<= ' 1 ' ;  
OUTPUT13<= ' 1 ' ;  
addra13  <= " 0 1 0 1 0 1 " ;  
ELS I F  INP ( 1 27  DOWNTO 9 6 )  
over13<= ' 0 ' ;  
match1 3<= ' 1 ' ;  
OUTPUT1 3<= ' 1 ' ;  
addra 1 3  <= " 0 1 0 1 1 0 " ;  
ELS I F  INP ( l 27  DOWNTO 9 6 )  
over13<= ' 0 ' ;  
match1 3<= ' 1 ' ;  
OUTPUT 1 3<= ' 1 ' ; 
addra 1 3  <= " 0 1 0 1 1 1 " ;  
ea ( 1 6 )  THEN 
ea ( 17 )  THEN 
ea ( 1 8 )  THEN 
ea ( 1 9 )  THEN 
ea ( 2 0 )  THEN 
ea ( 2 1 )  THEN 
ea ( 2 2 )  THEN 
ea ( 2 3 )  THEN 
ELS I F  INP ( l2 7  DOWNTO 9 6 )  > ea ( 63 )  THEN 
over1 3<= ' 1 ' ;  
match1 3<= ' 0 ' ; 
OUTPUT1 3<= ' 0 ' ;  
ELSE 
match1 3<= ' 0 ' ; 
over13<= ' 0 ' ; 
addra13  <= " 0 0 0 00 0 " ;  
144 
OUTPUT1 3<= ' 0 ' ;  
END I F ;  
END I F ; 
END PROCESS COMPARATOR1 3 ;  
COMPARATOR1 4 : PROCESS (CLK, STATE , RESET ) 
BEGIN 
IF RESET = ' l '  THEN 
addra 1 4  <= " 00 0 00 0 " ; 
OUT PUT 1 4 <= ' 0 ' ; 
over1 4<= ' 0 ' ; 
match1 4 <= ' 0 ' ; 
ELS I F  CLK ' EVENT AND CLK= ' l '  THEN 
IF STATE /= PROCESSING or din_rdy= ' 0 '  THEN 
OUTPUT 1 4 <= ' 0 ' ; 
over1 4<= ' 0 ' ; 
match1 4<= ' 0 ' ; 
addra 1 4  <= " 0 0 0 0 0 0 " ; 
--ELSE 
ELS I F  INP ( 12 7  DOWNTO 9 6 )  
over1 4 <= ' 0 ' ; 
match1 4<= ' 1 ' ;  
OUTPUT 1 4 <= ' 1 ' ;  
addra 1 4  <= " 0 1 1 000 " ;  
ELS I F  INP ( 12 7  DOWNTO 9 6 )  
over1 4 <= ' 0 ' ; 
match 1 4 <= ' 1 ' ; 
OUTPUT 1 4 <= ' 1 ' ;  
addra 1 4  <= " 0 1 1 0 0 1 " ;  
ELS I F  INP ( 12 7  DOWNTO 96 ) 
over1 4<= ' 0 ' ; 
match1 4 <= ' 1 ' ;  
OUTPUT 1 4 <= ' 1 ' ;  
addra1 4  <= " 0 1 1 0 1 0 " ;  
ELS I F  INP ( 12 7  DOWNTO 96 ) 
over 1 4 <= ' 0 ' ; 
match1 4 <= ' 1 ' ;  
OUTPUT1 4 <= ' 1 ' ;  
addra1 4  <= " 0 1 1 0 1 1 " ;  
ELS I F  INP ( 1 2 7  DOWNTO 9 6 )  
over1 4 <= ' 0 ' ; 
match1 4 <= ' 1 ' ; 
OUTPUT1 4 <= ' 1 ' ;  
addra 1 4  <= " 0 1 1 1 0 0 " ; 
ELS I F  INP ( 1 27  DOWNTO 9 6 )  
over1 4<= ' 0 ' ;  
match 1 4 <= ' 1 ' ;  
OUTPUT 1 4 <= ' 1 ' ;  
addra1 4  <= " 0 1 1 10 1 " ;  
ELS I F  INP ( 12 7  DOWNTO 9 6 )  
over1 4<= ' 0 ' ; 
matchl 4 <= ' 1 ' ; 
OUT PUT1 4 <= ' 1 ' ;  
addra1 4  <= " 0 1 1 1 1 0 " ;  
ELS I F  INP ( 1 2 7  DOWNTO 9 6 )  
ea ( 2 4 ) THEN 
ea ( 2 5 )  THEN 
ea ( 2 6 )  THEN 
ea ( 27 )  THEN 
ea ( 2 8 ) THEN 
ea ( 2 9 )  THEN 
ea ( 30 )  THEN 
ea ( 3 1 )  THEN 
145 
over1 4<= ' 0 ' ;  
rnat ch1 4 <= ' 1 ' ;  
OUTPUT1 4 <= ' 1 ' ; 
addra 1 4  <= " 0 1 1 1 1 1 " ;  
ELSI F  INP ( 127  DOWNTO 9 6 )  > ea ( 3 1 )  THEN 
--over1 4 <= ' 1 ' ; 
rnatch 1 4 <= ' 0 ' ; 
OUTPUT 1 4 <= ' 0 ' ; 
ELSE 
rnatch1 4 <= ' 0 ' ;  
over1 4 <= ' 0 ' ; 
addra l 4  <= " 0 00000 " ;  
OUTPUT1 4 <= ' 0 ' ; 
END I F ;  
END I F ;  
END PROCESS COMPARATOR1 4 ;  
MUXERl : PROCESS (CLK, RESET ) 
variable rnatchla : std_logic ; 
BEGIN 
IF RESET= ' l '  THEN 
--overl <= I O I ; 
rnatchlx <= I O I ; 
addra lx <= ( OTHERS=> ' 0 ' ) ;  
OUTPUT l <= ( OTHERS=> ' 0 ' ) ;  
rnatchl a  ' 0 ' ;  
ELS IF  CLK ' EVENT AND CLK= ' l '  THEN 
--overl<=overl l  OR over12  OR over13 OR over1 4 ;  
rnatchla : =rnatchll  OR rnatch12  OR rnatch13  OR rnatch14 ; 
rnatchlx<=rnatchl a ;  
OUTPUT l (Acount l )  <= OUTPUT l l  OR OUTPUT 12  OR OUTPUT 13  OR OUTPUT 1 4 ; 
I F  rnatchl l  = ' 1 '  THEN 
addralx <= addra l l ;  
ELS I F  rnatch12 = ' 1 '  THEN 
addralx <= addra 12 ; 
ELS I F  match13  = ' 1 '  THEN 
addralx <= addra 1 3 ;  
ELS I F  match14  = ' l '  THEN 
addra lx <= addra 1 4 ; 
ELSE 
addralx <= (OTHERS=> ' 0 ' ) ;  
END I F ;  
END I F ;  
END PROCESS MUXERl ; 
COMPARATOR2 1 :  PROCESS ( CLK, STATE , RESET ) 
BEGIN 
IF RESET = ' 1 '  THEN 
OUTPUT2 1<= ' 0 ' ; 
over2 1<= ' 0 ' ; 
rnatch21<= ' 0 ' ; 
addrbl l  <= " 0 00000 " ;  
ELS I F  CLK ' EVENT AND CLK= ' l '  THEN 
IF STATE /= PROCESSING or din_rdy= ' 0 '  THEN 
OUTPUT21<= ' 0 ' ;  
146 
over2 1<= ' 0 ' ; 
match2 1<= ' 0 ' ; 
addrbl l <= " 0 0 0 00 0 " ;  
--ELSE 
ELS I F  INP ( 95 DOWNTO 64 ) ea ( 0 )  THEN 
over2 1<= ' 0 ' ;  
match2 1<= ' 1 ' ;  
OUTPUT2 1<= ' 1 ' ; 
addrbll  <= " 0 0000 0 " ;  
ELS I F  INP ( 95 DOWNTO 64 ) ea ( l )  THEN 
over2 1<= ' 0 ' ;  
match2 1<= ' 1 ' ;  
OUT PUT21<= ' 1 ' ;  
addrbl l <= " 0 0000 1 " ;  
ELSI F  INP ( 95 DOWNTO 64 ) ea ( 2 )  THEN 
over2 1<= ' 0 ' ;  
match2 1<= ' 1 ' ;  
OUTPUT2 1<= ' 1 ' ;  
addrbl l  <= " 0 0 0 0 1 0 " ; 
ELS I F  INP ( 95 DOWNTO 64 ) ea ( 3 ) THEN 
over2 1<= ' 0 ' ;  
match2 1<= ' 1 ' ;  
OUTPUT2 1<= ' 1 ' ; 
addrbl l <= " 00 0 0 1 1 " ;  
ELS I F  INP ( 95 DOWNTO 64 ) ea ( 4 )  THEN 
over21<= ' 0 ' ;  
match2 1<= ' 1 ' ;  
OUTPUT21<= ' 1 ' ;  
addrbl l  <= " 0 0 0 1 0 0 " ;  
ELS I F  INP ( 95 DOWNTO 64 ) ea ( S )  THEN 
over2 1<= ' 0 ' ; 
match2 1<= ' 1 ' ;  
OUTPUT2 1<= ' 1 ' ; 
addrbl l  <= " 0 0 0 1 0 1 " ;  
ELS I F  INP ( 95 DOWNTO 6 4 ) ea ( 6 )  THEN 
over2 1<= ' 0 ' ;  
match2 1<= ' 1 ' ;  
OUTPUT2 1<= ' 1 ' ; 
addrbl l <= " 0 0 0 1 1 0 " ;  
ELS I F  INP ( 95 DOWNTO 64 ) ea ( 7 )  THEN 
over2 1<= ' 0 ' ; 
mat ch2 1<= ' 1 ' ;  
OUTPUT2 1<= ' 1 ' ; 
addrbl l  <= " 0 0 0 1 1 1 " ;  
ELS I F  INP ( 95 DOWNTO 64 ) > ea ( 63 )  THEN 
over21<= ' 1 ' ;  
match2 1<= ' 0 ' ; 
OUTPUT2 1<= ' 0 ' ; 
ELSE 
match2 1<= ' 0 ' ;  
over2 1�= ' 0 ' ;  
addrbl l <= " 000000 " ;  
OUTPUT2 1<= ' 0 ' ; 
END I F; 
END I F; 
1 47 
END PROCESS COMPARATOR2 1 ;  
COMPARATOR22 : PROCESS ( CLK, STATE , RESET ) 
BEGIN 
IF RESET = ' 1 '  THEN 
OUTPUT22<= ' 0 ' ; 
over2 2<= ' 0 ' ; 
mat ch22<= ' 0 ' ; 
addrb12  <= " 000000 " ;  
ELS I F  CLK ' EVENT AND CLK= ' l '  THEN 
I F  STATE /= PROCESS ING or din_rdy= ' 0 '  THEN 
OUTPUT22<= ' 0 ' ; 
over22<= ' 0 ' ;  
match22<= ' 0 ' ; 
addrb12  <= " 00 0 0 0 0 " ;  
--ELSE 
ELS I F  INP ( 95 DOWNTO 64 ) 
over22<= ' 0 ' ; 
match22<= ' 1 ' ; 
OUTPUT22<= ' 1 ' ; 
addrb12  <= " 0 0 1 0 0 0 " ;  
ELS I F  INP ( 95 DOWNTO 64 ) 
over22<= ' 0 ' ; 
mat ch22<= ' 1 ' ; 
OUTPUT22<= ' 1 ' ; 
addrb1 2  <= " 0 0 1 00 1 " ;  
ELS I F  INP ( 95 DOWNTO 64 ) 
over22<= ' 0 ' ; 
match22<= ' 1 ' ;  
OUTPUT22<= ' 1 ' ;  
addrb1 2 <= " 0 0 1 01 0 " ; 
ELS I F  INP ( 95 DOWNTO 64 ) 
over2 2<= ' 0 ' ;  
match22<= ' 1 ' ;  
OUTPUT22<= ' 1 ' ;  
addrb12  <= " 0 0 1 0 1 1 " ;  
ELS I F  INP ( 95 DOWNTO 64 ) 
over22<= ' 0 ' ; 
match22<= ' 1 ' ;  
OUTPUT22<= ' 1 ' ; 
addrb12  <= " 0 0 1 1 0 0 " ;  
ELS I F  INP ( 95 DOWNTO 64 ) 
over22<= ' 0 ' ; 
match22<= ' 1 ' ;  
OUTPUT22<= ' 1 ' ; 
addrb12  <= " 0 0 1 1 0 1 " ;  
ELSI F  INP ( 95 DOWNTO 64 ) 
over22<= ' 0 ' ; 
match22<= ' 1 ' ;  
OUTPUT22<= ' 1 ' ;  
addrb1 2 <= " 0 0 1 1 1 0 " ; 
ELSIF  INP ( 95 DOWNTO 64 ) 
over22<= ' 0 ' ; 
mat ch22<= ' 1 ' ;  
OUTPUT22<= ' 1 ' ;  
ea ( B )  THEN 
ea ( 9 ) THEN 
ea ( l 0 )  THEN 
ea ( l l )  THEN 
ea ( 1 2 )  THEN 
ea ( 1 3 )  THEN 
ea ( l 4 )  THEN 
ea ( l S )  THEN 
1 48 
addrb12  <= " 0 0 1 1 1 1 " ;  
ELSI F  INP ( 95 DOWNTO 64 ) > ea ( 63 )  THEN 
over22<= ' 1 ' ;  
match2 2<= ' 0 ' ;  
OUTPUT22<= ' 0 ' ;  
ELSE 
match22<= ' 0 ' ; 
over22<= ' 0 ' ;  
addrb12  <= " 000000 " ;  
OUTPUT22<= ' 0 ' ; 
END I F; 
END IF ;  
END PROCESS COMPARATOR2 2 ;  
COMPARATOR2 3 :  PROCESS ( CLK, STATE , RESET ) 
BEGIN 
IF RESET = ' 1 '  THEN 
OUTPUT2 3<= ' 0 ' ; 
over23<= ' 0 ' ;  
match2 3<= ' 0 ' ; 
addrbl3  <= " 0 0 0 00 0 " ;  
ELS I F  CLK ' EVENT AND CLK= ' l '  THEN 
I F  STATE /= PROCESSING or din_rdy= ' 0 '  THEN 
OUTPUT2 3<= ' 0 ' ;  
over2 3<= ' 0 ' ; 
match2 3<= ' 0 ' ;  
addrb1 3 <= " 000000 " ;  
- -ELSE 
ELS I F  INP ( 95 DOWNTO 64 ) 
over2 3<= ' 0 ' ;  
match2 3<= ' 1 ' ; 
OUTPUT2 3<= ' 1 ' ;  
addrb1 3 <= " 0 1 0000 " ;  
ELSI F  INP ( 95 DOWNTO 64 ) 
over23<= ' 0 ' ;  
match2 3<= ' 1 ' ;  
OUTPUT2 3<= ' 1 ' ; 
addrb13  <= " 0 1 0 0 0 1 " ;  
ELSI F  INP ( 95 DOWNTO 64 ) 
over2 3<= ' 0 ' ; 
match2 3<= ' 1 ' ;  
OUTPUT2 3<= ' 1 ' ;  
addrbl 3  <= " 0 1 00 1 0 " ;  
ELS I F  INP ( 95 DOWNTO 64 ) 
over23<= ' 0 ' ;  
match2 3<= ' 1 ' ;  
OUTPUT2 3<= ' 1 ' ;  
addrbl3  <=  " 0 1 00 1 1 " ;  
ELS IF  INP ( 95 DOWNTO 64 ) 
over23<= ' 0 ' ; 
match2 3<= ' 1 ' ;  
OUTPUT2 3<= ' 1 ' ; 
addrb 1 3  <= " 0 1 01 00 " ;  
ELS I F  INP ( 95 DOWNTO 64 ) 
over23<= ' 0 ' ; 
ea ( 1 6 )  THEN 
ea ( 1 7 )  THEN 
ea ( 1 8 )  THEN 
ea ( 1 9 )  THEN 
ea ( 2 0 )  THEN 
ea ( 2 1 )  THEN 
1 49 
match2 3<= ' 1 ' ;  
OUTPUT23<= ' 1 ' ;  
addrbl3 <= " 0 1 0 1 0 1 " ;  
ELS I F  INP ( 95 DOWNTO 64 ) ea ( 2 2 )  THEN 
over2 3<= ' 0 ' ;  
match2 3<= ' 1 ' ;  
OUTPUT2 3<= ' 1 ' ; 
addrb1 3 <= " 0 1 0 1 1 0 " ;  
ELS I F  INP ( 95 DOWNTO 64 ) e a ( 2 3 )  THEN 
over23<= ' 0 ' ;  
match2 3<= ' 1 ' ;  
OUTPUT2 3<= ' 1 ' ; 
addrb1 3  <= " 0 1 0 1 1 1 " ;  
ELS I F  INP ( 95 DOWNTO 64 ) > ea ( 63 )  THEN 
over2 3<= ' 1 ' ; 
match23<= ' 0 ' ; 
OUTPUT2 3<= ' 0 ' ;  
ELSE 
match2 3<= ' 0 ' ; 
over2 3<= ' 0 ' ;  
addrb1 3 <= " 0 0 0 0 0 0 " ;  
OUTPUT2 3<= ' 0 ' ;  
END I F; 
END I F ; 
END PROCESS COMPARATOR2 3 ;  
COMPARATOR2 4 :  PROCESS ( CLK, STATE , RESET ) 
BEGIN 
IF RESET = ' 1 '  THEN 
OUTPUT2 4 <= ' 0 ' ;  
over2 4 <= ' 0 ' ; 
match2 4 <= ' 0 ' ;  
addrb1 4 <= " 0 00000 " ;  
ELS I F  CLK ' EVENT AND CLK= ' l '  THEN 
I F  STATE /= PROCESS ING or din_rdy= ' 0 '  THEN 
OUTPUT2 4<= ' 0 ' ; 
over2 4 <= ' 0 ' ; 
match2 4 <= ' 0 ' ; 
addrbl 4  <= " 0 0 0 0 00 " ;  
--ELSE 
ELS I F  INP ( 95 DOWNTO 6 4 ) 
over2 4 <= ' 0 ' ; 
match2 4<= ' 1 ' ;  
OUTPUT2 4 <= ' 1 ' ;  
addrb1 4  <= " 0 1 1 0 0 0 " ;  
ELS I F  INP ( 95 DOWNTO 6 4 ) 
over2 4 <= ' 0 ' ; 
match2 4 <= ' 1 ' ; 
OUTPUT2 4 <= ' 1 ' ;  
addrb1 4  <= " 0 1 1 0 0 1 " ;  
ELS I F  INP ( 95 DOWNTO 64 ) 
over2 4 <= ' 0 ' ;  
match2 4<= ' 1 ' ;  
OUTPUT2 4 <= ' 1 ' ;  
addrbl 4  <= " 0 1 1 0 1 0 " ;  
ea ( 2 4 ) THEN 
ea ( 2 5 )  THEN 
ea ( 2 6 )  THEN 
150 
ELS I F  INP ( 95 DOWNTO 64 ) 
over2 4 <= ' 0 ' ; 
rnatch2 4 <= ' 1 ' ;  
OUTPUT2 4 <= ' 1 ' ;  
addrb1 4  <= " 0 1 1 0 1 1 " ;  
ELS I F  INP ( 95 DOWNTO 64 ) 
over2 4 <= ' 0 ' ; 
rnatch2 4 <= ' 1 ' ;  
OUTPUT 2 4 <= ' 1 ' ; 
addrb1 4  <= " 0 1 1 1 0 0 " ;  
ELS I F  INP ( 95 DOWNTO 6 4 ) 
over2 4 <= ' 0 ' ; 
rnatch2 4 <= ' 1 ' ;  
OUTPUT2 4 <= ' 1 ' ;  
addrb1 4  <= " 0 1 1 1 0 1 " ;  
ELS I F  INP ( 95 DOWNTO 64 ) 
over2 4 <= ' 0 ' ; 
rnatch2 4<= ' 1 ' ;  
OUTPUT2 4 <= ' 1 ' ;  
addrb1 4  <= " 0 1 1 1 1 0 " ;  
ELS I F  INP ( 95 DOWNTO 64 ) 
over2 4 <= ' 0 ' ; 
rnatch2 4 <= ' 1 ' ;  
OUTPUT24<= ' 1 ' ;  
addrb1 4  <= " 0 1 1 1 1 1 " ;  
ea ( 2 7 ) THEN 
ea ( 2 8 )  THEN 
ea ( 2 9 )  THEN 
ea ( 30 )  THEN 
ea ( 3 1 )  THEN 
ELS I F  INP ( 95 DOWNTO 6 4 ) > ea ( 3 1 )  THEN 
--over2 4 <= ' 1 ' ;  
rnatch2 4<= ' 0 ' ; 
OUTPUT2 4 <= ' 0 ' ;  
ELSE 
rnatch2 4 <= ' 0 ' ;  
over2 4 <= ' 0 ' ; 
addrb1 4  <= " 000000 " ;  
OUTPUT2 4 <= ' 0 ' ; 
END I F ;  
END I F ;  
END PROCESS COMPARATOR2 4 ; 
MUXER2 : PROCESS ( CLK, RESET ) 
variable rnatch2a : std_logi c ;  
BEGIN 
IF RESET= ' l '  THEN 
--over2 <= I O I ; 
rnatch2x <= I O I ; 
addrblx  <= ( OTHERS=> ' 0 ' ) ;  
OUTPUT2 <= ( OTHERS=> ' 0 ' ) ;  
rnatch2a . - ' 0 ' ; 
ELS I F  CLK ' EVENT AND CLK= ' l '  THEN 
- -over2<=over2 1 OR over22 OR over2 3 OR over2 4 ;  
rnatch2a : =rnatch2 1 OR rnatch22 OR rnatch2 3 OR rnatch2 4 ;  
rnatch2x<=rnatch2a ; 
OUTPUT2 (Acount l )  <= OUTPUT2 1 OR OUTPUT22 OR OUTPUT23  OR OUTPUT2 4 ;  
I F  rnatch2 1 = ' 1 '  THEN 
addrblx <= addrbl l ;  
ELSI F  rnatch22 = ' 1 '  THEN 
1 5 1  
addrblx  <= addrbl2 ; 
ELS I F  match2 3 = ' 1 '  THEN 
addrblx <= addrbl 3 ;  
ELS IF  match2 4 = ' l '  THEN 
addrb lx <= addrb1 4 ;  
ELSE 
addrblx <= ( OTHERS=> ' 0 ' ) ;  
END I F ;  
END I F ;  
END PROCESS MUXER2 ; 
COMPARATOR31 : PROCESS ( CLK , STATE , RESET ) 
BEGIN 
IF  RESET = ' 1 '  THEN 
OUT PUT31<= ' 0 ' ; 
over31<= ' 0 ' ; 
match31<= ' 0 ' ; 
addra2 1 <= " 000000 " ;  
ELSIF  CLK ' EVENT AND CLK= ' l '  THEN 
IF STATE /= PROCESSING or din_rdy= ' 0 '  THEN 
OUT PUT31<= ' 0 ' ; 
over31<= ' 0 ' ;  
match3 1<= ' 0 ' ; 
addra2 1  <= "000000 " ;  
--ELSE 
ELS I F  INP ( 63 DOWNTO 32 ) 
over31<= ' 0 ' ; 
match31<= ' 1 ' ; 
OUTPUT3 1 <= ' 1 ' ;  
addra2 1 <= " 0 000 0 0 " ; 
ELSI F  INP ( 63 DOWNTO 32 ) 
over31<= ' 0 ' ; 
match31<= ' 1 ' ;  
OUTPUT3 1<= ' 1 ' ;  
addra2 1  <= " 0000 0 1 " ;  
ELSI F  INP ( 6 3 DOWNTO 32 ) 
over31<= ' 0 ' ;  
rnatch3 1<= ' 1 ' ; 
OUTPUT3 1<= ' 1 ' ; 
addra2 1  <= " 0000 1 0 " ;  
ELSIF  INP ( 63 DOWNTO 32 ) 
over31<= ' 0 ' ; 
rnatch3 1<= ' 1 ' ;  
OUTPUT3 1 <= ' 1 ' ;  
addra2 1 <= " 0 0001 1 " ;  
ELSIF  INP ( 63 DOWNTO 32 ) 
over31<= ' 0 ' ; 
rnat ch31<= ' 1 ' ;  
OUTPUT31<= ' 1 ' ; 
addra2 1 <= " 0 00 1 00 " ;  
ELS I F  INP ( 63 DOWNTO 32 ) 
over3 1<= ' 0 ' ;  
rnatch31<= ' 1 ' ; 
OUTPUT31<= ' 1 ' ;  
addra2 1  <= " 0 0 0 1 0 1 " ;  
ea ( 0 )  THEN 
ea ( l )  THEN 
ea ( 2 )  THEN 
ea ( 3 )  THEN 
ea ( 4 )  THEN 
ea ( S )  THEN 
1 52 
ELSI F  INP ( 63 DOWNTO 32 ) 
over31<= ' 0 ' ; 
rnatch3 1<= ' 1 ' ;  
OUTPUT31<= ' 1 ' ;  
addra2 1  <= " 0 0 0 1 1 0 " ;  
ELSI F  INP ( 63 DOWNTO 32 ) 
over3 1<= ' 0 ' ; 
rnatch3 1<= ' 1 ' ; 
OUTPUT31<= ' 1 ' ;  
addra2 1  <= " 0 0 0 1 1 1 " ;  
ea ( 6 )  THEN 
ea ( 7 )  THEN 
ELS I F  INP ( 63 DOWNTO 32 ) > ea ( 63 )  THEN 
over31<= ' 1 ' ;  
rnatch31<= ' 0 ' ; 
OUTPUT3 1<= ' 0 ' ; 
ELSE 
rnatch3 1<= ' 0 ' ;  
over31<= ' 0 ' ;  
addra2 1  <= " 0 0 0 0 0 0 " ;  
OUTPUT3 1<= ' 0 ' ;  
END I F ;  
END I F; 
END PROCESS COMPARATOR31 ; 
COMPARATOR32 : PROCESS ( CLK, STATE , RESET ) 
BEGIN 
IF RESET = ' l '  THEN 
OUTPUT32<= ' 0 ' ; 
over32<= ' 0 ' ;  
rnatch32<= ' 0 ' ; 
addra22 <= " 00 0 0 0 0 " ;  
ELSI F  CLK ' EVENT AND CLK= ' l '  THEN 
IF STATE /= PROCESSING or din_rdy= ' 0 '  THEN 
OUTPUT32<= ' 0 ' ;  
over32<= ' 0 ' ;  
rnatch32<= ' 0 ' ;  
addra22 <= " 000000 " ;  
--ELSE 
ELS I F  INP ( 63 DOWNTO 32 ) 
over32<= ' 0 ' ; 
rnatch32<= ' 1 ' ;  
OUTPUT32<= ' 1 ' ;  
addra22 <= " 0 0 1 0 0 0 " ;  
ELS I F  INP ( 63 DOWNTO 32 ) 
over32<= ' 0 ' ; 
rnatch32<= ' 1 ' ; 
OUTPUT32<= ' 1 ' ; 
addra22 <= " 0 0 1 00 1 " ;  
ELS I F  INP ( 63 DOWNTO 32 ) 
over32<= ' 0 ' ; 
rnatch32<= ' 1 ' ;  
OUTPUT32<= ' 1 ' ;  
addra22 <= " 00 1 01 0 " ;  
ELSI F  INP ( 63 DOWNTO 32 ) 
over32<= ' 0 ' ; 
rnatch32<= ' 1 ' ; 
ea ( B )  THEN 
ea ( 9 ) THEN 
ea ( l0 )  THEN 
ea ( 1 1 )  THEN 
1 53 
OUTPUT32<= ' 1 ' ; 
addra22 <= " 0 0 1 0 1 1 " ;  
ELS I F  INP ( 63 DOWNTO 32 ) 
over32<= ' 0 ' ; 
match32<= ' 1 ' ;  
OUTPUT32<= ' 1 ' ;  
addra22 <= " 0 0 1 1 0 0 " ;  
ELS I F  INP ( 63 DOWNTO 32 ) 
over32<= ' 0 ' ; 
match32<= ' 1 ' ;  
OUTPUT32<= ' 1 ' ;  
addra22 <= " 0 0 1 1 0 1 " ;  
ELS I F  INP ( 63 DOWNTO 32 ) 
over32<= ' 0 ' ; 
match32<= ' 1 ' ;  
OUTPUT32<= ' 1 ' ;  
addra22 <= " 0 0 1 1 1 0 " ;  
ELS I F  INP ( 63 DOWNTO 32 ) 
over32<= ' 0 ' ;  
match32<= ' 1 ' ;  
OUTPUT32<= ' 1 ' ;  
addra22 <= " 0 0 1 1 1 1 " ;  
ea ( 1 2 )  THEN 
ea ( 1 3 )  THEN 
ea ( 1 4 ) THEN 
ea ( l S )  THEN 
ELS I F  INP ( 63 DOWNTO 32 ) > ea ( 63 )  THEN 
over32<= ' 1 ' ;  
match32<= ' 0 ' ; 
OUTPUT32<= ' 0 ' ;  
ELSE 
match32<= ' 0 ' ;  
over32<= ' 0 ' ;  
addra22 <= " 000000 " ;  
OUTPUT32<= ' 0 ' ;  
END I F ;  
END I F ;  
END PROCESS COMPARATOR32 ; 
COMPARATOR33 : PROCESS ( CLK, STATE , RESET ) 
BEGIN 
I F  RESET = ' l '  THEN 
OUTPUT33<= ' 0 ' ;  
over33<= ' 0 ' ;  
match33<= ' 0 ' ; 
addra2 3  <= " 0 00000 " ;  
ELS I F  CLK ' EVENT AND CLK= ' l '  THEN 
I F  STATE /= PROCESSING or din_rdy= ' 0 '  THEN 
OUTPUT33<= ' 0 ' ;  
over33<= ' 0 ' ; 
match33<= ' 0 ' ; 
addra2 3 <= " 0 0000 0 " ;  
--ELSE 
ELS I F  INP ( 63 DOWNTO 32 ) 
over33<= ' 0 ' ;  
match33<= ' 1 ' ;  
OUTPUT33<= ' 1 ' ;  
addra2 3  <= " 0 1 0000 " ;  
ELS I F  INP ( 63 DOWNTO 32 ) 
ea ( 1 6 )  THEN 
ea ( 17 )  THEN 
1 54 
over33<= ' 0 ' ;  
rnatch33<= ' 1 ' ;  
OUT PUT33<= ' 1 ' ;  
addra23  <= " 0 1 0001 " ;  
ELSI F  INP ( 63 DOWNTO 32 } 
over33<= ' 0 ' ; 
rnat ch3 3<= ' 1 ' ;  
OUTPUT33<= ' 1 ' ; 
addra23  <= " 0 1 00 1 0 " ;  
ELSIF  INP ( 63 DOWNTO 32 } 
over33<= ' 0 ' ;  
rnat ch33<= ' 1 ' ;  
OUTPUT33<= ' 1 ' ;  
addra23  <= " 0 1 0 0 1 1 " ;  
ELS I F  INP ( 63 DOWNTO 32 } 
over33<= ' 0 ' ; 
match33<= ' 1 ' ;  
OUTPUT33<= ' 1 ' ;  
addra23 <= " 0 1 0 1 00 " ;  
ELSI F  INP ( 63 DOWNTO 32 } 
over3 3<= ' 0 ' ; 
rnatch33<= ' 1 ' ; 
OUTPUT33<= ' 1 ' ;  
addra23 <= " 0 1 0 10 1 " ; 
ELSI F  INP ( 63 DOWNTO 32 } 
over33<= ' 0 ' ; 
match33<= ' 1 ' ; 
OUTPUT3 3<= ' 1 ' ;  
addra23 <= " 0 1 0 1 1 0 " ;  
ea ( 1 8 }  THEN 
ea ( 1 9 }  THEN 
ea ( 2 0 } THEN 
ea ( 2 1 }  THEN 
ea ( 2 2 }  THEN 
ELS I F  INP ( 63 DOWNTO 32 } ea ( 2 3 }  THEN 
over3 3<= ' 0 ' ; 
match33<= ' 1 ' ;  
OUTPUT33<= ' 1 ' ; 
addra2 3 <= " 0 1 0 1 1 1 " ;  
ELS I F  INP ( 63 DOWNTO 32 } > ea ( 63 }  THEN 
over33<= ' 1 ' ; 
match33<= ' 0 ' ; 
OUTPUT33<= ' 0 ' ; 
ELSE 
match33<= ' 0 ' ; 
over33<= ' 0 ' ; 
addra2 3  <= " 0 00000 " ;  
OUTPUT33<= ' 0 ' ; 
END I F ;  
END I F; 
END PROCESS COMPARATOR33 ; 
COMPARATOR3 4 : PROCESS (CLK, STATE , RESET } 
BEGIN 
I F  RESET = ' 1 '  THEN 
OUTPUT34<= ' 0 ' ;  
over3 4<= ' 0 ' ;  
match34<= ' 0 ' ; 
addra2 4 <= " 0 00000 " ;  
ELS I F  CLK ' EVENT AND CLK= ' l '  THEN 
155  
I F  STATE /= PROCESS ING or din_rdy= ' 0 '  THEN 
OUTPUT34<= ' 0 ' ; 
over3 4<= ' 0 ' ;  
match34<= ' 0 ' ; 
addra2 4 <= " 000000 " ;  
--ELSE 
ELS I F  INP ( 63 DOWNTO 32 ) 
over3 4<= ' 0 ' ;  
match3 4<= ' 1 ' ;  
OUTPUT3 4 <= ' 1 ' ; 
addra2 4  <= " 0 1 1 00 0 " ;  
ELS I F  INP ( 63 DOWNTO 3 2 )  
over3 4<= ' 0 ' ; 
match34 <= ' 1 ' ;  
OUTPUT34<= ' 1 ' ; 
addra2 4 <= " 0 1 1 00 1 " ;  
ELS I F  INP ( 63 DOWNTO 32 ) 
over3 4<= ' 0 ' ; 
mat ch34<= ' 1 ' ; 
OUTPUT34<= ' 1 ' ;  
addra2 4  <= " 0 1 1 0 1 0 " ;  
ELSIF  INP ( 63 DOWNTO 32 ) 
over3 4<= ' 0 ' ; 
match3 4<= ' 1 ' ;  
OUTPUT34<= ' 1 ' ;  
addra2 4 <= " 0 1 1 0 1 1 " ;  
ELS I F  INP ( 63 DOWNTO 32 ) 
over34<= ' 0 ' ; 
match34<= ' 1 ' ; 
OUTPUT34<= ' 1 ' ; 
addra2 4 <= " 0 1 1 1 0 0 " ;  
ELS I F  INP ( 63 DOWNTO 32 ) 
over3 4<= ' 0 ' ; 
match34<= ' 1 ' ; 
OUTPUT34<= ' 1 ' ; 
addra2 4  <= " 0 1 1 1 01 " ;  
ELS I F  INP ( 6 3 DOWNTO 32 ) 
over34<= ' 0 ' ; 
match34 <= ' 1 ' ;  
OUTPUT3 4<= ' 1 ' ; 
addra2 4 <= " 0 1 1 1 1 0 " ;  
ELS I F  INP ( 6 3 DOWNTO 32 ) 
over3 4<= ' 0 ' ; 
match3 4 <= ' 1 ' ; 
OUTPUT3 4<= ' 1 ' ;  
addra2 4 <= " 0 1 1 1 1 1 " ;  
ea ( 2 4 ) THEN 
ea ( 2 5 )  THEN 
ea ( 2 6 )  THEN 
ea ( 2 7 ) THEN 
ea ( 2 8 ) THEN 
ea ( 2 9 )  THEN 
ea ( 3 0 }  THEN 
ea ( 3 1 ) THEN 
ELS I F  INP ( 63 DOWNTO 32 ) > ea ( 3 1 )  THEN 
--over34<= ' 1 ' ;  
match34<= ' 0 ' ; 
OUTPUT3 4<= ' 0 ' ; 
ELSE 
match34<= ' 0 ' ; 
over34<= ' 0 ' ; 
addra24  <= " 000000 " ;  
OUTPUT34<= ' 0 ' ; 
156 
END I F ; 
END I F ; 
END PROCESS COMPARATOR34 ; 
MUXER3 : PROCESS ( CLK , RESET ) 
variable rnatch3a : std_logic;  
BEGIN 
IF RESET= ' l '  THEN 
--over3 <= I O I ; 
rnatch3x <= I O I ; 
addra2x <= ( OTHERS=> ' 0 ' ) ;  
OUTPUT3 <= ( OTHERS=> ' 0 ' ) ;  
rnatch3a : = ' 0 ' ;  
ELS IF  CLK ' EVENT AND CLK= ' l '  THEN 
- -over3<=over31  OR over32 OR over33 OR over3 4 ;  
rnatch3a : =rnatch31  OR rnatch32 OR rnatch33 OR rnatch34 ; 
rnatch3x<=rnatch3a ; 
OUTPUT3 (Acount l )  <= OUTPUT31  OR OUTPUT32 OR OUTPUT33 OR OUTPUT34 ;  
I F  rnatch31  = ' 1 '  THEN 
addra2x <= addra2 1 ;  
ELS I F  rnatch32 = ' 1 '  THEN 
addra2x <= addra22 ; 
ELSIF  rnatch33 = ' 1 '  THEN 
addra2x <= addra2 3 ;  
ELS I F  rnatch34  = ' 1 '  THEN 
addra2x <= addra24 ; 
ELSE 
addra2x <= ( OTHERS=> ' 0 ' ) ;  
END I F ; 
END I F; 
END PROCESS MUXER3 ; 
COMPARATOR4 1 :  PROCESS ( CLK, STATE , RESET ) 
BEGIN 
IF  RESET = ' 1 '  THEN 
OUTPUT4 1<= ' 0 ' ; 
over4 1<= ' 0 ' ;  
rnat ch 4 1<= ' 0 ' ; 
addrb21 <= " 0 00000 " ;  
ELS I F  CLK ' EVENT AND CLK= ' l '  THEN 
I F  STATE /= PROCESSING or din_rdy= ' 0 '  THEN 
OUTPUT4 1<= ' 0 ' ;  
over4 1<= ' 0 ' ;  
rnatch 4 1<= ' 0 ' ; 
addrb2 1 <= " 000000 " ;  
--ELSE 
ELS I F  INP ( 3 1  DOWNTO 0 )  = ea ( 0 )  THEN 
over4 1<= ' 0 ' ;  
rnatch 4 1<= ' 1 ' ;  
OUTPUT4 1<= ' 1 ' ; 
addrb2 1 <= " 0 0 0 00 0 " ;  
ELS I F  INP ( 31 DOWNTO 0 )  = ea ( l )  THEN 
over4 1<= ' 0 ' ; 
rnatch4 1<= ' 1 ' ;  
OUTPUT4 1<= ' 1 ' ; 
1 57 
addrb21  <= " 0 0 0 00 1 " ;  
ELS I F  INP ( 3 1 DOWNTO 0 )  = ea ( 2 )  THEN 
over4 1<= ' 0 ' ;  
match4 1<= ' 1 ' ;  
OUTPUT4 1<= ' 1 ' ;  
addrb2 1 <= " 0 000 1 0 " ; 
ELS I F  INP ( 3 1 DOWNTO 0 )  = ea ( 3 )  THEN 
over4 1<= ' 0 ' ;  
match4 1<= ' 1 ' ;  
OUTPUT4 1<= ' 1 ' ;  
addrb2 1 <= " 0 0 0 0 1 1 " ;  
ELS I F  INP ( 31 DOWNTO 0 )  = ea ( 4 )  THEN 
over4 1<= ' 0 ' ;  
mat ch4 1<= ' 1 ' ; 
OUTPUT4 1<= ' 1 ' ;  
addrb2 1 <= " 0 0 0 1 0 0 " ;  
ELS I F  INP ( 3 1  DOWNTO 0 )  = ea ( 5 )  THEN 
over4 1<= ' 0 ' ; 
match4 1<= ' 1 ' ;  
OUTPUT4 1<= ' 1 ' ;  
addrb2 1 <= " 00 0 1 0 1 " ;  
ELS I F  INP ( 3 1 DOWNTO 0 )  = ea ( 6 ) THEN 
over4 1<= ' 0 ' ; 
match4 1<= ' 1 ' ;  
OUTPUT4 1<= ' 1 ' ;  
addrb2 1 <= " 0 00 1 1 0 " ;  
ELS I F  INP ( 3 1  DOWNTO 0 )  = ea ( 7 )  THEN 
over4 1<= ' 0 ' ; 
match4 1<= ' 1 ' ;  
OUTPUT4 1<= ' 1 ' ;  
addrb2 1 <= " 0 0 0 1 1 1 " ;  
ELS I F  INP ( 3 1 DOWNTO 0 )  > ea ( 63 )  THEN 
over4 1<= ' 1 ' ;  
match4 1<= ' 0 ' ; 
OUTPUT4 1<= ' 0 ' ;  
ELSE 
match4 1<= ' 0 ' ; 
over4 1<= ' 0 ' ; 
addrb2 1 <= " 00000 0 " ;  
OUTPUT4 1<= ' 0 ' ; 
END I F ;  
END I F ;  
END PROCESS COMPARATOR4 1 ;  
COMPARATOR4 2 :  PROCESS ( CLK, STATE , RESET ) 
BEGIN 
IF RESET = ' 1 '  THEN 
OUTPUT4 2<= ' 0 ' ; 
over4 2<= ' 0 ' ; 
match4 2<= ' 0 ' ; 
addrb22 <= " 0 00000 " ;  
ELS I F  CLK ' EVENT AND CLK= ' l '  THEN 
I F  STATE /= PROCESSING or din rdy= ' 0 '  THEN 
OUTPUT4 2<= ' 0 ' ; 
over4 2<= ' 0 ' ; 
158 
match4 2<= ' 0 ' ; 
addrb22 <= " 00000 0 " ;  
- -ELSE 
ELS I F  INP ( 3 1 DOWNTO 0 )  = ea ( B )  THEN 
over4 2<= ' 0 ' ;  
match4 2<= ' 1 ' ;  
OUTPUT4 2<= ' 1 ' ;  
addrb22 <= " 00 1 0 00 " ;  
ELS I F  INP ( 3 1 DOWNTO 0 )  = ea ( 9 )  THEN 
over4 2<= ' 0 ' ; 
match4 2<= ' 1 ' ;  
OUTPUT4 2<= ' 1 ' ;  
addrb22 <= " 0 0 1 00 1 " ;  
ELSI F  INP ( 31 DOWNTO 0 )  = ea ( l 0 )  THEN 
over4 2 <= ' 0 ' ; 
match42<= ' 1 ' ;  
OUTPUT4 2<= ' 1 ' ; 
addrb22 <= " 0 0 1 0 1 0 " ; 
ELSIF  INP ( 3 1 DOWNTO 0 )  = ea ( l l )  THEN 
over4 2<= ' 0 ' ; 
mat ch4 2<= ' 1 ' ;  
OUTPUT4 2<= ' 1 ' ;  
addrb22 <= " 00 1 0 1 1 " ;  
ELS I F  INP ( 3 1 DOWNTO 0 )  = ea ( 1 2 )  THEN 
over4 2<= ' 0 ' ;  
match4 2<= ' 1 ' ;  
OUTPUT4 2<= ' 1 ' ;  
addrb2 2  <= " 00 1 1 0 0 " ;  
ELS I F  INP ( 3 1  DOWNTO 0 )  = ea ( 1 3 )  THEN 
over4 2<= ' 0 ' ; 
match4 2 <= ' 1 ' ;  
OUTPUT4 2<= ' 1 ' ; 
addrb22 <=  " 0 0 1 1 0 1 " ;  
ELS I F  INP ( 3 1 DOWNTO 0 )  = ea ( 1 4 )  THEN 
over42<= ' 0 ' ;  
match4 2<= ' 1 ' ;  
OUTPUT4 2<= ' 1 ' ;  
addrb22  <= " 0 0 1 1 1 0 " ;  
ELS I F  INP ( 31 DOWNTO 0 )  = ea ( l S )  THEN 
over4 2<= ' 0 ' ; 
match4 2<= ' 1 ' ;  
OUTPUT4 2<= ' 1 ' ;  
addrb22 <= " 00 1 1 1 1 " ;  
ELSI F  INP ( 31 DOWNTO 0 )  > ea ( 63 )  THEN 
over4 2<= ' 1 ' ;  
match4 2<= ' 0 ' ; 
OUTPUT4 2 <= ' 0 ' ; 
ELSE 
match4 2<= ' 0 ' ; 
over4 2<= ' 0 ' ; 
addrb22 <= " 000000 " ;  
OUTPUT4 2<= ' 0 ' ; 
END IF ;  
END IF ;  
END PROCESS COMPARATOR4 2 ;  
1 59 
COMPARATOR4 3 :  PROCESS (CLK, STATE , RESET ) 
BEGIN 
IF RESET = ' l '  THEN 
OUTPUT4 3<= ' 0 ' ; 
over4 3<= ' 0 ' ;  
mat ch4 3<= ' 0 ' ; 
addrb23 <= " 0 00000 " ;  
ELS I F  CLK ' EVENT AND CLK= ' l '  THEN 
I F  STATE /= PROCESSING or din_rdy= ' 0 '  THEN 
OUTPUT4 3<= ' 0 ' ; 
over4 3<= ' 0 ' ; 
match4 3<= ' 0 ' ; 
addrb23 <= " 0 00000 " ;  
--ELSE 
ELS I F  INP ( 3 1 DOWNTO 0 )  = ea ( 1 6 )  THEN 
over4 3<= ' 0 ' ; 
match4 3<= ' 1 ' ;  
OUTPUT4 3<= ' 1 ' ; 
addrb23 <= " 0 1000 0 " ; 
ELS I F  INP ( 3 1 DOWNTO 0 )  = ea ( 1 7 )  THEN 
over4 3<= ' 0 ' ; 
match4 3<= ' 1 ' ; 
OUTPUT4 3<= ' 1 ' ; 
addrb2 3 <= " 0 1 000 1 " ;  
ELS I F  INP ( 3 1 DOWNTO 0 )  = ea ( 1 8 )  THEN 
over4 3<= ' 0 ' ;  
match4 3<= ' 1 ' ; 
OUTPUT4 3<= ' 1 ' ; 
addrb23 <= " 0 1 0 0 1 0 " ;  
ELS I F  INP ( 31 DOWNTO 0 )  = ea ( 1 9 )  THEN 
over4 3<= ' 0 ' ; 
match4 3<= ' 1 ' ; 
OUTPUT4 3<= ' 1 ' ;  
addrb23 <= " 0 1 0 0 1 1 " ;  
ELS I F  INP ( 3 1 DOWNTO 0 )  = ea ( 2 0 )  THEN 
over43<= ' 0 ' ; 
match4 3<= ' 1 ' ;  
OUTPUT4 3<= ' 1 ' ; 
addrb23 <= " 0 1 0 1 0 0 " ;  
ELS I F  INP ( 3 1 DOWNTO 0 )  = ea ( 2 1 ) THEN 
over4 3<= ' 0 ' ; 
match4 3<= ' 1 ' ;  
OUTPUT4 3<= ' 1 ' ; 
addrb23 <= " 0 1 0 1 0 1 " ;  
ELS I F  INP ( 31 DOWNTO 0 )  = ea ( 22 ) THEN 
over4 3<= ' 0 ' ; 
match4 3<= ' 1 ' ;  
OUTPUT4 3<= ' 1 ' ; 
addrb23 <= " 0 1 0 1 1 0 " ; 
ELS I F  INP ( 3 1  DOWNTO 0 )  = ea ( 2 3 )  THEN 
over4 3<= ' 0 ' ; 
match4 3<= ' 1 ' ; 
OUTPUT4 3<= ' 1 ' ;  
addrb23 <= " 0 1 0 1 1 1 " ;  
160 
ELS I F  INP ( 3 1 DOWNTO 0 )  > ea ( 63 )  THEN 
over4 3<= ' 1 ' ;  
match4 3<= ' 0 ' ;  
OUTPUT4 3<= ' 0 ' ; 
ELSE 
match4 3<= ' 0 ' ; 
over4 3<= ' 0 ' ;  
addrb23  <= " 000000 " ;  
OUTPUT4 3<= ' 0 ' ; 
END I F ;  
END I F ;  
END PROCESS COMPARATOR4 3 ;  
COMPARATOR4 4 :  PROCESS ( CLK, STATE , RESET ) 
BEGIN 
IF RESET = ' 1 '  THEN 
OUTPUT4 4 <= ' 0 ' ;  
over4 4 <= ' 0 ' ; 
match4 4 <= ' 0 ' ;  
addrb2 4 <= " 0000 0 0 " ;  
ELS I F  CLK ' EVENT AND CLK= ' l '  THEN 
I F  STATE /= PROCESSING or din_rdy= ' 0 '  THEN 
OUTPUT4 4 <= ' 0 ' ; 
over4 4 <= ' 0 ' ;  
match4 4 <= ' 0 ' ; 
addrb2 4 <= " 00000 0 " ;  
--ELSE 
ELS I F  INP ( 3 1  DOWNTO 0 )  = ea ( 2 4 ) THEN 
over4 4 <= ' 0 ' ;  
match4 4 <= ' 1 ' ; 
OUTPUT 4 4 <= ' 1 ' ;  
addrb2 4 <= " 0 1 1 00 0 " ;  
ELS I F  INP ( 31 DOWNTO 0 )  = ea ( 2 5 )  THEN 
over4 4 <= ' 0 ' ; 
match4 4 <= ' 1 ' ;  
OUTPUT4 4 <= ' 1 ' ; 
addrb2 4  <= " 0 1 1 0 0 1 " ;  
ELS I F  INP ( 31 DOWNTO 0 )  = ea ( 2 6 )  THEN 
over4 4 <= ' 0 ' ;  
match4 4 <= ' 1 ' ; 
OUTPUT 4 4 <= ' 1 ' ;  
addrb2 4 <= " 0 1 1 0 1 0 " ;  
ELS I F  INP ( 31 DOWNTO 0 )  = ea ( 2 7 )  THEN 
over4 4 <= ' 0 ' ; 
match4 4 <= ' 1 ' ; 
OUTPUT 4 4 <= ' 1 ' ;  
addrb2 4 <= " 0 1 1 0 1 1 " ;  
ELS I F  INP ( 31 DOWNTO 0 )  = ea ( 2 8 )  THEN 
over4 4 <= ' 0 ' ; .  
match4 4 <= ' 1 ' ; 
OUTPUT4 4<= ' 1 ' ;  
addrb2 4 <= " 0 1 1 1 00 " ;  
ELSI F  INP ( 31 DOWNTO 0 )  = ea ( 2 9 )  THEN 
over4 4 <= ' 0 ' ;  
match4 4 <= ' 1 ' ; 
1 6 1 
OUTPUT 4 4 <= ' 1 ' ;  
addrb2 4 <= " 0 1 1 1 0 1 " ;  
ELS I F  INP ( 3 1  DOWNTO 0 )  = ea ( 3 0 )  THEN 
over4 4 <= ' 0 ' ;  
match4 4<= ' 1 ' ;  
OUTPUT 4 4 <= ' 1 ' ; 
addrb2 4 <= " 0 1 1 1 1 0 " ;  
ELS I F  INP ( 3 1 DOWNTO 0 )  = ea ( 3 1 )  THEN 
over 4 4 <= ' 0 ' ; 
match4 4<= ' 1 ' ;  
OUTPUT 4 4 <= ' 1 ' ;  
addrb2 4 <= " 0 1 1 1 1 1 " ;  
ELS I F  INP ( 3 1 DOWNTO 0 )  > ea ( 3 1 )  THEN 
over4 4 <= ' 1 ' ;  
match4 4 <= ' 0 ' ;  
OUTPUT4 4 <= ' 0 ' ; 
ELSE 
match4 4 <= ' 0 ' ; 
over4 4 <= ' 0 ' ;  
addrb2 4  <= " 0 000 00 " ;  
OUTPUT4 4<= ' 0 ' ; 
END I F ;  
END I F ;  
END PROCESS COMPARATOR4 4 ;  
MUXER4 : PROCESS ( CLK, RESET ) 
variable match4 a : std_logic ;  
BEGIN 
IF RESET= ' l '  THEN 
over4 <= ' 0 ' ; 
match4x <= ' 0 ' ;  
addrb2x <= ( OTHERS=> ' 0 ' ) ;  
OUTPUT4 <= ( OTHERS=> ' 0 ' ) ;  
match4a  . - ' 0 ' ; 
ELS I F  CLK ' EVENT AND CLK= ' l '  THEN 
--over4 <=over4 1 OR over4 2 OR over4 3 OR over4 4 ;  
over4 <=over4 4 ;  
match4 a : =match4 1 OR match4 2 OR match4 3 OR match4 4 ;  
match4 x<=match4 a ;  
OUTPUT4 (Acount l )  < =  OUTPUT4 1  OR OUTPUT4 2  OR OUTPUT4 3 OR OUTPUT4 4 ;  
I F  match4 1 = ' 1 '  THEN 
addrb2x <= addrb2 1 ;  
ELS I F  match4 2 = ' 1 '  THEN 
addrb2x <= addrb22 ;  
ELS I F  match4 3 = ' 1 '  THEN 
addrb2x <= addrb2 3 ;  
ELSI F  match4 4 = ' 1 '  THEN 
addrb2x <= addrb2 4 ;  
ELSE 
addrb2x <= ( OTHERS=> ' 0 ' ) ;  
END I F ;  
END I F ;  
END PROCESS MUXER4 ; 
LAST CMP MTCH : PROCESS ( CLK, RESET ) 
162 
BEGIN 
I F  RESET = ' 1 '  THEN 
match4 4 x  <= ' 0 ' ; 
ELS I F  CLK ' EVENT AND CLK= ' l '  THEN 
IF Acount l = 0 THEN 
match4 4 x  <= match4 4 ;  
ELSE 
match4 4 x  <= ' 0 ' ; 
END IF ;  
END I F; 
END PROCESS LAST_CMP_MTCH ; 
STATE DELAY : PROCESS ( CLK, RESET ) 
BEGIN 
I F  RESET = ' 1 '  THEN 
matchl <= I Q  I j 
match2 <= I Q  I i 
match3 <= I O I j 
match4 <= I Q  I j 
ELS I F  CLK ' EVENT AND CLK= ' l '  THEN 
mat chl <= matchlx ; 
mat ch2 <= mat ch2x ; 
match3 <= match3x ; 
match4 <= match4 x ;  
END I F; 
END PROCESS STATE DELAY ; 
proces s ( clk , reset ) 
begin 
if reset = ' 1 '  then 
wr enx <= ( OTHERS=> ' 0 ' ) ;  
buffin ( l )  <= ( OTHERS=> ' 0 ' ) ;  
buffin ( 2 )  <= ( OTHERS=> ' 0 ' ) ;  
buf fin ( 3 )  <= ( OTHERS=> ' 0 ' ) ;  
buf fin ( 4 )  <= ( OTHERS=> ' 0 ' ) ;  
elsif  clk ' event and cl k= ' l '  then 
if matchl = ' 1 '  then 
wr_enx (ptrl ) <= ' 1 ' ;  
buffin (ptrl ) <= doutal ; 
i f  match2 = ' 1 '  then 
wr_enx ( ptr2 ) <= ' 1 ' ;  
buffin (pt r2 ) <= doutbl ; 
i f  match3 = ' 1 '  then 
wr_enx (ptr3 ) <= ' 1 ' ;  
buffin (ptr3 ) <= douta2 ; 
i f  match4 = ' 1 '  then 
- - 4  matches 
wr_enx (ptr4 ) <= ' 1 ' ;  




--3  matches 
wr_enx ( ptr4 ) <= ' 0 ' ; 
buffin ( ptr4 ) <= ( OTHERS=> ' 0 ' ) ;  
end i f ;  
else  
i f  match4 = ' 1 '  then 
else  
- - 3  matches 
wr_enx ( ptr3 ) <= ' 1 ' ; 
buffin ( pt r3 ) <=doutb2 ; 
buffin ( ptr4 ) <= ( OTHERS=> ' 0 ' ) ;  
wr_enx ( ptr4 ) <= ' 0 ' ; 
--2  mat ches 
wr_enx ( ptr3 ) <= ' 0 ' ; 
wr_enx ( ptr4 ) <= ' 0 ' ;  
buffin ( pt r3 ) <= ( OTHERS=> ' 0 ' ) ;  
buffin ( ptr4 ) <= ( OTHERS=> ' 0 ' ) ;  
end i f ;  
end i f ;  
i f  match3 = ' 1 '  then 
wr_enx ( ptr2 ) <= ' l ' ;  
buffin ( ptr2 ) <= douta2 ; 




- - 3  mat ches 
wr_enx ( ptr3 ) <= ' l ' ;  
wr_enx ( pt r4 ) <= ' 0 ' ; 
buffin ( pt r3 ) <= doutb2 ;  
buffin ( ptr4 ) <= ( OTHERS=> ' 0 ' ) ;  
--2  mat ches 
wr_enx ( ptr3 ) <= I O I ; 
wr_enx ( ptr4 ) <= I O I ; 
buff  in ( ptr3 ) <= ( OTHERS=> ' 0 ' ) ;  
buf fin ( ptr4 ) <= ( OTHERS=> ' 0 ' ) ;  
i f ;  
i f  mat ch4 = ' l '  then 
else  
--2  mat ches 
wr_enx ( pt r2 ) <= ' 1 ' ;  
wr_enx ( ptr3 ) <= ' 0 ' ; 
wr_enx ( ptr4 ) <= ' 0 ' ;  
buffin ( ptr2 ) <=doutb2 ; 
buffin ( ptr3 ) <= ( OTHERS=> ' 0 ' ) ;  
buffin ( ptr4 ) <= ( OTHERS=> ' 0 ' ) ;  
- - 1  mat ch 
wr_enx ( ptr2 ) <= ' 0 ' ; 
wr_enx ( ptr3 ) <= ' 0 ' ; 
wr_enx ( pt r4 ) <= ' 0 ' ; 
buffin ( pt r2 ) <= ( OTHERS=> ' 0 ' ) ;  
buffin ( pt r3 ) <= ( OTHERS=> ' 0 ' ) ;  
buffin ( ptr4 ) <= ( OTHERS=> ' 0 ' ) ;  
end i f ;  
164 
else  
end i f ;  
end i f ;  
i f  match2 = ' 1 '  then 
wr_enx (pt rl ) <= ' l ' ;  
buffin (ptrl ) <= doutbl ;  




wr_enx (ptr2 ) <= ' 1 ' ;  
buffin (ptr2 ) <= douta2 ; 
if  match4 = ' 1 '  then 
else 
end 
--3  matches 
wr_enx (ptr3 ) <= ' 1 ' ;  
wr_enx (ptr4 ) <= ' 0 ' ; 
buffin (ptr3 ) <= doutb2 ; 
buffin (ptr4 ) <= (OTHERS=> ' 0 ' ) ;  
--2  matches 
wr_enx (ptr3 ) <= ' 0  I i  
wr_enx (ptr4 ) <= I O I ; 
buff  in (ptr3 ) <= (OTHERS=> ' 0 ' ) ;  
buf f  in ( pt r4 ) <= (OTHERS=> ' 0 ' ) ;  
i f ;  
i f  match4 = ' 1 '  then 
--2  matches 
wr_enx (ptr2 ) <= I 1 I ; 
wr_enx (ptr3 ) <= I O I ; 
wr_enx (ptr4 ) <= ' 0  I i  
buffin (ptr2 ) <= doutb2 ; 
buf fin (ptr3 ) <= ( OTHERS=> ' 0 ' ) ;  
buffin (ptr4 ) <= ( OTHERS=> ' 0 ' ) ;  
else 
- - 1  match 
wr_enx (ptr2 ) <= ' 0 ' ; 
wr_enx (ptr3 ) <= ' 0  I ; 
wr _ enx (ptr4 ) <= I O I ; 
buffin (ptr2 ) <= ( OTHERS=> ' 0 ' ) ;  
buf fin (ptr3 ) <= ( OTHERS=> ' 0 ' ) ;  
buff  in ( ptr4 ) <= (OTHERS=> ' 0 ' ) ;  
end i f ;  
i f ;  
if  match3 = ' 1 '  then 
wr_enx (ptrl ) <= ' 1 ' ;  
buffin (ptrl ) <= douta2 ; 
if  match4 = ' 1 '  then 
else  
--2  matches 
wr_enx (ptr2 ) <= ' 1 ' ;  
wr_enx (ptr3 ) <= ' 0 ' ; 
wr_enx (ptr4 ) <= ' 0 ' ; 
buffin (ptr2 ) <= doutb2 ; 
buffin (ptr3 ) <= (OTHERS=> ' 0 ' ) ;  
buffin (ptr4 ) <= (OTHERS=> ' 0 ' ) ;  
- - 1  match 
1 65 
wr enx ( ptr2 ) <= 
wr enx ( ptr3 ) <= -
wr enx ( ptr4 ) <= 
buffin ( ptr2 ) <= 
buffin (ptr3 ) <= 
buffin ( ptr4 ) <= 
end i f ;  
else 
if match4 = I 1 '  then 
else 
- - 1  match 
wr_enx (ptrl ) 
wr enx (ptr2 ) 
wr enx ( ptr3 ) -
wr enx ( ptr4 ) 
buff in ( ptrl ) 
buffin ( ptr2 ) 
buffin (ptr3 ) 
buff in (ptr4 ) 
--0  matches 
wr_enx (ptrl ) 
wr_enx ( ptr2 ) 
wr_enx (ptr3 ) 
wr_enx (ptr4 ) 
buff in ( ptrl ) 
buffin ( ptr2 ) 
buffin ( ptr3 ) 
buff in ( ptr4 ) 
end i f ;  

















end i f ;  
end if ;  
end i f ;  
end process ; 
process ( clk , reset ) 
variable g , h , j , f  std_logic ;  
begin 
i f  reset = I 1 '  then 
ptrl  <=  1 ;  
elsif  cl k ' event and clk= ' l '  then 
--mxla · = matchl NOR match2 
--mxlb : = matchl NOR match2 
--mxlc . - matchl NOR match3 
--mxld . - match2 NOR match3 
--matchl AND match2 
--matchl AND match3 
--matchl AND match4 
--match2 AND match3 
--match2 AND match4 











I O  I ; 
I O  I ; 
I O  I ; 
( OTHERS=> ' 0 ' ) ; 
( OTHERS=> ' 0 ' ) ;  
( OTHERS=> ' 0 ' ) ;  
I 1 I ; 
I O  I ; 
I O  I i 
I O  I ; 
doutb2 ; 
( OTHERS=> ' 0 ' ) ;  
( OTHERS=> .' 0 ' ) ;  
( OTHERS=> ' 0 ' ) ;  
I O  I i  
I O  I ; 
I O  I ; 
I O  I ; 
( OTHERS=> ' 0 ' ) ;  
( OTHERS=> ' 0 ' ) ;  
( OTHERS=> ' 0 ' ) ;  
( OTHERS=> ' 0 ' ) ;  
--matchl  AND match2 AND match4 ; 
--matchl AND match3 AND match4 ; 
--match2 AND match3 AND match4 ; 
--j  . g NOR h ;  
--j  NOR ( g , h ) ; 
-- j : = NOR3 ( g , h ,  f ) ; 
--j  : =  g NOR h NOR f ;  
i f  STATE DEL = REPORTx then 
ptrl  <= 1 ;  
elsif (matchl AND match2 AND match3 AND match4 ) = ' 1 '  then 
--4  match 
--do nothing 
ptrl <= ptrl ; 
elsif  ( (matchl AND match2 AND match3 ) OR (matchl AND match2 
AND match4 ) OR (matchl AND match3 AND match4 ) OR (match2 AND match3 AND 
match4 ) ) = ' 1 '  then 
--3  match 
if ptrl  = 2 then 
ptrl  <= 1 ;  
elsif  ptrl  = 3 then 
ptrl  <= 2 ;  
elsif  ptrl  = 4 then 
ptrl  <= 3 ;  
else  
ptrl  <=  4 ;  
end i f ;  
elsif  ( (matchl AND match2 ) OR (matchl AND match3 ) OR (matchl 
AND match4 ) OR (match2 AND match3 ) OR (match2 AND match4 ) OR (match3 AND 
match4 ) ) = ' 1 '  then 
--2  match 
i f  ptrl  = 3 then 
ptrl  <= 1 ;  
elsif  ptrl  = 4 then 
ptrl <= 2 ;  
elsif  ptrl  = 2 then 
ptrl  <= 4 ;  
else 
ptrl  <= 3 ;  
end i f ;  
els i f  ( matchl OR match2 OR match3 OR match4 ) = ' 0 '  then 
--0  matches do nothing 
else 
--1 match 
i f  ptrl  = 4 then 
ptrl  <= 1 ;  
elsif  ptrl  = 3 then 
ptrl  <= 4 ;  
elsif  ptrl = 2 then 
ptrl  <= 3 ;  
else 
ptrl <= 2 ;  
end i f ;  
end i f ;  
end i f ;  
167 
end process ; 
proce ss (clk , reset ) 
begin 
if reset = ' 1 '  then 
ptr2 <= 2 ;  
el sif  cl k ' event and cl k= ' l '  then 
if  STATE DEL = REPORTx then 
ptr2 <= 2 ;  
e lsif  (matchl AND mat ch2 AND match3 AND match4 ) = ' 1 '  then 
-- 4  match 
--do nothing 
ptr2 <= ptr2 ; 
elsif  ( (mat chl AND match2 AND match3 ) OR {matchl AND mat ch2 
AND match4 ) OR (matchl AND match3 AND match4 ) OR {match2 AND match3 AND 
match4 ) ) = ' 1 '  then 
--3 match 
if ptr2 = 2 then 
ptr2 <= 1 ;  
elsif  ptr2 = 3 then 
ptr2 <= 2 ;  
e lsif  ptr2 = 4 then 
ptr2 <= 3 ;  
else  
pt r2 <=  4 ;  
end i f ;  
e l s i f  ( (matchl  AND match2 ) OR (matchl AND match3 ) OR { matchl 
AND match4 ) OR (match2 AND match3 ) OR {mat ch2 AND match4 ) OR {match3 AND 
match4 ) ) = ' 1 '  then 
--2  match 
if ptr2 = 3 then 
ptr2 <= 1 ;  
e lsif  ptr2 = 4 then 
ptr2 <= 2 ;  
e lsif  ptr2 = 1 then 
ptr2 <= 3 ;  
else  
ptr2 <=  4 ;  
end i f ;  
e l s i f  (matchl O R  match2 OR match3 O R  match4 ) = ' 0 '  then 
--0 matches do nothing 
else  
--1  match 
i f  pt r2 = 4 then 
pt r2 <= 1 ;  
elsif  ptr2 = 3 then 
ptr2 <= 4 ;  
els i f  ptr2 = 2 then 
ptr2 <= 3 ;  
else  
ptr2 <=  2 ;  
end i f ;  
end if ;  
end if ;  
end process ; 
process ( cl k , reset ) 
1 68 
begin 
if reset = ' 1 '  then 
ptr3 <= 3 ;  
elsif  cl k ' event and cl k= ' l ' then 
i f  STATE DEL = REPORTx then 
pt r3 <= 3 ;  
elsi f (matchl AND match2 AND match3 AND match4 ) = ' 1 '  then 
- - 4  match 
- -do nothing 
ptr3 <=  pt r3 ; 
els i f  ( (matchl AND match2 AND match3 ) OR (matchl AND match2 
AND match4 ) OR (matchl AND match3 AND match4 ) OR (match2 AND match3 AND 
match4 ) ) = ' 1 '  then 
--3  match 
i f  ptr3 = 2 then 
ptr3 <= 1 ;  
elsif  ptr3  = 3 then 
ptr3 <= . 2 ;  
elsif  ptr3  = 4 then 
ptr3 <= 3 ;  
else 
ptr3 <= 4 ;  
end i f ;  
elsi f ( (matchl AND match2 ) OR (matchl  AND match3 ) OR (matchl 
AND match4 ) OR (match2 AND match3 ) OR (match2 AND match4 ) OR (match3 AND 
match4 ) ) = ' 1 '  then 
--2 match 
if ptr3 = 3 then 
ptr3 <= 1 ;  
elsif  ptr3 = 2 then 
ptr3 <= 4 ;  
elsif  ptr3 = 1 then 
ptr3 <= 3 ;  
else 
ptr3 <= 2 ;  
end i f ;  
el s i f  (matchl  OR match2 O R  match3 O R  match4 ) = ' 0 '  then 
- - 0  matches do nothing 
else 
--1 match 
i f  ptr3 = 4 then 
ptr3  <= 1 ;  
elsif  ptr3 = 3 then 
ptr3 <= 4 ;  
elsif  ptr3 = 2 then 
ptr3 <= 3 ;  
else  
ptr3  <= 2 ;  
end i f ;  
end i f ;  
end i f ;  
end process ;  
process ( cl k , reset ) 
begin 
if reset = ' 1 '  then 169 
ptr4  <= 4 ;  
elsif  cl k ' event and clk= ' l ' then 
if STATE DEL = REPORTx then 
ptr4  <= 4 ;  
elsif  (matchl AND match2 AND match3 AND match4 ) = ' 1 ' then 
-- 4  match 
--do nothing 
ptr4  <= ptr4 ; 
el sif  ( (matchl AND match2 AND match3 ) OR (matchl AND match2 
AND match4 ) OR (matchl AND match3 AND match4 ) OR (match2 AND match3 AND 
match4 ) ) = ' 1 '  then 
--3 match 
i f  ptr4  = 2 then 
ptr4  <= 1 ;  
elsif  ptr4  = 3 then 
ptr4  <= 2 ;  
elsif  ptr4  = 4 then 
ptr4  <= 3 ;  
else 
ptr4  <= 4 ;  
end i f ;  
el sif  ( (matchl AND match2 ) OR (matchl AND match3 ) OR (matchl 
AND match4 ) OR (match2 AND match3 ) OR (match2 AND match4 ) OR (match3 AND 
match4 ) ) = ' 1 '  then 
--2 match 
if  ptr4  = 3 then 
ptr4  <= 1 ;  
elsif  ptr4  = 2 then 
ptr 4 <= 4 ;  
elsif  ptr4  = 1 then 
ptr 4 <= 3 ;  
else 
ptr4  <= 2 ;  
end i f ;  
el sif  (matchl O R  match2 OR match3 OR match4 ) = ' 0 '  then 
--0  matches do nothing 
else 
--1  match 
if ptr4  = 4 then 
ptr4  <= 1 ;  
elsi f ptr4  = 3 then 
ptr4  <= 4 ;  
elsif  ptr4  = 2 then 
ptr4  <= 3 ;  
else 
ptr4  <= 2 ;  
end i f ;  
end i f ;  
end i f ;  
end process ; 
process ( cl k , reset ) 
begin 
if reset = ' 1 ' then 
smla <= ' 0 ' ; 170 
srnlb <= ' 0 ' ; 
--srnl <= ' 0 ' ; 
sidel <= ' 0 ' ; 
sidela  <= ' 0 ' ; 
sidelb <= ' 0 ' ; 
Aa <= ( OTHERS=> ' 0 ' } ;  
Ab <= ( OTHERS=> ' 0 ' } ;  
Mult inlA <= ( OTHERS=> ' 0 ' ) ;  
rd_enl<= ' 0 ' ;  
rd_en3<= ' 0 ' ;  
elsif  clk ' event and cl k= ' l '  then 
if STATE = REPORTx THEN 
sidel <= ' 0 ' ; 
els i f  din_rdy= ' l '  and STATE_DEL=MACN AND 
INP/= " 0 0000000000000000000000000000000000000000000000000000000000000000  
0 0000000000000 0 0 0 0 0 000000000000000000000 0 0 0 0 0000000 0 0 0 0 0 0 0 0 0000 " then 
if sidel = ' 0 '  then 
else  
i f  ernptyl= ' 0 '  then 
srnla <= ' 1 ' ; 
rd_enl <= ' 1 ' ;  
rd_en3 <= ' 0 ' ; 
Aa <= INP ( 127  DOWNTO 64 } ;  
sidel <= ' 1 ' ;  
elsif  ernpty3= ' 0 '  then 
srnla <= ' 1 ' ;  
rd_enl <= ' 0 ' ; 
rd_en3 <= ' 1 ' ; 
else 
Aa <= INP ( 127  DOWNTO 6 4 } ; 
side l <= ' 0 ' ;  
srnla <= ' 0 ' ; 
rd_enl  <= ' 0 ' ;  
rd_en3 <= ' 0 ' ;  
Aa <= ( OTHERS=> ' 0 ' } ;  
sidel <= ' 0 ' ;  
end i f ;  
i f  ernpty3= ' 0 '  then 
srnla <= ' 1 ' ;  
rd_enl <= ' 0 ' ; 
rd_en3 <= ' 1 ' ;  
Aa <= INP ( 127  DOWNTO 6 4 ) ;  
sidel <= ' 0 ' ; 
elsif  ernptyl= ' 0 '  then 
srnla <= ' 1 ' ;  
rd_enl  <= ' 1 ' ;  
rd_en3 <= ' 0 ' ;  
else 
Aa <= INP ( l27  DOWNTO 64 } ;  
sidel <= ' 0 ' ; 
srnla <= ' 0 ' ;  
rd_enl  <= ' 0 ' ; 
rd_en3 <= ' 0 ' ; 
Aa <= ( OTHERS=> ' 0 ' } ;  
sidel <= ' 0 ' ; 
171 
else 
end i f ;  
end i f ;  
srnla <= ' 0 ' ;  
rd enl <= ' 0 ' ;  
rd en3 <= ' 0 ' ; 
end i f ;  
smlb <= sml a ;  
--srnl <= smlb;  
- ---sml <= sml a ;  
Ab <= Aa ; 
--Mult_inlA <= Ab ; 
Mult inlA <= Aa ; 
sidela <= sidel ; 
sidelb <= sidela ; 
end i f ;  
end proces s ;  
sml < =  ' 0 '  when ( rd_errl OR rd_err3 ) = ' 1 '  else smlb ; 
sm2 <= ' 0 '  when ( rd_err2 OR rd_err4 ) = ' 1 '  else srn2b;  
--Mult inlB <= doutl when sidela= ' 0 '  else dout 3 ;  
--Mult in2B <= dout2 when side2a= ' 0 '  else dout4 ; 
Mult inlB <= dout l when side lb= ' 0 '  else dout3 ;  
Mult in2 B <= dout2 when side2b= ' 0 '  else dout 4 ;  
process ( cl k , reset ) 
begin 
if reset = ' 1 '  then 
sm2a <= ' 0 ' ; 
sm2b <= ' 0 ' ;  
--sm2 <= ' 0 ' ;  
side2 <= ' 0 ' ; 
side2a  <= ' 0 ' ; 
side2b <= ' 0 ' ; 
xa <= ( OTHERS=> ' 0 ' ) ;  
Xb <= ( OTHERS=> ' 0 ' ) ;  
Mult  in2A <= ( OTHERS=> ' 0 ' ) ;  
rd_en2<= ' 0 ' ; 
rd_en4 <= ' 0 ' ; 
elsif  cl k ' event and cl k= ' l '  then 
if STATE = PROCESSING THEN 
side2 <= ' 0 ' ;  
elsi f  din_rdy= ' l ' and STATE_DEL=MACN AND 
INP/= " 000000000000000000000000000000000000 000 0 0 0 0 0 0 0 0000000000000000000  
0000000000000000000000000000000000000 000000000000000000000000 0 0 "  then 
if side2 = ' 0 '  then 
if empty2= ' 0 '  then 
sm2a <= ' 1 ' ;  
rd_en2 <= ' 1 ' ;  
rd_en4 <= ' 0 ' ;  
xa <= INP ( 63 DOWNTO 0 ) ; 
side2 <= ' 1 ' ;  
els i f  empty4 = ' 0 '  then 
sm2a <= ' l ' ; 
rd_en2 <= ' 0 ' ;  
rd_en4 <= ' 1 ' ;  




side2 <= ' 0 ' ; 
sm2a <= ' 0 ' ; 
rd_en2 <= ' 0 ' ; 
rd_en4 <= ' 0 ' ;  
xa <= ( OTHERS=> ' 0 ' ) ; 
side2 <= ' 0 ' ; 
end i f ;  
else 
if  empty4 = ' 0 '  then 
sm2 a <= ' 1 ' ;  
rd_en2 <= ' 0 ' ; 
rd_en4 <= ' 1 ' ;  
xa <= INP ( 63 DOWNTO 0 ) ; 
side2 <= ' 0 ' ;  
elsif  empty2= ' 0 '  then 
sm2a <= ' l ' ;  
rd_en2 <= ' 1 ' ;  
rd_en4 <= ' 0 ' ;  
else 
xa <= INP ( 63 DOWNTO 0 ) ; 
side2 <= ' 0 ' ; 
sm2a <= ' 0 ' ;  
rd_en2 <= ' 0 ' ; 
rd_en4 <= ' 0 ' ; 
xa <= ( OTHERS=> ' 0 ' ) ;  
side2 <= ' 0 ' ; 
end i f ;  
end i f ;  
sm2a < =  ' 0 ' ; 
rd en2 <= ' 0 ' ; 
rd en4 <= ' 0 ' ; 
end i f ;  
sm2b < =  sm2a ; 
--sm2 <= sm2b ; 
----sm2 <= sm2a ; 
Xb <= xa ; 
- -Mult_in2A <= Xb ; 
Mult in2A <=xa ; 
side2a <= side2 ; 
side2b <= side2a ; 
end i f ;  
end process ; 
proces s ( cl k , reset ) 
begin 
if reset = ' 1 '  then 
instatus <= 6 ;  
1 73 
num_inputs <= ' 0 ' ;  
rd en <= ' 0 ' ;  
Cl <= ( OTHERS=> ' 0 ' ) ;  
D1 <= ( OTHERS=> ' 0 ' ) ;  
C2 <= ( OTHERS=> ' 0 ' ) ;  
elsif  cl k ' event and cl k= ' l '  then 
' 1 '  and size 
' 1 '  and size 
if  num input s = ' 0 '  then 
1 )  then 
i f ( fml = ' 1 '  and fm2 
Cl <= mout l ;  
' 1 ' )  then 
Dl <= mout 2 ;  
instatus <= 0 ;  
num_inputs <= ' 0 ' ; 
rd_en <= ' 0 ' ; 
elsif ( fml = ' 1 '  or fm2 = ' 1 ' ) then 
if  fml = ' 1 ' then 
else 
if  fa = ' 1 '  then 
Cl <= mout l ;  
D 1  <= aout ; 
rd en <= ' 0 ' ; 
instatus <= 0 ;  
num_inputs <= ' 0 ' ;  
elsif  empty_out = ' 0 ' and NOT ( rd_en 
else 
rd_en <= ' 1 ' ; 
instatus <= 1 ;  
Cl <= mout l ;  
num_inputs <= ' 0 ' ; 
C2 <= moutl ;  
instatus <= 2 ;  
num_inputs <= ' 1 ' ;  
rd en <= ' 0 ' ; 
end i f ;  
i f  fa ' 1 '  then 
Cl <= mout 2 ;  
D 1  <= aout ; 
rd en <= ' 0 ' ;  
instatus <= 0 ;  
num_inputs < =  ' 0 ' ; 
e lsif  empty_out = ' 0 '  and NOT ( rd_en 
1 )  then 
else 
rd_en <= ' 1 ' ;  
instatus <= 1 ;  
Cl <= mout2 ; 
num_inputs  <= ' 0 ' ; 
C2 <= mout 2 ;  
instatus <= 2 ;  
num_inputs <= ' 1 ' ;  
rd en <= ' 0 ' ; 
end i f ;  
end i f ;  
--empty_out ' 1 '  is delayed 
1 74 
or empty_out = ' 1 ' ) then 
nothing in the buffer 
add result emerges 
elsi f fa = ' 1 '  and ( ( rd_en = ' 1 '  and s i ze = 1 )  
--why go to num_input s= ' l ' when there ' s  
- -maybe go and wait for 2 clocks , not no 
I O I and s i ze 
NOT ( rd_en 
NOT ( rd_en 
- -then write to the buffer 
1 then 
el sif  
elsif  
C2  <= aout ; 
instatus <= 2 ;  
num_inputs <= ' 1 ' ;  
rd en <= ' 0 ' ; 
fa = ' 1 '  and empty_ 
rd en <= I 1 I ; 
instatus <= 1 ;  
Cl  <= aout ; 
num input s <= I O I ; 
fa = I O I and empty_ 
rd en <= ' 1 ' ;  
num_inputs <= ' 1 ' ;  
instatus <= 3 ;  
out 
out 
elsif  fa = ' 0 '  and empty_out 
' 1 ' and s i ze = 1 )  and pending = 0 then 
rd_en <= ' 0 ' ; 
num_inputs <= ' 0 ' ; 
instatus <= 6 ;  
Cl <= ( OTHERS=> ' 0 ' ) ;  
D1 <= ( OTHERS=> ' 0 ' ) ;  
C2 <= ( OTHERS=> ' 0 ' ) ;  
elsif  fa = ' 0 '  and empty_out 
' 1 '  and size = 1 )  then 
else 
rd en <= ' 1 ' ;  
num_inputs <= ' 1 ' ;  
instatus <= 3 ;  
Cl <= ( OTHERS=> ' 0 ' ) ;  
Dl <= ( OTHERS=> ' 0 ' ) ;  
C2 <= ( OTHERS=> ' 0 ' ) ;  
rd en <= ' 0 ' ;  
instatus <= 6 ;  
num_inputs <= ' 0 ' ; 
end i f ;  
else 
i f  ans_flag = ' 1 '  then 
Cl <= (OTHERS=> ' 0 ' ) ;  
C2 <= ( OTHERS=> ' 0 ' ) ;  
Dl <= (OTHERS=> ' 0 ' ) ;  
rd en <= ' 0 ' ;  
num_input s <= ' 0 ' ; 
instatus <= 0 ;  
elsif  instatus = 2 then 
if fml = ' 1 ' then 
Cl <= C2 ; 
Dl <= mout l ;  
rd en <= ' 0 ' ; 
1 75 
I O I then 
I O I and 
' 0 '  and 
' 0 '  and 
rd en 
as a NOT in the next elsif  
= 1 )  or  empty_out = ' 1 ' ) then 
instatus <= 0 ;  
num_inputs <= ' 0 ' ; 
els i f  fm2 = ' l '  then 
Cl <= C2 ; 
D1 <= mout2 ; 
rd en <= ' 0 ' ; 
instatus <= 0 ;  
num_inputs <= ' 0 ' ;  
elsif  fa = ' 1 '  then 
Cl <= C2 ; 
D1 <= aout ; 
rd_en <= ' 0 ' ; 
instatus <= 0 ;  
num_input s <= ' 0 ' ;  
- -couid probably check for rd en & size 
elsif  fa = ' 0 '  and ( ( rd_en = ' 1 ' and size 
rd en <= ' 0 ' ; 
C2 <= C2 ; 
instatus <= 5 ;  
num_input s <= ' 1 ' ;  
elsif  fa = ' 0 '  and empty_out 
Cl <= C2 ; 
rd_en <= ' 1 ' ;  
instatus <= 1 ;  
num_input s <= ' 0 ' ;  
end i f ;  
' 0 '  then 
el sif  instatus = 3 then 
if fml = ' 1 '  then 
Cl <= mout l ;  
rd_en <= ' 0 ' ; 
instatus <= 7 ;  
num_inputs <= ' 0 ' ; 
e lsif  fm2 = ' 1 '  then 
Cl <= mout2 ;  
rd_en <= ' 0 ' ; 
instatus <= 7 ;  
num_inputs <= ' 0 ' ; 
elsif  fa = ' 1 ' then 
--Cl <= 
conv_integer ( unsi gned ( dout_out ( 3 1 downto 0 ) ) ) ;  
as a NOT in the next elsif  
= 1 )  or empty_out = ' 1 ' )  then 
hasn ' t  screwed anything up yet 
Redirect to instatus 5 and 
Cl <= aout ; 
rd_en <= ' 0 ' ; 
instatus <= 7 ;  
num_inputs <= ' 0 ' ; 
--couid probabl y check for rd en & s i ze 
elsif  fa = ' 0 '  and ( ( rd_en = ' 1 '  and s i ze 
--this stage is dangerous , why it 
-- I don ' t  know, it ' s  hit 8 times . 
1 76 
data . 
-- have the buffer read back in the 
--rd_en <= ' 0 ' ; 
--C2 <= C2 ; 
--instatus <= instatus ; 
num_input s <= ' 1 ' ;  
rd en <= ' 0 ' ; 
instatus <= 9 ;  
els if  fa = ' 0 '  and empty_out 
--C2 <= 
' 0 '  then 
conv_integer (unsigned {dout_out { 3 1  downt o 0 } ) ) ;  
rd_en <= ' 1 ' ; 
--instatus <= 8 ;  
num_inputs <= ' 1 ' ;  
instatus <= 4 ;  
end i f ;  
elsif  instatus = 4 then 
--Cl <= C2 ; 
Cl <= dout_out ; 
instatus <= 7 ;  
num_input s <= ' 0 ' ; 
rd en <= ' 0 ' ; 
elsif  instatus = 5 then 
if fml = ' 1 '  then 
Cl <= C2 ; 
D1 <= mout l ;  
rd_en <= ' 0 ' ; 
instatus <= 0 ;  
num_input s <= ' 0 ' ; 
elsif  fm2 = ' 1 '  then 
Cl <= C2 ; 
D1 <= mout 2 ;  
rd en <= ' 0 ' ; 
instatus <= 0 ;  
num_input s <= ' 0 ' ; 
elsi f fa = ' 1 '  then 
Cl <= C2 ; 
D1 <= aout ; 
rd_en <= ' 0 ' ; 
inst atus <= 0 ;  
num_input s < =  ' 0 ' ; 
elsif  rd ack = ' 1 '  then 
--rewrite back into buf fer 
C2 <= {OTHERS=> ' 0 ' } ;  
else 
rd_en <= ' 0 ' ;  
num_input s <= ' 0 ' ; 
Cl <= {OTHERS=> ' 0 ' } ;  
D1 <= (OTHERS=> ' 0 ' } ;  
instatus <= 6 ;  
--write C 2  t o  buffer and clear i t  
C2 < =  {OTHERS=> ' 0 ' } ;  
rd_en <= ' 0 ' ; 
num_input s <= ' 0 ' ; 
Cl <= {OTHERS=> ' 0 ' } ;  
1 77 
01 <= { OTHERS=> ' 0 ' ) ;  
instatus <= 6 ;  
end i f ;  
els i f  instatus = 9 then 
else 
if  fml = ' 1 '  then 
Cl <= mout l ;  
01 <= dout out ; 
rd en <= ' 0 ' ;  
instatus <= 0 ;  
num_input s <= ' 0 ' ;  
els i f  fm2 = ' 1 '  then 
Cl <= mout2 ;  
01 <= dout out ; 
rd en <= ' 0 ' ; 
instatus <= 0 ;  
num_input s <= ' 0 ' ; 
el sif  fa = ' 1 '  then 
Cl <= dout_out ; 
else 
01  <= aout ; 
rd_en <= ' 0 ' ;  
instatus <= 0 ;  
num_input s <= ' 0 ' ; 
instatus <= 0 ;  
num_input s <= ' 0 ' ; 
rd en <= ' 0 ' ; 
C l <= (OTHERS=> ' 0 ' ) ;  
01 <= { OTHERS=> ' 0 ' ) ;  
end i f ;  
r d  en <= ' 0 ' ; 
C2 <= C2 ; 
instatus <= instatus ; 
num_input s <= ' 1 ' ; 
end i f ;  
end i f ;  
--end i f ;  
end i f ;  
end process ; 
process { clk , reset ) 
begin 
if  reset = ' 1 '  then 
inputstatus <= 0 ;  
C <= ( OTHERS=> ' 0 ' ) ;  
0 <= ( OTHERS=> ' 0 ' ) ;  
els i f  cl k ' event and cl k= ' l '  then 
if instatus = 0 then 
C <= Cl ; 
0 <= 01 ; 
input status <= 1 ;  
elsi f instatus = 1 then 
C <= Cl ; 
--0  <= conv_integer ( unsigned ( dout_out ( 31 downto 0 ) ) ) ;  
1 78 
D <= ( OTHERS=> ' 0 ' ) ; 
inputstatus <= 2 ;  
elsif  instatus = 7 then 
C <= C l ; 
else  
D <=  dout_out ; 
inputstatus <= 1 ;  
D <= ( OTHERS=> ' 0 ' ) ; 
C <= ( OTHERS=> ' 0 ' ) ;  
inputstatus <= 3 ;  
end i f ;  
end i f ;  
end process ; 
process ( clk , reset ) 
begin 
if  reset = ' 1 '  then 
sa <= ' 0 ' ;  
Ainl <= ( OTHERS=> ' 0 ' ) ;  
Ain2 <= ( OTHERS=> ' 0 ' ) ;  
e lsif  cl k ' event and cl k= ' l '  then 
if inputstatus = 1 then 
IF  C = 
" 0 0000000000000000000000000000000000000000000000000000 00000000000 "  and 
D = " 0 000000000000000000 000000000000000000000000000000000000 000000 0 0 0 "  
THEN 
sa <= ' 0 ' ; 
ELSE 
sa <= ' 1 ' ; 
Ainl <= C ;  
Ain2 < =  D ;  
END  IF ;  
elsif  inputstatus 2 then 
else  
sa <= ' 1 ' ; 
Ainl <= C ;  
Ain2 < =  dout_out ; 
Ainl <= ( OTHERS=> ' 0 ' ) ;  
Ain2 <= ( OTHERS=> ' 0 ' ) ;  
sa <= ' 0 ' ; 
end i f ;  
end i f ;  
end process ;  
proces s ( cl k , reset ) 
--variable overflow val std_logic_vector ( 63 downto 0 ) ; 
std_logic ;  --variable overflow 
begin 
if reset = ' 1 ' then 
wr_enbuff  <= ' 0 ' ; 
overflow <= ' 0 ' ; 
overflow_val <= ( OTHERS=> ' 0 ' ) ;  
overflow2 <= ' 0 ' ; 
overflow_val2 <= ( OTHERS=> ' 0 ' ) ;  
dinbuf f  <= ( OTHERS=> ' 0 ' ) ;  
1 79 
e l s i f  cl k ' event and cl k= ' l ' then 
I F  ( instatus = 4 and fml= ' l '  and fm2= ' 1 '  and fa= ' l ' )  THEN 
wr_enbuff <= ' l ' ;  
dinbuff <= aout ; 
ove rflow <= ' l ' ;  
overflow_val <= mout l ;  
overflow2 <= ' 1 ' ;  
overflow_val2  <= mout 2 ;  
ELS I F  ( instatus = 4 and ( fml = ' O '  and fm2= ' 0 ' ) and fa= ' l ' )  
OR ( ( fml= ' l '  and fm2= ' 1 ' ) and fa = ' l '  and num_input s= ' O ' ) then 
wr_enbuff <= ' 1 ' ;  
dinbuff <= aout ; 
ELS I F  ( instatus = 4 and ( fml= ' l '  and fm2= ' 1 ' ) )  then 
wr_enbuff <= ' 1 ' ;  
dinbuff <= mout l ;  
i f  overflow = ' 0 '  then 
overfl ow <= ' 1 ' ;  
overflow val <= mout2 ; 
else  
overflow2 <= ' 1 ' ;  
overflow_val2 <= mout 2 ;  
end i f ;  
ELSIF  ( instatus = 4 and ( fml= ' l '  o r  fm2= ' 1 ' ) )  then 
i f  fml = ' 1 '  then 
dinbuff <= mout l ;  
else  
dinbuff <= mout 2 ;  
end i f ;  
wr enbuff <= ' l ' ;  
ELS I F  ( ( fml= ' l '  and fm2 = ' l ' ) and fa= ' l ' and 
num_input s = ' l ' ) then 
wr enbuff <= ' 1 ' ;  
dinbuff <= aout ; 
--and temporari ly store mout 2 unt il  it can be put 
int o the bu ffer 
i f  overflow = ' 0 '  then 
overflow <= ' 1 ' ; 
overflow val <= mout2 ;  
else  
overflow2 <= ' l ' ;  
overflow_val2  <= mout 2 ;  
end i f ;  
ELS I F  ( ( fml= ' l '  and fm2= ' 1 ' ) and fa= ' O '  and 
num_input s= ' l ' ) then 
wr enbuff <= ' 1 ' ;  
dinbuff <= mout 2 ;  
ELS I F  ( ( fml= ' l '  or fm2= ' l ' ) and fa= ' l '  and 
num_inputs = ' l ' ) then 
wr enbuff <= ' 1 ' ;  
dinbuff <= aout ; 
elsif  ( instatus = 5 and rd ack ' 1 ' )  then 
wr_enbuff  <= ' l ' ;  
dinbuff <= dout_out ; 
ELS I F  ( instatus = 5 and fa 
and ANS_FLAG= ' O ' ) then 
1 80 
' 0 '  and fml= ' O '  and fm2= ' 0 '  
wr_enbuff  < = ' 1 ' ; 
dinbuff  <= C2 ; 
ELS I F  ( instatus = 9 and fa ' 0 '  and fml = ' 0 '  and fm2= ' 0 ' ) 
then 
wr_enbuff  <= ' l ' ; 
dinbuff  <= dout_out ; 
ELS I F  overflow = ' 1 '  THEN 
wr_enbuff <= ' l ' ; 
dinbuff <= overflow val ;  
overflow <= ' 0 ' ; 
ELS I F  overflow2 = ' l '  THEN 
wr_enbuff  <= ' 1 ' ;  
else 
dinbuff <= overflow_val2 ; 
overflow2 <= ' 0 ' ; 
wr enbuff  <= ' 0 ' ; 
dinbuff  <= ( OTHERS=> ' 0 ' } ;  
end i f ;  
end i f ;  
end proces s ;  
-- keeps a detailed account o f  the s i ze o f  the buffer 
proces s ( clk , reset , buffreset } 
begin 
if reset = ' 1 '  or buffreset = ' 1 '  then 
s i ze <= 0 ;  
elsif  cl k ' event and clk= ' l '  then 
if wr enbuff  = ' 1 '  and rd en = ' 1 ' then 
s i ze <= s i ze ; 
elsif  wr enbuff = ' 1 '  and rd en = ' 0 '  then 
if size = 64 then 
size <= size ; 
else 
s i ze <= s i ze + 1 ;  
end i f ;  
el sif  wr enbuff = ' 0 '  and rd en 
if s i ze = 0 then 
else 
si ze<=0 ; 
else 
size <= size  - 1 ;  
end i f ;  
size <= s i ze ;  
end i f ;  
end i f ;  
' 1 '  then 
end process ;  
process { cl k , reset , buffreset } 
begin 
if reset = ' 1 '  or buf freset ' 1 '  then 
pendingml <= 0 ;  
elsif  cl k ' event and elk = ' 1 '  then 
if sml = ' 1 '  and fml = ' 1 '  then 
pendingml <= pendingml ; 
1 8 1  
elsif  sml = ' l '  and fml = ' 0 '  then 
if pendingml = 12 then 
--pending <= 0 ;  
else 
pendingml <= pendingml + l ;  
end i f ;  
elsif  sml = ' 0 '  and fml = ' l '  then 
if pendingml = 0 then 
--pendingrnl <= 1 2 ; 
else 
else 
pendingml <= pendingml - l ;  
end i f ;  
pendingml <= pendingrnl ; 
end i f ;  
end i f ;  
end proce� s ;  
process ( cl k , reset , buffreset ) 
begin 
if reset = ' l '  or buffreset ' 1 '  then 
pending <= 0 ;  
elsif  cl k ' event and el k = ' 1 '  then 
if sa = ' 1 '  and fa = ' 1 '  then 
pending <= pending ; 
elsif  sa = ' 1 '  and fa = ' 0 '  then 
i f  pending = 1 3  then 
--pending <= 0 ;  
else 
pending <= pending + 1 ;  
end i f ;  
elsif  sa  = ' 0 '  and fa = ' 1 '  then 
if pending = 0 then 
--pending <= 12 ; 
else 
else 
pending <= pending - l ;  
end i f ;  
pending <= pending ; 
end i f ;  
end i f ;  
end process ; 
process ( cl k , reset ) 
begin 
if reset = ' 1 '  then 
ANS FLAG <= ' 0 ' ; 
ANSWER <= ( OTHERS=> ' 0 ' ) ;  
elsif  clk ' event and cl k= ' l '  then 
--if  emptyl= ' l '  and ernpty2= ' 1 ' and empty3= ' 1 '  and 
empty4 = ' 1 '  and pendingrnl=0 and pendingrn2=0 and pending=l and 
overflag= ' l '  and empty out= ' l '  and fa= ' l '  and sa= ' 0 '  and instatus /= 9 
and num_inputs= ' 0 ' and �r_enbuff= ' O '  then 
if emptyl= ' l '  and empty2= ' 1 '  and ernpty3= ' 1 '  and ernpty4 = ' 1 '  
and pendingrnl=0 and pending= l and overflag= ' l '  and empty_out= ' l '  and 
182 
fa= ' l '  and sa= ' 0 '  and instatus /= 9 and instatus /= 0 and inputstatus /= 1 
and num_inputs= ' 0 ' and wr enbuff= ' 0 '  and STATE DEL=SEND then 
ANS_FLAG <= ' 1 ' ;  
ANSWER <= aout ; 
--elsif  emptyl= ' l '  and empty2= ' 1 '  and empty3= ' 1 '  and 
empty4 = ' 1 '  and pendingml=0 and pendingm2= 0 and pending=0 and 
over flag= ' l '  and empty_out= ' l '  and fa= ' 0 '  and sa= ' 0 '  and inputstatus=3 
and instatus= 6 and wr enbuff= ' 0 '  and num_inputs= ' 0 '  and STATE_DEL=SEND 
then 
elsif  emptyl= ' l '  and empty2= ' 1 '  and empty3= ' 1 '  and 
empty4 = ' 1 '  and pendingml=0 and pending=0 and overflag= ' l '  and 
empty_out= ' l '  and fa= ' 0 '  and sa= ' 0 '  and inputstatus=3 and instatus=6 
and wr enbuff= ' 0 '  and num_inputs= ' 0 '  and STATE_DEL=SEND then 
ANS_FLAG <= ' 1 ' ;  
ANSWER <= ( OTHERS=> ' 0 ' } ;  
--el sif  emptyl= ' l '  and empty2= ' 1 ' and empty3= ' 1 '  and 
empty4 = ' 1 '  and pendingml= 0 and pendingm2= 0 and pending= 0 and 
overflag= ' l '  and empty_out= ' l '  and fa= ' 0 '  and sa= ' 0 '  and instatus=S and 
wr enbuff= ' 0 '  then 
elsi f emptyl= ' l '  and empty2= ' 1 '  and empty3= ' 1 '  and 
empty4 = ' 1 '  and pendingml= 0 and pending=0 and overflag= ' l '  and 
empty_out= ' l '  and fa= ' 0 '  and sa= ' 0 '  and instatus=S and wr enbuff= ' 0 '  
and STATE DEL=SEND then 
ANS_FLAG <= ' 1 ' ;  
ANSWER <= C2 ; 
elsif  emptyl= ' l '  and empty2= ' 1 '  and empty3= ' 1 '  and 
empty4 = ' 1 ' and pendingml= 0 and pending= 0 and overflag= ' l ' and 
empty_out= ' l '  and fa= ' 0 '  and sa= ' 0 '  and instatus= 9 and wr enbuff= ' 0 ' 
and STATE DEL=SEND then 
ANS_FLAG <= ' 1 ' ;  
ANSWER <= dout_out ; 
elsi f STATE = PROCESSING OR STATE 
ANS_FLAG <= ' 0 ' ; 
ANSWER <= ( OTHERS=> ' 0 ' } ;  
end i f ;  
end i f ;  
end process ; 
process ( clk ,  reset } 
begin 
i f  reset = ' 1 ' then 
buffreset <= ' 0 ' ; 
el sif  clk ' event and cl k= ' l '  then 
if ANS FLAG = ' 1 '  then 
buffreset <= ' 1 ' ;  
else 
buffreset <= ' 0 ' ; 
end i f ;  
end i f ;  
end process ; 
END behavior ; 
183 
ADDRESS then 
Appendix G - DPFPMult.vhd 
1 84 
Double Precision Floating Point Multiplier 
< dpfpmult . vhd > 
4 / 1 8 /2 0 0 4  
kbaugher@ut k . edu 
Author : Kirk  A Baugher 
--Library XilinxCoreLib ; 
library I EEE ; 
use I EEE . std_logic_1 1 64 . all ; 
use I EEE . std_logic_arith . al l ;  
use I EEE . std_logic_unsigned . all ; 
entity dpfpmult is  
port ( CLK : in std_logic ;  
A i n  std_logic_vector ( 63 downto 0 ) ; 
B : in std_logic_vector ( 63 downto 0 ) ; 
OUTx : out std_logic_vector ( 63 downto 0 ) ; 
start : in std_logic ;  
finish : out std_logic 
) ; 
end dpfpmult ; 
architecture RTL of  dpfpmult is 
signal MA, MB : std logic_vector ( 52 downto 0 ) ; 
signal EA, EB : std_logic_vector ( l 0 downto 0 ) ; 
signal Sans , s l , s 2 , s 3 , s 4 , s 5 , s 6 , s 7 , s 8 , s 9 : std_logic ; 
signal stepl , step2 , step3 , step4 , step5 , step6 , step7 , step8 
signal Q :  std_logic_vector ( 1 05  downto 0 ) ; 
signal eaddans : std_logic_vector ( l l  downto 0 ) ; 
signal exp_result : std_logic_vector ( 12 downto 0 ) ; 
signal answer : std_logic_vector ( 63 downto 0 ) ; 
signal  exponent : std_logic_vector ( l 0 downto 0 ) ; 
signal exponent l : std_logic_vector ( l l downto 0 ) ; 
signal mca l , mca2 , mca3 , mca4 , mca5 , mca 6 , mca7 std_logic ;  
signal eca l , eca2 , eca3 , eca4 : std_logic ;  
signal mcbl , mcb2 , mcb3 , mcb4 , mcb5 , mcb6 , mcb7 std_logic;  
signal ecbl , ecb2 , ecb3 , ecb4 : std_logi c ;  
s ignal  mc8 , mc8 a , mc8b , mc8 c , mc8d : std_logic ;  
signal ec5 , ec5a , ec5b , ec5c , ec5d : std_logic ;  
component mul53  
port ( 
el k :  IN std logic ;  
a :  IN  std_logic_VECTOR ( 52 downto 0 ) ; 
b :  IN std_logic_VECTOR ( 52 downto 0 ) ; 
q :  OUT std_logic_VECTOR ( l 0 S  downto 0 )  
) ; 
end component ; 
1 85 
std_logic ; 
component expaddl l 
port ( 
A :  IN std_logic_VECTOR ( l0 downt o 0 ) ; 
B :  IN std_logic_VECTOR ( l0 downto 0 ) ; 
Q C  OUT : OUT std logic;  
Q� OUT std_logic=VECTOR ( l O downto 0 ) ; 
CLK : IN std_l ogi c ) ; 
END component ; 
component expbi as l l  
port ( 
A :  IN std_logic_VECTOR ( l l downto 0 ) ; 
Q :  OUT std_logic_VECTOR ( 1 2 downto 0 ) ; 
CLK : IN std_logi c ) ; 
END component ; 
begin 
MA ( 5 1 downto 0 )  <= A ( 5 1 downt o 0 ) ; 
MA ( 52 )  <= ' 1 ' ; 
MB ( 5 1 downto 0 )  <= B ( 5 1  downto 0 ) ; 
MB ( 52 )  <= ' 1 ' ; 
EA <= A ( 62 downto 52 ) ; 
EB <= B ( 6 2 downto 52 ) ; 
Sans <= A ( 6 3 )  XOR B ( 63 ) ; 
mul 53  0 : mul 53  port map ( a => MA, b => MB , elk => CLK, q => Q ) ; 
expaddl l_0 : expaddl l port map (A => EA, B => EB,  Q => eaddans ( l 0 
downt o 0 ) , Q_C_OUT => eaddans ( l l ) , CLK => CLK ) ; 
expbi asl l_0 : expbias l l  port map (A => eaddans , Q => exp_result , CLK => 
CLK ) ; 
------------< Float ing- Point Multiplication Algorithm >-- ------­
process ( CLK) 
begin 
--some latch should be insert ed here for delay 4 cycle 
--wait unti l  rising_edge ( CLK ) ;  
I F  ( CLK = ' 1 '  and CLK ' event ) THEN 
Sl <= Sans ; 
S2 <= S l ;  
S 3  < =  S 2 ;  
S 4  <= S3 ; 
SS  <= S 4 ; 
S 6  <= S 5 ; 
S7  <= S 6 ;  
S8  <= S7 ; 
s 9  <= s 8 ;  
stepl  <= st art ; 
step2 <= stepl ; 
1 86 
step3 <= step2 ; 
step4  <= step3 ; 
steps <= step4 ; 
step6 <= stepS ; 
step7 <= step6 ;  
steps <= step7 ; 
finish <= step8 ; 
END I F ;  
end proces s ;  
proces s  ( CLK ) 
variable mca , mcb 
variable eca , ecb 
begin 
std_logic_vector ( S l downto 0 ) ; 
std_logic_vector ( l 0 downto 0 ) ; 
--check for a zero value for an input and adj ust the answer if  
necessary at end 
I F  ( CLK = ' 1 '  and CLK ' event ) THEN 
mca : = A ( S l  DOWNTO 0 ) ; 
mcb : = B ( S l  DOWNTO 0 ) ; 
eca : = A ( 62 DOWNTO 52 ) ;  
ecb . - B ( 62 DOWNTO 52 ) ;  
meal <= mca ( S l )  OR mca ( S 0 )  OR mca ( 4 9 )  OR mca ( 4 8 )  OR mca ( 4 7 ) OR 
mca ( 4 6 ) OR 
mca ( 4 5 )  OR mca ( 4 4 ) OR mca ( 4 3 ) ; 
mcbl <= mcb ( S l )  OR mcb ( S 0 )  OR mcb ( 4 9 )  OR mcb ( 4 8 )  OR mcb ( 4 7 ) OR 
mcb ( 4 6 ) OR 
mcb ( 4 5 )  OR mcb ( 4 4 ) OR mcb ( 4 3 ) ; 
mca2 <= mca ( 4 2 )  OR mca ( 4 1 )  OR mca ( 4 0 )  OR mca ( 3 9 )  OR mca ( 38 ) OR 
mca ( 37 )  OR 
mca ( 3 6 )  OR mca ( 3 5 )  OR mca ( 34 ) ; 
mcb2 <= mcb ( 4 2 )  OR mcb ( 4 1 )  OR mcb ( 4 0 )  OR mcb ( 3 9 )  OR mcb ( 38 ) OR 
mcb ( 37 )  OR 
mcb ( 3 6 )  OR mcb ( 35 ) OR mcb ( 3 4 ) ; 
mca3 <= mca ( 33 )  OR mca ( 32 )  OR mca ( 31 )  OR mca ( 3 0 )  OR mca ( 2 9 ) OR 
mca ( 2 8 ) OR 
mca ( 27 ) OR mca ( 2 6 )  OR mca ( 2 5 ) ; 
mcb3 <= mcb ( 33 )  OR mcb ( 32 )  OR mcb ( 31 ) OR mcb ( 3 0 )  OR mcb ( 2 9 ) OR 
mcb ( 2 8 ) OR 
mcb ( 27 )  OR mcb ( 2 6 )  OR mcb ( 2 5 ) ; 
mca4  <= mca ( 2 4 ) OR mca ( 2 3 )  OR mca ( 22 )  OR mca ( 2 1 )  OR mca ( 20 ) OR 
mca ( l 9 )  OR 
mca ( 1 8 )  OR mca ( 1 7 )  OR mca ( 1 6 ) ; 
mcb4 <= mcb ( 2 4 ) OR mcb ( 2 3 ) OR mcb ( 22 ) OR mcb ( 2 1 ) OR mcb ( 2 0 ) OR 
mcb ( 1 9 )  OR 
mcb ( l 8 )  OR mcb ( l 7 ) OR mcb ( 1 6 ) ; 
meas <= mca ( l S )  OR mca ( 1 4 ) OR mca ( 1 3 )  OR mca ( 1 2 )  OR mca ( l l ) OR 
mca ( 1 0 )  OR 
mca ( 9 ) OR mca ( B )  OR mca ( 7 ) ; 
mcbS <= mcb ( lS )  OR mcb ( l 4 ) OR mcb ( 1 3 )  OR mcb ( 1 2 )  OR mcb ( l l ) OR 
mcb ( l0 )  OR 
mcb ( 9 ) OR mcb ( B )  OR mcb ( 7 ) ; 
mca 6 <= mca ( 6 )  OR mca ( S )  OR mca ( 4 )  OR mca ( 3 )  OR mca ( 2 ) OR mca ( l )  
OR 
mca ( 0 ) ; 
1 87 
OR 
mcb 6 <= mcb ( 6 )  OR mcb ( S )  OR mcb ( 4 )  OR mcb ( 3 )  OR mcb ( 2 ) OR mcb ( l )  
mcb ( 0 ) ; 
mca7 <= mea l OR mca2 OR mca3 OR mca4 OR mea s  OR mca 6 ;  
mcb7 <= mcb l OR mcb2 OR mcb3 OR mcb4 OR mcb5 OR mcb6 ; 
mc8 <= mca7 AND mcb7 ; 
mc8 a  <= mc8 ; 
mc8b <= mc8 a ;  
mc8c  <= mc8b;  
mc8d <= mc8 c ;  
eca l <= eca ( 1 0 )  OR eca ( 9 ) OR eca ( 8 )  OR eca ( 7 ) ; 
eca2 <= eca ( 6 ) OR eca ( S )  OR eca ( 4 )  OR eca ( 3 ) ; 
eca3 <= eca ( 2 )  OR eca ( l )  OR eca ( 0 ) ; 
eca4 <= ecal OR eca2 OR eca3 ; 
ecbl <= ecb ( 1 0 )  OR ecb ( 9 )  OR ecb ( 8 )  OR ecb ( 7 ) ; 
ecb2 <= ecb ( 6 ) OR ecb ( S )  OR ecb ( 4 )  OR ecb ( 3 ) ; 
ecb3 <= ecb ( 2 )  OR ecb ( l )  OR ecb ( 0 ) ; 
ecb4 <= ecbl  OR ecb2 OR ecb3 ; 
ec5 <= eca4 AND ecb4 ; 
ec5a <= ecS ; 
ec5b <= ec5a ; 
ec5c <= ec5b ; 
ec5d <= ec5c ;  
END I F; 
end process ; 
proce ss ( CLK ) --7th step-- Check for exponent overflow 
variable exponent la : std_logic_vector ( l2 downt o 0 ) ; 
begin 
- -wait until  rising_edge ( CLK ) ;  
I F  ( CLK = ' 1 '  and CLK ' event ) THEN 
I F  ( exp_result ( 12 )  = ' 1 '  OR exp_result = " 0 1 1 1 1 1 1 1 1 1 1 1 1  " )  THEN 
exponent <= " 1 1 1 1 1 1 1 1 1 1 1 " ;  - - I f  overflow set to max value 
of 2 5 4  ( biased)  
ELSE 
exponent <= exp_result ( l 0 downt o 0 ) ; 
END I F ;  
END I F ;  
end process ; 
process ( CLK, Q )  
variable exponent la : std_logic_vector ( l l  downto 0 ) ; 
variable mant issa  : std_logic_vector ( 53 downto 0 ) ; 
variable exponent lx std_logic_vector ( l 0 downto 0 ) ; 
begin - -8 th step--
- -wait unti l  rising_edge ( CLK ) ;  
1 88 
I F  ( CLK = ' 1 ' and CLK ' event ) THEN 
exponentla ( l 0 downto 0 )  : = exponent ; 
exponentla ( l l )  : = ' 0 ' ; 
mantissa  : = Q ( l0 S  downto 52 ) ; 
I F  mantissa ( 5 3 )  = ' 1 '  THEN 
I F  ecSd = ' 0 '  AND mc8d = ' 0 '  THEN 
exponent la : = " 000000000000 " ;  
mantissa  : = 
" 00 0 0000000000000000000000000000000000000 000000000 00000 " ;  
--ELS I F  ecSd = ' 0 '  THEN 
exponent la : = " 000000000000 " ;  
--ELS I F  mcBd  = ' 0 '  THEN 
mantissa  : = 
" 0 000000000000000000000000000000000000000000 0000000 0000 " ;  
ELS I F  exponent la < " 1 1 1 1 1 1 1 1 1 1 1 " THEN 
exponent la  : = exponent la + " 00 0 0 0 0 0 0 1 " ;  
END I F ;  
exponent lx : = exponentla ( l 0 downto 0 ) ; 
answer <= S7  & exponent lx & mantissa ( 52 downto 1 ) ; 
ELSE 
IF ecSd = ' 0 '  AND mc8d = ' 0 '  THEN 
exponentla  : = " 0 00000000000 " ;  
mantissa  : = 
" 0000000000000000000000000000000000000000000000000 00 0 0 0 " ;  
- -ELS I F  ecSd = ' 0 '  THEN 
exponent la : = " 000000000000 " ;  
--ELS I F  mc8d = ' 0 '  THEN 
mantissa  : = 
" 00 0 0 0 0 000000000000000000000000000000000000 0000000 0 0 0 00 " ;  
END I F; 
exponent lx : = exponent la ( l 0 downto 0 ) ; 
answer <= S7  & exponentlx & manti ssa ( S l downto 0 ) ; 
END I F ;  
OUTx <= answer ; 
END I F ;  
end process ; 
end RTL ;  
189 
Appendix H - DPFPAdd.vhd 
1 90 
Double Precision Floating Point Adder 
< dpfpadd . vhd > 
4 / 1 8 /2 0 0 4  
kbaugher@ut k . edu 
--Author : Kirk A Baugher 
LI BRARY I EEE ; 
USE I EEE . std_logic_1 1 64 . ALL ; 
USE I EEE . std_logic_arith . ALL ; 
USE I EEE . std_logic_unsigned . all ; 
ENT ITY dpfpadd I S  
PORT ( 
CLK : IN STD_LOGIC ;  
start : IN STD_LOGI C ;  
Ain : IN STD_LOGIC_VECTOR ( 63 DOWNTO 0 ) ; 
Bin : IN STD_LOGIC_VECTOR ( 63 DOWNTO 0 ) ; 
OUTx : OUT STD_LOGIC_VECTOR ( 63 DOWNTO 0 ) ; 
finish : OUT STD_LOGIC ) ; 
END dpfpadd; 
ARCHITECTURE behavior OF dpfpadd I S  
S IGNAL edi f f l , edi f f2 , edi ffout , Aout , Bout : STD_LOGIC_VECTOR ( l l DOWNTO 
0 ) ; 
S IGNAL expa , expb , explout , explout l , explout l a , explout lb : 
STD_LOGIC_VECTOR ( l0 DOWNTO 0 ) ; 
S IGNAL rnanta , rnantb, rnantxlout , rnant lout , rnantx2out , rnant lout l 
STD_LOGIC_VECTOR ( 53 DOWNTO 0 ) ; 
S IGNAL rnantx2a , rnant lout l a , rnant lout lb 
STD_LOGIC_VECTOR ( 5 3  DOWNTO 0 ) ; 
S IGNAL s a , sb, Sans lout , Sans lout l : STD_LOGIC ;  
S IGNAL Sans lout l a , Sanswerl , Sans lout lb, Sanswerz l , change 
STD_LOGIC ; 
S IGNAL rnant result : STD_LOGIC_VECTOR ( 53 DOWNTO 0 ) ; 
S IGNAL Sans2 , Sans3 , Sans 4 , Sans 5 , Sans 6 STD_LOGIC ;  
SIGNAL expl a , explb, explc , expld, exple : STD_LOGIC_VECTOR ( l 0 DOWNTO 
0 ) ; 
S IGNAL Z l , Z2 , Z 3 , Z 4 , Z 5 , Z 6 , Z7 , Z8 , Z 9 , Z l0 , Z l l , zeroflagl , zeroflag2 
std_logic ;  
S IGNAL fl , f2 , f3 , f4 , f5 , f6 , f7 , f8 , f 9 , f1 0 , f l l , f12  : STD_LOGIC ;  
S IGNAL SSxl , SSx2 , SSx3 , SSx4 , SSx5 , SSx6 , SSx7 , SSx8 , SSxout , SSxout2 
STD_LOGIC ; 
S IGNAL SSout , SStoCompl STD_LOGIC ; 
S IGNAL expanswerl , expanswerzl  
STD_LOGIC_VECTOR ( l0 DOWNTO 0 ) ; 
S IGNAL shi ft l , shi ftout , shi ft , shift 1 2 , shi ftn2 STD_LOGIC_VECTOR ( S  
DOWNTO 0 ) ; 
S IGNAL rnantans3 , rnant z l  STD_LOGIC_VECTOR ( S l 
DOWNTO 0 ) ; 
S IGNAL rnantx2tornantadd STD_LOGIC_VECTOR ( 5 4 DOWNTO 
0 ) ; 
1 9 1 
--Port A is input , Port B output 
COMPONENT subexpl 
port ( 
CLK : IN std logic ;  
A :  IN std_logic_VECTOR ( l l  downto 0 ) ; 
B :  IN std_l ogic_VECTOR ( l l  downto 0 ) ; 
Q :  OUT std_logi c_VECTOR ( l l  downto 0 ) ) ;  
END COMPONENT ; 
-- Port A is  input , Port B output 
COMPONENT mantadd5 
port ( 
CLK : IN std_logic ;  
A :  IN  std_l ogic_VECTOR ( 53 downto 0 ) ; 
B :  IN std_logic_VECTOR ( 53 downto 0 ) ; 
Q :  OUT std_logic_VECTOR ( 5 3 downto 0 ) ) ;  
END COMPONENT ; 
- -Port A is input , Port B output 
COMPONENT twoscompl 
port ( 
CLK : IN std_logic ; 
BYPASS : IN  std_logi c ;  
A :  IN  std_logic_VECTOR ( 53 downto 0 ) ; 
Q :  OUT std_logi c_VECTOR ( 5 4 downto 0 ) ) ;  
END COMPONENT ; 
BEGIN 
Aout <= ' 0 '  & Ain ( 62 DOWNTO 52 ) ; 
Bout <= ' 0 ' & Bin ( 62 DOWNTO 52 ) ; 
subexpl2  : subexpl port map ( 
A=>Aout , 
B=>Bout , 
Q=>edi f fl , 
CLK=>CLK ) ;  
subexp2 1 subexpl  port map ( 
A=>Bout , 
B=>Aout , 
Q=>edi ff2 , 
CLK=>CLK ) ; 
mantexe mantadd5 port map ( 
A=>mant lout l a ,  
B=>mantx2out , 
Q=>mant_result , 
CLK=>CLK ) ;  
twos twoscompl port map ( 
A=>mantx2 a ,  
BYPASS=>SStoCompl , 
Q=>mantx2tomantadd, 
CLK=>CLK ) ; 
PROCl : PROCESS ( CLK)  --Occurs during expdi ff 
1 92 
BEGIN 
IF CLK ' EVENT AND CLK= ' l '  THEN 
expa <= Ain ( 62 downto 52 ) ; 
manta <= " 0 1 "  & Ain ( 5 1  downto 0 ) ; 
sa  <= Ain ( 63 ) ; 
expb <= Bin ( 62 downto 52 ) ; 
mantb <= " 0 1 " & Bin ( 5 1  downto 0 ) ; 
sb <= Bin ( 63 ) ; 
END I F ; 
END PROCESS PROCl ; 
PROC2 : PROCESS ( CLK ) --depending 
variable expl , exp2 
variable mant l ,  mant2 , mantxl 
variable edi f f  
variable Sans l , SS  
BEGIN 
IF CLK ' EVENT AND CLK= ' l '  THEN 
IF edi ffl ( l l )  = ' 0 '  THEN 
expl : = expa ; 
mant l · = manta ;  
Sansl  : = sa ; 
ELSE 
exp2 : = expb ; 
mant2 : = mantb; 
edi ff : = edi ffl ; 
expl : = expb ; 
mant l : = mantb; 
Sans l : = sb ; 
exp2 : = expa ; 
mant2 : = manta ;  
edif f  . - edi ff2 ; 
END I F; 
SS  : = sa  XOR sb ; 
on expdi ff larger number goes to FPl 
std_logic_vector ( l 0  downto 0 ) ; 
std_logic_vector ( 5 3 downto 0 ) ; 




--Begin shi fting lower number mantissa  
--IF  ( edi ff ( 7 }  OR  edi ff ( 6 } OR  ediff ( S } } = ' 1 '  THEN-- for single-
precision 
I F  ( ediff ( l l )  OR edi ff ( l 0 )  OR ediff ( 9 ) OR edi ff ( 8 )  OR edi ff ( 7 )  OR 
edi ff ( 6 ) ) = ' 1 '  THEN - - for DP 
mantxl : = 
" 0000000000000000000000000000000000000000000000 00000 0 0 0 " ; - -change to 25  
zeros for  sp 
ELSE 
IF edi ff ( 5 )  = ' 1 '  THEN --shi ft 32 zeros 
mantx1 ( 2 0  downto 0 )  : = mant2 ( 52 downto 32 ) ; 
mantx1 ( 52 downto 2 1 ) · =  
" 00000000000000000000000000000000 " ;  
ELSE 
--For Single Precision 
1 93 
- - I F  edi ff ( 4 ) = ' 1 '  THEN-- shi ft 1 6  zeros 
mantx1 ( 3 6  downto 0 )  : = mant 2 ( 52 downto 1 6 ) ; 
mantxl ( 52 downto &d )  " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 " ;  
mantxl  ( 3 7 downto &d )  : = " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 " ;  
--ELSE 
mantxl 
END I F ;  
END I F ;  
SSout <= SS ; 
mantxlout <= mantxl ; 
explout <= expl ; 
rnant lout <= rnant l ;  
Sans lout <= Sans l ;  
edi ffout <= edi f f ;  
mant 2 ;  
END I F ; 
END PROCESS PROC2 ; 
PROC3 : PROCESS  ( CLK ) - - Finish shi fting 
variable rnantx2 std_logic_vector ( 5 3 downto 
0 ) ; 
variable edi ffa 
0 )  ; 
std_logic_vector ( l l downto 
variable SSx 
BEGIN 
IF CLK ' EVENT AND CLK= ' l '  THEN 
SSx : = SSout ; 
rnantx2 : = rnantxl out ; 
edi ffa : = edi ffout ; 
std_logic ;  
--Comment edi ffa ( 4 )  out for  s ingle preci sion 
I F  ediffa ( 4 )  = ' 1 ' THEN--shi ft 1 6  zeros 
ELSE 
rnantx2 ( 3 6  downto 0 )  : = mantx2 ( 52 downto 1 6 ) ; 
rnantx2 ( 52 downto 37 ) : = " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 " ; 
rnantx2 · = rnantx2 ; 
END I F ;  
I F  edi ffa ( 3 ) = ' 1 '  THEN--shi ft 8 zeros 
ELSE 
mantx2 ( 4 4  downto 0 )  : = mantx2 ( 52 downto 8 ) ; 
mantx2 ( 52 downto 4 5 )  · = " 0 0 0 0 0 0 0 0 " ;  
mantx2 
END I F ; 
rnantx2 ; 
I F  edi ffa ( 2 ) = ' 1 '  THEN-- shift 4 zeros 
ELSE 
rnantx2 ( 4 8  downto 0 )  : = rnantx2 ( 52 downto 4 ) ; 
rnantx2 ( 52 downto 4 9 ) : = " 0 0 0 0 " ;  
rnantx2 . - rnantx2 ; 
END I F; 
I F  edi ffa ( l ) = ' l ' THEN--shi ft 2 zeros 
mantx2 ( 50 downto 0 )  : = rnantx2 ( 52 downto 2 ) ; 
mantx2 ( 52 downto 5 1 )  : = " 0 0 " ;  
1 94 
ELSE 
mantx2 : = mantx2 ; 
END I F; 
I F  edi ffa ( 0 ) = ' l '  THEN--shi ft 1 zeros 
ELSE 
mantx2 ( 5 1 downto 0 )  mantx2 ( 52 downto 1 ) ; 
mantx2 ( 52 )  : = ' 0 ' ; 
mantx2 . mantx2 ; 
END I F; 
I F  SSx = ' 1 '  THEN 
mantx2 · = NOT ( mantx2 ) + 1 ;  
ELSE 
mantx2 : = mantx2 ; 
END I F ; 
mantx2a <= mantx2 ; 
explout l <= explout ; 
mant lout l <= mant lout ; 
Sans lout l <= Sans lout ; 
SSxout <= SSx ; 
SStoCompl <= NOT ( SSx ) ; 
END I F; 
END PROCESS PROC3 ;  
--this proces s  occurs during the twos compliment 
PROC2 3 :  PROCESS ( CLK ) 
BEGIN 
IF CLK ' EVENT AND CLK= ' l '  THEN 
mantx2out<=mantx2tomantadd ( 53 DOWNTO 0 ) ; 
SSxout2<=SSxout ; 
explout lb <= explout l ;  
explout la <= explout lb ; 
mantlout lb <= mant lout l ;  
mant lout la <= mant lout lb; 
Sans lout lb <= Sans lout l ;  
Sans lout la <= Sans lout lb ; 
END I F; 
END PROCESS PROC2_3 ; 
PROC4 : PROCESS ( CLK) --mantissa  
variable mant result l 






: std_logic_vector ( 5 3 downto 
std_logic_vector ( 5 1 downto 0 ) ; 
: std_logic_vector ( l 0 downto 
std_logic ;  
std_logic_vector ( 5  downto 0 ) ; 
I F  CLK ' EVENT AND CLK= ' l '  THEN 
mant_result l : = mant_result ; 
expanswer : = exple ; 
Sanswer : = Sans 6 ;  
1 95 
0 ) ; 
0 ) ; 
change <= ' 1 ' ; 
I F  mant_result1 ( 53 )  = ' l '  THEN 
mantans l : = mant_result 1 ( 52 downto 1 ) ; 
shi ft <= " 000000 " ;  
--shiftn <= " 0 0000 1 " ;  
shi ftn2 <= " 0 0 0 00 1 " ;  
ELS I F  mant_result1 ( 52 )  = ' 1 '  THEN 
mantans l : = mant_result l ( S l  downto 0 ) ; 
shi ft <= " 000000 " ;  
--shi ftn <= " 00 0 0 1 0 " ;  
shi ftn2 <= " 00 0 0 00 " ; 
ELS I F  mant_result 1 ( 5 1 )  = ' 1 '  THEN 
mantans 1 ( 5 1  downto 1 )  . - mant_result1 ( 50 downto 0 ) ; 
mantans l ( 0 )  : = ' 0 ' ;  
shi ft <= " 0 0000 1 " ;  
--shi ftn <= " 0 0 0 0 1 1 " ;  
shi ftn2 <= " 0 0 0 0 0 1 " ;  
ELS I F  mant_result 1 ( 50 ) = ' 1 '  THEN 
shi ft <= " 0 000 10 " ;  
--shi ftn <= " 0 00 100 " ;  
shi ftn2 <= " 000 0 1 0 " ;  
mantans l ( S l  downto 2 )  : = mant_result 1 ( 4 9  downto 0 ) ; 
mantans l ( 1  downto 0 )  : = " 00 " ;  
ELSIF  mant_resultl ( 4 9 ) = ' 1 '  THEN 
shi ft <= " 00 0 0 1 1 " ;  
--shi ftn <= " 0 0 0 1 0 1 " ;  
shi ftn2 <= " 00 00 1 1 " ;  
mantans l ( S l  downto 3 )  : = mant_result 1 ( 4 8  downto 0 ) ; 
mantans 1 ( 2 downto 0 )  : = " 000 " ;  
ELS I F  mant_result1 ( 4 8 )  = ' 1 '  THEN 
shi ft <= " 0 0 0 1 0 0 " ;  
--shi ftn <= " 0 0 0 1 1 0 " ;  
shi ftn2 <= " 0 0 0 1 0 0 " ;  
mantans 1 ( 5 1  downto 4 )  : = mant_result 1 ( 4 7  downto 0 ) ; 
mantans 1 ( 3 downto 0 )  : = " 0 0 0 0 " ;  
ELS I F  mant_result 1 ( 4 7 )  = ' 1 '  THEN 
shi ft <= " 0 0 0 1 0 1 " ;  
- -shi ftn <= " 0 00 1 1 1 " ;  
shi ftn2 <= " 0 0 0 1 0 1 " ;  
mantans l ( S l  downto 5 )  : = mant_result 1 ( 4 6  downto 0 ) ; 
mantans l ( 4 downto 0 )  : = " 0 000 0 " ;  
ELSIF  mant_result 1 ( 4 6 )  = ' 1 '  THEN 
shift <= " 0 0 0 1 1 0 " ;  
--shiftn <= " 0 0 1 0 00 " ;  
shi ftn2 <= " 000 1 1 0 " ;  
mantans l ( S l downto 6 )  : = mant_result 1 ( 4 5  downt o 0 ) ; 
mantans 1 ( 5 downto 0 )  : = " 0 0 0 0 00 " ;  
ELS I F  mant_result 1 ( 4 5 )  = ' 1 '  THEN 
shift <= " 0 00 1 1 1 " ;  
--shiftn <� " 0 0 10 0 1 " ;  
shi ftn2 <= " 0 00 1 1 1 " ;  
mantans 1 ( 5 1  downto 7 )  : = mant_result 1 ( 4 4  downto 0 ) ; 
mantans 1 ( 6  downto 0 )  : = " 0 000000 " ;  
ELS I F  mant_result 1 ( 4 4 ) = ' 1 '  THEN 
shi ft <= " 0 0 1 0 0 0 " ;  
1 96 
--shi ftn <= " 0 0 1 0 1 0 " ; 
shi ftn2 <= " 0 0 1 0 0 0 " ;  
mantans l ( Sl downto 8 )  : = mant_result1 ( 4 3  downto 0 ) ; 
mantans 1 ( 7  downto 0 )  : = " 00000000 " ;  
ELS I F  mant_result 1 ( 4 3 )  = ' l '  THEN 
shi ft <= " 0 0 1 0 0 1 " ;  
- -shi ftn <= " 0 0 1 0 1 1 " ;  
shi ftn2 <= " 0 0 1 00 1 " ; 
mantans l ( Sl downto 9 )  : = mant_result 1 ( 4 2  downto 0 ) ; 
mantans 1 ( 8  downto 0 )  : = " 000000000 " ;  
ELS I F  mant_result 1 ( 4 2 )  = ' 1 '  THEN 
shi ft <= " 0 0 1 0 1 0 " ;  
--shiftn <= " 0 0 1 1 0 0 " ;  
shi ftn2 <= " 0 0 1 0 1 0 " ;  
mantans l ( Sl downto 1 0 )  : = mant_re sult 1 ( 4 1  downto 0 ) ; 
mantans 1 ( 9  downto 0 )  : = " 0000000000 " ;  
ELS IF  mant_result l ( 4 1 )  = ' 1 '  THEN 
shift <= " 0 0 1 0 1 1 " ;  
--shi ftn <= " 0 0 1 1 0 1 " ; 
shi ftn2 <= " 0 0 1 0 1 1 " ; 
mantans l ( S l  downto 1 1 ) : = mant_result 1 ( 4 0  downto 0 ) ; 
mantans l ( l 0 downto 0 )  : = " 0 0 0 000000 00 " ;  
ELS I F  mant_result 1 ( 4 0 )  = ' l '  THEN 
shi ft <= " 0 0 1 1 0 0 " ; 
--shi ftn <= " 00 1 1 10 " ;  
shi ftn2 <= " 0 0 1 1 0 0 " ; 
mantans l ( S l downto 12 ) : = mant_result 1 ( 39  downto 0 ) ; 
mantans l ( ll downto 0 )  : = " 0 00000000 00 0 " ;  
ELS I F  mant_result1 ( 39 )  = ' 1 '  THEN 
shi ft <= " 0 0 1 1 0 1 " ;  
--shi ftn <= " 0 0 1 1 1 1 " ;  
shi ftn2 <= " 0 0 1 1 0 1 " ;  
mantans l ( Sl downto 1 3 )  : = mant_result 1 ( 38 downto 0 ) ; 
mantans l ( 12  downto 0 )  : = " 0 000000000000 " ;  
ELS I F  mant_result 1 ( 3 8 ) = ' 1 '  THEN 
shi ft <= " 0 0 1 1 1 0 " ;  
--shiftn <= " 01 0000 " ;  
shi ftn2 <= " 0 0 1 1 1 0 " ;  
mantans l ( S l  downto 1 4 ) : = mant_resul t 1 ( 37 downto 0 ) ; 
mantans l ( 13 downto 0 )  : = " 00000000000000 " ; 
ELS IF  mant_result 1 ( 37 ) = ' 1 '  THEN 
shi ft <= " 0 0 1 1 1 1 " ;  
- -shiftn <= " 01 0 0 0 1 " ;  
shi ftn2 <= " 0 0 1 1 1 1 " ;  
mantans l ( S l  downto 1 5 )  : = mant_resul t 1 ( 3 6  downto 0 ) ; 
mantans 1 ( 1 4 downto 0 )  : = " 0 00000000000000 " ;  
ELS IF  mant_result 1 ( 3 6 )  = ' 1 '  THEN 
shi ft <= " 0 1 00 0 0 " ;  
--shi ftn <= " 01 0 0 1 0 " ;  
shi ftn2 <= " 0 1 0 000 " ;  
mantans l ( Sl downto 1 6 )  : = mant_result 1 ( 35  downto 0 ) ; 
mantans l ( lS downto 0 )  : = " 0 000000000000000 " ;  
ELS IF  mant_result 1 ( 35 ) = ' 1 '  THEN 
s hi ft <= " 0 1 0 0 0 1 " ;  
--shi ftn <= " 0 1 0 0 1 1 " ;  
1 97 
shi ftn2 <= " 0 1 000 1 " ;  
mantans l ( S l  downto 1 7 )  : = mant_resul t 1 ( 3 4  downto 0 ) ; 
mantans l ( 1 6 downto 0 )  : = " 0 0 0 Q 0 0 0 0 0 0 0 0 0 0 0 0 0 " ;  
ELS I F  mant_result1 ( 3 4 ) = ' 1 '  THEN 
shi ft <= " 0 1 0 0 1 0 " ;  
--shi ftn <= " 0 1 0 1 0 0 " ;  
shi ftn2 <= " 0 1 00 1 0 " ;  
mantans l ( S l  downto 1 8 )  : = mant_resul t 1 ( 33 downto 0 ) ; 
mantans 1 ( 1 7 downto 0 )  : = " 00 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 " ; 
ELS I F  mant_result 1 ( 3 3 )  = ' 1 '  THEN 
shi ft <= " 0 1 00 1 1 " ;  
--shiftn <= " 0 1 0 1 0 1 " ;  
shi ftn2 <= " 0 1 0 01 1 " ;  
mantans l ( S l  downto 1 9 )  : = mant_result 1 ( 32  downto 0 ) ; 
mantans l ( 1 8 downto 0 )  : = " 0 0 000 0 0 0 0 0 0 0 0 0 0 0 00 0 " ;  
ELS I F  mant_result1 ( 32 )  = ' 1 '  THEN 
shi ft <= " 0 1 0 1 00 " ;  
--shi ftn <= " 0 1 0 1 1 0 " ;  
shiftn2 <= " 0 1 0 1 0 0 " ;  
mantans l ( S l downto 2 0 ) : = mant_resul t 1 ( 3 1 downto 0 ) ; 
mantans 1 ( 1 9  downto 0 )  : = " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 " ;  
ELSI F  mant_result 1 ( 31 ) = ' l '  THEN 
shi ft <= " 0 1 0 1 0 1 " ;  
--shi ftn <= " 0 1 0 1 1 1 " ;  
shi ftn2 <= " 0 1 0 1 0 1 " ;  
mantans 1 ( 5 1  downto 2 1 )  : = mant_resul t 1 ( 30 downto 0 ) ; 
mantans l  ( 20 downto 0 )  : = " 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 " ;  
ELS I F  mant_resul t 1 ( 30 )  = ' 1 '  THEN 
shift <= " 0 1 0 1 1 0 " ;  
--shi ftn <= " 0 1 1 0 0 0 " ;  
shi ftn2 <= " 0 1 0 1 1 0 " ; 
mantans 1 ( 5 1  downto 2 2 )  : = mant_resul t 1 ( 2 9  downto 0 ) ; 
mantans l ( 2 1  downto 0 )  : = " 00 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 00 0 " ; 
ELS I F  mant_resul t l ( 2 9 )  = ' 1 '  THEN 
shi ft <= " 0 1 0 1 1 1 " ;  
--shi ftn <= " 0 1 1 00 1 " ;  
shiftn2 <= " 0 1 0 1 1 1 " ;  
mantans l ( S l  downto 2 3 )  : = mant_resul t 1 ( 2 8  downto 0 ) ; 
mantans l  ( 2 2 downto 0 )  : = " 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 " ;  
ELS I F  mant_result l ( 2 8 ) = ' 1 '  THEN 
shi ft <= " 0 1 1 0 0 0 " ; 
--shi ftn <= " 0 1 1 0 1 0 " ;  
shi ftn2 <= " 0 1 1 000 " ;  
mantans l ( S l downto 2 4 ) : = mant_result 1 ( 2 7  downto 0 ) ; 
mantans1 ( 2 3  downto 0 )  : = " 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 " ; 
ELSE 
mantans l : = mant_result l ( S l  DOWNTO 0 ) ; 
change<= ' 0 ' ; 
shi ft <= " 00 0 0 0 0 " ;  
--shi ftn <= " 0 00000 " ;  
shi ftn2 <= " 000000 " ;  
END I F ;  
rnant z l  <= rnantans l ;  
shi ftout <=shi ft ; 
expanswerz l<=expanswer ; 
198 
Sanswerzl<=Sanswer ;  
END IF ;  
END PROCESS PROC4 ; 
PROC4A : PROCESS ( CLK) - -mantissa  normali zation 
variable mant resultla  : std_logic_vector ( 5 1  downto 0 ) ; --




variable shiftc , shi ftd2 
BEGIN 
std_logic_vector ( 5 1  downto 0 ) ; 
: std_logic_vector ( l 0 downto 0 ) ; 
std_logic ; 
std_logic_vector ( 5  downto 0 ) ; 
I F  CLK ' EVENT AND CLK= ' l '  THEN 
--shiftc : = shiftout ; 
shiftc  : = shift ; 
--shiftd : = shiftn ; 
shi ftd2 : = shi ftn2 ; 
mant_result la : = mant zl ; 
expanswer : = expanswerzl ; 
Sanswer : = Sanswerz l ; 
I F  change = ' 1 '  THEN 
mantans : = mant_result l a ;  
ELSI F  mant_result la ( 2 7 ) = ' 1 '  THEN 
shiftc : = " 0 1 1 00 1 " ;  
--shi ftd : = " 0 1 1 0 1 1 " ;  
shiftd2 : = " 0 1 1 0 0 1 " ;  
mantans ( 5 1  downto 2 5 )  : = mant result l a ( 2 6  downto 0 ) ; 
mantans ( 2 4  downto 0 )  : = " 0000000000000000000000 000 " ;  
ELS I F  mant_result l a ( 2 6 )  = ' 1 '  THEN 
shi ftc : = " 0 1 1 0 1 0 " ;  
--shiftd : = " 0 1 1 1 00 " ;  
shiftd2 : = " 0 1 1 0 1 0 " ;  
mantans ( 5 1  downto 2 6 )  : = mant result l a ( 2 5  downto 0 ) ; 
mantans ( 2 5  downto 0 )  : = " 00 0 0 000000000000000000000 0 " ;  
ELS I F  mant_result la ( 2 5 )  = ' 1 '  THEN 
shiftc : = " 0 1 1 0 1 1 " ;  
--shi ftd : = " 0 1 1 1 0 1 " ;  
shiftd2 : = " 0 1 1 0 1 1 " ;  
mantans ( 5 1 downto 2 7 ) : = mant result la ( 2 4 downto 0 ) ; 
mantans ( 2 6  downto 0 )  : = " 0 0 0000000000000000000000000 " ;  
ELS I F  mant_result l a ( 2 4 ) = ' l '  THEN 
shi ftc : = " 0 1 1 1 00 " ;  
--shiftd : = " 0 1 1 1 1 0 " ;  
shi ftd2 : = " 01 1 10 0 " ;  
mantans ( 51 downto 2 8 ) : = mant result l a ( 2 3  downto 0 ) ; 
mantans ( 2 7  downto 0 )  : = " 0 0 00000000000000000000000 000 " ;  
ELS I F  mant_result l a ( 2 3 ) = ' 1 ' THEN 
shiftc : = " 0 1 1 10 1 " ;  
--shi ftd : = " 0 1 1 1 1 1 " ;  
shi ftd2 : = " 0 1 1 1 0 1 " ;  
mantans ( 5 1  downto 2 9 )  : = mant result l a ( 22 downto 0 ) ; 
mantans ( 2 8  downto 0 )  : = " 0 0 0 0 00000000000000000000000 0 0 " ;  
ELS I F  mant_result la ( 22 ) = ' 1 '  THEN 
shiftc : = " 0 1 1 1 1 0 " ;  
--shi ftd : = " 10 0 0 00 " ;  
1 99 
shi ftd2 : = " 0 1 1 1 1 0 " ;  
mantans ( S l downto 30 ) : = mant result la ( 2 1 downto 0 ) ; 
mantans ( 2 9  downto 0 )  : = " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 " ; 
ELS I F  mant_result la ( 2 1 ) = ' 1 '  THEN 
shi ftc  : = " 0 1 1 1 1 1 " ;  
--shi ftd : = " 1 0 0 0 0 1 " ;  
shi ftd2 : = " 0 1 1 1 1 1 " ;  
mantans ( S l downto 31 ) : = mant result la ( 2 0 downto 0 ) ; 
mantans ( 30 downto 0 )  : = " 00 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 " ;  
ELSI F  mant_result l a ( 2 0 ) = ' 1 '  THEN 
shi ftc : = " 1 0 0 0 0 0 " ;  
--shi ftd : = " 1 0 0 0 1 0 " ;  
shi ftd2 : = " 1 00 0 0 0 " ;  
mantans ( S l  downto 32 ) : = mant result la ( 1 9  downto 0 ) ; 
mantans ( 3 1 downto 0 )  : = " 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 " ;  
ELSI F  mant_resul t l a ( l 9 )  = ' l '  THEN 
shi ftc  : = " 1 0 0 00 1 " ;  
--shi ftd : = " 1 0 0 0 1 1 " ;  
shi ftd2 : = " 1 0 0 0 0 1 " ;  
mantans ( Sl downto 33 ) mant_resul t l a { 1 8 downto 0 ) ; 
mantans ( 32 downto 0 )  : = 
" 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 " ;  
ELSI F  mant_resul t l a ( 1 8 )  = ' 1 '  THEN 
shi ftc : = " 1 0 0 0 1 0 " ;  
--shi ftd : = " 1 0 0 1 0 0 " ;  
shi ftd2 : = " 1 00 0 1 0 " ;  
mantans { S l downto 3 4 ) : = mant result l a ( 1 7 downto 0 ) ; 
mantans ( 33 downto 0 )  : = 
" 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0000 0 0 0 0 0 0 " ;  
ELSI F  mant_result la { 17 )  = ' l '  THEN 
shi ftc : = " 1 000 1 1 " ;  
--shi ftd : = " 1 0 0 1 0 1 " ;  
shi ftd2 : = " 1 0 0 0 1 1 " ;  
mantans { S l downto 35 ) : = mant resul t l a ( 1 6  downto 0 ) ; 
mantans ( 3 4 downto 0 )  : = 
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 " ;  
ELS I F  mant_result la ( l 6 )  = ' 1 '  THEN 
shi ftc : = " 1 0 0 1 0 0 " ;  
- - shi ftd : = " 1 0 01 1 0 " ;  
shi ftd2 : = " 1 0 0 1 00 " ;  
mantans ( S l downto 3 6 )  : = mant result l a { 1 5 downto 0 ) ; 
mantans ( 35 downto 0 )  : = 
" 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 " ;  
ELSI F  mant_result la ( l 5 )  = ' l '  THEN 
shi ftc : = " 1 0 0 1 0 1 " ;  
--shiftd : = " 1 0 0 1 1 1 " ;  
shi ftd2 : = " 10 0 1 0 1 " ;  
mantans ( S l  downto 37 ) : = mant result l a ( 1 4 downto 0 ) ; 
mantans { 3 6 downto 0 )  : = 
" 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 " ;  
ELS I F  mant_result la { l 4 ) = ' l '  THEN 
shi ftc : = " 1 0 0 1 1 0 " ;  
- - shiftd : = " 1 0 1 0 00 " ;  
shi ftd2 : = " 1 0 0 1 1 0 " ;  
mantans ( S l downto 3 8 ) : = mant_re sult l a ( 1 3 downto 0 ) ; 
200 
mantans ( 37 downto 0 )  : = 
" 0 00000000000000000000000000000000 0 0 0 0 0 " ;  
ELS I F  mant_result la ( 1 3 )  = ' 1 '  THEN 
shi ftc : = " 1 0 0 1 1 1 " ;  
--shi ftd : = " 1 0 1 00 1 " ;  
shiftd2 : = " 1 0 0 1 1 1 " ;  
mantans ( 5 1  downto 3 9 )  : = mant resultla ( l 2 downto 0 ) ; 
mantans ( 38 downto 0 )  : = 
" 0 0000000000000000 0000000000000000 000000 " ;  
ELS I F  mant_result la ( 12 )  = ' 1 '  THEN 
shi ftc : = " 1 0 1 0 0 0 " ;  
--shi ftd : = " 1 0 1 0 10 " ;  
shi ftd2 : = " 1 0 1 000 " ;  
mantans ( 5 1 downto 4 0 )  : = mant result l a ( l l  downto 0 ) ; 
mantans ( 39 downto 0 )  : = 
" 0 0 0 0 0000000000000000000000000000 0 0 0 0 0 00 0 " ;  
ELS I F  mant_result la ( l l )  = ' 1 '  THEN 
shiftc : = " 1 0 10 0 1 " ;  
--shi ftd : = " 10 1 0 1 1 " ;  
shi ftd2 : = " 1 0 1 0 0 1 " ;  
mantans ( S l downto 4 1 )  : = mant result la ( l 0 downto 0 ) ; 
mantans ( 4 0  downto 0 )  : = 
" 0 0000000000000000000000000000000000000000 " ;  
ELS I F  mant_result l a ( l 0 )  = ' l '  THEN 
shiftc : = " 1 0 1 0 10 " ;  
--shi ftd : = " 10 1 100 " ;  
shi ftd2 : = " 1 0 1 0 10 " ;  
mantans ( 5 1 downto 4 2 )  : = mant resultla ( 9  downto 0 ) ; 
mantans ( 4 1  downto 0 )  : = 
" 00 0 00000000000000000000000000000000 0 0 0 0 0 0 0 " ;  
ELS I F  mant_result la ( 9 )  = ' 1 '  THEN 
shi ftc : = " 1 0 1 0 1 1 " ;  
--shiftd : = " 1 0 1 1 0 1 " ;  
shi ftd2 : = " 1 0 1 0 1 1 " ;  
mantans ( S l downto 4 3 )  : = mant result la ( 8  downto 0 ) ; 
mantans ( 4 2  downto 0 )  : = 
" 0000000000000000000000000000000000000000000 " ;  
ELS I F  mant_result la ( 8 )  = ' 1 '  THEN 
shi ftc : = " 1 0 1 1 0 0 " ;  
--shiftd : = " 10 1 1 10 " ;  
shiftd2 : = " 1 0 1 10 0 " ;  
mantans ( S l  downto 4 4 )  : = mant result la ( 7  downto 0 ) ; 
mantans ( 4 3  downto 0 )  : = 
" 000000000000000000000000000000000000000000 0 0 " ;  
ELS I F  mant_result la ( 7 ) = ' l '  THEN 
shiftc : = " 1 0 1 1 0 1 " ;  
--shiftd : = " 10 1 1 1 1 " ; 
shi ftd2 : = " 1 0 1 1 0 1 " ;  
mantans ( S l downto 4 5 )  : = mant resultla ( 6  downto 0 ) ; 
mantans ( 4 4  downto 0 )  : = 
" 0 000000000000000000000000000000000 0 0 0 0 0 0 0 0 00 0 " ;  
ELS I F  mant_result l a ( 6 )  = ' 1 '  THEN 
shiftc : = " 1 0 1 1 1 0 " ;  
--shi ftd : = " 1 1 0 000 " ;  
shi ftd2 : = " 1 0 1 1 1 0 " ; 
201 
mantans ( S l downto 4 6 ) : = mant_result la ( S  downto 0 ) ; 
mantans ( 4 5  downto 0 )  : = 
" 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 " ;  
ELSI F  mant_resul t l a ( S )  = ' 1 '  THEN 
shi ftc : = " 1 0 1 1 1 1 " ;  
--shi ftd : = " 1 10 0 0 1 " ;  
shi ftd2 : = " 1 0 1 1 1 1 " ;  
mantans ( S l  downto 4 7 )  : = mant result l a ( 4  downto 0 ) ; 
mantans ( 4 6  downto 0 )  : = 
" 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 " ;  
ELSI F  mant_result l a ( 4 )  = ' 1 '  THEN 
shi ftc : = " 1 1 0000 " ;  
- - shiftd : = " 1 10 0 1 0 " ;  
shi ftd2 : = " 1 1 0 0 00 " ;  
mantans ( S l downto 4 8 )  : = mant result l a ( 3  downto 0 ) ; 
mantans ( 4 7 downto 0 )  : = 
" 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 00 " ;  
ELS I F  mant_resul t l a ( 3 )  = ' 1 '  THEN 
shi ftc : = " 1 1 0 00 1 " ;  
- -shi ftd : = " 1 1 0 0 1 1 " ;  
shiftd2 : = " 1 1 0 0 0 1 " ;  
mantans ( S l  downto 4 9 )  : = mant result l a ( 2  downto 0 ) ; 
mantans ( 4 8  downto 0 )  : = 
" 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 000000000 0 0 0 0 0 0 0 0 " ; 
ELS I F  mant_resul t l a ( 2 )  = ' 1 ' THEN 
shi ftc : = " 1 1 0 0 1 0 " ;  
--shiftd : = " 1 1 0 1 0 0 " ;  
shi ftd2 : = " 1 1 00 1 0 " ; 
mantans ( S l  downto 50 ) : = mant result l a ( l  downto 0 ) ; 
mantans ( 4 9  downto 0 )  : = 
" 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0000000000 0 0 0 0 00 " ; 
ELS I F  mant_resul t l a ( l )  = ' l '  THEN 
shi ftc : = " 1 1 0 0 1 1 " ;  
shiftd2 : = " 1 1 0 0 1 1 " ;  
--shi ftd : = " 1 1 0 1 0 1 " ;  
mantans ( S l )  : = mant_result la ( O ) ; 
mantans ( S O downto 0 )  : = 
" 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 " ;  
ELSE 
shi ftc : = " 00 0 00 0 " ;  
- - shi ftd : = " 000000 " ;  
shi ftd2 : = " 0 00000 " ;  
mantans : = 
" 0 0 0 0 0000000000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0000000 0 0 0 0 0 0 00 0 " ;  
END I F ; 
OUTx <= Sanswer & expanswer & mantans ; 
Sanswerl  <= Sanswer ; 
expanswerl  <= expanswer ; 
mantans 3 <= mantans ; 
shi ft l <= shi ftc ;  
--shi ft l l  < =  shi ftd;  
shi ft 1 2  <= shi ftd2 ; 
END I F ;  
END PROCESS PROC4A; 
202 
PROC4 B :  PROCESS (CLK ) - -occurs during normali zat ion 
BEGIN 
IF  CLK ' EVENT AND CLK= ' l '  THEN 
Sans2 <= Sans l out la ;  
Sans 3 <= Sans 2 ;  
Sans4  <= Sans 3 ;  
Sans 5 <= S ans 4 ; 
S ans 6 <= Sans5 ;  
SSxl <= SSxout2 ; 
SSx2 <= SSxl ; 
SSx3 <= SSx2 ; 
SSx4 <= SSx3 ; 
SSx5 <= SSx4 ; 
SSx6 <= SSx5 ; 
SSx7 <= SSx 6 ;  
SSxB <= SSx7 ; 
expla <= explout l a ;  
explb <= expla ;  
explc <= explb ; 
expld <= expl c ;  
exple <= expld; 
END I F; 
END PROCESS PROC4 B ;  
PROCESS EXP ADJ : PROCESS ( CLK ) 
variable expanswer2 : STD LOGIC_VECTOR ( l 0 DOWNTO 0 ) ; 
variable mant ans 4  STD_LOGIC_VECTOR ( 5 1 DOWNTO 0 ) ; 
variable Sanswer2 STD_LOGIC;  
variable shi ft l x , shift 12x STD_LOGIC_VECTOR ( S  DOWNTO 0 ) ; 
BEGIN 
IF  CLK ' EVENT AND CLK= ' l '  THEN 
Sanswer2 : = Sanswerl ;  
expanswer2 : = expanswerl ; 
mantans 4 : = mantans3 ; 
shi ft lx : = shi ftl ; 
--shi ft l lx : = shi ftl l ;  
shift 1 2x : = shi ft1 2 ; 
I F  Z l l  = ' 1 '  THEN 
Sanswer2 : = ' 0 ' ;  
expanswer2 : = ( OTHERS=> ' 0 ' ) ;  
mantans 4 : = (OTHERS=> ' 0 ' ) ;  
ELS IF  ( shi ft lx > " 0 0 0 0 0 0 "  AND SSxB = ' 1 ' ) THEN 
expanswer2 : = expanswer2 - shi ft lx; 
ELS I F  ( shift 1 2x > " 0 00000 " AND SSxB = ' 0 ' ) THEN 
expanswer2 : = expanswer2 + shi ft 12x ;  
END  IF ;  
OUTx <=  Sanswer2 & expanswer2 & mant ans 4 ;  
203 
END I F ;  
END PROCESS PROCESS EXP ADJ ; 
PROCESS FINISH : PROCESS ( CLK ) 
BEGIN 
I F  CLK ' EVENT AND CLK= ' l '  THEN 
fl  <= s tart ; 
f2 <= fl ; 
f 3  <= f2 ; 
f4  <= f3 ; 
f5 <= f4 ; 
f 6  <= f 5 ; 
f7 <= f 6 ;  
f 8  <= f7 ; 
f9  <= f8 ; 
fl0  <= f 9 ;  
f l l  <= fl0 ; 
fl2  <= fl l ;  
fini sh <= fl2 ; 
END I F ;  
END PROCESS PROCESS FINISH ; 
Zerocheckl : PROCESS ( CLK) 
BEGIN 
IF CLK ' EVENT AND CLK= ' l '  THEN 
I F  Ain = 
" 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 " THEN 
zeroflagl <= ' 1 ' ; 
ELSE 
zeroflagl <= ' 0 ' ; 
END I F ;  
END I F ;  
END PROCESS Zerocheckl ; 
Zerocheck2 : PROCESS ( CLK) 
BEGIN  
I F  CLK ' EVENT AND CLK= ' l '  THEN 
IF Bin = 
" 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 "  THEN 
zeroflag2 <= ' 1 ' ; 
ELSE 
zerofl ag2 <= ' 0 ' ; 
END I F ; 
END I F ;  
END PROCESS Zerocheck2 ; 
Zeropass : PROCESS (CLK) 
BEGIN 
IF CLK ' EVENT AND CLK= ' l '  THEN 
Zl  <= zeroflagl AND zeroflag2 ; 
Z2  <= Zl ; 
Z3 <= Z2 ; 
Z 4  <= Z3 ; 
204 
ZS  <=  Z 4 ; 
Z 6  <= Z S ;  
Z7  <=  Z 6 ;  
Z S  <= Z7 ; 
Z 9  <= ZS ; 
Z l 0  <= Z 9 ;  
Z l l  < =  Z l 0 ;  
END I F ;  
END PROCESS Zeropass ; 
END behavior ; 
205 
Vita 
Kirk Andrew Baugher was born on January 2, 1 980 in Enterprise, Alabama. Kirk 
was raised in all across the country spending his childhood and adolescent life in 
Alabama, Washington, Virginia, Tennessee, and Texas. He began attending college at 
the University of Tennessee, Knoxville in the fall of 1 998. During his undergraduate 
term at the University of Tennessee, Kirk co-oped for one year with the Tennessee Valley 
Authority and soon graduated with a Bachelor of Science degree in Computer 
Engineering and a minor in Engineering Communication and Performance in 2003 . 
Immediately following the completion of his undergraduate degree, Kirk started his 
graduate degree. One year later in 2004, Kirk graduated with a Master of Science in 
Electrical Engineering. 
Kirk will be starting his career as an engineer with Honeywell in Clearwater, 
Florida. 
