Memory optimization for a parallel sorting hardware architecture by Lu, Shih-Lien
AN ABSTRACT OF THE THESIS OF  
Dale A. Beyer for the degree of Master of Science in Electrical and Computer 
Engineering presented on May 22, 1997. Title: Memory Optimization for a Parallel 
Sorting Hardware Architecture 
Abstract Approved: 
Shih-Lien Lu 
Sorting is one of the more computationally intensive tasks a computer performs. 
One of the most effective ways to speed up the task of sorting is by using parallel 
algorithms. When implementing a parallel algorithm, the designer has to make several 
decisions. Among the decisions are the algorithm and the physical implementation of the 
algorithm. A dedicated hardware solution is often physically quicker than a software 
solution. 
In this thesis, we will investigate the optimization of a hardware implementation 
of max-min sort. I propose an optimization to the data structures used in the algorithm. 
The new data structure allows quicker sorting by changing the basic workings of the 
max-min sort. The results are presented by comparing the new data structure with the 
original data structure. The thesis also discusses the design and performance issues 
related to implementing the algorithm in hardware. 
Redacted for PrivacyC Copyright by Dale A. Beyer  
May 22, 1997  
All Rights Reserved  Memory Optimization for a Parallel Sorting Hardware Architecture 
By  
Dale A. Beyer  
A THESIS  
submitted to  
Oregon State University  
in partial fulfillment of 
the requirements for the 
degree of 
Master of Science 
Presented: May 22, 1997  
Commencement June 1998  Master of Science thesis of Dale A. Beyer presented May 22, 1997. 
APPROVED: 
Major Professor, Representing Electrical and Computer Engineering 
Head of Department of Electri  and Cc jTh:.ter Engineering 
Dean of Graduate  hool 
I understand that my thesis will become part of the permanent collection of Oregon State 





Redacted for PrivacyAcknowledgements: 
I would like to thank my roommate Jeff McNeal who without his help I may 
never have developed some of the algorithmic improvements. I also thank all my friends 
for believing in me and putting up with my crazy days of writing and proofreading the 
thesis. I also would like to thank Professor Ben Lee for his input and advice on methods 
to thoroughly analyze the performance of the parallel algorithms presented. Additionally, 
I would like to thank Professor Shih-Lien Lu, who had the faith in me being able to 
accomplish so much in so little time. Finally, I would like to thank my parents whose 
love, support, and proofreading (thanks Mom) helped me throughout my academic 
endeavors. TABLE OF CONTENTS  
Page 
1. INTRODUCTION  1  
1.1. Background  1  
1.2. Custom Designs  2  
1.3. Speeding Up Sorting  3  
1.4. Approach and Outline of the Thesis  4  
2. PARALLEL SORTING  5  
2.1. Single Processor Sorting  6  
2.2. Analyzing Parallel Sorting  8  
2.2.1. Lower Bound  9  
2.2.2. Measurement  9  
2.2.3.  Classification of Parallel Systems  10  
2.3. Parallel Sorting Algorithms  11  
2.3.1. Odd-Even Sort  11  
2.3.2. Enumeration Sort  14  
2.3.3. MM-Sort  15  
2.4. Limitations of Parallel Sorting Algorithms and Summary  21  
3. HARDWARE COMPONENTS DESCRIPTION  23  
3.1. SRAM  23  
3.1.1.  Cell Structure  24  
3.1.2. Chip Interface  24  
3.1.3. Timing and Applications  25  TABLE OF CONTENTS (Continued) 
Page 
3.2. DRAM  26  
3.2.1.  Cell Structure  27  
3.2.2.  Interface Signals  28  
3.2.3. Timing and Applications  28  
3.3. FPGAs  29  
3.3.1.  Structure  30  
3.3.2. Timing  30  
3.3.3.  Applications  31  
4. MAX-MIN PROCESSOR DESIGN  32  
4.1. Max-Min Heap Structure  33  
4.1.1. Data Structure Algorithms  33  
4.1.2.  Speed of Creation  35  
4.1.3. Estimated Running Time  36  
4.2. Insertion Sort Structure  37  
4.2.1. Algorithm Used  37  
4.2.2.  Speed of Creation  38  
4.2.3. Estimated Running Time  39  
4.2.4.  Possible Optimizations  40  
4.3. SRAM Insertion Sort Structure  41  
4.3.1. Data Structure Algorithm  42  
4.3.2.  Speed of Creation  47  
4.3.3. Estimated Running Time  50  
4.4. Comparison of Processor Designs  53  
5. MM-SORT PERFORMANCE  55  TABLE OF CONTENTS (Continued) 
Page 
5.1. Overall Speed  56  
5.2. Variation Using SRAM  59  
5.3. Scalability  62  
5.4. SRAM Design Decisions  64  
5.4.1. Hole Size  65  
5.4.2. DRAM Pointer Stack  65  
5.4.3. SRAM Speed  66  
6. DESIGN OVERVIEW AND CONCLUSION  67  
6.1. Design Guidelines  67  
6.1.1. Communication Interface  68  
6.1.2. Processor Memory Interface  69  
6.1.3.  Sorting Network  69  
6.1.4.  Interface Chip  70  
6.2. Future Research  70  
6.3. Conclusion  71  
BIBLIOGRAPHY  72  
APPENDICES  74  
APPENDIX A SRAM Insertion Sort Processor Code  76  
APPENDIX B Max-Min Heap Processor Code  90  
APPENDIX C Optimized Insertion Sort Processor Code  99  LIST OF FIGURES  
Figure  Page 
2.1  A comparison unit for odd-even sorting  8  
2.2  An eight element odd-even sorter .  8  
2.3  A four element serial odd-even sorter  9  
2.4  Data representation for a heap  17  
2.5  Data representation for a max-min heap  18  
3.1  A six transistor SRAM cell  25  
3.2  Two implementations of a DRAM cell  27  
3.3  FLEX 10k FPGA family layout  31  
4.1  Total number of memory accesses to create a max-min heap  35  
4.2  Creation constant of a max-min heap  36  
4.3  Total number of memory accesses to create a sorted DRAM array  39  
4.4  Creation constant of a sorted DRAM array  40  
4.5  A memory array with data holes for quicker shifting  43  
4.6  Data storage method for a SRAM insertion sort processor  45  
4.7  Effect of hole size on the SRAM insertion sort processor  49  
4.8  Effect of hole size on the SRAM insertion sort processor (zoomed in)  49  
4.9  Memory storage efficiency of the SRAM insertion sort processor  50  
4.10  Memory accesses per element to create a sorted array  52  
4.11  Optimal hole size choice for creating a sorted array  52  
5.1  Chip layout to build a MM-sorter  56  
5.2  Original sorting method with 2 processors  57  LIST OF FIGURES (Continued) 
Figure  Page 
5.3  Original sorting method with 16 processors  58  
5.4  Original sorting method with 64 processors  58  
5.5  MM-sort running time for 16 processors (including unloading)  59  
5.6  MM-sort alternating the global and local sort for 2 processors  61  
5.7  MM-sort alternating the global and local sort for 16 processors  61  
5.8  MM-sort alternating the global and local sort for 64 processors  62  
5.9  Speed improvement by adding processors to sort 16k elements  63  
5.10  Scalability of problem with 8k elements per processor  64  
5.11  Number of lost DRAM pods for 16 processors sorting 64k elements  66  MEMORY OPTIMIZATION FOR A PARALLEL SORTING  
HARDWARE ARCHITECTURE  
CHAPTER 1  
INTRODUCTION  
1.1  Background 
Design and implementation of computing machines have fascinated many 
generations. Only recently in the 20th century was the first electronic computer built. 
One of the early machines, the ENIAC, was capable of 5,000 addition or 14 
multiplication operations in a second [1]. This early technology used vacuum tubes, over 
17,000 to be more exact, which frequently burned out. The invention of the silicon 
transistor improved the reliability and speed of the computer while reducing the size 
necessary for a single machine. The technology rapidly improved from SSI (Small Scale 
Integrated circuits) to LSI (Large Scale Integrated circuits) to ULSI (Ultra Large Scale 
Integrated circuits). Initially, in the SSI technology, only single logic gates were 
available on a chip like the NAND, NOR, and NOT gates. The improved technology of 
LSI and ULSI circuits, however, allowed a designer to place thousands of these gates on 
a single chip. 
With multiple gates available on a single chip, a designer could build complex logic 
including registers, memories, and eventually processors. As uniprocessors improved, 
designers achieved an increase in speed by connecting several processors in parallel. 
After the success of parallel computers, many modern chipsets now incorporate the 
ability to link several processors in a single system. With the ability to process several 
tasks simultaneously, engineers have the power to complete problems quicker by dividing 
a single problem into several smaller tasks and completing the work in parallel. An 
example of such a machine is the recent development of the Tera-Flop machine by Intel. 2 
By combining 7,264 thousand Pentium Pro® processors [2], engineers built a machine 
capable of completing 1.06 trillion floating point operations in a single second. As a 
point of comparison, the ENIAC would take roughly 6.7 years (if all calculations are 
additions and within the 10 digit precision) to complete the same number of calculations 
as the Tera-Flop machine does in one second. 
The resources available to a computer system have increased, with the single and 
parallel processing power. In the early days of a personal computer, a system needed 
only 640 Kbytes of memory and 10 megabytes of disk space to run the majority of the 
programs. Now systems running Windows 95® require at least 8 megabytes of Memory 
and roughly 500 megabytes of disk space to run modem programs. Large scale computer 
systems, like the Tera-flop machine, can have 454 gigabytes of system memory and over 
2 terabytes of disk space [2]. In business applications, many computers require large 
resources for applications like data warehouses, a system that aids in management 
decisions by maintaining a large collection of data based on previous decisions [3]. It is 
reasonable for a computer in the data warehouse application to contain over one terabyte 
of data in a single database. 
1.2  Custom Designs 
When an engineer designs a function for a computer application, there are three 
different ways to implement the design. The first choice is to use general purpose 
processors in the design and write software, which completes the function. Designs 
solved in software often compromise speed, since it may take multiple instructions to 
perform a single task. Additionally, a function can be intentionally slow to implement in 
software, such as the Data Encryption Standard (DES) [4]. An opposite method is to 
design the function completely in hardware, by designing an Application Specific 
Integrated Circuit (ASIC). The downside to an all hardware approach is that ASICs have 
a long design and implementation cycle and are very expensive unless used in high 
volume. The last design choice uses firmware, which is equivalent to software written 3 
for hardware. The limitations of firmware are the amount of logic available in the part, 
and the maximum attainable clock rate. 
The larger programmable parts, such as Electronically Programmable Logic Devices 
(EPLDs), or the even larger Field Programmable Gate Arrays (FPGAs), allow an 
engineer plenty of room to implement the many functions. A single part from the Altera 
Flexl0k family of FPGAs, the FLEX10K50, contains over 50 thousand usable gates [5] 
to build a application specific processor. In the field of Digital Signal Processing (DSP), 
for example, several digital filters can be implemented in a single chip of this size. 
Recent technology has made moving from a FPGA design to an ASIC easier. The 
movement from firmware to production ASICs allows an engineer to prove the design 
and work out the design flaws. Using programmable parts improves the design cycle 
with lower risk and cost than using only ASICs. Then when full production of the device 
is required, the designer can convert the programmable part into a smaller less expensive 
solution, an ASIC. In very high speed communication or advanced processor design, 
however, the programmable parts are often too slow or small to be used as prototypes. 
1.3  Speeding Up Sorting 
One of the classical problems in computer science is the task of sorting. Besides 
games and screen savers, sorting is one of the more computationally intensive tasks. 
Numerous algorithms are efficient in improving the speed of a sorting task dependent on 
the criteria of the problem and the physical limitations of the computer system, i.e. 
number of processors, amount of memory, and accessibility of data. In a parallel 
application, the algorithm divides the larger problem into smaller parts. The entire task 
then takes the amount of time to sort the single smaller piece. Each algorithm hasan 
advantage depending on the criteria of data size, data accessibility, and design 
constraints. 4 
A quick parallel algorithm to sort data would greatly speed up the analysis time for 
applications with massive amounts of data, including the government census and data 
warehouse applications. Optimized software sorting applications can improve 
performance of the system by 50 80% [6]. By building sorting into the hardware of a 
computer system, not only improves the performance, but also reduces the processor load 
making it available for other tasks as the sorting occurs. 
1.4  Approach and Outline of the Thesis 
In this thesis, I took the practical engineering approach to solving the sorting task. 
The use for the hardware described in the thesis is in a computer system where a large 
database sorts regularly occur. The data is made available serially, either from reading it 
off the disk, or from copying the values out of main memory. Many of the parallel 
algorithms assume the same number of processors as data values. While this is possible 
on a small scale, when the data approaches several gigabytes (109), the algorithms are not 
realistic from an engineering standpoint. To reduce the design risk, standard 
components, such as FPGAs and standard memory DRAM or SRAM, are targeted for 
use. Discussions about tradeoffs, including power, cost, and speed, are mentioned when 
applicable. 
The next chapter of the thesis presents the different algorithms of parallel sorting. 
The algorithms include the MM-sort algorithm, which is simulated in the latter chapters. 
Chapter 3 discusses the standard hardware components used in the design. Chapter 4 
describes the different memory data structures for the MM-sort processors and the 
relative performance. Chapter 5 compares the data structures as the multiple processors 
sort the data in parallel including discussions about scalability and relative performance. 
Finally, Chapter 6 discusses final development necessary for implementing the design 
and possible further research. 5 
CHAPTER 2  
PARALLEL SORTING  
When it comes to the topic of sorting, there seem to be many algorithms which 
claim to be the quickest. But how do they measure the speed of operation? When 
analyzing an algorithm for hardware implementation, there are several essential points to 
keep in mind. First and most importantly, is the algorithm easily realizable in dedicated 
hardware? For example, an algorithm which uses a division by 2 is easy to implement, 
by using a bit shift right, while division by 3 is much more complex in terms of the 
hardware necessary. Another point is the amount of memory required to implement the 
algorithm. Some algorithms are in-place, which means they require no more memory 
than data, while other algorithms require extra memory for the recursive nature of the 
algorithm like quicksort, [7]. Quicksort divides a list into two sublists by choosing a 
pivot value and moving the values larger than the pivot value into one list and the smaller 
values into the other list. Then the algorithm recursively sorts the sublists by choosing a 
new pivot and subdividing each of the sublists. 
Before we get deeply into the topic of sorting, the definition of a sorted list should 
be clear. A list of elements {al, az, a3, ... an} is sorted when the key value, the value 
which is being sorted, is in increasing or decreasing order. E.g., a decreasing ordered list 
of n elements, ak  ak +i for all k = 1, 2, ... n-1. Values that are not quantifiable are not 
good candidates for a sorted list, since the comparison operation does not have a real 
definition. 
In this chapter, the basic tools to analyze soiling functions are developed. Then 
several common parallel sorting algorithms are presented. Finally, the parallel algorithm 
chosen for the remainder of the thesis is presented. The difficulties of implementing all 
the algorithms in hardware when sorting a large number of elements are discussed. 6 
2.1.  Single Processor Sorting 
Before delving into the world of parallel algorithms, let's define some of the 
common terms used when dealing with sorting. Speed is the major concern of most 
algorithms, so how do you measure it?. -The notation used for measuringan algorithm's 
running time is 0(f(n)). Pronounced big Oh, the measure indicates the asymptotic upper 
bound behavior, or order, of the algorithm. The formal definition of big Oh notation is 
that a function g(n) is 0(f(n)) if and only if there exists some no less than infinity and c 
such that g(n) < c*gn) for all n > no [8]. The function variable, n, in the case of sorting is 
the number of elements in the list. Thus an algorithm which sorts in 0(n2) time, takes 
approximately 4 times as long to sort twice as many elements. 
For a given algorithm, normally there are two cases for consideration, the average 
case and the worse case. A designer needs to consider both when designing a system to 
estimate the maximum and average sorting times. Due to the difficulty of determining 
the true worse case for some of the data structures, average case analysis is used for the 
majority of the thesis. 
When sorting a list with a single processor there is a lower bound on the speed of 
a given algorithm. This lower limit is 0(nlgn) [8,9]. Here the term lgn refers to the 
logarithm base 2, or symbolically log2 x = lg x. The lower bound tells algorithm 
designers when they have achieved an optimal algorithm. An optimal order algorithm 
indicates to the designer that further optimizations will only reduce the constant and not 
improve the asymptotic running time. 
One important note is while an algorithm may have better asymptotic behavior, 
for a given n the actual performance could be worse. For example, an algorithm which 
behaves n2/8 versus one which is lOnlg n, is slower in terms of the order of operation, 
0(n2) versus O(nlgn). For small values of n, however, the 0(n2) algorithm is faster. Why 
is this? For smaller values of n, the constants that are ignored in the big Oh notation can 
play a significant role. In the example, the first algorithm sorts a list of 64 elements in 7 
512 turns. In comparison, the "optimal" O(nlgn) algorithm takes 3840 turns. Therefore, 
it is important to pay close attention to the limits of your problem as you consider an 
algorithm. 
A sample single-processor sorting algorithm is the binary search algorithm for 
placement applied to insertion sort. Insertion sort starts with a single element and 
through induction uses it as a sorted list. The algorithm inserts the next element into the 
list in its sorted position. The algorithm keeps the sorted nature of the list as each of the 
remaining elements are added to the list. To find the place for inserting the element, a 
binary search compares the elements of the sorted list. A binary search starts with the 
middle of the list. After comparing with the middle element, half of the sorted list is 
eliminated from consideration. Repeating the comparison step with the remaining 
portion of the list will eventually find the spot to insert the element. The maximal 
number of comparisons required is lgn [8]. 
The worse case number of comparisons for the binary search, insertion sort on 
sorting a list of n elements is derived in equation 2.1 as follows. The comparisons per 
turn are equal to log2 the number of elements in the list. Using the properties of 
logarithms, the addition of logs turns into the multiplication of the terms. The factorial 
can be approximated as shown in equation 2.2, which results in O(nlgn) comparisons. 
Eq. 2.1 ±1g  =lg(121i) = lg n!  lg L = 0 (n lg n)
2 2 
n!> n(n 1)(n 2)(72/2) > (n/2)&2   Eq. 2.2 
While the number of comparisons for this algorithm is optimal, the number of 
memory moves necessary for inserting each item is 0(n2). We will revisit the insertion 
sort algorithm in chapter 4 and look at ways to improve the performance of the memory 
moves, which would be the limiting function in a hardware implementation unless a 
specialized shiftable memory was designed. 8 
2.2.  Analyzing Parallel Sorting 
For sorting, applying parallel processing speeds up the computation time by 
reducing the size of the problem. One method of speeding up sorting is to partition the 
sorting task into several smaller tasks, such as dividing the list by the number of 
processors and letting each processor sort the smaller list. Another method is to reduce 
the sorting problem to a task independent of the other processors for each element in the 
list. Each algorithm has advantages and disadvantages. There are several different ways 
to measure the performance of the algorithm including, processor speedup, cost, and 
processor efficiency. 
A point of disagreement in algorithm analysis is whether to include the 
measurement of loading and unloading time of the processors with the data [9]. While 
many authors ignore the loading requirement, computers have the need to communicate 
with the outside world. The communication interface is not instantaneous, nor does it 
have infinite bandwidth. A sorting algorithm that takes advantage of sorting as the data is 
loading is quicker in a real world environment than algorithms that require all the data to 
be present before sorting. For the remainder of the thesis, the time to load the data and 
the utilization of this time will be mentioned for each of the algorithms. 
The size of the problem considered is large enough that an accepted scenario is for 
the data to come off a non-volatile memory source such as a hard disk or tape drive. 
With the data rates for a Ultra Wide SCSI drive at 40 MB/s [10], a list of a two million 
32-bit elements would take over 3 seconds to load into the array. The loading time is a 
significant portion of the operating time, since a single DRAM memory access only takes 
100ns [11], which means during loading there is enough time for 30 million memory 
accesses. 9 
2.2.1. Lower Bound 
It may seem surprising to the reader that the same lower bound, which exists for a 
single processor sorting algorithm, exists for the parallel algorithm. The lower bound of 
a parallel algorithm is calculated by multiplying the number of processors used to sort the 
data and the parallel running time order. If it were possible for a parallel algorithm to 
break the 0(nlgn) lower bound, then a single processor could perform the tasks of the 
parallel algorithm serially and violate the single processor lower bound. 
If the loading time is taken into account then no algorithm can sort quicker than 0(n), 
since it will take 0(n) to load and unload the data into the processors. No algorithm can 
output the beginning of the sorted list without having the entire list, unless there is some 
prior knowledge of the remaining elements not in the processors. Since the data sorted is 
not of a special form, there is no guarantee of such a condition. 
2.2.2. Measurement 
The most important measurement of a parallel algorithm is the running time of the 
algorithm. Often there are two stages to a parallel algorithm. A parallel computational 
step where all the processors sort their individual data, and a movement or routing step 
where the individual processors merge and combine their results. The main parallel 
speedup is gained from all the processors working in parallel during the first step. The 
second portion of the algorithm is considered part of the overhead of a parallel algorithm, 
and the best algorithms minimize the time spent merging. 
The speedup of the sorting algorithm can be measured by comparing the ratio of the 
worst case running times of the fastest sequential algorithm with the parallel algorithm. 
The speedup ratio, hopefully greater than 1, is the speedup gained by doing the sorting in 
parallel. In an algorithm usingp processors, the speedup ratio would optimally achieve a 10 
value ofp. Due to the overhead of dividing the list and reconstructing the smaller lists 
into the original problem, however, causes the speedup to be less than p. 
Another consideration in ULSI technology is the amount silicon space that is used. 
The amount of silicon space is directly proportional to the cost of the chip. Two similar 
algorithms can have extremely different hardware requirements. Therefore, one 
algorithm could be more efficiently built than the other. The efficient processor allows 
for a more cost effective solution due to less silicon area or a more powerful design by 
putting more processors in the extra silicon area. 
A measure of an algorithm that is not considered by most algorithm designers is the 
scalability of the design. Flexibility in the algorithm to add or remove processors 
depending on the application improves the feasibility of using the design. Several 
algorithms require n or n2 processors, which limits algorithm's uses to small applications. 
If there are more than n elements in the list, then new hardware needs to be designed to 
handle the extra elements. 
2.2.3. Classifications of Parallel Systems 
Several different ways exist to design each processor within the entire parallel 
system. The first method is to have a simple design for each of the processors. All the 
processors are running the exact same instructions but on different data. The instructions 
could be as simple as a compare unit which passes the larger value in one direction and 
the smaller the other. Each processor has limited memory, usually temporary registers, 
and centralized control unit for all the processors. Therefore, an algorithm which uses 
this architecture, needs at least as many processors as data elements to be sorted. 
The algorithms which use the single instruction multiple data method for processor 
design usually require a very high bandwidth memory called Parallel Random Access 
Machine (PRAM) [12]. PRAM is an abstract model of a parallel computer that has 11 
infinite memory bandwidth. All the processors in a PRAM can be connected to the same 
global memory and sort the data in the memory which it is stored. Many theorists don't 
explain what happens when multiple writes occur to the same location. Algorithms do 
exist, however, to resolve such conflicts. The PRAM model is attractive from a 
programming point of view since all the data is available to any processor at any time. 
The hardware realization of such a memory, on the other hand, is difficult to implement 
and would cause large latencies for any access due to bus contention. 
The other extreme of parallel design is a multi-computer approach. Here general 
processors are connected through either a high speed network like a mesh or something 
simple like a LAN, Local Area Network. Parallel algorithms using the multi-computer 
architecture rely on minimal communication between the processors, since the network 
has a considerable delay and effects the performance of the system. Implementation of 
algorithms on multi-computers is normally in software, and it uses an already existing 
hardware platform. 
The final approach is a multi-computer approach with dedicated hardware. The 
processors have individual memory like the multi-computer approach, but the processors 
have been specialized to accomplish the task of the algorithm. Communication between 
processors is not an issue, since the processors have a specialized network dedicated for 
the algorithm's communication. An algorithm on this architecture is normally very 
scaleable by just scaling the communication network. 
2.3.  Parallel Sorting Algorithms 
2.3.1. Odd-Even Sort 
Perhaps the easiest of the parallel sorting algorithms to understand is the odd-even 
sort. The algorithm sorts by successively merging larger and larger lists in parallel until 
the entire list has been merged. The basic operation of the algorithm is to break the n 12 
elements into n lists. With each list containing a single element, you combine it with 
another single element list to create a two element list. The combination step is 
performed for all the elements. Figure 2.1 shows the processor used in each step, a simple 
comparison unit. The unit routes the maximum element of ao and ai to the output marked 
H, and the minimum element to the output marked L. With a single comparison unit, two 
elements can be sorted. 
15 ao  L 7 
a1  H  
7  15  
Figure 2.1 A comparison unit for odd-even sorting 
For larger lists the odd-even sort combines the lists a and b in the following 
manner. A list c is created by alternating the odd elements of lists a and b. So the 
beginning elements of c would be al, b1, a3, b3, etc. Another list d is created out of the 
even elements from lists a and b. Several comparison units sort lists c and d and then 
combine them into a final output e. For lists c and d each containing n and i = 1,2,... n-1 
e2i = min(ci+i,di) and e2i +1 = max(ci+1,di) with el = c1 and e2n = dn. An eight element sorter 
is shown in Figure 2.2. 
2x2 element Merging  4x4 element Merging 








Figure 2.2 An eight element odd-even sorter 13 
Figure 2.2 shows how odd-even sort uses many processors to parallelize the task, 
even though each processor is a simple comparison unit. The sorting of n elements can 
be done at the rate of 0(1g2n) [9]. Since the data, in the problem which we are solving is 
handled serially, the final order of the algorithm would only be 0(n). Even with the 0(n)' 
running time providing a significant increase over 0(nlgn) running time ofa single 
processor algorithm, the number of processors necessary for the speed increase is 
0(n1g2n). For small lists, the number of additional processors is minimal for the speedup. 
The largest size list, however, needs to be specified to allow a designer to accommodate 
the maximum sized list. 
A way to reduce the number of processors necessary is to implement the 
comparison processors in a single array. Then each processor is a single bit comparison 
unit connected to shift registers to hold the values in-between the comparisons. The 
values are processed with the most significant bit first. The single bit compare unit only 
has to handle a single value at a time. If the new value is less than the current value it 
will be stored in the lower shift register. Otherwise the new value is passed along the 
array. Figure 2.3 shows an example of a 4 element array sorter. Before processing data 
the lower shift registers are reset to an infinite value, all ones. To read the data out, a 
zero value, all zeros, is passed in and the sorted list will emerge from the end of the array. 
In F  L-4.I SR Fo.  L- SR 1-o- L -ad SR  L  Out I 
SR  SR SR 
SR = Shift Register 
Figure 2.3 A four element serial shift sorter 
The algorithm reduces the processing time to 0(n), which is as long as it takes to get 
the data, and the number of processors to n. An advantage of the array sort has over the 
standard odd-even sort is the processors sort as the values are loaded. A hardware 
implementation of the array sorter has been produced which can sort 512 16-bit keys 14 
[13]. While the chip is capable of sorting 512 keys, extending the chip to support a 
million 32-bit keys would be difficult or require too many chips to place on a single 
board. 
2.3.2. Enumeration Sort 
Another sorting technique, which is a brute force approach to the problem of sorting, 
is enumeration sort. Enumeration sort uses a processor per element and counts the 
number of elements less than the value associated with the processor. The final value 
that the processor has is the location for the processor's element in the sorted array. The 
case where equal values exist in the list, a processor only counts the similar value as less 
than when the element has a lower index than its own element. For example, let the 
processors pi, p2,  pn represent the processors processing elements ai,a2, ...a.. If 
elements ak and a3 are equal and j < k, then processorpk will count element a1 as greater 
than ak. 
By inspection the algorithm takes 0(n) time, since each element needs to be 
examined once by each processor. A quicker algorithm uses n2 processors, however, that 
requires the use of PRAM. Due to the unfeasibility of building the PRAM and the 
number of processors, discussion of the algorithm is omitted. Enumeration sort, 
however, is well suited to work on a multi-computer system due to the low interprocessor 
communication. Each processor can perform its portion of the task without the need to 
modify the execution based on the results from other processors. 
Another feature of enumeration sort is that it is scalable. Since each task is 
independent, less than n processors can be used in the algorithm. When a processor is 
finished. It will look for another element and start calculating the new element's position. 
Since the algorithm requires the processors to examine each element, the processors have 
a well balanced load. Therefore, each processor is going to take relatively the same 15 
amount of time to examine an element. An implementation of enumeration sort with 4 
processors will be 2 times as fast as enumeration sort with 2 processors in the ideal case. 
Separating the necessity for 0(n) processors, the designer does not have to worry 
about the specification of the maximum size list which is sorted. By designing the reuse 
of processors, 16 processors can sort any size list. An implementation of odd-even sort 
with that number of processors has a maximum number of elements it can efficiently sort. 
Admittedly, the processing unit necessary for enumeration sort is more complex than the 
simple comparison unit in odd-even sort, but the unlimited list size is an attractive feature 
to a designer. 
2.3.3. MNI-sort 
The algorithm I chose to implement in hardware is the MM-sort, or max-min sort. 
MM-sort combines several qualities of the other sorting algorithms into one algorithm. 
The first and most important quality is scalability. Max-min sort has the capability of 
using many processors less than the number of elements in the list. Like enumeration 
sort, the MM-sort divides the work between all the processors. The key advantage above 
the enumeration sort is the single processor sorting algorithm is 0(nlgn) not 0(n2). 
Therefore, the speedup for two processors is close to 2, in comparison to a single 
processor running an optimal algorithm. While two enumeration sort processors would 
have slower performance than that of a single processor running an optimal algorithm 
like quicksort. 
Another quality of the MM-sort is each processor sorts during the entire process. 
Unlike the odd-even sort where each processor's task is so specialized that it is used once 
during the sort, the processor in MM-sort has better utilization. By having higher 
complexity in the processor, it will be larger, but can still fit into FPGA, one of the 
requirements for testing the technology. 16 
The MM-sort also takes advantage of being a specialized multi-computer algorithm 
and utilizes the special communication network for inter processor communication (IPC). 
The communication channel and functionality can fit into a single chip and can operate 
quickly. The quick operation will minimize the idle time of the processors. The next 
sections explain the basic operation of a MM-sort processor as outlined by the algorithm. 
Then chapter 4 develops performance evaluations of a single processor with different 
data structures. Chapter 5 discusses the scalability and interaction of multiple processors 
in a single sorting unit. Chapter 6 briefly discusses requirements and concerns for 
processor implementation. 
2.3.3.1.  Heaps 
Max-MM sort uses the communication between processors to pass the maximum 
and minimum elements from each of the processors lists. After passing two elements to 
the network, the processor receives 2 new values to add to its list. The 2 new values are a 
piece of a sorted version of all the values communicated by the processors. The IPC 
continues until the values each processor sends out to the communication network are the 
same that it receives. The original proposed data structure to hold the list in each 
processor's memory is a max-min heap [14]. 
The heap is a data structure that is similar to a binary tree. The root value is the 
maximum value of the entire list, and each child's value is less than the parent's value. 
Each node holds the assertion about the values below it. Figure 2.4 shows an example of 
a simple heap and corresponding representation of data in an array. For a Heap stored in 
an array, the parent's location is denoted as location Heap[j] the children are at nodes 
Heap[2j] and Heap[2j+1]. The root location is at node 1 of the array. In Figure 2.4, 
notice how the Level 4 values are actually greater than some of the values on Level 3. 
The looser organization of data lends itself to quicker creation than a sorted array. 17 
Level 1 
The Heap as stored in a linear array 
Level 2  130 110 151211 5 1 8 1111141 3 117 
L2 L3  IA V 
Level 3 
Level 4 C)  14  C) 
Figure 2.4 Data representation for a heap 
A max-min heap is similar to a regular heap except that the nodes alternate 
between a maximum and a minimum value. Thus on a maximum level, the parent node j 
needs to be greater than the four grandchildren 4j, 4j+1, 4j+2, 4j+3 and the two children 
2j and 2j+1. On a minimum level, the opposite is true, a parent is less than its 
grandchildren and children. The max-min heap is harder to implement algorithmically, 
since determining whether the current node is a max or min node is difficult. By 
counting the number of right shifts the current address takes to equal zero, the hardware 
can determine the polarity of the node. An odd number of shifts indicate a maximum 
node and an even number of shifts indicate a minimum node. The max-min heap, 
however, has the ability to find the maximum and minimum values of the list quickly 
without sorting all the data. 
Figure 2.5 shows an example of a max-min heap with the same data values from 
figure 2.4. The Hasse diagram, a diagram representing the order relationships implicit 
within the structure [14], aids in the visualization of the max-min heap. The diagram 
shows the leaf nodes as the pivot point between the maximum and minimum nodes. 
From the data representation, the maximum value is at the root node, and the minimum 









The MM-Heap as stored in a linear array 
r1 3  1 5 1151301101 8  11114121117 
1  L2 L3  L4 
Decreasing 
Order 
Figure 2.5 Data reperesentation for a max-min heap 
The max-min heap has the same order for inserting, creating and deleting values 
out of the heap [14]. The speed is 0(n) for creating the heap, which is better than 
0(nlgn) if the data were to be sorted into an array. Insertion into the heap is O(lgn) while 
a linearly sorted array is 0(n). The only place where the heap suffers slower performance 
is in deleting the maximum and minimum values. The max-min heap requires O(lgn) 
while a sorted array only takes 0(1). Estimating the run time from the big oh notation 
seems to point to the max-min heap being the best data structure for the algorithm. 
2.3.3.2.  Original Algorithm 
MM-sort as described by [15], has three phases of operation. During the first 
stage the processors load data into a local array without sorting. After all the data is 
loaded, the processors create the max-min heaps in parallel. The parallel creation of the 19 
heaps is called the local sorting phase. After the creation of the heaps, the global sorting 
phase begins. The object of the global sorting phase is to make processor pa's minimum 
value greater than p2's maximum value. Likewise for p2's minimum value and p3's 
maximum value and so on up to the last processor. Then each processor has a list that is 
a subset of the final sorted list. The arrangement of the sorted data occurs by each 
processor handing the maximum and minimum values into the communication network, 
which sorts all the values passed to it. Then the communication/sorting network returns 
the two largest values top', then next two largest to p2, until finally it passes the smallest 
values to pk. The global sorting continues until the values that all the processors pass 
down as the maximum and minimum values are the same values that come back from the 
sorting network. 
The number of loops necessary in the global sorting phase is 0(n/p) where p 
represents the number of processors. The time for a single loop is 0(1g(n/p)) since it 
involves the deletion and insertion of elements into the max-min heap. This gives an 
overall running time of 0(N1gN) where N = n/p  .  This running order achieves near ideal 
speedup for each processor added to the array. The speedup is less ideal, however, 
because of the delay associated with the sorting network. 
The order of the algorithm can be evaluated by looking at a single processor in the 
array. Let the size of the list for a processor be N = n/p.  In worse case, every single 
value needs to be removed from the processors list. Since two elements can be removed 
at once, it will take 0(N1gN). The lgN is due to the time necessary to adjust the heap 
after the addition and removal of elements in each step. As long as the sorting time of the 
sorting network is minimal, then the overall running time is O(N1gN). 
2.3.3.3.  Variations 
The original algorithm is quite efficient. The testing in the original paper, 
however, used over 2,000 processors [15]. The concern with over a thousand processors 20 
is the same as the odd-even type of sort. How do you physically connect all the 
processors and synchronize the clocks? When sorting with fewer processors, closer to 
64, the connections become more realizable. To sort the same number of elements, 
however, n/p becomes very large. The cost of removing and adding elements to the heap 
structure increases, and therefore one might find another way to speed up the sorting 
time. 
A method to speed up the sorting is to presort the data values as you pass them 
into the processor array. For a system with 16 processors, the interface chip could presort 
every 16 elements. The pre-sorting reduces the amount of work done by the sorting 
network in the global sorting phase. The initial movement of the elements gets them 
closer to the proper processor. In the worse case, however, the extra hardware doesn't 
aid in the sorting time. Actual implementation of the pre-sorting is dependent on the 
speed at which the data becomes available to the sorter. The slower the data arrives the 
longer the sorter has to massage the data before it has to accept more data. If the data 
arrives quickly, a pipelined sorter, which could be a simple odd-even sorter for just 16 
elements, would only incur a delay during the filling of the pipeline. For a system of 
thousands of elements, however, the filling delay would be negligible. 
Another possible optimization is a different data structure to hold the memory 
values. With only 100 elements, one could quickly design a custom memory. Since the 
sorter needs to sort close to a million elements, the design of a custom memory would not 
fit in a FPGA. The extra overhead associated with the addition and deletion of elements 
in a heap could prove costly as the number of elements increases. In addition, the max-
min heap does not end up with a sorted list. The removal of items after the global sorting 
phase is done takes O(N1gN). 
With a different data structure, which keeps an updated sorted list, could perform 
the global sorting on every turn. The sorted list data structure would allow a processor to 
overlap the delay of the sorting network with active sorting time. While the network is 
sorting the max-min elements, the processor could add a new value from the interface 21 
chip. The combination of the global and local sorting eliminates the early moves in the 
global sorting phase where all elements need to get into a single processor's list. For the 
first sort in the global sort, all the maximum and minimum values of the sub-lists are 
moved at once. All the maximum elements probably belong in the first processor's list, 
while all the minimum elements probably belong in the last processor's list. Each 
processor, however, can only accept two elements at a time. If the sorting network 
moves the maximum and minimum values when they arrive to the arrays, the chance of a 
collision between maximum values is less likely. 
2.4.  Limitations of Parallel Sorting Algorithms and Summary 
The first limitation of the non-scaleable algorithms is the limit of the number of 
elements to sort. While the scaleable algorithms, max-min sort and enumeration sort, 
have physical limitations based on the amount of memory per processor. The addition of 
extra memory is easier than redesigning the complex connections in an odd-even sort 
processor array. To double the capacity of the odd-even sorting network not only 
requires more than double the number of processors, but also requires another layout of 
chips on the circuit board to accommodate the additional processor. The enumeration 
and max-min sort, however, can just add extra address lines and memory to 
accommodate more elements. 
Another limitation of all parallel algorithms is the problem of interconnecting 
hundred's of chips. Probably even more difficult than processor interconnection is the 
synchronization of all the chips. Synchronization of all the clocks requires careful board 
layout, and with many chips connected, synchronization at high speeds may not be 
possible. In addition, the physical size of a chip in the die package puts a limit on the 
total number of chips that can fit on a circuit board. 
The clock skew problem may not be a limitation of an algorithm which operates 
asynchronously like odd-even sort. In sorting 256 elements, however, the odd-even sort 22 
algorithm requires the movement of 128 elements. These moves are from the final 
recombination of the two 128 element lists to produce a 256 element list. While the 
number of comparison units may be possible to achieve on a single chip, the sorting of a 
list 4 times larger requires 4 times as many interconnection lines. Another problem is the 
number of pins or internal connections necessary to move the data. If each element is 16 
bits, then the 128 elements become 2048 pins, which is not practical. A method to reduce 
the number of pins would be to move the data serially. Serial data, however, would 
require a system clock throughout the chips. The clock results in the need to distribute it 
to the chips. Then the problem reduces to trying to synchronize all of the clocks, as you 
scale the design upwards. 
A final note is the cost of the design. An algorithm that inefficiently uses 
processors will waste both power and silicon area. One of the deciding factors in the cost 
of a chip is the amount of silicon real estate used. The larger the chip, the larger the 
possibility of a defect in a chip and the fewer chips per wafer. The lower yield and larger 
size results in higher costs to produce the chip. Therefore a chip that has a lot of unused 
processing power, costs more than an efficient implementation which uses the processing 
power for the majority of the algorithm. 23 
CHAPTER 3  
HARDWARE COMPONENTS DESCRIPTION  
For the design of a parallel sorter implementing the max-min sort algorithm, several 
standardized hardware components are used for reliability. By minimizing the new 
components in the system, the design is safer to implement. If you combine several new 
technologies at once, for instance a large shiftable memory, the processor, and the sorting 
network, then the design has several locations for errors. Since the shiftable memory 
would be an ASIC design due to its size, any error requires workarounds, which affects 
the performance or greatly increases the design cost of the system. By using standardized 
components, the bugs of the algorithm and statemachines can be worked out, since the 
remainder of the hardware has well specified behavior. 
The standard components for a successful implementation include Dynamic Random 
Access Memory (DRAM), Static Random Access Memory (SRAM), and Field 
Programmable Gate Arrays (FPGA). The DRAM and SRAM compose the memory 
system for storing the list. Each processor in the system has access to its own individual 
memory subsystem. Therefore, the processor is responsible for refreshing the DRAM. 
The algorithm is easy enough that a single FPGA can hold the logic for a processor. The 
lower sorting network chip for the MM-sort can also fit into a single FPGA. To reduce 
the interconnect problem which exists due to the number of processors and the size of the 
data path (32-bits) a fast serial interface needs to be used. 
3.1.  SRAM 
The high power, but quick memory cell is the static RAM, or SRAM. The memory is 
static, since the stored values are continually refreshed. The refresh operation is 
expensive in terms of power, for it can consist of a D.C. path to ground. The constant 24 
refresh, however, allows the external interface to the memory to be simple and requires 
no additional cycles besides the memory accesses. 
With the many applications for SRAM, the initial development branched into bipolar, 
NMOS, and CMOS designs [16]. In the early 70's, bipolar transistors were used in 
designs requiring high speed memory cells, like memory caches. NMOS transistor 
designs made efficient use of the silicon and were used for a low cost solution. The 
CMOS transistors had the advantage of no static power dissipation from an active load, 
and were used for low power implementations. 
As CMOS technology improved, the use of NMOS transistors has phased out, for the 
advantages of an active load and improved CMOS technology were far greater than the 
decreasing cost difference. Other technologies such as GaAs have been introduced for 
very high speed applications as the need for even quicker memory has arisen. 
3.1.1. Cell Structure 
The standard SRAM memory cell, which stores one bit of data, consists of six 
transistors. These six transistors consist of a latch and two transmission gates. Figure 3.1 
shows the six transistors in a CMOS SRAM cell. The upper transistors of the latch, M3 
and M4, are PMOS transistors and act as the active load. The transmission gates, M5 and 
M6, are the interface to the external bit lines. Both the active value and complement are 
available during each read. A write cycle changes the latched value by having a larger 
drive capability which over comes the cross coupled inverters. 
3.1.2. Chip Interface 
When accessing SRAM there are two types of interfaces, the asynchronous and 
synchronous interfaces. The major difference between the two interfaces is the 25 
Row select 
Figure 3.1 A six transistor SRAM cell 
requirement of a system clock. The asynchronous SRAM is implemented using latches 
that assert data shortly after the signals are valid. Synchronous SRAM uses the clock pin 
to latch the address and data is available on the bus after a clock edge. Synchronous 
memories often have a latency of two clock cycles between the address and the data. 
Zero latency SRAM chips, however, are available [17]. Registers internal to the 
synchronous SRAM can pipeline accesses to allow a processor to present multiple 
addresses in a row. The latency from such a configuration is present for only the first 
access. Some synchronous SRAMs still have the asynchronous interface signals for the 
added control of the output [18]. 
Due to the high speed access required from SRAM, all the data and address pins are 
placed on the chip package. Large bandwidth memories are expensive due to the high 
pin count on the package. Other types of SRAM memories have separate data lines for 
reading and writing data to the memory. The separate lines, however, are not required for 
the majority of SRAM applications. 
3.1.3. Timing and Applications 
The timing requirements of SRAM are the smallest of the concerns for the MM-sort 
design. The designer can usually spend more money, burn more power, and use more 26 
silicon to achieve the necessary speed. When designing a cache system for a high 
performance computer, it is important that the SRAM makes the processor wait as little 
as possible. A system could tolerate less frequently accessed SRAM using an extra clock 
cycle for a lower production cost. These caches, however, are not limited to only memory 
systems but other peripherals as well. Hard drives such as the Western Digital AC31600 
has a cache of 128k [19] to increase drive access time and to buffer transfers. 
3.2.  DRAM 
In contrast to the static nature of the SRAM, DRAM does not continually maintain 
the stored value in the cell. Without external refreshing, the DRAM cell loses the stored 
value in a few milliseconds. The data loss from storing the value on a capacitor. The 
transistor's leakage current, due to the backwards biased pn junction, drains the charge 
representing the value off the capacitor. 
Designers developed the DRAM cell because it was a more efficient way to pack data 
cells onto a single chip. The six transistors in a standard SRAM cell are a lot of silicon 
area for a single value. DRAM technology, in contrast, can store a single bit with three 
transistors [16]. Later designers managed to reduce the data cell to a single transistor and 
capacitor. A single transistor cell allowed higher density memory devices. The 
disadvantage of the DRAM was the design complexity of needing additional refresh 
circuitry to maintain data integrity. 
Technologically, DRAM cells are MOS transistors because they have extremely low 
leakage currents and are very compact. As designers made process advances in MOS 
technology, they gained the ability to place more cells into a single chip. In 1978, a high 
density DRAM chip contained 16kbits, while in 1996 a DRAM chip is capable of over 
64Mbits per chip [18]. 27 
3.2.1. Cell Structure 
As stated before, the DRAM cell structure has evolved from three or four transistors 
down to the single transistor cell. Figure 3.2 shows the two different cell structures [20]. 
Since the data is stored on a capacitor, every read can destroy the-data by changing the 
charge. Thus in a memory access, the value has to be rewritten back into the memory 
cell before the transmission gate is turned off and the read completed. The capacitance of 
a cell is small in comparison to the bit line, and thus the bit lines have to be precharged. 
The circuitry then senses the bit lines current, if the current is drawn in, then a zero 
stored. If very little or no current is drawn in, then a one is stored on the capacitor. 
When the charge has leaked off the capacitor, the circuitry cannot determine if the value 
is a zero, or a very lowly charged one. 
Four Transistor DRAM Cell  One Transistor DRAM Cell 
Row Line 1 
Bit 
Line 
Figure 3.2 Two implementations of a DRAM cell 
The majority of modern DRAM chips have internal refresh circuitry. The external 
hardware supplies a refresh signal periodically, which allows the DRAM to refresh a row 
of cells. An internal counter refreshes each row when signaled by the external controller. 
The DRAM refreshes either all the memory at once, or one row at a time. As long as the 
external hardware supplies the refresh signal often enough, dependant on the mode of 
refreshing, then the data will remain valid. 28 
3.2.2. Interface signals 
Due to the high density of memory cells available on a single chip, the number of 
address lines necessary to access the data would be large. The address pins would 
increase the cost of the chip due to packaging. DRAM is supposed to be inexpensive, so 
the DRAM address lines are time multiplexed to avoid unnecessary cost. The address 
signals are divided into rows and columns. The division is a natural move for a chip 
designer since the are cells designed in rows and columns on the chip. A typical memory 
access puts the appropriate row address on the address lines and lowers the signal Row 
Address Select, RAS. To complete the access, the hardware places the column address 
on the address lines and asserts Column Address Select, CAS. Shortly after the falling 
edge of CAS, the DRAM outputs the data on the data lines. 
In a page mode access, typically called fast page mode, the hardware can access 
several different column addresses from the same row address. In fast page mode, the 
RAS signal is held low and the new column address is placed on the address lines. The 
CAS signal is reasserted to get the next memory access. The multiple accesses reduces 
the memory latency since the column lines already contain the data to be read and do not 
require precharging. 
3.2.3. Timing and Applications 
An external timer is required to allow for a periodic memory refresh. For example, a 
DRAM memory chip 2 Meg x 8 bits, MT4C2M8B1/2 by Micron, requires all 2048 rows 
to be refreshed every 32ms [211 If the addresses have not been refreshed and the system 
needs to read data, the refresh cycles can cause access delays. Therefore, it is 
advantageous to refresh the memory when there are idle cycles available. 
Typically, memory is specified by the time RAS is required to be low. The total 
cycle time is twice as long as the RAS time, due to the required precharging of the 29 
column lines before an access. The MT4C2M8B1/2 with a speed grade of 6Ons has the 
minimum full random access cycle time of 110ns [21]. Memory access optimization can 
be achieved by linear memory accesses and grouping multiple reads or writes of 
consecutive addresses together. The most common application for DRAM is mass 
storage of data. The reason for DRAM storage is the inexpensive memory requires little 
power. The slow access times are a trade off, and often a small amount of SRAM can be 
used in conjunction with the DRAM as a cache to increase the average latency. 
3.3.  FPGAs 
In the area of FPGAs, several different types of technology exist. The major types of 
FPGA devices include the external memory loading, the EPROM, and the fuse blown 
technologies available from Xlinix, Altera, and Actel respectively [22]. Each technology 
has its advantages and disadvantages. The advantage of the external memory technology 
is that field upgrades are easy to implement, for the external memory only has to be 
changed to reconfigure the part. Upon power up, the memory loading FPGA loads the 
design from the non-volatile memory. The disadvantage of external memory is the design 
on the device cannot be used until the downloaded is complete. The larger Altera parts in 
the Flex 10k family, which download from external memory, take less than 200ms for 
configuration [5]. 
The Altera EPROM technology is available in the smaller FLEX 7000 and 5000 
families. The technology is similar to the fuse blown Actel parts. The designer programs 
both parts like any EPROM. Upon power-up, the device is immediately available to start 
working. The fuse version of Actel makes the internal connections quicker, since signals 
do not pass through a programmable transistor transmission gate which is often smaller 
and therefore slower. The disadvantage of the Actel part is that once a fuse is blown, the 
part cannot be reprogrammed. 30 
Altera has several different size parts, but the FLEX 10k family is ideal because of 
the range of sizes available. The pin compatibility between different parts, common to 
most FPGAs, allows a designer to move to a larger device if the extra space is necessary. 
The 10k family has built in RAM blocks which are ideal for processor register locations 
[5]. Parts from other manufactures require a portion of the standard logic to build the 
registers, which reduces the number of available gates to the designer and places a larger 
strain on the chips routing resources. 
3.3.1. Structure 
The internal structure of the FLEX 10k family consists of three major parts. The 
logic array blocks (LABs) the fast track interconnect, and the embedded array blocks 
(EABs) make the major components in the part. Each LAB consists of eight Logic 
Elements (LEs), a common cell device in programmable parts. The interconnect forms a 
grid-like pattern allowing any LAB to connect to any output pin with minimal delay. 
Embedded into each EAB is 2048 bits of memory addressable as 2048x1, 1024x2, 512x4, 
or 256x8 memories. Additional logic is available in the EAB to aid in address decoding 
and memory select lines. Figure 3.3 shows an overview of the basic chip layout. The 
number of available LABs and EABs depends on the size of part chosen. 
3.3.2. Timing 
One of the problems with implementing a complex design in a large part is estimation 
of the timing delay internal to the chip. While each component has a calculable delay, 
calculation is difficult when the delay can depend on how many LEs are required to 
realize the logic function. Fortunately, good software support for timing analysis exists, 
which enables the designer to see the critical paths in the design. Logic reduction or 
hanging the design can reduce the timing of the critical path allowing a maximum clock 
speed [23]. 31 
Embedded Array  Logic Array 
Block  Block 
Fast  'rack  LAB  LAB  EAB  LAB  LAB 
Inte  onnect 
LAB  LAB  EAB  LAB  LAB  
[-I  r--1  71  r-i  ri 
U U  LJ  LJ H 
Figure 3.3 FLEX 10k family FPGA layout 
Several small functions in a Flex 10K part can run at very high speeds, roughly a 
lOns clock. The speed of the functions give the designer a gauge for calculating the 
speed obtainable in a processor design. For example, a 16-bit accumulator can run as fast 
as 107 MHz [5]. Doubling the data path has a slowing effect on the clock, but with the 
speed of 107MHz, it is reasonable to aim for a 40-50Mhz system clock. 
3.3.3. Applications 
With the capability to implement registers, and built-in logic for adders and counters, 
the Flex 10k, is ideal for complex controller design. The 10k family can not only 
implement processors for sorting, but also advanced digital signal processing (DSP) 
chips. Due to the SRAM download time to configure the chip, the large FPGAs cannot 
be used in applications where functionality is required immediately after power is 
applied. 32 
CHAPTER 4  
MAX-MIN PROCESSOR DESIGN  
After selection of an appropriate algorithm, the design of an individual processor is 
the next step. While the odd-even sort algorithm has simple processors, the processor for 
the MM-sort algorithm is more elaborate. Careful consideration must go into the design 
details. With each processor responsible for sorting thousands of elements, any gain in 
processor performance per element greatly improves the overall running time of the 
sorter array. 
The main bottleneck, in terms of the processor's speed in sorting elements, is the 
memory interface. The processor can quickly perform comparisons and manipulations of 
pointers compared to the latency associated with accessing DRAM. A large effort, 
therefore, should be focused on the optimal storage structure of the elements in memory. 
This chapter examines and evaluates several different structures' performance, in not 
only Big Oh notation, but also the actual number of memory accesses. The chapter 
investigates three different data structures used to store the data. The max-min heap 
proposed by [15] in the original algorithm, and two different versions of a linear array are 
developed. 
All the data structures use DRAM for the main memory, however, the problems of 
refreshing the memory are not discussed. The affect of refresh cycles is small in 
comparison to the total time. The chapter discusses places where modes of DRAM 
access, such as Fast Page, are advantageous at the end of the chapter. For the comparison 
reasons, all memory accesses are assumed to take the same amount of time. 
I simulated all of the data structures using code models for performance. The source 
code for each of the separate processors is in the Appendix. The data taken from the 
transactions were the number of memory reads and writes, and the number of 
comparisons the data structure required during the local sorting phase of the algorithm. 33 
The values sorted were random numbers, with the code using the same random number 
streams for each of the different data structures. The data is the average of 16 separate 
data trials to get the average case operation. 
4.1.  Max-Min Heap Structure 
As described in chapter 3, the max-min heap structure alternates maximum and 
minimum nodes to allow quick access to the maximum and minimum values in the list. 
The state machine to implement the max-min heap needs two different methods for 
adding elements to the list. The first method is when the heap has not been constructed in 
memory. The processor adds each new element to the end of the array. Then when 
sorting network requests the first maximum or minimum element from the list, the 
processor creates the max-min heap and then processes the request. Once processor has 
constructed the heap, the processor can only return to the first method by completely 
emptying the list, and therefore eliminating the heap structure in memory. 
The processor needs an algorithm that changes the max-min heap into a linear array. 
The interface chip should not have to wait the O(lgn) time necessary to remove each 
element after the sorting is complete. Algorithms for creating a linear array out of a 
regular heap have been proposed which operate in O(nlgn) time [24]. These algorithms, 
however, essentially perform the removal of elements one at a time, placing the value at 
the end of the array. With slight modification to the processor, the removal algorithm 
could be used to create the linear array and allow the task to operate in parallel with all 
the processors. 
4.1.1. Data Structure Algorithm 
The algorithms to create and maintain the max-min heap lend themselves to state 
machine design. Many of the algorithms have an opposite algorithm which only differ in 34 
the type of comparison. The difference is whether the starting node was a maximum or 
minimum node. The algorithms used are TrickleDown Max/Min and BubbleUp 
Max/Min. The trickle algorithms move down the heap and move the starting node value 
down into the proper location in the heap. The bubble algorithms move an element up 
the heap until it reaches it proper location. The max or min term denotes the type of node 
that the algorithm starts on. 
When creating the max -nun heap, the processor runs Trickle Down for all the node 
values n/2, n/2 1,  , 2,1. Each step creates a correctly formed max-min heap by 
moving the node value into the mini heap structure below the current node. Creating the 
heap by using the trickle down functions is a modification of Floyd's algorithm [24] used 
for quickly creating a linear array. The Trickle Down algorithm is the same algorithm 
used later for adjusting the heap after deleting an item. When the processor deletes the 
maximum or minimum value, it places the last element from the heap into the empty 
location and then trickles the value down into the proper place in the heap. 
The Bubble algorithm adds a new element to the list, by adding the element to the 
bottom of the heap. After adding the element, the processor compares the parent and 
grandparent values to determine whether it needs to bubble up the max or min side of the 
heap. The movement of the element continues until it reaches the correct place in the 
heap. 
Both the Trickle and Bubble algorithms are efficient in their memory usage. Since 
the memory is partitioned into a binary tree the maximum depth that any item will have 
to move is O(lgn). Therefore, the addition of an element to the array only requires O(lgn) 
moves before it is located in the proper location. 
Perhaps the most time-consuming portion of the algorithm, non-memory related, is 
the determination of the initial node as a maximum or minimum node. A method to 
determine the polarity of the node uses a shift register to shift data bits to the right, until 
the value is zero. The number of shifts, which were required to get the value of zero, 35 
indicates the polarity of the node. Assuming the root node is a maximum node, an odd 
number of shifts indicate a maximum node. An even number of shifts indicates a 
minimum node. The processor can quickly compute the nodes relative to the initial node 
based on their relative position. 
4.1.2. Speed of Creation 
According to [14] the worse case complexity for creating the max-min heap is 7n/3 
data movements. Therefore, we expect 0(n) for` or the creation of the heap. After 
implementing the algorithm, the program measured the speed of creating the heap. 
Figure 4.1 shows the results from the data run. The program added elements to the 








0  50,000  100,000  150,000  200,000  250,000  300,000 
Number of Elements 
Figure 4.1 Total number of memory accesses to create a max-min heap 36 
4.1.3. Estimated Running Time 
Looking at figure 4.1, the slope of the line is relatively constant. Upon further 
analysis by dividing each value by the number of elements sorted reveals the constant. 
Figure 4.2 shows result of the division. The graph shows the value settles around 4.1n 
for the creation constant of the max-min heap in the average case. Note the good running 
time is only for the creation of the heap. In comparison to the other techniques, each 
global sorting phase requires 0(lgn) steps for each insertion and deletion of the maximum 
and minimum elements. Since the number of global sorts is 0(n) the total time for the 












0   50,000  100,000  150,000  200,000  250,000  300,000 
Number of Elements 
Figure 4.2 Creation constant of the max-min heap processor 37 
4.2.  Insertion Sort Structure 
An easier algorithm to implement in hardware is an insertion sort routine. The 
processor adds every element to the sorted list as it arrives. By adding the elements as 
they arrive, the processor utilizes any latency associated during the initial reading of data. 
The only problem is if data waiting at the source blocks further activities, since the 
addition of each element may take a while. In a computer system where the interface 
chip dedicated to the sorting array accesses the main memory, the system performance 
should not be adversely affected by the speed of reading the values out of main memory. 
4.2.1. Algorithm Used 
The first modification made to the normal insertion sort routine is the location of the 
values in memory. Usually, the smallest element is at address zero and the values 
increase with increasing address. When the processor adds a new smallest element, the 
processor needs to shift entire array needs to allow the insertion of the item. Since the 
sorting network causes the deletion of the smallest value, the data structure needs 
modification. 
By using two pointers, one for the top and one for the bottom of the list, the deletion 
of either the maximum or the minimum value takes a single access and an adjustment of a 
pointer. The dual pointers reduce the worse case number of moves from n to n/2, which 
increases the speed by a factor of two. If the memory is thought of as circular, then the 
processor avoids the problem of a pointer hitting the bottom or top of physical memory. 
For example, when the minimum pointer decrements to the bottom of the memory, 
address zero, it wraps around to the top of the address space and continue to decrement. 
A binary search of the existing list locates the insertion point with a minimal number 
of comparisons. The usual insertion sort performs the same number of comparisons as 38 
data movements. Since a binary search only uses O(lgn) accesses and comparisons, the 
overhead to find a location is small. A simple comparison of the insertion location and 
the top and bottom pointers indicates the quickest direction to shift data. 
Wrap-around pointers cause a problem, however, with the comparison operation. For 
after the minimum pointer wraps to the maximum value, for example, then the minimum 
pointer is greater in value than the maximum pointer. Adding more bits than necessary, 
allows the pointer to wrap around and still maintain the correct magnitude necessary for 
insertion sort. The main memory would ignore the upper bits of the pointer. The 
effective memory mapping shadows the memory on top of itself and therefore puts the 
minimum address next to the maximum address. For equal movement capability of the 
pointers, the processor should initialize them to the middle address of the memory space. 
4.2.2. Speed of Creation 
A binary search insertion sort does the optimal number of comparisons for a single 
processor algorithm, 0(nlgn). Equations 2.1 and 2.2 showed the derivation of this value. 
The number of data moves required by the algorithm, however, is not optimal. Under 
worse case conditions, the maximum number of moves performed is n/2. When sorting n 
items the value becomes 0(n2). Equation 4.1 shows the derivation of this value. 
i  n -(n +1)  n2  n 
=  +  ()(n2) Eq 4.1 Z., 2 4  4 4 
Figure 4.3 graphs the results of the insertion sorter using the same data values as 
the max-min heap processor from section 4.1. The curve has the distinct shape of a n2 
function. While the number of memory accesses are much larger than the max-min heap, 
the deletion cost, when running the global sort is less than that of the max-min heap, 0(1) 













0  50,000  100,000  150,000  200,000  250,000  300,000 
Number of Elements 
Figure 4.3 Total number of memory accesses to create a sorted DRAM array 
4.2.3. Estimated Running Time 
To estimate the execution time of the algorithm, the worse case function is close to 
equation 4.2. The first two terms are from the insertion of elements to the list. The final 
term is from the comparisons performed by the binary search. The multiplication factor 
of two comes from needing both a memory read and memory write to shift an element in 
the array. 
#Memory accesses = 2 c1  (n2 + n) + c2  n lgn Eq. 4.2 
For large values of n the dominant term is the n2 term. Figure 4.4 graphs the 
estimation of the constant, cl, for the average case. The data points were divided by the 40 
square of the number of elements sorted. The straight line for the larger values of n is 
from the smaller influence of the nlgn and n terms. Thus, the function 0.2572 estimates 
















0  50,000  100,000  150,000  200,000  250,000  300,000 
Number of Elements 
Figure 4.4 Creation constant of a sorted DRAM array 
4.2.4. Possible Optimizations 
The insertion sort routine can perform further optimizations. One optimization takes 
advantage of having several elements to add to the array at once. The processor spends 
the bulk of its the time moving elements already in the array to make space for a new 
element. When the processor needs to add an element to the center of the array, it must 
shift n/2 elements. If the processor adds several elements at once, the magnitude of the 
shifting could increase to combine the addition of two elements into a single shift of the 41 
elements. Therefore, it only takes n/2 moves to insert two elements into the center of the 
array opposed to the n moves which the standard algorithm takes. 
In addition, we could perform other optimizations, such as checking to see if the 
element is greater than or less than the maximum and minimum elements. If the 
processor kept copies of these elements internally then no memory accesses would be 
required for the check. As more processors are in the parallel sorter, it is less likely that a 
new element is in the individual processors' list. This optimization assumes that the 
processor combines global and local sorts into a single event. The assumption is valid 
since the array data structure allows the combination of the two sorting phases into a 
single phase. 
If the list has a large number of duplicate values, another optimization is to handle 
equal terms as a separate case. Upon a comparison indicating that the values are equal, 
rather than insert the value at the current location, the comparisons could continue in the 
direction of shortest shifting. If there were 20 similar elements, checking for duplicates 
could result in a reduction of 19 moves during the insertion phase of the element. 
All of these optimizations do have a cost, however, of added hardware complexity. 
Some of the optimizations, such as checking the maximum and minimum values in the 
list, require very little in algorithm internal to the processor. The multiple move 
optimization, however, comes at a much higher hardware cost. The effects of all the 
optimizations only resulted in a 15% performance improvement. Therefore, we need a 
different data structure. 
4.3.  SRAM Insertion Sort Structure 
Comparison of figure 4.1 and 4.3 show that for a large number of elements the 
standard DRAM insertion sort routine is not efficient enough to come close to the 
performance of a max-min heap. Even the simulations show evidence of the difference. 42 
The heap algorithm took roughly one hour to run all the data points, while the DRAM 
simulation took about three days. 
The optimizations discussed in section 4.2.4 are not enough to bridge the 10,000 
memory accesses per element difference between the two algorithms for a large number 
of elements. Even implementing the insertion sort routine in SRAM could not keep up. 
The disparity between the two algorithms is why big Oh notation is so useful in analyzing 
algorithms. With a creative use of SRAM, however, the performance gap is bridgeable. 
4.3.1. Data Structure Algorithm 
Looking at the data from the DRAM simulations, the largest affect on the number of 
memory accesses is the movement of data. As discussed in section 4.2.4, moving larger 
blocks of data improves performance. The processor could move several elements at 
once without needing to add all the elements in the newly created "hole". The "hole" 
would give a major performance advantage to the processor during subsequent moves. 
How can you move a hole into the array and keep track of it? With a hole, data only has 
to shift to the nearest hole in the array, and if one is not close enough a new set of holes 
are moved into the array. Using holes can reduce the number of expensive DRAM 
moves required to add values. Figure 4.5 shows a pictorial model of the proposed 
memory array. Notice how values can move into the hole with a minimal number of 
moves. 
Keeping track of holes is a difficult task. A reserved value could be used to indicate a 
hole, but then there is no efficient method to find the closest hole relative to an insertion 
point. A better way is to keep track of the holes is in a table. The table should contain a 
list of allocated memory and how much is used. The table needs to be quick for 
efficiency considerations. Since the table is relatively small and only has the data width 
of the address space of the DRAM, it is an ideal application for SRAM. Due to storing 
the memory allocation table in SRAM, I call the technique SRAM insertion sort. It is not 43 
a requirement, however, that the designer implements the table with SRAM. In the 
following sections, I discuss the terminology developed to explain the SRAM insertion 
sort algorithm. 
(1) 25 
The initial array is shown as Al.  A,  1...1121151181-2112129136141145153158150 I.. 
(2) 38  (1) To insert the value 25, 4 holes are  
moved between 23 and 29, denoted by  ,  ,  ,  ,  ,  ,  ,  
in oneA21...1121151182123 2111  h ih  291341145j...1 'h'. The value 25 is then placed i 
of the new holes in A2.  (3)17  at/
(2) Then 38 is added to A2 to produce A3  ...I 12114181211231 251291 h I h 1361381411451... I 
A3. The values 29 and 36 can shift 
with 29 moving into the closest hole. 
(3) Then 17 is added shifting 18, 21,  \\\4\4\ 
A41 -11211511118 1211231251291 h1361381411451... I 23, and 25 up. The final array is A4. 
Figure 4.5 A memory array with data holes for quicker shifting 
4.3.1.1.  DRAM Pods 
A DRAM pod is the base unit of allocation for the SRAM insertion sort. For easy 
hardware implementation, the sizes are kept to multiples of 2. Without multiples of two, 
the processor needs modular arithmetic hardware to calculate the number of elements 
used in the pod and the base address of the pod. Each pod can contain from one to the 
size of the pod elements. A zero pod cannot be represented without extra data bits, so if a 
pod ever decreases to zero elements, the processor removes the pod from the array. 
Inside a pod, the lowest address is the smallest value, and the largest address is the largest 
value in the sorted pod. The SRAM table arranges each pod such that the largest value of 
each pod is less than the smallest value of the next higher pod. The arrangement of pods 
is much like the processor's lists at the end of the MM-sort algorithm. 44 
4.3.1.2.  Holes 
Holes are the empty memory locations to which data values move. From figure 
4.5 the number of holes allocated, or hole size, is 4. The hole size is the same as the pod 
size and the terms are used interchangeably. The processor only allocates new holes 
when there are no holes in the any of the adjacent DRAM pods. When no adjacent pods 
have a free hole, the processor shifts a pod filled with holes into the location below the 
current pod. The new holes are now close enough to shift the value to be inserted into the 
array quickly, filling the empty pod. 
4.3.1.3.  SRAM Values 
Each pod has an associated SRAM value. The value indicates the top address 
used in the pod, where the processor allocates each DRAM pod on boundaries of 2d 
elements at a time. Thus, lower d bits of the SRAM value is one less than the number of 
values stored in the DRAM pod. The SRAM array maintains the actual sorted order of 
the pods eliminating the need to move the elements in the DRAM pods. The smaller 
SRAM, therefore exhibits the 0(n2) characteristic of the insertion sort algorithm. The 
number of memory elements in SRAM is n/2d, making the number of elements to sort 
smaller. Figure 4.6 shows an example of an array with a pod size of 4. Any element 
added to the array would be able to shift to the nearest hole. 45 
SRAM array  DRAM array 
0  11 u1 9 1 7 1 2 1181 131 u 
1211231291h 112 112 115 118 1 71111h 1 h I Legend  
u = unused or unallocated  
12  23 h = hole 
158160 lh 1.111361411451hlul ulu lu  
The dashed line helps to  
indicate the pod boundaries  
Data Array Represented 
7111Ih I h 1121121151181211231201361411451h1581601h 'h1 
Figure 4.6 Data storage method for a SRAM insertion sort processor 
4.3.1.4.  DRAM Pointer Stack 
Due to the nature of the MM-sort algorithm, values in the array are dynamic. 
With addition and deletion of elements, the pod sizes increase and decrease. When a pod 
reaches a size of zero, processor removes the pod and shifts to cover the invalid SRAM 
location. Since the processor allocates memory in increasing order, the DRAM value of 
the pod needs to be saved. Without a method to save the pod address, the processor can 
slowly lose all the memory, leaving no memory for the processor to allocate when it 
needs a new pod. The DRAM pointer stack keeps track of the empty pod values. The 
processor removes a value from the stack when it needs a new pod. A design 
consideration is the size of the stack. If the stack is too small, then when the processor 
destroys a pod and the stack is full, the DRAM value is lost for the remainder of the sort. 
A large stack is only a problem in implementation when there is not space to implement 
the entire stack in the processor. 
4.3.1.5.  Efficiency 
A problem with using the SRAM table is the processor allocates more memory 
than elements. The unused allocated memory space results inefficient memory 46 
management. If the pod size is too large then large holes can exist at the end of the 
sorting. The ratio of used memory to allocated memory is the memory efficiency of the 
processor. In figure 4.6, the efficiency of memory usage is 14 elements in 20 allocated 
spots or 70%. The size of the pods and the algorithm used to determine the creation of a 
new pod affects the efficiency. 
4.3.1.6.  Algorithm 
The SRAM insertion algorithm works in the following steps. First, to add an 
element, the processor performs a binary search on the maximum value of each of the 
DRAM pods using a binary search through the SRAM array. Since the middle value of 
the SRAM array is not the middle value of the array, the binary search loses some of its 
efficiency. The SRAM pointers have the wrapping capability described in the standard 
insertion sort in section 4.2, which allows the processor to add new values from either 
side. Next, the processor continues in the DRAM pod to find the exact place in which to 
insert the value. After the processor finds the DRAM location, the DRAM values are 
shifted to make room for the new value. Finally, the SRAM value pointing to the pod is 
incremented to indicate the addition of an element. 
If the pod is full, the processor checks the two adjacent DRAM pods for an 
available hole. The processor first checks the lower pod, because it requires fewer moves 
on average to shift a value down. If no adjacent pod has a hole, then the processor inserts 
a new pod below the full pod. The base value for the pod stored in SRAM is removed 
from the DRAM pointer stack, if a value is present. If no value is present, the processor 
allocates the next sequentially available pod and increments the pointer to free memory. 
The new SRAM value pointing to the pod is shifted into the SRAM array from 
the closest side, using the algorithm described in the standard implementation of the 
insertion sort. The processor inserts the minimum value into the empty pod and uses the 
new hole to insert the value into the current pod. If the new value would have been the 47 
smallest value in the current pod, then the new value is placed in the empty pod instead. 
The processor sets the SRAM value of the new pod to indicate one element in the pod. 
Since the processor only needs to remove the maximum and minimum values, the 
deletion algorithm is easy to implement. If it is the maximum value, the processor 
removes largest value in the largest DRAM pod, and the SRAM value is updated. 
Removing the minimum value is not as easy. After removing the minimum value, the 
remainder of the elements in the pod shift to cover up the missing value at address zero. 
The SRAM value of the pod is updated. If the SRAM value pointing to the maximum or 
minimum pod indicates only one element left, the processor removes the value and 
destroys the pod. The pod value is pushed onto the DRAM pointer stack ifspace is 
available. 
The lower bound of efficiency is determined by looking at the pod creation 
algorithm. Since the processor requires three full pods before creating an empty pod, the 
worse case memory model is 2 full pods and one single element pod. Depending on the 
hole size created, the lower bound is (2h+1)/3h, where h is the hole size or 2d. 
The memory inefficiency requires the system to have extra memory space available. 
Since the pointers are factors of two and the efficiency is never less than 66%, then two 
times the maximum number of elements is a safe DRAM estimation. The design is safe, 
since a large enough DRAM pointer stack guarantee the memory does not become 
fragmented. 
4.3.2. Speed of Creation 
The algorithm still has a n2 term, but the term is reduced by a factor of several 
hundred. The array, which has the same characteristics of the original insertion sort, is 
the SRAM table. Since each SRAM value represents h elements, the algorithm reduces 
the order by h2. The DRAM pods do not factor into the algorithm, since the number of 48 
moves is a constant value. The constant number of moves is, in the worse case, 2h-1 
moves. 
The SRAM table effectively removes the n2 term from the memory access equation 
from the division of the large constant. While according to big Oh notation the order is 
still 0(n2), we are using values of n that do not exhibit the asymptotic behavior of the big 
Oh notation. Much like the example in chapter 2 where the larger order equation is more 
efficient than the lower order equation due to the constants, the same is true for this 
equation. 
The same number streams from figures 4.1-4.4 generated figures 4.7-4.10. The 
assumption was made that all memory accesses were equal, e.g. SRAM was not 
considered quicker. The results of the SRAM insertion sort were dramatically better than 
the results from the standard insertion sorting routine. At the extreme of 2" elements, a 
disparity still exists between the SRAM and man-min heap numbers of 35 accesses per 
element. The difference is much less than before, 16,000 for the standard insertion sort. 
With the improved behavior of the SRAM algorithm, it is reasonable to believe that 
during the global sort the SRAM array has comparable performance to the max-min heap. 
Figure 4.7 shows the number of memory accesses for the different hole sizes. 
Figure 4.8 is a close up view of the smaller element values. As the number of 
values in the SRAM table become large, the 0((nlh)2) term starts to dominate the number 
of memory accesses and makes the SRAM storing routine inefficient. Shortly after the 
table sorting affects the number of memory accesses, the next larger hole size is more 
efficient. The line crossings on the graph show the changes in efficiency. First 8, then 
16, then 32, and finally 64 element pods have the lowest number of memory accesses as 
the number of total elements increases. 49 
50,000,000 
45,000,000 





0  Ad 
30 000 000 
'  Hole Size 





1  16 
t:e: 20,000,000 
32 






0  50,000  100,000  150,000  200,000  250,000  300,000 
Number of elements 
Figure 4.7 Effect of hole size on the SRAM insertion sort processor 
4,000,000  'r 
8 





9. /'w 16  0 
,/ /
df  9 
9 ' 32 




0  Hole Size 8 
16 
1,000,000  - - -32 
A  64 
,.< ,.., ,.. 
500,000  .-
0 
0  10,000  20,000  30,000  40,000 
Number of elements 
Figure 4.8 Effect of hole size on the SRAM insertion sort processor (zoomed in) 50 
In addition to the number of memory accesses, the hole size affects the efficiency 
of the memory allocation scheme. Figure 4.9 shows the average efficiency for each of 
the different hole sizes. These values are greater than the worse case of 66%, but still not 
close to the ideal 100%. The larger the hole size, however, the greater the tendency to 













75  ......--........ 
32  
70 
0  50000  100000  150000  200000  250000  300000 
Number of Elements 
Figure 4.9 Memory storage efficiency of the SRAM insertion sort processor 
4.3.3. Estimated Running Time 
The derivation of the run time equation is more difficult than the standard insertion 
sort. Breaking up the sorting into separate phases and looking at the memories separately 
aids in determining the total number of memory accesses. Equation 4.3 is the equation 
for the total number of memory access. The first term is from the movement of values 51 
into the DRAM pods. The second term is from the movement of items in the SRAM 
table, approximated as n2 moves for a list of length n. The multiplication of two is from 
the memory read and write needed to move an element. The third term is the binary 
search section of the sort. Since each data read requires on average a table read and 
DRAM read, there is an additional factor of two. The final term is from the actual 
writing of the element into the data structure. The writing of the element involves two 
writes, one for the DRAM pod, and one to update the SRAM value of the pod. The 
approximate value of the constants is from the data. 
n
# Memory Accesses = 2  n h + 2 c2 ()2 ± 2 c3 n lg n + 2n 
Eq. 4.3 
where c1 a 0.535  c2 a 0.189  c3 az 0.903 
Figure 4.10 shows the affects of the terms as the number of elements increases. 
By dividing the memory accesses by the number of elements and using a log scale, the 
slope of the line corresponds to a nlgn term. Each hole size has a relatively flat section of 
the line, where the n2 term does not adversely effect the overall running time. When the 
value starts to increase steeply, the effect of the SRAM memory moves has taken over the 
order of the function. 
By choosing the optimal hole size for the range of element values, figure 4.11 was 
created. The breaks in the slope of the line correspond to the use of different hole sizes. 
The slope of the line is determined by the dominant term. Since the number of elements 
divides the memory accesses, the slop is 1/h2. As the hole size increases the slope of the 
line decreases. The second line is the calculated number of memory accesses using 





i T 4 







X Is  Hole Size 
1  8 





0  - 64 
100  1000  10000  100000  1000000 
Number of Elements (log scale) 









0  50,000  100,000  150,000  200,000  250,000  300,000 
Number of Elements 
Figure 4.11 Optimal hole size choice for creating sorted array 53 
4.4.  Comparison of Processor Designs 
The algorithms in SRAM insertion sort are easy to implement in hardware. The 
optimizations discussed for the standard insertion sort requires a lot of hardware and 
provides minimal improvements in the performance of the processor. The separate 
memory interfaces on the SRAM insertion sort uses separate registers to maintain the 
values used in the SRAM and DRAM interfaces. A few registers are a small tradeoff for 
the vastly improved performance. 
One of the advantages of the memory accesses of the SRAM insertion sort is the 
accesses associated with memory moves are linear. Thus, the design can take advantage 
of the DRAM Fast Page mode. The processor can buffer all the shifting data values 
during a single read, and then write them back in another burst. The bursting of data 
accesses eliminates over half of the time necessary to access each element in a memory 
move. The accesses of the max-min heap, however, are often between a node and its 
children or grandchildren, which require a separate RAS and CAS strobe. These non-
page accesses take a longer time. The differences aren't enough to make up the factor of 
35 memory accesses per element, but it would make the gap much smaller. 
Another advantage of the SRAM insertion sort is the processor has sorted the values 
in memory. With the max-min heap the values do not have the sorted structure, and 
therefore are slower to remove from the memory array. Conversion of the max-min heap 
into a sorted list would require an extra signal and an additional to perform the 
conversion in parallel. Once signaled the processors can repeatedly remove the 
maximum value from the heap and create the sorted list. The time necessary using this 
algorithm is an additional 0(nlgn) memory accesses. 
Another difference between the max-min heap and the SRAM insertion sort is the 
amount of memory required to sort the same number of elements. While the max-min 
heap uses all but one location in memory, address 0, SRAM insertion sort requires twice 
as much memory. The doubling of memory is mainly in the use of the DRAM memory 54 
holes by the processor. The holes, which are never used, require more memory to be 
available than elements. 
An interesting characteristic of the SRAM insertion sort is the size of the table. As 
the table size becomes large, the table sorting becomes inefficient. The self-regulating 
size of the table makes it ideal for a SRAM application. In the storage of 215 data 
elements, there needs to be 216 available data locations from over allocation of memory. 
Each table entry consists of 2 bytes, 16 bits, and would have only 211 different addresses, 
since the most efficient hole size is 32 for this many values. Therefore, the SRAM would 
only need to be 4kB large, 211 * 21. The DRAM on the other hand would be 216 elements 
and 25 bytes per element or 221 bytes. 55 
CHAPTER 5  
MM-SORT PERFORMANCE  
While chapter 4 investigated the individual processor performance, the speed gain of 
the algorithm is from the processors working together. While one processor may be 
quick at sorting by itself when placing several processors together the scalability becomes 
important. Since all of the processor designs have an identical delay through the sorting 
network, only the running time of the different processor designs affects the relative 
performance. For discussion in this chapter, all memory accesses are assumed equal and 
no attempts were made to use the Fast Page mode access of the DRAM. 
Figure 5.1 shows the connection of an array of processors for the max-min sort 
algorithm. Each individual processor works in parallel as the interface chip adds data to 
each processor. The main interface chip acts as the communication interface to the 
outside world. The interface chip contains the logic to read data from the bus, start the 
sort, and write the sorted data. The network sorter chip contains the logic to read and 
write the maximum and minimum values to the processors, detect completion, and 
quickly sort 2p elements. The network sorter could be one of the previously mentioned 
parallel sorters such as the odd-even sorter, due to the small number of elements to sort. 
The simulations in this chapter use the memory models developed for the different 
processors in chapter 4. The testing configured each type of processor into a variable 
sized array. The testing varied the number of processors used in the array and the 
number of elements added to each processor. When the processor had additional 
parameters to vary, such as the hole size, they were used. Section 5.1 presents the results 
from the original version of the algorithm. Section 5.2 shows the results from variations 
of the original algorithm, which the insertion sort routines make possible. 56 
External Bus 
Interface Chip 
Processor  Processor  Processor 
.411-110.  E  4111-01. Chip 1  Chip 2  Chipp 
Network Sorter 
Figure 5.1 Chip layout to build a MM-sorter 
5.1.  Overall Speed 
Tests of the original algorithm used the two separate sorting phases, both the local 
and the global, and were done on a single processor computer. Each processor performs 
the local sort serially. After sorting the list, a processor returned control to the program 
to start the next processor. After all p processors had finished the local sort, the program 
performed the global sorting phase until the network sorter indicated that the arrays were 
sorted, i.e. no data movements occurred between the incoming values 
Each processor kept track of the number of memory accesses, type specific if 
applicable, and the number of comparisons used during the sort. Each data run was 
performed for 16 different sets of data. All of the data was saved off into a file 57 
containing additional information about the number of processors and number of 
elements. 
Figures 5.2-5.4 show the results of the original algorithm for 2, 16 and 64 processors 
respectively. The labels SRAM(h) refer to the hole size used for the SRAM processor 
and DRAM refers to a highly optimized version of the standard insertion sort. The 
optimized insertion sort creates the array using an O(nlgn) algorithm and also has the all 
the optimizations discussed in chapter 4. From the figures the max-min heap processor 
appears to be the most efficient for sorting the data. 
An important note, however, is the charts did not factor in the removal of elements 
from the processor array. If the data incorporated the removal of elements, figure 5.5 for 
16 processors, the run time of the max-min heap becomes worse in comparison to the 
other algorithms. The max-min heap requires and additional 0(nlgn) time to remove the 
elements, while the SRAM and DRAM versions, in comparison, require only 0(n) time. 
100,000,000 
5 10,000,000 

























100  1,000  10,000
Number of Elements 
100,000 









10,000  DRAM 
,01  - SRAM (8) 
.SRAM(16) 
1,000 
1,000	  10,000  100,000  1,000,000 
Number of Elements 
Figure 5.3 Original sorting method with 16 processors 
1,000,000,000 
100,000,000  , 
>6" 10,000,000  0,' 





0,0  4e°0' s000.0.008.1.".4........111frIg DRAM  10,000 
SRAM (8) 
- .SRAM(16) 
1,000  .  , 
1,000  10,000  100,000  1,000,000 
Number of elements 













es  I 
... r 






1,000	  10,000  100,000  1,000,000 
Number of Elements 
Figure 5.5 MM-sort running time for 16 processors (including unloading) 
In figure 5.5, the SRAM insertion sort processor has closed the performance 
difference between it and the max-min heap. For certain values of elements, the SRAM 
method is actually faster. As the number of elements increases, the max-min heap takes 
longer to output the data. Therefore, the SRAM method is comparable in total 
performance to the max-min heap using the original sorting algorithm. 
5.2.  Variation Using SRAM 
One of the reasons to develop the new SRAM method of data storage, was the 
capability to remove the maximum and minimum values at any time of the sort. If the 
sorting network requests the maximum and minimum values from the max-min heap, it 
first has to construct the heap, and then all additions afterwards are performed in O(nlgn) 60 
time opposed to the 0(n) time it takes to create the heap. The speed performance of the 
max-min heap is lost if the network sorting chip creates the heap before the interface chip 
adds all the elements. One variation of the algorithm, useable by the SRAM data 
structure, is to perform a single global sort after adding each element. After all the 
elements are in the processors, the global sorting phase continues until the lists are sorted. 
Combination of the local and global sorting means the processor array is not 
dependent on receiving all the data before it starts the sort. It is hard to quantify the 
amount of time receiving all the data takes, since the data loading time is dependent on 
the system's bus traffic, the system memory latency, and disk latency. It should be 
sufficient to say the advantage of the phase combination is considerable. 
Figure 5.6-5.8 shows the results of using the global sort each time the interface chip 
added p elements, e.g. one element per processor. The max -min heap line in the graph is 
from the standard implementation of the algorithm. The line is meant as a baseline 
comparison of the alternate method. The DRAM version performs the local and global 
sorts and is the maximally optimized version. None of the data points take the unloading 
of data into account. 
From figure 5.8, the high performance system of 64 SRAM insertion sort processors 
is of over 25% faster than the max-min heap processors. With additional time gained by 
quicker removal of elements from the array and sorting while adding elements to the 








g  500,000 
0  4,000	  8,000  12,000  16,000  20,000 
Number of Elements 









I  DRAM 
MM Heap 
0  40,000  80,000  120,000  160,000 
Number of Elements 











0  100,000  200,000  300,000  400,000  500,000  600,000 
Number of Elements 
Figure 5.8 MM-sort alternating the global and local sort for 64 processors 
5.3.  Scalability 
Another advantage of the SRAM design is the linear scaling of the processors to 
tackle larger and larger jobs. The number of memory moves performed by the processors 
increases 10%. Therefore, the sorter can sort twice as many elements with twice as many 
processors in only 10% more time. The speed scaling of the max-min heap is not 
constant. Using a small number of processor and doubling the elements and processors 
simultaneously can take up to 50% percent more time. 
Figure 5.9 shows the scaling of simultaneous sorting using the SRAM method with 
the max-min heap sorter. Figure 5.9 is the speed scaling of a constant number of 
elements by adding more processors. The relatively straight lines of the SRAM sorter 
show linear improvement in the performance, while the max-min heap has nonlinear 
scaling. The change in the scaling factor for the max-min heap is explained by the small 








1  10 
Number of Processors (log scale) 
Figure 5.9 Speed improvement of adding processors to sort 16k elements 
Figure 5.10 shows the affects of scaling the number of processors and keeping the 
number of elements added to each processor constant. The same number of elements per 
processor doubles the total number of elements sorted as you double the number of 
processors. The figure shows how more processors can handle larger jobs with similar 
performance. 
There are a few reasons for the better performance of the max-min heap in the 2 and 4 
processor case. First the SRAM does a lot of unnecessary memory accesses as the 
elements are added to the list. The accesses are unnecessary since the likely hood ofan 
element leaving the processor's list is (p-1)1p. With only a few processor the chances are 
relatively good that the new element should remain in the list. If the value should remain 
in the list, an unnecessary global sort is performed. Another reason for the better 
100 64 
performance is the smaller number of processors require fewer network sorts. The fewer 
network sorts, the less expensive removal operations performed by the max-min heap, 




L  f. 
i1,500,000 
















0  10  20  30  40  50  60  70 
Number of Processors 
Figure 5.10 Scalability of problem with 8k elements per processor 
5.4.  SRAM Design Decisions 
In the SRAM processor design, several different parameters need optimization for 
efficient performance. A miscalculation of these parameters could cause the array to, sort 
slower than expected. The worse case is that the array is unable to complete the sorting, 
in the case of running out of memory due to memory fragmentation. The design 65 
parameters for efficient performance include the DRAM hole size, the depth of the 
DRAM pointer stack, and the SRAM speed. 
5.4.1. Hole Size 
From previous figures, the size of the holes is proportional to the number ofmemory 
accesses. If the maximum number of elements of a problem is known in advance, then 
the hole size can be optimized. The problem with hard coding the hole size, is smaller 
sorting jobs in the same processor array would have sluggish performance. To maintain 
the best performance, a processor design with configurable hole sizes is the best option. 
Then the software driver could set the proper hole size based on the number of elements 
and processors. Running out of SRAM should not be a problem, for fewer elements 
require fewer table entries. From the design criteria, the safety factor of two should allow 
for enough table entries to run at an optimal speed independent of hole size. 
5.4.2. DRAM Pointer Stack 
The DRAM pointer stack cannot be designed large enough, for the more values 
saved, the less the memory the processor fragments. If the design space is limited, 
however, the designer needs to know the minimal size, which allows for a functional 
processor. Figure 5.11 shows the affect of the pointer stack and number of DRAM pods 
lost. After a depth of three, the loss of holes is during the final global sort only phase. 
The holes lost are due to the memory efficiency increasing. Therefore the lost pods are 
an artifact of over allocating memory during the earlier phases of sorting. A good way to 
visualize the process is to think of the edge of a processor list overlapping with the 
adjacent lists held by the other processors. During the final global sorting, the accesses 
are localized to the outer portion of the list. The addition of elements to the edge of the 







0000000 100  8 elements 
16 elements 
80 







00  re< w w 
.M/  NO 0O 
00000N 
NO 
N0I1.0  0000000 
00  IN RV  OE
MI  ICI 
0  1  2  3  4  5  6  7  8  9 
Depth of DRAM Pointer Stack 
Figure 5.11 Number of lost DRAM pods for 16 processors sorting 64k elements 
5.4.3. SRAM Speed 
Another design decision is the SRAM speed. While components such as the logic for 
the element comparison and the pointer addition affect the maximum system clock, the 
designer still has to chose the SRAM speed. The best performance is not necessarily 
obtained by the fastest SRAM. With the timing for the DRAM being fixed at roughly 
100ns per random cycle, the fastest SRAM may not be necessary. Assuming a 2Ons 
clock is obtainable on the processor chip, the DRAM data is available in 4Ons after the 
RAS signal. The entire cycle occupies 6Ons of memory bus time. The processor can use 
the additional 4Ons before the next memory access to interface with the SRAM without 
incurring a delay. Careful choice of the SRAM speed grade can maximize the speed 
while still minimizing the cost. 67 
CHAPTER 6  
DESIGN OVERVIEW AND CONCLUSIONS 
The SRAM insertion sort processor has a speed advantage over the max-min heap 
processor. If the designer uses the SRAM processor in designs larger than four 
processors, then the sorter has better performance than the max-min heap. The following 
section outlines the basic requirements for each component in the max-min parallel 
sorter. 
After discussing the logic necessary to implement the design, Section 6.2 discusses 
areas for optimizing the sorter. In addition to hardware-based, the section discusses 
changes in the algorithm necessary for quick implementation on a symmetric 
multiprocessor (SMP) or a message-passing multi-computer. Section 6.3 summarizes the 
results presented in the thesis. 
6.1.  Design Guidelines 
The design ideas discussed in the following section are not a comprehensive list of 
details for a hardware implementation of the MM-sorter. The ideas presented are 
observations that a designer should address in the development of the hardware. Several 
of the ideas aid in tolerating the latency associated with different portions of the 
algorithm. For example, a processor that creates a new pod takes more time to add an 
item to its list than a processor that inserts a new element into an existing pod. Allowing 
the second processor to start on a new element reduces the idle time of the processor. 
The main control unit in the processor is an important piece to build. If the parallel 
sorter strictly enforces the alternation between the global and local sorting steps, then 
processors may spend a considerable amount of time idle. The idle time is from the delay 
associated with the network sorting chip as it sorts the maximum and minimum elements. 68 
By not enforcing strict alternation between global and local sorting, the processor can add 
additional elements to the array during the network sort.  Overlapping the cycles 
maximizes the amount of time the processor spends sorting elements and minimizes the 
need for a highly efficient network sort chip. 
6.1.1. Communication Interface 
One of the possible problems discussed earlier in the thesis is the number of pins on 
the chips. By implementing a shift register, a serial communication interface can shrink 
the pin width of the data bus to one bit. While one bit would increase the communication 
time, the processor can use a buffer to hide the latency. Once the buffer has a value, the 
processor moves it to another register for processing clearing the buffer to load a new 
value. 
Single bit communication is necessary for the interface chip to processor 
communication. The idealized bus shown in figure 5.1 presents a heavy load on the 
driving chip's circuitry, which decreases the maximum attainable bus rate. Only having a 
single bit for data communication, the interface chip can use a line dedicated to each 
processor, minimizing the loading. Separate control signals can perform the necessary 
handshaking. 
Network sorter buffers can be the positions in which the data is stored. The final 
global only stage causes a large latency if the sorter uses a single data bit bus. For each 
processor is idle while the data is loading into the network chip. A larger slower bus 
allows a single processor to start processing data and therefore minimize the idle time. 69 
6.1.2. Processor Memory Interface 
The processor to memory interface was the main optimization presented in the thesis. 
Chapter 4 outlines the algorithm for inserting an element into memory. -Due to the 
similarity between the binary sort first performed on the SRAM and.then DRAM, the 
circuitry of the comparison and adder units can be combined to minimize the hardware. 
The DRAM refresh cycles require a timer and counter internal to the processor. The 
timer and counter are used to guarantee that the refresh cycles, RAS before CAS, occur 
often enough. When the processor uses the memory bus to move a new pod pointer into 
SRAM, the processor can initiate refresh cycles on the DRAM, since the address and data 
bus are not necessary for a refresh cycle. When the processor is idle, it could allow the 
refresh controller to refresh the DRAM. By capturing free cycles to refresh the DRAM, 
the design minimizes the amount of delay incurred by the use of DRAM. 
6.1.3. Sorting Network 
Perhaps the least optimized chip on the board is the sorting network chip. Since the 
design has been constrained to a small number of processors, the chip can use one of the 
other algorithms presented, like odd-even sort, to sort the elements. For a fully functional 
chip, the designer needs to build additional hardware to detect elements swapping 
locations. Only during the final stage of global only sorting does the latency effect the 
run time of the circuit. A simple algorithm, which takes minimal hardware resources, is 
the standard insertion sort. While the delay associated with sorting is higher than other 
algorithms, the processor can hide the majority of the delay during the local and global 
sorting phases. 70 
6.1.4. Interface Chip 
The bus interface chip is the most complex chip on the board to design. The need to 
conform to an external bus protocol requires much research. The most important 
function of the interface chip is the ability to do direct memory access (DMA). DMA 
removes the need for a system processor to send the data to the MM-sorter. The system 
processor can work on other programs until the interface chip interrupts it to signal 
completion. In addition to the bus protocol, choosing an appropriate register interface for 
a software driver should not be ignored. 
For the communication with the array of each processor chip needs a separate 
address. The addresses help the interface chip to communicate to the processors 
individually or as a group. A decision between sending data in parallel to all the 
processors, or sending data to a single processor at a time is dependent on the amount of 
control circuitry available to the interface chip. 
6.2.  Future Research 
Before a full implementation of the proposed sorter, the designer needs to examine 
different bus interfaces and simulate the time in terms of processor clocks. The 
simulations can develop variations of parallel and serial interfaces, which connect to the 
processors. Other optimizations in the processor development are the effects of using the 
page capabilities of DRAM and its affect on the processor performance. Additionally, 
presorting data before it reaches the processor increases the execution time and lowers 
the number of necessary network sorts. 
The hardware solution is expensive to build in both the cost of designing the 
hardware and the use of space inside a computer. Another viable solution is using a 
symmetric multiprocessor (SMP) or a multi-computer system to perform the sorting in 
software. The main drawback of implementing the design on a general system, such as a 71 
SMP, is the overhead associated with message passing. Development of larger messages, 
which pass several values at once, could reduce the average latency. The effect of larger 
messages on the sorter's performance, however, needs to be investigated. 
6.3.  Conclusion 
The use of parallel processors to speed up different tasks is a difficult problem. The 
inherent linearity of certain tasks, such as sorting, often requires interesting partitioning 
of data to perform the work in parallel. The large data capability ofmany modern 
systems has facilitated the need of these parallel algorithms to process the large quantities 
of data. In the task of parallel sorting, we looked at the MM-sort algorithm. The max-
min algorithm stands out because of its scalability, performance, and relative ease of 
hardware implementation. 
Before designing the processor, we looked at several different data structures in 
which the processor stores data. The max-min heap, which the algorithm suggests, has 
the disadvantage of not sorting until all the elements are present. The standard linear 
array starts sorting immediately, but uses a large number of memory accesses to shift data 
in the array. With improvements to the standard array, a new data structure, which 
handles memory holes in array to reduce the amount of shifting, achieved much closer 
performance to the max-min heap. From the different data structures, we saw how the 
performance of the same algorithm could be vastly different. 
Chapter 5 discussed the performance of each of the different data structures as they 
worked in parallel. It introduced a modification to the max-min algorithm, which the 
new SRAM data structure could use. The performance of the new algorithm scaled 
smoother than the original algorithm and out performed it with as few as eight 
processors. Running only 16 processors sorting 8k elements per processor, the new 
algorithm is 35% faster.  Finally, chapter 6 discussed the details important to the design 
and areas for further research. 72 
BIBLIOGRAPHY 
[1]  Akera, A. and Winegrad, D. (1996, January 30). A Short History of the Second 
American Revolution. [WWW page]. 
URL: http://www.upenn.edu/almanac/v42/n18/eniac. html 
[2]  How Fast is the INTEL ASCI Teraflops Computer, [WWW page]. 
URL:http://www.intel.com/pressroom/archiveheleases/cn1217fs.html 
[3]  Bull Data Warehousing. Overview, [WWW page]. 
URL:http://wwvv.dwo.bull.com/dwtechtc.htm 
[4]  Kaufman, Perlman, and Spenciner. Network Security: Private Communication in 
a Public World Prentice Hall Inc. 1995. 
[5]  Flex 10k Embedded Programmable Logic Family Data Sheet. Altera, June 1996 
[6]  Richman, Dan "Users discover little-known benefits of sorting software", 
Computerworld, Jan 15, 1996, v30, n3, p 52. 
[7]  Hoare, C. A. R. "Quicksort", Computer Journal, 1962, v5, n1, p 10-15. 
[8]  Baase, Sara. Computer Algorithms: Introduction to Design and Analysis. 
Addison-Wesley Publishing Company. 1988 
[9]  AKL, Selim G. Parallel Sorting Algorithms. Academic Press, Inc. 1985 
[10]  Adaptec: Top Performing, True multitasking. [WWW page] 
URL:http://www.adaptec.com/deskpoon/promohopperf.html 
[11]  HM514405D 1Megx4 Dynamic RAM access memory Data Sheet. Hitachi, 
December 1996. 
[12]  Leighton, F. Thomson. Introduction to parallel Algorithms and Architectures: 
Arrays, Trees, Hypercubes. Morgan Kaufmann Publishers, Inc. 1992 
[13]  Afghahi, A. A 512 16-b Bit Serial Sorter Chip. IEEE Journal of Solid-State 
Circuits, Vol 26, No 10. October 1991: pg 1452-1457. 
[14]  Atkinson, Sack, Santoro, Strothotte. Min Max Heaps and Generalized Priority 
Queues. Communications of the ACM. October 1986, Volume 29, Number 10. 
Pg 996-1000 73 
[15]  Zhang, Yanjun and Zheng, S. Q. A Simple and Efficient VLSI Sorting 
Architecture. Proceedings of the 37th Midwest Symposium on Circuits and 
Systems. Vol. 1, pgs 70-73. August 3-5, 1994 
[16]  Prince, B., Due-Gundersen, G. Semiconductor Memories. John Wiley & Sons 
Ltd, 1983. 
[17]  Fast Static RAM Databook, Motorola. 1993. 
[18]  Prince, Betty. High Performance Memories: New architecture DRAMs and 
SRAMs - evolution and function. John Wiley & Sons Ltd, 1996. 
[19]  Western Digital Drive Specifications AC31600, [WWW page] 
URL:http://www.wdc.com/products/drives/drive-specs/AC31600.html 
[20]  Triebel, W. and Chu, A. Handbook of Semiconductor and Bubble Memories. 
Prentice Hall, Inc. 1982. 
[21]  DRAM Databook, Micron Technology, Inc. 1992 
[22]  Jenkins, Jesse H. Designing with FPGAs and CPLDs. Prentice Hall, Inc. 1994. 
[23]  Max+Plus II for Workstations: Data sheet. Altera, July 1994. 
[24]  Floyd, Robert W. Algorithm 245: Treesort3. Communications of the ACM, 
December 1964 Vol 7 no 12. pp 701. 74 
APPENDICES  75 
APPENDICES  
On the following pages are the C++ class definitions used in the simulation of the 
different processors. The processor chips are the SysArrayChip (SAC) objects. The 
processor objects interacted with the NetworkSortChip (NSC) and the main program 
which acted as the BuslnterfaceChip (BIC). The SRAM insertion sort processor code is 
fully listed in Appendix A. The Appendix B and C show the changes necessary to 
implement the max-min heap and the optimized insertion sort respectively. 
All code was run on PentiumTm processors running Linux version 2.0.0.  The 
source code was compiled using the gnu c++ compiler, g-i-+ v2.7.2. The data values were 
varied by a testing program which tabulated the results into a table. All the results 
presented were the average of 16 different seed values for the random number generator. 76 
APPENDIX A  SRAM Insertion Sort Processor Code 
The code following are the 2 files which define the C++ class object SAC2 that makes up 
the SRAM insertion sort processor. 
/***********************************************44*****:**********/ 
/*  ./  
/* SysArrayChip2.h  ./  
/*  ./  
/*  This file contains the class definitions for the MM-Sort */  
/* Processor Chip.  All the global enumerated types are defined */  
/* in the file sorter.h  */  
/*  */  
/****************************************************************/  
#ifndef SYSARRAYCHIP2_H  
#define SYSARRAYCHIP2_H  
#include "sorter.h"  
class SAC2  {  
public:  
SAC2(void);  // Constructor function  
void Reset(void);  // Reset the chips pointers  
int Sort(unsigned type_of_sort);  // Manual excitation for Sorting  
/////////// BIC Interface functions /////////////  
BOOL BIC_DataRDY(void);  // Is the BIC data register empty  
CMD  BIC Data( int & Data, CMD Command);// Perform a read/write  
/////////// NSC Interface functions /////////////  
BOOL NSC_DataRDY(void);  // Are the NSC data registers empty  
CMD  NSC_Data( int & DataHi, int & DataLo, CMD Command);  
// Perform a NSC read/write  
/////////// Statistical Functions  ///////////////  
double DRAMReads(void);  // Num of Dram Reads  
double SRAMReads(void);  // Num of Sram Reads  
double Comparisons(void);  // Num of Comparisons  
double DRAMWrites(void);  // Num of Dram Writes  
double SRAMWrites(void);  // Num of Sram Writes  
int  NumberinMemory(void);  // Num of elements in Array  
int  DRAMEffeciency(void);  // Effeciency of DRAM usage  
//Friends for allowing the testing programs to have  
//full access to the elements.  Remove after testing is done.  
friend void update(int);  
friend void main();  
private:  
///// BIC Registers /////  
int BicData;  // I/O Register for BIC Side  77 
BOOL NewBicData;  // flag indicating if data needs to be sorted  
///// NSC Registers /////  
int NscDataHi;  // I/O Register for NSC Side (Upper value)  
int NscDataLo;  // I/O Register for NSC Side (Lower value)  
BOOL NewNscData;  // flag indicating if data needs to be sorted  
///// Statistic Registers ////  
double Dread;  // Number of DRAM Reads  
double Dwrite;  // Number of DRAM Writes  
double Sread;  // Number of SRAM Reads  
double Swrite;  // Number of SRAM Writes  
double Comparison;  // Number of Comparisons  
long Dholes;  // Number of DRAM holes lost  
///// Memory interface /////  
int SptrHi;  // Points to the next available SRAM slot  
// which a high value can go into  
int SptrLo;  // Points to the next available SRAM slot  
// which a low value can go into  
int numSorted;  // Number of elements in DRAM  
int DRAM[DRAMARRAYSIZE];  // The DRAM  
int SRAM[SRAMARRAYSIZE];  // The SRAM  
int NextDRAM;  // The pointer to the next available DRAM spot  
BOOL DRAMFrags;  // Flag indicating if there are DRAM holes  
// left to be reclaimed  
int DRAMHole[15];  // Stack for DRAM holes created  
int DRAMHolesptr;  // Pointer to the empty slot in DRAMHoles;  
///// Internal sorting functions for Sort ////  
// Sort a single value into Memory  
int Sort_Single(int DataValue);  
// Insert Value into Memory baised on the pointers  
int Insert(int Value, int sptr, int dptr, int direction);  
// Temporary Definitions to allow data gathering  
int DRAM MASK;  
int DRAM_USED;  
int MAXHOLES;  
); // End SAC2 class definition  
#endif // End of #ifndef SYSARRAYCHIP2 H  78 
/****************************************************************/ 
/*  */ 
/*  SyaArrayChip2.cpp  */  
/*  */  
/*  This is the file for the implementation of the MM-sort  */  
/* processor chip.  The code implements the SRAM insertion sort */  
/* method of maintaining the data.  Global definitions are kept */  
/* in the file "sorter.h"  */  
/*  */  
/****************************************************************/  
#include "SysArrayChip2.h"  
SAC2: :SAC2 ( )  
// Initialize all the bus interface variables to the initial state  
BicData =O;  
NewBicData=FALSE;  
NscDataHi=0;  
NscDataLo =O;  
NewNscData=FALSE;  









// Initialize the DRAMHoles variables  
DRAMFrags = FALSE;  
DRAMHolesptr = 0;  
// Initialize the pointers to the middle of the SRAM ARRAY  
// By adding an extra SRAMARRAYSIZE, the pointers can wrap around  
// memory  
SptrHi=SRAMARRAYSIZE+SRAMARRAYSIZE/2;  
SptrLo=SRAMARRAYSIZE+(SRAMARRAYSIZE/2) -1;  
/****************************************************************/  
/* Reset  */  
/*  */  
/* Reset the chip by clearing all the variables  */  
/****************************************************************/  
void SAC2::Reset(void)  





NewNscData=FALSE;  1 
79 









// Initialize the DRAMHoles variables  
DRAMFrags = FALSE;  
DRAMHolesptr = 0;  
// Reinitialize the pointers to the middle of the SRAM ARRAY  
// By adding an extra SRAMARRAYSIZE, the pointers can wrap around  
// memory  
SptrHi=SRAMARRAYSIZE+SRAMARRAYSIZE/2;  
SptrLo=SRAMARRAYSIZE+(SRAMARRAYSIZE/2) -1;  
/****************************************************************/ 
/* BIC DataRDY() 
/*  4/ 4/ 
/* This function returns a bool indicating if a CMD_WRITE  */
/* will be successful on the BIC side of the array  */ 
/****************************************************************/  
BOOL SAC2::BIC DataRDY()  
if(NewBicData==TRUE)  // If there is already unsorted Data  
return FALSE;  // then not ready for a transaction  
else  
return TRUE;  
/****************************************************************/ 
/* CMD BIC_Data(int &, CMD)  4/
/*  4/
/* This function performs a transaction on the BIC  */
/* side of the System Array Chip.  If the command  4/
/* was successful the original command is returned  4/
/* otherwise CMD ERROR is returned  4/ 
/***************7************************************************/  
CMD SAC2::BIC Data(int & Data, CMD Command)  
{  
int Dramptr;  
// If the user wants to read the chip  
if(Command==CMD_READ)  
{  if(NewBicData != TRUE && numBorted>0) // If any elements are  
{  // in the array and Register is free.  
// Set Data to uppermost value  
Dramptr = SRAM[(SptrHi  1)  & SRAMARRAYSIZE MASK];  
Data = DRAM[Dramptr];  
Dread++;  
Sread++;  
if((Dramptr & DRAM USED) == 0)  80 
{ SptrHi--;  
if((DRAMFrags==FALSE) && (DRAMEolesptr < MAXHOLES))  
{  DRAMHole[DRAMHolesptr++] = Dramptr;  
DRAMFrags = TRUE;  
else  
{  if(DRAMHolesptr < MAXHOLES)  
{  DRAMHole[DRAMHolesptr++] = Dramptr;  
}  
else  
{  Dholes++;  
1 ) 
else  
{ SRAM[(SptrHi  1)  & SRAMARRAYSIZE_MASK]=Dramptr-1;  
numSorted--;  
//Done to make looking at memory easier.  Can be removed  
DRAM[Dramptr] = 0;  
return Command; //Return that command was successful  
else // no values to read  
return CMD ERROR;  // Return Error indicating no read  
) 
// If the user wants to write the chip  
else if (Command == CMD_WRITE)  
// Check to make sure that there isn't data which needs to  
// be sorted into memory  
{  if(NewBicData!=TRUE)  // If register is empty  
{  NewBicData = TRUE; // Fill it and set flag  
BicData = Data;  
}  
else // Otherwise cannot accept write  
return CMD ERROR;  // Return error, no write  
)  
else // Not a read or a write command (unsupported command)  
return CMD ERROR;  
) 
/****************************************************************/  
/* BOOL NSC DataRDY()  */  
/*  */  
/* This function returns if a CMD WRITE will be successful on  */  
/* the NSC side of the array  */  
/*  */  
/****************************************************************/  
BOOL SAC2::NSCDataRDY()  
if(NewNscData==TRUE 11 numSorted <= 1)  // If there is unsorted  
return FALSE;  // Data or not enough for a transaction  
// Then false  
else  
return TRUE;  
} 81 
/****************************************************************/  
/*  CMD NSC_Data(int &, int &, CMD)  */  
/*  */  
/*  This function performs a transaction on the NSC  */  
/*  side of the System Array Chip.  If the command  */  
/*  was successful the orignal command is returned  */  
/*  otherwise CMD ERROR is returned  */  
/****************************************************************/  
CMD SAC2::NSC Data(int & DataHi, int & DataLo, CMD Command)  
int Dramptr;  
// If the user wants to read the chip  
if(Command==CMDREAD)  
{  if(NewNscData!=TRUE && numSorted >= 2) // Make sure enough data  
{	  // to send and then send it  
// First read the UpperValue  
Dramptr = SRAM[(SptrHi  1)  & SRAMARRAYSIZE_MASK];  
DataHi = DRAM[Dramptr];  
//Done to make looking at memory easier.  Can be removed.  
DRAM[Dramptr] = 0;  
Dread++;  
Sread++;  
if((Dramptr & DRAM_USED) == 0)  
(  SptrHi--;  
SRAM[SptrHi & SRAMARRAYSIZE_MASK] = 0;  
if((DRAMFrags==FALSE) &&  ( DRAMHolesptr < MAXHOLES))  
{  DRAMHole[DRAMHolesptr++] = Dramptr;  
DRAMFrags = TRUE;  
else  
f if(DRAMHolesptr < MAXHOLES)  
{  DRAMHole[DRAMHolesptr++] = Dramptr;  
else  
{  Dholes++;  
}	 }  
else  
(   SRAM[(SptrHi  1)  & SRAMARRAYSIZE_MASK]=Dramptr - 1;  
Swrite++;  
numSorted--;  
// Next Read the LowerValue  
Dramptr = SRAM[(SptrLo + 1)  & SRAMARRAYSIZE_MASK];  
DataLo = DRAM[(Dramptr & DRAM_MASK)];  
//Done to make looking at memory easier.  Can be removed.  
DRAM[(Dramptr & DRAMLMASK)] = 0;  
Dread++;  
Sread++;  
if((Dramptr & DRAM_USED) == 0)  
(	  SptrLo++;  
SRAM[SptrLo & SRAMARRAYSIZE_MASK] = 0;  82 
if((DRAMFrags==FALSE) && (DRAMHolesptr < MAXHOLES))  
{  DRAMHole[DRAMHolesptr++] = Dramptr;  
DRAMFrags = TRUE;  
else  
{  if(DRAMHolesptr < MAXHOLES)  
{  DRAMHole[DRAMHolesptr++] = Dramptr;  
else  
{  Dholes++;  
}  }  
else // Shift all the memory values down  
{  for(int i = 0; i< (Dramptr & DRAM_USED); i++)  
{  DRAM[( Dramptr & DRAM_MASK) + i] =  
DRAM[(Dramptr & DRAM_MASK) + i + 1];  
Dread++;  
Dwrite++;  
// Done to read DRAM easier  
DRAM[(Dramptr & DRAM MASK) + i + 1]=0;  
SRAM[(SptrLo+1) & SRAMARRAYSIZE MASK] = Dramptr  1;  
Swrite++;  
numSorted--:  
return Command;  // Return success  
else  // Either Registers are in use or not enough memory  
return CMD ERROR;  // Return error  
}  
// If the user wants to write the chip  
else if (Command == CMD_WRITE)  
( if(NewNscData!=TRUE)  // If the registers are empty  
{ NewNscData = TRUE; // Set the flag and read the values  
NscDataHi = DataHi;  
NscDataLo = DataLo;  
else  // The registers are in use  
return CMD ERROR;  // Return error  
else // Unsuported command  
return CMD ERROR;  // Return error  
1 
/****************************************************************/  
/* double DRAMReads()  */  
/*  */  
/* This accessor function returns the number of DRAM reads  */  
/****************************************************************/  
double SAC2::DRAMReads()  
{  return Dread;}  
/****************************************************************/ 
/* double SRAMReads()  
/*  
/* This accessor function returns'the'number of SRAM reads  83 
/****************************************************************/  
double SAC2::SRAMReads()  
{  return Sread;}  
/****************************************************************/  
/* double Comparisons()  */  
/*  
/* This function returns the number of Data comparisons  */  
/****************************************************************/  
double SAC2::Comparisons()  
{  return Comparison;)  
/****************************************************************/  
/* double DRAMWrites()  */  
/*   */  
*/  /* This accessor function retruns the number of DRAM writes  
/****************************************************************/  
double SAC2::DRAMWrites(void)  
{ return Dwrite;}  
/****************************************************************/  
/* double SRAMWrites{)  */  
/*  */  
/* This accessor function returns the number of SRAM writes  */  
/****************************************************************/  
double SAC2::SRAMWrites(void)  
{  return Swrite;}  
/****************************************************************/  
/* int NumberInMemory()  */  
/*  */  
/* This accessor function returns the number of items in the  */  
/* memory array.  */  
/****************************************************************/  
int SAC2::NumberInMemory(void)  
{  return numSorted;}  
/****************************************************************/  
/*  int DRAMEffeciency()  */  
/*  */  
/* This accessor function returns the percentage of DRAM which  */  
/*  is being used over that which is allocated  */  
/****************************************************************/  
int SAC2::DRAMEffeciency(void)  
{ int Used = 0, Allocated = 0;  
for(int i=SptrLo+1; i < SptrHi ;  i++)  
{  Used += (SRAM[(i & SRAMARRAYSIZE MASK)] & DRAM_USED) + 1;  
Allocated += DRAM USED +1;  
}  
if(Allocated == 0)  
{  return 100;  }  
else  
{  return (Used*100)/Allocated;}  
/****************************************************************/ 
/* int Sort(unsigned)  */ 84 
/*   */  
/* This function is the manual excitation required to make the  */  
/* Array sort the data values loaded into the Bus Buffer  */  
/* Registers.  The unsigned int represents the types of  */  
/* optimization which are selected by the user.  The return  */  
/* value is the number of data values which were added to the  */  
/* array.   */  
/****************************************************************/  
int SAC2::Sort(unsigned type_of_sort)  
int temp=0;  //Value holder for the number of elements sorted  
int Dramptr;  //pointer to the DRAM value  
if(NewNscData == TRUE && numSorted >= 1)  // Sort the NSC Data  
{  temp +=2;  
Comparison += 2;  
Sread +=2;  
Dread +=2;  
Dramptr = SRAM[(SptrHi  1)  & SRAMARRAYSIZE MASK];  
if(NscDataHi >= DRAM[Dramptr])  
{  Insert(NscDataHi,SptrHi-1,Dramptr,1);}  
else  
{  Sort_Single(NscDataHi);  
Dramptr = SRAM[(SptrLo + 1)  & SRAMARRAYSIZE_MASK];  
if(NscDataLo <= DRAM[(Dramptr & DRAM MASK)])  
{   Insert(NscDataLo, SptrLo +l, Dramptr & DRAM_MASK, -1);  }  
else  
{   Sort_Single(NscDataLo);  }  
NewNscData = FALSE; // Indicate the values have been sorted  
else if(NewNscData == TRUE)  
(  temp +=2;  
Sort_Single(NscDataHi);  
Sort_Single(NscDataLo);  
NewNscData = FALSE; // Indicate the values have been sorted  
if(NewBicData == TRUE)  // Second sort BIC Data if valid  
{  temp++;  
Sort_Single(BicData);  
NewBicData = FALSE; // Indicate the value has been sorted  
return temp;  
/****************************************************************/  
/* int Sort Single( int  */  )  
/*  */  
/* This function does the actual sorting into the DRAM memory  */  
/* array.  If the memory value compares to the new sorted  */  
/* value, no further comparisons are done.  */  
/* The return value is where the item was placed in the list  */  
/****************************************************************/  
int SAC2::SortSingle(int DataValue)  
int tmpHi_ptr, tmpLoptr, amp_ptr;  85 
int Dramptr;  
int temp = 0, done = 0;  
int offset;  
int direction;  
tmpHi_ptr = SptrHi-1;  // Point to upper most valid value  
tmpLo_ptr = SptrLo +l;  // Point to lower most valid value  
cmp_ptr = (tmpHi_ptr + tmpLo_ptr)/2;  
if(numSorted >= 1)  //  If there is a valid value in the array  
// Loop until sorted  
{ while(done !=l)  
{ Dramptr= SRAMPamp ptr & SRAMARRAYSIZE MASK)];  




if(temp > DataValue)  // if data is less than value  
{ tmpHi_ptr=cmpptr-1; // adjust the upper pointer  
}  
else if(temp < DataValue)  // if data is greater than value  
{ tmpLo_ptr = cmp_ptr+1; // adjust the lower pointer  
}  
else // data is equal to the value to insert  
// Insert up, since it normally requires less moves  
Insert(DataValue, cmp_ptr, Dramptr, +1);  
done=1;  // Enable exit of loop  
if(done != 1)  // If we are not done check to see  
{ if(tmpHi_ptr < tmpLo_ptr) // if pointers cross then  
// "zoom in" on DRAM pointer for insertion  
cmp_ptr = tmpLo_ptr;  // Save the SRAM ptr  
if(cmp_ptr == SptrHi)  
{ cmpptr--;  
Dramptr=SRAM((cmp_ptr &SRAMARRAYSIZE_MASK));  
Sread++;  
Insert(DataValue, cmp_ptr, Dramptr, 1);  
1  
else if (cmp_ptr == SptrLo)  
1   cmp_ptr++;  
Dramptr=SRAM[(cmp_ptr & SRAMARRAYSIZE_MASK)];  
Sread++;  
Insert(DataValue, cmp_ptr, (Dramptr & DRAM MASK), -1);  
else  
{ direction = -1;  
Dramptr = SRAM((cmp_ptr & SRAMARRAYSIZE_MASK));  
Sread++;  
tmpHi_ptr = Dramptr-1;  
tmpLo_ptr = Dramptr & DRAMMASK;  
while(tmpHi_ptr >= tmpLo_ptr)  
{ Dramptr = (tmpHi_ptr + tmpLo_ptr)/2;  
temp = DRAM(Dramptr);  
Dread++;  86 
Comparison++;  
if(temp > DataValue)// if data is 1.t. value  
{ tmpHi_ptr=Dramptr-1; // adjust the upper pointer  
direction = -1;  
else  // if data is g.t or equal to the value  
{  tmpLo_ptr = Dramptr+1; // adjust the pointer  
direction = +1;  
}  
// shift memory between pointers  
Insert(DataValue, cmp_ptr, Dramptr, direction);  
)  
done=1;  // Enable exit of loop  
// End of DRAM "Zoom"  
else   // Otherwise continue  
{ cmp_ptr=(tmpLo_ptr+tmpHi_ptr)/2;  // Calculate new pointer  
}  
}  
else if(numSorted==0)// SRAM & DRAM is empty just insert the element  
SRAM[(SptrHi++) & SRAMARRAYSIZE_MASK] = NextDRAM;  
DRAM[NextDRAM] = DataValue;  




else  // Number of elements became negative.  Indicate an error  
{  cout << "ERROR.  Number sorted became negative"  ;  
exit( -1);  
return cmp_ptr-1;  
} 
/****************************************************************/  
/* int Insert( int  ,  int ,  int ,  int )  */  
/*  */  
/* This function will insert the first int Value into the DRAM  */  
/* memory.  If there isn't space in the DRAM, a hole will be  */  
/* created in the SRAM and the new value will be moved into the */  
/* hole.  The second interger is a pointer to the location in  */  
/* the SRAM where the hole would need to be created.  The third */  
/* value is the pointer to the actual DRAM value, with the last */  
/* int indicating the direction which it needs to move, Positive*/  
/* indicating up, negative indicating down.  */  
/****************************************************************/  
int SAC2::Insert(int Value, int sptr, int dptr, int direction)  
int tmp_ptr, i, trap;  
int dptr_below, dptr_above;  
int tempData;  
//Error Checking of SRAM pointers  
if(SptrHi > (3*SRAMARRAYSIZE - 4)  SptrLo < 4)  87 
{	  cout << "Reached boundary of the SRAM array! Exiting." ;  
exit(-20);  
//Error Checking of NextDRAM pointer  
if(NextDRAM >= DRAMARRAYSIZE  3*(DRAM_USED+1)  )  
{  cout << "Reached the edge of DRAM usage! Exiting." ;  
exit(-60);  
tmp_ptr = SRAM[(sptr & SRAMARRAYSIZE_MASK)];  
if((tmp_ptr & DRAM MASK)  != (dptr & ERAMMASK))  
{  cout << "DRAM and SRAM pointers do not agree.  Exiting."  ;  
exit(-2);  
Sread++;  
if(direction >0)  
{ dptr += 1;  
if((tmp_ptr&DRAM_USED)==DRAM_USED )// Then we need to look above and  
// Below for a memory hole  
dptr_below = SRAM[(sptr-1) & SRAMARRAYSIZE_MASK];  
dptr_above = SRAM[(sptr +l)  & SRAMARRAYSIZE_MASK];  
// Statistics updated if values are actually used  
if((sptr -1 != SptrLo) && (Sread++) &&  
((dptr below & DRAM USED)  != DRAM USED))  
{ SRAM[(sptr -1)  & SRAMARRAYSIZE_MASK] =dptr_below +1;  
Swrite++;  
tmp_ptr = tmp_ptr & DRAM MASK;  
if(tmp_ptr == dptr )  // aptr  is supposed to be the value moved  




{  DRAM[dptr_below+1] = DRAM[tmp_ptr];  
while(tmp_ptr < dptr -1)  









else if ((sptr+1 != SptrHi  && (Sread++) &&  )  
((dptr_above & DRAM_USED)  != DRAM_USED))  
SRAM[(sptr +l) & SRAMARRAYSIZE_MASK] = dptr_above +1;  
Swrite++;  
// Loop until the pointer wraps around  
while((dptr above & DRAM USED)  != DRAM_USED)  88 
{	  DRAM[dptr_above+1]=DRAM[dptr_above];  















while(tmp_ptr >= dptr)  









else // No memory hole above or below need to create one  
// Shift memory which will require the shortest number of moves  
if( (sptr-l-SptrLo) > (SptrHi-sptr-1)  // shift memory up  )  
{ tmp=SptrHi-1;	  // Memory hole is one above tmp  
SptrHi++;  
while(sptr <= tmp)  // as long as the hole !eq to the pointer  
{  SRAM[(tmp+1) & SRAMARRAYSIZE_MASK] =  






else  // Shift Memory down  
{	  tmp=SptrLo+1;  // Memory hole is one below tmp  
SptrLo--;  
while(sptr-1  >= tmp) // as long as the hole isn't at ptr  1  
{  SRAM[(tmp-1) & SRAMARRAYSIZE_MASK] =  





1  89 
1 
// Now there is a free space below sptr  
if(DRAMFrags == TRUE)  
{  SRAM[(sptr-1) & SRAMARRAYSIZE_MASK] = DRAMHole[--DRAMHolesptr];  
Swrite++;  
if(DRAMHolesptr == 0)  
{  DRAMFrags = FALSE;  
tmp = DRAMHole[DRAMHolesptr];  
DRAMHole[DRAMHolesptr] = 0;  
else  
f   SRAM[(sptr-1) & SRAMARRAYSIZE_MASK] = NextDRAM;  
tmp = NextDRAM;  
NextDRAM = NextDRAM + DRAM USED + 1;  
Swrite++;  
}  
// Move the lower value into the new array  
tmp_ptr = tmp_ptr & DRAMMASK;  
if(tmp_ptr == dptr) // dptr is supposed to be the value moved  




DRAM[tmp] = DRAM[tmp_ptr];  
Dread++;  
Dwrite++;  
while(tmp_ptr < dptr - 1)  









else // Else Shift memory  in DRAM "pod" up  
{	  SRAM[(sptr) & SRAMARRAYSIZE_MASK]=tmp_ptr+1;  
Swrite++;  
while(tmp_ptr >= dptr)  







numSorted++;  90 
APPENDIX B Max-MM Heap Processor Code 
Only the differences between the SRAM insertion sort processor and the max -min heap 
processor code are included for brevity. The major change is in the sorting functions. 
For a full description for the sorting functions look in [11]. 
/***********************************************4.****************/  
/*  */  
/* SysArrayChip2.h  */ 
/*  ./  
/****************************************************************/  
#ifndef SYSABRAYCHIP2_H  
#define SYSARRAYCHIP2_H  
#include "sorter.h"  
class SAC2  {  
private:  
///// Memory interface /////  
int numSorted;  // Number of elements in DRAM  
int DRAM[DRAMARRAYSIZE];  //  The DRAM  
int Dptr;  //  Pointer to the next available DRAM  
place  
///// Internal functions for the MM Heap ////  
BOOL Created;  
// Remove the max or min value from Memory  
int Delete(BOOL top );  
// Insert Value into Memory baised on the pointers  
int Insert(int Value);  
// Create the Heap  
int Create(void);  
void TrickleDownMax( int node);  
void TrickleDownMin( int node);  
void BubbleUp( int node);  
void BubbleUpMax( int node);  
void BubbleUpMin( int node);  
}; // End SAC2 class definition  
#endif // End of #ifndef SYSARRAYCHIP2 H  91 
/****************************************************************/  
/*   */  
/* SysArrayChip2  . cpp  */  
/*  */  
/****************************************************************/  
#include "SysAxrayChip2.h"  
/****************************************************************/  
/* CMD BIC_Data(int &, CMD)  */  
1*  */  
/* This function performs a transaction on the BIC  */  
/* side of the System Array Chip.  If the command  */  
/* was successful the orignal command is returned  */  
/* otherwise CMD ERROR is returned  */  
/****************************************************************/  
CMD SAC2::BICData(int & Data, CMD Command)  
int Dramptr;  
// If the user wants to read the chip  
if(Command==CMD_READ)  
{ if(NewBicData != TRUE && numSorted>0 && Created == TRUE)  
{  //If there are elements, the Register is free, and the Heap was  
// created Set Data to uppermost value  
Data = Delete(TRUE);  // Remove the Maximum value  
return Command; //Return that command was successful  
1  
else // no values to read  
return CMD ERROR;  // Return Error indicating no read  
// If the user wants to write the chip  
else if (Command == CMD_WRITE)  
// Check to make sure that there isn't data which needs to  
// be sorted into memory  
{ if(NewBicData!=TRUE)  // If register is empty  
NewBicData = TRUE;  // Fill it and set flag  
BicData = Data;  
if(Created == FALSE)  
{  Insert(BicData);  
NewBicData = FALSE;  
else // Otherwise cannot accept write  
return CMD ERROR;  // Return error, no write  
else // Not a read or a write command (unsupported command)  
return CMD ERROR;  
/****************************************************************/  
/* CMD NSC Data(int &, int &, CMD)  */  
/*  */  
/* This function performs a transaction on the NSC  */  
/* side of the System Array Chip.  If the command   */ 
/* was successful the orignal command is returned  */  
/* otherwise CMD ERROR is returned  *,  92 
/****************************************************************/  
CMD SAC2::NSC Data(int & DataHi, int & DataLo, CMD Command)  
int Dramptr;  
// If the user wants to read the chip  
if(Command==CMDREAD)  
{ if(NewNscData!=TRUE && numSorted >= 2 && Created == TRUE)  
// Make sure enough data to send and it is a Heap  
// First read the UpperValue  
DataHi = Delete(TRUE);  
// Next Read the LowerValue  
DataLo = Delete(FALSE);  
return Command;  // Return success  
}  
else  // Either Registers are in use or not enough memory  
return CMD ERROR;  // Return error  
// If the user wants to write the chip  
else if (Command == CMD_WRITE)  
{ if(NewNscData!=TRUE)  // If the registers are empty  
{ NewNscData = TRUE; // Set the flag and read the values  
NscDataHi = DataHi;  
NscDataLo = DataLo;  
else  // The registers are in use  
return CMD ERROR;  // Return error  
1  
else // Unsuported command  
return CMD ERROR;  // Return error  
/****************************************************************/  
/* int Sort()  */  
/*  */  
/* This function is the manual excitation required to make the  */  
/* Array sort the data values loaded into the Bus Buffer  */  
/* Registers. The return value is the number of data values  */  
/* which were added to the array.  */  
/****************************************************************/  
int SAC2::Sort(void)  
int temp = 0; // Temporary value for number of elements sorted  
if(Created == FALSE)  
{ Create();  
Created = TRUE;  
if(NewNscData == TRUE )  // First Sort both the NSC Data values  
temp =2;  
Insert(NscDataHi);  
Insert(NscDataLo);  
NewNscData = FALSE; // Indicate that the values have been sorted  
1  93 
if(NewBicData == TRUE)  // Second sort BIC Data if valid  
{  temp++;  
Insert(BicData);  
NewBicData = FALSE; // Indicate that the value has been sorted  
return temp;  
/****************************************************************/  
/* int Delete(BOOL top)   */  
/*   */  
/* This function deletes a value from the heap.  If top is true */  
/* then the value is removed from the Maximum node, if the  */  
/* value is false then it is removed from one of the minimum  */  
/* nodes.  The return value is the value deleted from the list  */  
/****************************************************************/  
int SAC2::Delete(BOOL top)  
int Returnvalue;  
if(Created == FALSE)  
cout << " ERROR: Trying to delete values when heap hasn't been";  
cout << "created" << endl;  
exit(0);  
1  
if(top == TRUE)  //Remove maximum value  
{ Returnvalue = DRAM[1];  
Dread++;  
DRAM[1] = DRAM[--Dptr];  






else	  //Find minimum value and remove it  
{	  Dread +=2;  
Comparison ++;  
if(DRAM[3] > DRAM[2])  
{  Returnvalue = DRAM[2];  
DRAM[2] = DRAM[--Dptr];  






{ Returnvalue = DRAM[3];  
DRAM[3] = DRAM[--Dptr];  




TrickleDownMin(3);  94 
return Returnvalue;  
) 
/****************************************************************/  
/* int Insert( int Value )  */  
/*  */  
/* This function will insert the Value into the DRAM Heap.  */  
/****************************************************************/  
int SAC2::Insert(int Value)  
//Error Checking of NextDRAM pointer  
if(Dptr >= DRAMARRAYSIZE  3  )  
{  cout << "Reached the edge of DRAM usage! Exiting." ;  
exit(-60);  
} 
if(Created == FALSE)  
{  DRAM[Dptr++] = Value;  
numSorted++;  
return 1;  
else  
{  DRAM[Dptr++] = Value;  
numSorted++;  
BubbleUp(Dptr-1);  




/* int Create( void )  */  
/*  */  
/* This function will Create the DRAM Heap.  */  
/****************************************************************/  
int SAC2::Create( void)  
{  
int i = Dptr/2;  
int temp;  
unsigned int maxnode = 1;  
for(  ; i > 0  ; i--)  
{  temp = i;  
maxnode = 1;  
while(  ( temp / 2) > 0  )  
{  temp = temp / 2;  
maxnode = maxnode ^ 1;  
)  
if(maxnode == 1)  
{ TrickleDownMax(i);  
else  
{ TrickleDownMin(i);  
/****************************************************************/  95 
/* void TrickleDownMax( int node )  
*  
/* This function will insure a correct MaxMin heap below the  
/* current node. (root node is a max node)  
/****************************************************************/  
void SAC2::TrickleDownMax( int node)  
int tempnode, swapnode, Maxvalue;  
BOOL Grandchild = FALSE;  
if(node*2 >= Dptr)  
{  return;  
}  
Maxvalue = DRAM[node*2];  
swapnode = node*2;  
if((node * 2) + 1 < Dptr  )  
{  Dread ++;  
Comparison ++;  
if(Maxvalue < DRAM[node*2 + 1]   )  
{ Maxvalue = DRAM[node*2 + 1];  
swapnode = node*2 + 1;  
tempnode = node * 4;  
while(tempnode < Dptr && tempnode < node*4+4)  
{	  Dread ++;  
Comparison ++;  
if(DRAM[tempnode] > Maxvalue )  
{ Maxvalue = DRAM[tempnode];  
swapnode = tempnode;  
Grandchild = TRUE;  
tempnode++;  
}  
if(Grandchild == TRUE)  
{	  Comparison ++;  
Dread ++;  
if(DRAM[node] < Maxvalue)  
{	  DRAM[swapnode] = DRAM[node];  
DRAM[node] = Maxvalue;  
Dwrite +=2;  
Dread ++;  
Comparison ++;  
if(DRAM[swapnode] < DRAM[swapnode/2])  
Maxvalue = DRAM[swapnode/2];  
DRAM[swapnode/2] = DRAM[swapnode];  
DRAM[swapnode] = Maxvalue;  
Dwrite +=2;  
Trick1eDownMax(swapnode);  
else // Node is correct  
{  return;}  
1 96 
else // Swapping with the child  
{	  Comparison ++;  
Dread ++;  
if(Maxvalue > DRAM[node])  
{  
DRAM[swapnode] = DRAM[node];  
DRAM[node] = Maxvalue;  
Dwrite +=2;  
I	  I I  
/****************************************************************/  
/* void TrickleDownMin( int node  */  )  
/*  */  
/* This function will insure a correct MaxMin heap below the  */  
/* current node. (root node is a min node)  */  
/****************************************************************/  
void SAC2::TrickleDownMin( int node)  
{  
int tempnode, swapnode, Minvalue;  
BOOL Grandchild = FALSE;  
if(node*2 >= Dptr)  
{  return;  
Minvalue = DRAM[node*2];  
swapnode = node*2;  
if((node * 2) + 1 < Dptr )  
{ Dread ++;  
Comparison ++;  
if(Minvalue > DRAM[node*2 + 1]   )  
{ Minvalue = DRAM[node*2 + 1];  
swapnode = node*2 + 1;  
tempnode = node * 4;  
while(tempnode < Dptr && tempnode < node*4+4)  
{	  Dread ++;  
Comparison ++;  
if(DRAM[tempnode] < Minvalue )  
{ Minvalue = DRAM[tempnode];  
swapnode = tempnode;  
Grandchild = TRUE;  
1  
tempnode++;  
if(Grandchild == TRUE)  
{ Comparison ++;  
Dread ++;  
if(DRAM[node] > Minvalue)  
{  DRAM[swapnode] = DRAM[node];  
DRAM[node] = Minvalue;  
Dwrite +=2;  
Dread ++;  
Comparison ++;  97 
if(DRAM[swapnode] > DRAM[swapnode/2])  
Minvalue = DRAM[swapnode/2];  
DRAM[swapnode/2] = DRAM[swapnode];  
DRAM[swapnode] = Minvalue;  
Dwrite +=2;  
TrickleDownMin(swapnode);  
else // Node is correct  
{  return;  }  
1  
else // Swapping with the child  
( Comparison ++;  
Dread ++;  
if(Minvalue < DRAM[node])  
DRAM[swapnode] = DRAM[node];  
DRAM[node] = Minvalue;  
Dwrite +=2;  
I	 I  
/****************************************************************/  
/* void BubbleUp( int node  */  )  
/*  */  
/* This function will check above the current node to make sure */  
/* that the value is correctly placed.  */  
/****************************************************************/  
void SAC2::BubbleUp( int node)  
int temp = node;  
int tempValue;  
unsigned int maxnode = 1;  
(	 )  while(  temp / 2) > 0  
I   temp = temp / 2;  
maxnode = maxnode A 1;  
if(maxnode == 0)  
(  if( node / 2 > 0)  
{	  Dread +=2;  
Comparison ++;  
if( DRAM[node] > DRAM[node/2])  
( tempValue = DRAM[node];  
DRAM[node] = DRAM[node/2];  
DRAM[node/2] = tempValue;  




{  BubbleUpMin(node);}  
}  
else // Max node  
(  if( node / 2 > 0)  
{  Dread +=2;  
Comparison ++;  98 
if( DRAM[node] < DRAM[node/2])  
{	  tempValue = DRAM[node];  
DRAM[node] = DRAM[node/2];  
DRAM[node/2] = tempValue;  
Dwrite +=2;  
BubbleUpMin(node/2);  
else  
{  BubbleUpMax(node);  
/****************************************************************/  
/* void BubbleUpMax( int node  */ 
)  
/*  */  
/* This function will check above the current node to make sure */  
/* that the value is correctly placed.  */ 
/****************************************************************/  
void SAC2::BubbleUpMax( int node)  
int Temp;  
if(node/4 > 0   )  
{	  Dread++;  
Comparison++;  
if(DRAM[node] > DRAM[node/4]  )  
Temp = DRAM[node];  
DRAM[node] = DRAM[node/4]  ;  
DRAM[node/4] = Temp;  




/* void BubbleUpMin( int node  )  */  
/*  */  
/* This function will check above the current node to make sure */  
/* that the value is correctly placed.  */  
/****************************************************************/  
void SAC2::BubbleUpMin( int node)  
{  
int Temp;  
if(node/4 > 0   )  
{	  Dread++;  
Comparison++;  
if(DRAM[node] < DRAM[node/4]  )  
{	  Temp = DRAM[node];  
DRAM[node] = DRAM[node/4]  ;  
DRAM[node/4] = Temp;  
Dwrite +=2;  
BubbleUpMin(node/4);  
3  99 
APPENDIX C Optimized Insertion Sort Processor Code 
Only the differences between the SRAM insertion sort processor and the standard 
insertion sort processor code are included for brevity. The major change is in the sorting 
functions. 
/****************************************************************/  
/* SysArrayChip.h  */  
/****************************************************************/  
#ifndef SYSARRAYCHIP_H  
#define SYSARRAYCHIP_H  
#include "sorter.h"  
// Changeable Sort Options  
#define EQUAL OPT 1  // This optimizes the moves which will occur  
// when the data is equal to each other  
#define BOUNDS OPT 2 // This optimizes the sorter to check the upper  
// value and lower value passed in from the NSC Side  
#define NSCONLY_OPT 4 // This optimizes the sorter when the NSC is  
// the only side giving values  
#define GROUPSORT_OPT 8  // This optimizes the sorter using the  
// previous comparison as a starting values for the next sort  
#define GRPMOVE_OPT 16  // This optimizes the sorter to move the  
// elements as groups minimizing the number of duplicate moves  
class SAC {  
private:  
///// Memory interface /////  
int ptrHi;  // Points to the next available DRAM slot  
// which a high value can go into  
int ptrLo;  // Points to the next available DRAM slot  
// which a low value can go into  
int numSorted;  // Number of elements in DRAM  
int DRAM[ARRAYSIZE];  // The DRAM  
///// Internal sorting functions for Sort ////  
//Sort a single value  
int Sort_Single(int DataValue, unsigned type_of_sort,  
int place, int ptr);  
int ShiftMemory(int ptr);  // Shift memory around the pointer  
int Sort2(void);  // Secondary Sorting function (optimized)  
///// Internal sorting functions for Sort2 ////  
// Find the memory hole location for the data  
int Find_Place(int upperPtr, int lowerPtr, int Value);  
// Shift memory around the pointers  
void ShiftMemory3(int ptr[], int holes);  
}; // End SAC class definition  
#endif // End of #ifndef SYSARRAYCHIP H  100 
/****************************************************************/  
/*   */  
/* SysArrayChip.cpp  */  
/*  */  
/****************************************************************/ 
/* CMD BIC_Data(int &, CMD)  
/*  
/* This function performs a transaction on the BIC  
/* side of the System Array Chip.  If the command  
/* was successful the orignal command is returned  
/* otherwise CMD ERROR is returned  
/****************************************************************/  
CMD SAC::BIC Data(int & Data, CMD Command)  
// If the user wants to read the chip  
if(Command==CMD_READ)  
{ if(NewBicData != TRUE && numSorted>0) // If elements to read  
{  Data=DRAM[( --ptrHi) & ARRAYSIZE_MASK];  // and Register is free.  
// Set Data to uppermost value  
Dread++;  
numSorted--;  
//Done to make looking at memory easier.  Can be removed  
DRAM[(ptrHi) & ARRAYSIZEMASK]=0;  
return Command; //Return that command was successful  
1  
else // no values to read  
return CMD ERROR;  // Return Error indicating no read  
// If the user wants to write the chip  
else if (Command == CMD_WRITE)  
// Check to make sure that there isn't data which needs to  
// be sorted into memory  
( if(NewBicData!=TRUE)  // If register is empty  
{  NewBicData = TRUE; // Fill it and set flag  
BicData = Data;  
}  
else // Otherwise cannot accept write  
return CMD ERROR;  // Return error, no write  
else // Not a read or a write command (unsupported command)  
return CMD ERROR;  
/****************************************************************/  
/* CMD NSC Data(int &, int &, CMD)  */  
/*  */  
/* This function performs a transaction on the NSC  */  
/* side of the System Array Chip.  If the command  */  
/* was successful the orignal command is returned  */  
/* otherwise CMD ERROR is returned  */  
/****************************************************************/  
CMD SAC::NSCData(int & DataHi, int & DataLo, CMD Command)  101 
{  
// If the user wants to read the chip  
if(Command==CMD READ)  
{  if(NewNscData! =TRUE && numSorted >= 2)// Make sure enough elements  
{ DataHi=DRAM[(--ptrHi) & ARRAYSIZE MASK]; // to send and then send  
DataLo=DRAM[(++ptrLo) & ARRAYSIZE MASK];  
Dread += 2;  
numSorted -= 2;  
//Done to make looking at memory easier.  Can be removed.  
DRAM[(ptrHi) & ARRAYSIZE_MASK]=0;  
DRAM[(ptrLo) & ARRAYSIZE_MASK]=0;  
return Command;  // Return success  
else  // Either Registers are in use or not enough memory  
return CMD ERROR;  // Return error  
// If the user wants to write the chip  
else if (Command == CMD_WRITE)  
{  if(NewNscData!=TRUE)  // If the registers are empty  
{ NewNscData = TRUE; // Set the flag and read the values  
NscDataHi = DataHi;  
NscDataLo = DataLo;  
}  
else  // The registers are in use  
return CMD ERROR;  // Return error  
1  
else // Unsuported command  
return CMD ERROR;  // Return error  
/****************************************************************/  
/* int Sort(unsigned)  4/  
/*  */  
/* This function is the manual excitation required to make the  */  
/* Array sort the data values loaded into the Bus Buffer  */  
/* Registers.  The unsigned int represents the types of  */  
/* optimization which are selected by the user.  The return  */  
/* value is the number of data values which were added to the  */  
/* array.  */  
/****************************************************************/  
int SAC::Sort(unsigned type_of_sort)  
int temp=0;  //Value holder for the number of elements sorted  
if(type_of_sort & GRPMOVE_OPT)// If the type of sort includes the  
{  return Sort2();  // GRPMOVE option Run Sort2()  
1  
if(type_of sort & GROUPSORT OPT) // If the sorting is to be done as a  
{ int Temp-S-ortArray[3];  // group sort using the following code  
int ptr=0,done=0;  
if(NewNscData == TRUE)  // If there is data from the NSC put  
{ TempSortArray[0] = NscDataLo;// the sorted values in the array  
TempSortArray[1] = NscDataHi;  102 
ptr = 2;  // Move the pointer to the next free space in the array  
NewNscData = FALSE;// Indicate the data will be added to the DRAM  
temp += 2;  // Indicate there are 2 values to be sorted  
if(NewBicData == TRUE)  // If there is data from the BIC  
{	  temp += 1;  // Indicate there is one more value to be sorted  
NewBicData = FALSE;  
while(ptr !=0 && done == 0)// While a space hasn't been found or  
{  // the empty spot isn't the bottom of the array  
if(TempSortArray[ptr-1] > BicData)// If the next lower data is  
// less than the new data  
{ TempSortArray[ptr] = TempSortArray[ptr-1]; // move the Data  
ptr--;  // Decrement the pointer  
else  // Otherwise put the data in the hole  
{ TempSortArray[ptr] = BicData;  
done = 1;  // Enable the exit of the loop  
if(done == 0) // If the loop exited and we not done  
{ TempSortArray[ptr] = BicData;} // put value in 0's place  
ptr = 0;  // reset the pointer  
for( int i = 0; i < temp ; i++)  // Loop on the number of elements  
{ done = 0;  
if(i == 0 && (type_of_sort & BOUNDS_OPT)) // Check the lower  
{	  Comparison++;  // value against the bottom value  
Dread++;  
// if it is less insert it into DRAM  
if(TempSortArray[i] <= DRAM[(ptrLo+1)&ARRAYSIZE MASK])  
{  DRAM[(ptrLo--) & ARRAYSIZE_MASK] = TempSortArray[i];  
Dwrite++;  
numSorted++;  
done = 1;  // Don't check any more  
// Check the highest value against the max value  
if(i == (temp-1) && (type_of_sort & BOUNDS_OPT) && done != 1)  
{ Comparison++;  
Dread++;  
// if it is greater than the max value  
if(TempSortArray[i] >= DRAM[(ptrHi-1)&ARRAYSIZE_MASK])  
DRAM[(ptrHi++) & ARRAYSIZE_MASK] = TempSortArray[i];  
Dwrite++;  // insert it into the DRAM  
numSorted++;  
done = 1;  
} 
)  if( done != 1  // If the value hasn't been inserted insert it  
{ ptr = Sort_Single(TempSortArray[i],type_of_sort,i,ptr);  103 
// Sort the value, giving a pointer hint  
else  // If no group optimization  
(   if(NewBicData == TRUE)  // First sort BIC Data if valid  
(  temp++;  
Sort Single(BicData, type_of_sort,0,0);  
NewBicData = FALSE; // Indicate that the value has been sorted  
if(NewNscData == TRUE)  // Second Sort both the NSC Data values  
( temp +=2;  
if( type_of_sort & BOUNDS_OPT)  
(  Comparison += 2;  
Dread +=2;  
if(NscDataHi >= DRAM[(ptrHi-1)&ARRAYSIZE MASK))  




Sort_Single(NscDataHi, type_of_sort,1,0);)  
if(NscDataLo <= DRAM[(ptrLo+1) & ARRAYSIZE_MASK))  




{  Sort_Single(NscDataLo, type_of_sort,0,0);}  
NewNscData = FALSE; // Indicate the values have been sorted  
else  
(  Sort_Single(NscDataHi, type_of_sort,1,0);  
Sort_Single(NscDataLo, type_of_sort,0,0);  
NewNscData = FALSE; // Indicate the values have been sorted  
return temp;  
/****************************************************************/  
/* int Sort Single(int, unsigned, in  int)  */  
/*  */  
/* This function does the actual sorting into the DRAM memory  */  
/* array.  The second interger passes in the sorting  */  
/* optimizations which should be performed.  If it is 0  */  
/* no optimizations are performed.  If the memory  */  
/* value compaires to the new sorted value,  no further  */  
/* compairisons are done.  The third integer gives the sorting  */  
/* unit a guess at where the element might end up in the array  */  
/* The return value is where the item was placed in the list  */  
/****************************************************************/  
int SAC::Sort_Single(int DataValue, unsigned type_of_sort, int  place ,  
int ptr)  
int tmpHi,ptr, tmpLo_ptr, amp ptr;  104 
int temp = 0, done = 0;  
int offset;  
tmpHi_ptr = ptrHi-l;  // Point to upper most valid value  
tmpLo_ptr = ptrLo+1;  // Point to lower most valid value  
if (ptr > 0)  
{ tmpLo_ptr = ptr;  
}  
if(type_of_sort & NSCONLY_OPT)  
{ offset = (tmpHi_ptr  tmpLo_ptr)/16;  
if(place >= 1) cmp_ptr= tmpHi_ptr  offset;  
else cmpptr = tmpLo_ptr + offset;  
else  
{  cmp_ptr = (tmpHi_ptr + tmpLo_ptr)/2;  
}  
if(numSorted >= 1)  // If there is a valid value in the array  
// Loop until sorted  
{ while(done !=l)  
{ temp = DRAM((cmp_ptr) & ARRAYSIZE_MASK];  
Dread++;  
Comparison++;  
if(temp > DataValue)  // if data is less than value  
{  tmpHi_ptr=cmp_ptr-1; // adjust the upper pointer  
else if(temp < DataValue)  // if data is greater than value  
{ tmpLo_ptr = cmp_ptr+1;  // adjust the lower pointer  
}  
else // data is equal to the value to insert  
{  if  ( type_of_sort & EQUAL_OPT  // If equal opt. enabled  )  
{  if( (ptrHi - cmp_ptr) < (cmp_ptr  ptrLo)  )  // equal to case  
(  tmpLo_ptr = cmp_ptr+1;} // Upper end is closer  
else  
(  tmpHi_ptr = cmp_ptr-l;} // Lower end is closer  
)  
else  // With them equal insert right here (non optimized)  
{	  cmp_ptr = ShiftMemory(cmp_ptr);  // Just insert it here  
DRAM[(cmp_ptr) & ARRAYSIZE_MASK] = DataValue;  
Dwrite++;  // put it into the hole created  
numSorted++;  
done=1;  // Enable exit of loop  
) 
)  
if(done != 1)  // If we are not done check to see  
{ if(tmpHi_ptr < tmpLo_ptr)  // if pointers cross  
( cmp_ptr = ShiftMemory(tmpLo_ptr);// shift memory between ptrs  
DRAM[(cmp_ptr) & ARRAYSIZE_MASK] = DataValue;  
Dwrite++;  // Write value into Mem hole  
numSorted++;  
done=1;  // Enable exit of loop  
)  
else  // Otherwise continue and calculate a  
{  cmpptr=(tmpLo_ptr+tmpHi_ptr)/2;  // new compairson pointer  
}  105 
else if(numSorted==0)  // DRAM is empty just insert the element  
(	  cmp_ptr = ptrHi;  
DRAM[(ptrHi++) & ARRAYSIZE_MASK] = DataValue;  
Dwrite++;  
numSorted++;  
else  // Number of elements became negative.  Indicate an error  
{  cout << "ERROR.  Number sorted became negative";  
exit(0);  
}  
return cmp_ptr-l;  
/****************************************************************/  
/* int ShiftMemory( int  */  )  
/*  */  
/* This function will shift the DRAM memory to create an opening*/  
/* inbetween the pointer passed and the next lower value.  The  */  
/* function returns the value of the pointer where the space  */  
/* was created.   */  
/****************************************************************/  
int SAC::ShiftMemory(int ptr)  
int tmp_ptr;  
int tempData;  
//Error Checking. of pointers  
if(ptrHi > (3*ARRAYSIZE  4)  II ptrLo < 4)  
{  cout << "Reached boundary of the array! Exiting.";  
exit(0);  
// Shift memory which will require the shortest number of moves  
if( (ptr-l-ptrLo) > (ptrHi-ptr)  )  // shift memory up  
{ tmp_ptr=ptrHi-1;  // Memory hole is one above tmp_ptr  
ptrHi++;  
while(ptr <= tmp_ptr) // while the hole not equal to the pointer  
f   tempData = DRAM[(tmp_ptr) & ARRAYSIZE_MASK];  // move element up  




return ptr;  // Return pointer to memory hole  
}  
else  // Shift Memory down  
(	  tmp_ptr=ptrLo+1;  // Memory hole is one below tmp_ptr  
ptrLo--;  
while(ptr  1 >= tmp_ptr) // as long as the hole isn't at ptr  
{	  tempData = DRAM[(tmp_ptr) & ARRAYSIZE_MASK]; // move element up  




tmp_ptr + +;  
return ptr - 1;  // Return pointer to memory hole  
/****************************************************************/  
/* int Sort2(void)  */  
/*  */  
/* This function is the manual excitation required to make the  */  
/* Array sort the data values loaded into the Bus Buffer  */  
/* Registers.  This function will do group moves of upto 3 items*/  
/* so that the moves are more efficient.  The return value is  */  
/* the number of items sorted into the array  */  
/****************************************************************/  
int SAC::Sort2(void)  
int TempSortArray[3];  
int ptr[3]= {ptrHi,ptrHi,ptrHi};  
int temp=0;  
int tmpptr=0;  
int done = 0;  
if(NewNscData == TRUE) II If data is from the NSC put into array  
{	  TempSortArray(0) = NscDataLo;  
TempSortArray[1] = NscDataHi;  
tmpptr = 2; // Increment the pointer to the next spot on the array  
NewNscData = FALSE;  
temp += 2;	  // Indicate 2 values to be sorted  
if(NewBicData == TRUE) // If there is data from the BIC to add  
(	  temp += 1;  // Indicate 1 more value to be sorted  
NewBicData = FALSE;  
while(tmpptr !=0 && done == 0) // While the value hasn't been added  
{  if(TempSortArray[tmpptr -1] > BicData)  // If less than current  
{ TempSortArray[tmpptr] = TempSortArray[tmpptr-1]; // Move up one  
tmpptr--;  // Decrement the pointer  
else	  // Otherwise value is greater than the value  
{   TempSortArray[tmpptr] = BicData; // Add to the hole indicated  
done = 1;  
if(done == 0)  // If exited the loop but were not finished  
{   TempSortArray[tmpptr] = BicData; // add value to the hole  
tmpptr = 0;  
// Loop on the number of values to add to the array  
for( int i = 0; i < temp ;  i++)  
(  done = 0;  
if(i == 0  )  // If it is the lower value  
{  Comparison++;  
Dread++;  107 
if(TempSortArray[i] <= DRAM[(ptrLo+1)&ARRAYSIZE MASK])  
// Compare it with the min value  
{ ptr[i] = ptrLo+l; // If it is less indicate it with the pointer  
done = 1;  
if(i == temp-1 && done != 1) // If the highest value in the array  
{ Comparison++;  
Dread++;  
if(TempSortArray[i] >= DRAM[ (ptrHi-1) & ARRAYSIZE_MASK])  
// Compare it with the max value  
( ptr[i] = ptrHi;// If it is greater indicate it with the pointer  
done = 1;  
if( done != 1  )  // If a pointer hasn't been found yet  
(  if(i!=0)  // If it isn't the first value use previous pointer  
{  ptr[i] = Find_Place(ptrHi,ptr[i-1]  - 1, TempSortArray[i]);}  
else  // If it is the first value, use the ptrHi and ptrLo  
{ ptr[i] = Find_Place(ptrHi,ptrLo, TempSortArray[i]);}  
ShiftMemory3(ptr, temp); // Shift the memory to get memory holes  
for(i = 0  ;  i < temp ; i++) // loop on number of values to enter  
{ DRAM[(ptr[i]) & ARRAYSIZE_MASK ]  = TempSortArray[i];  
numSorted++;  // Add it to the DRAM  
Dwrite++;  
return temp;  // Return the number of values added to memory  
}  
/****************************************************************/  
/* int Find Place(int, int, int)  */  
/*  */  
/* This function does all the comparisons and finds a place to  */  
/* put the element in the list.  It uses as small a window as   */  
/* possible by having the user pass in the boundary conditions  */  
/* The return value is where the item should be placed in the  */  
/* the value is larger of the 2 pointers indicating the space  */  
/****************************************************************/  
int SAC::Find_Place(int upperPtr, int lowerPtr, int Value)  
{  
int tmpHiptr, tmpLo_ptr, cmp_ptr;  
int temp = 0, done = 0;  
int offset;  
tmpHi_ptr = upperPtr-l;  // Point to upper most valid value  
tmpLo_ptr = lowerPtr+1;  // Point to lower most valid value  
cmp_ptr = (tmpHi_ptr + tmpLo_ptr)/2;  
if(upperPtr == lowerPtr)  
{  cmp_ptr = lowerPtr;  
done = 1;  108 
if(numSorted >= 1)  // If there is a valid value in the array  
// Loop until sorted  
{ while(done !=1)  
{	  temp = DRAM[(cmp_ptr) & ARRAYSIZE MASK];  
Dread++;  
Comparison++;  
if(temp > Value)  // if data is less than value  
{  tmpHi_ptr=cmp_ptr-1; // adjust the upper pointer  
}  
else if(temp < Value)  // if data is greater than value  
{ tmpLo_ptr = cmp_ptr+1;  // adjust the lower pointer  
else // data is equal to the value to insert  
{  if( (upperPtr  cmp _ptr) < (cmpptr  lowerPtr)  )  // equal case  
I   tmpLo_ptr = cmp_ptr +l; // Upper end is closer  
else  
{ tmpHi_ptr = cmp_ptr-1; // Lower end is closer  
)  
if(done != 1)  // If we are not done check to see  
{ if(tmpHiptr < tmpLo_ptr) // if pointers cross  
{ cmp_ptr = tmpLoptr;  // Set the pointer to the upper hole  
done =l;  // Enable exit of loop  
else  // Otherwise continue to calculate a  
I cmp_ptr=(tmpLo_ptr+tmpHi_ptr)/2;  // new comparison pointer  
) 
)  
else if(numSorted==0)  // DRAM is empty just set the pointer  
{  cmp_ptr = upperPtr;  
else  // Number of elements became negative.  Indicate an error  
{ cout << "ERROR.  Number sorted became negative";  
exit(0);  
return cmp_ptr;  
/****************************************************************/  
/* void ShiftMemory3( int (], int )  */  
/*  */  
/* This function will shift the DRAM memory to create an opening*/  
/* inbetween the pointer passed and the next lower value for  */  
/* the number contained int the 4th int.  */  
/* The function adjusts the values at the int Addresses where it*/  
/* created a memory hole  */  
/* was created.  */  
/****************************************************************/  
void SAC::ShiftMemory3(int ptr[], int holes)  
int numDown = 0, numUp = 0, undecided = 0;  
int Moveptr, numToMove = 0;  
//Error Checking of pointers passed in by Sort2()  109 
for(int i = 0  ; i < holes ; i++)  
{  if(ptr[i] > (3*ARRAYSIZE  4)  II ptr[i] < 4)  
{  cout << "Reached boundary of the array! Exiting.";  
exit(0);  
// Shift memory which will require the shortest number of moves  
if( (ptr[0]  1  ptrLo) > (ptr[1]  ptr[0])  )  
{ undecided ++;  // If true want to shift with other pointer  
}  
else  // Otherwise down most efficient  
f numDown ++;  
if( (ptr[1]  1 - ptr[0]) > (ptr[2]  ptr[1])  )  
{ undecided ++;  // If true want to shift up with other pointer  
}  
else  // Else if lower pointer is farther to bottom  
{  if( (ptr[0]  1  ptrLo) > (ptr[2]  ptr[1])  )  
{ undecided ++;  // Still undecided  
else  // Otherwise we want to shift down  
{  numDown += 1 + undecided;  
undecided = 0;  
} 
if( (ptr[2]  1  ptr[1]) >= (ptrHi  ptr[2])   )  
{	  numUp += 1 + undecided; // If true want to shift up and  
undecided = 0;  // make undecided shift ups  
else  // Otherwise  
{ if(undecided ==2  // If we have 2 undecideds  )  
{  if ((ptr[0]  1  ptrLo) > (ptrHi  ptr[2])  )  // Check to see if  
{ numUp += 1 + undecided;// group of three is closer to bottom  
undecided = 0; // or top if top move up  
1  
else  
{ numDown += 1 + undecided;  // else move down  
undecided = 0;  
}  
)// If only one undecided check to see if closer to bottom or top  
else if( (ptr[1]  1  ptrLo) >= (ptrHi  ptr[2])   )  
{ numUp += 1 + undecided;  // if top move up  
undecided = 0;  
else  
{ numDown += 1 + undecided;  // else move down  
undecided = 0;  
// Adjust the values if we are not wanting to create 3 memory holes  
if(holes == 1)  // if only one hole reduce numUp and numDown  
{  if(numDown > 0) // If a move down exists.  Must be only valid value  
{ numDown = 1;  
numUp = 0;  110 
else  // Otherwise move up  
{ numDown = 0;  
numUp = 1;  
}  
else if(holes == 2)  // if 2 holes to be created  
(  if( numDown >  1)  // If numDown is 2 or greater  
{ numDown = 2;  
numUp = 0;  
else  // Otherwise numUp should be difference of holes and down.  
{ numUp = 2  numDown;  
ptrLo -= numDown;  // Move the lower pointer to make space  
ptrHi += numUp;  // Move the upper pointer to make space  
Moveptr = ptrLo+l; // First take care of creating the lower holes  
for( i = 0  ;  i < numDown ;i++) // Loop on lower holes created  
{  
ptr[i] = ptr[i]  (numDown-i)  ;  
// Change the pointer value since other holes will move it  
// While the current hole isn't the hole that we want  
while(Moveptr++ != ptr[i]   )  
{  DRAM[(Moveptr-1) & ARRAYSIZE_MASK] =  
DRAM[(Moveptr  1 + (numDown -i)) & ARRAYSIZE_MASK ];  
// Shift memory the number of holes left to create  
Dread++;  
Dwrite++;  
Moveptr = ptrHi-l; // Create the upper holes  
for( int k = holes  1; k >= numDown ;k--) // Start with the max ptr  
{  
ptr[k] = ptr[k] + (k-numDown);  
// Change pointer to indicate the number of holes left  
// While the current hole isn't the hole that we want  
while(Moveptr-- != ptr[k]   )  
{ DRAM[(Moveptr+1) & ARRAYSIZE_MASK] =  
DRAM[(Moveptr  (k-numDown)  )  & ARRAYSIZE_MASK ];  
// Shift memory the number of holes left to create  
Dread++;  
Dwrite++;  