Design, implementation, and evaluation of a shared-memory parellel processing system (SMPPS) by Staub, Eric H.
New Jersey Institute of Technology
Digital Commons @ NJIT
Theses Theses and Dissertations
Fall 1998
Design, implementation, and evaluation of a
shared-memory parellel processing system
(SMPPS)
Eric H. Staub
New Jersey Institute of Technology
Follow this and additional works at: https://digitalcommons.njit.edu/theses
Part of the Computer Engineering Commons
This Thesis is brought to you for free and open access by the Theses and Dissertations at Digital Commons @ NJIT. It has been accepted for inclusion
in Theses by an authorized administrator of Digital Commons @ NJIT. For more information, please contact digitalcommons@njit.edu.
Recommended Citation
Staub, Eric H., "Design, implementation, and evaluation of a shared-memory parellel processing system (SMPPS)" (1998). Theses.
881.
https://digitalcommons.njit.edu/theses/881
 
Copyright Warning & Restrictions 
 
 
The copyright law of the United States (Title 17, United 
States Code) governs the making of photocopies or other 
reproductions of copyrighted material. 
 
Under certain conditions specified in the law, libraries and 
archives are authorized to furnish a photocopy or other 
reproduction. One of these specified conditions is that the 
photocopy or reproduction is not to be “used for any 
purpose other than private study, scholarship, or research.” 
If a, user makes a request for, or later uses, a photocopy or 
reproduction for purposes in excess of “fair use” that user 
may be liable for copyright infringement, 
 
This institution reserves the right to refuse to accept a 
copying order if, in its judgment, fulfillment of the order 
would involve violation of copyright law. 
 
Please Note:  The author retains the copyright while the 
New Jersey Institute of Technology reserves the right to 
distribute this thesis or dissertation 
 
 
Printing note: If you do not wish to print this page, then select  
“Pages from: first page # to: last page #”  on the print dialog screen 
 
  
 
 
 
 
 
 
 
 
 
 
 
The Van Houten library has removed some of the 
personal information and all signatures from the 
approval page and biographical sketches of theses 
and dissertations in order to protect the identity of 
NJIT graduates and faculty.  
 
ABSTRACT
DESIGN, IMPLEMENTATION, AND EVALUATION OF A
SHARED-MEMORY PARALLEL PROCESSING SYSTEM
(SMPPS)
by
Eric II. Staub
As technology reaches its limits of improvements in microprocessor processing speeds,
scientists and engineers have to find viable solutions to meet ever-increasing demands for
faster processing speed. One such solution is parallel processing. No longer does one
have to wait on sequential operations. A specific task can be split in sub-tasks that can
run simultaneously, thus reducing the overall execution time of the task.
The design and implementation of these systems is crucial to the effectiveness of
parallel systems. A dual-processor SMPPS was designed and implemented in order to
demonstrate how multiple processors are a viable solution to increasing the speed of
computer processing. Parallel algorithms were developed for this system and were used
for performance analysis. The results show that SMPPS systems of a small scale can
result in very significant increases in speed for problems characterized by fine-grain
parallelism.
DESIGN, IMPLEMENTATION, AND EVALUATION OF A
SHARED-MEMORY PARALLEL PROCESSING SYSTEM
(SMPPS)
by
Eric H. Staub
A Thesis
Submitted to the Faculty of
New Jersey Institute of Technology
in Partial Fulfillment of the Requirements for the Degree of
Master of Science in Computer Engineering
Department of Electrical and Computer Engineering
January 1999
APPROVAL PAGE
DESIGN, IMPLEMENTATION, AND EVALUATION OF A
SHARED-MEMORY PARALLEL PROCESSING SYSTEM
(SMPPS)
Eric H. Staub
Dr. Sotirios G. Ziavras, Thesis Advisor	 Date
Associate Professor of Electrical and Computer Engineering, and Computer and
Information Science, NJIT
Dr. Solomon Rosenstark, Thesis Co-Advisor 	 Date
Professor of Electrical and Computer Engineering, NJIT
Dr. Edwin Hou, Committee Member 	 Date
Associate Professor of Electrical and Computer Engineering, NJIT
BIOGRAPHICAL SKETCH
Author:
	 Eric H. Staub
Degree:
	 Master of Science in Computer Engineering
Date: 	 January 1999
Undergraduate and Graduate Education:
• Master of Science in Computer Engineering
New Jersey Institute of Technology, Newark, NJ, 1998
• Bachelor of Science in Computer Engineering
New Jersey Institute of Technology, Newark, NJ, 1997
Major: 	 Computer Engineering
ACKNOWLEDGMENT
The author wishes to express sincere thanks to my two advisors, Dr. Ziavras and
Dr. Rosenstark, for their guidance, support, knowledge, and men toring. I would also like
to thank Dr. Hou for serving as a committee member.
I would like to give a special thanks to Rosalie Gaddala for her friendship and
support throughout my academic career at NJIT and to Amy Sun who was an inspiration
for me over the last few months of my graduate work.
Most of all, I would like to thank my Mother and the rest of my Family for the
love and support that has gotten me this far.
I would like to thank all of the Electrical and Computer Engineering Department
and all of the Faculty and Staff of NJIT that have been an integral part my academic
career at NJIT. I would like to thank the numerous students that I had the opportunity to
work and study with. I would like to thank the United States Air Force for giving me the
opportunity to become an officer to serve my country and the Air Force Institute of
Technology for giving me the opportunity to get my graduate degree at NJIT. I would
like to thank the AFROTC Detachment 490 Staff, past and present, for their guidance and
support while I have been at NJIT. Also, I would like to thank my brothers of Tau Delta
Phi Fraternity. And finally, I would like to thank Altera's University Program for
supplying hardware and software for this project.
TABLE OF CONTENTS
Chapter Page
1	 INTRODUCTION 	 1
1.1	 Parallel Processing 	 1
1.1.1 Importance of Parallel Processing
	 ......... ..1
1.1.2 Classes of Parallel Processing
	 3
1.2 Existing Machines 	 4
1.2.1 Message-Passing 	 4
1.2.2 Shared-Memory 	 6
2 IMPLEMENTING A SHARED-MEMORY PARALLEL PROCESSING
SYSTEM (SMPPS) 	 10
2.1	 Objectives 	 10
2.2 A Dual-Processor Shared-Memory Parallel Processing System 	 10
2.2.1 Meeting Design Objectives 	 10
2.2.2 The Design  11
2.2.3 Timer Configuration 	
3 IMPLEMENTATION OF PARALLEL ALGORITHMS 	 27
3.1 Matrix Multiplication 	
3.1.1 Demonstrating a [4x4], [8x8], and [16x16] with [4x4] Matrix 	 77
vi
TABLE OF CONTENTS
(Continued)
Chapter Page
4 PERFORMANCE EVALUATIONS... ................................. ... 	 ... . .............................. 34
4.1 	 Matrix
	
Multiplication ........ . .............
	 ......................................................... . .......... 34
5 CONCLUSIONS
	 35
6 APPENDIX A - Diagrams 	 36
7 APPENDIX B - Programs 	 48
8 REFERENCES 	 91
vii
LIST OF FIGURES
Figure	 Page
1 Steps processors make to solve the equation g = (a+b)*(c+d)
	
2 MIMD architecture (with shared-memory) 	 4
3 Generic model of a message-passing multicomputer (M=Memory, P=Processor)
	 5
4 The UMA multiprocessor model (e.g., the Sequent Symmetry S-81) [ P = Processor;
SM Shared-Memory; 110 = Input/Output ] 	 7
5 Two NUMA models for multiprocessor systems 
	 8
6 The COMA model of a multiprocessor (D: Directory, C: Cache, P: Processor; e.g.,
the KSR-1) 	 8
7 Address location of devices 
	 12
8 Differences between the 28C64 and the Atmel 28C256 	 13
9 Differences between the 6264 and the HM62256LP-12 	 14
10 Differences in the wiring of the 74LS138 	 14
11 Truth table for the shared-memory control logic 	 '71
12 Karnaugh Maps for the shared-memory control logic 	 '72
13 Timer Interrupt Service Routine (written in assembly) 	 24
14 [4x4] Matrix Multiplication on a single processor	 '28
15 [4x4] Matrix Multiplication 	 28
16 Matrix-Multiplication Execution Times 	 ?9
17 [4x4] Matrix Multiplication on dual processors 	 30
18 [4x4] Matrix Multiplication on dual processors using shared-memory 	 31
16 Matrix-Multiplication Execution Times (Repeat) 	 34
viii
LIST OF DIAGRAMS
Diagram 	 Page
1 Dual-Processor Shared-Memory Block Diagram (I)
	 37
2 Dual-Processor Shared-Memory Block Diagram (II)
	 38
3 Original Control Logic Design 	 39
4 1-2 DeMultiplexor Logic 	 40
5 2-1 Multiplexor Logic 
	 41
6 Final Shared-Memory Control Logic Design 	 42
7 Default Symbol CTEST Logic 	 43
8 Timer Control Logic  
	 44
9 Flow-Chart I — One Processor Operation 	 45
10 Flow-Chart II — Dual-Processor Operation 	 46
11 Flow-Chart III — Dual-Processor Operation using Shared-Memory 	 47
ix
CHAPTER 1
INTRODUCTION
1.1 Parallel Processing
1.1.1 Importance of Parallel Processing
Even with ever changing technology, industry is always looking for ways to improve
performance. Scientists are continually finding innovative ways to speed up the
processing power of computers. Still, we need faster and more effective ways to
accomplish a task. Now that advancements in technology are reaching their limits,
industry must look for a new way to keep up with the demands. There is the old adage
that two minds are greater than one. This theory can be applied to computer processing.
With two processors, not only can more tasks be accomplished, but also tasks can be
accomplished faster.
For example, the simple task of {g = (a+b)*(c+d)} would take three steps (part a.
of Figure 1) on one computer. On a system with two processors, that same task would
take two steps (part b. of Figure 1). For simplicity sake, the time to pass information
between the processors is not considered.
2(a)System with one processor.
Step 1: Processor A adds 'a' to 'b' and places value in 'e'.
Step 2: Processor A adds 'c' to 'd' and places value in 'f'.
Step 3: Processor A multiplies `e' and 'f', and places in 'g'.
(b)System with two processors.
Step 1: Processor A adds 'a' to 'b' and places value in `e'.
Processor B adds 'c' to 'd' and places value in 'f'
Step 2: Processor A or B multiplies `e' and
	 and places in
Figure 1: Steps processors make to solve the equation g = (a+b)*(c+d).
This is a 33% improvement in the time to accomplish a simple task. If the
additional processor gives a 33% increase, why not add another processor? In this simple
case the addition of more processors would not have any effect. This is because the task
is made up of three subtasks, one of which requires information from the previous two.
Even if the third processor was assigned the multiplication of `e' and `f' it would not be
able to proceed until the additions were complete.
One might conclude that the improvement of processing time using multiple
processors is limited. Actually the limit only exists for a particular task. As the task
changes, the speedup factor changes. When multiple processor theory is applied to the
task of (a+b)*(c+d)*(e+f)*(g*h), the results are quite different. On one processor the
task will take seven steps. On a two-processor system it would take four steps. This is
over 40% decrease in processing time. On a four-processor system that same task would
take only three steps. This is over 50% decrease. If the task is applied to a five-processor
system, there is no improvement in processing time. Once again the processing time can
only be improved to a certain limit.
3Another factor to consider is that adding a fourth processor only increased the
speedup by 10%. When one processor was added there was a gain of 40%, and only 10%
more when adding two additional processors. Also, during some of the steps, some of the
processors are not needed. Further complicating the matter is the movement of data
between processors. This transfer will take additional time that will decrease the overall
speedup of the system. Deciding what is the best possible design to obtain the best
possible results is a topic that will not be discussed in detail and will be left to
independent research. However, the focus of this paper will center on the design of a
shared-memory parallel dual-processor system and the timing results of running
algorithms on the system.
1.1.2 Classes of Parallel Processing
Before I get into the design of the system, I will discuss the different types of parallel
computing systems. As one might guess, parallel systems are designed in different ways.
In general, parallel systems are classified in to two major groups. The system I have
designed falls into the shared-memory class and the other class consists of message
passing systems. Each system has its pros and cons and the type of system needed is
basically dependent on the task that needs to be accomplished. How parallel computers
communicate with one another and how they share memory determines which one of the
two major classes of parallel computers the systems belong to.
4Systems that are considered inherent parallel computers are those which operate
in the MIMD (multiple instruction stream over multiple data stream) mode. An example
of a MIMD system is shown in Figure 2. Since parallel computers must share
information, there has to be a way for them to access the shared information. In
multiprocessor shared-memory systems this is accomplished by placing information in
some variable and giving all systems access to that variable. In message-passing systems
the information is passed between computers by using an interprocessor communication
network.
Figure 2: WW1 architecture (with shared-memory).
1.2 Existing Machines
1.2.1 Message-Passing
A system in the message-passing class consists of one or more multiple-computer
networks. These networks connect together computer nodes. The computer nodes
communicate information between one another through these networks. Hardware
routers usually handle this communication. An example of a message-passing
interconnection network is shown in Figure 3.
Figure 3: Generic model of a message-passing multicomputer (M=Memory, P=Processor).
Each network node is attached to a router. Based on the design and type of
protocols that the router uses, information is then sent between the computer nodes via
routing. This gives the designer the flexibility of creating multiple types of
communications between the networks. By changing how the networks interact, the
designer has the ability to use the same networks to accomplish numerous different tasks.
As with all technology, the scientist and engineer strive to improve the original
design. Message-passing systems are now in their third stage of development.
Development started in 1983 with systems like the Caltech Cosmic and the Intel iPSC/1.
These systems were designed with software-controlled message-passing for the
hypercube architecture.
5
6Over the years of 1988-1992, systems such as the Intel Paragon and the Parsys
SuperNode 1000 represented the next stage in the evolution of message-passing systems.
The systems incorporated routing messages via hardware, utilizing software for medium-
grain distributed computing, and using mesh-connected architectures.
The third stage of the development started in 1993 and consisted of machines that
placed processing and communication devices on the same chip. Systems such as the
MIT J-Machine and the Caltech Mosaic are based on this design.
Listed above are a few of the many systems that have been developed. Each
system has its own unique design. What that design is and how each accomplishes its
message passing can be found in numerous technical notes and publications. These
systems were mentioned just to give a flavor of the type of systems and progression of
the development of message-passing systems.
1.2.2 Shared-Memory
Shared-memory systems consist of multiple-processors, each of which has its own private
memory, and information is shared through an independent memory that all of the
processors have the ability to access. As with message-passing systems, I will give a brief
description of shared-memory systems. I will briefly describe only three of the many
models of shared-memory systems. Many other models incorporate one or more features
of these three models.
7The first model, Figure 4, is the uniform-memory-access (UMA). In this model
all processors have equal access to all memory. These systems are for multiple processes
for problems characterized by a high degree (that is fine-grain) parallelism. The system I
designed falls under this model.
Figure 4: The UMA multiprocessor model (e.g., the Sequent Symmetry S-81)
( P = Processor; SM = Shared-Memory; I/O = Input/Output ].
The next model, Figure 5, is the non-uniform-memory-access (NUMA). NUMA
systems consist of groups of multiple-processors that are connected by interconnection
networks. There is local-shared-memory within each group and global-shared-memory
between the groups. These systems share memory based on the location of the memory in
relation to the processor needing access to that memory. Therefore, the access time to
memory is not uniformly distributed among the processors.
Figure 5: Two NUMA models for multiprocessor systems.
The last model, Figure 6, I will discuss, is the cache-only memory access
(COMA). These systems are similar to NUMA systems, but the shared memories are
replaced with cache memories. Processors wanting to access memory in another
processor's cache memory must do so through cache directories.
8
Figure 6: The COMA model of a multiprocessor (D: Directory, C: Cache, P: Processor; e.g., the KSR-I).
9Numerous different sources, including the Internet, can be found for further
information about parallel systems. This follow-on information is not necessarily needed
to understand the design of my shared-memory system or the results of testing algorithms
on that system.
CHAPTER 2
IMPLEMENTING A SHARED-MEMORY
PARALLEL PROCESSING SYSTEM
(SMPPS)
2.1 Objectives
There are three main objectives to this project. The first is the design of the shared-
memory parallel processing system. Next is the implementation of that system. The final
objective is the evaluation of the system for some algorithms.
2.2 A Dual-Processor Shared-Memory Parallel Processing System
2.2.1 Meeting Design Objectives
Since the evaluation of the system consisted of testing algorithms, I needed to design a
system that could be implemented within time and monetary constraints. This system
would have to show the effectiveness of running an algorithm on a parallel system as
opposed to running that same algorithm on a single processor system.
I chose to develop a system with two processors and a single shared-memory.
This would reduce the cost and complexity of the project. Also, it would help keep me
within the time and monetary constraints. The next step was to determine which
processor to use for the project.
I initially chose to use the TI TMS320C80 processor. The C80 processor consists
of four DSPs and one RISC processor. I spent the next month gathering information
about the C80. I considered how I would implement a system using two C80 processors
10
11
and what software would have to be developed to manage and test the interface between
the two processors. After carefully considering the options that the information I
collected presented to me, I determined that I would be unable to use the C80 for this
project. Using the C80 would not only be cost prohibitive, but the complexity of
implementing a dual processor system was extremely complex.
I then focused my attention on using TI's C40. Even though the cost was quite
less, the complexity still remained quite high. After another month of investigations it
was determined that using the C40 was not a viable solution. This left the Motorola
68000 series microprocessor. These processors would be much more cost effective and
the complexity would be greatly reduced. Since I was familiar with this series of
microprocessor, I determined that it would be the most promising candidate for a dual-
processor system.
2.2.2 The Design
As an undergraduate, I was involved in many projects. The most significant was my
senior project. In this project I developed a control system for a constant-pressure
floodgate. I used the Motorola 68008 microprocessor as the control system processor. I
used a micro-controller design that was developed by Dr. Rosenstark and is part of EE-
393, Electrical Engineering Lab III. The micro-controller design and specifications are
explained in detail in the EE-393 Lab Manual. (Rosenstark 1998) The current version of the
Lab Manual has the new micro-controller, Motorola 68EC000 microprocessor, in place
of the Motorola 68008 microprocessor.
12
Once it was determined what microprocessor I should be using, the project was
set in motion. The Electrical Engineering Laboratory III (EE 393 — Spring 98) was using
the last of the MC68008 to build micro-controllers. Since the discontinuation of the
processor, Dr. Rosenstark was seeking an alternative processor. The alternative was the
MC68EC000. To test the feasibility of using this processor, Dr Rosenstark had one
student build a micro-controller with the MC68EC000. The student was successful in
using MC68EC000.
In order to accomplish the objectives I set, I needed to make modifications to the
micro-controller in the EE-393 Lab Manual. The micro-controller has its own memory,
which included DRAM and an EEPROM. The memory used in the EE-393 Lab Manual
was 28C64 EEPROM and 6264 DRAM. Since my design required a larger memory
space, I chose to use an ATMEL 28C256 EEPROM and a 62256 DRAM. This would
give me two blocks, each 8K bytes, of addressable memory. This change in address
space changed the addressing scheme of the micro-controller (see Figure 7).
Figure 7: Address location of devices.
13
Another benefit of using these chips is that they are 28-pin packages. This would
allow me to use the original design while only changing two wires for each chip. The
additional wires are address lines A13 and A14. These lines will be connected to pins
that where originally no-connect pins on the EEPROM and will replace the nCE2 pin and
a no-connect pin of the DRAM. This is shown in Figure 8 and Figure 9.
Figure 8: Differences between the 28C64 and the Atmel 28C256.
14
Figure 9: Differences between the 6264 and the HM62256LP-12.
Since I am using a larger address space, the address lines on the 74LS138 will
have to change. Lines A13, A14, and A15 will be replaced with A15, A16, and A17 as
shown in Figure 10.
Figure 10: Differences in the wiring of the 74LS138.
15
Now that the major design decisions were out of the way I started to build the
circuits around the microprocessor. I proceeded as far as possible with the parts that I
had acquired up to this point. I was having difficulties acquiring some of the important
components so I was unable to go any further. Due to lack of parts to complete the
microprocessors I decided to work on the control logic and the 2-1 Mux.
After spending some time designing the control logic I received most of the
components needed to finish the micro-controllers. After completing the first micro-
controller, I ran into difficulties interfacing with the computer. Since I was only having
trouble with communicating with the computer I started to build the second micro-
controller. Once I completed this micro-controller, I ran into the same difficulties. After
an exhaustive trouble shooting effort, I was only able to communicate with the computer
on a simple level. I was still unable to run the Monitor program. I then changed my
focus to the software and the assembler.
After more intense trouble shooting, Dr. Rosenstark and I determined that one of
the problems was created by my larger address space. Specifically the range from 8000H
to FFFFH. This problem was caused by the assembler when it sign extended. As a
solution we decided not to use this address range. I moved the private memory to 0001
0000H — 0001 7000H and moved the shared-memory to 0002 0000H — 0002 7FFF. This
solved some of the problems but I was still unable to get the monitor program to work.
While working on my project I was teaching EE393 over the second summer
session. These students were using the MC68EC000. These students were using the
smaller EPROMs and RAMs. They did not have the communication problems that I was
having. This was very perplexing since it was the same program, except for the different
16
address scheme. Since I was able to communicate on a simple level it had to be a
software problem. After using some unique debugging, I determined that the James L.
Antonakos' Assembler was assembling addresses that used the LEA command with an
offset of 6H. I also found another problem. The James L. Antonakos' assembler creates
S1 records. This would not allow me to write a program to be loaded by the monitor in
my memory location since my addressing scheme was a long word.
At this point I tried using another assembler. I found that Paragon's assembler
was able to assemble the program, and I was able to run the monitor program. This
created another problem. The Paragon assembler used S2 records in the Hex file. The
monitor was not able to load S2 files, so I would not be able to load a program into
memory.
Working with Dr. Rosenstark we came of with several solutions. The first was to
change the LEA commands to MOVEA.L commands. This solved most of the problems
but I would still be unable to use Antonakos' assembler for files to be loaded into the
memory because my addressing scheme requires S2 records. Dr. Rosenstark's changing
the monitor program to load S2 records solved this problem. Dr. Rosenstark has passed
this information on to James L. Antonakos and he is currently working on a solution.
I now had two fully working micro-controllers. Now it was time to start to work
on the shared-memory logic. For simplicity, I chose to make the shared-memory the
same type as the private-memory of the micro-controllers. This way I would be able to
use the same address and data bus as the micro-controllers.
17
The next step was to design the interface between the micro-controllers and the
shared-memory. My design called for single-port access of the memory. Also, access of
the shared-memory should not interfere with the independent processing of the other
processor unless both processors try to access the shared-memory at the same time. in
order to accomplish that, I needed to separate the address and data buses of the individual
processor while allowing access to those buses when shared-memory is accessed.
Diagram I in Appendix A shows the initial block diagram for the system. I
separated the address buses with 2-1 multiplexors and the data buses with bus-
transceivers. I used a bus-transceiver on the data bus because of the bi-directional nature
of the data bus. After further evaluation of my design I found that I had unnecessary
logic.
Diagram II in Appendix A shows that I removed two bus-transceiver blocks and
two 2-1 MUX blocks. The DRAM chip has an enable pin on it. This enable pin would
only be activated when a processor requires access to the shared-memory. This allowed
me to remove the MIA blocks. The bus-transceiver is bi-directional so it can be placed
in the direction of the shared memory while a processor is accessing its private memory.
Since the shared-memory is not enabled during this time, the data on the data lines of the
shared-memory chip is ignored. This allowed me to remove the bus-transceiver blocks.
Now that the design for the address and data bus was complete I needed to design
the shared-memory control logic. The problem that needed to be solved was how to
access the shared-memory with interrupting independent processing of the other
processor. I used one of the features of the MC68EC000 to build my design.
18
I used the MC68EC000 A/S pin and the /DTACK pin. When an instruction is
executed the MC68 places a signal on the A/S pin. In order for the processor to continue
to the next instruction, a signal must be placed on the /DTACK pin. Once the state on the
/DTACK pin has gone from high to low and then back to high, the processor will
continue on to the next instruction. If the transition is not completed the processor will
not continue.
Since my design requires that a second processor wait till the first processor is
done when both processors try to access shared-memory, I can use these pins to my
advantage. In the EE 393 design the two pins are connected directly together. If I could
separate the pins during shared-memory access, I would have solved my problem. Now
that I had a possible solution to this problem, I had to consider the other chips that needed
to be controlled by this logic.
The shared-memory had to be enabled when accessed and whether the operation
is a read or write must be handled. The bus-transceiver on the data bus must be enabled
and the direction set. And finally the multiplexor on the address bus must be set
correctly. This design would require large amounts of logic and testing would become a
nightmare. Luckily, as part of my undergraduate work I used a software package by
Altera called MAX+plus II.
I decided to use ALTERA programmable chips for the control logic and the 2-1
Mux. Using the Altera chips would be much more cost effective and would reduce the
area required for the shared-memory system. Also these chips would allow flexibility in
the design of the logic. The design could be easily modified and reprogrammed onto the
chip.
19
MAX+plus II can be used to design entire logic devices from those as simple a
gate to those as advanced as microcomputers. The designs can be created in text format
or in graphical format. Once the design is complete, it can be thoroughly tested. If it
does not meet the specifications needed, then it can be easily changed and tested again.
This eliminates the need to build the circuits, test them, and then throw them away
because they did not meet the specifications you had planned. Another advantage was
that the design could be placed on a single chip the size of a computer processor. Not
only would I save time and money, but also the space I needed for my control logic
would be reduced.
Diagram III in Appendix A shows one of the preliminary designs. The final
design for the most part was similar to this design. One of the features of MAX+plus II
that is very useful is the ability to create default symbols. This allows the use of the same
sub-design in multiple places. This became particularly useful when testing a specific
point of the design.
I used this feature in two places in my design. One place was the point that
became the focal point of fault with my original design. This will be explained as I
describe the final design of the control logic. The second place is the 1-2 de-multiplexor
I created. I would have had to create a third default symbol, but this symbol had already
been created. This was the 2-1 mulitplexor.
The 1-2 DEMUX is shown in Diagram IV in Appendix A. I created it using tri-
state buffers. This design allows one signal to be sent over a different line based on what
is selected by the select pin. The drawback to this design is the high 'Z' output that is
created when a line is not selected. This would be a problem when a processor is
20
working with its own memory. Then the input to the shared-memory logic would be highe
'Z'. Since my design of this control logic requires a high or low signal to be present, I
had to come up with another solution.
The simple solution was an open-collector buffer. Since the chip that I will be
placing the design on does not support open-collector buffers in the design, I chose to
route the 1-2 DEMUX output out of the chip and then back into the chip via an input pin.
The signal would then go through the open-collector buffer and then back into the design
on the chip. This would require an additional chip. Since I had saved large amounts of
space by using the Altera chip, I didn't mind adding one additional chip.
In order to save additional space, I chose to design the 2-1 multiplexors for the
address bus with the MAX+plus II software. Diagram V in Appendix A shows this
design. This would require the use of two Altera MAX EPM7128SLC84-7 chips. Using
two MAX chips still required less space than using 2-1 multiplexor chips. After running
the control design through many simulations, I programmed the design into the second
MAX chip. I then proceeded to wire the chip into the micro-controller. Before I could
actually test the design, I had to wire the bus-transceivers for the data bus and the second
MAX chip, which has the 2-1 Multiplexors for the address bus.
Once the wiring was complete, I started testing the design. The design did not
work the way it was expected to. After days of testing and troubleshooting, I narrowed
the problem down to a specific area in the design. I removed this area from the design
and created a default symbol for this area. It is shown as default symbol `ctest' in
Diagram VI in Appendix A. This would allow me to redesign and test the problem area
of the design.
21
After many days of testing and modifications, I determined that I would have to
redesign this portion of the control logic. Any modifications I made to the design would
either introduce a race condition into the logic or give total control of the shared-memory
to one processor. Just before starting from scratch, I asked Scott Margo, an MIT
Electrical Engineering Ph.D. student what he thought might solve the problem. After
evaluating the design, he came to the same conclusion that I should start over from the
truth tables. The resulting truth table is shown in Figure 11.
Figure 11: Truth table for the shared-memory control logic.
Using the Karnaugh Maps in Figure 12 (a) and (b) the following equations emerged:
A'out = (Ain /Bin)+(Ain /Bout)+(/Ain Bin /Aout /Bout)
B'out = (/Ain Bin)+(Bin /Aout Bout)
Figure 12: Karnaugh Maps for the shared-memory control logic
The resulting logic is shown in Diagram VII in Appendix A.
I tested this design by running it through several simulations. The results of these
simulations were very promising. After compiling the control design with this new
design, I programmed it into the MAX chip. This began the testing phase of the new
control logic. I used the monitor program on each micro-controller to manually access
the shared-memory. I was able to edit and display the shared-memory from both micro-
controllers. This confirmed that the hardware design was complete.
The next step was to write a program that used software semaphores to lock the
shared-memory. The program I wrote is in Appendix B. The program ran flawlessly on
both processors. Not only did the hardware design work, but also the software-controlled
locks were executing properly.
2.2.3 Timer Configuration
Before I could move on to the algorithms, I had to decide how I would track the
execution times. The most effective way is to interface directly with the micro-
controllers. This would allow the software to directly control the timer. Not only would
this be more efficient, but it would also produce more accurate times.
'7 7
23
I chose the Intel 8253-5 programmable interval timer to accomplish the task of
timing the execution of the algorithms. The 8253 timer is a 24-pin dual in-line package
with three 16-bit counters, each with a count rate of up to 2 MHz. The timer has five
different modes of operation and four different ways of obtaining count values. I will be
using mode 0, interrupt on terminal count, and will use `Read/Load least significant byte
first, then most significant byte' for obtaining the count value. The timer counts down
from 2 16-1. This produces a 16-bit number.
The timer has an eight-bit data bus that can be easily interfaced with the micro-
controller's eight-bit data bus. This data bus is used to read the count value in the count
register. As stated before this is done with two reads of the chip. The first read is stored
in one register and the second read is stored in another register. The final result is the
combination of the two values, which is a 16-bit number.
Once I completed the interface of the chip to the micro-controller, I conducted
preliminary tests on the timer chip. These tests were done to ensure the timer was
working properly. Even though I chose to operate the timers at 1.2 MHz, I noticed that
the timer was counting completely down several times. I was getting valid count values
but had no way of telling how many times the counter started over. This could cause a
problem when determining the speed up of the algorithms that I would be testing on the
project.
In order to solve this problem I had to find a way to track how many times the
counter reaches zero. This was one of the main reasons I chose to operate the timer in
mode 0. In mode 0 the timer would count down to zero, and once zero was reached a
high signal would be placed on the out1 pin of the timer chip. Now I had a way to keep
24
track of how many times the timer reached zero. Of course, it was not as simple as I
thought.
Once the timer reached zero, the signal would be placed on the outl pin. The
timer would then continue to count down again. The problem with this is that the signal
on the out1 pin was not reset. The only way to reset the out1 pin was to reset the entire
timer and then restart the timer. This presented another problem. All of these actions
would take time. Even though it was a very small amount of time, it was still enough to
reduce the accuracy of the execution times of the algorithms.
The solution to this problem brought about the final .design for the interface of the
timer. Since the resetting of the timer would take time, I needed to halt the execution of
the algorithm while I was resetting the timer. I accomplished this by using the external
interrupts on the MC68EC000.
Using the 68's interrupts I could reset the timer and count the number of times the
tinier reached zero. This was accomplished by adding an interrupt service routine to the
monitor program. The routine, which is written in assembly, is shown in Figure 13.
Using the interrupts also required some additional hardware design.
Figure 13: Timer Interrupt Service Routine (written in assembly).
25
While designing the hardware interface between the timer and micro-controller, I
developed a way to totally automate the resetting of the timer and the reading of the final
count value. This would require additional interrupts and logic for the interface. After a
few weeks of testing designs, I decided just to use the interrupt for the resetting of the
timer and keeping track of how many times the timer reached zero. I made this decision
based on the fact that these additional features of automation were not really necessary
and the fact that I would not be able to work out the bugs in the design in the time
allocated for the timer design.
Since I was not using automation for the stopping and reading of the timer, I had
to create a design that would allow the software to stop and read the timer. In the micro-
controller design the 74LS138 is used to select different chips. This is accomplished by
having three upper address lines connected to the 74LS138. By executing a read/write at
the address location specified by the address lines that are connected to the 74LS138, a
particular chip will be enabled. Since I was not using all of the locations available on the
74LS 138, I decided to use it to help with the stopping and reading of the timer.
Now that my new design for the timer required additional logic, I decided to use
the MAX+plus II software. I designed the logic and then added it to the design for the
address bus multiplexors. This is shown in Diagram VIII in Appendix A. The logic
would allow for the interrupt for the tracking of the number of times the counter reaches
zero, the software-controlled stopping and reading of the timer.
The timer will be initialized and started by software-control. When the timer
reaches zero, the execution of the algorithm will be interrupted, a count variable will be
incremented, the timer will be reset and restarted, and then the execution of the algorithm
26
will resume. This will be done without any software-control. When the algorithm is
complete, software will stop the timer and read the count value. The software-control
will be additional lines of code that will be added to the code for the algorithms. This
code will not affect the results of the execution time of the algorithms. After running
several tests, I determined the design was sufficient to give effective timing results for the
algorithms I would be testing on the SMPPS.
This concluded the hardware design of the system. Now it was time to move on
to the development of the algorithms for the system. For this project I will be testing two
algorithms. The first will be matrix multiplication and the second would be parallel
sorting.
CHAPTER 3
IMPLEMENTATION OF PARALLEL ALGORITHMS
3.1 Matrix Multiplication
3.1.1 Demonstrating a [4x4], [8x8], and [16x16] with [4x41 Matrix
For the matrix-multiplication algorithm (MMA), I wanted to use several different sized
matrices to show the effective speed up of using a SMPPS. I would multiply two
matrices and place the results in a third matrix. The three matrix sizes I chose were 4x4,
8x8, and a 16x16. This would give me speed up values for simple matrix multiplication
that is time-consuming.
I would also produce results for computing the matrix-multiplication on one
processor and on the SMPPS. The multiplication of the matrices on the SMPPS would be
done in two different ways. One way would be just utilizing the two processors, and the
second would utilize the shared-memory. I will be expecting a speed up of almost two
for the dual processor system without shared data, and considerably less of a speed up for
the shared-memory implementation. This would be caused by the overhead involved in
using the SMPPS. The transfer of data through the shared-memory is considerably
slower than using registers of a single micro-controller. I do, however, expect a
reasonable speed up over the single processor.
I will use 4x4 matrices to demonstrate the different ways I will do the matrix-
multiplication algorithm. I will be multiplying matrices A and B, and placing the results
in matrix C as shown in Figure 14.
27
Figure 14: [4x41 Matrix Multiplication on a single processor.
The operations required to compute Matrix C are shown in Figure 15.
28
Figure 15: [4x4] Matrix Multiplication.
To obtain the execution time for running the algorithm on one processor, I gave
the processor access to all of matrix A and matrix B. The program I developed for this
algorithm is in Appendix B. I started out by writing individual programs for each of the
three different sized matrices and each of the three different ways. While developing the
first few programs, it occurred to me that this might affect the results for the execution
times of the algorithm. What I needed was a program that accomplished the three
29
different types of matrix-multiplication on all three of the matrix sizes. Also, the
program must accomplish it with as little different overhead as possible.
As I developed the program, I would test it numerous times. I started to get count
values for the different matrices. The values I was getting were very close to the
speedups I expected. The problem I was having was that I could not get the program to
work exactly like I wanted it to. It would give me results for one matrix size and not the
others. As I made changes to correct the problem, another problem would be introduced.
Rather then spend tremendous amount of time on trying to resolve these problems. I
chose to continue with the writing of the thesis. Figure 16 shows the results of the
execution times.
Figure 16: Matrix-Multiplication Execution Times (clock cycles).
The flowchart for the one-processor matrix multiplication algorithm is shown in
Diagram IX in Appendix A. In the program the micro-controller would have access to all
of matrix A and matrix B. The program would be loaded into the memory of one micro-
controller. The program is then started. After the program went through its
initializations and loading of variables, the timer would start and it would simply
calculate the results for matrix C by the previously stated equations. Once the results
were calculated they were moved to shared-memory and the timer was stopped. The last
step of the program was to read the values in the timer.
30
The next step was the program that used two processors to do the matrix
multiplication. This was accomplished by giving Processor A access to the first half of
matrix A (half of the rows) and access to all of matrix B. Processor A computes the
results for the first half of the C matrix. Processor B was given access to the second half
of matrix A and all of matrix B. Processor B computes the results for the second half of
the C matrix. The dashed line in Figure 17 shows the separation for the 4x4 matrices:
Figure 17: 14x41 Matrix Multiplication on dual processors.
The flowchart for this program is shown in Diagram X in Appendix A. Since the
only difference between the program in each processor is what portion of matrix A is
accessible, I developed the program to load on the correct portion of the matrix that the
individual processor needed. I accomplished this by using a subroutine that required a
start and finish location for the values of the matrix. The start and finish locations were
determined by which processor was using the program. This was all controlled by the
settings placed in the beginning of the program. To gain a better understanding of what I
did, a review of the program in Appendix B will be necessary.
31
In order to obtain the most accurate times as possible, I chose to have the
processor control the start and stop of the timer. I accomplished this by using
semaphores. These semaphores would be used to signal the other processor when it
could continue with its operations. This would allow the initialization and loading of
variables by both processors without having to include these operations in the execution
times.
Processor A would start by loading its start values and then would enter into a
wait state. It would exit that Wait State when Processor B signaled that it had finished
loading variables and was now in its own wait state. Now Processor A would start the
timer, signal Processor B to start executing, and then start its own execution. Once
Processor A completed its execution it would check to see if Processor B was complete.
If Processor B were complete, Processor A would stop and read the count value of the
timer. Otherwise, Processor A would enter a wait state until Processor B completed its
execution.
The final program would give timing results for using shared memory as well as
the dual processors. The flowchart for this process is shown in Diagram XI in Appendix
A. In this program, both the A matrix and the B matrix are split up. The separation of
the matrices is shown in Figure 18.
Figure 18: 14x41 Matrix Multiplication on dual processors using shared-memory.
32
In this program, Processor A has access to the first half of matrix A and the first
half of matrix B. Processor A computes the results for the first half of the C matrix.
Processor B has access to the second half of matrix A and the second half of matrix B.
Processor B computes the results for the second half of the C matrix.
The difference between the program and the dual processor program is that each
processor does not have all of the data to complete the computations for the C matrix.
For instance, for Processor A to compute the value of Coo it would need access to B70 and
B30. Since Processor B has access to these locations, the data in these locations must be
transferred to Processor A through the shared-memory. During the computation portion
of the program, each processor must finish the calculations that are possible and wait
until it is given the needed data.
I tried to develop the program in a fashion that would allow one processor to
make its possible calculations while the other processor was sending and receiving data
from the shared-memory. To ensure that a processor did not retrieve the data before it
was placed in shared-memory, I used the semaphores to place the processor into a wait
state until the required data was available. Once again, a better understanding can be
obtained by reviewing the program in Appendix B.
I gave a description on how I implemented the different programs by showing
how it was done on a [4x4] matrix. I developed the program to compute the results for
the [8x8] matrix. To get the results for the [4x4] case, I added code to reduce the number
of loops in the matrix-multiplication routines. I increased the number of loops in the
matrix-multiplication routines to get the results for the [16x16]. In order to produce valid
timing results, I tried to do this in a way that makes the overall operation of the program
33
to remain the same for all size matrices. The theory of adding and subtracting loops was
sound, but the code to keep the operations the same became quite complex. This is what
is causing the delay in the development of a fully operational program.
CHAPTER 4
PERFORMANCE EVALUATIONS
4.1 Matrix Multiplication
As I stated earlier, I am getting consistent results from the current program. However, I
am still unable to remove all of the bugs from the program to produce results for all of the
program operations. I noticed that overall the results I obtained do not change as I make
changes to the program. When I make changes to the program I am able to get results for
different size matrices. Several times I was able to get results for more than one size
matrix and the results were quite similar to the ones I was getting when I was only able to
produce results for one size matrix. Since I am getting results like I expected, I could
continue to troubleshoot the current program. With time, I expect to have all the
problems worked out of the program. The speedups, based on the results in Figure 16
are:
Figure 16: Matrix-Multiplication Execution Times (clock cycles).
34
CHAPTER 5
CONCLUSIONS
Based on the results I achieved with the matrix multiplication algorithm, I am concluding
that there is an overall effective speedup in using a SMPPS. Overall I would rate this
project as a success. I accomplished the first two objectives and made significant
progress on the third objective. This project gave me the opportunity to work on a
project from the design phase to the testing phase and the opportunity to apply the
knowledge I acquired while at NJIT as well as hone my engineering skills.
During the project, I conquered many hurdles and had the chance to have an
impact on the curriculum of undergraduate students. Many of the discoveries I made
while designing and implementing the micro-controller were beneficial to the EE 393
Lab. Teaching the EE 393 Lab over the summer session was equally rewarding. Not
only was I able to increase my understanding of the micro-controller, but I was enabled to
impart to the students the knowledge I had gained while working on the project.
The SMPPS project leaves the door open for future areas of study and research.
Basing a new system with more processors on this design would present an interesting
challenge. Also, developing more parallel algorithms for the system would present en
equally challenging obstacle. The possibilities that can be pursued are virtually limitless.
35
APPENDIX A
DIAGRAMS
Appendix A has the following diagrams:
Dual-Processor Shared-Memory Block Diagram (I)
Dual-Processor Shared-Memory Block Diagram (II)
Original Control Logic Design
1-2 DeMultiplexor Logic
2-1 Multiplexor Logic
Final Shared-Memory Control Logic Design
Default Symbol CTEST Logic
Timer Control Logic
Flow Chart I — One Processor Operation
Flow Chart II — Dual-Processor Operation
Flow Chart III — Dual-Processor Operation using Shared-Memory
Diagram 1
37
Diagram 2 38
39
40
41
42
h gam
43
Diagram 8
44
Diagram 9
45
Dig gram 10
46
47
APPENDIX B
Programs
This is the program for a [4x4], [8x8], and [16x16] Matrix-Multiplication on One
Processor System, a Dual-Processor System, and a Dual-Processor System using Shared-
Memory.
48
49
50
51
52
A7E 	 EQU 	 $1407E
A7F 	 EQU 	 $1407F
A80 	 EQU 	 $14080
A81 	 EQU 	 $14081
A82	 EQU 	 $14082
A83 	 EQU 	 $14083
A84 	 EQU 	 $14084
A85 	 EQU 	 $14085
A86	 EQU 	 $14086
A87 	 EQU 	 $14087
A88 	 EQU 	 $14088
A89	 EQU 	 $14089
ABA 	 EQU 	 $1408A
ABB 	 EQU 	 $1408B
A8C 	 EQU 	 $1408C
A8D 	 EQU 	 $1408D
A8E	 EQU 	 $1408E
A8F 	 EQU 	 $1408F
A90	 EQU 	 $14090
A91	 EQU 	 $14091
A92 	 EQU 	 $14092
A93 	 EQU 	 $14093
A94 	 EQU 	 $14094
A95 	 EQU 	 $14095
A96 	 EQU 	 $14096
A97 	 EQU 	 $14097
A98 	 EQU 	 $14098
A99 	 EQU 	 $14099
A9A 	 EQU 	 $1409A
A9B 	 EQU 	 $1409B
A9C 	 EQU 	 $1409C
A9D 	 EQU 	 $1409D
A9E 	 EQU 	 $1409E
A9F 	 EQU 	 $1409F
AAO 	 EQU 	 $140A0
AA1 	 EQU 	 $140A1
AA2 	 EQU 	 $140A2
AA3 	 EQU 	 $140A3
AA4 	 EQU 	 $140A4
AA5	 EQU 	 $140A5
AA6 	 EQU 	 $140A6
AA7	 EQU 	 $140A7
AA8
	
EQU 	 $140A8
AA9 	 EQU 	 $140A9
AAA 	 EQU 	 $140AA
AAB
	 EQU 	 $140AB
AAC	 EQU 	 $140AC
AAD 	 EQU 	 $140AD
AAE 	 EQU 	 $140AE
AAF 	 EQU 	 $140AF
ABO 	 EQU 	 $140B0
AB1	 EQU 	 $140B1
AB2 	 .EQU 	 $140B2
AB3 	 EQU 	 $140B3
AB4 	 EQU 	 $140B4
AB5 	 EQU 	 $140B5
AB6 	 EQU 	 $140B6
53
AB7 	 EQU 	 $140B7
AB8 	 EQU 	 $14038
AB9 	 EQU 	 $14039
ABA 	 EQU
	 $140BA
AB 13 	 EQU 	 $14033
ABC 	 EQU
	 $140BC
ABD 	 EQU
	 $140BD
ABE 	 EQU
	 $140BE
ABF 	 EQU
	 $1403F
ACC) 	 EQU 	 $14000
AC1 	 EQU 	 $140C1
AC2 	 EQU 	 $140C2
AC3 	 EQU
	 $140C3
AC4 	 EQU 	 $140C4
AC5 	 EQU 	 $14005
AC6 	 EQU
	 $14006
AC7 	 EQU 	 $14007
AC8 	 EQU 	 $14008
AC9 	 EQU 	 $140C9
ACA 	 EQU
	 $140CA
ACB 	 EQU 	 $140CB
ACC 	 EQU 	 $140CC
ACD 	 EQU
	 $140CD
ACE 	 EQU
	 $140CE
ACF 	 EQU
	 $140CF
ADO 	 EQU 	 $140D0
AD1 	 EQU
	 $140D1
AD2 	 EQU 	 $140D2
AD3 	 EQU
	 $140D3
AD4 	 EQU 	 $140D4
AD5 	 EQU
	 $140D5
AD6 	 EQU
	 $140D6
AD7 	 EQU 	 $140D7
AD8 	 EQU 	 $140D8
AD9 	 EQU
	 $140D9
ADA 	 EQU
	 $140DA
ADB 	 EQU 	 $140DB
ADC 	 EQU
	 $140DC
ADD 	 EQU 	 $140DD
ADE 	 EQU
	 $140DE
ADF 	 EQU 	 $140DF
AE0 	 EQU 	 $140E0
AE1 	 EQU 	 $140E1
AE2 	 EQU
	 $140E2
AE3 	 EQU 	 $140E3
AE4 	 EQU 	 $140E4
AE5 	 EQU 	 $140E5
AE6 	 EQU 	 $140E6
AE7 	 EQU 	 $140E7
AE8 	 EQU
	 $140E8
AE9 	 EQU 	 $140E9
AEA 	 EQU
	 $140EA
AEB 	 EQU
	 $140EB
AEC 	 -EQU
	 $140EC
AED 	 EQU
	 $140ED
AEE 	 EQU
	 $140EE
AEF 	 EQU 	 $140EF
54
55
56
57
58
59
60
61
62
63
64
65
66
SA99
	
EQU 	 $28099
SA9A 	 EQU 	 $2809A
SA9B 	 EQU 	 $2809B
SA9C 	 EQU 	 $2809C
SA9D 	 EQU 	 $2809D
SA9E 	 EQU 	 $2809E
SA9F 	 EQU 	 $2809F
SAA0
	
EQU 	 $280A0
SAA1 	 EQU 	 $280A1
SAA2 	 EQU 	 $280A2
SAA3 	 EQU 	 $280A3
SAA4 	 EQU 	 $280A4
SAA5
	
EQU 	 $280A5
SAA6
	
EQU 	 $280A6
SAA7 	 EQU 	 $280A7
SAAB 	 EQU 	 $280A8
SAA9 	 EQU 	 $280A9
SAAA 	 EQU 	 $280AA
SAAB 	 EQU 	 $280A13
SAAC
	
EQU
	 $280AC
SAAR 	 EQU 	 $280AD
SAAB 	 EQU
	 $280AE
SAAF
	
EQU
	 $280AF
SAB0 	 EQU 	 $280B0
SAB1
	
EQU
	 $280B1
SAB2 	 EQU 	 $28032
SAB3 	 EQU 	 $28033
SAB4
	
EQU
	 $28034
SAB5
	
EQU
	 $280B5
SAB6
	
EQU
	 $280B6
SAB7
	 EQU 	 $28037
SAB8
	
EQU
	 $28038
SAB9
	 EQU 	 $28039
SABA 	 EQU
	 $280BA
SABB
	 EQU 	 $28033
SABC 	 EQU
	 $280BC
SABD 	 EQU 	 $280BD
SABE 	 EQU
	 $2803E
SABF 	 EQU
	 $280BF
SAC0 	 EQU
	 $28000
SAC1 	 EQU
	 $280C1
SAC2 	 EQU
	 $280C2
SACS 	 EQU
	 $280C3
SAC4 	 EQU
	 $280C4
SAC5 	 EQU 	 $28005
SAC6 	 EQU
	 $28006
SAC7 	 EQU
	 $28007
SAC8 	 EQU
	 $28008
SAC9 	 EQU
	 $280C9
SACA 	 EQU 	 $280CA
SACB 	 EQU
	 $280C3
SACC 	 EQU
	 $280CC
SACD 	 EQU
	 $280CD
SACE 	 EQU
	
$280CE
SACF 	 EQU
	 $280CF
SAD0
	 EQU
	
$280D0
SAD? 	 EQU
	 $280D1
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
REFERENCES
Rosenstark, Dr. Sol, 1998, Computer Construction Project and Experiments — EE 393:
Electrical Engineering Laboratory III New Jersey: New Jersey Institute of Technology.
****The following references have been included as further reading to help with
the understanding of the material, figures, and programs contained in this thesis.
Hwang, Kai, 1993, ADVANCED COMPUTER ARCHITECTURE: Parallelism,
Scalability, Programmability. New York: McGraw-Hill, Inc.
Antonakos, James L., 1996, THE 68000 MICROPROCESSOR: Hardware and Software
Principles and Applications. Third Edition. New Jersey: Prentice Hall.
Anon., 1992, M68000: Family Programmer's Reference Manual. Arizona: Motorola
Literature Distribution.
Katz, Randy H., 1994, CONTEMPORARY LOGIC DESIGN, California: The
Benjamin/Cummings Publishing Company, Inc.
91
