Design of a high-speed digital processing element for parallel simulation by Cwynar, D. S. & Milner, E. J.
NASA Technical Memorandum 83373 NASA-TM-83373 19830021326
!
Design of a High-Speed Digital
!
Processing Element For
Parallel Simulation
Edward J. Milner and David S. Cwynar
Lewis Research Center
Cleveland, Ohio
https://ntrs.nasa.gov/search.jsp?R=19830021326 2020-03-21T02:16:53+00:00Z
DESIGNOF A HIGH-SPEEDDIGITAL PROCESSINGELEMENTFORPARALLELSIMULATION
Edward J. Milner and David S. Cwynar
National Aeronautics and Space Administration
Lewis Research Center
Cleveland, Ohio 44135
SUMMARY
A prototype of a custom-designed computer to be used as a processing ele-
ment in a multi-processor-based jet engine simulator is described in this re-
port. The computer was custom-designed to give it the speed and versatility
required to simulate a jet engine in real-time. Real-time simulations are
needed for closed-loop testing of digital electronic engine controls. The
prototype computer has a microcycle time of 133 nanoseconds. This speed was
achieved by: (1) prefetching the next instruction while the current one is
executing, (2) transporting data using high-speed data busses, and (3) using
state-of-the-art components such as a VLSI multiplier. However, some other
features usually found in commercially available computers, but not neces-
sarily required in a simulator, were left out of the custom design to reduce
cost and system complexity. These include complex interrupt structures, byte
addressability, and a large memory addressing range.
The report discusses processing element requirements, design philosophy,
the architecture of the custom-designed processing element, the comprehensive
instruction set, the diagnostic support software, and the development status
of the custom design. Problem areas encountered in designing the processing
element, along with logic circuitry used to eliminate those problems, are
pointed out. By doing so, the authors hope that the reader may gain some in-
sight into the kinds of difficulties that might be expected when undertaking
the development of a custom-designed computer to meet special application
needs.
INTRODUCTION
Even though the rapid growth in microelectronic technology in recent years
has made it possible to build compact high-speed digital computers, accurate
real-time digital simulations of modern airbreathing engines and their controls
are still difficult to achieve. Because of the large number of complex non-
linear calculations required in such a simulation and the sequential nature of
digital computers, real-time execution of these calculations requires simpli-
fication of the engine model and/or the use of an extremely fast, dedicated
mainframe computer (e.g., ref. i).
The technique of parallel processing seems to show promise for real-time
simulation. Several computers concurrently operating on different portions of
the simulation can effectively reduce the computation time and may provide
real-time response of the simulation (ref. 2). Parallel computation may permit
real-time execution of simulations heretofore considered too complex for such
execution speeds. The use of low-cost mini or microcomputers can make this
approachcost-effectiverelativeto currentapproaches(i.e.,use of mainframe
computers).
The design of a real-timedigital simulator(RTDS),to be used as a re-
search tool for the developmentand testingof airbreathingenginesand their
controls,is being pursued at the NASA Lewis Research Center (LeRC).A concep-
tual design of a possibleLewis RTDS is presentedin reference3. A block
diagramof this design is shown in figure 1.
Basically,the RTDS consistsof severalprocessingelements (PE's)synch-
ronizedon a high-speeddata transferbus by a transfercontroller. All but
two of the processingelementscan be used to performconcurrentsimulation
computations.One of the two remainingprocessingelements is dedicatedto
input/outputfunctions. The last processingelement serves as the real-time-
extension(RTX) of the front-end-processor(FEP). It is a specialpurpose
processorlinkinglow-speedoperator interactionwith the high-speedsimulator
core. The FEP providesthe interfacebetweenthe operatorand simulatorthat
permitsthe operator to control the simulationexecution. The FEP, based on
the MotorolaMC68000microprocessor,handlessuch functionsas peripheralcom-
municationsand mode control. In addition,a host computer interfaceallows
the downlinkingand uplinkingof data and programsbetweena host computer and
the simulator. At LeRC the host computeris an InternationalBusinessMachine
(IBM) 370/3033.
This report describesthe design of a prototypecomputermeant to be used
as a processingelement (PE) in a multi-processor-basedjet engine simulator.
The custom design providescomputationspeeds and programmingflexibilitynot
obtainableusing commerciallyavailablecomputers. The report discussesthe
requirementsfor a PE to be used in a RTDS, the Lewis design philosophy,the
architectureof the custom-designedcomputerand its support software,and the
developmentstatusof the custom design.
PROCESSINGELEMENTREQUIREMENTS
Tile custom-designedprocessingelement (PE) was requiredto satisfyseveral
basic requirements. The first and foremostrequirementwas computational
speed. The kinds of engine control researchfor which the simulatorwould be
used requirethe simulationcalculationsto be performedin less than 10 milli-
seconds. This requirementpointedto a maximummicrocycletime of 250 nano-
seconds. Since the PE was to be used not only as the principalcomputingele-
ment in the simulatorbut also as a tool in future hardwareand software re-
search,every effort was made to minimize the microcycletime.
Another requirementwas that the PE be versatileand convenientto use.
The PE must be flexibleenough to accommodateboth hardware and software
changes. Becausethe simulatoris intendedto be a researchtool, the PE must
be compatiblewith differentsimulatorhardwareconfigurations. The PE design
must allow changes to its instructionset to evaluate its effectson system
operation.
Becausethe PE is to be used in a parallelprocessingsystem,the PE mem-
ory access must be sufficientlyfast. Otherwise,a disproportionateamount of
time will be requiredto transferthe data among the elements comprisingthe
simulator. Finally,the prototypedesign must be inexpensiveto build.
PROCESSOROVERVIEW
The custom-designedcomputerpresentedin this report satisfiesall of
these requirements. Speed was acquiredby buildingthe PE with state-of-the-
art (VLSI multiplier,e.g.) and employinginnovativecircuitryto reduce the
executiontime. As will be seen later, the prototypedesign,with its micro-
cycle time of 133 nanoseconds,is capableof executinga sequenceof code two
to three times faster than a DigitalEquipmentCorporation(DEC) PDP 11/70
computer. In additionthe custom-designedPE provides the programmerwith a
very powerfuland complete instructionset for simulatingjet engine behavior.
Almost 300 differentinstructions,each with an easy to understandnatural
mnemonic name, cover the spectrumof arithmetic,logical,and conditional
operations. Many operationswhich would normallyrequiremultiple instructions
on other computersare availableon this custom design as a singlecommand.
Selectingthe maximum of two values,MAX, for example, is a single instruction
on the custom-designedPE.
Instructionsare not hardwiredin the custom-designedPE, but reside in
the PE as microcode. No hardwarechangesare requiredto modify the instruc-
tion set. In addition,all circuitconnectionson all circuit boards are
wirewrapconnections. Thus, solderingis unnecessarywhen modifyingwiring on
the computerboards.
The PE design will permit the transferof data betweenPE's to be carried
out quickly and orderly in the simulator,the processconsumingan acceptable
portionof the computationupdate cycle. The hardwaremeets this high-speed
data transfer requirementby using pipelining,45 nanosecondmemory, and
SchottkyTTL components. Costs were kept to a minimum by building the proto-
type PE inhouseat LeRC. The hardwarefor the systemcost approximately
$11 DO0. This includesthe chassis, circuitboards,components,wire, etc.
for the PE; it does not includeany supporthardwareor diagnosticequipment.
A photographof the prototypehardwareis includedas figure 2.
Salientfeaturesof the design include: a 133 nanosecondmicrocycletime;
an advancedmicrocyclecontrollerto minimize the number of cycles associated
with instructionsinvolvingconditionals; an enhanced instructionset that
permits rapid executionof simulation-relatedfunctions(SELECTMAX or SELECT
MIN executed in 166 nsec, for example),and a very-large-scaleintegrated
(VLSI) 16-bitmultiplierthat permitsthree differenttypes of multiplication
to be executed in 400 nanosecondseach.
DESIGN PHILOSOPHY
The emergenceof moderatelypriced,very high speed metal-oxide-
semiconductor(MOS) and transistor-transistor-logic(TTL) memories has made
possible the design of moderately-pricedmemorymodules with access times of
80 to 100 nanoseconds. As a result,traditionalminicomputerdesign guide-
lines,which assume that the arithmeticlogic unit (ALU) is at least twice as
fast as the memory,have become obsolete. Therefore,to achieveALU cycle
times that are compatiblewith the speeds of these advancedmemories,one could
use the latestemitter-coupled-logic(ECL) bit-slicedevices. However, these
devices are very expensive,consume a considerableamount of power, and require
specialcircuitboards. They also lack standardizationand have poor noise
immunity. TTL bit-slicecircuit elements,like the AdvancedMicro Devices
(AMD) 2900, for example,offer noise immunitybut lack the speed needed to
take advantageof the high speed memories. An alternateapproachto achieving
high-speed computation would be to use parallel ALU's. The Plessy MIPROC-16
computer is an example. However, to achieve reasonable speed (a 250 nano-
second microcycle time) at low cost, the MIPROC-16 incorporates a single ac-
cumulator, a 16-bit instruction word length, and an awkward addressing mode.
This limits the computational power of the machine and requires many cycles
for complex operations.
Thus, to provide the required speed and programming flexibility for the
real-time simulator, it was necessary to custom design a processor. Making
the design compatible with TTL and ECL bit-slice architectures would leave
open the possibility for incorporating new bit-slice technology as it becomes
available to improve performance and/or lower costs. The basic processor de-
sign is a 16-bit computer with a 32-bit instruction word length. The design
incorporates saturated logic of the Schottky and low-power Schottky TTL family.
Current packaging technology restricts the flexibility of available VLSI and
LSl devices. Hence, our custom-designed computer features mostly medium-scale
integrated (MSI) circuits in the ALU. The micro-programmable architecture
employs instruction prefetch and permits pipelining of processor control sig-
nals to increase speed and provides the ability to modify the instruction set
when necessary.
ARCHITECTUREOF THE CUSTOM-DESIGNEDPROCESSINGELEMENT
The architecture of the custom-designed PE is presented in figure 3. It
consists of an arithmetic logic unit (ALU), high-speed data busses, a high-
speed TRWMPY16HJmultiplier, a status register with an associated status logic
generator, an exponent generator/zero detector, a microprogram controller for
sequencing execution of microcoded instructions and a 32K-by-16-bit memory.
These components and associated design considerations will be discussed further
in the following sections of this report. Also included in the PE architecture
are a program counter, which points to the next instruction to be executed,
and a memory address register, which keeps track of memory access. Associated
with each of these is an adder which can be used to increment or decrement the
program counter or memory address register. Access to the PE is provided by
an input/output port and an external memory port. Through these paths the
programmer not only may monitor parameter values being used by the PE, but he
is also provided with a means of modifying those which he feels need adjusting.
The Status Register
At any instant during execution, a special register gives the current
condition and mode of operation of the computer and the calculations taking
place. It also provides the programmer with control over the execution of the
program by allowing him to select and/or change the mode of operation. In the
custom-designed computer this special register, called the status register, is
16 bits long. Its layout is shown in figure 4. Notice that by setting bits
12 through 15, the programmer can enable various levels of interrupts and/or
overflow limiting. Whenoverflow limiting is enabled, any calculation that
would over-flow is restricted to its full scale value. Bits 7 through 11 are
a series of flags that allow different actions to be taken depending on
whether they are set or reset. Bits 5 and 6 are the overflow latch and
overflow latch enable, respectively. Whenthe overflow latch is enabled (bit
6 set), an overflow will cause the overflow latch (bit 5) to be set and remain
set until it is specifically reset by an action taken by the operator or the
program itself. Bit 4 acts as a carryout bit; that is, it is a storage loca-
tion for a bit which may be lost in a calculation due to limited register size.
For example, in a 16-bit machine, adding the hexidecimal words FFFF and FFFE
(-i plus -2 decimal) results in FFFD (-3 decimal) with the carry bit (bit 4)
set because a bit is "carried out" of the calculation. The final four bits
(bits 0 through 3) are designated as the condition code. Generally, for
arithmetic operations bit 3 indicates whether an overflow has occurred, and
bits 0 through 2 indicate whether the result is positive, zero, or negative,
respectively. Generally, for logical operations comparing A with B, the con-
dition code bits 0 through 2 indicate whether A is greater than B, equal to B,
or less than B. A summary of status register characteristics is presented in
table I.
Status Logic Capabilities
Comparisons. - A unique feature of the PE is that circuitry has been in-
cluded in the design which allows valid, logical comparisons of arbitrarily
large, multiple-precision signed-words, as well as single-precision signed-
words. That logical comparisons of signed-words cannot be made by just sub-
tracting one word from another in the ALU and checking the sign of the result
is well-known. To do so can result in an error. To prevent these kinds of
errors, an external zero detector circuit has been included as part of the PE
hardware. The inputs to the zero detector include: (i) the words to be com-
pared; (2) the output bits, Uxx, of the ALU; and (3) bit 1 of the status
word, SW1, which is set when, at the time the status word is updated, all bits
of the current ALU output are zero (indicated as, Uxx = 0).
The logical comparison is based on the relationship that if B is not
greater than A (B > A) and B is not equal to A (ZERO), then B must be
less than A. ZEROdenotes the output from the external zero detector circuit.
Whenperforming single-precision comparisons and when operating on the least
signficant 16 bits of multiple-precision words, the ZEROsignal will be true
if any of the Uxx bits output from the ALU are set. Whenoperating on the
more significant 16-bit portions of multiple-precision words being compared,
ZEROrepresents the current status of a running test on the multiple-precision
comparison. Here the ZEROsignal is based not only on the current 16 bits be-
ing compared (indicated as U_xx), but also on the previously compared 16 bits
(indicated as SW1). The ZEROsignal then satisfies the relationship
ZERO= SW1° (Uxx = O) (1)
For both single and multiple-precision comparisons, the B > A signal is deter-
mined by the sign bit and the carryout bit from the ALU subtraction. Thus, the
logical comparison approach used in the PE is valid for both single and
multiple-precision comparisons. The multiple-precision comparisons reduce to a
series of multiple-precision subtractions in the ALU together with the use of a
running condition code.
Becausethe custom-designedPE is a research-orientedcomputer,it has been
designedto be extremelyflexiblewhen it comes to logicaloperations. For
example, it has been designedto permit the multiple-precisioncomparisonsto
be performedfrom the most significantwords to the least significantwords.
In some applicationsthis can result in increasedspeed. The general procedure
is similarto that just discussed. In this case, however,the procedurecan
stop after comparisonof the most significantwords is made, provided an in-
equality is detected. As before,the status registeris updatedafter each
subtractionin the ALU. SWO is used in this case to indicate if a previous
word comparisondeterminedthat B > A. If SW1 is set, all less significant
words comparedthus far have been equal. And since the less significantwords
of multiple-precisionintegersare unsigned,inequalitymay be detectedby
simply observingthe carryout bit of the ALU upon subtraction. If the carryout
bit is reset, B is greater than A.
Multiplications.- Since the custom designedPE uses the TRW MPY16HJmul-
tiplierwhich has no built in overflow indicator,logic to sense multiplication
overflow,MPYOVFL, had to be includedin the custom design. The only possible
way to overflow in an integer(scaledfraction)multiplicationon the custom-
designedPE is to try to multiplyminus full scale by minus full scale. In
scaled fractionnotation,this amountsto tryingto perform the calculation
-1"-1=+1. The result,+1, overflowsthe scaled fractionformat,and hence,
causes a multiplicationoverflowcondition. To determinemultiplicationover-
flow all that needs to be done is to determinewhen minus full scale is fed to
each input of the TRW multiplier. Since this multiplieris separatefrom the
ALU, the ALU can be used to add both multiplierinputs while the multiplication
is taking place. The multiplicationoverflow indication,then, can come from
the relationship
MPYOVFL= OVFL• (Uxx = O) (2)
since, when minus full scale is added to minus full scale, both an overflow (OVFL)
occurs in the ALU and the U x output of the ALU is zero. Notice that this is
the only set of inputsto t_e ALU that will result in these conditions.
The InstructionSet
Normally,instructionsreside in a computer as an integralpart of its hard-
ware. Thus, the programmermust be satisfiedwith the instructionset built into
the computer he is using. He may be able to constructsome other instructionsby
buildingPROCEDURESor MACROS using the instructionsprovided to him by the manu-
facturer. This may necessitatethe use of awkwardconstructsthat requireextra
executiontime and more core_storage. In a time-criticalsimulationapplication,
this may not be acceptabel_.The custom-designedPE inc]udesa microcoded instruc-
tion set which offers severaladvantagesover a systemwith conventionalhard-
wired instructions. As mentionedpreviously,the PE was designed for use as a
researchtool to develop and evaluateparallelprbcessorhardwareand software.
Having the instructionset reside in softwarecomplementsthe versatilityof the
design. Instructionscan be convenientlyadded or deletedfrom the instruction
set as the need arises.
6
The instruction set residing in microcode in the custom-designed PE gives
the jet engine simulation programmer extraordinary computing power with which
to do his work. The highlights of this instruction set are summarized in
table 11. The set includes 285 instructions in all. There are 22 different
arithmetic instructions. These include the basic operations of addition, sub-
traction, multiplication, and division operating on integer (including scaled
fraction) or floating point inputs, both single precision and double precision.
The enabling and controlling of the interrupt structure and data transfer con-
trol of the machine is governed by 11 different control instructions. These
instructions also control input/output to the PE.
The following instructions are unique to this machine insofar as they are
extremely fast and provide the programmer with powerful one instruction com-
puting capability.
Data type conversions are provided by 22 data conversion instructions.
These allow any of the data types: integer (scaled fraction), floating-point,
both single and double precision, to be converted to any other data type auto-
matically. These instructions allow automatic scaling and descaling between
integer and floating-point. Implementation of integration schemes used in the
simulations can be facilitated by 14 instructions which perform multiple-
precision, cumulative addition and subtraction.
Block move instructions allow the moving of the contents of blocks of mem-
ory around within memory using only a single instruction. Single instructions
also allow fast implementation of complex multiconditional jumps. The jump
will occur on status word condition true. The status word condition is field
selectable on four fields with 16 different combinations on each field.
Instructions operating on integer input data are very fast. They require
at most two machine microcycles to execute, with most requiring only one. The
multiply instruction using the TRWMPY16HJmultiplier requires 3 microcycles
to execute and the divide instruction executes in only 19 microcycles in this
custom-designed machine. Likewise, select maximum/minimum requires only one
microcycle for 16-bit integers.
For comparison purposes, the time required to execute typical instructions
with the custom-designed PE and various commercially available machines is
presented in table III. The table shows the superior speed advantage of thePE.
The Memory and Addressing Modes
As mentioned earlier, the PE has a 32K-by-16-bit memory. Included in the
PE architecture is an external memory port interface which allows access to
the PE memory via external data busses (see fig. 3).
Memory addressing is aided by the powerful addressing modes which allow
flexibility in specifying memory locations. Memory may be specified as a base
location, as a relative displacement, as a relative location specified in an
index register, or as a combination of all three. In addition, memory loca-
tions may be specified as an absolute address offset by a relative location
specified in an index register.
In the base address mode of referencing memory, the memory address is ob-
tainea by adding together the contents of two registers - one known as a base
register, the other known as an index register - plus a 12-bit displacement
specified as a constant in the operand field of the instruction. As shown ill
figure 3, the contents of the base and index registers for memory calculations
are transported on the RA and RB register busses, respectively. The displace-
ment comes directly from the instruction register. The reason for two regis-
ters is to allow the addressing of a block of data. The base register holds a
reference location which acts as an origin for determining the addresses of
the other memory locations. The contents of the index register acts as an
index, just as its name implies, allowing the programmer to address a block of
memory. He may also address a pattern of memory locations by changing the
contents of the index register in some prearranged fashion.
In the absolute address mode of referencing memory, the memory address is
obtained by adding together a 16-bit constant and the contents of an index
register. For the memory address calculation these data are transported on
the RA and RB register busses, respectively. The result of this summation is
an absolute memory reference relative to the first location used by the simu-
lation. Either of these addressing modes may be used in a simulation at any
time. The programmer is free to intermix them, using the one he feels is most
suited to his application in that particular portion of the simulation.
Microprogram Control Logic
The microprogram control logic (MPCL) which coordinates execution of the
PE microcoded instructions, can be considered as a computer within the PE.
Each PE instruction has associated with it a sequence of microcode commands
which carry out the desired PE operation. When the microprogram requests an
operation to be performed, sufficient cycles are granted to allow that opera-
tion to be carried out. The microprogram memory is 1K-by-72-bits. Each
microprogram instruction is 72 bits long. The MPCLis displayed in figure 5.
The op-code for the next PE instruction is directed from the prefetch latch to
the microprogram control shifter. The microprogram control shifter sets the
microprogram counter to the correct microprogram address in order to begin
executing the sequence of code which will perform the operation. Inputs are
sent to the ALU via high-speed busses where the arithmetic operations then take
place. Included in the MPCLis a prioritizer which controls the operations.
It will halt other operations if it determines that: (1) the system has been
master cleared, (2) transfer of data between PE's is in progress, (3) a halt
has been requested, (4) a parity error has been detected, (5) the op-code for
the next instruction is not available from the instruction prefetch, (6) an
external interrupt has occurred, or (7) a system pause has been requested by
the operator.
The select logic for the microprogram next address control is shown in
figure 6. Depending on whether a special microprogram bit is set or reset in
the microprogram instruction word, one of two paths is selected as shown in
the figure - either tier A or tier B. Furthermore, depending on whether con-
dition code bits ccl and cc2 are set or not, various options can arise in
selecting the address of the next microprogram instruction to be executed.
For example, suppose that tier B is being followed. If ccl is reset, then the
condition for a conditional jump is not satisfied. The jump is not taken and
the next microprogram instruction to be executed is the next microprogram
instruction in the program sequence. However, if ccl is set and cc2 is also
set, then the condition for a jump is also satisfied. A jump is taken to the
address indicated (pipeline address), and the microprogram instruction con-
tained therein is the next microprogram instruction to be executed. If cc2 is
reset, the microprogram incrementer (shown in fig. 5) advances the micro-
program counter by one and the next microprogram instruction in the sequence
is executed as the next instruction.
8
If the tier A path is followedand ccl is reset, interruptsmay be ser-
viced. If an interrupthas occurred,a jump is made to the interruptmap to
determinethe cause and the action to be taken. Examples include:HALT, to
stop executionof the program;MASTER CLEAR, to clear registersand reinitia-
lize the PE; a parity error having occurred;transferof data among PE's about
to take place.
The ArithmeticLogicUnit (ALU)
The ALU logicalarrangementis shown in figure 7. The ALU is responsible
for performingthe actual arithmeticand/or logicaloperationsrequestedin
the instructionoperationcode. Inputs to the ALU come via two high-speed
data busses (A-busand B-bus in fig. 3). Output from the ALU goes to a multi-
plexer/shifterwhere the final output is shaped as to whether it is to have
its bits reversed,passed as is, or shiftedleft or right. These values are
then stored in registersand/ormemory via high-speeddata busses for use in
future calculation,as appropriate. An overflowdetectorchecks the output to
determinewhether an overflow has occurred. If overflow limitingis in effect,
a resultwnich overflowsis limitedto its correspondingfull scale value with
appropriatesign. The output also goes to the status logic generatorso that
the status registermay be updatedto reflectthe calculationjust completed
in the ALU. Informationnecessaryfor the ALU to carry out its duties is sup-
plied to the ALU control via high-speeddata bussesfrom the programinstruc-
tion register,the microprogramcontrol,the statusregister,and the PE
memory.
High-SpeedData Busses
Information is moved around within the custom-designed PE structure via
high-speed data busses. The layout of the bus system is shown in figure 3.
The A and B-busses are used to transmit input information to the ALU. The
B-bus also supplies the TRWhigh-speed multiplier with input information.
These busses receive register information from the register busses (RA and RB
busses) through a system of latches also shown in figure 3. The RA and RB
busses also transmit register information for memory storage and for memory
address calculation. Address information is transmitted via the memory ad-
dress bus to the high-speed multiplier, to the instruction address register
through the prefetch latch, and to memory.
A special pair of busses are used to transfer data between PE's in the
digital parallel processor. The data being transferred is transmitted over
the transfer data bus, and the address from which the data came is transmitted
over the transfer address bus. The transfer control is used to signal the
system when a computation cycle is completed, so that a data transfer cycle
may begin.
Data TransferBetweenProcessingElements
Whena data transfer cycle begins, the transfer control logic increments
the address register every 100 nanoseconds. To meet this high rate of data
transfer between PE's, pipelining, special 45 nanosecond memory, and Schottky
TTL circuitry are used.
9
The memory configuration during a transfer cycle is shown in figure 8.
The address of the data to be transferred is transmitted to the source memory
from the transfer controller via the transfer address bus. Once this command
is received, the data is latched at the source memory output. Then the PE
places its latched data onto the 16-bit high-speed data transfer bus. The
destination PE's latch in the data and store it. The storage location of the
data received is determined by the local destination address register. Before
a transfer cycle occurs, each PE initializes this local register and auto-
matically increments the memory address after receiving each 16-bit data word.
Because pipelining is used during the data transfer cycle, a data transfer
to as many as nine different destination processors can take place every 100
nanosecond clock cycle. After the first cycle, which is needed to initialize
the pipeline, the source PE will gate the source data onto the data transfer
bus. Those PE's designated as destinations then latch the incoming data.
After incrementing their local addresses, they clock in a new commandword and
thus prepare for the next cycle. Whenthe transfer cycle is completed, the
control logic reinitializes the address register.
Diagnostic Test Software
Several diagnostic programs have been written to check that the custom-
designed PE hardware is operating properly. A complete memory test was car-
ried out at the bit level using a generalized memory test algorithm (ref. 4).
Using a similar algorithm, a complete register test at the bit level was also
carried out. These diagnostics checked the hardware and the microcode soft-
ware to make sure that for each register or memory location every bit could
be set and cleared; and that while a bit pattern was being stored in one word,
bits were not being erroneously cleared and/or set elsewhere. The system,
undergoing testing, is shown in figure 9.
Once it was determined that the PE registers and memory were working prop-
erly, the checkout of the microcode software for each instruction was initia-
ted. The philosophy followed for checking the instructions was to execute
each instruction with every possible combination of the sign bit and the two
most significant data bits of the instruction input parameters. The test was
run once with the remaining data bits set, and then again with the remaining
data bits cleared. It was felt that if a wiring error existed in the hardware
or if an error existed in the microcoded instruction, it would manifest itself
using this set of input parameters for the diagnostic test. The output from
each instruction for each set of input data was then checked against a table
of hand calculated values of predicted PE outputs. Anytime a discrepancy be-
tween the PE computed value and the table value occurred, the error was
flagged. If no discrepancy occurred, using the input values described above,
the instruction was considered to have passed the error test. That is, it was
assumed that the instruction was microcoded correctly and that it was executing
correctly.
Fifteen diagnostic programs were developed to cover the checking of the
various classes of instructions in the custom PE's library. Each instruction
was checked by one of the diagnostic programs. In addition to the memory and
register diagnostic programs already mentioned, separate diagnostic programs
covering the checking of jump instructions, shift instructions, status word
modification instructions, and Boolean logic instructions were developed. The
remaining diagnostic programs were used to check the arithmetic instructions.
10
STATUSOF THE CUSTOM-DESIGNEDPROCESSINGELEMENT
At the current time, prototype hardware for the custom-designed processing
element is built, and most of the instructions operating on either integers or
scaled fractions are operational. The Boolean logic, shift, and jump instruc-
tions also execute satisfactorily. The arithmetic operations of addition and
subtraction are operational. Problems have been experienced with the divide
instructions and multiply instructions. The divide instruction executes cor-
rectly as long as both inputs are from registers. If one of the inputs is from
memory, however, the result of the division is incorrect. The source of the
error has not yet been found. But because the divide instruction does execute
correctly in the all-register-input case, the basic divide algorithm does not
seem to be the source of these errors. More than likely the problem is being
caused by a hardware wiring error or incorrect microcoding.
Still to be checked out are floating-point instructions. Having f|oating-
point instructions available would be very convenient because the programmer
wouldn't have to be concerned about scaling the parameters in his engine simu-
lation. However, floating-point capability is not considered critical in the
PE. Although it takes more time for the programmer to set up a fixed-point
simulation because of the required scaling, simulations using integer or scaled
fraction arithmetic generally will execute faster. And as mentioned earlier,
achieving high execution speed was a major consideration in designing a PE for
real-time simulation of jet engines and their controls.
CONCLUDINGREMARKS
This report describes a prototype of a custom-designed computer to be used
as a processing element in a multi-processor-based jet engine simulator. The
purpose of the custom-design was to give the computer the speed and versatility
required so that it could be used as a research tool for the deve|opment and
testing of airbreathing engine digital electronic controls. Speed was of ut-
most importance in designing this custom computer. Usingstate-of-the-art
components and innovative circuitry, a computer computation cycle time of 133
nanoseconds was achieved.
Problem areas encountered in designing the processing element, along with
logic circuitry used to eliminate those problems, are pointed out. These may
give the reader some insight into the kinds of difficulties that might be ex-
pected when undertaking the development of a custom-designed computer to meet
special application needs. The authors hope that this report proves helpful
and facilitates that kind of development. In the future, undertakings of this
kind will be prompted more and more by the rapid advances being made in the
area of microelectronic technology. Powerful integrated circuitry is becoming
available on more dense and less costly chips. These powerful, inexpensive
chips will act like building blocks which the designer will be able to link
together in many different configurations, giving him the capability of custom
tailoring a computer to his exact needs.
11
REFERENCES
I. Mihaloew, J. R.: A Nonlinear Propulsion System Simulation Technique for
Piloted Simulators. NASATM-82600, 1981.
2. Szuch, J. R.: Advancements in Real-Time Engine Simulation Technology.
NASATM-82825, 1982.
3. Blech, R. A.; and Arpasi, Dale J.: An Approach to Real-Time Simulation
Using Parallel Processing. NASA-TM81731, 1981.
4. Milner, E. J.: A Generalized Memory Test Algorithm. NASATM-82874, 1982.
12
TABLE I. - STATUS REGISTERSUMMARY
• 40 Instructionsfor direct manipulationof the status register
• Conditionaljumps and links
l_I_e_Oe_ec_a_eon_e,_s16 Combinationswithin each field
• Overflow flag and overflow latch with enable
• Auto overflow limitingwith enable
• 5 Programflags - 4 with externalset
• 4 Conditioncode bits plus a carry bit
• 3 internallyvectoredinterrupts
• Runningconditionalsfor multiple-precisionoperations
13
TABLE II. - HIGHLIGHTSOF MACHINEINSTRUCTIONSET
• 22 Basic arithmetic instructions
• 11 control instructions
• 22 data conversion instructions
• 14 integration instructions
• 5 Addressing modes
• Block move instructions
• Complex functional jumps
• Fast 1-2 cycle integer instructions
• 3 1/3 Cycle multiply
• 19 Cycle divide
• 1Cyc|e 16-bit integer select maximum/minimum
14
TABLE III. - COMPARISONSOF MICROPROCESSORS
FORREAL-TIMESIMULATORCOMPUTATIONS
Basic Function Estimated time/function (_sec)
7.5 MHz 8 MHz 6 MHz 10 MHz 68000/Custom
Lewis Motorola ZILOG INTEL PE ratio
Custom PE 68000 Z8002 8086
Add/sub 16-bit variables 0.25 1.43 1.55 1.61 5.7
16-bit constants .133 1.0 1.17 1.3 7.5
32-bit variables .35 1.93 2.52 3.22 5.5
Compares 16-bit integer .33 2.68 2.60 2.27 8.1
Mult./Div. 16-bit integer .88 13.2 15.3 14.0 15.0
Integration 32-bit .80 4.75 8.6 6.4 5.9
48-bit .93 11.0 14.5 8.9 11.8
15
pI Icsa& inputs&ADC's _ outputsHost I
oomputer ----_=I,noutputPUt,_L
I I _ (lOP)
Proqramdownlink _ :
_" T I /r High speeddata
Front-end ,/ transferbus
processor _ _p, Real-time
@-EP) - . extension(RI'X)
_TT '- Processing
', ', _ element
Interrupts
andcontrol-/
o
CRTI Floppy •
ke_oard disk •
processing .,h
element _ ,._
_"_"_ J Transfer
controller
Figure1. - Basicsimulatorstructure.
!
C-81-3883
Figure2. - LeRCcustom-designedprocessingelement.
0, +_]. -_" I Memo!address '
..... I I Prog....... ter I I register I I I
1_°°_1I 'Pc' I I ,M,,, I I Mooo_
| Add.... I "-'_ ]6 _ _ Ir--*l ca,cu,atienI ' / t ) "1 I
J I ,ate, I ' -1= .,,_, _Dotaoo!.16 ] ' . _ _ I , Parity I Transfer Ii Add....
/ _ I latch J 1 I I _'_l / g..... tor k memory J
RA-BuS {]6 _ f f ='J _1 I I_-Z_ f _ I .rData
, _ | I 112 _ _ 1", I II Parltv I Data.I Parity Jl ,.JJ out
I _ 24"" inJ" bit J/_
[ _ 1 l Exponent I I l_e_mrY ldata 1 l I lImB 1 i 1 b;t'I'_/"_l _ I _'--_I_. / trans,erI I '/-_ I I I generatoUI I dataI I I I field I =I _ out4----L-U--_ II _au_I I _" I I I .... detectorI I_____1 I ;I _ "k- II Externa,J I !_ ",_----------I I /
....tie. .n,tructionI-_
IJ J register I register t data/_ __[ _ I . I J _
 111" ,atcl _ ALU /-" _Ys_t_s_er
_ _ _ Irn:ti:tuevi°n _ _ I_ _ R_egiSter I L__ _icroprogram L -T T --,._ l
'I I H_g, i I A,u.... I=,"7 I re_te_I I i i ' ' 'I, o_tro, ,-I" I l---'----f---_-_ Il_e_. I I !Z' ,, J,___._ Mi=oprogramI I I _ I--I I Traoefe_L-_ I----'
I_" ' _n'mtr°_'Yc°ntr°l LkN _ii!irgs/ I ALU'_'_ _
•o _. _ _ Physically I I _ I
Figure3. - Processingelementarchitecture.
Bit 15 14 13 12 ll lO 9 8 1 6 5 4 3 2 ] 0
... ] INT2 I INTI I INT0 I Snsed] Sensed] Sensed I SensedI Programl ._ I Overfowl Carry ] _'" ]--'" I _'" ] _'" lOverflow e Overflow Condition Condition Condltmn Condition
Hm_t enable enable enable flag4 flag 3 flag2 flag 1 flag I latch latch blt co_e code co_e co_eI °na_'_I i I I I I I I jena ,eJ i ' I 1'01
Conditioncode
Figure 4. - Statusregister format.
< Masterclear
< Transferin progress
Op-code < Instruction not prefetched
prefetchlatch Threeexternal interrupts
r Pause
I Micr°program'°ad I I Iport_' Shifter Prioritizer/vectorizer
I Microprogrammemory I Microprogram
(1K x72- expandable counter Incrementer
to4Kx80)
Pipelineregister counter I Microprogramnext fromALU
J_ f I addresscontrol Status
register
I
_Alsoservesaspoweron vector
To
ALU
control
Figure5. - Microprogramcontrol logic.
Start
A Yes _ No I
No
Jumpto I Continue I
pipeline (microprogram
N_,,_r __] Yes address counter)
r
I mtolvectortoutoIaddress next exceptionlatch instruction vector
Yes
True
I
Continue Jumpto I
(microprogram pipeline Icounter) address
Figure6. - Microprogramnextaddressselectlogic.
i- A-bus
I B-bus-,
I generation
PP Internal logicControl_ : statuslatch
k group #1 t fInstruction \ _ Status
register_ \ \ register I
•,, \ Enables
t ,nte.a,t8 _ _ statuslatch [3 group #2 Overflow Uxxdetector
-- Reverse : ALU
ALU To : bits -,,,,, ] control
control status P Right
reg. ., ; shift
6 Overflow
Left I
shift A B
main _ _'3 IALU QA15.Qhs>--,'-; Shifter " MUX/shmer I
/1 control t
Up :_5 16
control
ToUP
control
Figure7. - ALUlogicalarrangement.
[,ourceIi= memory i -_(255words)
Transferaddressbus -,, _- Transfer
" I---_ I I _ datab°s\ Initial Memoryaddress Incrementervalue register
address-"\_I I
Destination =
memory Data =
(upto 32K)
Figure8. - Memoryconfigurationduring transfercycle.
C-81-3885
Figure9. - Custom-designedprocessingelementundergoingtesting.
1. Report No.
NASA TM-83373
4. Title and Subtitle
2. Government Accession No. 3. Recipient's Catalog No.
5. Report Date
DESIGN OF A HIGH-SPEED DIGITAL PROCESSING ELEMENT
FOR PARALLEL SIMULATION
7. Author(s)
Edward J. Milner and David S. Cwynar
9. Performing Organization Name and Address
National Aeronautics and Space Administration
Lewis Research Center
Cleveland, Ohio 44135
12. Sponsoring Agency Name and Address
National Aeronautics and Space Administration
Washington, D. C. 20546
15. Supplementary Notes
June 1983
6. Performing Organlzalion Code
505-40-58
8. Performing Organizalion Report No.
E-1641
10. Work Unit No.
1,. Contract or Grant No.
13. Type 01 Report and Period Covered.
Technical Memorandum
14. Sponsoring Agency Code
Unclassified - unlimited
STAR Category 33
16. Abstract
Described in this report is a prototype of a custom-designed computer to be used
as a processing element in a multi-processor-based jet engine simulator. The pur-
pose of the custom design was to give the computer the speed and versatility
required to simulate a jet engine in real-time. Real-time simulations are needed
for closed-loop testing of digital electronic engine controls. The prototype com-
puter has a microcycle time of 133 nanoseconds. This speed was achieved by:
(1) prefetching the next instruction while the current one is executing, (2) trans-
porti ng data us ing hi gh-speed data busses , and (3) usi ng state-of-the-art compon-
ents such as a VLSI multiplier. Included in the report are discussions of proces-
sing element requirements, design philosophy, the architecture of the custom-
designed processing element, the comprehensive instruction set, the diagnostic
support software, and the development status of the custom design.
~~-;:-;-"7"'"=- ~:---=""":":""~:----------"""'~~~::--~~-:-------------17. Key Words (Suggested by Author(s)) 18. Dlstrlbulion Statement
Microcomputer design
Parallel processing element
High-speed computer design
Digital simulator processor
19. Security Classil. (of this report)
Unclassified
20. Security Classif. (01 this page)
Unclassi fied
21. No. 01 pages 22. Price'
"For sale by the National Technicallnlormation Service. Springfield. Virginia 22161
Ji 3 1176 00509 3977 i
ii : LANGLEY RESEARCH CENTERNational Aeronauticsand SPECIALFOURTHCLASSMAIL Ili /-- 176 00509 3977_IIIIllllflllll!ltlJllllllllllJllllllllllllllilllllllllJltllllllll
Washington, D.C. "
20546
Official Business
Penalty for Private Use, $300 Postage and Fees Paid
National Aeronautics and
Space Administration
NASA-451
NASA POSTMASTER: If Undeliverahle (Secti<_n I S_Postal Manual) IXJ No! Return
