The 168/E is a SLAC developed microprocessor which emulates the IBM 360/370 computers with an execution speed of about one half of a IBM 370/168. These processors are used in parailel for the track finding and geometry programs of the LASS spectrometer. The system is contro1led by a PDP-11 minicomputer via a three port interface which we call the Bermuda Triangle. The tape handling and downloading is controlled by one of SLAC s IBM computers via a SLAC built interface between the PDP-11 and an IBM channel.
INTRODUCTION
In recent An average of 0.5 sec of 370/168 CPU time is required for each good event to read the raw data, do the basic event reconstruction, and output the resuits for each successful event.
The software program for these spectrometers generaliy takes many man-years to deveiop on a large computer system, and is often changed as it is better understood.
It is therefore not easily removed from the large computer on which it was developed. Unlike its IBM cousin, the linker has an optionai input with which the user can assign the address of the COMMON blocks. This feature is used to make the data memory overiays which are described later.
CAN A MICROPROCESSOR DO THE BIG JOB?
Having built at very low cost a microprocessor that can be programmed in FORTRAN and has a speed which is no worse than twice as slow as a 370/168, is a fine achievement.
But due to the design choices that have been made, it is still fair to ask the question: can it do the real number crunching job that we have with the LASS production code?
First of all, to be useful it must do a significant fraction of the time consuming part of the job.
With the LASS production code, weii over half of the CPU time is spent in the subroutine which finds tracks in the solenoid detectors. Thus the 168/E must be able to execute this subroutine and ail the subroutines it calls to be a useful processor.
This part of the program is slightly over 32 K bytes of executable code and it translates to a little over 16 K microinstructions which is 5 168/E memory boards fiiied on the program side.
In Thus, the 168/E processor could be used to take the most time consuming part of the production code away from the central computer. However, the event by event input to this part of the code is very large; much larger than the originai raw input tape data. This is because the first part of the code unpacks the raw integer data such as wire numbers, widths, etc., into banks of floating point coordinates appropriately scaled, aligned, and corrected.
The Once overiays were necessary, it was easy to extend this technique to that code which is executed after the time consuming part, including the formatting of the resuit tape record. When an overlay is executed on the 168/E, ail of the processor' s program memory will be overwritten.
The translation aiso creates a data set which contains all the constant and variable data which was internai to the subroutines.
We call this the 'Local Memory' and it may be defined as ali the data space a program uses which is not in a COMMON biock. The locai memory aiso needs to be loaded into the 168/E data memory when the program memory is loaded with an overlay.
For the LASS production code, the local memory is typically 10% of the data memory required by an overlay.
With the overlays described above, the 168/E can handle programs much iarger than can fit into its memory at one time.
Still larger programs can be handled by further overiaying the remaining data memory which contains the program's COMMON biocks.
In order to do this, additionai knowledge of the program is needed. One would like to know exactly in which overlays a COM-MON is needed, in which overlays data is stored into the COMMON and in which overlays data is fetched from the COMMON. If for example a COMMON biock is used only in overiays 3, 4 and ;-then this physicai data space can be used for other COMMON's which are only used in overlays 6, 7, and 8.
A method has been developed to study the whole program in this ieveL of detaii [3] . When With the master index as a data base, software tools have been developed to generate data memory ioad maps for all the overlays. An exampie is given in figure 1 . The left hand vertical scaie is data memory iocation expressed in bytes, and the nine columns are the nine overlays.
Note that one first loads the iocal memory (LMO1 through LM09) into the iow addresses of the processor, then the constant COMMONs. Banks of coordinates generated in overiay I are stored in COMMONs DYNA and WIDTHS; they are used by all the foliowing overiays. Other COMMONs such as PTBANK are generated at a later overlay, then saved until the end of processing the event.
The net effect of the data memory overia in is a substantial saving in memory required by the 168 E processor. Since memory is the most expensive part of the processor, enough money is saved to add more processors to the system. If all the COMMONs were loaded into the memory at one time, it would reguire over 250 K bytes of data memory; but with the overlaying only 90 K bytes is required. On the program side, if all the code was loaded into the program memory at one time it wouid require over 120 K microinstruction words, while with the overiays iess than 20 K micro instructions are needed. One pays the cost, however: the processor is idle during the transfer of the data and program into its memory.
For the LASS production code, we have measured that the total time spent overlaying is 90 msec per event.
This is less than 10% of the average event execution time on the processor which is over 1 second per event. Thus, we feel the overlaying technique is a good compromise for our roduction code and in the following sections we wiii describe the scheme for impiementing the overlays.
BERMUDA TRIANGLE SYSTEM
The Bermuda Triangle system shown in figure 2 , is our method of overlaying the 188/E memory.* The Bermuda Triangle is a three way interface with I/O Ports to a large buffer memory, a PDP-ll UNIBUS, and a bus to the 168/E processors. Data may be transferred bidirectionally between any two ports.
Two Bermuda Triangies are used, one for the program memory and one for the data memory.
The first port of the Bermuda Triangle is to the buffer memories. The program buffer memory, with 128 K words by 24 bits, is iarge enough to hold a single copy of all the program to be executed.
The data buffer memory, with 64 K words by 32 bits (256 K bytes), is iarge enough to hold all the iocal memory and copies of the constant COMMON blocks.
The data buffer memory also buffers events on input and results on output. The memory used is slower but much less expensive than the 168/E memory.
The memories are impiemented with general purpose memory cards purchased from Mostek Memory Systems.
Their MK8000 memory card offers up to 128 K words of 24 bits.
The program memory is thus a single card, while the data memory is two cards depopulated to 64 K words of 16 bits. The cycle time is 500 nsec with an access time of 375 nsec. We have used the backpiane and chassis that Mostek provides for PDP-11/70 add-on memory.
The signal traces on the backplane were cut across the middle so that both the program and data memories could plug into the same backplane and chassis.
The second port of the Triangle is the bus to the processors. It is a 50 line flat cabie with TTL Tritate drivers and receivers. The transfer uses a protocol which is essentially identical to the one being developed by the FASTBUS committee [4] .
A 24 bit address field and 32 bit data field are used.
They are time multiplexed on a set of 32 bus lines. The 4 most significant bits of the address field are decoded to select one processor with the remaining bits seiecting the internal addresses of the processor s memory. Thus the bus allows direct access to any location within any processor.
The rate of transfer on this bus is one word in 700 nsec.
thus the transfer rate on the data side is neariy 6 M bytes per second and on the program side it is equivaient to nearly 3 M bytes per second of IBM object code.
The third port of the Triangle is a PDP-11 UNIBUS.
A PDP-11/04 with 40 K bytes of memory is used as the control computer for the system. This port has 6 control registers to aiiow the PDP-11 to control the data flow between the three ports. Care has been taken that different software tasks in the computer have different registers that they control, thus making the software tasks easier to write. The buffer memories are loaded from the UNIBUS. This means that ordinary batch jobs can transfer data to and from the Bermuda Triangie system. The FORTRAN programmer gets access to the system by a simple FORTRAN caliable subroutine.
Thus the IBM 360/370 reads the raw data from ta e, sends it to the PDP-l1 to be processed by the 168/EBermuda Triangle system, receives the resuits and writes the output tape.
The IBM system with its 24 hour staff handles ail the job scheduling, tape mounting, etc.
Production jobs wiii be submitted to the system as is done now, and each job will first initialize the PDP-l1 and buffer memories.
To synchronize the PDP-1l and 370 software, the 370 aiways attempts a read from the PDP-ll before a write. When the IBM computer reads resuits from the PDP-11, it obviously frees a buffer in the PDP-1l system, thus a write can then aiways be done. For normal event transfers, the control unit transfers directly to or from the data buffer memory through the 8 K byte UNIBUS window of the Bermuda Triangle, with the PDP-1l setting up the appropriate address and page registers. If the IBM computer attempts a read when no data is ready in the 168/E system, the control unit sends back a 'B6USY' response. When this signal is received, the IBM channel simpiy queues the read command without causing an interrupt to the CPU.
When the data becomes ready for transfer, the PDP-1 lioads the word count register in the control unit, and it sends a request for service to the IBM channel.
This request signai wakes up the channel and the transfer is started. This is standard operating procedure for devices on a IBM 360/370 channel.
The whoie data transfer procedure is handled by the IBM channei.
The IBM CPU is free to work on other jobs from the time it issues the Start I/O instruction until it receives an interrupt that the transfer is complete.
PDP-1l SOFTWARE
The PDP-11/04 computer has the job of controlling the 168/E overlays, the transfer of event data to and from the 168/E, and the transfer of data to and from the control unit.
The job is divided into a number of software tasks, corresponding to the non-shareable hardware resources. There is a task for each processor, a task for the channel interface, and a task for each of the processor busses.
As was mentioned earlier, the Bermuda Triangie was designed so that the hardware resources could easily be assIgned to specific software tasks. We have chosen a smaii multi-tasking executive called SPEX [5] which ailows all the tasks to be resident in memory and hence no disk is required on the PDP-11.
It has been used as the data acquisition system in severai experiments at FermiLab, Brookhaven, and CERN.
Each of these tasks is "driven" by a queue of work to do.
The channel interface tasks receives raw event data and queues it to the processor work queue. When a processor becomes available, its task wili take an event from the work queue, supervise its transfer to the processor, its execution through the various overiays, and the transfer of the results back into the bu fer memory.
The processor task will then queue the result buffer to the channel interface work queue and start working on another event from the processor work queue. Meanwhile, the channel interface task initiates the transfer of results from the buffer memory to the IBM channel. 
