Introduction
The real-time operation of algorithms that arise in speech and vision research is often limited by the speed of the low-level signal handling algorithms, which in turn are typically limited by the computation time of a critical inner loop. To speed up the execution times of these algorithms, dedicated special purpose hardware or very fast auxilliary processors have been used to implement the critical code.
In signal processing problems, and especially with the FFT, good results have been obtained with a special machine architecture relying heavily on instruction-cycle overlap (pipelining) and on multiple parallel arithmetic units capable of performing certain complex operations, notably the FFT -butterfly-multiply, very efficiently [FDP] [SPS-41] [AP-120B]. These processors are almost invariably both expensive to build and difficult to program, due to their complex structure and parallelism. Their computational power on suitable tasks has nevertheless made them useful.
At Carnegie-Mellon University we are investigating artificial intelligence algorithms which, although they deal directly with input or generated signals, do not fall entirely within the realm of conventional "signal processing". Floating point and complex arithmetic are not usually required, and while the FFT is used in some problems, most algorithms cannot consistently utilize such special features as butterfly-multiply hardware efficiently. Low precision integers may be used for most computations; often only small bit fields are manipulated, as in vision tasks where pixels may be as small as 4 bits. When signal processing is done, multiple precision is as satisfactory as floating point. Considerable logical manipulation and decision-making capability are also called for. We have designed a new processor, Harp, which suits the needs of our research and is considerably more general-purpose in its orientation than the signal processors.
II. Architecture
Harp is an auxiliary processor to a host PDP-11. It is a "pure" 16-bit machine:
instructions, data paths, and ALU are all 16 bits wide; there are no "byte" instructions or packed data representations.
The Harp processor operates from two small, very high speed memories-the data and instruction working stores. There are no data registers or accumulators in
Harp has the following design goals, some of which are in keeping with the signal-processor approach, some not:
A.
Extremely high speed (under 40 ns cycle).
B.
Low cost ($5000-$ 10,000 parts).
C. Programming simplicity: transparent pipeline and parallelism.
D.
Large number of high-speed registers.
E. Separate instruction and data memories for overlapped access.
F. General-purpose capability (minicomputer-like instruction repertoire).
G. 16-bit data and instructions for minicomputer compatibility.
H. Large high speed second-level memory (4K -64K).
I. Extensive diagnostic capability.
Software tools were used extensively during the design of Harp. A simulator, debugger, and assembler, were built very early in the design phase of each proposed architecture. These tools were used to code 8 representative algorithms (summarized in Section VII) to give the designers programming experience and performance information. A number of complete design iterations resulted before Harp was committed to hardware. The working store contents can be transferred to or from a large buffer memory via a block transfer mechanism. The transfer rate of 0.8 Gbps (20 ns per word) avoids bottlenecking the fast processing of Harp. Buffer memory is expandable from 4K to 64K and is double-ported, one port permitting high speed transfers to the Harp working stores, the other compatible with a PDP-11 UNIBUS.
<--i>
No single-word direct access to buffer memory is provided for several reasons:
1) this memory is interleaved, and the delay for initiating access is five times greater than the average transfer time; 2) allowing access an a cycle-to-cycle basis would have slowed Harp's execution rate considerably; 3) two words of working store are taken up by addresses for each reference to buffer memory.
The block transfer machanism is used to overlay programs too large to fit in the available instruction memory. Block transfers are fast enough to make frequent overlays acceptable in many situations: a 64-word transfer takes less than 1.5 us.
Since Harp is intended to operate in conjunction with a host computer, no input or output capability is provided other than the buffer memory connection to the UNIBUS. The UNIBUS bandwidth is sufficient for most I/O to real-time devices and mass storage. Since the relatively slow PDP-11 coordinates these transfers to the buffer memory, no hardware interrupt capability is included in Harp.
III. Instruction Set
Two-address instructions are advantageous in high speed processing, since one such instruction can accomplish as much as two or three single-address instructions. A special set of instructions dispose of the output of the separate multiplication processor, which performs 16x16 bit multiplications and retains the 32-bit product.
Instructions are provided to store the two halves of this result in data memory and to add them to data memory (to accumulate a double-precision inner product). The multiplication processor operates in parallel with the central processor but receives instructions and operands from it. The 16x16 multiply takes 80 ns to complete, so a program must execute at least one instruction after the MUL before accessing the product.
Though Harp is a pipelined machine, the pipe is invisible to all instructions except explicit state-register references. Branch lookahead hardware avoids delay while allowing natural instruction sequencing. Programmers need not be concerned with such complications as "clearing the pipe on a branch" which plague most signal processors.
simultaneously without the need for extender boards. Although the cross consumes considerable packaging volume, it is rackmountable at 17 x 17 x 12 inches. Double-sided PC construction was chosen over wire-wrap because of cost, signal fidelity, and reproducibility. Multilayer PC, with its higher prototype turnaround time and expense is not justified by the present circuit density.
Memory chips of sufficient size and speed for the buffer memory are presently least expensive in TTL, so this memory and the PDP-11 interface are implemented with TTL The buffer memory chips and their control circuitry are mounted on PC boards adjoining the cross. There is room to expand the memory to 64K words (16 boards) while keeping the memory bus length to the ECL converters under 12 inches. Cables connect to the PDP-11 interface and control circuitry. This 100 chip TTL circuit is wire-wrapped in the prototype.
The prospect of having several Harps in our environment encouraged diagnostic capability to ease the hardware maintance burden. Since most of the machine failures are expected to be hard static faults rather than high speed timing problems, considerable work has gone into a hardware diagnostic system and its software support package. All of these involve the PDP-11 as an interrogater and data collector. The machine clock may be stopped and Harp may be single-stepped by PDP-11 software. Between cycles, all registers and inter-board data and control signals may be examined by the PDP-11 without affecting the state of Harp; the working stores can be accesed through the Harp processor while Harp is halted.
Lower-level diagnostics exercise the buffer memory, working stores, processor, and control at levels sucessively deeper within Harp. Combined with the accessibility of inter-board signals, these should allow location of faults at the register level.
V. Software
Harp is fully supported by an extensive collection of system software, including a symbolic assembler, an unusual graphics-based debugging package, a complete simulator, and a set of PDP-11 routines for Harp control.
The Harp display-oriented debugger runs on the PDP-11 and uses a graphic display terminal to maintain a picture of the instruction memory (decoded), the data memory, and the state registers; it has facilities for tracing programs and modifying memory in Harp. The debugger can be used with either the simulator or the actual hardware.
A package of BLISS-11 routines and macros allows the PDP-11 program to access all Harp memories and to control execution of Harp. Tfcjese routines are the basic facility for inner loop execution on Harp.
VI. General-Purpose Capabilities
The designers of Harp began with a view of creating a "functionally specialized architecture" for the problems encountered in speech and vision research. As we progressed it became clear that the only relevant common characteristics of the tasks are 1) fairly large amount of processing on each datum, and 2) small items, usually 4 to 12 bits of precision. The first observation is expressed in the size of data working store -64 words is large if considered as "registers", but small for a "main memory";
as "working store", it is well matched to Harp's tasks. The second led to the choice of small (16 bit) integers as the data type. The resulting design looks not at all like a "specialized" processor, but more like a clean vertically microcoded minicomputer with residual control. Notably, at least one recent signal processor design is billed by its builders as basically a "high performance minicomputer" [LDVT] .
We expect Harp to be so cost-effective that it will be attractive whenever more is needed on a PDP-11 system or similar minicomputer. processing power
VII. Performance
The simulated performance of Harp on 8 representative tasks is shown in Figure   5 , compared to the performance of two conventional computers on the same tasks.
The effectiveness of the Harp architecture and instruction set design can be seen in Figure 4 which shows frequencies of use of various machine features. Smoothing is an image understanding algorithm that examines every point in a picture and smoothes it in relation to its context. The 8 points surrounding the point to be smoothed are summed; if the sum exceeds a threshold, the point is smoothed out.
ALGORITHM
Another image understanding algorithm is Edge Suppression, used for thinning out the edges in a picture by removing extraneous picture elements on an edge. Each scan line is examined, and only local maximum points are retained. Both of these image algorithms work on picture elements of up to 8 bits in size. Each field selects a pair of addressing modes that a source or destination operand can choose from. The high bi, o, the instruction address field selects one of the pair.
D
The low 6 bits directly address the operand. XI The low 6 bits are added to index register 1 to obtain the address. The low 6 bits are indexed with index register 2. The low 6 bits are indexed with XI, which is then incremented. The low 6 bits are indexed with X2, which is then incremented. The low 6 bits are indexed with XI, which is then decremented. The low 6 bits are indexed with X2, which is then decremented.
X2
X1 + X2+ XI-X2-
