This paper introduces a novel (non-von Neumann) programming paradigm of parallel computation featuring a much more efficient implemen-tation of parallel algorithms, as well as a novel (hardware) machine paradigm efficiently supporting such implementations. Acceleration factors of up to more than 2000 have been obtained experimentally on an example architecture for a number of important applicationsalthough using a hardware being more simple than that of a single RISC microprocessor. Due to its auto-sequencing data memory the machine principles are partly related to the organization of associative memories or systems. The machine organization and its most important hardware features are briefly introduced. The programming paradigm and its flexibility is illustrated by a few application examples.
Introduction
For a number of real-time applications extremely high throughput (up to several kiloMIPS) is needed at very low hardware cost. For at least another decade this mostly will be possible only with dedicated hardware, but not with programmable von-Neumann-type universal hardware. Even technologically advanced processors very often will be still too slow and/or too expensive. Also parallel or concurrent computers do not meet the requirements, or, are by far too expensive. Their sustained average performance is by orders of magnitude lower. than the peak rate. The reason is, that communication mechanisms offered by this hardware are not sufficiently powerful and/or too inflexible: the hardware is compiler-hostile, since most of the dense data dependencies of parallel algorithms cannot be mapped onto it. Next section gives more details about the reasons. This paper introduces and illustrates a new machine paradigm being much more efficient in implementation of parallel algorithms by avoiding most of these problems. In contrast to the von Neumann paradigm this paradigm accepts an extraordinarily wide variety of optimization methods, since being supported by a number of innovative hardwired machine features, such as:
auto-sequencing data memory (minimizing overhead) auto-sequencing register file organization (scan cache) reconfigurable rALU (featuring ultra micro parallelism) variable word length memory with innovative interface 9 minimized access data trace by ultra micro scheduling
After a brief discussion of the difficulties to achieving massive parallelism on contemporary computer systems and also on V U 1 solutions, some fundamental requirements for bener hardware efficiency are identified. Also contemporary throughput measures are briefly discussed. After outlining basic xputer principles the MOM xputer architecture is introduced. Simple algorithm examples are used to illustrate xputer operation and programming. After summarizing a number of performance-relevant architectural features contributing to the superiority of xputers more examples are introduced to illustrate semi-associative xputer applications. Finally technology aspects and possible embeddings are discussed.
From Contemporary Hardware towards what ?
Communication mechanisms within concurrent computer systems are extremely hostile to optimizing compilers. Also vector machines have fundamental performance bottle necks [Vec, Ve3] and their sustained average performance is by several orders of magnitude lower, than their peak rate [Vec] , even when creative coding techniques help the compiler [Ve2] . VLIW (Very Long Instruction Word) architectures [Eli, Ced] are much more optimizer-friendly by lower level of parallelism (at instruction level) [Bul, Para, Ga2] and relatively good optimization results have been reported for systolizable algorithms [Bull. but only for algorithms with only locally regular data dependencies (systolic algorithms or systolizable algorithms). VLIW architectures still have substantial drawbacks.
Also data flow machines are optimizer-hostile, since indeterministic operation does not permit optimization at compile time. Also data flow machines throughput is also affected by a number of other drawbacks: several new kinds of bottlenecks have been introduced. Code causes an enormous addressing overhead and data accessing conflicts [Gaj] .
A higher degree of parallelism may be achieved by Application-specific Array Processors (ASAPs). Even ASAPs have substantial draw-backs: extensive 1/0 overhead is caused by scrambling and unscrambling of data streams, expensive design of special hardware is required. A more important drawback is, that only algorithms with locally regular data dependencies (systolic or systolizable algorithms. see [SKI and others) are supported. This drawback also holds for parallel computer architectures for systolic emulation [Wp] .
Data-driven Ultra Micro Parallelism
A more detailed comparative analysis has been published elsewhere [Mch] . We strongly believe in the following fundamental requirements to avoid most of these problems, to obtain sufficiently optimizer-friendly hardware, to avoid most of the massive overhead caused (within software and hardware) by von Neumann principles. To obtain sufficiently flexible communication mechanisms parallelism should be implemented at a level much lower than usual: (1) below instruction level (ultra micro parallelism ). Optimization (parallelization) should be based on very fine granularity resource allocation and scheduling (2) -determined at compile time to a much larger extent than usual (3). The paradigm should be deterministically datadriven (4). application area The non-von Neumann xputer paradigm beiig introduced in this paper is an approach into this direction. Its novel processor organization is based on data sequencing (in contrast to the control flow sequencing paradigm of von Neumann machines), so that also optimization methods based on data-dependencies are efficiently supported. Its architectural implementation is supported by an autosequencing data memory, which is partly related to functional memory [Fl. Gar] A computer-to-xputer performance comparison seems to be the best possible way to evaluate the merits of these results. Since an xputer does not have a hardwired "instruction set", it does not make sense to use MIPS, normally used for computer-tocomputer comparison to indicate the progress of technology and physical design. rather than the efficiency of machine principles. But also other computational devices benefit from progress of technology. That's why we prefer the technology-independent measure of acceleration factor obtained experimentally ( fig. 1 ) from two equivalent implementations of the same algorithm: one from a computer (VAX-1lnSO) and one from the technologically comparable MOM xputer [MOM] ). We also have found out. that for computed acceleration factor estimates a good model is obtained from comparing the total number or duration of primary memory cycles.
Another important measure is the r-ALU size depending on the computation needed for a particular application and on the number of applications resident simultaneously (its role will be explained later). A rough measure of expense is the number of PLDs (programmable logic devices) needed of a particular type. Fig. 1 shows some such expense figures obtained experimentally on the MOM [MOM] xputer with code from an optimizing compiler having been implemented and tested at Kaiserslautem [CE] .
Xputer Machine Organization
For clarification xputers are compared to computers. The ALU of computers is a very narrow bandwidth device: it can carry out only a single simple operation at a time. Xputers, however, use a PLD-based r-ALLI ([MOM] fig. 2 b) , reconfigurable such, that several highly parallel data paths form also powerful compound operators, which need only a few nanoseconds per execution, due to highly parallel dedicated intra-chip read / modify / write interconnect between register files (called scan caches, see below) and r-ALU ( fig. 4 a) . The r-ALU is configured only during loading, not at run time, so that PLD set-up slowness does not affect performance: dedicated wires are fast and avoid buses' multiplexing overhead [Bus] . Although 2 ns gate delay PLDs ave available commercially, PLDs might be slower than traditional ALU technologies. This is more than compensated by its micro parallelism and other xputer features. Soft instruction set computers are not new: well known are dynamically microprogrammable architectures. where, however. flexibility is based on sequential programs. Also combinationally programmable repertories of functions have been proposed [Fl, Gar] , however, based on function tables mixed with data within data memory. But the rALU used for xputers is kept strictly separate from data.
In computers control flow is the primary activator ( fig. 2 c): the instruction counter is the control state register. The rate of control flow is very high (control jlow overhead ): for each single data manipulation action at least one preceding control action is needed, which requires at least one memory cycle each. If no emit address nor emit data is used, additional control flow and even data operations are needed for address computation (addressing overhead ).
Driven by a auto-sequencing data memory xputers are deterministically data-driven ( fig. 2 d) . For auto-sequencing the data memory interface includes a data sequencer. a hardwired data address generator ( fig. 2 b. instead of computers' instruction sequencer: fig. 2 a) . This hardwired data sequencer provides a repertory of generic data address sequences without any addressing overhead. Such an address sequence makes a scan cache move through data memory space, step by step, scanning a predefined segment of primary memory space along a path, which we call a scan pattern. 
counter) d)
k) 1-dimensional. scan, d-f) 2-dimensional scan.
The MOM (Map-oriented Machine)
Let's illustrate the role of this data sequencer by the [PP, We] . For stack-based hardware support of nested scan pattern see [Hir] .
Computer control flow has only a single "scan puttern" ( fig. 3 a, compare 3 b) scanning instructions one by one (as long as no branch nor jump is encountered, which we consider to be an cscope from the scan). In contrast to those of xputers this scan pattern is not free of overhead: each step requires its own instruction fetch. Each instruction fetch requires a memory access cycle. This especially makes iterative operations inefficient, since the same instruction is fetched again and again. Looping instructions cause additional control overhead and thus additional memory access cycles. It is obvious, that the computer paradigm is extremely overhead-prone, whereas the xputer paradigm avoids most kinds of overhead.
The Data Sequencing Paradigm
For high level programming of xputers we use a simple model which is supported by the auto-sequencing data memory, and, which we call data sequencing. This paradigm will be illustrated here by 2 simple algorithm examples. The fxst example (a systolic algorithm: fig. 5 ) is not a good one to demonstrate the merits of xputers over vector machines. It has been selected for easy illustration of the data sequencing paradigm. 
Fine Granularity Scheduling.
This first example has illustrated the task of the innovative kind of compilers needed for xputer [CE. We]: a kind of fine granularity scheduling (or: ultra micro scheduling ) of caches and rALU subnets, and, of data words, ready to be auto-sequenced. This is fundamentally different from sequentially piling up sequential code like conventional compilers do it for computers. Later in a section on xputer high performance features a more detailed impression on t h i s scheduling task will be given. 
Organization of Residual Control
At the end of the above data sequence example the cache finds a tagged control word (TCW fig. 6 c) which then is decoded (right side in fig. 6 b) to change the state of the residual control logic ( fig.4 a) to select further actions of the xputer. This sparse TCW insertion into data maps we call sparse control. Note, that the control state changes only after many data operations (driven by the data sequencer).
That's why we use the term residual control or sparse control for this philosophy. Note, that xputer operation is datadriven so that TCWs may be. encountered only from within a data sequence. A TCW decoder is defined at compile time and configured as a subnet within the r-ALU. Fig. 7 a illustrates distribution of the residual control state between a scan state register (holding scan pattern select code and parameters), an ALU state register (holding subnet select code) and residual control state register. We define, that only conditional branching, operator select and scan pattern select, but not data addressing, are control actions. Thus during a scan there is no control action: the data counter is not a state register. But escape from a scan is a control action (like in computers. see fig. 3 a). Escapes are ( fig. 7 a) : normal escape (by end of scan flag from data sequencer), delimiter escape (on TCW encounter), off-limits escape (address exceeds memory segment limits), conditional brunch escape (by decision data from r-ALU), and, event escape (by external event flag). Upon off-limits escape, branch escape, or event escape a remote control word (RCW) or remote address word (RAW) is fetched from a remote memory segment via an escape cache. A second decision mechanism (implicit branching, because residual control state is not affected) is activated only within datadependent s c m (i. e. without escape: curve following etc.
[CE, MOM]). Such a data-dependent scan may be exited by conditional branch escape or off-limits escape.
To achieve xputer universality also non-generic scans and individual data accessing are needed, implemented by listdirected scan: next data address is read from a TAW (tagged address word) within the data map or from a RAW (in case of an escape). This list mode can be entered directly during a scan upon TAW encounter. If no TCW is found, a TAW does not activate residual control. Reading addresses from primary memory means addressing overhead, so that list-driven sequencing is slower than hardwired scan patterns. But also in this mode of operation the xputer paradigm is still superior to the computer paradigm. 
I/O Data Sequencing
Xputer U 0 is simple: the scan-cache-based data sequencing hardware (more details in fig. 4 c) is linked to an I/O channel ( fig. 4 de) . which is more powerful than DMA known from computers. The data streaming in are not just downloaded into a memory segment. Via a suitable scan pattern selection. along with proper scan cache adjustment, the data sequencer sets up a structured data map already during input operation. Also during output ( fig. 4 e) data may be picked (by the data sequencer) from memory in a structured way.
Highly Flexible Cost / Performance Ratio
strategies for more paralle1ism. fig. 6 d) . where a single word holds 14 operands.The scan pattern is very short, so that the 1-by-1 cache visits only 2 locations ( fig. 6 e) . In total the number of primary memory semicycles has been reduced from 41 to 2. so that a speed-up by about a factor of 20 has been obtained. This illustrates the extremely high flexibility of the xputer paradigm with respect to cost/performance trade-off. Xputer word lengths are compiler-defmed: data path. cache. and control words. Thus extensible xputer architectures are feasible, upgr&ble by inserting P L D~ into free r -a~ sockets and more boards into free memory slots. E. g. it is easy to design a VWL memory (Variable Word Length Memory), where data word length could be changed under software control to support VLDW (very large data word)
Data Address Generator Hardware
This section illustrates the address generator operation. 
Non-systolizable Algorithms
The introductory application example in fig. 5 J 6 has been a systolic algorithm, being easy to convert into a data sequencing scheme because of the locality of data communication. In digital signal processing and in other important application areas, however, also nonsystolizable algorithms are very important. In contrast to parallel computer systems and V U 1 arrays. xputers smoothly accept also non-systolic data sequencing schemes.
The implementation of non systolizable on the Xputer has been shown in [Bil, PPI for the constant geometry FFT. The
Xputer FFT implementation uses 3 caches to support processing of non local data dependencies. This method is shown in figure 9 .b.
Xputer High Performance Features
Having explained introductory sequencing examples we may obtain deeper insight into xputer performance issues more easily. Xputer performance stems from a number of different phenomena and concepts. Fig. 10 surveys the most important mechanisms contributing to the efficiency of parallel algorithm implementations running on xputers, which will be discussed throughout this chapter. Important roots of xputer efficiency are: the r-ALU's ultra micro parallelism. the data sequencing paradigm, and. the high flexibility of xputer memory interface architecture.
Much wider varieties of optimization strategies than possible with computers can be efficiently mapped onto this innovative methodology. Compound operators' ultra micro parallelism reduces memory access by substantially minimizing the number of stored intermediate variables.
Often the r-ALU's flexible data path width facilitates better utilization of r-ALU space (e. g. see 2-D filtering example in next chapter). Dedicated intra-r-ALU interconnect avoids using buses being slow and causing multiplexing overhead [Bus] .
The data sequencing paradigm obviously is by far less overhead-prone, than the von Neumann control flow paradigm. Control flow overhead is almost completely avoided (also no instruction fetch cycles are needed). The above examples have demonstrated, that addressing overhead is substantially reduced not only by hardwired address generator (also see the pattern matching example in next chapter). Not yet all mechanisms of overhead reduction in xputer programs are well understood: we propose basic research also covering overhead mechanisms of the von Neumann paradigm. Now let's look at memory bandwidth. We may distinguish two kinds of factors: reduced memory bandwidth requirements due to the xputer paradigm, the r-ALU concept, and optimizing compilers (see above), and. providing higher memory bandwidth. Interface flexibility offers an extremely wide variety of strategies (optimum data maps) to meet the bandwidth requirements having been left over, where the xputer scan cache model is an important concept in finding such strategies. Important means are wide memory data paths (VLDW approaches, see above) supported by VWL memories (Variable Word Length memories, see above).
More fig. 6 b) would be further reduced from 5 to 2 (total speed-up factor: 6). In unit step sequencing of large caches memory bandwidth bottlenecks can be reduced (due to optimizing compiler strategies) by another cache feature reducing repetitive access to memory locations. The MOM 2-D cache hardware also provides a multidirectional shift path, separately for each dimension, such that, for e. g. a 4-by4 cache in a video scan the number of semi cycles is reduced from 32 to 8 [MOM] . By combination of this feature with interleaving the memory access rate may be further reduced to 2 (total speed-up factor: 12). Thus several relatively cheap hardware features supporting optimization may total up another order of magnitude of acceleration. fig. 9 b) or between distant subarrays. Also comparing acceleration factors in lines no. 4 and 5 within the table in fig. 1 shows, that here multiple-cache solutions tend to be much more efficient. Like cache memories of computers, scan caches in xputers help to reduce performance degradation due to the memory access bottleneck. It is obvious that xputer cache use is fully deterministic, due to a data scheduling strategy being completely compiler-driven.
That's why a much larger variety of optimization strategies may be applied, in contrast to computers permitting only probabilistic strategies which yield only low hit rates. By xputer cache use, however, extraordinarily high hit rates may be achieved, since cache traffic can be scheduled very precisely in detail to the optimum, tailored to any particular sequencing problem. This is because xputer hardware accepts almost any optimized schedule which always provides the right data at the right location at the right time. Thus compilation for xputers is a kind of very high level synthesis, where the number of visits to data locations in 
Xputers in Image Processing
Xputers are especially well suitable for image preprocessing, so that no specialized and much more expensive image processing computers are needed. Due to its universality also other kinds of parallel algorithms may be accelerated by the same xputer. and, in mass product applications stand-alone xputer use substantially reduces the total chip count. In image preprocessing systolizable algorithms (mainly using simple scan patterns, see fig. 3 e, f) and methods using data-dependent scan patterns are dominating. This section illustrates xputer use here by electronics design automation examples having been implemented at Kaiserslautern. where integrated circuit layout uses data structures being quite similar to those, well known from image preprocessing. Fig. 11 shows a 2-D digital filtering example implemented at Kaiserslautern: a systolic algorithm example in image preprocessing. A video scan pattern ( fig. 11 b) is used to move a 3-by-3-sized single scan cache, which at each location recomputes the center pixel c4. by an expression shown in fig. 11 a. The cache map in fig. 11 a shows integer weight distribution. The r-ALU subnet ( fig. 11 c) is derived from the local DG in fig. 11 a. Although including 18 arithmetic functions this compound function is purely combinational and fits on a small fraction of a single 5128 chip (last line in fig. 1 ) -due to the extraordinarily efficient minimization made possible by the high flexibility of xputer r-ALUs. Since xputer data path width is not hardwired a low path width (e. g. 8 bits for the adders in fig. 11 ) may save PLD space. Multiplication by 1 saves a multiplier entirely. In case of binary coded integers multiplication by 2 or 4 (see fig. 11 ) may be replaced by a shift left by 1 bit, or, by 2 bits, respectively. All this demonstrates xputers' high acceptance of a wide variety of optimization strategies. Further minimization yields from memory accessing strategies, possible with xputers only. On-cache shift paths minimize the number of memory access cycles needed to 1 per word and video scan per line. Combined with suitable memory interleaving this may total up to an order of magnitude (see section 5 for explanations).
Two-dimensional digital filtering

Pattern Matching Applications on Xpu ters
We use pattern matching examples to illustrate image preprocessing capabilities of xputers. such as applicable also to integrated circuit layout verification and routing using grid-based design rules. A DRC may be carried out by a finite state machine or combinational logic [San] . Such algorithms run very fast on ASIC hardware which, however, have to be reimplemented for changed design rules and for portation. Due to very large primary memories with modem workstations also conventional software implementation is feasible which, however, is very inefficient because of sequential processing of the very large number of reference patterns. But to measure acceleration factors such implementations are needed. The MOM-DE environment with tools like a reference pattern generator and the PISA [San] package facilitate comparative performance measurement by convenient generation of such pattern matching algorithms.
In contrast to computers. here the performance of xputers is competitive to ASIC solutions. E. g. for a grid-based design rule check (DRC) the MOM xputer has been programmed such, that a single video scan over the layout is sufficient. Substantial acceleration is obtained also for other kinds of grid-based layout processing. such as Lee routing [Aq. CE], ERC (electrical rules check), compaction, fault extraction, etc. Reference patterns are configured combiationally into the r-ALU as a single very powerful compound function linked with a video scan sequence within a 2-dimensional bit map memory segment. A single readmodify-write data loop is performed per cache location without using decision data. Experimental results in gridbased DRC with 4-by4 cache are acceleration factors of up to 2000 (CMOS design rules).
The extremely high acceleration factor is due to mainly two reasons: all (hundreds of) reference patterns are bundled by a huge compound Boolean operator (massive ultra micro parallelism) and caching completely avoids addressing overhead (an analysis of the VAX version of this algorithm has shown about ! I O' % CPU time for addressing). Also MOM on-cache shift and access mode flag features, also see section 5 ) contribute to the high performance by minimized storage access time.
. 3 Application Development Support
MOM-DE, the MOM application development environment is running on a host (a pVAX [Mch] ) featuring a selfexplanatory syntax-driven editor for a high level language MoPL (Map-oriented Programming Language), roughly a Pascal extension. MoPL sources are accepted by the MoMpiler the "code generator" of which includes a commercial PLD programming tool needed for r-ALU personalization. MoPL includes a sublanguage PaDL, which efficiently supports pattern matching applications in general. An optimizing reference pattern generator has been implemented [San] , which accepts VLSI layout design rules [Aq] . For inclusion of pattern matching applications an interactive graphic pattern editor has been implemented [We] for easy editing, modification, inspection and surveying of sets of reference patterns.
Embedding and Technology Issues
The most common PLD application is hardware prototyping. But recently an innovative kind of PLD use has been commercialized ASIC emulation from netlist sources [Qul. Qu2] : replacing simulation since being a more efficient way of ASIC verification. In contrast to xputers. however, ASIC emulation does not provide a new design paradigm: the netlist is imported: the result of a separate (conventional) hardware design process. Xputers have a programming paradigm: a very high level model of parallel algorithms.
Running an implementation on an xputer is execution -but not emulation. Since for some PLDs also compatible gate arrays are available commercially (e. g. by Plessey): xputer machine code may be directly submitted for fabrication. That's why the xputer paradigm may be considered to be an alternative high level synthesis approach to ASIC design [CE] -more precisely: very high level synthesis. ASIC emulation nor simulation is needed, since direct execution is available for design verification.
Partitioning large r-ALUs
To avoid communication bandwidth problems, cache@) and r-ALU should be on the same chip. If for "large" applications or VLDW use more than a single PLD chip is needed for the r-ALU. also more expensive inter-chip wiriig is needed in addition to the very efficient intra-chip wiring. This is rather a packaging issue of than a performance issue, since still primary memory access remains the only critical bottleneck. In implementing several such "large algorithms" we have always found heuristically a clever partitioning scheme, by slicing caches into multi-bit slices and distributing the compound operator such. that only loose coupling is required between chips.
Conclusions
With xputers an innovative computational machine paradigm has been introduced and implemented which achieves for parallel algorithms (also non-systolizable ones) drastically better performance and hardware utilization and drastically more (compiler-) optimizer-friendliness than the von Neumann paradigm (comparative summary: fig. 12 ). Acceleration factors up to more than 2000 have been obtained experimentally with a simple monoprocessor. For many applications xputers may outperform large parallel computer systems or ASIC solutions. Due to convenient conversion into a gate array the xputer also provides an alternative ASIC design methodology.
Xputers fit well to image preprocessing and digital signal processing, so that often special DSP processors or expensive special image processing computers are not needed. Due to xputer universality also other kinds of parallel algorithms and glue software may run on the same xputer, and, in mass product applications a stand-alone use is possible, which substantially reduces the total chip count.. yet all phenomena are well understood which contribute to For xputer architectures an exeemely low amount of specific the high acceleration factors found experimentally. We need hardware is needed, not being performance-critical, so that a new direction of (very) high level synthesis, a new it's easy to keep up with technology. 
