Abstract| Programmable Active Memories (PAM) are a novel form of universal recongurable hardware co-processor. Based on Field-Programmable Gate Array (FPGA) technology, a PAM is a virtual machine, controlled by a standard microprocessor, which can be dynamically and indenitely recongured into a large number of application-specic circuits. PAMs oer a new mixture of hardware performance and software versatility.
I. Introduction T HERE are two ways to implement a specic highspeed digital processing task. The simplest is to program some general-purpose computer to perform the processing. In this software approach, one eectively maps the algorithm of interest onto a xed machine architecture. However, the structure of that machine will have been highly optimized to process arbitrary code. In many cases, it will be poorly suited to the specic algorithm, so performance will be short of the required speed. The alternative is to design ad hoc circuitry for the specic algorithm.
In this hardware approach, the machine structure|processors, storage and interconnect|is tailored to the application. The result is more ecient, with less actual circuitry than general-purpose computers require. The disadvantage of the hardware approach is that a specic architecture is usually limited to processing a small number of algorithms, often a single one. Meanwhile, the general-purpose computer can be programmed to process every computable function, as we have known since the days of Church and Turing.
Adding special-purpose hardware to a universal machine, say for video compression, speeds up the processor|when the system is actually compressing video. It contributes nothing when the system is required to perform some different task, say cryptography or stereo vision.
We present an alternative machine architecture that oers the best of both worlds: software versatility and hardware performance. The proposal is a standard high-performance microprocessor enhanced by a PAM coprocessor. The PAM can be congured as a wide class of specic hardware systems, one for each interesting application. PAMs merge together hardware and software.
This paper presents results from seven years of research, at INRIA, DEC-PRL and other places. It addresses the following topics:
How to build PAMs. How to program PAMs. What are the applications? Section II introduces the principles of the underlying FPGA technology.
Section III highlights the interesting features of PAM architecture.
Section IV presents some of the methods used in programming large PAM designs.
Section V describes a dozen applications, chosen from a wide variety of scientic elds. For each, PAM outperforms all other existing technologies. A hypothetical machine equipped with a dozen dierent conventional co-processors would achieve the same level of performance|at a higher price. Through reconguration, a PAM is able to timeshare its internal circuitry between our twelve (or more) applications; the hypothetical machine would require different custom circuits for each, that must be physically present at all times.
We assess, in Section VI, the computing power of PAM technology, today and in the future.
II. Virtual circuits
The rst commercial FPGA was introduced in 1986 by Xilinx [1] . This revolutionary component has a large internal conguration memory, and two modes of operation: in download mode, the conguration memory can be written, as a whole, through some external device; once in congured mode a FPGA behaves like a regular applicationspecic integrated circuit (ASIC).
To realize a FPGA, one simply connects together in a regular mesh, n m identical programmmable active bits (PABs). Surprisingly enough, there are many ways to implement a PAB with the required universality. In particular, it can be built from either or both of the following primitives: a congurable logic block implements a boolean function with k inputs (typically 2 k 6); its truth table is dened by 2 k (or less) conguration bits, stored in local registers; a congurable routing block implements a switchbox whose connectivity table is set by local conguration bits. Such a FPGA implements a Von Neumann cellular automaton. What is more, the FPGA is a universal example of such a structure: any synchronous digital circuit can be emulated, through a suitable conguration, on a large enough FPGA, for a slow enough clock.
Some vendors, such as Xilinx [2] or AT&T [3] , form their PABs from both congurable routing and logic blocks. Other early ones, such as Algotronix [4] (now with Xilinx) or Concurrent Logic [5] (now with Atmel), combine routing and computing functions into a single primitive|this is the ne grain approach. An idealized implementation of this ne grain concept is given in gure 1. A third possibility is to build the PAB from a congurable routing box connected to a xed (non congurable) universal gate such as a nor or a multiplexor [6] . This PAB has 4 inputs hn; s; e; wi, 4 outputs hN;S;E;Wi, one register (ip-op) with input R and output r, and a combinational gate g(n; s; e; w; r) = (N; S; E ; W; R)
The truth table of g is specied by 160 = 5 32 bits. Each FPGA implementation can emulate each of the others, granted enough PABs. In order to make quantitative performance comparisons between the diverse signicant implementations, let us, from now on, choose as our reference unit any active bit with one 4-input boolean function|congurable or not|and one internal bit of state (see section VI and Vuillemin [7] ). With its ve 5-input functions, the PAB from gure 1 counts for ten or so such units.
The FPGA is a virtual circuit which can behave like a number of dierent ASICs: all it takes to emulate a particular one is to feed the proper conguration bits. This means that prototypes can be made quickly, tested and corrected. The development cycle of circuits with FPGA technology is typically measured in weeks, as opposed to months for hardwired gate array techniques. But FPGAs are used not just for prototypes; they also get incorporated in many production units. In all branches of the electronics industry other than the mass market, the use of FPGAs is expanding, despite the fact that they still cost ten times as much as ASICs in volume production. In 1992, FPGAs were the fastest growing part of the semi-conductor industry, increasing output by 40 %, compared with 10 % for chips overall.
As a consequence, FPGAs are on the leading edge of silicon chips. They grow bigger and faster at the rate of their enabling technology, namely that of the static RAM used for storing the internal conguration.
In the past 40 years, the feature size of silicon technology has been shrinking by a factor 1= 1:25 each year.
This phenomenon is known as Noyce's thesis; it was rst observed in the early sixties. The implications of Noyce's thesis for FPGA technology are analyzed by Vuillemin [7] . The prediction is that the leading edge FPGA, which has 400 PABs operating at 25 MHz in 1992, will, by year 2001, contain 25k PABs operating at 200 MHz.
III. PAMs as Virtual Machines
The purpose of a PAM is to implement a virtual machine that can be dynamically congured as a large number of specic hardware devices. The structure of a generic PAM is found in gure 2. It is connected|through the in and out links|to a host processor. A function of the host is to download conguration bitstreams into the PAM. After conguration, the PAM behaves, electrically and logically, like the ASIC dened by the specic bitstream. It may operate in stand-alone mode, hooked to some external system|through the in 0 and out 0 links. It may operate as a co-processor under host control, specialized to speed-up some crucial computation. It may operate as both, and connect the host to some external system, like an audio or video device, or some other PAM. To justify our choice of name, observe that a PAM is attached to some high-speed bus of the host computer, like any RAM memory module. The processor can write into, and read from the PAM. Unlike RAM however, a PAM processes data between write and read instructions| which makes it an \active" memory. The specic processing is determined by the contents of its conguration bitstream, which can be updated by the host in a matter of milliseconds|thus the \programmable" qualier.
We now describe the architecture of a specic PAM: it is named DECPeRLe-1 and will be referred to as P 1 . It was built at Digital's Paris Research Laboratory in 1992. A dozen copies operate at various scientic centers in the world; some are cited as we enumerate the operational applications in section V.
The overall structure of P 1 is shown in gure 3. Each of the 23 squares denotes one Xilinx XC3090 FPGA [2] . Each of the 4 rectangles represents 1 MB of static RAM (letter R). Each line represents 32 wires, physically laid out on the printed circuit board (PCB) of P 1 . A photo of the system is shown in gure 4. The merit of this structure is to host, in a natural manner, the diverse networks of processing units presented in section V. Depending upon the application, individual units are implemented within one to many FPGAs; they may also be implemented as look-up tables (LUT) through the local RAM; some slow processes are implemented by software running on the host. Connections between processing units are mapped, as part of the design conguration, either on PCB wires or on internal FPGA wires.
A. FPGA matrix
The computational core of P 1 is a 4 The fourth external connection links to the host interface of P 1 : a 100 MB/s TURBOchannel adapter [8] . In order to avoid having to synchronize the host and PAM clocks, host data goes through two FIFOs, for input and output respectively. To the PAM side of the FIFOs is another switch FPGA, which shares two 32-bit buses with the other switches and controllers|see gure 3. The host connection itself consists of a host-independent part implemented on the P 1 mother board and a hostdependent part implemented on a small option board specic to the host bus. A short cable links the two parts|see gure 4.
In addition to the above, P 1 features daughter-board connectors that can provide more than 1.2 GB/s of bandwidth to specialized hardware extensions.
D. Firmware
One extra FPGA on P 1 is not congurable by the user; call it POM, by analogy with ROM. Its function is to provide control over the state of the PAM, through software from the host. The logical protocol of the host bus itself is programmed in POM conguration. Adapting from TURBOchannel to some other logical bus format, such as VME, HIPPI or PCI is just a matter of re-programming the POM and redesigning the small host-dependent interface board.
A function of the POM is to assist the host in downloading a PAM conguration|1.5 Mb for P 1 . Thanks to this hardware assist, we are able to recongure P 1 up to fty times per second, a crucial feature in some applications.
One can regard P 1 as a software silicon foundry, with a 20 ms turn-around time.
We take advantage of an extra feature of the XC3090 component: it is possible to dynamically read back the contents of the internal state register of each PAB. Together with a clock stepping facility|stop the main clock and trigger clock cycles one at a time from the host|this provides a powerful debugging tool, where one takes a snapshot of the complete internal state of the system after each clock cycle. This feature drastically reduces the need for software simulation of our designs.
PAM designs are synchronous circuits: all registers are updated on each cycle of the same global clock. The maximum speed of a design is directly determined by its critical combinational path. This varies from one PAM design to another. It has thus been necessary to design a clock distribution system whose speed can be programmed as part of the design conguration. On P 1 , the clock can be nely tuned, with increments on the order of 0.01%, for frequencies up to 100 MHz.
A typical P 1 design receives a logically uninterrupted ow of data, through the input FIFO. It performs some processing, and delivers its results, in the same manner, through the output FIFO. The host is responsible for lling-in and emptying-out the other side of both FIFOs. Our rmware supports a mode in which the application clock automatically stops when P 1 attempts to read an empty FIFO or write a full one, eectively providing fully automatic and transparent ow-control.
The full rmware functionality may be controlled through host software. Most of it is also available to the hardware design: all relevant wires are brought to the two controller FPGAs of P 1 . This allows a design to synchronize itself, in the same manner, with some of the external links. Another unique possibility is the dynamic tuning of the clock. This feature is used in designs where a slow and infrequent operation|say changing the value of some global controls every 256 cycles|coexists with fast and frequent operations. The strategy is then to slow the clock down before the infrequent operation|every 256 cycles| and speed it up afterwards|for 255 cycles. Tricky, but doable.
E. Other Recongurable Systems
Besides our PAMs, which were built rst at INRIA in 1987 up to Perle-0, whose architecture is described in some detail in an earlier report [9] , then at DEC-PRL, other successful implementations of recongurable systems have been reported, in particular at the universities of Edinburgh [10] and Zurich [11] , and at the Supercomputer Research Center in Maryland [12] .
The ENABLE machine is a system, built from FPGAs and SRAM, specically constructed at the university of Mannheim [13] for solving the TRT problem of section V-G.2. Many similar application-specic machines have been built in the recent years: their recongurable nature is exploited only while developing and debugging the application. Once complete, the nal conguration is frozen, once and for all|until the next \hardware release".
Commercial products already exist: QuickTurn [14] sells large congurable systems, dedicated to hardware emulation. Compugen [15] sells a modular PAM-like hardware, together with several congurations focusing on genetic matching algorithms. More systems exist than just the ones mentioned here.
A thorough presentation of the issues involved in PAM design, with alternative implementation choices, is given by Bertin [16] .
IV. PAM programming A PAM program consists of three parts: 1. The driving software, which runs on the host and controls the PAM hardware. 2. The logic equations describing the synchronous hardware implemented on the PAM board. 3. The placement and routing directives that guide the implementation of the logic equations onto the PAM board. The driving software is written in C or C++ and is linked to a runtime library encapsulating a device driver. The logic equations and the placement and routing directives are generated algorithmically by a C++ program. As a deliberate choice of methodology, all PAM design circuits are digital and synchronous. Asynchronous features|such as RAM write pulses, FIFO ags decoding or clock tuning| are pushed into the rmware (POM) where they get implemented once and for all.
A full P 1 design is a large piece of hardware: excluding the RAM, twenty-three XC3090 containing 15k PABs are roughly the equivalent of 200k gates. This amount of logic would barely t in the largest gate arrays available in 1994.
The goal of a P 1 designer is to encode, through a 1.5 Mb bitstream, the logic equations, the placement and the routing of fteen thousand PABs in order to meet the performance requirements of a compute-intensive task. To achieve this goal with a reasonable degree of eciency, a designer needs full control over the nal logic implementation and layout. In 1992, no existing computer-aided design (CAD) tool was adapted to such needs.
Emerging synthesis tools were too wasteful in circuit area and delay. One has to keep in mind that we already pay a performance penalty by using SRAM-based FPGAs instead of raw silicon. Complex designs can be synthesized, placed and routed automatically only when they do not attempt to reach high device utilization; even then, the resulting circuitry is signicantly slower than what can be achieved by careful hand placement.
Careful low-level circuit implementation has always been possible through a painful and laborious process: schematic capture. For PAM programming, schematic capture is not a viable alternative: it can provide the best performance, but it is too labor intensive for large designs.
Given these constraints, we have but one choice: a middle-ground approach where designs are described algorithmically at the structural level, and the structure can be annotated with geometry and routing information to help generate the nal physical design.
A. Programming tools
We rst had to choose a programming language to describe circuits. Three choices were possible: a generalpurpose programming language such as C++, a hardware description language such as VHDL, or our own language. We do not discuss the latter approach here; it is the subject of current research.
We decided to use C++ for reasons of economy and simplicity. VHDL is a complex, expensive language. C++ programming environments are considerably cheaper, and we are tapping a much wider market in terms of training, documentation and programming tools. Though we had to develop a generic software library to handle netlist generation and simulation, the amount of work remains limited. Moreover, we keep full control over the generated netlist, and we can include circuit geometry information as desired.
A.1 The netlist library
To describe synchronous circuits with our C++ library is straightforward. We introduce a new type Net, overload the boolean operators to describe combinational logic, and add a primitive for the synchronous register. From these, a C++ program can be written which generates a netlist representing any synchronous circuit. This type of low-level description is made convenient by the use of basic programming techniques such as arrays, for loops, procedures and data abstraction. Figure 5 shows, for example, a piece of code representing a generic n-bit ripple-carry adder. 
The execution of such a program builds a netlist in memory; this netlist can be analyzed and translated into an appropriate format (XNF or EDIF), or used directly for simulation. Linking a netlist description program with behavioral code yields mixed-mode simulation with no special eort.
Since we have direct access to the netlist at this level of description, we can easily annotate logic operators with placement directives. For example, to specify that our ripple-carry adder should be aligned vertically, with the paired carry and sum bits generated by the same logic block, we simply add the lines shown in gure 6 to the description of the adder. Contrary to the silicon compilers from a decade ago [17] , these placement annotations do not aect the logic behavior of the generated netlist. They do not specify contacts; they only specify the partitioning of logic into physical blocks and the absolute or relative placement of these blocks in a two-dimensional grid. A back-end tool analyzes these attributes and emulates the interface of a schematic capture software in order to guarantee that the placement and logic partitioning information is preserved by the FPGA vendor software.
A.2 The runtime library
At the system level, the programming environment provides two main functions: a device driver interface, and full simulation support of that interface. This simulation capability allows the designer to operate together the hardware and software parts from a PAM program. The device driver interface provides the mandatory controls to the application program: the usual UNIX I/O interface with open, close, synchronous and asynchronous read and write; download of the conguration bitstreams for the PAM FPGAs; readback of their state (i.e. the values of all PAB registers); read and write of the PAM static RAMs; software control of the PAM board clock.
A.3 Lessons
The main lesson we draw from our experience with these programming tools is that PAM programming is much easier than ASIC development. Students with no electrical engineering background were able to use our tools after a few weeks of training. In particular, users can easily develop their own module generators in matters of days, while only highly skilled engineers are able to write module generators for custom VLSI. This capability is one of the main reasons why we were able to develop such complex applications spanning dozens of chips, with engineers and students not previously exposed to PAM, each in a matter of months.
B. Debugging and optimization tools
Debugging of a PAM design can be done entirely through software. Mixed-mode simulation at the block level allows designers to certify datapath components before using them in complex designs. Full-system simulation eliminates the need for generating special input patterns to test the hardware part of the program. Full-system simulation allows for hardware/software codebug: both application driver and hardware, working together.
After simple bugs have been removed, it becomes necessary to simulate the design on a large number of cycles. To do so, the most eective technique is to compile the design into a bitstream, download this bitstream into the board, and run the board in trace mode (single-step the clock, readback the board state at each cycle and collect these states for analysis; it is possible to run this mode at up to 100 Hz). In simple cases, this can be done with no modication to the runtime application source code. In complex cases, all necessary primitives are available to build application-specic code to generate and/or analyze the readback traces. P 1 's clock generator can also be operated in double-step mode. In that mode, the clock runs at full speed every second cycle. By comparing double-step traces taken at increasing clock frequencies with a previously recorded single-step trace, we can automatically locate the critical path of a design for a given execution. This method alleviates the need to rely on delay simulation as provided by the standard industrial simulation packages. It is necessary to perform that tedious task only once, when certifying the operating speed of the nal design.
We developed a screen visualization tool called showRB to help analyze readback traces. It can display the state of every ip-op in every FPGA of a PAM board, at the rate of tens of frames per second. In conjunction with the double-step mode, it can be used to detect critical paths along execution traces. Interestingly enough, such a tool also proved invaluable in demonstrating the structure of some hardware algorithms.
V. Applications
Our applications were chosen to span a wide range of current leading-edge computational challenges. In each case, we provide a brief description of the design, a performance comparison with similar reported work, and pointers to publications describing the work in more detail.
One PAMs may be congured as long integer multipliers [18] . They compute the product P = A B + S where A is an n-bit long multiplier, and B; S are arbitrary size multiplicands and summands [19] ; n may be up to 2k for the P 1 implementation.
Our multipliers are interfaced with the public domain arbitrary-precision arithmetic package BigNum [20] : programs based on that software automatically benet from the PAM, by simply linking with an appropriatedly modied BigNum library. P 1 computes product bits at 66 Mb/s (using radix 4 operations at 33 MHz), which is faster than all previously published benchmarks. This is 16 times over the gures reported by Buell and Ward [21] for the Cray II and Cyber 170/750. P 1 's multiplier can compute a 50-coecient 16- 
A more aggressive multiplier design is reported by Louie and Ercegovac [22] : using radix 16 and deep pipeline, this multiplier operates at 79 MHz, which is 2.5 faster than ours within 3 times the area. At that speed, this design is faster than conventional multipliers, even for short 32-bit operands.
B. RSA cryptography
To investigate further the tradeos in our hybrid hardware and software system, we have focused on the RSA cryptosystem [23] . Both encryption and decryption involve computing modular exponentials, which can be decomposed as sequences of long modular multiplications, with operand sizes ranging from 256 bits to 1k bits.
Starting from the general-purpose multiplier above, we have implemented a series of systems spanning two orders of magnitude in performance, over three years.
Our rst system [18] uses three dierently programmed Perle-0 boards, all operating in parallel with the host. At 200 kb/s decoding speed, this was faster than all existing 512-bit RSA implementations, regardless of technology, in 1990. A survey by E. Brickell [24] grants the previous speed record for 512-bit key RSA decryption to an ASIC from AT&T, at 19 kb/s. The
For 512-bit keys the same datapath delivers a decryption rate in excess of 300 kb/s although it uses only half the logic resources in P 1 .
PAM implementations of RSA rely on recongurability in many ways: we use a dierent PAM design for RSA encryption and decryption; we generate a dierent hardware modular multiplier for each dierent prime modulus with the coecients of the binary representation of each modulus hardwired into the logic equations of the design. C. Molecular biology Given an alphabet A = (a 1 ; : : : ; a n ), a probability (S ij ) i;j=1:::n of substitution of a i by a j , and a probability (I i ) i=1:::n (resp. (D i ) i=1:::n ) of insertion (resp. deletion) of a i , one can use a classical dynamic programming algorithm to compute the probability of transforming a word w 1 over A into another one w 2 ; this denes a distance between words in A. Applications include automated mail sorting through OCR scanners, on-the-y keyboard spelling corrections, and DNA sequencing in biology.
D. Lavenier from IRISA (Rennes, France) has implemented this algorithm with a Perle-0 design which computes the distance between an input word and all 30k words in a dictionary; it reports the k words found in the dictionary which are closest to the input. The system processes 200k words/s which is faster than a solution previously implemented at CNET using 12 Transputers. It has only half of the performance obtained by a system previously developed at IRISA based on 28 custom VLSI chips and two printed-circuit boards.
The DNA matching algorithm [26] is one of the driving applications for the PAM developed at the Supercomputing Research Center in Maryland [12] : the reported performance is, here again, in excess of that obtained with existing supercomputers.
The Compugen commercial company [15] sells the Bioccelerator, a PAM which can be congured as a number of molecular biology search functions. This device is a coprocessor to a host server; it can be accessed through remote procedure call from any workstation on the network.
It is interfaced with a widely used software package and its use is transparent, except for the speed-up advantages.
D. Heat and Laplace equations
Solving the heat and Laplace equations has numerous applications in mechanics, integrated circuit technology, uid dynamics, electrostatics, optics and nance [27] .
The classical nite dierence method [28] provides computational solutions to the heat and Laplace equations. Vuillemin [29] shows how to speed-up this computation with help from special-purpose hardware. A rst implementation of the method on P 1 , by Vuillemin and Rocheteau [29] , operates with a pipeline depth of 128 operators. Each operator computes S. Hadinger and P. Raynaud-Richard further improved the implementation [32] . Rening the statistical analysis, they show that the datapath width can be reduced to 16 bits provided the rounding-o of the low-order bit is done randomly|with all deterministic round-o schemes, parasitic stable solutions exist which signicantly perturb the result. Their implementation therefore uses a 64-bit linear feedback shift-register to randomly set the rounding direction for each processing stage.
The width reduction in the datapath allows us to extend the pipeline length to 256, pushing the equivalent processing power up to 39 GIPS. Using P 1 's fast DMA-based I/O capabilities and a large buer of host memory, this design can accurately simulate the evolution of temperature over time in a 3-D volume, discretized on 512 3 points, with arbitrary power source distributions on the boundaries. It also supports the use of multigrid simulation, where one \zooms out" to coarser discretization grids in order to rapidly advance in simulated time, then \zooms back in" to full resolution, in order to accurately smooth out the desired nal result. E. Neural networks M. Skubiszewski [33] [34] has implemented a hardware emulator for binary neural networks, based on the Boltzmann machine model.
The Boltzmann machine is a probabilistic algorithm which minimizes quadratic forms over binary variables, i.e. The latest P 1 realization solves problems with 1400 binary variables, using 16-bit weights, for a total computing power of 500 megasynapses per second. (The megasynapse is the traditional unit used in this eld; it amounts to one million additions and multiplications by small coecients.)
F. Multi-standard video compression
In view of the required input bandwidth (30 MB/s for standard TV color images) and the amount of computation required by current standards (resp. 3, 4 and 8 Gop/s 1 for JPEG, DCT 3D and MPEG), custom hardware is currently necessary for compressing video in real time.
Matters get complicated, as several dierent video compression standards are emerging. The following shows how a single congurable system such as P 1 
G. High-energy physics G.1 Image classication
The calorimeter is part of a series of benchmarks proposed by CERN 2 [36] . The goal is to measure the performance of various computer architectures, in order to build the electronics required for the Large Hadron Collider (LHC), before the turn of the millennium. The calorimeter is challenging, and well documented: CERN benchmarks seven dierent electronic boxes, including some of the fastest current computers, with architectures as different as DSP-based multiprocessors, systolic arrays and massively parallel systems.
This problem is typical of high-energy physics data acquisition and ltering: 202032b images are input every 10 s from the particle detectors, and one must discriminate within a few s whether the image is interesting or not. This is achieved by computing some simple statistics on it (maximum value, second-order moment,. . . ) and using them to decide whether or not a sharp peak is present (gure 12). What makes the problem dicult here are the high input bandwidth (160 MB/s) and the low latency constraint.
Hadron jet Electron Fig. 12 . Calorimeter typical input images Vuillemin [7] analyzes in detail the possible implementations of the calorimeter, on both general-purpose computer architectures (single and multi processors, SIMD and MIMD) and special-purpose electronics (full-custom, gatearray, FPGAs). The conclusion provides an accurate quantitative analysis of the computing power required for this task: the PAM is the only structure found to meet this bound.
This algorithm was implemented by P. Boucard and J. Vuillemin on P 1 [37] [38] . Using the external I/O capabilities described in section III-C, data is input from the detectors through two o-the-shelf HIPPI-to-TURBOchannel interface boards plugged directly onto P 1 . The datapath itself uses about half of P 1 's logic and RAM resources, for a virtual computing power of 39 GBOPS (gure 13).
G.2 Image analysis
The Transition Radiation Tracker (TRT) is another benchmark from CERN, analyzed in the same report [36] . The problem is to nd straight lines (particle trajectories) in a noisy digital black and white image.
The algorithm used is based on the classical Hough transform: rst compute the number of active (\on"
and the low latency requirement ( 2 images) preclude any implementation solution other than one using specialized hardware, as shown by CERN [36] .
R. M anner and his team from University of Mannheim [13] have successfully built the specialized FPGA-based ENABLE machine for solving this problem, using the straightforward O(N 3 ) implementation of the Hough transform. It computes the score for all lines of 16 dierent slopes crossing a 128 96 grid at the required 100 kHz rate, with a latency of 2 images (20 s). It needs more than twice the computing power of P 1 to achieve this result.
J. Vuillemin [39] describes an O(N 2 log N) algorithm to compute the Hough transform, in a recursive way analogous to the Fast Fourier Transform (gure 14). The resulting gain in the processing power needed by the computation makes it just possible to t it in one P 1 board.
This was implemented by L. Moll, P. Boucard and J. Vuillemin [37] [38] . As above, data is directly input from the detectors through two HIPPI-to-TURBOchannel boards plugged in P 1 's extension slots. The design computes 31 slopes at the required 100 kHz rate with a latency of 1 image (10 s). A 64-bit sequential processor would need to run at 1.2 GHz to achieve the same computation.
G.3 Cluster detection
The NESTOR Neutrino Telescope under construction in the Mediterranean near Pylos, Greece, is an threedimensional array of 168 photomultiplier tubes (PMTs) designed to detect Cherenkov radiation from fast muons created by neutrino interactions. Clustered detections from actual Cherenkov-generated photons are expected to happen at a maximum rate of a few per second, while the background noise originating from bioluminescence and radioactive potassium ( 40 K) causes random PMT rings at a rate of 100 kHz per PMT.
A P 1 board will be used to process the raw data and detect muon trajectories 3 , by looking for space-and timecorrelation among events. The peak and average data rates are 500 MB/s and 100 MB/s respectively. Data enters directly through P 1 's 256b-wide daughter-board connectors (see section III-C). Provided the peak data rate can be accommodated|which is the case with the P 1 solution| subsequent processing is straightforward (see Katsanevas et al. [40] for details).
H. Image acquisition P 1 's TURBOchannel adapter (see section III-C), being built around a single XC3090, is a PAM in its own right| albeit a small one. M. Shand [41] describes a number of 3 In high-energy physics terminology, this is the rst level trigger. experiments based on this board, including an interface to a large frame CCD camera [42] . This camera delivers image data at 10 MB/s with no ow control. Conventionally an interface for such a camera would use a dedicated frame buer. Our interface dispenses with this buer by transfering the incoming image data directly into system memory, using Direct Memory Access (DMA) over the TURBOchannel. In addition to the obvious cost savings of eliminating the frame buer memory, use of system memory makes the captured image immediately available to software and allows the system to capture images continuously. These attributes prove essential to one use of this interface|the principal image acquisition system at the Swedish Vacuum Solar Telescope where the system has been in use since May 1993 [43] .
The success of this small PAM (or PAMette) has lead us to develop a new PAM board, I/O-oriented and of small size, to explore these new kinds of application. M. Shand, in collaboration with G. Scharmer and Wang Wei of the Swedish Royal Observatory, is investigating the use of this board in an adaptive optics system combining image acquisition, image processing, and on-the-y servo control.
I. Stereo vision
Part of the research on stereo vision at INRIA 4 is focused on computing dense, accurate and reliable range maps, from simultaneous images obtained by two cam-eras. The selected stereo matching algorithm is presented by Faugeras et al. [44] : a recursive implementation of the score computation makes the method independent of the size of the correlation window, and the calibration method does not require the use of a calibration pattern.
Stereo matching is integrated in the navigation system of the INRIA cart, and used to correct for inertial and odometric navigation errors. Another application, jointly with CNES 5 , uses stereo to construct digital elevation maps for a future planetary rover.
A software implementation of the selected method computes the correlation between a pair of images in 59 seconds on a SPARCStation II. A dedicated hardware implementation using four digital signal processors (DSP), developed jointly by INRIA and Matra MSII, performs the same task in 9.6 seconds. A P 1 implementation of the very same algorithm by L. Moll [45] In order to explore the digital signal processing domain, D. Roncin and P. Boucard implemented a real-time digital audio synthesizer on P 1 , capable of producing up to 256 independant voices at a sampling rate of 44.1 kHz. Primarily designed for the use of additive synthesis techniques based on lookup-tables, this implementation includes features which allow frequency-modulation synthesis and/or non-linear distortion and can also be used as a sampling machine.
This design contains 4 MB of wave-table memory, shared by the 256 voice generators, which can be partitioned into sub-tables of various sizes allowing the simultaneous use of up to 1k dierent sound patterns. It also includes an output mixing section and global control.
Each of the 256 voices consists of: 5 Centre National d'Etudes Spatiales, France A phase computation section, which computes the index of a voice sample in the selected wave-table (using 24-bit arithmetic). Using the output of another voice in this computation leads to frequency modulation and non-linear distortion. An envelope generator and static level section, which computes the amplitude value for the current sample (also using 24-bit arithmetic) and combines it with the output of the wave-table to produce the amplitude modulated sample. Dynamic amplitude envelopes are generated using linked linear segment techniques. A control section, which denes the operating mode of the voice: normal oscillator, carrier operator for frequency modulation, non-linear transfer function operator, free-running or single shot, synchronous phase operation, wave-table size and location selection, output channel selection. . . The output mixing section contains four 32-bit accumulators, which connect to two SPDIF 6 (stereo) digital audio output ports. Synthesizing this standard consumer audio format allows for the direct connection of P 1 to an othe-shelf tape recorder or audio amplier, through a mere cable.
All parameters and controls can be updated by the host at any time in parallel with the running synthesis. At 22 MHz, this design produces 11M samples per second, which amounts to about 22M 1616-bit multiplications, 100M ALU operations and 45M load/store operations. A software implementation of this algorithm running on standard CPUs shows that the DECPeRLe-1 implementation is equivalent to a computing power of about 2 GIPS. A simpler version of this design has been ported on a standard DSP processor (27-MHz Motorola 56001). The DSP is capable of computing only 24 voices at the required sampling rate|less than one tenth the number computed by P 1 .
K. Long Viterbi Decoder
In many of today's digital communications systems the signal-to-noise ratio (SNR) of the link has become the most severe limitation. Convolutional encoding with maximum likelihood (Viterbi) decoding provides a means to improve the SNR of a link without increasing the power budget, and has become an important technique in satellite and deep-space communications systems. 7 The coding gain of a Viterbi system is primarily determined by the constraint length K of the code, while the complexity of the decoder increases exponentially with K.
Today's VLSI implementations typically oer codes with K = 7 and K = 8. NASA's Galileo space probe is equipped with a constraint length 15 rate 1/4 encoder, for which a Viterbi decoder based on an array of 256 custom VLSI chips is being developed [46] . R. Keaney and D. Skellern from Macquarie University (Sydney, Australia), together with M. Shand and J. Vuillemin from PRL, have implemented a Viterbi decoder for the Galileo code on P 1 [47] . Using on-board RAM to trace through the 2 14 possible states of the encoder, this design computes 4 states in parallel at each 40 ns clock cycle, for an overall decoding speed of 2 kb/s. The coding gain has been measured to be within 0.5 dB of the optimal gain for this particular code.
There is no analytical method to prove that a particular code provides the optimal coding gain for a given constraint length. Taking further advantage of PAM recongurability, this system will be used to perform a code search among constraint length 15 convolution codes, by recompiling a new P 1 conguration on-the-y for each code. Our particular choice of unit for measuring computing power is based on the 4-input combinational function 8 . A bit-serial binary adder, which is composed of two functions of three inputs, also counts for one unit. The accounting rules that follow, for arithmetic and logic operations over n-bit wide inputs, are thus straightforward: + One (n+n 7 n+1)-bit addition each nanosecond is worth n GBOPS. Subtraction, integer comparison and logical operations are bit-wise equivalent to addition. One (n m 7 n + m)-bit multiplication each nanosecond is worth nm GBOPS. Division, integer shifts and transitive (see Vuillemin [49] ) bit permutations are bit-wise equivalent to multiplication. Due to the great variety of the operations required by each application, quantitative performance comparison between dierent computer architectures is a challenging art [50] . The million of instructions per second (MIPS) and million of oating-point operations per second (MFLOPS) are more traditional units for measuring computing power. By our denition, a 32-bit standard microprocessor 9 operating at 100 MHz (100 MIPS) has a virtual computing power of 3.2 GBOPS, and a 200 MHz, 64-bit processor features 12.8 GBOPS. A 100-MHz, 64-bit oating-point multiplier delivering one operation per cycle (100 MFLOPS) would rate 281 GBOPS.
It follows from this accounting that P 1 has a virtual computing power which is higher than that of the fastest integer microprocessor existing in 1994. 8 The particular choice of the unit function only aects our measure by a constant factor, provided we keep bounded fan-in. 9 with no hardware multiplier We have shown that it is now possible to build highperformance PAMs, with applications in a large number of domains. Table II updates what is feasible within 1994 technology. The technology curves for PAM cost/performance derive from those for FPGA and static RAM [51] ; we can use them as a basis for extrapolation, from now into the future.
Let us compare the respective merits of three possible implementation technologies, for a given specic highperformance system. High-performance means here that the computational requirement far exceeds the possibilities of the fastest micro-processor. That leaves three implementation possibilities: 1{ program a parallel machine; 2{ design a specic PAM conguration; 3{ build a custom system. The rst two involve only software; the third involves hardware as well. Let us review some of the comparative merits, for each technology.
1. Each reported PAM design was implemented and tested within one to three months, starting from the delivery of the specication software. This is roughly equivalent to the time it takes to implement a highly optimized software version of the same system on a supercomputer: both are technically challenging, yet both are orders of magnitude faster than what it takes to cast a system into custom ASICs and printed-circuit boards. 2. For many specic high-speed computational problems, PAM technology has now proved superior, both in performance and cost, to all current forms of general-purpose processing systems: pipelined machines, massively parallel ones, networks of microprocessors,. . .
The cost of P 1 is comparable to that of a high-end workstation. This is much lower than the cost of a supercomputer. Based on gures from McBryan [30] , the price (in $ per operation per second) for solving the heat and Laplace equations is 100 times higher with supercomputers than with P 1 .
3. PAM technology is currently best applied to low-level, massively repetitive tasks such as image or signal processing. Due to their software complexity, many current supercomputer applications still remain outside the possibilities of current PAM technology. 4. For many real-time problems, PAMs already have performance and cost equal to those of specic, custom systems: the lower the volume, the better for the PAM. By tuning a specic application for a PAM, we have shown that very high performance implementations are possible. For at least six of the cases presented in section V, the performance achieved by our P 1 implementation exceeds, by at least one order of magnitude, those of any other implementation, including custom VLSI-based ones. 5. An important set of applications is accessible only through PAM technology: high-bandwidth interfaces to the external world, with a fully programmable, realtime capability. P 1 has 256-bit wide connectors, capable of delivering up to 1.2 GB/s of external I/O bandwidth. It is then a \simple matter of hardware programming" to interface directly to any electricallycompatible external device, by programming its communication protocol into the PAM itself. Applications include high-bandwidth networks, audio and video input or output devices, and data acquisition. 
