The future Large Hadron Collider (LHC) 
The Large Hadron Collider: a triggering challenge
The community of High Energy Physics is about to decide to go forward with the next generation collider to be built at CERN, the 'Large Hadron Collider' or LHC. This new instrument will allow the international community of researchers to explore unknown areas of physics at the smallest scale, as it collides two counter-rotating beams of protons each at an energy of 7000 GeV, not attainable today. The development of critical components for this collider, to be installed in the existing LEP underground ring, is well advanced: in particular, the critical superconducting magnets with fields of more than 9 Tesla have been industrially produced and successfully tested. Experimentation in that ring is expected to start at the beginning of the next century, in an optimistic scenario around the year 2002.
One characteristic property of the future collider [1] arises from the fact that the collisions giving clues to the physics of interest are rare, and in particular the ratio between this and the overall collision rate is very small, like one in a million. The accelerator builders, therefore, put their ingenuity to work for achieving the highest possible 'luminosity', i.e. beam density and collision rate. They do this by fine focusing the largest possible number of protons into packets ('bunches') which follow each other at very short time intervals (25 nsec -corresponding to not a very long inter-bunch distance, about eight meters at the speed of light). The detectors studying the collisions will then have to deal with very high rates of events and must attempt to achieve a time separation that takes advantage of the bunch structure -the limit being that some of the physics processes put to use in detectors take longer than the bunch separation.
Although only partially true at high luminosity (multiple collisions will occur in a single bunch crossing, and appear as one 'supercollision'), let us assume that the problem of separating collisions recorded in the detector into individual signals can be solved. There remains yet another challenge, though: to use the signals from a single collision, or at least a subset of them, in order to decide if the collision at hand should be analyzed in more detail and eventually recorded. The detectors are, of course, constructed to provide signals corresponding to the signatures of interesting collisions, in nearly all cases characterized by high transverse momenta and by the occurrence of leptons (electrons, muons, tau-s, and neutrinos). This selection procedure of entire collisions is called 'triggering', and is familiar to physicists from past experiments, albeit at rates much lower than those imposed by the LHC. Our present contribution discusses briefly the structure of triggers at the LHC, and a specific implementation possibility of one critical trigger part.
Trigger structure
Physics at the LHC will start with a primary event rate of 40 MHz, the bunch crossing frequency. Each event is characterized by several Mbytes of information, once it is fully digitized. In real time at high frequency before rate-reducing triggers, this volume of information is transmitted (in analog or digital form) in parallel into many thousands of individual buffers, with characteristics specific to the different subdetectors. The task of the trigger is to find the small number of interesting physics events, not more than a few (to be specific, certainly less than a hundred) per second. A succession of event selection algorithms is applied; close to the detectors they have to run at bunch crossing frequency and must be simple enough to be implemented in custom-made or specifically adapted hardware, with limited or no programmability. As there is a finite latency, viz. delay between the availability of data and a final 'yes/no' decision, transmission and operations have to be pipelined and all data stored in a buffer, avoiding dead time as much as possible. Schematically, this is shown in the following figure 1. As the successive stages of event selection reduce the rates, algorithms of increasing complexity and implemented in processors of some generality become both necessary and possible. We concentrate in this paper on an implementation of 'second-level' algorithms, where the assumed input rate of events does not exceed 100 kHz. The algorithms to be implemented at this level in order to achieve another rate reduction of a factor of ~100, are experiment-dependent, examples are discussed below.
This somewhat idealized trigger structure is completed by a 'third-level' trigger. We assume that it can be implemented as a group of general-purpose processors, each of which is served a full event (~ 1000 events/second), and executes high-level code that allows a final data reduction in real time.
Functional structure of the second-level trigger
In R&D work over the last few years 1 , several guiding principles have emerged which are by now quite generally accepted. One of them is the fundamental Region-of-Interest (RoI) concept for level-2 triggering, very critical at least at high luminosity (viz. at high level-1 rates). The RoI concept relies on the level-1 trigger to identify those parts of the detector containing candidate features (electrons, photons, muons, jets). Only the data fraction in this candidate region (of order a few % of the total) is considered relevant and is moved to the level-2 processors; the restricted readout alleviates the very stringent bandwidth requirements on providing data for algorithms from distributed buffers, at high frequency.
Simultaneously, the RoI concept considerably simplifies algorithms: local algorithms convert limited data from a single subdetector into variables containing the relevant physics message ('features'), in order to confirm or disprove a known hypothesis.
Another principle is to allow in level 2 for algorithms using full-granularity, fullprecision data on all characteristic detectors. The extent to which data with full precision are effectively needed in level 2, will have to be the subject of future detailed physics-driven studies, and may even be dependent on the target physics. In general, the option for use of full data is expected, and systems must provide for this. Reducing the requirements on precision in the trigger will, of course, result in architectural simplification and thus cost savings.
A further guideline for all implementations is that any proposed or demonstrated hardware solution must be envisaged to hold for the entire detector, or at least for several detector components, including the usual constraints of flexibility, robustness, ease of control and maintenance. It also must be readily embedded in the overall data acquisition system. As we rely as much as possible on commercially available components, and are dealing with a market that evolves fast and is not driven by our application, we will have to ensure that future technology improvements can be absorbed as easily as possible in the system, the limit being that some new technologies will, obviously, require architectural adaptation. These criteria translate into a maximum use of standard interfaces, and in particular the introduction of as few different components as possible (viz. detectorindependent solutions).
We can decompose the problem of second-level triggering in some detail, using as model the detector design as pursued in the ongoing work on LHC experimental proposals (two experiments are planned, they are known under the names ATLAS and CMS). This structuring of the problem is vital to limit the choices of transmission and processing technology, and to introduce the use of natural parallelism. The following paragraphs outline the functional decomposition of the problem into three phases.
Phase 1 consists of the buffering of level-1 selected data, and the collection of regions of interest. Raw detector data, after a level-1 (L1) trigger has occurred, are naturally transmitted via cables or fibers and collected in multiple local non-overlapping memory modules, the L2 buffers. They hold information over the duration of level-2 (L2) operation. A level-1-guided device, the RoI-builder, indicates regions of interest, which in general extend across the boundaries of physical L2 buffers. The data pertaining to RoI-s are selected by some mechanism, which we term RoI collection, whose implementations can make use of both the subdetector and the RoI parallelisms.
Phase 2 consists of 'feature extraction', or local processing of the data in a single RoI of a subdetector. On data collected for a single RoI, a relatively simple feature extraction algorithm can perform a local pre-processing function. Features are variables containing the relevant physics message, like cluster or track parameters that can be used to corroborate or disprove the different physics hypotheses. Feature extraction algorithms have a locality that permits to exploit the natural double parallelism of RoI-s and subdetectors. Future simulation will have to show if and to which extent this simple concept has to be diluted (the fate of nearly all simple concepts), in order to avoid physics losses. This could be true in regions of overlap of detector parts (e.g. barrel/end cap), where each subdetector only has a weak signal. This paper concentrates on implementations of feature extraction algorithms for different types of subdetector. Phase 3 contains global decision making on the basis of features, viz. processing full RoI-s and then full events. Figure 2 shows the natural and efficient order of processing, combining first all subdetectors that have 'observed' the same physics 'object', into decision variables which are again local (same RoI). This is followed by combining all RoI-s into an event decision. The bandwidth requirements are substantially reduced by feature extraction, so that implementations of this phase deal typically with fewer and small packets. The trigger frequency of 100 kHz is, of course, unchanged. We believe that in this (global) phase 3 of level 2, general programmability must provide the reserves for introducing quickly new ideas in an unknown future of experimentation. The present paper does not expand on this subject.
Algorithms
Our discussion concentrates on how the first two phases, RoI collection and feature extraction, can be implemented in programmable hardware, fast enough to follow the imposed decision frequency of 100 kHz, and general enough to be adapted to different detectors by reprogramming only. The focus is more on feature extraction than on RoI collection, as the latter is a largely non-algorithmic task requiring detailed knowledge of the detector readout. Feature extraction, on the other hand, results in algorithms that have properties of image processing tasks, particularly if data are transmitted in iconic (uncompressed) form. We present demonstrated implementations that fulfill all requirements outlined above, with hardware available today. The following examples are all taken from the proposed experiment ATLAS [2] , for which detailed information was available to us.
Router
The Router is the functional unit collecting data for regions of interest (RoI-s) for a single subdetector [3] . In a present implementation for the transition radiation tracker (TRT) prototype of RD-6 (see section 4.3 below), it is placed between the detector and the level-2 buffer, spying into the data flow and performing, beyond RoI collection, various functions of data formatting and preprocessing [4] .
These functions arise from the necessity to keep the feature extraction processors as free as possible from algorithm parts dealing with the physical layout of the detector readout and of decoding the transmission format. Data are locally arranged in a 'natural' order of all or large parts of the detector data, an 'image' of the data or (for thresholded data) ordered lists 1 . In the example of the TRT, data from groups of 16 planes in z-direction were transmitted in successive 32-bit HIPPI words, in separate groups for even and odd straws, and containing three successive bunch crossings (as the drift time of the chambers requires). The Router reduces the signals from three bunch crossings into one, and arranges the data, depending on straw signal amplitudes, into two binary images in memory corresponding to data with low and high threshold. Only the part of the image corresponding to an RoI (indicated from level 1's RoI builder) has to be transmitted to the feature extractor.
For a later generalization of this Router unit to other and larger detectors the challenge consists in routing information from hundreds of data carriers (fibers) to a single output per RoI. Presently, a data-driven implementation using specialized connections and programmable devices (FPGA-s) has been proposed [5] . A particular difficulty arises from thresholded data transmitted in lists of dynamic length, as is necessary for low-occupancy devices like the Si tracker (SCT). Potentially, these data can be accumulated for each RoI in a variable-length list, one for every potential part of the RoI. Only the parts corresponding to active RoI-s will then be combined and transmitted for feature extraction.
Feature extraction in a Silicon Tracker (SCT)
Conceptually and in an overall detector environment, the conversion of raw data in an inner detector at radii up to 100 cm and at large angles (|η| < 1.4) was discussed in [6] . The layout of the Si tracker we have considered now is no longer the same, but the principle of the feature extraction algorithm has not changed: this detector is powerful by its high resolution in the bending plane (100 µm pitch in rφ), and characterized by a modest redundancy (4 high-precision hits in a good track). The ATLAS tracker is now arranged on 6 cylindrical surfaces, four of which are high precision in the rφ coordinate, and radii go from 30 to 105 cm. Strips are arranged on wafers of a size of roughly 6 x 6 cm 2 , pairs of wafers are read out via a single fiber in a thresholded format, indicating wafer and strip addresses for hits only. This thresholding, or zero-suppression, is needed because of the large number of strips, in excess of a million, and the low occupancy (1%, dominated by noise).
In our algorithm we use only four rφ layers, because the two remaining cylinders contain z-measuring pads, with no precision information in φ. The input to our algorithm consists then of the ordered lists of n a ,n b ,n c ,n d points in layers a,b,c,d at the radii r a ,r b ,r c ,r d . We want to find at least 3 points aligned in a narrow road.
We first implemented a least-squares fit through all alignment permutations of the points. If n = n a +n b +n c +n d is small, this 4-fold loop over the 4 lists a,b,c,d is acceptable. For each quadruple we have to perform the 5 least squares fits through abcd, abcm, abmd, amcd, mbcd, where m stands for the missing layer. We pre-computed all this, and stored the residual of the best of these fits in a lookup table. When n gets large, however, the n a *n b *n c *n d necessary simple table lookups take too long. Although a simple "masking" operation can reduce the time to n a *max(n b ,n c ,n d ) table look ups (the candidate list is produced via 3 parallel lookup tables containing the possible combinations of layer r a with the 3 other layers r b ,r c ,r d ), we still find execution times increasing with the square of detector occupancy.
We then explored a quite different method and showed that it produces results of the same quality. In this algorithm, we form variable slope histograms with very narrow bins. As in all histogram techniques one has to use either overlapping roads or combine 2 neighboring bins. This method behaves linearly in n = n a +n b +n c +n d , and has been selected for implementation in DECPeRLe-1.
Feature extraction in a Transition Radiation Tracker (TRT)
The endcap TRT is a device based on radial straw tubes of small (4mm) diameter with interspersed foils for transition radiation. This device serves simultaneously as a precision and high-redundancy tracker, and (by pulse height analysis) as an electron identifier ( [7] , [8] and [9] ). The barrel TRT uses similar technologies, with straw chambers arranged longitudinally, and the analysis can be reduced to a near-identical algorithm. The endcap part is arranged in 16 wheels, each containing 8 planes equidistant along z. Wheels are mounted non-equidistant along z, from z=109 to z=320cm. Each plane contains 600 radial straws at equal distance in φ, but with small φ offsets from plane to plane. The inner radius of all straws is 0.5m, the outer 1.0 m. The total number of straws is 76800 in one endcap.
The tracking algorithm works in the projection natural for the TRT, i.e. in the φ/z plane which corresponds closely to the readout coordinates straw/plane number. Tracks in this projection appear as straight lines, with the slope dφ/dz directly related to p T , the position along φ indicating azimuth, and start and end point are crudely indicative for z and η. The algorithm recognizes patterns of digitizings that correspond to high-momentum tracks, taking into account the pulse height distribution of digitizings for identification of electrons. It is assumed that two rectangular images are transmitted, one for a low, the other for a high threshold on pulse height. Pixel positions, however, can not directly be used as z and φ; the detailed knowledge of the geometry, i.e. the precise positions of straws, is used to convert into true coordinates, and thus to refine the histogramming and get improved results for position and p T . A possible straw configuration is shown in figure 6 below. The algorithm consists of making a large number of histograms whose bins correspond to roads of different dφ/dz and to different offsets in φ. The method can be extended to include the use of drift time measurement, should this become necessary: drift time improves the resolution along φ by a factor of ten or better, although it introduces a leftright ambiguity on each single hit.
Feature extraction in a calorimeter
Various calorimeter segmentations and different technologies are foreseen to cover the different angular regions of the future LHC detectors with electromagnetic and hadronic calorimeters. Our calorimeter algorithm is rather generic and works well for a basic tower size of .025 x .025 in ∆η x ∆φ. We assume a single electromagnetic and a single hadronic layer. The algorithm is based on an RoI of 20x20 towers (pixels). Data from electromagnetic and hadronic cells are simply added, and the summed 'image' of pixels containing energy E i,j is analyzed for its transverse profile. The objective of the algorithm is to find features, i.e. decision variables, suitable to distinguish electrons from pions or from hadronic jets, or even pions from jets.
Inside the RoI, a near-circular region has first to be defined in which the peak energy deposition is fully contained (cluster area). Our cluster area is defined by all pixels at radius less than 5 pixels distant from the cluster center x cm , y cm :
Simple variables like hadronic energy fraction over the cluster area , or more complicated features like the second moment M of the cluster radius in two dimensions weighted with total energy E = E em + E had , or, equivalently, energy sums in different ringshaped zones and longitudinal volumes of fine granularity, are then calculated over the cluster area.
These features are derived from the ones used in off-line programs, and contain relevant information for electron/pion/jet discrimination that can not be explored in a level-1 algorithm. In our simulation [10] this analysis over a RoI of ∆η x ∆φ =.5x.5 gives a factor around ten in rate reduction, disregarding information from other detectors (rejecting the background of level-1-accepted QCD jets in an electron sample). Features also include the position of the cluster area, refining the position resolution inside the RoI.
Implementing functions in programmable hardware
Two opposite and extreme ways seem possible to implement a specific high-speed digital processing task. The apparently simplest is to program some general-purpose processor to perform the processing at hand. In this software approach, one effectively maps the algorithm of interest onto a fixed machine architecture. The structure of that machine has been highly optimized by its manufacturer to process arbitrary code. In many cases, it is poorly suited to the specific algorithm. This is particularly true if the algorithm includes low-level bit handling as in our case. On the other hand, a general-purpose processor can be programmed to process any computable function, and thus is infinitely flexible. Due to high-volume production, it may also be the most attractive solution overall.
The opposite extreme is to design custom circuitry for the specific algorithm. In this hardware approach, the entire machine structure is tailored to the application. The result is more efficient in execution time, with less actual circuitry than what general-purpose computers require. The drawback of the hardware approach is that a specific architecture is limited to processing a single or a small number of specified algorithms. If we add specialpurpose hardware to a universal machine, say for image processing tasks, we may speed up the processor, but we know well that this is limited to the fraction of the code concerned (Amdahl's law of disappointing returns).
We discuss here an alternative machine architecture, which offers the best from both worlds: software versatility and hardware performance. One such device has been used for RoI collection in a very scaled-down case, and has served to understand the principle of generalizing the data collection of subsets of a very large (multi-crate) data acquisition system. Two devices have been used for feature extraction. They are called DECPeRLe-1 and Enable. They differ in that DECPeRLe-1 was conceived with multiple applications in mind, whereas Enable is a configuration optimized for a specific algorithm (to be described below). On both devices, pipelined algorithms were implemented that could keep up with the required data flow corresponding to a decision frequency of 100 kHz. Excellent adaptability was demonstrated in our studies, in the number of implemented algorithms on DECPeRLe-1.
General-purpose machines based on FPGAs
We deal with an architecture called Programmable Active Memories (PAMs), which should be seen as a novel form of universal hardware co-processor 1 . Based on Field Programmable Gate Array (FPGA) technology, a PAM is a virtual machine, which can be dynamically configured into a large number of devices of very different performance and software characteristics. We present PAM programming as an alternative to classical gatearray and full custom circuit design.
The first commercial FPGA was introduced in 1986 by Xilinx [11] . This revolutionary component has a large internal configuration memory, and two modes of operation: in download mode, the configuration memory can be written, as a whole, through some external device; once configured, an FPGA behaves like a regular application-specific integrated circuit (ASIC). Any synchronous digital circuit can be emulated, through a suitable configuration, on a large enough FPGA, for a slow enough clock.
An FPGA is simply a regular mesh of n x m simple programmable logic units, called programmmable active bits (PABs) . The FPGA is a virtual circuit which can behave like a number of different ASICs: all it takes to emulate a particular one is to feed the proper configuration bits. The purpose of a PAM is to implement a virtual machine which can be dynamically configured as a large number of specific hardware devices. The structure of a generic PAM can be seen on figure 3. It is connected through the in and out links to a host processor (see figure 3) . A function of the host is to download configuration bit streams into the PAM. After configuration, the PAM behaves, electrically and logically, like an ASIC defined by the specific bit stream. It may operate in stand-alone mode, hooked to some external system through the in' and out' links. It may operate as a co-processor under host control, specialized to speed-up some crucial computation. It may operate as both, and connect the host to some external system, like an audio or video device, a specific detector, or some other PAM.
The reprogramming facility means that prototypes can quickly be made, tested and corrected. The development cycle of circuits with FPGA technology is typically measured in weeks, as opposed to months for hardwired gate array techniques. FPGAs, however, are not only used for prototypes, they also get incorporated in many production units. In all branches of the electronics industry other than the mass market, the use of FPGAs is increasing, despite the fact that they still cost ten times as much as ASICs in volume production. In 1992/93, FPGAs were the fastest growing part of the semi-conductor industry, increasing output by 40%, compared with 10% for chips overall.
As a consequence, FPGAs are at the leading edge of silicon devices. The future development of the technology has been analyzed in [12] and [13] . The prediction is that the leading edge FPGA, which has 400 PABs operating at 25 MHz in 1992, will, by year 2001, contain 25k PABs operating at 200 MHz.
DECPeRLe-1 as a PAM realization
DECPeRLe-1 is a specific PAM implementation; it was built as an experimental device at Digital's Paris Research Laboratory in 1992. Over a dozen copies operate at various scientific centers in the world. We review here the important general architectural features of PAMs using the example of DECPeRLe-1. A view of the board and the overall structure of DECPeRLe-1 are shown in the following figures 4 and 5. The computational core of DECPeRLe-1 is a 4 x 4 matrix (M) of XC3090 chips. Each FPGA has 16 direct connections to each of its four nearest neighbors. The four FPGAs in each row and each column share two common 16-bit buses. In addition, the board holds an FPGA which is not programmable by the user. It contains firmware to control the state of the PAM through software from the host. Adapting from TURBOchannel to some other logical bus format, such as VME, HIPPI or PCI can be done by re-programming this FPGA, in addition to re-designing a small host-dependent interface board.
DECPeRLe-1 has four 32-bit-wide external connections. Three of these link edges of the FPGA matrix to external connectors. They are used for establishing real-time links, at up to 33MHz, between DECPeRLe-1 and external devices such as the Router's HIPPI output links in our application. Their aggregated peak bandwidth exceeds 400 MB/s. The fourth external connection links to the host interface of DECPeRLe-1: a 100 MB/s TURBOchannel adapter. In order to avoid having to synchronize the host and PAM clocks, host data transit through two FIFOs, for input and output respectively. On the PAM side of the FIFOs is another switch FPGA, which shares two 32-bit buses with the other switches and controllers.
PAM designs are synchronous circuits: all registers are updated on each cycle of the same global clock. The maximum speed of a design is directly determined by its critical combinational path. This varies from one PAM design to another. It has thus been necessary to design a clock distribution system whose speed can be programmed as part of the design configuration. On DECPeRLe-1, the clock can be finely tuned, with increments on the order of 0.01%, for frequencies up to 100 MHz.
A typical DECPeRLe-1 design receives a logically uninterrupted flow of data, through the input FIFO. It performs some processing, and delivers its results through the output FIFO. The host or attached special hardware is responsible for filling-in and emptying-out the other side of both FIFOs. Our firmware supports a mode in which the application clock automatically stops when DECPeRLe-1 attempts to read an empty FIFO or write a full one, effectively providing fully automatic and transparent flow-control.
The full firmware functionality may be controlled through host software. Most of it is also available to the hardware design: all relevant wires are brought to the two controller FPGAs of DECPeRLe-1. This allows a design to synchronize itself, in the same manner, with some of the external links. Another unique possibility is the dynamic tuning of the clock. This feature can be used in designs where a slow and infrequent operation coexists with fast and frequent operations.
A thorough presentation of the issues involved in PAM design, with alternative implementation choices, is given by P. Bertin in [14] .
A PAM program consists of three parts: the driving software, which runs on the host and controls the PAM hardware; the logic equations describing the synchronous hardware implemented on the PAM board, and the placement and routing directives that guide the implementation of the logic equations onto the PAM board.
The driving software is written in C or C++ and is linked to a runtime library encapsulating a device driver. The logic equations and the placement and routing directives are generated algorithmically by a C++ program. As a deliberate choice of methodology, all PAM design circuits are digital and synchronous. Asynchronous features, such as RAM write pulses, FIFO flags decoding or clock tuning, are pushed into the firmware where they get implemented once and for all.
A full DECPeRLe-1 design is a large piece of hardware. The goal of a DECPeRLe-1 designer is to encode, through a stream of 1.5 Mbits, the logic equations, the placement and the routing of fifteen thousand PABs in order to meet the performance requirements of a compute-intensive task. To achieve this goal with a reasonable degree of efficiency, a designer needs full control on the implementation of the logic. In 1992, no existing computer-aided design (CAD) tool was adapted to these needs. Therefore a C++ library was made, enabling the designer to describe a design algorithmically at the structural level, and supply geometric information to guide the final automatic physical implementation. This type of low-level description is made convenient by the use of basic programming techniques such as arrays, loops, procedures and data abstraction. Another library facilitates the writing of driver software and simulation support. In addition, powerful debugging and optimization tools have been developed, enabling the designer to visualize in detail the states of every flip-flop in every FPGA of a PAM board.
The main lesson we draw from our experience with these programming tools is that PAM programming is much easier than ASIC development, enabling us to develop complex applications spanning dozens of chips with even untrained engineers or students, in a matter of a few months.
Implementing the algorithms in DECPeRLe-1
The general-purpose idea underlying the DECPeRLe-1 board has been demonstrated by mapping all algorithms described above onto this device. The following remarks concern these implementations, and their limitations.
Feature extraction in a Silicon Tracker (SCT)
For all existing implementations of the SCT algorithm, the same minimal specifications have been used. The task is to find the best track in a set of 64 slopes x 32 φ intercepts, assuming an occupancy of 1% equivalent to about 30 active pixels per RoI. Software implementations in general-purpose processors are currently 2 or 3 orders of magnitude slower than the required 100 kHz. Using the simple histogramming algorithm based on sequential processing of the list of active pixels, we could meet the required performance. Moreover, as the algorithm computation time is linear in the number of hits, it is possible to absorb bursts of high-occupancy images. The histogram solution also has the advantage of using look-up tables, resulting in flexibility on search patterns used.
In the latest implementation on DECPeRLe-1 described in [15] , a 64x16 histogram containing 4 bits for each search pattern (1 for each row) is filled by a look-up table while the coordinates of hits arrive one per clock cycle. The histogram peak is extracted by shiftregisters and max-units. At 23 MHz, with an event frequency of 100 kHz, it is possible to process 230 pixels per event. The circuit also provides all the needed functionality to do multiple passes and extract the best of all the passes. It is also possible to do a zoom after the first pass around the best track to increase the overall precision. Thus, in two passes, a full 64x32 grid can be scanned with an average of 115 pixels per event, which is more than the required performance.
Feature extraction in a Transition Radiation Tracker (TRT)
As part of the study of detectors for LHC experiments, a TRT endcap prototype has been built and tested by the RD6 project [8], [9] . The prototype for our run contained two sectors covering an angle (φ) of about 30 degrees (32 straws), each sector made of 16 planes spaced along z. A typical RoI in a future endcap TRT is expected to consist of 16 straws in the φ direction and most of the 128 planes. A RoI for the present prototype was assumed to be 32 x 16 straws (in z x φ). The geometric arrangement of the straws in the z-φ projection is shown in figure 6 : The detector output, preprocessed by the Router, consists of two images, one for the high and one for the low threshold signal. Each image consists of 16 x 32 bit words. Each 32 bit word corresponds to output from 32 straws, the z axis being defined by the 'zigzag' line at the bottom of figure 6. The Router was connected to DECPeRLe-1 by two HIPPI lines, using two commercially available HIPPI-to-TURBOchannel Interface boards (HTI) originally developed at CERN. These boards implement a HIPPI destination sending data by DMA to a TURBOchannel bus. A DMA receiver and a full I/O controller was implemented on DECPeRLe-1, utilizing resources not used by the feature extraction algorithm. The HTI units could be directly plugged into DECPeRLe-1's extension slots, so the DECPeRLe-1 board could be used without any hardware changes, only loading the necessary configuration file.
As described in section 4.3, the aim of the feature extraction task is to identify a straight line in the in the z-φ plane of the RoI. Based on the number of low threshold hits and high threshold hits along the track, an electron identification can be made. As for the SCT implementation, the track finding is done by histogramming and peak finding. Instead of using lookup tables, the histogramming is performed by a 'Fast Hough Transform'. The FHT is a variation of the Hough Transform, and has been fully described in [16] .
The algorithm is based on superimposing a 128 x 128 pixel grid on the straw pattern in the z-φ projection. The grid is graded in φ steps of 1/8 of the inter-straw distance, which makes all straw positions fall exactly on grid points. In this grid (see also [17] ) it is possible to define lines of slope 0, ±1, ±2, ... pixels, where the slope denotes how much the line ascends or descends across the RoI. The φ offset (or intercept) of a line is represented by the pixel where the line enters the RoI. In figure 6 a set of lines is shown with slopes -15, -14, ..., -1, 0, 1, ..., 15, all with intercept 64, i.e. entering the RoI in the middle of the φ axis.
The output of the algorithm consist of 4 words: track intercept (7 bits), track slope (5 bits), number of low threshold hits along the track (6 bits), and number of high threshold hits along the track (6 bits). These data are packed into one 32 bit word, and written into the output FIFO of DECPeRLe-1, which is subsequently read by the host program. This makes it just possible to fit a histogrammer using 31 slopes and 128 intercepts in one DECPeRLe-1 board. This yields a total of 31 x 128 = 3968 possible patterns in the pixel grid described above, close to the maximum that can be handled by DECPeRLe-1. Two alternative designs operating at the required 100 kHz rate with a latency of 1 image (10 µsec) were implemented on DECPeRLe-1; one with a road width of 4 pixels, another with 8. A 64-bit sequential processor would need to run at more than 1 GHz, with matched input, to achieve the same computational result.
Other combinations of slopes/intercepts are possible. For instance, using a broader range of slopes (±1,±3, ...,±31 pixels), the number of intercepts should be reduced to 64. In a full TRT detector, the RoI will consist of 80-100 planes, giving 3 times the volume of data as in the present prototype. In addition, drift time data may be included. The PAM provided by DECPeRLe-1 will not be able to handle this data rate. As faster and larger PAMs are likely be available in near future, we are confident that a similar implementation for a full detector RoI is a realistic possibility.
Feature extraction in a Calorimeter
The calorimeter algorithm appears as an ideal image processing application. We have implemented the algorithm as a double-pass operation, due to the need to find center-ofgravity values first. In a high-level algorithm, the two passes could be combined, if no numerical problems arise, but on an FPGA-based device, high-precision multiplication would be unduly penalized. The high input bandwidth (160 MB/s) and the (self-imposed) low latency constraint constitute a serious challenge to any implementation.
In a previous benchmark exercise [18] , the possible implementations of the calorimeter algorithm have been discussed at length, on both general-purpose computer architectures (single and multi processors, SIMD and MIMD) and special-purpose electronics (fullcustom, gate-array, FPGAs). The conclusion provides an accurate quantitative analysis of the computing power required for this task: PAM-s are the only structure found to meet the requirements.
This algorithm was implemented on DECPeRLe-1 with input from the host. Using the external I/O capabilities described for the TRT, data input could be from the detectors through two off-the-shelf HIPPI-to-TURBOchannel interface boards plugged directly onto DECPeRLe-1. The data path inside DECPeRLe-1 uses only about 25% of DECPeRLe-1's logic and all the RAM resources, for a virtual computing power of 39 Giga-operations/sec (binary). The initial input of 2 x 32bit @ 25 MHz is reduced to two 16-bit pipelines at 50 MHz, after summing E em and E had . The total accumulation of E i,j and x i,j E i,j is done in carry-save accumulators. Meanwhile, the raw data (E em and E tot ) are stored in an intermediate memory. The masking/summing operations on clusters are performed by using, in a second pipeline, matrices pre-computed for all possible positions of the center of gravity (a relatively small number, due to the limited size of RoI and cluster). 
Enable as a specific PAM realization
The Enable Machine is a systolic 2nd level trigger processor for feature extraction in the transition radiation tracker (TRT) of ATLAS/LHC [19] , [20] . It has been designed to satisfy benchmark specifications defined by the EAST-RD11 collaboration to make different computer architectures comparable. Results of the benchmark exercise were published previously [18] ; candidate architectures comprised massive parallel SIMD architectures like MASPar or Blitzen, pipelined image processing systems like Maxvideo, and RISC multiprocessors like iWarp. Results showed that general-purpose machines were one to two orders of magnitude too slow to meet the required execution frequency of 100 kHz, or could only run with major simplifications in the algorithm. I/O restrictions, a limited RAM bandwidth, along with missing adaptability to the algorithm's fine-grain parallelism are the main reasons for the failure of conventional architectures. The only commercially available system that partly fulfilled the benchmark specs was Maxvideo (Maxvideo was restricted to infinite-momentum tracks, i.e., horizontal lines in the z,φ input images). The high degree of parallelism and the execution of the algorithm in hardware, both provided by the FPGA matrix, were the major advantages that made the Enable Machine the only candidate which satisfied all given requirements at the time of benchmarking (DECPeRLe-1 was introduced later).
The Enable Machine meets the benchmark requirements for the envisaged full detector, not only for the present detector prototype. To fit the required functionality on a single PAM board of size 36×40 cm 2 (figure 8), the Enable architecture is specifically tailored to the TRT application. This is achieved by two PAM matrices of a total of 36 FPGAs, handling the two input images. Despite this specialization the Enable Machine is a PAM implementation optimized for a general class of systolic algorithms with similarities in the data flow. All algorithms implemented so far for the different ATLAS trigger tasks are systolic algorithms. In the following description of the Enable Machine we will focus on those architectural features that give major advantages for algorithms proposed for ATLAS 2nd level triggers.
System overview: A complete Enable system consists of several boards each providing a distinct function: the VME host CPU, the I/O interface (HIPPI), the active backplane, and one or more PAM matrix boards. These boards are all together housed in a 9U VME crate. Due to the synchronous design of the backplane, even the extension of the bus into another crate is possible, thus allowing the use of additional PAM matrix boards.
FPGA matrix:
The PAM unit of the Enable Machine consists of 36 Xilinx XC3190 FPGAs arranged in a square matrix of 6 × 6 chips. The principal communication scheme of this matrix is nearest-neighbor connection, which is natural and well adapted to systolic implementations of typical 2nd level trigger algorithms. Operations are performed on a constant data flow in a pipeline mode. Due to the optimization of the Enable Machine's architecture for the TRT algorithm, the FPGA connection scheme is not entirely regular, in particular by dividing the 6 × 6 matrix into two blocks of 18 chips each, connected by a 32 bit wide bus. Thus the two data streams for low and high threshold can easily be processed in parallel. The entire matrix operates at 50 MHz.
Local RAM: One main characteristic of the Enable PAM implementation is a distributed memory. Each of the 36 FPGAs in the matrix is equipped with a 128 kByte synchronous SRAM, for a total RAM size of 4.5 MB. The whole memory thus is accessible in parallel, resulting in an aggregated RAM bandwidth of 2 GB/s. Also, this organization uses much less routing resources for the FPGAs to access memory compared to the use of global RAM. Distributed RAM is the optimized architecture for algorithms based on a massive use of table lookup. The use of synchronous RAM with up to 66 MHz frequency provides a simple method for accessing the memory at the full speed of the FPGA matrix, especially for table look-up, but also for many other systolic algorithms requiring large block transfers. No wait states have to be introduced and there is also no need for any kind of specialized firmware.
Internal Links: For the set-up of a scalable system, the use of a high-speed bus system is critical. Enable uses a synchronous high-speed backplane bus system with broadcast capability. It is split into two buses for each data flow direction with a total bandwidth of 600 MBytes/s. This bus connects the I/O system of the Enable Machine with a certain number of PAM boards. A transfer along that bus is routed one slot per cycle. Although the transfer latency increases with the slot distance, this disadvantage is compensated by the 50 MHz frequency of the bus. For block transfers common in trigger applications, the 64 bit wide bus is able to transfer incoming detector images at a rate of 400 MBytes/s. Due to the much less stringent communication requirements in the opposite direction, a 32 bit wide bus is used there (figure 9). This concept has major advantages for system scalability. Unlike traditional bus systems the transfer frequency is not influenced by the number of boards involved and is independent on the length of the backplane. The most serious limitation for that concept is rather the physical size of backplane and crate than electrical parameters. Theoretically, an unlimited number of boards could be involved. In the Enable Machine prototype a seven slot system is used over an entire VME crate. For an application in high energy physics experiments, the separation of I/O subsystem and PAM boards has one obvious advantage: An adaptation to various bus systems provided by different detectors affects only the I/O board that had to be redesigned.
External Links: Two external links are provided by the Enable prototype. The first link is the HIPPI interface located on the I/O board mentioned above. Input data of up to 256 x 32 x 2 bits are routed to the Enable Machine´s Interface board via two HIPPI lines. They provide a combined input data rate of up to 200 MB/s. Four FPGAs (Xilinx XC3190) on the Interface board allow data formatting and coupling to the backplane bus. In a system equipped with several PAM boards a final decision can be extracted by the Interface board, based on the data from all Histogrammer units. The second link is a VMEbus system. Although the VMEbus is mainly used for communication with the host system it serves additionally as a secondary link. Provided by the wide distribution in high energy physics experiments, a special VME based link -the VIC bus -is the most important link to data acquisition systems (e.g. for testing of the prototype) or the trigger control logic.
Enable programming: The PAMs are programmed by downloading a configuration program into the FPGA devices from the host CPU, using normal VME transfer cycles. To provide maximum configuration speed, all chips of the FPGA matrix are configured in parallel. Broadcasting of the same configuration to multiple chips increases the configuration speed further, leading to a configuration time of 20 ms. Taking advantage from the reconfigurability of the FPGAs the Enable Machine does not waste its resources for infrequently used functions as RAM controlling for filling the LUTs. To write the LUT data to the RAM, the FPGAs have to be loaded with a RAM controller configuration. After updating the RAM, the old configuration is restored. Due to the fast reconfiguration time this RAM access time is acceptable, but saves many hardware resources by avoiding dedicated RAM controller logic.
Development tools: Experience with commercial tools (XILINX, VIEWLOGIC) shows that high design speeds can be achieved only if placement of the FPGA logic is done by hand, leaving the routing for the automatic tools. We therefore developed at the University of Mannheim the XILLIB tool, a graphical interface for interactive FPGA placement.
Together with the XILLIB library editor and a commercial schematic layout tool (PADS) which creates the netlist of the library element connections, it serves as a powerful tool for creating and changing FPGA designs. The preplaced FPGA design can then be routed by the standard tools (APR) with good success. The main advantage of XILLIB is the ease of placing and moving logic elements. Changes in the design can be made in minutes. Small changes of the placement or of the library elements also lead to small changes of the routed design. Merge functions between old and new designs let unchanged parts of a design untouched. An additional feature is the possibility to selectively highlight parts of the net list, allowing to look at the interweaving of logic cells. The XILLIB package also contains a library generator. User-generated functional elements are usually much better optimized for a given design than elements generated automatically or taken from circuit libraries. A large number of re-usable library elements is already available. To make XILLIB compatible with other commercial software, several interfaces have been developed. Our experience shows that this fast placement tool increases the design speed and flexibility, and it decreases the necessary turnaround time. Using XILLIB complex designs can be made in weeks instead of months.
Debugging tools:
The Enable system provides not only a software design simulator but also powerful tools for off-line design verification and on-line debugging. For off-line design verification a special microcontroller-based test environment has been developed, to be fully integrated into the simulation software. Simulations can run transparently on a software basis, or be executed on a test board. Running a design at full speed for a given number of clock cycles and analyzing the internal states results in much shorter time for design verification than full software simulation. By performing such tests with various clock frequencies a close estimation of the timing constraints of the actual design can be obtained.
Arbitrary input data streams can be supplied to the chip under test, and the output streams plus all internal states of the chip can be transferred to the host. Complex trigger conditions can be defined on the host system using PLD like test vectors, or by selecting CLBs and IOBs on a graphical display. The test sequence is performed by the microcontroller and the results are reported to the host system. Additionally, a full-trace option can be selected to show the states of all selected CLBs and IOBs after an arbitrary number of clock cycles.
The idea of on-line debugging is to collect data from different stages of Enable's data path and to run checks while the Enable Machine is operating. Data collecting is done by copying data to FIFOs on-the-fly on an event by event basis. The controlling VME CPU reads the FIFOs and performs the checking for the captured event data set. Data may be captured on the (HIPPI) input lines or as they travel from the Enable input board to the Enable backplane. The active configuration of the histogrammer XILINX chips determines the type of data that will be available in this stage This feature of tracing data on their way through the Enable system can also be used for off-line debugging, supplying data via VME.
Implementing the TRT algorithm on Enable
Because of the non-equidistant planes along z and the small φ offsets inside planes (for the layout of the prototype detector we refer to section 5.3.2.), a somewhat irregular structure of track patterns is expected in the TRT. Tracks become even more irregular if damaged straws have to be masked, or other unpredictable image distortions have to be handled in a running system. To give maximum flexibility in the pattern definition, Enable uses an algorithm based on general template matching and histogramming along predefined search roads. Since the corresponding patterns are stored in the distributed memory, these search roads do not have to be straight lines; nearly arbitrary road shapes can be programmed. In the first step two histograms are computed from the images corresponding to the two energy thresholds. In a second step the track with the highest electron probability is determined. For that a weighting and filtering function of identical histogram channels in both images (low and high threshold) is computed, and the maximum of this function is evaluated. Especially background noise (e.g. at high luminosity) strongly influences the weighting function. To provide maximum flexibility the histogrammer unit uses again table lookup for the implementation of the weighting function. Simulations have shown that fine-tuning of this function is important for maximum trigger efficiency. Each three FPGAs form a processing unit with two FPGAs calculating histograms for 2 slopes and 20 offsets (low and high threshold), and the third FPGA calculating the weighted maximum. The distributed RAM allows to run all of these units locally and in parallel.
According to the specifications of the benchmark algorithm the Enable Machine is able to handle image sizes up to 255 x 32 pixels of 2 bits each. 400 search roads can be processed in one PAM board for each low and high threshold in parallel. All of these search roads belong to the relevant part of the image, tracks at the edges with less than a minimal contribution to the image (here 32 pixels) are ignored. Because the latency depends linearly on the image size, Enable is able to process full detector images (128 planes) within 3.5 µs. Assuming the RoI size of the TRT prototype (16 x 32 x 2 pixels) up to 12 RoIs can be processed on one PAM board at the required 100 kHz rate. This performance compares to about 4,800 search roads at 100 kHz for the prototype RoI.
Another algorithm has been proposed for the TRT if the drift time information has to be considered. Using a general Hough transform arbitrary patterns can be processed. No restriction to straight lines or simple patterns applies. This alternative is a massive LUT algorithm working with zero-suppressed input. Detector data are received as a list of 16-bit coordinates from hits (pixel addresses). Every pixel addresses one RAM location and each RAM data bit corresponds to one histogram channel. The total number of histogram channels depends on the width of the RAM data path. Enable´s distributed RAM again provides a possible implementation. The FPGA matrix has to be configured as a long line of FPGAs. The whole list is routed and processed systolically through that line. Every FPGA asserts cycle by cycle the coordinates onto the addresses of his dedicated RAM and builds up the Hough space by histogramming the RAM data lines. The whole distributed RAM could work in parallel. Considering the data flow of the TRT prototype each RAM may be used to processes approximately 8 banks of histogram channels serially which leads to a total in excess of 3,000 search roads per PAM matrix board at 100 kHz.
Test runs with DECPeRLe-1 and Enable
Prototypes of a data-driven Router and an Enable Machine were built and used in 1992/93 with the TRT prototype of RD6. A DECPeRLe-1 board was acquired in 1993.
We have used multiple runs in the laboratory using our own hardware emulator SLATE [21] to test the functioning of the components at full speed, with programmed simulated TRT data. We have also connected the Router together with Enable [22] or DECPeRLe-1 in turn, to a prototype TRT detector running during beam test periods in September 1993 and June 1994. Although parasitic to detector tests, both setups were able to take in excess of 10,000 individual events on tape. The Router demonstrated that its input board was operating in a fully transparent way to the detector's data taking, and showed its programmability by supplying data in different formats to the feature extraction devices.
During the 1994 beam run the Enable Machine was integrated in the test detector data acquisition system supplied by project RD13. The Enable VME crate was coupled to the data acquisition crate via a VIC bus module containing 4 MB of memory. During accelerator bursts the memory was filled by Enable data. In the time gap between bursts the data acquisition system read this buffer. In a certain debug mode the Enable Machine provides data output of several kBytes per event. These data contain the full histogram content and preprocessed event data from different stages within the machine. This feature has successfully been used for a detailed analysis of the electronics. Off-line analysis verified that both the z-φ corner turning and the histogrammer produced correct results.
During the SLATE tests of DECPeRLe-1 an incompatiblility between the Router and the HTI boards was discovered, causing word losses at the end of each accelerator burst. There was no time to fix this problem for the beam test, so a small percentage of the events was lost. Events that were successfully transmitted to DECPeRLe-1 were subjected to offline analysis after the tests. No direct connection between DECPeRLe-1 and the DAQ crate was established since the necessary VME-based software was not available in time. However, data from DECPeRLe-1 were successfully transmitted by Ethernet at the end of the accelerator burst, and read out together with the raw detector data. The results were associated by event number, and compared bit-by-bit to the output of a simulation program fed with the raw data; a 100% correspondence was found.
Conclusions
The concept of programmable active memories (PAM-s) has been presented as a general alternative for low-level and I/O-intensive computing tasks. The field-programmable gate array technology used in PAM-s is evolving fast. Plans exist in industry for designing larger and faster PAMs, able to cope with a variety of general image processing problems. More specifically, we have demonstrated on two PAM implementations a fully programmable concept which can implement data-driven processing tasks in the secondlevel trigger, for all presently proposed detector parts and in existing technology.
The two devices we report about were realized for trigger tasks at a 100 kHz event rate. Both were benchmarked to be far superior in performance to conventional SIMD and MIMD machines in many aspects. The Enable Machine is a TRT-specialized PAM realization. Its power is in the bandwidth of the distributed RAM and of the I/O system. DECPeRLE-1 has been built for more general applicability, and several trigger algorithms have been implemented on this device. Parallel synchronous bus systems with high bandwidth are used in both systems. High flexibility is obtained through the reprogrammability of the basic elements, and by the scalable design. Software tools assure short development times for designs or design changes.
