Abstract--The H1 Level 2 neural network trigger has been running successfully at DESY for 4 years. In order to provide increased selectivity at the higher luminosity planned for the HERA upgrade, an improved 'intelligent' preprocessing has been devised. This system extracts complementary physics information from the Level 1 trigger stream and furnishes it to the L2 neural network in order to improve its decision. A new preprocessing board (The Data Distribution Board Version 2 -DDB2) is currently being designed at the Max Planck Institute for Physics in Munich in order to implement the necessary algorithms in fast Field Programmable Gate Arrays, taking advantage of parallelism and pipelined structures in order to meet the timing requirement of 8 µs. We present the different algorithmic steps and report on the current status of the DDB2 hardware upgrade.
I. INTRODUCTION T HE contribution of the neural approach in particle physics triggering applications has shown its efficiency as discriminator between interesting events and background, [1] [2] [3] [4] [5] . The multiplicity of signals from the different sub-detectors leads to a large number of parameters to be processed. In order to exploit the correlations between these parameters, a first approach might have been to conceive large neural network structures with adequate number of connections to treat the problem in its entirety. This approach, although valid in theory, has shown limitations in practice. The solution adopted for the H1 Level 2 neural network trigger is to provide an initial preprocessing step allowing to employ a priori human understanding in order to extract physically relevant variables.
A first advantage of this approach resides in the fact that input variables are generally correlated in some way, which can be exploited to restrict the input space to a space of lower dimensionality and thus reduce number of training samples required for the neural network. These patterns are not easily obtained because of the limited available statistics for interesting physics channels.
The second advantage is to significantly reduce the neural network size, and particularly the dimension of the input vector. This leads to a reduction of the number of connections to be processed, resulting in an important practical improvement.
In fact, the necessity for the trigger to provide an answer in a relatively small timing range (typically the order of 20 µs for a Level 2 trigger) imposes a maximum number of computations, implying constraints on the type of hardware to be used. The massive computation needs in the case of large networks cannot be realized in practice because of limited circuits capabilities, and compromises are generally made on the network architectures, resulting in a loss of efficiency.
The principles of the H1 neural network trigger itself are detailed in [4] and [5] . The main subject of this article is to describe the adopted strategy for the preprocessing and to present the new DDB2 (Data Distribution Boards version 2). In section II the H1 detector and general Level 2 scheme are presented. Section III outlines the preprocessing algorithms adopted for DDB2, while in section IV, the actual or planned hardware implementations of the algorithms are given. A conclusion appears in the final section.
II. THE H1 DETECTOR AND LEVEL 2 TRIGGERING HERA is a large particle accelerator located in Hamburg, Germany, which has continuously improved its effectiveness since its inception in 1992. The accelerator is currently being upgraded to increase its luminosity by a factor of 5, bringing more selectivity and access to more relevant physics.
The H1 Detector is a very complex apparatus designed to detect particles which are created when high energy electrons and protons collide [6] . In order to track charged particles within the detector, an inner and outer central jet chamber (CJC) and proportional chambers (MWPC) are distributed around the interaction point. The system is surrounded by a liquid argon calorimeter (LAr) with electromagnetic and hardronic parts for energy measurement. Two additional warm calorimeters cover the forward ("plug") and backward ("SpaCal") part of the solid angle.
In the H1 experiment, the trigger decision is taken on four online levels enabling the data flow to be continuously reduced. Level 1 and Level 2 have to employ special hardware since the timing constraints are very strong (respectively 2.4 µs and 20 µs). A Level 3, although provided, is not currently used. Data are sent directly to the Level 4 processor farms performing a full event reconstruction.
The current Level 2 trigger consists of a pattern recognition module (PRM) implementing the neural network to discriminate between different classes such as heavy flavor production, jets, etc. This PRM consists of a commercially available chip CNAPS by Adaptative Solutions Inc. [7] . It is a parallel computer in SIMD architecture containing 64 processor nodes (PN). Up to eight chips (512 PNs) can be combined on one board. The main internal parts of the PNs are a multiplier (24 bits), an adder (32 bits), a logic unit, registers, and memory (4K). The CNAPS-1064 chip can model a neural net with of 64 inputs, 64 hidden nodes and 1 output node (typical of those used in H1) in 8µs. The offline training of the neural networks used in the H1 trigger was performed using the standard backpropagation algorithm [8] with the Aspirin/MIGRAINES software [9] . More details about training and simulations of the neural networks can be found in [10] .
The purpose of the DDB boards is to take data from the Level 2 bus and preprocess them before they pass to the CNAPS chips. The main bus is divided into eight sub-buses each receiving data from an octant of the total θ and ϕ plane in which signals are collected. Data on the sub-buses are transmitted sequentially as 16 bit words and represent different quantities such as calorimetric energy, tracks, muon hit map, etc. Preprocessed data are then provided to the PRMs via a mezzanine receiver board. The controlling (loading, checking, monitoring) of this level is assured by a SUN SPARC station, connected via a VME interface to both PRM and DDB boards. The current DDBs can perform only rudimentary operations such as splitting and combining bit strings, summing binary quantities, etc.. In order to transform the neural network input into more physics-oriented quantities, new preprocessing algorithms have been designed for DDB2, as described in Section III. Simulations performed using these algorithms show a significant impact on the trigger selectivity. Figs. 1 and 2, for example, compare the neural network trigger output for real φ photoproduction data using the current DDB and a simulation of DDB2. The separation between physics and background is quite significantly improved with DDB2. The DDB2 boards are designed to implement the new intelligent preprocessing while retaining the old DDB functionalities to maintain compatibility between the versions. 
III. PREPROCESSING ALGORITHMS FOR DDB2
The Level 2 preprocessing is performed in 4 stepsclustering, matching, ordering and post-processing -which are designed to combine data from the sub-detectors having similar properties. These steps, represented schematically in Fig. 3 , are outlined in the following subsections. Additional details may be had in [11] . 
A. Clustering
This algorithm combines neighboring quantities within a sub-detector. As the intrinsic features of the sub-detectors are different, multiple varieties of the algorithm are necessary, even if the main idea is the same throughout.
The data formats and computed parameters of the subdetectors are summarized in Table I . Each is represented by a θ, ϕ plane, or layer, in which signals are collected. Received data consist of an address correlated with their position in the layer and a coded quantity representing the energy (calorimeters) or "pseudo-energy" (other detectors).
In the Liquid Argon Calorimeter, data are clustered in the electromagnetic and hardronic parts separately, providing two types of granularities, big towers and trigger towers. The algorithm simply sums nearest neighbor energy around a local peak in order to determine a region of interest. For each cluster found, the energy of the center, Ecent, the energy of the ring, Ering corresponding to the energy of the surrounding neighbors, the total energy, Etot is evaluated. The number of hits, Nhits, containing energy above a certain threshold and the angular information are also provided to the next algorithmic steps.
The MWPC (MultiWire Proportional Chamber) data are transferred via the PQZP 1 bus without zero suppression and consist of 256 bits organized in 16 words ordered by ϕ, each of which is coded in 16 bits ordered with increasing θ index. Since the available data only contain binary values, a presumming is performed to obtain discrete quantities before applying the general cluster algorithm. This implies to perform a summing of each non zero bit with all its direct neighbors. The cluster pseudo-energy parameters Pcent, Ptot, Pring, Nhits as well as the angular information θ, ϕ are then calculated. Because of the one dimensional nature of the drift chambers, CJC (Central Jet Chamber) data only provide ϕ information in the form of 45 bits signaling a hit in the respective bin. Four masks are provided according to the momentum of the tracks. One dimensional pre-clustering is then performed for each of the bits before applying the general cluster algorithm in one dimension. The resulting cluster parameters Dcent, Dtot, Dring, Nhits and the angles θ, ϕ are then obtained.
Data provided by the SpaCal calorimeter pass three different thresholds (low, medium, high) and arrive in two granularities: coarse (5x5) for low and high thresholds, and fine (20x20) for medium threshold. No clustering is performed for data coming from sub-detectors of coarse granularity as it would not be useful. Nevertheless, for subdetectors of fine granularity, pseudo-energy parameters Scent, Stot, Sring, Nhits, formed from pre-summed bits, as well as the usual angles θ, ϕ, are furnished.
PQZP : Parallel Quickbus Zero Suppression Processor

B. Matching
The matching algorithm builds objects from clusters in the different sub-detectors according to their three dimensional positions. Two types of matching are processed relative to the angular distributions in the sub-detectors. Table I shows the type of algorithms and the layers being matched together in the two distinct topological regions.
The algorithm starts from a cluster center in an upper layer and determines clusters having the same angular coordinates in the subsequent layers. Matched clusters are then merged to constitute an object whose properties are stored for later use. Processed clusters are removed to avoid double counting, except for the one dimensional CJC layers whose clusters may be used several times. An exact match between θ and ϕ of the clusters is not necessary; a deviation is tolerated to take into account statistical or systematic variations. In order to take into consideration the different layers' granularities, look-up tables are supplied, storing all the topological relations of clusters between sub-detectors. The number of objects is configurable but a maximum of 16 is foreseen. 
C. Ordering and post-processing
The objects delivered by the matching algorithm are subsequently combined to form exploitable quantities that may be sent to the neural network. In the ordering step, objects are sorted into three lists, executed in parallel, according to parametrizable variables, for example angular orientation, total energy, presence of a muon, etc.. The angular information of the sorted objects is then expressed relative to the biggest object in the list.
In the post processing step, thresholds are applied to quantities calculated from the sorted objects, for example, one could count the number of objects in the LAr lists having energy above certain threshold and a number of hits below a specific number. The exact specifications of this step are still under investigation and will be dependent upon the reaction to be studied.
IV. LEVEL 2 HARDWARE SCHEME WITH DDB2
A. Overview
Depending on the type of events to be detected, different DDB2 boards and associated CNAPS boards are used, functioning in parallel on the same data stream but differently configured in order to target specific properties of a collision.
The general structure of the Level 2 trigger is represented in Fig. 4 . The current system is composed of two crates, one containing the CNAPS processor-based VME boards in which the neural networks are implemented and the other, the associated DDB preprocessing boards. In the upgraded system, five new DDB2 boards and their corresponding CNAPS boards are added in a third crate, enabling thus a combination of the two preprocessing modes. Other cards implementing the data acquisition from the Level 1 trigger, control, monitoring and interfaces are also provided in these crates. More technical details about the general configuration of the Level 2 trigger are given in [12] .
Since the available time for Level 2 is 20 µs and 12 µs are necessary for the neural network-related calculations, a remaining time of 8 µs must be sufficient to execute the clustering, matching, ordering and post-processing. This timing constraint requires to exploit maximum parallelism and develop pipelined structures. These concepts are treated more explicitly in the following sections. 
B. Input protocol and hardware specification
Data transferred to the DDB2 are provided by the Level 1 trigger and collected by receiver cards, distributing them in a time-multiplexed way on the Level 2 bus composed of 8 subbuses. An interleaved mode capability allows to transfer data at 20.8 MHz, i.e., double the HERA master clock speed of 10.4 MHz.
All algorithms, interfaces and modules are described in VHDL using VIEWLOGIC design flow and are to be implemented in Field Programmable Gate Arrays (FPGAs) from the XILINX™ VIRTEX family [13] . Processing circuits on the DDB2 boards are designed to function at multiples of the HERA master clock frequency, in particular at 20.8 and 41.6 MHz, and simulations to date indicate that there should be no problems in attaining this goal.
C. DDB2 board description
The DDB2 boards implement all algorithms destined to process data from Level 1 in order to provide them to the neural network. It must also provide control functionality, interface circuitry with any host computer and test capabilities. A block diagram of the DDB2 is given in Fig. 5 . The input module constitutes the interface with the L2 bus and conveys incoming data to three different circuits as described below. 
1) The MOST master-chip
A MOST (Monitoring, Setup and Test) interface manages the communication between all FPGAs of the DDB2 and maintains the connection with any host computer. Moreover, it is used to access the external memories from the VME interface. A master chip supplying the VME slave interface is linked to the other FPGAs in which clustering, matching, ordering and post processing are implemented, making the configuration of these circuits via a host computer possible.
2) The event controller circuit
The event controller module circuit manages the event handling, the data collection from data processing chips and the transfer to the CNAPS boards. Status checking and all other steering of the circuits is also provided by this module.
3) The processing chips
A general functionality of these chips is given in Figs. 6, 7. The data arrive sequentially on the L2 bus and are distributed through FPGAs consisting of two distinct units. The first implements both clustering and matching algorithms; the second, both ordering and post-processing, as described below. 
a)
The clustering/matching unit
The clustering part of the circuit is made of eight units working in parallel on the data provided by the eight Level 2 sub-buses. Each unit sorts the incoming data according to predefined criteria such as energy or angular information.
During a first step, addresses of every cluster center and its corresponding parameters are stored in two distinct memories. Subsequently, each center's address is sequentially sent to a look-up table that provides the address of the corresponding neighbors and delivers it to the CAM (Content Addressable Memory) module. This module consists of storage devices which can be addressed by their own contents. A data value input to the CAM is simultaneously compared to all the stored data. This module enables a cluster to determine whether a pre-stored one belongs to its neighborhood or not. Neighbor address determination is not obvious and depends on the technology of the sub-detector and how the granularity is defined. If the neighbor condition is fulfilled, the properties of the centers are merged and the new cluster's coordinates become those of the biggest energy center.
The output unit combines the results obtained in the previous steps and suppresses potential plane-splitting due to the distribution of data across the eight sub-buses of the L2 bus. Parameters described in Table I are also calculated in this unit. Addresses of newly found clusters are then provided to the matching part of the chip, which stores them in internal registers after a sorting procedure. This process is concurrently executed in all processing chips at the rate of the outgoing clusters. The number of clusters to be matched is configurable but limited to 16.
The matching unit is composed of 16 input registers and 16 vicinity comparators connected to their associated external memory in which neighborhood information is stored. The entire operation is managed by a control module, and an output-encoding unit delivers the addresses of clusters constituting the object. The matching procedure is initiated once all clusters coming from all layers are stored in registers and when there are no more clusters to come.
A first layer is defined as the seed layer and the address of the first cluster associated to the biggest criterion is sent to the inter-bus destined to address memories. The corresponding neighbors' coordinates of this cluster are then accessed and a comparison is performed between these neighbors and the coordinates of the clusters available in all filled registers.
When all clusters of the seed layer have been processed, the same operation is carried out considering the next available layer as seed layer. The sequence of processing layers can be configured. Matched clusters except those coming from the CJC layer are removed and cannot be matched again to build another object, in order to avoid double-counting.
In the case of successful comparison within a layer, a single bit corresponding to the matched cluster's address is set in a comparison vector. The bits of this vector are then decoded and used to address memories in which all clusters' parameters were previously stored by the clustering unit. Each object can have access to all the properties of the clusters that build it. The parameters are then collected and supplied to the ordering unit.
b)
The Ordering/Post processing unit
The ordering and post-processing units will execute the exact algorithms outlined in section IIC, with parameter lists stored in the internal memories of the FPGA's, so as to provide data that can be directly exploited by the neural networks. In contrast to the clustering and matching units, which are now completely designed and checked-out, the exact hardware implementation of the ordering and postprocessing units has not yet been completely defined. 
D. Hardware status
The assembled DDB2 circuit boards will be available shortly, after which programming of the FPGAs can begin. The ordering and post processing algorithms are still under development. As for clustering and matching, the hardware specifications are completely defined; however, these must be tailored to the specific sub-detector modules. For the LAr preprocessing circuit this is nearly implemented, while the SpaCal and Towers chips are still under development.
The specifications of the steering modules are completed. The MOST master chip and event controller will be checked out, along with the LAr module and any other modules which will be ready, in an upcoming beam test, enabling to obtain some first results. This configuration can then be upgraded to incorporate the remaining circuits as they become available.
V. CONCLUSION
We have presented the algorithmic concepts and the technical realization of the new preprocessing board DDB2, that may be seen as an intelligent way of triggering data in the H1 experiment at Level 2. The system relies on the applying of original principles in the sense that it exploits physical ideas to improve the trigger selectivity. The hardware implementation has also represented a challenge because of the relatively strong timing constraints -8 µs to process all algorithms. This problem has been solved by taking advantage of the intrinsic parallelism of the data distribution and by pipelining in an optimal way.
All these concepts are implemented making intensive use of FPGAs which are interesting for several reasons. First, the recent development of these circuits in terms of resources and speed make of them an attractive alternative to custom circuits like ASICs. Moreover their relatively small cost permits to rapidly implement a prototype design without major developmental constraints. The reconfigurability is also very important, enabling flexibility, adaptivity and independence from possible changes in the design specifications. In case of DDB2, the VIRTEX XILINX™ family has been chosen, allowing to combine all general properties of the programmable logic and their own technical features.
