The Design Real-Time 2.0 software environment provides the user with a set of tools to create a real-time fully programmable system which can sustain a rate of several MHz from multi-ch . ann el inputs, can extend processing time in a pipeline stage and can provide a result within a few hundred ns. The tool set allows the user to a) create applications of different sizes with different throughput and filtering algorithms; b) select the processor speed, internal bus width, lookup tables, number of input and output result bits for each set of data received at each chann el for each algorithm execution; c) simulate a specific parallel-processing system for a given algorithm on different sets of data; d) monitor the system in real time via an RS232 interface; e) create a hardware component (FPGA or ASIC) by means of interfacing to the Electronic Design Automation (EDA) tools. The advantage of these tools is that of allOwing simulation, before construction, of an entire programm able high-speed data acquisition and processing system, and selection of the processor speed , bus width and real-time algorithm which is most cost effective for a speCific application where general purpose processors fail in speed and performance. Benefits· derived from the use of the novel architecture and cost comparisons with respect to similar designs are also provided.
INTRODUCTION
The Design· Real-Time 2.0 software environment for designing programma ble systems capable of acquiring and processing data at a rate of several MHz is based on the 30-Flow architecture.
The advantage of the 3D-Flow programmable system [1], [2] , [3] , [4] with respect to the "hardwired" alternative systems described in the references [5] , [6] , [7] , [8] , [9] , [10] , [II] is in a) allowing it to adapt to unexpected operating conditions and enabling new, unforeseen physics because of its fl exibility/scalability; b) simplifying the electronics in many areas (see Section V-B, backplane, number of boards, cost, -size of the system, ease of monitoring in real-time, etc.); c) allowing extension of the processing time in a pipeline stage (see Sec. V-Ai); d) providing full programmability in processing data acquired from multisensor systems at a speed not envisaged before; and e) making it possible to build software tools (the Design Real-Time) capable of creating, simulating, and verifying at the gate level before construction, entire systems of different sizes and topologies.
The 3D-Flow architecture has the above advantages because it is based on a single replicated component. while the "hardwired" system does not have the advantages because a) it would have required an exorbitant amount of effort to model each different component at the gate level, b) it would have been limited to a specific application. Similar high-speed systems in current High Energy Physics (HEP) experiments or the ones proposed for future experiments at the Large Hadron Collider at CERN, Geneva, a) make use of special hardwire ASIC development for each experiment, and b) use a time window for each pipeline processing stage, not to exceed the time interval between two consecutive input data. (See Sec. V
Al)
The innovation is the 3D-Flow processor and system architecture which is the result of the ability to constrain an entire design to a single type of replicated component and a minimal number of different types of boards which make a system scalable, meeting the requirements of several high speed data acquisition and processing applications.
The significance of the innovation is that it mee ts the requirements (see Section V -C) of the specifications for HEP applications [10] , [11] , [12] , [13] in less development time while using the latest technology and requiring a simplified implementation at a lower cost, although it is not limited to those applications. Other commercial applications such as Positron Emission Tomography (PET), quality control in industry, high-speed data processing from multi-sensors or from high-speed communication can benefit from this innovation.
II. THE NEED FOR PROGRAMMABILITY IN FAST
REAL-TIME DATA ACQUISmON AND PROCESSING In commercial applications (see Fig. lb ) the demand for real-time digital video, image processing and networking is increasing. The 2.5 Gbps optical networking products available today (and 10 Gbps available for long distances) require high-performance processing systems capable of handling GbyteJs up to several TbyteJs of information from multiple chann els. The system should be scalable in size and also in performance as the technology level advances. . Perfonnance of One 3D-Flow Crate as described in [3] .
In High Energy Physics applications (see Fig. la ) we typically have a high input data rate (of the order of 800 GbyteJs to a few TbyteJs) with the need to detect some specific patterns (photons/electrons, single hadrons. muons, and jets, as well as global sums of energy and missing energy). In addition, there are combinations of objects such as lepton pairs and jets with leptons or missing energy. Valid patterns which satisfy the first level of trigger algorithm criteria occur only at a rate of the order of 100KHz to I MHz.
m. SOLUTION TO BREAK CURRENT SPEED BAR RIERS IN HIGH-SPEED PROGRAMMABLE SYSTEMS
The key element is the presence in the 3D-Flow processor and system of the Top-te-Bottom "bypass switches" (see Sec. III-A. Sec. V-Ai), which remove the constraint of executing within the time interval of two consecutive input data sets, operations of 1. fetching input data; 2. exchanging with neighbors; and 3. performing eventual pattern recogrutlOn and data reduction in order to obtain a reasonable amount of reduced data that can be sent through a reasonable number of output lines.
The above feature together with the ability 1. to constrain to a single type of replicated circuit;
2. to constrain to a minimal number of different boards; and 3. to constrain to an architecture that simplifies software development and hardware assembly, and which meets the requirements of several fast real-time applications, provides a novel processor and system architecture which breaks the current speed barri ers in programmable systems.
This novel architecture feature allows for implementation of a programm able acquisition and processing system acquiring data from multi-sensors at speeds related to the processor speed in the following manner. For example, with a. processor speed @ 100 MHz, the system can acquire from each channel a) 4-bit data @ 400 MHz, b) 8-bit @ 200 MHz, 16-bit @ 100 MHz, or 32-bit @50 MHz. The input data rate and the complexity of the real-time algorithm C!lI1 change and will aff ect only the latency of the results.
Since the processor input Top port is S-lines multiplexed to an internal 16-bit wide bus, the 4-bit @ 400 MHz inputs from the sensors will require an external 1:2 multiplexer.
A. Component of the technology platform
The overall architecture is based on a single circuit, the 3D-Flow [1] ProceSSing Element (PE), consisting of fewer than lOOK gates. It is technology independent and is replicated several times in a chip, on a board, and on a crate.
The 3D-Flow processor is essentially a Very Long 'Instruction Word (VLIW) processor. Its 128-bit-wide instruction word allows for concurrent operation (up to 26 in one cycle) of the processor's internal units: Arithmetic Logic Units (ALUs), Look Up Table memories, YO buses, Multiply Accumulate and Divide unit (MAClDIV), comparator units, a register file, an interface to the RS-232 serial port used to preload programs and to debug and monitor during their execution, and a program storage memory.
The high-performance YO capability is built around four bi-directional ports (North, East, South, and West) and two mono-directional ports (Top and Bottom). The Top port receives input data and the Bottom port transmits results of calculations along successive layers. Data and results flow through the stack from the sensors to the last layer. The last layer outputs results only. (See Figure 2) . A built-in pipelining capability (which complements the internal processing pipeline capability of the system) is realized using a "bypass mode," in which a processor will automatically transmit the data at its Top port to the Top port of the processor in the next 330 layer without disturbing internal processing. The "bypass mode" is controlled in a synchronous mann er by a programm able counter located on each CPU and presettable by RS-232. This feature thus provides an automatic procedure to route the incoming data to the layer free to process it. 
B. Technology-independent 3D-Flow ASIC
The goal of this parallel-processing architecture is to acquire multiple data in parallel and to process them rapidly, accomplishing digital filtering, pattern recognition, data exchange with neighbors, and data formatting.
Because the 3D-Flow approach is based on a single type of circuit, it is natural to keep this modularity with a single type of replicated component that does not require glue logic for its interconnection. For this reason as well as the fact that IC design advances are very rapid, it is best to retain it in IP (Intellectual Property Virtual Component) form written in generic VHDL reusable code so that can be implemented at any time using any technology. In this way it can be implemented at the last moment using the latest technology that will provide the best characteristics (low power dissipation, lower cost, smaller size, higher speed).
SOCs (System On a Chip), utilizing IPs (Intellectual Property) Virtual Components (VC), are redefining the world of electronics, as exemplified at DAC '98 conference. IV. DESIGN REAL-TIME: THE SOFI'W ARE TOOLS TO INTERFACE BETWEE N APPLICATION, FPGA, AND ASIC FOR A SYSTEM DESIGNER Now that the 3D-Flow architecture, the component of the technology and the technology-independent ASIC have been described, the Design Real-Time software tools can be described which allow the user to design fast programmable real-time systems of different sizes, topologies, and performance (g-bit, or 16-bit wide internal buses). The steps are: a) to create a system and simulate it in software, b) using the EDA tools, to create a component in hardware, simillate, and verify each feature against the requirements of each section of the software system (e.g. stack, pyramid, real-time monitoring).
Design Real-Time is an integrated high-level design environment for the development. verification, and implementation of scalable high-speed real-time applications for which commercially available processors fail because of throughput requirements. Design Real-Time:
• interfaces with third-party EDA tools;
• is based on a single type of replicated component, the 3D Flow (PE in the form of an IP block);
• is technology independent because the PE, IP block can be targeted to the latest technology;
• takes the user to a higher level of abstraction and productivity gain during the design phase because of the simplicity of the 3D-Flow architecture. and the powerful' tools, the set of predefined macros and the real-time algorithms available to the user;
• allows for implementation of the user's conceptual idea into the fastest programmable system at the gate level.
A. 3D-Flow Design Real-Time tools
create a new 3D-Flow application (called project) by
varying system size, throughput, filtering algorithm, and routing algorithm, and by selecting the processor speed, lookup tables, number of input and output bits for each set of data received for each algorithm execution; 2. simulate a specified parallel-processing system for a given algorithm on different sets of data. The flow of the data ����J�!!l;.
PHYSICS REs. I!! Q-CONTROL
The "3DF-CREA TE" software module allows the user to:
1. define a 3D-Flow system of any size;
2.
interconnect processors for building a specific topology with or without the chann el reduction stage (''pyramid'');
3. modify an existing algorithm or create a new one. The complexity of the real-time algorithms for the first levels of trigger algorithms in HEP experiments, such as the ones reported in [1], [7] . [10] , [11] . [12] , [13] . [14] , [15] , have been examined and fewer than 10 layers (corresponding to 20 steps, each executing up to 26 operations) of 3D-Flow processors are required;
4. create input data files to be used to test the system during the debugging and verification phase.
The "3DF-SIM" module allows for simulation and debugging of the user's system real-time algorithm and generates the "Bit-Vectors" to be compared later with the ones generated by the third-party silicon foundry tools.
The "3DF-VPS" module is the Virtual Processing System that emulates a 3D-Flow hardware system. .
The right side of Figure 4 shows the hardware flow of the 3D-Flow system implementation in a System-On-a-Chip (SOC). The same common entity, the IP 3D-Flow processing element (PE), shown in the center of the figure and previously used as the behavioral model in the simulation, is now synthesized in a specific technology by using the same code.
The number of chips required for an application can be reduced by fitting several PE's into a single die. Each PE requires about lOOK gates and the gate density increases.
continually (See Fig. 3 ). Small 3D-Flow systems may fit into a chip. For this reason, it is also called SOC 3D-Flow. However. when an application requires the building of a 3D-Flow system that cann ot be accommodated into a single chip, several chips each accommodating several 3D-Flow PEs can be interfaced with glueless logic to build a system of any size to be accommodated on a board, on a crate, or on several crates (3] .
C. Design Real-Time verification process
The verification process of an entire 3D-Flow system can be performed down to the gate-level in the following steps:.
• The 3DF-SIM: a) extracts from the system the input data for the selected 3D-Flow processor(s) for which an equivalent hardware chip (which was targeted to a specific technology) has been created, and b) generates the Bit Vectors for the selected processor(s);
• The same input data and the same real-time algorithm are applied to the hardware 3D-Flow model, and the simulation is performed using the third-party tools;
• Bit-Vectors generated by the. third-party tools using the hardware model are compared with the Bit-Vectors obtained by the previous software simulation (3DF-SIM);
• Discrepancies are eliminated.
D. Resultsfrom the use of Design Real-Time
Preliminary use of the Design Real-Time tools has made it possible to determine the parameters that led to design the data acquisjtion and processing system for pattern-recognition (particles in HEP experiments) described in [3] and [4] , providing:
1. simulation and implementation results of a real-time system for the Level-O trigger of LHCb [3] , [4] . [II] 
V. EXAMPLE OF APPLICATIONS WHICH BENEFIT FROM DESIGN REAL-TIME TOOLS
The benefits provided by the architecture described in this article (see full details in [3] and [4] ) are applicable to similar applications which require high-speed data acquisition and processing not solvable by general-purpose microprocessors. A. Advantages of the described architecture vs. current hardwired systems
The proposed architecture offers many advantages: flexibility. programma bility, scalability, cost. software development simplification, simplified hardware implementation, and various other advantages in many specific applications.
The most recent questions by the trigger coordinator of the LHCb experiment at CERN (right after a LO trigger implementation review) gave the opportunity to answer and to compare the advantages with alternative implementations. For example, some differences clarify the advantages such as:
1) Extending processing time in a pipeline stage
In a high-speed data acquisition and processing system such as the ones at the LHC experiments at CERN where 16-to 32-bit data per chann el are received every 25 05, a pipeline stage would not only need the time required to fetch the 32-bit input data, and to exchange the information with its neighbors (see Figure 5 ), but would also need the time required to reduce the data received from neighbors (2x2, or 4x4) in order to be able to send through the exit port every 25 ns a reasonable amount of reduced data through a reasonable number of lines. The throughput problem posed by the need to exchange data is illustrated in Figure 5 and explained in its caption. 
MHZ
A design that needs to constrain each pipeline stage to 25 ns, needs to impose limitations by:
1. partitioning the problem. (The option of building a system that handles only ECAL, another that handles HCAL, is not cost effective since more electronics has to be built. The problem is just deferred to a later stage with the need to build other electronics to correlate all partial results from the ECAL, HCAL, Pad chamber, etc., subsystems, with the disadvantage of not having the possibility of using raw data from all subdetectors within a specific area in an integrated manner for better particle identification.);
keeping the trigger algorithm very simple. (This may not
provide the best efficiency);
3. limiting the field of analysis to a small area (at the limit to a 2x2), with the intent to limit the number of hardware connections (Limits the efficiency);
4. designing fast electronics ("hardwired, or GaAs adder ASCIs which are not programmable but are expensive because development takes a long time and they will be outdated when they need to be used).
Trigger architectures such as the ones adopted and described in [7] and [11) from LAL and CMS (as well as the other groups such as Bologna, Atlas, etc.) have used in their solution 1) and 2), while LAL opted also for 3), CMS makes the analysis on a larger area and had developed a 200 MHz GaAs 8-inputs 12-bit adder. Regardless, GaAs is not cost effective for common logical functions (it is more suitable for fast analog circuits, radiation-hard components, or for digital circuits @ GHz). Problems such as the one of CMS would have found a higher-performance and lower-cost solution using the 3D-Flow architecture which provides the possibility to execute algorithms requiring up to 250 ns and does not require special technologies such as GaAs.
If the constraint of 25 ns is eliminated, the user will not need to partition the problem in a section for ECAl, another for HCAL, etc., but will be able to use the raw data of a specifi c area from several subdetectors in an integrated mann er I I
for better particle identification. The following is a series of considerations:
1. Alternative approaches such as [7] , [9] , [10] do not allow for future algorithm flexibility and expansion, since the fixed number of lines at 1 to 2 m distance of cell-to-cell communication as described in [7] , Section 3, is a limitation for the 25 ns fixed time window of the highly limited architecture. Within an architecture such as the LAL, .the future speed increase provided by advances in technology will provide a minimal performance increase overall, due to the major speed limitation of the information propagation time across the fixed number of lines. Moreover, if the trigger algorithm requires the analysis of more raw data, the entire system will need to be redesigned.
2. While the 2x2 algorithm [7] , [10] limits the .number of data exchanges, it imposes a limit, however, on event detection efficiency. The need to change the algorithm in the future might not be too remote, since for the past five years the LHCb experiment in the baseline had the 3x3 algorithm and large experiments such as CMS and Atlas use a 3x3 or a 4x4 trigger algorithm instead of a new 2x2 algorithm proposed for LHCb.
3. The approach of implementing the entire first-level trigger in FPGA (such as the approach in [7] by LAL) either a) must be very simple (to the detriment of trigger efficiency), or b) is very costly and difficult to handle, since the.FPGA is best employed for simple combinatorial operation. but becomes very inefficient with respect to ASIC when more complex pattern recognition functions need to be implemented.
4.
Similarly, more complex hardwired systems such as the one by CMS were forced to partition the tasks in small sections in order to comply with the 25 ns pipeline stage maximum time window. and the hardware implementation became very complex and costly (see the complexity of the crate backplane with respect to the one by the 3D Flow architecture which has short traces and regular non intersecting connections), and the different types of complex board that were developed.
4) FPGAs VS. AS1Cs
During the phase of partitioning functions in a large design, the cost-effectiveness of implementing each function or group of functions in ASIC or in FPGA should be studied and tested.
Results of tests using different technologies and results of tests using different synthesizers which are reported in [3] and [4] show a clear advantage to implementing the front-end combinatorial logic with some pipeline buffering in FPGA, while it is more cost-effective to implement circuits aimed to 334 process data for pattern recognition in ASIC.
B. The proposed architecture simplifies the implementation
In the 3D-Flow system, not only is the layout of the components within a board, and the boards on a crate. simplified since it has the 3D-Flow chip replicated several times and it has only a single type of board, but also because it has regular connection between components and boards.
As an example let us take the backplane and compare it with the backplane of two other implementations: LAL group [5] , [6] , [7] , section 3.1.1. and CMS first-level trigger implementation [11] section 2.2, Figures 4 and 5 . In both designs traces go from several boards to one board generating a concentration of traces higher in some areas than in other areas requiring a special construction (e.g., 13 layers of PCB required by CMS).
The problem of the 3D-Flow approach needing to connect 20480 pins, which first seems to be complicated. turns out to be simpler and solvable with 6 layer PCB's. The reason for the simplicity is the original approach of the 3D-Flow that simplifies matters in many areas. A different architecture that would have to route traces connecting 20480 pins @ 400
Mbps in a "spaghetti" fashion, might require more than 30 layers.
For a clearer understanding of how the 3D-Flow architecture simplifies the backplane as compared to other implementations, Figure 6 shows the scheme oT the backplane L VDS links logical layout (bottom-right) and the physical layout (bottom left) of the interconnection between neighboring processors to the North, East, West, and South.
Traces may be drawn with a controlled impedance at a value of our choice in the range of 50 to 100 Ohms. Each arrow of Figure 6 indicates a group of 320 traces. For best layout, and with the aim of eliminating all traces intersecting, one set of seven of these groups will connect pins of even connectors in one PCB layer and another set the odd connectors in another PCB layer. Each group is made of 320 non-intersecting traces of about 5 cm in length. laid side by side on a total width of about 32 cm. These groups of traces will be evenly distributed on the PCB area without the formation of any high-density trace area as is the case in all alternative implementations.
C. Fulfil requirements necessary fo r Physics performo.nce of HEP experiments
The design of the first-level trigger algorithm of an experiment is a task of a large group of people performing many physics simulations. These simulations are usually performed in high-level language programs running on large computers and aim to find a set of parameters and an algorithm that together provide the selection of the best events containing some interesting particles (data) and reject the others. '
The task of the hardware implementation, such as the 3D Flow architecture, is to take the resulting parameters and algorithm generated by the above-mentioned groups and translate them into circuits that implement the required functions at the required speed. Or in other'words, translate the functions into silicon, i.e., ihe hardware architecture, the block scheme, the circuit, or the HDL code, the crates, the boards, the connectors, the cables, the timing, the synchronization, etc.
The parameters specify the input data rate, the number of bits per each chann el, the maximum output data rate expected, the depth of the FIFO which allows for data to be buffered between two different stages of the trigger, the depth of the pipeline buffer which takes into account the time interval the input data should be kept (including cables and electronic delay) during the trigger decision time.
The trigger algorithm is the sequence of operations which needs to be performed on the input data in order to determine if the event has to be kept for further investigation or rejected. , The articles [3] and [4] describe in detail the hardware implementation of the LHCb level 0 trigger algorithm as specified in [12] , (with the possibility of executing all trigger algorithms defined in the last 5 years, including the last 2x2 algorithm proposed by LAL group), which meets the requirements of the experiment.
Verification of the requirements can be done at high level by checking whether:
1. the input data rate of 40 MHz is satisfied with a processor speed of SO MHz; 2. the maximum number of bits from each detector chann el is 23, which is satisfied by the 32-bitlchannel capability of the 3D-flow system; 3. the required output data rate is 1 MHz for a word not to exceed 64 bits while the capability of the 3D-Flow system is a word of 64-bit @ 20 MHz without requiring an output lookup table;
4. In 250 ns the 3D-flow system can execute (in the board designed in [3] ) up to 20 steps, each one executing up to 26 operations per step which is above and beyond any first-level algorithm foreseen in current experiments.
After this verification of the global requirements, which validates the 3D-Flow architecture against the requirement of the HEP experiment, the Design Real-Time will allow for the details to be verified by simulating any specific algorithm (3x3, 2x2, etc.) from the system level to the gate level.
VI. COST/PERFORMANCE COMPARISON BETWEEN HARDWIRED SYSTEMS AND THE 3D-FLow

PROGRAMMABLE SYSTEM
The detailed board and system design of the 3D-Flow (including a list of ICs, connectors, cables and the layout of the components on the boards) is described in [3] and [4] .
To make a meaningful price comparison, a number of HEP document� quoting prices has been studied. Since the prices derived seemed low, the cost of the 3D-Flow boards has been estimated higher. The following criteria have been ap g lied: a) 3D-Flow boards for the simpler 2x2 algorithm $4Icm , while for the more complex 3x3 algorithm requiring more 3D-Flow chips $6.41cm2; b) LAL-Bologna $2.7/cm2; c) CMS $3,3/cm2• Even if the cost of the 3D-Flow board is estimated at almost twice that of the CMS boards, the 3D-Flow architecture has a definite advantage in cost-it is about three times less expensive, which will be reflected also in lower maintenance cost-in addition to its advantage in programma bility, scalability, and flexibility.
LAL and Bologna boards (36.6 cm x 40 cm) have been estimated at an average of $3600lboard. CMS large boards (36.6 cm X 40 cm) have been estimated at an average of $4800Iboard. eMS small boards (36.6 ern x 28 cm) have been estimated at an average of $3400lboard.
The "3D-Flow mixed-signal processing boards," (36.6 ern x 34 cm) has been estimated at $5OOO lboard for the 2x2 LAL algorithm and $Sooo lboard for the complex CMS algorithm.
The cost to design a 9U board has been estimated at $77000. The cost to design a backplane has been estimated at $50000. The cost of a backplane has been estimated at $3600. The cost of a 9U crate has been estimated at $9000 . Legenda:
• LAL board design (4): front-end card (24S units) -(Ref. [6] Sec. 4 While the cost benefit in an experiment is considerable, even more important is the performance of the level-O trigger, and its flexibility to accommodate future changes. The below list gives references of the featuresJperformances. The details are described in Sections I, Ill , and V of this article and in the references listed in the table. 
f�� H��
Fully programma ble
Add subsystems later X 31 [4] No boundaIv limitation
Technolo�-indepcnd. vn. CONCLUSIONS
The Design Real-Time allows the user to create, simulate and verify the design and implementation of a programmable real-time system for which general-purpose processors are limited in speed. Its utilization has been shown to realize the advantages in application in first-level trigger in HEP (however, it is not limited to the�" applications) :he benefits of the 3D-Flow system in HEP c.periments cor pared to the alternative systems at all levels are: a) SCIENTIFIC: allowing it to adapt to unexpected operating conditions and enabling new, unforeseen physics. b) TECHNICAL: flexibility, programma bility, scalability by allowing the size of the system to be expanded by cascading several crates in x, y dimensions. The complexity of the algorithm (z dimension) can easily be scaled to 10 times the time interval between two consecutive input data sets of 32-bitlchann el using a 3D Flow processor @ 80 MHz (corresponding to an algorithm of up to 20 steps executing up to 26 operations per step). c) COST: at least 50% lower than the alte.mative proposals. The 3D-Flow architecture clearly has an advantage in cost (about three times less expensive, which will be reflected also in lower maintenance cost), even if the cost of the 3D-Flow board is estimated as almost twice that of the CMS boards. The benefit of this cost reduction is due to the innovative 3D-Flow architecture which provides also the most important advantage in system programmab ility. sC;alability, and flexibility and in simplifying its hardware implementation.
vm. ACKNOWLEDGMENTS I thank the SBIR office of DOE of Dr. Robert Berger, for all its support. I am grateful to A. E. Werbrouck for interesting
