HLTB design for high-speed multi-FPGA pipelines by Magri, Josef et al.
HLTB Design for High-Speed Multi-FPGA Pipelines
Josef Magri∗, Owen Casha, Keith Bugeja, Ivan Grech and Edward Gatt
University of Malta, Msida, Malta
∗E-mail: josef.magri.12@um.edu.mt
Abstract—This paper presents the design and implementation
of a high-level test bench for high-speed multi-FPGA pipelines,
to model and simulate architectures that gather and process
large amounts of data. The test bench was successfully employed
in a nuclear particle detector system, forming part of a large
physics experiment. The design under test consists of three main
stages. The first stage simulates the acquisition of the analog
input data, providing designers with a means to verify correct
operation with unlimited input variation, be it actual or generated
data. The second stage, which contains multiple hierarchies of
FPGAs, comprises the actual detector firmware design. The
last stage is divided into two modules: data acquisition and
triggering, which are based on non-synthesizable VHDL features.
The simulated system has been verified against the provided
technical documentation. Each module was individually tested;
subsequently, integration testing of the entire pipeline was carried
out to ascertain its physical correctness across design corners. The
upfront costs in terms of time and resources required to set up
the environment are outweighed by the benefits of having such
a system, which range from the scalability, predictability and
manageability of modular systems to overcome the associated
limitations of high-speed synthesis and instrumentation. Hence,
factoring high-level test benches in the design pipeline becomes
not just an asset but an invaluable tool for the optimization,
testing and verification of complex high-speed designs.
I. INTRODUCTION
The proliferation of state-of-the-art computational
modelling and simulation software, made possible by a
substantial increase in computer processing power during
the past decade, has enabled the creation of ever more
complex hardware designs, which necessitate improved
testing and verification techniques to ensure a correct and
more effective final product. Emerging technologies such
as Python-based Cocotb [1] and MATLAB HDL coder [2]
integrate with electronic design automation (EDA) solutions
to provide verification services through co-simulation
environments. In general, these technologies employ external
direct programming interfaces (DPI) or Verilog procedural
interfaces (VPI), which require the actual design under
test (DUT) to be modified for a correct integration. This
limitation introduces an increase in problem complexity due
to a lack of demarcation between the design and verification
processes. Testing and verification tools should be based on
sound design principles such as abstraction, modularity and
separation of concerns; a design should be kept separate from
the simulation environment for a seamless integration with
these tools, without requiring any modifications to the DUT.
At the time of writing, standards [3] for both hardware
and software verification and validation are being compiled,
highlighting the importance of producing fully functional
embedded systems as early as possible in the design cycle,
thus reducing the need for expensive testing procedures.
Moreover, in multi-module environments with high-speed
interconnections, where instrumentation issues limit signal
probing, ensuring system correctness using testing is neither
feasible nor practical. Other systems may require one to
be closely located to the hardware set-up in order to carry
out testing. Since this arrangement is not always feasible,
remote environments are set up to address these issues,
leading to additional costs and overheads. In order to reduce
both the investment risk and the maintenance overhead, the
overall system behaviour must be clearly understood before
its actual deployment, entailing the efficient modelling of
the environment surrounding the design and its respective
verification to shorten the testing cycle.
In this work, we present a novel high-level test bench
(HTLB) design to enable efficient optimization, testing and
verification of high-speed multi-FPGA pipelines. The test
bench was implemented using a combination of hardware
description languages (HDL) and dynamic programming
languages; in particular, structural HDL was employed to
provide a layer of abstraction between the environment and
the design, while Tcl/Tk scripts were used for configuration
and monitoring of the DUT. This methodology, focusing
on abstraction and modularity, separates the role of the
system designer from that of the verification engineer while
facilitating their tasks through a graphical user interface (GUI).
In addition, this tool gives the possibility to independently
optimize specific stages of the architecture due to the high
level of modularity of the HLTB.
II. APPLICATION
The HLTB was implemented and integrated within a
nuclear particle detector system, forming part of a large
physics experiment at the Conseil Europe´en pour la Recherche
Nucle´aire (CERN). CERN is one of the oldest research
facilities in the world for particle and nuclear physics,
and hosts the circular Large Hadron Collider (LHC), a
particle accelerator with a circumference of 27 km. The High
Momentum Particle Identification Detector (HMPID) is one
of nineteen detectors located in ALICE (A Large Ion Collider
Experiment) at the CERN LHC. The read-out of the HMPID
modules is organized according to a dedicated architecture
based on the VLSI Gassiplex chip [4]. The 16-channel
pre-amplifier and signal conditioning Gassiplex chips are
multiplexed and uniformly distributed on the back side of
the cathode pad planes, each storing an electric charge via
track and hold circuitry, in coincidence with the arrival of a
trigger signal. Corresponding signals are sequentially sent to a
commercial 12-bit analog-to-digital converter (ADC) followed
by the DILOGIC, which is an ASIC specifically designed
for eliminating the empty channels, to subtract the base line
(pedestal) and store locally the true digitized information [5].
As the LHC Run3 upgrade approaches (scheduled for the
period 2020-2022) [6], improving the read-out rate of the
front-end electronics of the sub-detectors presents a number
of challenges. Such sub-detectors are composed of complex
Fig. 1. A block diagram of the HMPID system integrated within the high-level test bench. It consists of three main stages: the analog front-end (Stage 1), the
read-out electronics module (Stage 2), the data acquisition module and the triggering module (Stage 3).
multi-FPGA hierarchies, making it very time consuming and
sometimes impractical to debug and monitor at hardware
level. For instance, very few fundamental signals within the
design are readily accessible by means of an instrument.
These challenges make the HMPID an excellent test case to
demonstrate the importance and relevance of the proposed tool.
III. IMPLEMENTATION
Figure 1 shows a block diagram of the HMPID system
integrated within the HLTB. It consists of three main stages:
the analog front-end (Stage 1), the read-out electronics module
(Stage 2), the data acquisition module and the triggering
module (Stage 3). The first stage simulates the acquisition of
the analog input data and signal conditioning performed by
the Gassiplex and provides a means to verify correct operation
with unlimited input variation, be it actual or generated data.
This data is stored in a series of text files and fed to the
simulation by a Tcl script monitoring the data clock. Stage
2, which contains multiple hierarchies of FPGAs, comprises
the actual detector firmware design. The last stage (Stage 3)
is divided into two modules, data acquisition and triggering,
which are based on non-synthesizable VHDL features.
The front-end stage consists of 10 DILOGICs, where
each DILOGIC is fed by three Gassiplex chips, resulting in
a total of 48 channels. The ADC and DILOGIC chips are
mounted on a dedicated board called DILO5. Two DILO5
cards process the signal from a column, an FPGA with an
embedded dual port FIFO, situated on its own board; each
column generates the control signals required to operate and
synchronize the Gassiplex and DILO5 cards. Furthermore, a
column can read data from the DILOGICs and store it for
partial event building and possible event selection. The column
controller is critical for the synchronization of data gathering
as it determines the correct and precise addressing of each
DILOGIC with respect to time division multiplexed Gassiplex
card data. The column and DILO5 boards are plugged on a
motherboard, called a segment, which provides a bi-directional
32 bit data bus as well as power supplies and trigger signals
distribution. One segment, controlled by one FPGA, hosts eight
column control cards and handles a total of 3,840 channels.
Three segment boards are daisy-chained by short flat cables
to enable a total read-out of 11,520 analog channels. The
Read-out and Control Board (RCB) interfaces and manages the
communication between the data acquisition module (DAQ)
and the Local Trigger Unit (LTU) with the segments. The RCB
also includes the Timing, Triggering and Control (TTC) [7,8]
receiver and Source Interface Unit (SIU) chips; the former is
used for trigger synchronization and the latter for data transfer
and control. The RCB reads all the fragments of the event
stored on a column, builds a complete event with the Detector
Data Link (DDL) format [9] for ALICE and sends it to the
DAQ over the DDL medium.
The Global Trigger Unit (GTU) is the triggering system of
ALICE which delivers commands from its control centre. The
LTU decomposes the trigger signal and instructs the TTCtx
to send the trigger data needed by the assigned sub-detector
settings (Fig. 1, Stage 3). The Central Trigger Processor (CTP),
at the heart of the ALICE TTC system [10], is designed to
identify and select events containing significant and potentially
interesting physical phenomena. This process is performed at
a rate which complies with the restrictions imposed by the
bandwidth of the DAQ system and the High Level Trigger
(HLT). One of the main challenges for the ALICE trigger
system is to make optimum use of the nineteen sub-detectors
since they all have different read-out and detection time
specifications. The triggers generated by the TTC system are
categorized in a three-level hierarchy. The Level 0 (L0) trigger
is the fastest trigger which reaches the detectors at 1.2 µs,
followed by the Level 1 (L1) trigger arriving at the detectors
at 6.5 µs from the collision. These triggers do not include any
information about the validity of the event. The final level
is the Level 2 (L2) trigger, whose occurrence is determined
by the past future protection condition [10] and information
computed by the trigger detectors. In particular, this trigger
makes sure that pile-ups corrupting the data are avoided within
a programmable time interval before and after the collision.
Under normal operating conditions, this is computed by the
drift time of the Time Projection Chamber (TPC) detector, i.e.
it is issued 88 µs after the time of interaction. The L2 trigger
is distributed to the sub-detectors as a message containing
information classifying the event as either an accept (L2a) or
reject (L2r), subject to a set of predefined criteria [10].
Messages are transferred to the TTCrx via a fibre optic
cable which is composed of two channels. Channel A carries
the information for the L1 message while channel B consists
of the broadcast and command messages. The L0 trigger is
delivered over an LVDS cable from the LTU to the RCB,
independently of the TTC fibre. The CTP synchronizes all
the triggers received from the trigger detectors with the
LHC clock, which has a frequency of 40.08MHz [10].
Each sub-detector is synchronized by means of this clock.
The TTCrx was implemented in VHDL and tested for
correct operation. The CTP is simulated in a VHDL package
(ctp_pack.vhd) that emulates the function of the related
sub-systems. The required commands are stored in a text file
structure which is passed through an HDL TEXTIO function.
The DAQ consists of the Data Read-Out Received Card
(D-RORC) and the Destination Interface Unit (DIU). The DIU
communicates with the SIU via a duplex multi-mode optical
fibre cable. Since the different sub-detectors use the same
data transmission protocol, fragmented event data from the
participating sub-detectors are injected on the DDLs when an
L2a trigger occurs. Similarly to the triggering system, all this is
simulated via a designed VHDL package (daq_pack.vhd).
Finally, the DAQ transfers the gathered data to the central data
storage of ALICE [7, 8].
Fig. 2. GUI of the implemented HLTB which enables the user to configure,
control and monitor the simulation environment.
IV. SIMULATION PROCEDURE
The simulation procedure for the proposed HLTB design is
described in detail below; the steps can be broadly organized
into set up, execution and result gathering stages:
Preparation of Test Bench: The startup.tcl script,
launched through the QuestaSim software, presents the user
with a GUI (see Fig. 2), providing a straightforward method
for configuring and controlling the simulation environment.
Loading and Compilation: Each simulation type is bound
to a particular module, which can be either preset (provided
with the test bench) or user-generated. By selecting the
desired module from the GUI, the simulation program can be
compiled and linked to the necessary libraries and packages,
in preparation for the simulation step. In particular, the
compile.tcl script is executed to compile the required
VHDL modules. A VHDL package (values.vhd) hosts
simulation attributes and properties; the sim_set.tcl script
is executed to replace attributes in this module to customize
for specific simulation requirements.
Running a Simulation: The simulation is initiated through
the GUI, where a number of parameters can be set to define
the type of simulation and load specific design corners.
The simulation run-time instructions are predefined by the
selected simulation or else may be manually configured in
the appropriate script. Analog input data is passed to the
simulation by a Tcl procedure that is triggered by a clocked
VHDL output. This emulates the time multiplexed data coming
from the Gassiplex chips in the actual implementation.
Result Generation: The HTLB simulation was
instrumented to log events (i.e. simulation output and
interactions between the test bench and the DUT) for
further consideration and verification. The logged output
is stored in human-readable format, as opposed to the
binary representation from the physical system. A custom
tool was developed to convert between the two formats,
to exploit existing testing and verification tools such as
hmpDisplayMap (see Fig. 3).
Fig. 3. HMPID Raw Data Display Mapper showing ramp data input. The
screen is divided into 3 segments each having 8 columns with 10 DILOGICs.
Each square represents one of the 48 channels. The color bar represents the
charge intensity on each channel.
V. SIMULATION RESULTS
A typical HMPID detector run can be categorized into
two types: a “Calibration Run”, where data gathering occurs
with disabled zero-suppression to determine the detector noise
levels, and a “Standalone Run”, where following the initial
configuration of the detector with the gathered pre-compiled
pedestals and threshold, data is collected with zero-suppression
enabled. The type of run is broadcast to all detectors, followed
by the specific individual commands; both runs consist of a
number of stages. In calibration runs, the DAQ initially sends a
reset command to the front-end electronics. For the HMPID,
the DAQ starts addressing each segment and column in order,
to pre-configure the acquisition environment for the type of
run. When all segments have been addressed, the DAQ sends a
ready command to the RCB, in preparation for receiving the
triggers. As the L0 trigger arrives, the column controllers start
the track-and-hold procedure; this is followed by the transfer of
analog data from the Gassiplex to the DILOGIC, which is then
fed to the column FIFO and temporarily stored there. The RCB
then processes the L1 trigger payload to determine whether to
accept or reject the previous (L0) trigger. If the L1 trigger
is accepted, the RCB starts building the header files using
the previously stored payloads from the triggers. Conversely,
if the L1 trigger is rejected, the RCB starts the DILOGIC
reset procedure, discarding any currently stored information.
With the final L2a trigger, each detector starts transferring
its gathered data: prior to the transfer, the RCB data header
is further augmented with information from the L2a trigger.
The event data in the columns, together with the end of event
information is transferred to the DAQ and then further on to
the central data storage of ALICE.
Figure 4 shows the RCB wave window generated by the
HLTB for the calibration run of the HMPID system. The run
is limited to one column within one segment, interfaced to two
DIL05 cards. The first cursor shows the initialization command
Fig. 4. Simulated timing diagram generated by the HLTB for the calibration run of the HMPID system limited to one column within one segment interfaced
to two DIL05 cards. A physical model with typical delay was selected in this particular simulation.
from the DAQ to the RCB. In return, the RCB initialises
the segment (Set Segment) and the column respectively (Set
Column 1). A ready command prompts the RCB to wait for
trigger signals. The L0 trigger starts the triggering process;
the trigger is relayed (broadcast) to all columns, initiating the
track-and-hold of the analog data. After signal conditioning
and conversion, the data is transferred from the Gassiplex
to the DILOGIC, which processes the data according to the
previously set zero-suppression values. Subsequently, this data
is daisy chained to each respective column (multiple DILOGIC
to column transfers are carried out concurrently).
By the time the RCB is instructed to start transferring data
to the DAQ, all the header and read-out data would have
been processed and readily available in the required format.
The RCB accepts or rejects the L1 trigger depending on the
message received from the Sub Address and Data Qualifier.
The L2 trigger starts the initialization of event building. The
data transfer following event building is bounded by the L2
marker and the Column 1 Read; data from a single column is
being fed to the DDL as 32 bit packets at 40MHz.
Besides from providing a means to verify the functionality
of the HMPID detector, the proposed HLTB enables system
profiling through in-depth performance analysis across process
variations. Among other things, these features have been used
to study the effect of the variation of the propagation delay on
the 40MHz clock generated by a ring oscillator implemented
on the column control card (refer to Fig. 5). This oscillator
consists of a number of LCELL primitives [11] and an off-chip
delay-line of 5 ns, and generates a clock which guarantees
the synchronization of the data with the trigger received from
the RCB. The results generated by the HLTB indicate that an
optimized solution is required to achieve a more stable clock
across process variations. This is important since this clock
affects the read-out rate of the whole detector.
Fig. 5. Simulation of the column clock for different delay corner models.
VI. CONCLUSION
This paper introduced a novel HLTB design for high-speed
multi-FPGA pipelines through the use of modelling and
simulation; a nuclear particle detector system employing
high-speed protocols and a multiple module environment
has been used as a case study. The approach described
in this paper favours design principles such as abstraction
and modularity, promoting component modification within
interface boundaries and independent optimization of any stage
of the architecture. Feature-wise, the HLTB supports highly
detailed simulations across process variations with picosecond
resolution; this is an essential property when monitoring
signals at high data rates. The benefits of the presented
HLTB, such as extensibility, flexibility and maintainability,
far outweigh the setup costs. This is particularly evident in
scenarios where a readily available system requires an upgrade
but the debugging and monitoring of the updated firmware at
hardware level is impractical due to physical constraints or
instrumentation limitations. Thus, factoring HLTBs in design
pipelines becomes not just an asset but an invaluable tool for
testing and verification of complex high-speed designs.
ACKNOWLEDGMENTS
This work was made possible through a collaboration between the
University of Malta and the ALICE HMPID at the Conseil Europe´en pour
la Recherche Nucle´aire (CERN).
REFERENCES
[1] PotentialVentures, “Cocotb,” [Online]. https://cocotb.readthedocs.io.
[2] MathWorks, “HDL Coder,”
[Online]. https://www.mathworks.com/products/hdl-coder.html.
[3] “Ieee draft standard for system, software and hardware verification and
validation - corrigendum 1,” IEEE P1012/D2, pp. 1–18, Jan 2017.
[4] J. C. Santiard, W. Beusch, S. Buytaert, C. C. Enz, E. H. M. Heijne,
P. Jarron, F. Krummenacher, K. Marent, and F. Piuz, “Gasplex a Low
Noise Analog Signal Processor for Readout of Gaseous Detectors,”
Cern-Ecp, no. May, pp. 17–94, 1994.
[5] H. Witters and P. Martinengo, “DILOGIC-2 a Sparse Data Scan Readout
Processor For the HMPID Detector of ALICE,” 2001.
[6] P. Antonioli, A. Kluge, and W. Riegler, “Upgrade of the ALICE Readout
& Trigger System,” 2013.
[7] ALICE Collab., “The ALICE experiment at the CERN LHC,” 2008.
[8] ALICE Collab. , “ALICE Technical Design Report of the Trigger Data
Acquisition High-Level Trigger and Control System,” 1999.
[9] C. G. KFKI-RMKI (Budapest), “DDL & RORC Information Page,”
[Online]. http://alice-proj-ddl.web.cern.ch/alice-proj-ddl/.
[10] D. Evans, S. Fedor, G. T. Jones, P. Jovanovic´, A. Jusko, L. Kra´lik,
R. Lietava, L. Sˇa´ndor, J. Urba´n, and O. Villalobos-Baillie, “The ALICE
central trigger system,” 14th IEEE-NPSS Real Time Conference, pp.
129–133, 2005.
[11] Altera, “Designing with Low-Level Primitives - User Guide,” 2007.
