This paper presents the implementation of the data acquisition system of the resistive plate chamber (RPC) subdetector in compact muon solenoid (CMS) experiment at the Large Hadron Collider (LHC) in CERN. The described readout system connects with the RPC detector, the RPC link system, the RPC trigger system and the CMS data acquisition system and creates one of multiple metrological systems in CMS experiment. The readout system receives the data provided by the multiple channels of the link system, filters out the non-triggered data, encapsulates the data into the standard CMS common data format events and sends them to the global data acquisition system. The main problem in the readout system design was to provide the sufficiently large throughput to reliably transfer the data. The implemented system is the scalable solution based on advanced Field Programmable Gate Arrays (FPGA) technology.
Introduction
The compact muon solenoid (CMS) detector consists of many subdetectors recording the final state particles from the Large Hadron Collider (LHC) proton-proton interactions. The LHC and the CMS operate in a cyclical way-generally every 25 ns two bunches of accelerated protons interact in the middle of the CMS detector, which is so called bunch crossing (BX). The amount of data provided by the subdetectors is too big to be fully transmitted to the DAQ system for further analysis. Therefore, a hardware implemented level one (L1) trigger system ( [1] , chapter 1) is necessary to select interesting BXes (events). The resistive plate chamber pattern comparator muon trigger (RPC PAC muon trigger) is one of the components of this level one trigger system. The measurement results provided by the resistive plate chambers (RPC) are distributed via the Link System to the trigger boards (TB), where they are processed by the pattern comparator (PAC) processors [1] , chapter 13 to provide information used by the global muon trigger and global trigger to elaborate the level one trigger decision, which is then distributed back as the level one accept (L1A) signal. The data from each BX are meanwhile delayed in the shift buffers waiting for the L1A decision. Then, depending on the value of the L1A signal, the delayed data Link System RPC DAQ Figure 1 . Location of the RPC DAQ system in the RPC PAC muon trigger electronics [2] .
should be either discarded or accepted and sent as the 'event data' via the RPC data acquisition system (RPC DAQ) to the CMS data acquisition (CMS DAQ) system. The first concept of the RPC DAQ system has been described in [2] . This paper describes the second, final version of the RPC DAQ system, the prototype implementation of which has been successfully tested during the CMS magnet test and cosmic challenge (MTCC) in CERN [3] . The RPC DAQ system is divided into two main components. The first part of this system is the readout mezzanine board (RMB) located on the TB. It rejects the nontriggered data, concentrates the accepted data and sends them via the single optical link to the data concentrator cards (DCC). The DCC boards further concentrate the RMB provided data and transfer the concentrated data in the CMS common data format (CDF) [4] , via the S-Link64 interface, [5] [6] [7] to the CMS DAQ system. The total amount of trigger boards in the RPC trigger system is equal to 84. In the current implementation three DCC boards, each connected to 28 TBs, will be used. However, it is possible to increase the amount of DCC boards used, to obtain higher throughput. The position of the RPC DAQ in the RPC PAC muon trigger system is shown in figure 1.
Implementation of the readout mezzanine board
The readout mezzanine board (RMB) has been implemented as a daughterboard located on the TB (figure 2). All data processing on the RMB is performed by a single FPGA Stratix II device EP2S30F672C3 manufactured by Altera. The RMB features 18 link inputs which receive the data provided by the link system (but only 12 of them will be used simultaneously). The processed data are sent via the GOH ('GOL opto-hybrid' mezzanine board, based on the gigabit optical link-GOL chip) [8] interface with a throughput of 640 Mb s −1 to the DCC. The architecture of the RMB is shown in figure 3.
Architecture and algorithm of the RMB
The link boards (LB) send the data after zero suppression and time-multiplexing compression described in [9] . The data received from the optical link must be synchronized in the input blocks of the TB to compensate for the latency introduced by the link system and optical links. The algorithm of the synchronization is described in [10] . The synchronized data must be further delayed in the RMB to compensate for the latency of the trigger system. The information contained in each data word may be logically divided into two parts important for operation of the RMB. The first part is the 'chamber data' (CD), which should be transparently passed via the RPC DAQ, the second part is the 'delay', which denotes by how many BXes the data are delayed with regard to their original time position. The maximum delay value is equal to 7, thus in each input it is possible to receive up to eight data words from a single BX. However, if we consider the data from number N BX of consecutive BXes, a single input may provide only up to 7 + N BX data words.
The synchronized data enter the triggered data filter, which is responsible for rejecting of the non-triggered data. In the simplest case, the triggered data (event data) are the data originating from the BX for which the L1A pulse (event) has been generated by the trigger system. For diagnostic and debugging purposes, it is allowed that the RMB transfers also the data from a few BXes preceding the L1A pulse (the pre-triggered data) and from a few BXes following the L1A pulse (the post-triggered data)-such a solution may be helpful for proper synchronization of the RPC link system, and for detection of some RPC related problems. In the current implementation of RMB it is possible to send the data maximally from eight BXes for single L1A (3 BXes before L1A, BX with L1A and four BXes after L1A), which requires that the data are additionally delayed by three BXes in the triggered data filter. A special algorithm has been developed for allocation of BXes to the event data in the case, when the post-triggered data of one event overlap the data of the next event. The main rules are as follows: that every BX may belong only to one event, and that the BX, in which the L1A pulse occurred, must belong to its own event. The example of BX allocation according to this algorithm is shown in figure 4 . The triggered data from each channel (as well as the pretriggered or the post-triggered data) are transferred to the input queue. Other data are discarded.
Before arriving at the input queue, the data are transmitted synchronously, i.e., their BX of origin is unambiguously described by the moment when the data arrive to the filter and by the 'delay' part of the data word. Such an unambiguous relation does not exist at the output of the queue. Therefore, the data are supplemented with the 'BX identifier' when entering the input queue. The BX identifier consists of the number of the L1A pulse (the event number) and of the relative BX number (changing from 0 to 7, and equal to 3 for BX in which the L1A occurred).
The output of the input queue is connected to the sorter, which sorts the data considering first their BX identifier, then their link input number. The sorter starts to process the data not earlier than after seven BXes of delay. This allows us to sort the data correctly, when in one input the triggered data are delayed by the long sequence of non-triggered data. Otherwise, the sorter could start to process the data from next BX, available from other link inputs, before the delayed data arrived.
To efficiently store in the output data stream the information about the BX, and about the link input from which the data originate, the data are organized into the 'RMB data records'. This task is performed by the 'RMB word serializer'. The RMB data record for a single BX has the following format (see also figure 5) [11] :
• Start of the frame marker (SOF) with the frame identifier consisting of 10 lowest bits of the event number, and of 3 bits of relative BX number.
• BX number marker (BXN) with the 11 lowest bits of the absolute BX number.
• 0 to N LI (amount of link inputs used in RMB) of link input data records, each having the following form:
-Start of the link input data marker (SLD) with the link number. -One to eight chamber data (CD).
• End of frame marker (EOF) with cyclic redundancy check (CRC) checksum.
The serialized data are finally transferred to the GOH module.
Assessment of the throughput of the RMB
The main problem when designing the RPC DAQ system was to assure sufficient throughput to avoid loosing of the triggered data provided by the link system. The required L1A trigger frequency, for which the readout system should be able to . Structure of the RMB data record [11] .
transfer the data is equal to f L1A = 100 kHz [1] . The RMB processes data from maximally N LI = 12 optical link inputs.
As has been stated before, if RMB sends N BX BXes for each L1A pulse, the maximum amount of raw data to be sent from a single link input is equal to 7 + N BX . Considering that the RMB uses the data words of width W RMB = 16, we can calculate the required raw data bandwidth:
For N BX = 8, the calculated raw data bandwidth is equal to B RMB raw = 288 Mb s −1 . However, the markers introduced by the serialization process generate some protocol overhead. For each BX we transfer M BX = 3 words as the header (SOF, BXN) and as the trailer (EOF). Additionally, for each link input we add M LI = 1 SLD word. So finally the bandwidth occupied by markers is equal to:
For N BX = 8 and N LI = 12, we obtain B RMB proto = 192 Mb s −1 , and finally the total output data bandwidth B RMB = B RMB raw + B RMB proto = 480 Mb s −1 , which is significantly less than available link capacity.
However, we must consider that the rate 100 kHz is only the mean trigger rate. Therefore, both the throughput reserve, and data queues in the RMB are necessary to compensate for the possible sporadic data bursts caused by statistical trigger fluctuation. The minimum acceptable intervals between the L1A pulses are defined by the (preliminary) 'trigger rules' described in [12] . The more detailed analysis of the required input queues capacity should consider the trigger rules and the statistical properties of the data provided by the link system. The 'worst case' simulations, performed with the Monte Carlo methods, have shown that the input queues of 512 words length are sufficient to provide reliable transmission of the data.
If the amount of data to be sent will cause data buffers' overflow, a special 'discarded data marker' (DDM) will be inserted into the data stream, so occurrence of this problem will be recorded in the DAQ data.
Data concentrator card
The next stage of the RPC DAQ is the data concentrator card (DCC). To decrease the development costs, a DCC card developed for other CMS subdetector-electromagnetic calorimeter (ECAL) has been used in the RPC DAQ [13] (figure 6).
The data flow model required for the RPC detector significantly differs from the data flow in the ECAL readout. Thus, only the hardware of the DCC has been used. The firmware for FPGA devices on the DCC board has Figure 6 . The CMS ECAL data concentrator card (DCC) [13] .
been rewritten from scratch to implement the architecture appropriate for the RPC DAQ operation.
The main components of the DCC board are the multichannel optical receivers and the FPGA chips. The board contains nine input handler (IH) FPGA chips (Xilinx xc2vp7), the event merger (EM) chip (Altera EP1S25F672C6), and the event builder (EB) chip (Altera EP1S25F672C6).
The architecture of the DCC board, as used in the RPC DAQ, is shown in figure 7 .
The DCC board provides 70 optical inputs; however, only 36 inputs will be used simultaneously in the current implementation of the RPC DAQ. The serialized data arriving from the RMB boards are encapsulated by the DCC in the 'event fragments' recorded in a CDF format [4] .
Encapsulation of RMB data into CDF records
The 'event fragments' encoded in a CDF format as used by the RPC DAQ are the sequences of 64-bit words, where the first and the last words are the predefined header and trailer words. The inner part of the 'event fragment' is the 'payload' with a user defined format. The RPC DAQ uses the 64-bit words of the payload to encapsulate the 16-bit RMB data and 16-bit special markers. The simplest method of encapsulation has been used, where four 16-bit words are put into the single CDF payload word. A special 'No-data' marker has been reserved to fill unused 16-bit chunks, e.g., in the last word in the event, or in other situations, where the encapsulation algorithm is not able to optimally pack the data. The chamber data (CD) sent by the RMBs are transparently transferred to the output data. The special RMB markers are detected and used to control the flow of the data and to generate the special markers introduced into the payload. The CDF payload has the following structure (see also figure 8) [11] :
• The CDF payload consists of N BX 'BX data records'.
• Each BX record consists of a single 'start of the BX data' (SBXD) marker containing the lower 12 bits of the BX number, and of zero or more 'link input data records'. • Each link input data record consists of a single 'start of the link input data' (SLD) marker containing the 6-bit RMB input number and the 5-bit link input number, and of one or more chamber data (CD) transparently copied from the RMB data record.
An example of the data processing in the RPC DAQ is shown in figure 9 .
Algorithm of the DCC operation
Generally, the DCC receives the data from the consecutive RMB inputs, and puts them into the CDF event fragments. However, to achieve the requested performance while operating at reasonable clock frequency, a special highly parallel, pipelined architecture had to be implemented.
The RMB data arrive at the input handlers (IH) and enter the input queues. A special state machine inspects the output of these queues and makes sure that the SOF and BXN (see subsection 2.1) markers are available. Then IH is ready to transmit the data for the particular BX.
The event builder (EB) receives the L1A signal and generates a list of BXes for which it should request data from the Input Handlers. Then EB requests the data for the consecutive BXes from each IH. When the particular IH has no more data for the particular BX (or if it does not have those data at all) it signals the 'end of data' condition to the EB, so it can start polling the next IH on the same bus. This task is performed in parallel, independently of all three EM buses.
When all IHs on all buses have sent their data for particular BX, the next BX for that event is being handled.
In the above process, some IHs are asked to introduce the additional markers into the data stream. For example, the first IH on each bus is requested to introduce the 'EM start of data' (EMSOD) and the SBXD markers. The last IH on each bus is requested to send the 'EM end of data' (EMEOD) marker. Additionally for the first BX in the event the IH1 is requested to set the 'start of event' flag in the EMSOD marker and to generate the CDF header. Similarly, for the last BX in the queue the IH9 is requested to set the 'end of event' flag in the EMEOD marker. The SLD markers from the RMB data are supplemented with the RMB/TB identifiers and copied into the output data.
The data transmitted by the IHs are stored into three input queues (one for each bus) in the event merger (EM) chip. The event merger browses cyclically the outputs of the input queues, starting from the A bus. When it founds the EMSOD marker with the 'start of event' flag, it resets the CRC engine (each transferred word updates the CRC). Then EM copies all the data from the A bus to the S-Link output. When EM founds the EMEOD marker, it switches to the next bus. If the EMEOD marker contains the 'end of event' flag, the EM generates and outputs the CDF trailer with the calculated CRC.
The above algorithm due to its parallel and pipelined operation allows us to fully utilize the DCC S-Link64 output bandwidth, even when working at relatively low clock frequencies. In the current implementation the input/output BX1  BX1  BX1  BX3  BX6  BX3  BX6   BX1  BX6  BX3  BX2  BX2  BX4  BX7   BX1  BX2  BX4  BX5 Figure 9 . The example of data flow through the RPC DAQ system [11] -simplified case with two RMBs and five link inputs. Two BXes-one with the L1A, and the next one are sent. The marks above the data words in the upper-left section of the figure denote the status of the data ('*': triggered data; 'o': post-triggered data). The triggered and post-triggered data are encapsulated by RMBs into the RMB data records (the upper-right section of the figure) described in section 2.1. In this process, the frame, time and link markers are added to the data. The RMB data records are received by the DCC board and converted into the CDF event fragments (the lower part of the figure) described in the section 3.2. Each CDF event fragment contains data triggered (or pre-triggered, or post-triggered) by the particular L1A pulse. The data from each transmitted BX are grouped together, and preceded with the SBXD marker. The link markers (SLD) precede the channel data (CD) originating from the particular link input.
(I/O) buses between the IH, EM and EB chips operate at the LHC clock (40 MHz, which corresponds to 25 ns between bunch crossings). The internal blocks of these chips operate at the 80 MHz clock frequency, generated by the phase locked loop (PLL) from the LHC clock frequency (80 MHz). The output S-Link interface operates also at the 80 MHz clock generated by the PLL.
Assessment of the DCC throughput
The total raw maximum data bandwidth provided by the N RMB = 28 connected RMBs is equal to 28 × B RMB raw ≈ 8.1 Gb s −1 . The output throughput of the DCC is equal to B DCC out = 80 MHz × 64 bits/word = 5.12 Gb s −1 , which is much less than the raw maximum data bandwidth. However, if we consider the statistical properties of the data provided by the link system, it appears, that the expected data bandwidth is acceptable.
The expected data bandwidth can be estimated in the following way: there are two sources of the data, each with different behaviour: true muons and noise. Muons are expected to appear only in one BX, namely in L1A BX, and their rate should be equal to the rate given by the muon L1 triggers. Typical rates of the muon L1 triggers are shown in table 1 [1] , chapter 15. Each muon can fire up to six layers of the RPC chambers; each chamber gives up to two fired 'partitions' resulting in the two data words. Additionally, one should add one more data word to take into account a situation when muon fires two neighbouring chambers. It gives up to 13 data words for one muon. Therefore, the data bandwidth produced by muons can be computed:
where k = 3 is a 'safety factor' for the underestimated rate of the muon L1 trigger. A number of 'noise-fired partitions' in each BX is given by the Poisson distribution with mean 25 [14] , which means that no more than N noise = 42 'noise-fired partitions' are expected in one BX with probability equal to 99.73% (3σ ). Therefore, expected bandwidth generated by noise equals
where N BX = 8. It gives the total expected data bandwidth
Taking into account the structure of the CDF event fragment (see figure 8) , the protocol overhead at f L1A = 100 kHz may be calculated from the above calculations of the data bandwidth. In the worst case, each data word may arrive from a different link, resulting in the bandwidth occupied by the SLD markers equal to the bandwidth of the data: B SLD = B data ≈ 551Mb s −1 . One SBXD marker is sent for every transmitted BX, occupying the bandwidth of
The CDF header and trailer are the data words of width W DCC = 64. For every event there is exactly one CDF header and one CDF trailer transmitted, resulting in the bandwidth of B CDF = 2f L1A W DCC ≈ 13 Mb s −1 . The total protocol bandwidth is equal to
The total data and protocol bandwidth is equal to
which is significantly lower than the DCC output bandwidth.
Reprogrammability of the boards in the RPC DAQ system
Usually, after the power-up, the FPGA chips on the DCC board are configured with the firmware stored in the flash memories on the board. However, the hardware of the DCC allows us to reprogram the FPGA chips via the JTAG interface. The VME interface [15] on the DCC board contains the JTAG master block, which, together with a modified version of the Altera Jam STAPL Player software [16] , allows us to remotely modify the firmware of the IH, EB and EM chips. This feature has been intensively used during the development and debugging, and may also be used to download the dedicated diagnostic firmware in the case of problems. The same solution also allows us to program the flash memories to update the standard firmware.
The firmware for the RMB chip is uploaded via the VME interface after the power-up, which also allows us to easily update the firmware, or to use the dedicated diagnostic firmware.
Diagnostic features and error handling in the RPC DAQ
This paper describes the standard mode of the RPC DAQ operation. However, to allow testing of the system, additional diagnostic modes have been implemented. In the bit error rate testing mode, the linear feedback shift register (LFSR) with the length of 57 bits is used in the RMB to generate the pseudorandom data sequence. An identical register is used in the IH chips to verify whether the received data are correct. To allow the testing of the DCC while only a small amount of RMB is temporarily available, the special internal diagnostic mode has been implemented. In this mode the data to be transferred are written to the internal RAM of the IH chips, and the sequence of L1A triggers is written into the internal RAM of the EB chip. The sequence of the triggers may be replayed either once, or cyclically.
To allow the reliable cooperation with the CMS DAQ system, the RPC DAQ is equipped with the rich set of error detection features, which may be used in both normal operation and diagnostic modes. The DCC automatically detects and blocks the disconnected or noisy RMB inputs. The data transferred via RPC DAQ optical links are protected with the 12-bit cyclic redundancy check (CRC) checksum. Also the data transmitted via the S-Link to the CMS DAQ are protected with the 16-bit CRC. All detected errors are signalled by the introduction of special markers into the data stream and/or by incrementing the error counters, which can be monitored during the operation of the system via the VME interface.
Test results of the RPC DAQ system
The design has been extensively tested in simulations. The obtained results have confirmed that it is able to operate with a nominal L1A frequency of 100 kHz.
The simulations also allowed us to measure the latency of the data processing. The first data from RMB are transmitted 34 BXes after the corresponding L1A pulse. The first data from the DCC S-Link output are transmitted 16 BXes after the first data are received from the connected RMB.
The system has also been tested using the built-in selfdiagnostic features. The optical link between RMB and DCC has been tested during the 19 h without any transmission errors, resulting in the estimated bit error rate (BER) below 10 −13 at the confidence level of 0.95. The correct operation of the DCC has also been tested with the artificially generated data in the internal diagnostic mode. However, testing under the full load was not possible due to the bandwidth limitations of the system receiving the data from the S-Link interface.
The prototype implementation of the RPC DAQ has been tested during the second phase of the CMS magnet test and cosmic challenge (MTCC), which took place in September 2006 at CERN. The goals of the MTCC were to test the superconducting magnet of the CMS detector and to test multiple metrological subsystems of the detector (including the RPC system) as well as their interplay. During the test, cosmic rays were used as a source of muons, which were measured by the subsystems. The prototype implementation of the RPC DAQ consisted of two RMB mezzanines each located on a separate TB with 12 optical links connected, and of one DCC board connected via S-Link64 to the global CMS DAQ. The RPC DAQ has been tested with L1A trigger frequency between about 30 Hz, when the trigger was provided by the RPC technical trigger (RBC) [17] only, and about 450 Hz, when the trigger was provided by all trigger subsystems of the CMS operating during the MTCC. The capability to acquire data from variable amount of BXes per L1A has been tested for one BX (L1A BX only), five BXes (L1A, two pre-and two post-trigger BXes) and for eight BXes (L1A, three pre-and four post-trigger BXes).
During the tests, some problems with the FPGA firmware on both RMB and DCC boards have been discovered, and corrected either during the MTCC or later. Also, basing on the tests results, some changes to the serialization algorithms have been implemented. The MTCC has also shown that it is necessary to extend the diagnostic layer of the RPC DAQ, e.g., to place the error counters on the different stages of the system. During MTCC, a few millions of events have been collected. An example of event measured by the RPC and DT 4 DAQ Systems and reconstructed by the CMS event display [18] is shown in figure 10 . 4 DT, drift tubes, is another metrological subsystem of the CMS.
Summary
The presented RPC data acquisition system (RPC DAQ) is the essential part of the metrological system consisting of the RPC detector, RPC link system and RPC trigger. The presented solution allows us to transfer the data transmitted via the RPC link system to the CMS DAQ system.
The proposed implementation uses the parallel, pipelined, throughput optimized architecture, which allows us to achieve the requested bandwidth while operating at the reasonable clock frequencies. The available bandwidth of the presented solution may be easily scaled by increasing the number of DCC boards used, and decreasing the number of RMBs connected to a single DCC.
Queries

