Abstract-Algorithms have been studied, using Monte Carlo techniques and implemented in a fast Xilinx Virtex II pro FPGA, in order to calculate and remove after pedestal subtraction the common mode of a group of adjacent channels. The implementation of the algorithms has been optimized both for speed and minimal FPGA resources, in order to be used in multi-channel applications. This work has been carried out in order to define the optimum algorithm for common mode calculation to be implemented for common mode rejection in the CMS Preshower detector.
I. INTRODUCTION
The readout chain of the detectors based on microstrip sensors includes Front-End (FE) electronics for amplification and shaping of the signal induced to the strips by the passage of a charged particle or photon through the sensor. The signals usually are then digitised and sent out of the detector for additional processing or storage.
The common mode in this work is defined as the time dependent mean base line shift of the channel pedestals, i.e. of the signal level of the channel without particle charge. This shift is eventually common for a number of adjacent channels, due to internal or external EM sources (ground bounce, external strip-cable lines acting as RF antennas etc). Each channel is considered to include a sensor strip, an amplifier-shaper and an analog memory in the Front End chip, a multiplexing process in the FE output and an ADC. The input of the algorithm calculating the common mode is considered to be the digitized signal values of the channels in a group of adjacent channels after pedestal subtraction. The channel pedestals (mean base line level) are usually measured using a number of events without particle charge in predefined time intervals short enough to take into account any eventual variation of them.
Although the algorithm and its implementation presented here can be applied to remove either the common mode in any electronic detector or the time dependent background of an image, it has been developed in order to reject the common mode in the readout chain of the CMS silicon Preshower detector [1] .
The CMS silicon Preshower is a fine grain detector placed in front of the endcap ECAL. Its primary function is to detect photons with good spatial resolution in order to perform 0 rejection required in the search of Higgs bosons.
Each silicon sensor has a total active area of 61 x 61 mm 2 and is divided into 32 strips of 1.9 mm pitch with strip capacitance in the region of 50 pF. The PACE3 chip [2] , a large dynamic range, two-gain FE ASIC, is used for amplification-shaping-temporary storage of the analogue signals. The 32 strip signals are multiplexed on demand by the CMS first level trigger and sent out to a 12-bit ADC, AD41240 [3] . The digitized data from a group of up to 4 sensors are multiplexed [4] and sent to the CMS off-detector electronics through an optical link. In the off-detector electronics the data reduction algorithms are applied in order to send only the useful part of data to the CMS DAQ event builder for further online analysis and storage.
II. THE METHOD OF COMMON MODE CALCULATION
Various methods have been used for common mode estimation of a group of adjacent channels in electronic detectors used in High Energy Physics [5] . Some of them are based on the differences between the pulse heights of each channel and the corresponding mean value of all channels in the group. Other methods estimate the common mode using the median pulse height. The main difficulty in common mode estimation is the distinction between the channels having particle induced charge and the channels having common mode only, in particular when common mode fluctuations between channels are high.
In this work the common mode has been calculated for groups of 16 channels and therefore the digitized signals of each Preshower sensor have been divided in two groups of 16 channels each. In an earlier study a method, using a cut in the difference between the ADC value of each strip and the mean value of the group of strips, has been studied and implemented in an FPGA, showing a limitation especially for low ADC values of the signal [6] . In the present work a more efficient method has been studied and implemented in an FPGA. According to this method:
Fast ascending sorting of the 16 ADC values (v i ) after the mean pedestal subtraction is performed. Selection sorting and Gray en/decoding has been used in order to define the first part of the sorting list, which eventually doesn't include particle charge signal. This part is used to calculate the common mode. The length of the first part of the list is calculated in the following steps: In this case, if c 1 = c 2 = 2 is used, the length of the first part of the sorted list can be determined and the common mode can be calculated (3 ADC counts).
It is worthwhile to mention that the ADC values after pedestal subtraction are considered to be integers. Although this is an approximation and although the division used to calculate the gradual mean has been approximated using multiplication and right bit shifting in order to speed up the common mode calculation procedure, the precision in the calculation remains acceptable. This method has been simulated using estimated values for the pedestal and common mode variation. The performance of the method, as it is shown in figure 2, is satisfactory. Figure 2 plots the difference between the calculated and input common mode. For the left plot, as input to the simulation, normal distributed pedestals have been used with rms 7 ADC counts (expected for high gain in the PACE3 without any extra shielding of the on-detector electronics) and normal distributed common mode with mean 5 ADC counts and rms 10 ADC counts. In addition common mode variation, ~ 25% of the mean common mode, has been added to the 16 strips. Signals from the decay of Higgs (300 GeV) μμμμ has been used mixed with the appropriate minimum bias events for LHC luminosity 2 x 10 33 cm -2 s -1
. c 1 = c 2 = 3 was used in the method to calculate the common mode. The rms of the distribution in figure 2 (left) is ~1.8 ADC counts, which is quite low if someone takes into consideration that 1 minimum ionizing particle (MIP) produces a signal of 50 ADC counts for PACE3 high gain.
For the right plot in figure 2 , normal distributed pedestals have been used with rms 3 ADC counts (expected for low gain in the PACE3) as input to the simulation. The input common mode is similar to that used for the left plot. The rms of the distribution is ~0.9 ADC counts, compared to~8 ADC counts corresponding to a MIP for PACE3 low gain.
III. IMPLEMENTATION
The algorithms have been developed using VHDL and implemented in a XCV2P7 Virtex-II Pro Xilinx FPGA. The implementation of the algorithm is a trade-off between processing time and resources occupancy since it will be used for multi-channel application. The Xilinx ISE Foundation tool has been used for the implementation together with the Synplicity Synplify Pro synthesizer. ModelSIM Simulator by Mentor Graphics has been used for the verification of the algorithm.
The most time consuming part of the algorithm is the sorting procedure. Different sorting methods have been tested. It seems that there is a trade-off between fast sorting time and minimum logic requirements. The method we concluded to is the selection sorting method in conjunction with Gray encoding/decoding. This method has the minimum logic requirements and a satisfactory sorting time.
The sorting is performed in a bit-sequential mode using dual RAM banks [6] . An offset has been added at the start of the procedure to the ADC values in order to have positive integer numbers only. In addition the positive integers are transformed to Gray codes using the G k = XOR(b k , b k-1 ), where b k is kth bit of the number and G k the corresponding Gray code bit. At the end of the procedure the Gray codes are transformed back to binary using b k = XOR(b k-1 , G k ) and the added offset is subtracted.
The improved sorting method used is the following:
a) The elements enter the 1 st of the two banks (pages). This method is very fast compared with the bubble short method and processing time is always the same, n m, where n is the number of the elements to be sorted (16 for the Preshower) and m is the number of bits in each element (12 bits for the Preshower). Therefore for the CMS Preshower case the processing time is 192 clock cycles. This implementation can run with a clock up to 160 MHz and therefore its processing time is 1.2 s for 16 numbers. If the maximum actual length of the numbers is less than 12 bits (in the case that no particle signal is present or the induced charge is low) the procedure is executed faster, using a skipping circuit to determine the maximum actual length of the numbers. The resources occupied in the FPGA are 39 logic slices i.e. 2% of the XCV2P7 FPGA. It is worth mentioning that the method described in [6] requires twice the amount of time to perform the sorting in comparison to the time required by the current method. A demonstration of the sorting method of four 4-bit numbers is shown in figure 3 . The hexadecimal numbers to be sorted are A, 5, 3, C. When they are converted to Gray codes become F, 7, 2, A, and after sorting 2, 7, F, A. The Gray codes are then converted to binary and the resulting list is 3, 5, A, C.
A multiple-register sorting method has also been tested to be used for sorting [7] . This method was rejected due to the extremely high amount of logic resources needed, even though it requires only 16 clocks for 16 numbers.
The block diagram of the implemented full procedure calculating the common mode is shown in figure 4 . The procedure occupies 342 slices of the FPGA resources (i.e. the 6%) and it is executed in 1.45 μs.
In order to minimize the execution time no direct division is used for finding the gradual mean, but a multiplication using a lookup table followed by a bitwise right shifting. In particular, for a division by a power of 2, right shifting is performed to the bits of the dividend by the corresponding number of bits. For a division by a number between 3 and 15 the division has been replaced by a multiplication with the corresponding number in a LUT table shown in Table I followed by right bit shifting of 8 bits (division by 256). This approximation (error less than 1%) increases dramatically the speed. In addition, in order to keep the FPGA occupancy and required execution time low, truncation and no rounding has been used for the resulting quotient with no compromise to the accuracy.
In addition to the common mode of the adjacent channels (mean common mode) the rms of the common mode is calculated in order to eventually use it for a cut after the common mode subtraction to assign the channels having particle signal. The rms is approximated as m k -m 1 /(k/4).
After the sorting, the data of strips in the 1 st bank are rearranged to recover their original position in the list in order to be used in the following stages of the off-detector electronics. This has been done to avoid using extra memory for temporary storage of the original data. The rearrangement process occurs concurrently with the calculation of the common mode after sorting and therefore no extra time is needed. At the end of the common mode calculation process, the common mode is subtracted.
IV. USE OF THE COMMON MODE REJECTION METHOD FOR THE PRESHOWER READOUT
As mentioned in the introduction, the data received by the off-detector electronics are multiplexed. Each data frame includes 3 subsequent, with 25 ns time difference, digitised samples (SLOTS) of the signals from the strips of up to 4 sensors in order to reconstruct the pulse produced by the preamplifier-shaper.
The original induced charge is calculated from the reconstructed pulse shape using the deconvolution technique [8] . The off-detector electronics functionality for the Preshower is shown if figure 5 . The data frame after integrity checks (de-serialization, CRC, packet synchronization between on and off detector electronics) is being unpacked in order to construct the lists of the 16 adjacent channels values (SLICES). The pedestals are subtracted online during the unpacking.
After the common mode calculation and subtraction, Bunch Crossing Identification is performed in order to reject samples including residual signal from previous events. Finally, the particle induced charge is reconstructed and a threshold depending on the pedestal-common mode variation is applied. Only data from strips with particle induced charge are transmitted to the CMS DAQ system together with their addresses and a frame header. Due to the CMS Preshower low occupancy only less than 5% of the original data would be sent to the CMS DAQ event builder. As shown in the upper right part of figure 5 in a Virtex-II Pro FPGA XCV2P7 running with a clock of 160 MHz the execution time of the common mode calculation procedure for 4 PACE chips in the longest case is 7.3 μs, which is bellow the frame readout time of 7.5 μs (readout clock 40 MHz) and the occupancy is 6x6% = 36% of the FPGA logic resources.
