ABSTRACT: This paper describes the implementation of a SPI-programmable clock delay chip based on a Delay Locked Loop (DLL) in order to shift the phase of the LHC clock (25 ns) in steps of 1ns, with less than 5 ps jitter and 23 ps of DNL. The delay lines will be integrated into ICECAL, the LHCb calorimeter front-end analog signal processing ASIC in the near future. The stringent noise requirements on the ASIC imply minimizing the noise contribution of digital components. This is accomplished by implementing the DLL in differential mode. To achieve the required radiation tolerance several techniques are applied: double guard rings between PMOS and NMOS transistors as well as glitch suppressors and TMR Registers. This 5.7 mm 2 chip has been implemented in CMOS 0.35 µm technology.
Introduction
Delay lines are commonly used in high energy physics experiments since synchronization is critical for such kind of applications. The operating principle of digital delay lines is very simple: the user can set an arbitrary delay and thus compensate the latency introduced, for example, by cables or fibers. This paper describes the implementation of an SPI-programmable (Serial Peripheral Interface) 4 triple channel (12 clock outputs) delay line based on a Delay Locked Loop (DLL) and developed for the LHCb electromagnetic and hadronic calorimeter upgrade. The user can configure up to 25 different clock phases to cover the 25 ns LHC clock in 1ns steps.
The DLL is adjusted by means of two control voltages: one is automatically generated by the phase detector and charge pump circuit, while the other is externally adjusted by the user. This delay chip prototype will be integrated into the next LHCb calorimeter front-end ASIC (ICECAL) [1] , which is a very low noise analog signal shaper. Therefore, the digital switching noise contribution must be kept as lower as possible. The fully differential design of this DLL aims to reduce the switching noise contribution of delay lines, and the use of double guard rings also decreases noise propagation through the substrate.
-1 -Design methodology of this chip prototype is also determined by the radiation environment where this DLL will operate. ICECAL chips will be mounted on ECAL front-end boards (FEBs) located inside the LHCb cavern, where the radiation is expected to reach 5 krad. This fact justifies a full custom design. Moreover, the energy increase of the LHC machine (from 7 to 14 TeV) will increase the potentially dangerous ionizing radiations and thus, the probability of suffering from single event effects (SEE). This design must tolerate Single Event Upsets (SEUs), Single Event Transients (SETs) and Single Event Latch-ups (SELs). The probability of suffering SELs is reduced by increasing the distance between PMOS and NMOS transistors and inserting double guard rings between them, so that PMOS and NMOS transistors are confined inside islands of the same transistor type. SEUs are avoided by implementing Triple Modular Redundancy Registers (TMR) to store the DLL configuration and fault tolerant Finite State Machine in the SPI Slave. Finally, reset signals are protected from SETs by means of glitch suppressors. This paper is organized as follows: in section 2 the chip requirements are detailed, section 3 and 4 show the Delay Line and the Slow Control implementations respectively. Then, section 5 describes the improvements done to increase reliability and section 6 shows the chip measurements. Finally, section 7 concludes the paper.
Requirements
As commented in the introduction, this delay line will be integrated into the next LHCb Calorimeter analogue shaper ASIC (ICECAL v3) [1] , and therefore, it must fulfill the following requirements:
• Each DLL channel must be able to generate 3 independent and configurable clock phases in order to delay the LHC clock (40 MHz).
• Simultaneous Switching Noise (SSN) produced by delay elements has to be minimized as much as possible since the analog shaper has a high sensitivity.
• Delay line must tolerate Process, Voltage and Temperature (PVT) variations.
• The DLL must be robust against SEEs. Figure 2 shows the block diagram of the Delay Chip. The Reset block generates 1.2-ms-wide reset signal after the chip is switched on and also implements a glitch suppressor able to filter up to 8-ns-wide glitch SET. The SPI Slave interface generates the signals to read and or write the serial registers of DLL channels. It also enables the user to reset charge pumps by software. Moreover, the SPI slave state machine is SEU tolerant. Finally, each DLL channel has 3 independent LVDS clock signal outputs and enables the user to configure delays in steps of 1 ns, between 0 and 24 ns, being 25 ns the clock period. Configuration is stored in a 16-bit TMR register that protects data against SEUs. The DLL implementation is fully differential, so that switching noise is lower in comparison with single mode one. Taking into account that this delay chip will be integrated into the ICECAL -2 - chip, another measure to minimize the effects of switching noise is to place the ADC clock generator as far as possible from the analog shaper since the ADC is not inside ICECAL (see figure 1 ) and the clock signals have to be output. This block (figure 2, bottom) receives the reference differential CMOS clock signal that passes through a Voltage Controlled Delay Line (VCDL), then a multiplexer selects the desired output and finally a LVDS driver converts the voltage levels. On the other hand, the main DLL block has the same functionalities but also converts the LVDS input signal to CMOS and implements the phase comparator which generates the fine-grain control voltage. Two signals, Vcoarse, which is an external and fixed bias voltage, and Vcontrol ensure that the introduced delay by each VCDL stage is 1 ns.
Delay line implementation
-3 - 
VCDL
A VCDL is a set of N adjustable Delay Elements (DE) connected in cascade (see figure 3 ). In our particular case, the role of this block is to generate N=25 delay samples of the original differential clock in steps of 1 ns. Figure 4 shows the structure of the adjustable DE: starved inverters not only inverts the input signal but also applies a delay as a function of Vcontrol and Vcoarse signals (see figure 5 ). Weak Inverters (WI) ensures that differential signals are 180 • out of phase. This is especially critical in the latest VCDL stages, where the Integral Nonlinearity (INL) due to mismatches between counterpart starved inverters may change clock phases substantially. Finally, in order to drive the multiplexer that selects the output clock, a common inverter is placed at the end of each stage, isolating DE from the input capacitance of the multiplexer.
The starved inverter schematic is shown in figure 5 . It is based on a common CMOS inverter where the PMOS/NMOS sources are not directly connected to power rails but they pass through a MOS transistor acting as an adjustable resistor. The starved inverter performance will decrease with the increase of the impedance that these devices add to the current path. The impedance adjust, i.e. delay adjust, of these MOS transistors is done by varying the input voltage of their gates.
The main advantages of using of two control transistors versus the use of only one NMOS transistor are the range of delays that can be achieved and the slew rate symmetry, critical to achieve an end to end delay of 25 ns. The main disadvantage is obviously the area overhead and the need to control an additional signal. In order to minimize this second drawback we have connected all the gates of the NMOS transistor to the same external input, called Vcoarse. With this signal we adjust the bias that process variations or environmental factors may cause. Once the chip operating point is adjusted, the charge pump of each DLL channel will provide Vcontrol, that is the PMOS transistor voltage which compensates mismatch, environmental changes or noise perturbations that may perturb the 500 ps delay that every starved inverter should have.
Phase detector and charge pump
The phase detector design is divided in two parts. The first is a flip-flop that detects whether the reference clock signal n+1 rising edge arrives before or after the expected 25 ns delayed clock signal n rising edge. To do so, the reference clock is used to sample the delayed clock; if delay is lower than 25 ns, when the reference clock latches the input, the output is high, otherwise the output is low. The output of this simple circuit will tell us whether charge or discharge is needed. The second, an XOR between the previous two clock signals will determine the amplitude of the charge/discharge pulse. Combining clock phase module and sign information, the stored charge provides a 25 ns delay between signals.
Multiplexer
The aim of the multiplexer is to output one of the 25 available input signals coming from the VCDL causing the minimum possible delay perturbations. To do so, 25:1 multiplexer was implemented by means two levels of 5:1 line tristate multiplexers, and hence, clock signal path will be as short as only 2 multiplexers. This implies low latency and consequently low delay variability due to environmental variations.
Slow control implementation 4.1 Configuration registers
As previously commented, the radiative environment where the final implementation will operate implies the use of redundancy. Triple Modular Redundancy (TMR) consist in replicate three times the same memory bit and thus, even if one (and only one) SEU occurs not only output data will be correct but also the erroneous bit will be automatically corrected in the next clock rising edge (see figure 7) . In order to make TMR registers more resilient, flip-flops were physically interleaved in a factor 4, i.e. the first set of 4 registers corresponds to the first input bit of the majority voting circuit of four different output bits, the second set of 4 to the second input bit, and so. In this way, the distance between redundant flip-flops is higher than 100µm. Hence, the basic TMR block is 4-bit wide. Configuration Registers are merely an array of four 4-bit TMR blocks and parallelserial registers that interface with the SPI. Notice that TMR register is recovered from SEUs only after the rising clock of the SPI clock. Therefore, the simplest way to minimize the probability of suffering two SEUs in the same TMR block is periodically reading any of the SPI registers, since SPI Master will generate several clock pulses and TMR bits updated.
SPI slave
This block implements the interface between the external SPI Master and the internal serial registers. The SPI frame received by the slave consists in an 8 bit address and 16 bits of data. This block decodes the received address, enables the register selector and also generates the read/write signal. This block can address up to 32 configuration registers. Apart from reading and writing data, it also implements a software reset command that we use to empty the charge pumps of the DLL, and a bypass SDI-SDO command for troubleshooting purposes. Figure 8 shows the SPI slave state machine. Inside each bubble we can observe the binary encoded state number. After reset conditions, the Finite State Machine (FSM) is in E 101 , namely the idle state. When the SPI master sets spiEn to low, FSM goes to E 010 , where the 8-bit address is decoded. After decoding the address FSM goes to E 110 , the state where data is transferred to the selected register. Finally, after 8 or 16 clock cycles, the FSM waits for SPI master to set spiEn high indicating that SPI transmission has ended. As it can be observed, the number of bits to encode the state number is 3 despite the number of used states are only 4. We have protected the idle state against SEUs by using E 110 and E 010 which are at Hamming distance ¿ 1. Despite E 111 is at Hamming distance 1, assuming that spiEn will be at high level, the FSM will be recovered anyway.
Reliability
In previous section we have detailed how SEUs and SETs are avoided in this design. Apart from SEUs and SETs, our design can also suffer SELs which may cause serious (and permanent) damages to the chip [2] . There are several methods to harden designs against latch-ups: electrical or spatial isolation, parasitic bipolar gain reduction or bipolar to well resistance reduction [3] . We have adopted the following design rules in our layouts to reduce the SEL chances:
• Distance between P Diffusion and N-Well must be 5µm.
• The use of guard rings between PMOS and NMOS transistors. Figure 10 shows the layout design of 1-ns Delay Element where it can be observed the guard rings between PMOS and NMOS transistors. This implies the impossibility of using standard cells and automated layout generation tools. However, in order to make scalable layouts and speed-up the layout design time, we implemented a set of basic standard cell-like radiation hard gates (NOT, NOR, NAND, XOR, FF) that eases the placement procedure.
-6 - The price to pay is basically area: each radiation hard cell is around 4.5 µm higher in comparison with the counterpart standard cell. What is more, the use of guard rings makes not possible the use of POLY to join PMOS and NMOS gates which implies the intensive use of M2 for intra-cell connections, and thus limiting the inter-cell routability of M2. Another obvious drawback is the required manpower to do the manual digital layout synthesis.
Chip measurements
The above described design has been implemented standalone and a test run has been produced and encapsulated in a QFN-48 package. The result is shown in figure 11 . In order to have a reasonable amount of statistics, a total set of 25 chips have actually been encapsulated to be tested. A test board has been developed (figure 12) in which the type of clock input can be selected, differential or single ended, the coarse signal is adjusted with a potentiometer. Power supply (3.3V) and SPI communication can be set on standalone or obtained from the board hosting the analog signal processing electronics, called Front End board [4] . The LVDS clock outputs are measured by means of a 1 GHz bandwidth differential probe and a 20 Gsps oscilloscope. I0  I1  I2  I3  T0  T1  T2  T3  A0  A1  A2 
Linearity
In this subsection the delay line linearity measurements are shown. Linearity is measured by sweeping the 25 possible phases of each delay line, and performing delta time measurement between a fixed reference and each clock output. In the absence of process variations each stage is expected to have a delay of 1 ns. Differential Non Linearity (DNL) is defined as the difference between the expected delay and the measured delay.
DNL i = abs(Delay i − Delay i−1 − 10 −9 ) (6.1) Figure 13 shows the DNL distribution of the whole delay element population (7200 samples distributed into 25 chips with 12 delay lines of 24 stage-to-stage delta time measurements) and its gaussian fit. Measured σ DNL is 23 ps instead of 6 ps of the Monte Carlo simulation. Figure 14 .a allows to identify evidences of systematic channel to channel within-die variations. It can be observed that those delay lines that generate the ADC clock suffer more from variability that the ones that generate the Integrator and Track & Hold clocks. This difference is caused by the lack of symmetry in the ADC clock delay lines (see figure 2) . Moreover, figure 14.b also reveals systematic variations at VCDL stage level since the average of random variations in a single delay element is said to be zero. The error pattern exhibits a periodical behavior that matches with the multiplexor layout design pattern.
In order to estimate the impact of random variations (those inherent variations in a particular technology [5] ) on our chip implementation, systematic effects are decoupled from the statistics. The average DNL shown in figure 14.b is subtracted from DNL measurements. Moreover, ADC delay lines are removed from the statistics since they are clearly affected by systematic variability. Doing so, the resulting σ DNL is 13 ps and, what is more, the 99.88% of the DNL samples are below ±40 ps. That is the minimum achiveable variability for this technology node.
Control voltages and jitter
As explained in section 3.2, Vcontrol is genarated by means of a phase detector. This voltage together with Vcoarse adjusts the VCDL. Measured VCDL die-to-die variability was lower enough -8 -to use the same Vcoarse for all the chip samples. As for the Vcontrol noise, simulated σ V control was 0.53 mV while the measured σ V control was 6.64 mV. Despite the order of magnitude between simulation and measurement, the effects in jitter are not significant since Vcontrol noise is mainly synchronous. In fact σ Jitter is below the oscilloscope resolution of 5 ps.
Slow Control performance
SPI interface has been tested at different bitrates (from 3 Mbps to 15 Mbps) with successful results. The frame error rate at 15 Mbps is, at least, lower than 10 −5 . Some limitations in the current setup test prevent a deeper analysis.
Conclusions
On the one hand, this delay line implementation exhibit a very low jitter (less than 5 ps) and DNL (23 ps) which are lower enough to provide a clock in good conditions to the ICECAL chip. Moreover, this DNL can be reduced up to 10ps by improving the layout design in the next prototype version. On the other hand, despite the SPI interface has been successfully tested at 15 Mbps, some improvements are foreseen to increase its robustness against SEUs. Ahead of us we have also a testbeam to qualify radiation hardness.
