Abstract-A small-strip thin gap chamber (sTGC) will be used for both triggering and precision tracking purposes in the upgrade of the ATLAS forward muon spectrometer. Both sTGC pad and strip detectors are read out by a trigger data serializer (TDS) application-specific integrated circuit (ASIC) in the trigger path. This ASIC has two operation modes to prepare trigger data from pad and strip detectors, respectively. The pad mode (pad TDS) collects the firing status for up to 104 pads from one detector layer and transmits the data at 4.8 Gbps to the pad trigger extractor every 25 ns. The pad trigger extractor collects pad-TDS data from eight detector layers and defines a region of interest (ROI) along the path of a muon candidate. Data defining the ROIs is returned to the strip TDS. In the strip mode (strip TDS), the deposited charges from up to 128 strips are buffered and time-stamped, and a trigger matching procedure is performed to read out strips underneath the ROI. The strip-TDS output is also transmitted at 4.8 Gbps to the following fieldprogrammable gate array (FPGA) processing circuits. Details of ASIC design and test results are presented in this paper.
I. INTRODUCTION
A TLAS plans several major improvements in conjunction with the upgrades of the large hadron collider (LHC) over the next decade. The goal of these improvements is to retain the good tracking precision and trigger capabilities of the ATLAS detector in the high background expected from the increased instantaneous and integrated luminosity of the LHC. For the ATLAS muon system, the most important upgrade project is the replacement of the forward muontracking region with a so-called "New Small Wheel" (NSW) detector in 2019 [1] . The NSW is composed of eight layers of Micromegas (MMs) chambers and eight layers of small-strip thin gap chambers (sTGC) both arranged in two quadruplets, for a total active surface of more than 2500 m 2 . Both detectors will provide trigger and tracking primitives to the muon trigger and readout system, with a high level of redundancy. The sTGC is a gaseous, multiwire proportional drift chamber. It will be used as the primary trigger device for the NSW, with a spatial resolution of 100 μm per sTGC layer and an angular resolution of less than 1 mrad for online segments reconstruction. A block diagram of the sTGC detector is shown in Fig. 1 . The basic structure consists of a grid of goldplated tungsten anode wires sandwiched between two resistive cathode planes. The wire pitch is 1.8 mm and the wire-tocathode distance is 1.4 mm. One cathode is covered with ∼8 cm × 8 cm pads, while the other cathode is covered by strips with 3.2-mm pitch and lengths from 0.5 to 2 m. Due to the short wire-to-cathode distance, the time distribution of muon hits is mostly less than 25 ns, making it suitable as a trigger device.
The sTGC has a trapezoidal shape, where the precision coordinate, responsible for momentum resolution, is along the symmetry axis of the trapezoid (theta direction). The cathode strips are perpendicular and the wires are parallel to the precision coordinate. The phi direction, perpendicular to the precision coordinate, is measured coarsely by the cathode pads and more precisely by gangs of grouped wires. Only pads and strips are used in the trigger path.
Each sTGC quadruplet consists of four pad-wire-strip planes, as shown in Fig. 1 . The pads are used to identify muon tracks approximately pointing back to the interaction through a three-out-of-four coincidence tower. A band of strips underneath the tower of pads along a muon track are read out, and the location of a charged particle, traversing the sTGC, is inferred from the centroid of the charges. The sTGC layers in a quadruplet are staggered by half a pad in both directions, effectively reducing the pad area by four and the 0018-9499 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. number of strips underneath the pad by two for readout. This staggered pad layout further improves the muon phi direction measurement by a factor of four. A block diagram of the sTGC trigger electronics for a sector in one layer is shown in Fig. 2 . Both sTGC pad and strip detector signals are processed by a 64-channel amplifier shaper discriminator (ASD) application-specific integrated circuit (ASIC) [2] . The digitized ASD outputs are sent to a trigger data serializer (TDS) ASIC. The TDS ASIC has two operation modes to handle pad and strip detectors, denoted here as pad TDS and strip TDS, respectively. The pad TDS checks the presence of pad signals, prepares and sends the pad trigger data as well as the LHC bunching crossing identification (BCID) number, to the pad trigger extractor board on the rim of the NSW detector at a rate of 4.8 Gbps. The pad trigger extractor board (pad trigger extractor) collects pad-TDS data from eight sTGC layers and determines a region of interest (ROI) for the candidate muon track in the form of a BCID and position coordinate (band-phi ID). The ROI is then encoded and sent to the strip TDS at a rate of 640 Mbps. The strip TDS decodes the deposited charge on the strips and stores the charge information together with a time tag in buffers. Once it receives the ROI from the pad trigger extractor board, it translates the ROI to a band of strips using a preassigned lookup table. Pad-strip matching is performed and charges for matched strips are packed and serialized to the signal router board (router) on the rim of the NSW detector at 4.8 Gbps [3] . The router board collects strip trigger data from up to 12-strip TDS ASICs, removes NULL packets, and transmits the data via optical fibers to the segment-finding circuits in the ATLAS underground counting room (USA15).
The TDS is the hub of the sTGC trigger system. It is a chip with two operating modes: the pad mode and strip mode. Each TDS handles 128 channels in the strip mode and 104 channels in the pad mode. Both modes need to receive input signals from the same ASD ASIC and serialize the output at 4.8 Gbps. In addition to the high channel density, the ASIC is also required to have a power consumption less than 1 W and a global latency less than 100 ns for the strip mode and less than 50 ns for the pad mode to meet the overall ∼1-μs latency in the sTGC trigger path [1] . This paper is organized as follows. We describe the shared design features of the two modes of TDS in Section II. Next, we present the signal processing inside pad TDS and strip TDS in Sections III and IV, respectively. The prototype and test results are presented in Section V, followed by the conclusion in Section VI.
II. SHARED FEATURES OF THE MODES IN TDS
The actual mode of TDS operation is selected by an external hardware pin, and once a mode is active, its counterpart will be held in reset to save power. A block diagram of the TDS is shown in Fig. 3 , in which the architecture is divided into three parts, following the signal flow inside the chip: ASD interface, preprocessor, and serialization. The two modes follow similar signal flow in processing detector signals, with unique features specific to the pad and strip detectors. The two modes utilize a common I2C configuration interface as well as the LHC bunching crossing reset. They share common input buffers (i.e., the 104 ASD channels in the pad mode correspond to the first 104 channels of the 128 input channels in the strip mode). The 4.8-Gbps serialization interface (GBT SER DM) [4] is also shared between modes. The I2C interface is an industry standard. The ASD outputs and the serialization interface are described in the following.
A. ASD Outputs to the TDS
The ASD [2] is a complicated ASIC with different modes to process detector inputs from both sTGC and MM in the NSW. It provides the peak amplitude and time with respect to the LHC bunching crossing (BC) clock (∼2-ns precision). In the sTGC trigger path, the signals from sTGC pads and strips are first amplified and shaped. In the strip mode, the peak amplitude is found and digitized with a 6-b ADC. The digitization is serialized to the strip TDS at 320 Mbps. In the pad mode, the shaped signal is discriminated and a time-over-threshold (TOT) pulse is delivered to the pad TDS. An illustration of the ASD output patterns to the TDS is shown in Fig. 4 , in which OUT corresponds to the inputs to the TDS, and CK is the 160-MHz clock sent to the ASD from the ePLL [5] in the TDS. In the pad mode, the time information is coincident with the threshold crossing (i.e., the leading edge of the TOT pulse), while in strip mode, the time information is associated with the peaking time ("peak-found" point as shown in Fig. 4) .
B. Serialization
As shown in Fig. 3 , the pad and strip mode share the same serialization interface (GBT SER DM) [4] . The serializer is a modified version of the original CERN design [6] . It runs at 4.8 Gbps and loads 30 b at 160 Mbps. For proper operation of the following circuits in the trigger path, the trigger data is required to be shifted out in one LHC BC cycle (25 ns), i.e., the length of a complete packet in either pad or strip mode is 120 b. In the pad mode, a complete packet starts with a 4-b header "1010," followed by 116-b payload, whereas in the strip mode, the 120 b is broken into four consecutive 30-b packets, each with a 4-b header ("1010" for detector data and "1100" for NULL) and 26-b payload. The payload in both modes is scrambled according to the IEEE Standard 802.3-2012 for 10-Gbps physical layer implementation (with a polynomial function of 1 + x 39 + x 58 ) [7] , to keep the serial stream dc balanced. Packet headers are used for stream recovery in backend circuits. They are already dc balanced thus will not be scrambled. In addition, there is a pseudorandom binary sequence generation with every permutation of 31 b (PRBS-31 with a polynomial function of x 31 +x 28 +1) for the test purpose of the 4.8-Gbps serializer interface.
The phase-locked loop (PLL) of the serializer provides 40-and 160-MHz clocks for the operation of both pad and strip modes. The 40 MHz is mainly used to develop the BC count while the 160 MHz is the global clock for the internal signal processing.
III. SIGNAL PROCESSING IN THE PAD TDS
The ASD interface of the pad TDS captures the pad firing status from up to 104 sTGC pads and adds a BCID time tag for each crossing. The BCID time tag is assigned upon the arrival of the leading edge of the TOT pulse (pulse detection). The total detector area processed by one-pad TDS is about 2 m 2 . The corresponding routing length from pad detectors to a pad TDS varies by up to 3 m. This routing length difference would result in different BCID time tags for pad signals with a common arrival time, making it necessary for single channel time compensation. The time compensation precision is required to be 3.125 ns for the BCID interval of 25 ns.
Conventionally, additional delay is inserted into early arrival channels, as shown in Fig. 5(a) . However, the usage of either pure delay cells or feedback controlled delay circuits (e.g., with delay-locked loops) is excessive from the point of view of the logic resource usage, system complexity and power consumption for 104 channels. Alternatively, we compensate the delay by adjusting the phase of timing clocks (BC clock), as shown in Fig. 5(b) . An individual timing clock for each channel is utilized and the phase shift is achieved by BC clock regeneration through shift registers running at dual edges of a 160-MHz clock [8] . This scheme costs only a few flip flops and is fully synthesizable. Its stability is inherent to that of the 160-MHz clock, derived from the LHC clock.
The preprocessor circuit buffers the time-tagged pad signals in a two-depth ring buffer for each channel (ring buffer). It compares the current BCID to those of the buffered signals to flag the YES/NO firing status of every channel at each BCID. A YES is marked if any hits with the same BCID are found, otherwise, a NO is flagged. BCID comparison and firing status assignment is done at the end of a BC and the referred BCID is given by a global BCID counter, as shown in the pad mode of Fig. 3 . The firing statuses of all 104 channels are collected at each BC and the result is passed to the "frame builder" circuit, which combines the 12-b BCID with 104 pad firing status bits to form a 116-b payload to the serialization interface.
IV. SIGNAL PROCESSING IN THE STRIP TDS

A. ASD Interface of the Strip TDS
The serial charges delivered to the strip TDS are decoded in each channel (deserializer) and BCID time tags are also attached. The BCID tag is determined by the leading edge of OUT, which corresponds to the peaking time, as shown in Fig. 4 . In addition to the charge and its BCID tag, a "FLAG" bit is included, the value of which is dependent on the trigger matching time window used. A trigger matching time window is introduced to maximize the charge matching efficiency given that the muon hit distribution can be larger than one BC clock cycle. In strip TDS, this window is programmable from 25 to 50 ns with a step size of 6.25 ns. For a given window size (e.g., ∼31 ns), as shown in Fig. 6(a) , any signal arriving at the beginning of BCID k + 1 (the dashed area) might also belong to BCID k if the charge signal has a longer drift time. As a result, for any signal arriving in the dashed area of BCID k + 1, its "FLAG" is set to a logic TRUE, otherwise, it is set to FALSE. Finally, the 6-b charge, BCID, and the "FLAG" bit are packed to be sent forward. Besides, an embedded 6-b ASD charge pattern generator for the first 14 channels is also implemented (serial pattern #0-13) for test purpose.
B. Preprocessor of the Strip TDS
The strip charges obtained from the deserializer are buffered in four shift registers (BUF0-3) for each channel, as shown in Fig. 6(b) . The ring buffer is data driven, and a writing operation is performed by shifting the contents forward to free the first slot (BUF0) for the new event, while discarding the earliest event in BUF3. In this way, the sequence of the charges in the ring buffer preserves their timing sequence, with BUF0 being the latest. A programmable timer monitors the buffer status, inserting dummy events to avoid any slot being occupied forever. Every time there is an ROI sent from the pad trigger extractor board, shown in Fig. 3 , a number of strip channels are enabled for trigger processing ["trigger enable" in Fig. 6(b) ] and the shift registers within the channels are sampled for further processing by the "BCID comparison" unit.
The data unit sent to the strip TDS is composed of the trigger BCID and the band-phi ID. The band-phi ID corresponds to the coordinates of the extracted muon track, which is then translated into a range of strip channels via the preassigned lookup tables (pad LUT). For all channels associated with the range, a BCID comparison is performed. Take the example in Fig. 6(a) where a buffered signal with BCID k + 1 and a TRUE FLAG bit issues valid matching for either the ROI BCID k or k + 1, where the former may correspond to a late arrival signal belonging to BCID k. Once the BCID timing is matched for a given channel, a valid matching signal is generated and the corresponding charge is picked up for further processing. In case, there is more than one match in a strip channel, the most recent charge event is considered.
A maximum of 17 consecutive strips will be included in a pad trigger match operation. This is because the effective area of an sTGC pad is one fourth of the original size after staggering of pads through multiple sTGC layers, i.e., ∼4 cm ×4 cm. The number of strips underneath is roughly 4 cm/3.2 mm ≈ 13. In addition, two neighboring strips at both edges [ Fig. 7 (top) and (bottom)] are also considered since a muon could hit the pad boundaries. An efficient way to multiplex 17 out of 128 strips for any trigger band of strips is needed. To achieve this, all 128 channels are divided into seventeen groups with eight channels in each group using 17 8-1 selectors. For example, the first 8-1 selector connects channels 0/17/34/51/68/85/102/119, and the second 8-1 selector connects channels 1/18/35/52/69/86/103/120, as shown in Fig. 3 . This arrangement guarantees that all 17 strips in a band will not be routed to the same multiplexer. However, the multiplexer outputs in this configuration might not preserve the original sequence in the trigger band, e.g., the multiplexed outputs from SEL #0-16 for a band with strip #18-34 is 34, 18, 19, …33, in which strip #34 is moved to the front. The strip sequence problem is resolved by the "strip-sequencer" circuit as the leading strip is known for each pad trigger.
The strip charges from the ASD interface pass through the preprocessor only upon receipt of a pad trigger ROI. This design makes it difficult to probe individual functional blocks along the signal chain for diagnosis. As part of the design methodology for testability, the ROI requirement can be bypassed, in which case the data from the ASD interface can either be transmitted without any trigger processing or be processed by fake triggers from an embedded test trigger generator. The latter only works with the embedded "serial pattern" generators, as discussed in Section IV-A.
C. Serialization of the Strip TDS
The band of strips after trigger matching will be passed to the serialization circuit. The "frame builder" collects charges from 17 strips as well as time and ROI information to build a 120-b packet. The 120-b packet is further broken into four consecutive 30-b frames (with header "1010"). The 30-b-length NULL frames (with header "1100") are inserted whenever there are no requests from the pad trigger extractor. The use of short frame length and different headers for data and NULL is for quick frame switching in the sTGC router, where limited time is available to switch data frames from up to 12-strip TDS links to four optical fiber outputs on the router. The NULL frames will be suppressed during the switching operation.
The output data width of the TDS in one LHC BC cycle is 120 b and is not adequate to cover the information from all 17 strips. The size of a muon cluster is around four to five strips, making it possible to reduce the number of strips for output by first/last selection within the 17 strips. The reduction is performed based upon the firing status of the first strip, as shown in Fig. 7 . An output bit is included to indicate which 14 strips are selected.
V. TDS PERFORMANCE EVALUATION
A. TDS Prototypes and Test Setup
The TDS design started with a prototype of the serializeronly chip in 2013 [4] , which is a crucial part in TDS. The first prototype of TDS (TDSV1) followed and was proved to be successful in early 2014. A second prototype (TDSV2) was submitted in May 2015 to accommodate some specification changes. The die of TDSV2 is shown in Fig. 8(a) , in which the "TDS logic" includes all logic processing for the pad and strip modes. The area of the "TDS logic" is about 9.86 mm 2 in TDSV2. The "analog" includes the serializer core, ePLL, and the pad-trigger interface. The size of the die is 5.2 mm × 5.2 mm. The chip is fabricated in a 130-nm CMOS technology and is packaged in a 400-pin ball grid array (BGA) package, as shown in Fig. 8(b) .
A mezzanine card to the Xilinx VC707 evaluation board is designed for the chip performance evaluation. The Virtex-7 field-programmable gate array (FPGA) on the VC707 board provides all test inputs, configuration controls and receives the 4.8-Gbps output from the TDS for data checking. The connections are done via the pair of high-pin-count connectors on the VC707 board. A photograph of the test setup is shown in Fig. 8(c) .
B. Performance of the Serializer
The performance of the serializer core (GBT SER DM) has been evaluated in the serializer-only prototype [4] . However, the noisy mixed-signal environment of TDS may introduce extra noise and degrade the performance of the serializer core. To minimize induced noise from the digital circuits, the silicon substrate of the serializer core is isolated from the "TDS logic" and separate power and ground grids are used for the "analog" part in Fig. 8(a) .
All early tests were repeated and yielded the eye diagram of Fig. 9 with embedded PRBS-31 pattern. Further jitter analysis shows a total jitter of about 39 ps, which is slightly better than the 49.7 ps reported in the early tests [4] . The setup in Fig. 8(c) is similar to that in [4] and the result indicates that the performance of the serializer core has not been degraded after integration into the TDS ASIC. The eye diagram was taken with a Tektronix 12.5-GHz bandwidth 50 GS/s oscilloscope (DSA71254B).
C. Performance of the Strip TDS
To characterize the performance of the strip TDS, we simulate the ASD outputs and the pad trigger generator for the strip TDS inside the Xilinx Vertex 7 FPGA on VC707. In the test, the charge of each of the 128 channels is assigned as the least significant six bits of its channel number, e.g., channel 15 has a charge of 15 while channel 67 has a charge of 3. These inputs to TDS are introduced at a fixed BCID (e.g., BCID = 36) to simplify the analysis. The pad trigger signals are generated with respect to the TDS inputs, and we evaluate consecutive triggers in the test [e.g., four consecutive triggers each separated by one BC cycle (25 ns) are given around BCID 36]. For each trigger, a corresponding range of strips is evaluated using specific configuration of the lookup tables and the expected patterns have been observed. While these arbitrary patterns of strips and charges bear no relationship to real tracks passing through the detector, a check of the data output has confirmed that the logic functions as desired.
In addition, embedded functions are available for diagnosis. We also make use of these functions as another way to validate the TDS functionality. A summary of three major test functions is listed in Table I . The "TDS-router training frame" corresponds to the "test frame gen" in Fig. 3 . It generates fake strip-TDS output frames and can be used for establishing the link between TDS and router. This test only covers the serialization stage of the strip TDS. The "global test" utilizes an internal serial pattern generator to create the serial patterns in Fig. 4 . It goes through the whole signal flow except the "pad-trigger interface." In this mode, internal triggers are used. The "bypass trigger" works with external ASD inputs which bypass the channel ring buffers and pad trigger units. When used with the individual channel enables, the "bypass trigger" can probe single channel inputs without any pad triggers. The three test functions are all activated through the I2C interface. In total, they can test all functions of the strip TDS except the pad-trigger interface, allowing functional tests 
D. Performance of the Pad TDS
The performance of the pad TDS is evaluated by simulating the pad inputs in VC707 and decoding the serial outputs from TDS to check for correspondence. Hits are generated per Fig. 4 at the same time and additional delay is introduced for each hit to test the individual channel delay compensation. The delays introduce up to 3-ns variances between channels, spreading the hits in some channels to one BCID (e.g., 0x017) while the others to the next BCID (e.g., 0x018). Delay compensation is then added for all channels with the later BCID (0x018) to cancel the extra delay in its path. As a result, all inputs converge into one BCID (0x017), as reflected in the decoded stream of the pad TDS in the FPGA, confirming that the logic functions as desired. Detailed characterization of the delay compensation is reported in [8] .
E. Power Consumption
The TDS uses a single 1.5-V supply. The power consumption is evaluated by putting a 0.1-(1%) resistor in serial with the power supply and monitoring the voltage drop at an ambient temperature of about 25°C. The average voltage drop is 57.8 mV for the strip mode and 58.5 mV for the pad mode, which correspond to a current of 578 and 585 mA, respectively. This gives a total power consumption of around 0.9 W in either mode. In the evaluation, the TDS is fully working with all possible functions enabled. The values measured represent the typical total power consumption and meet the <1-W design requirement.
F. Latency
Both pad and strip modes of the TDS work with a global clock of 160 MHz (clk160), as shown in Fig. 3 . Different stages along the signal path from the input to the output are put in a pipeline of the clk160, and the total latency is allocated into each stage.
The latency of the strip TDS starts from the time of any pad trigger request at the pad-trigger interface to the time that the first bit of a data frame exits the serializer core. The latency breakdown along the signal path is listed in Table II. The total   TABLE III   ALLOCATION OF THE TOTAL LATENCY OF THE PAD TDS latency is found to be about 75 ns, satisfying the requirement of 100 ns.
The latency of the pad TDS starts from the end of a BC to the first bit of a pad frame exiting the serializer core. The total latency is itemized in Table III and the overall latency is found to be around 31 ns, less than the 50-ns requirement.
G. Radiation Tolerance
The total ionizing dose for the NSW sTGC trigger front-end boards is about 300 krad with a safety factor of 6 [9] . This radiation exposure poses no problem for the 130-nm CMOS process that was used [10] . For protection against single event upsets (SEU), triple modular redundancy (TMR) is applied to both the analog part (the serializer core and ePLL) and the TDS logic. In the TDS logic, full TMR is implemented with three voters and triple clock trees except for the trigger data. The goal is to sustain the TDS in normal working state while allowing bit upsets in the trigger data to be overcome by the redundancy of data from multiple detector layers.
An SEU test was performed with an 800-MeV neutron beam at the U.S. Los Alamos National Laboratory. The neutron beam has an average flux about 1.26 × 10 11 n/cm 2 /day with similar energy spectrum to that in the ATLAS detector. For a three-day test time, a fluence 3.79 × 10 11 n/cm 2 was accumulated, corresponding to a half year operation time for the inner rim of the NSW [9] . During the tests, four payload data errors were observed in the pad mode and one time header errors were monitored in the strip mode. In both modes, the link connections were not affected and no functional errors occurred.
VI. CONCLUSION
The TDS ASIC is a crucial component in the sTGC trigger system for the ATLAS NSW upgrade. It is a chip with two operation modes, preparing trigger data for both pad and strip detectors, performing pad-strip matching, and serializing the pad and strip data for transmission to the rim of the NSW detector at 4.8 Gbps. The chip is a mixed-signal ASIC with 128 channels in the strip mode and 104 channels in the pad mode. Detailed design of the TDS has been presented and the chip is manufactured in a GlobalFoundries 130-nm CMOS technology with a 400-pin BGA package. Extensive tests have been performed and the results show that the TDC meets its design specification, including the latency (<100 ns in the strip mode and <50 ns in the pad mode) and the power consumption (<1 W).
