The MPA is the pixel readout ASIC for the hybrid Pixel-Strip module of the Phase-II CMS Outer Tracker upgrade at the High Luminosity LHC (HL-LHC). It employs a novel technique for identifying high transverse momentum particles and provides this information at 40 MHz rate to the L1-trigger system. The chip also comprises a binary pipeline buffer for the L1-trigger latency, and a data path to support the readout of full events with a maximum trigger rate of 1 MHz and a latency of 12.8 µs. The design and implementation in a 65 nm CMOS technology of the first prototype ASIC that integrates all functionalities for system level operation with a power density lower than 90 mW /cm 2 are presented in this contribution.
The MPA is the pixel readout ASIC for the hybrid Pixel-Strip module of the Phase-II CMS Outer Tracker upgrade at the High Luminosity LHC (HL-LHC). It employs a novel technique for identifying high transverse momentum particles and provides this information at 40 MHz rate to the L1-trigger system. The chip also comprises a binary pipeline buffer for the L1-trigger latency, and a data path to support the readout of full events with a maximum trigger rate of 1 MHz and a latency of 12.8 µs. The design and implementation in a 65 nm CMOS technology of the first prototype ASIC that integrates all functionalities for system level operation with a power density lower than 90 mW /cm 2 are presented in this contribution.
Topical Workshop on Electronics for Particle Physics 11 -14 September 2017 Santa Cruz, California
Introduction
The objective of the CMS Outer Tracker upgrade for the High Luminosity LHC is to adopt the use of double layer sensors to facilitate the quick, on-detector, identification of high-pT tracks (>2 GeV /c) and their transmission to the L1 trigger system at the 40 MHz Bunch Crossing (BX) rate. For the first time data coming from the Tracker will be used in the L1 trigger decision of a high luminosity hadron experiment. In parallel, a readout channel will transmit triggered events to the Data AcQuisition (DAQ) system at a nominal average trigger rate of 750 kHz. A complete description of the CMS Outer Tracker upgrade can be found in the CMS Technical Proposal [1] .
The Macro Pixel ASIC is the readout chip for the Pixel-Strip module which extracts hits (binary signals) from the pixelated sensor. The chip comprises a fast front-end with leakage current compensation and amplitude discriminator circuits with binary readout for a 120 × 16 pixel sensor array with dimensions of 12.0 mm × 23.16 mm. The readout of 1920 channels per chip at a 40 MHz BX rate provides a data throughput of ∼80 Gbps per chip. Such an amount of data is combined in real time with the input data (2.56 Gbps) coming from the strip sensor layer. The latter is read out by another readout ASIC, called Short Strip ASIC (SSA). The on-chip data processing combines zero-suppression techniques with the capability of recognizing particles with high transverse momentum. The system achieves an almost lossless data transmission to the back-end with an output bandwidth of 1.6 Gbps per chip. In parallel, every event is stored for a maximum latency of 12.8 µs and it can be acquired with a trigger signal. The chip supports a maximum trigger rate of 1 MHz and provides the encoded position and dimension of pixel and strip clusters. A complete description of the analog front-end, as well as the details about the data processing logic can be found in [2] . The results, mainly about the front-end performance, obtained with the first prototype are summarized in [3] . This paper focuses on the implementation of the first full size 65 nm MPA including all functionalities required by the CMS Tracker.
Design Implementation
The strict requirement of a total power density lower than 90 mW /cm 2 has driven the choice of a 65 nm CMOS technology featuring 7-metal with aluminum ReDistribution Layer (RDL). This technology provides 5 thin metals which are used for signal routing, while the remaining 3 thicker metals are used for power routing. Voltage drop across the die is of particular importance because the power can be supplied only from the bottom of the chip since the other edges are covered by the sensor.
The very low power requirement makes also necessary to adopt several low power design techniques. The MPA exploits a Multi-Supply Voltage (MSV) technique to strongly reduce the digital power consumption without degrading the analog front-end and data transmission performance. Therefore, the digital core is powered at 1 V , while the Analog Front-End and the custom-sLVS [4] drivers and receivers are powered at 1.25 V . The sLVS differential interface utilizes a programmable current to optimize the power consumption while maintaining good signal integrity. The digital core is implemented using standard cell libraries of different threshold voltage devices (Multi-V T design) to locally improve performance or reduce power consumption. In particular, the design exploits low-V T devices for the serializer and deserializer logic running at 320 MHz, while the rest of the design uses normal-V T devices. Three of the main challenges presented by the MPA design are the optimizations of the clock distribution, the reduction of the memory power consumption and the radiation hardening. The solutions adopted in the MPA are presented in details in the next paragraphs.
Clock Distribution
The 40 MHz sampling clock tree is based on a clock trunk for the distribution to the pixel rows, shown in Figure 1B , which allows to achieve the very low skew (<< 1 ns) [5] required for the digitalization of the analog output. The clock trunk is routed manually with a thick metal in order to minimize the resistance. Consequently the skew among the pixel rows is reduced, but the larger line capacitance requires a multi-cell buffer placed in the periphery to be driven. The clock distribution inside the row is done with the automatic clock tree synthesis to achieve the same delay for all the pixels. This clock tree starts from a manually placed row input buffer.
Minimizing the skew has an important drawback: the entire pixel matrix switches within a very short period (<< 1 ns) causing high spike of current and consequently a large IR-drop. In order to avoid this effect, the sampling clock is used only for the first sampling flip-flop after the analog discriminator, while the remaining logic exploits a second clock distribution. The latter is based on a re-buffering stage per pixel row which increases the skew (> 1 ns) and reduces IR-drop at the clock transition to less than 20 mV (see figure 1A ).
Memory Gating
The MPA is required to store the entire pixel matrix hits for a maximum latency of 12.8 µs with a BX period of 25 ns, which corresponds roughly to 1 Mb of data. This data is readout when an external trigger is received and a lossless communication is required. The maximum trigger rate is 1 MHz, which makes more convenient to store data without compression. Indeed, any data compression step before the memory needs to operate at 40 MHz. Given the occupancy and the lossless requirements, every technique investigated exceeded the gain in power consumption achieved by the reduced memory size. Our approach was to store uncompressed data locally on every pixel row thus limiting the data transport across the large die only for the read operations of hits. A custom made radiation hardened SRAM block [6] of 128 bits × 512 words is employed per pixel row.
In addition, a memory gating technique, shown in Figure 1C , has been developed to limit the power consumption exploiting the low hit occupancy in the Outer Tracker. Since the memory is read with a fixed latency, the SRAM is managed as a circular memory: every location corresponds to a determined cycle. When the memory pointer reaches the last location, it restarts from the first location, overwriting the previous data. Consequently, if an event is not required within the defined latency, it is deleted. Considering an average occupancy of 1-2 cluster per bunch crossing, only two memories are written every cycle. By using an OR-tree, if a hit is detected a write operation is executed, otherwise the clock to the memory is gated. Doing so, the power consumption from the memory is reduced by ∼8 times, but the SRAM is not refreshed every cycle. Consequently a tag is saved with the data in order to identify if the data which are being read have been updated. The tag is provided by an 8-bit counter, called Latency counter, which increments every time the circular PoS(TWEPP-17)032
MPA
Davide Ceresa memory completes a full cycle. When the memory is read, the counter tag is compared with the current Latency counter, and the data are processed only if the counter tag value comes from the previous latency cycle. This method requires a refresh of the memory before the Latency counter overflows, i.e. around every 3 ms, which is obtained by disabling the OR logic. The final power consumption for the memory with the additional gating logic is ∼5 times lower than the power consumption without any gating technique.
Radiation Hardening
Radiation effects are one of the main challenges in the MPA design. The maximum Total Ionizing Dose (TID) foreseen in the CMS Outer Tracker is 100 MRad. As demonstrated in [7] , the degradation induced by high TID in 65 nm MOS transistors is strongly gate-length dependent. For this reason, minimum length transistors are not used in the analog design. On the other hand, the digital design is based on standard cell libraries from the foundry with which the designer cannot size freely the transistor dimension. Based on the results from [8] and considering the low operating temperature of the tracker (cooling system should work between -30 • C and -40 • C), we have implemented the digital logic utilizing a 9-tracks standard cell library from the foundry. Simulations employing radiation device models predict a frequency degradation around 20 % at 200 MRad and -20 • C.
Concerning Single Event Effects, the MPA implements a partial Triple Module Redundancy (TMR) methodology. The control logic, including the system clock (but not the sampling clock), and the configuration logic have been triplicated, while a full TMR has been excluded due to the restricted power budget. The large number of configuration bits is an important contribution to the power consumption, which can be mitigated by clock gating: the registers are clocked only when addressed and, in order to keep the Single Event Upset immunity, an error detection and correction logic is added. This additional logic sends a clock pulse to the registers when it detects a discrepancy among the output values of the registers. The voted output is then written in all the registers, correcting the bit upset.
Conclusions
A module-level verification environment, described in [9] , allowed extensive simulation of the MPA with back-annotated delays extracted from the final layout (in Figure 1A) . Moreover, activitybased power verification provided the IR-drop maps shown in Figure 1A : static and dynamic power analysis provide a detailed study of the voltage drop for the chip. In conclusion, full chip simulation and power verification run with EDA tools from Cadence (Innovus -Digital IC design, Voltus IC Power Integrity Solution for the digital domain and Virtuoso Analog Design Environment for the analog domain) show the achievement of the expected performance with a total power density lower than 90 mW /cm 2 .
PoS(TWEPP-17)032
MPA Davide Ceresa 
