: The ATLAS collaboration is currently investigating CMOS monolithic pixel sensors for the outermost layer of the upgrade of its Inner Tracker (ITk). For this application, two large scale prototypes featuring small collection electrode have been produced in a radiation-hard process modification of a standard 0.18 µm CMOS imaging technology: the MALTA, with a novel asynchronous readout, and the TJ MONOPIX, based on the well established "column-drain" architecture. The MALTA chip is the first full-scale prototype suitable for the development of a monolithic module for the ITk. It features a fast and low-power front-end, an architecture designed to cope with an hit-rate up to 2 MHz/mm 2 without clock distribution over the matrix, hence reducing total power consumption, and LVDS drivers. Laboratory tests confirmed the performance of the asynchronous architecture expected from simulations. Extensive testbeam measurements have proved an average detection efficiency of 96% before irradiation at a threshold of ∼ 230 e − with dispersion of ∼ 36 e − and ENC lower than 10 e − . A non fully functional pixel masking scheme, forces operation at relatively high thresholds, causing inefficiency. A severe degradation of efficiency has been measured after neutron irradiation at a fluence 1 × 10 15 1 MeV neq/cm 2 . Consistent results have been produced with the TJ MONOPIX. A correlation with inefficiency plots and pixel layout has triggered TCAD simulations, ending up to two possible solutions, implemented in a new prototype, the MiniMALTA.
MALTA: an asynchronous readout CMOS monolithic pixel detector for the ATLAS High-Luminosity upgrade 1 Introduction
A dedicated ATLAS [1] CMOS group is currently developing Depleted Monolithic Active Pixel Sensors (DMAPS) prototypes to be evaluated for the outermost pixel layer of the Inner Tracker (ITk) upgrade [2] . The TowerJazz1 180 nm CMOS technology has been used for the pixel sensor for the ALICE Inner Tracking System [3] : the ALPIDE chip [4, 5] . Promising results of prototypes, implemented in a process modification to fully deplete the sensitive layer [6, 7] , have motivated the development towards the more challenging requirements of the ITk, such as a radiation hardness up to 1.5 × 10 15 1 MeV n eq /cm 2 and a fast signal response, compatible with the HL-LHC 25 ns bunch structure. Two large scale DMAPS have been designed using this modification of the process, that has demonstrated to withstand the radiation requirements [8] : the MALTA chip (Monolithic pixel detector from ALICE to ATLAS) uses a novel asynchronous pixel matrix readout, while TJ Monopix the more conservative "column-drain" architecture [9] . This paper presents the results of the MALTA chip.
The MALTA sensor and front-end
The MALTA chip matrix contains 512×512 square pixels with a pitch of 36.4 µm, for a total sensing area of around 2 × 2 cm 2 . Figure 1 shows the cross-section of the the front-end circuit give a total input capacitance of less than 5 fF. A minimum ionizing particle is expected to generate ∼ 1500 e − .
The pixel front-end, derived from ALPIDE [10, 11] , is optimised for a threshold of 200 e − . Its fast response meets the requirement imposed by the 25 ns HL-LHC bunch crossing, and the power consumption is 0.9 µW per pixel. Eight different pixel flavours, differing for reset mechanism and electrode/well geometries, are implemented in the matrix as illustrated in table 1a. A clipping mechanism guarantees the return to baseline after less than 200 ns, reducing the dead time. The discriminator output triggers the in-pixel logic to transmit hit information to the periphery asynchronously. After the discriminator, the timewalk between pulses embeds charge information, preserved at the matrix output by balancing the propagation delay in the readout logic, as described in the next section.
The MALTA asynchronous readout architecture
The pixels are organized in groups of 2 × 8. Hits from a pixel are sent to dedicated logic, common within the group, generating a reference pulse with a programmable width of 0.5 ns, 1 ns or 2 ns. Starting from this pulse, pixel (16 bit) and group address (5 bit) are produced. Pulses are distributed in two parallel 22 bit wide busses, one reserved for the even groups, the other for odd. Therefore each bus corresponds to 32 groups (either blue or red in figure 2 ). As only non-adjacent groups share the same bus, there are no conflicts when clusters are shared between adjacent pixels. NAND gates are used as buffering stages and to inject the pulses into the shared bus (figure 2b), obtaining a fully balanced and symmetric chain, independent of reverse substrate bias. The transmission lines for the busses along the column are also carefully designed to balance the load and achieve a well-defined delay from each pixel to the output. At the end of the double-column, the two busses are merged in a single bus of 22+1 bits. The extra bit indicates the group parity. The same concept is used to combine pulses from the 256 double-columns of the matrix. The total number of parallel bits associated with a hit at the output of the chip is summarised in table 1b. Two possible solutions are implemented to merge signals at the end of the double-column: a priority arbitration stage, which delays pulses arriving simultaneously, and a simple OR structure. In the first case, 3 Delay Count bits provide timing information for off-line reconstruction. In the second case, the address of hits arriving at the same time at the input of the same OR gate will be corrupted. To synchronize the operation with the 40 MHz bunch structure of the LHC, the bunch crossing ID information can be added using a 2 bit counter. The chip can output its data on a 40-bit wide parallel LVDS output at the bottom of the chip [12] , or to one of 4 40-bit wide parallel CMOS transceivers at the left and right edge of the chip.
Power consumption
The need of a lower power consumption motivated the use of both the asynchronous readout and the small collection electrode. Table 2 compares the estimated power consumption of three DMAPS:
-3 -one based on the TJ MALTA, one on the TJ Monopix and one on a MONOPIX implemented with a large collection electrode [13] . The power consumption at the ITk hit rate is given per pixel and for the full matrix. The shown numbers are derived from simulation and measurements as no final full-scale prototypes are available. The low-capacitance of the small collection electrode reduces the analog power consumption per pixel by a factor 20, resulting in a factor 4 reduction for the full 2 × 2 cm 2 matrix, because of the smaller pixel pitch. The asynchronous readout consumes 20 times less digital power than the synchronous one, in the 36 µm pitch matrix. Comparing the MALTA architecture with the synchronous using a ∼ 6 times larger pixel, one can deduce that the advantage of no clock distribution increases with the detector granularity, as the power saving is reduced from a factor 20 to a ∼ 7, in this specific case. The extra power in the asynchronous architecture digital periphery is given by the final data synchronization and serialization at the output. 
Asynchronous architecture performance
The first MALTA chips correctly propagate the signals to the output. The delay introduced by the NANDs buffering structure has been characterised in simulations and verified with measurements, which consist of pulsing pixels at the top, centre and bottom of the same double-column simultaneously. Figure 3a shows the result of measuring on an oscilloscope the analog output of the top pixel (in green) and the reference pulse (in yellow). In figure 3b pulses from the same set of pixels are collected using the MALTA readout system. As shown in figure 2b , the pulsing signal reaches a pixel with a delay depending on the pixel position in the column. In both measurements, one can identify the three pixels pulsed simultaneously responding with a defined delay. Both the sending of the test pulse up the column and the pixel output down are associated with a certain delay, depending on the pixel position within the column. The difference in propagation delay for the pulse signal corresponds to the one between the bottom and top pixel, which from simulation is 17.5 ns. The output signals from the bottom pixel arrive first, then those from the pixels at the centre, and last those from the pixels at the top. A maximum of 7.5 ns of relative delay is expected propagating hits to the periphery. The sum of the maximum propagation delays of the test pulse and of the output signal corresponds to the measured 25 ns total delay shown in figure 3 . As the timing depends on the pixel address, it can be corrected off-line. In-chip correction is under study for future prototypes.
-4 - 
Detection efficiency before and after irradiation and next steps
Extensive testbeam measurements at the CERN SPS, with 120 GeV pions using a 3 µm resolution telescope [14] , indicate an average efficiency of 96% before irradiation (figure 4a). The inefficiency is mainly due to a not fully functional pixel masking scheme together with a few pixels affected by random telegraph noise. This forced to operate at a relatively high 230 e − threshold, with a dispersion of 36 e − and an ENC of 7 e − . After neutron irradiation at 1 × 10 15 1 MeV n eq /cm 2 severe efficiency drop down to 74% was observed, due to an increased noise level, forcing an even higher threshold of 350 e − , and severe charge loss at the pixel edges [15] . Figure 4b and 4c illustrate the correlation of the efficiency with the deep p-well layout. Similar results have been obtained with the TJ Monopix [16] . Detailed 3D simulations established that the charge loss near the pixel edges is related to an area of a very low electric field, increasing the probability to capture the signal charge by radiation-induced traps [17] . Two pixel improvements were proposed, significantly enhancing the lateral electric field near the pixel edges in simulation: a gap in the low dose n − layer and an additional deep p-type implant as illustrated in figure 5 . The fixes have been implemented in a new chip, called MiniMALTA, currently under test. It contains 16 × 64 pixels organized as in MALTA, including 8 sectors with splits on analogue front-end design, reset mechanism and process. Critical transistors of the front-end have been enlarged to address a solution for the random telegraph noise measured in MALTA. An asynchronous output data stream introduces complications in the readout system, and it is not compatible with the ITk. Therefore, a synchronization memory at the end of the column and a serial data output have been implemented in MiniMALTA. This approach preserves the advantage of no clock distribution over the matrix.
Steps towards system integration
The ITk is planning to arrange the hybrid pixels in modules of four readout chips, bump bonded to a single sensor of ∼ 4 × 4 cm 2 [18] . The monolithic proposal aims to build a CMOS quad-module, made of four sensors assembled as a large area detector. The geometry, data format and interface to the ATLAS readout shall be compatible with the hybrid structure, to constitute a "drop-in" solution -5 - for the outermost layer. The dead area between chips shall be minimized through accurate dicing and specific design attention with respect to the assembly. MALTA is the first "full-size" monolithic sensor available for the design of a prototype of such ITk module. It was designed to characterize and validate the asynchronous architecture via a large number of parallel outputs, which need to be replaced with a serial output. Four MALTA will be assembled on an adaptor card, compatible with the MALTA FPGA-based readout system.
CMOS transceiver pads on the side of the MALTA matrix enable to transmit hits coming from a sensor to the neighbouring one, merging data into a single set of parallel outputs, exploiting the same structure presented in paragraph 5. The number of outputs will be considerably lowered, hence ease assembly and cost. Wire-bonding chip-to-chip connection feasibility has been proved on a dual chip carrier board (figure 6b), with a spacing of 250 µm between dice. A dedicated flip-chip sensor-to-sensor connection is under investigation. 
Conclusions
MALTA is a 2 × 2 cm 2 DMAPS developed for the ATLAS High-Luminosity upgrade. It features square pixels of 36.4 µm pitch with small low-capacitance collection electrode, a fast front-end and a novel asynchronous architecture. The chip has been extensively characterized in laboratory and test beam measurements. A pixel requires only 0.9 µW of analog power, thanks to small low-capacitance collection electrode. However, the sensor suffers from degraded efficiency in the pixel corners after irradiation to 1 × 10 15 1 MeV n eq /cm 2 . Two possible process modifications have been identified and implemented in a new matrix to avoid charge collection efficiency loss after irradiation. A lower threshold should be achieved after a modification in the front-end design to reduce RTS noise. The asynchronous architecture demonstrated to be a promising candidate for future developments, limiting the increase of digital power consumption over the matrix. The advantage is more relevant at high hit rate and high granularity, which is the trend for HEP experiments.
