6 cm −2 junction density. They are presently the largest operational superconducting SFQ circuits ever made. The developed technique distinguishes between "hard" defects (fabrication-related) and "soft" defects (measurement-related) and locates them in the circuit. The "soft" defects are specific to superconducting circuits and caused by magnetic flux trapping either inside the active cells or in the dedicated flux-trapping moats near the cells. The number and distribution of "soft" defects depend on the ambient magnetic field and vary with thermal cycling even if done in the same magnetic environment.
I. INTRODUCTION

S
UPERCONDUCTOR digital devices are clear winners of several types of electronics benchmarks -some devices demonstrated the highest, up to ∼750 GHz, clock rate [1] , whereas other devices showed the lowest, below 10 −11 pJ, energy dissipation per logic operation [2] . These and many other impressive results were achieved using the device fabrication tools antiquated from the standpoint of modern microelectronics. The evident mismatch between the outdated tools and record-setting results stimulated many speculations about the potential of superconductor digital circuits made using modern microelectronics foundries. The IARPA C3 Program [3] gives us a chance to correlate the fabrication progress [4] , [5] and complexity of superconducting circuits [6] . Prototypes of superconducting microprocessors and their components presented at ASC 2016 [7] - [10] show new opportunities for rapid development of digital superconductor electronics. Benchmark test circuits are required for comparing fabrication processes and evaluating their maturity. They can also play an important role in distinguishing between numerous potential fail factors. Because of various reasons, only a few types of benchmark circuits and process diagnostic tests have been published and discussed. Shift registers dominate the pool of benchmark circuits due to a good balance between their universality and simplicity [11] , [12] . Due to their simplicity, shift registers were implemented even for a rudimentary high-T c fabrication technology [13] . Random access memories are second due to possibility of investigating partly operational circuits with defective cells and extracting their exact locations [14] . Toggle flip-flops and ring oscillators have been used to benchmark the maximum speed of operation and data transfer; see, e.g., [1] , [15] - [17] and references therein. In principle, any circuit can be used as a benchmark circuit, if yield of this particular circuit or of its functional units is the metric, e.g., [18] .
One of the most important requirements to benchmark circuits is scalability, i.e., the ability to increase the integration scale to, potentially, tens of millions of Josephson junctions and other components, similarly to CMOS circuits. Yet another important requirement is the ability to test and benchmark magnetic flux trapping. This problem is specific to superconductor electronics and has no analogs in CMOS circuits.
In our previous work, we suggested a simple and scalable ac-biased shift register, and explained how it can be used for generic technology benchmarking and flux trapping evaluation [19] . Functionally, it is similar to a "scan chain" technique widely accepted in CMOS technology. Being a shift register, our circuit shares drawbacks of all shift registers -it is a periodic structure rather than a more valuable "random" mix of logic gates. On the other hand, its design offers a unique possibility of evaluating critical currents of Josephson junctions in every register cell. Due to an ac-biasing scheme implemented, the circuit integration scale can be increased to multimillion-junction scale without significant complications associated with applications of large bias currents.
II. STRUCTURE OF THE TEST CIRCUIT
A unit cell of the register developed in [19] is shown in Fig. 1 . Each cell contains a chain of four inductances (L1-L4), each connected to the ground plane by a Josephson junction. All four junctions (J1-J4) have identical target parameters. The cell also has a "handle" -inductance L6 inductively coupled to the common ac clock line. Since [19] , we implemented a number of cell improvements driven by a progress with understanding technological limitations.
Besides, we have made two major layout upgrades of the original layout [19] shown in Fig. 1(c) . Fig. 1(d) illustrates our progress with cell miniaturization due to incremental upgrades of the MIT LL process from the SFQ4ee to the SFQ5ee node. In particular, we used a smaller number of larger "dummy" Josephson junctions and more aggressive inductor linewidth and via design rules. Both technology nodes use 0.1-mA/μm 2 Josephson junctions requiring resistive shunts. Fig. 1(e) shows how the cell area can be further reduced by using a prospective technology with 0.5-mA/μm 2 self-shunted Josephson junctions [20] . Regular cells are connected into interleaved strings with opposite data propagation directions (see Fig. 2 ). The ends of neighboring strings are connected by "U-turn" cells (the upper right corner in Fig. 2 ), which are functionally and schematically identical to the regular cells shown in Fig. 1 . In [19] we demonstrated a possibility to tap data from internal points of the register using dc-biased RSFQ-type "U-turn" inserts. In this work, we use a more efficient way to tap data by using a new ac-biased U-turn cell shown in Fig. 3 . We designed the cell in two steps. Firstly, we connected junction J3 with the first junction (J5) of the usual (dc-biased) JTL line via a relatively large (2 PSCAN units or 5.28 pH) inductance. Then we used PSCAN to align margins of the U-turn cell with margins of the unit cells by adjusting the cell inductances.
We placed the strings of cells (see Fig. 2 ) at 1-μm spacing. The empty spaces between them made very long moats. We cut them into shorter moats by connecting the sky and ground planes of the adjacent rows of cells by short bridges. The distance W between the bridges does not affect the circuit operation but could affect the moat efficiency. Different values of W could be used within a single register for a comparative study of moat efficiencies.
III. MEASUREMENT TECHNIQUE
The testing techniques proposed in [19] and enabling the extraction of margins of all individual cell in our shift registers consists of several steps. The first step is to establish that an N-bit register under test is operational and find its global margins -a range of positive and negative amplitudes of rectangle clock pulses shifting the input data pattern from the register input to its output after N clock periods. Intermediate data taps (see Fig. 3 ) can be used to observe the input pattern propagation through the register. These global or "nominal" amplitudes are then extensively used to write in and read out the special test patterns. An automated test set-up Octopux was used to synthesize the clock and data patterns and acquire output patterns at multiple data taps. The typical ac clock waveform is shown in Fig. 4 . The typical clock frequency was about 100 kHz.
The next step is the extraction of margins of individual cells. Our technique essentially uses the fact that margins of empty cells (that store logic "zeros") are much wider than margins of cells with magnetic flux quanta that store logic "ones." The technique is easier to explain using the simplest pattern containing only a single logic "1" written into a cell k under investigation. This is done using k clock pulses of the nominal amplitude. The main idea of the method is to shift this "1" from cell k to the next, k + 1 position using the k + 1 clock pulse with intentionally modified amplitude, either positive or negative, as shown in Fig. 4 . In order to check the result, we read out the register content by applying the nominal clock amplitudes and shifting the "1" either to the next data tap or to the register output. Then we compare the number of the clock pulses required with our expectations based on the position of the k-th cell in the register. The cell operated correctly, i.e., the modified positive and negative clock amplitudes used were within its margins, if the "1" in the cell k was shifted into the next, k + 1 cell. No shift or shifts to more than one position are considered as errors. To find the operation margins of the cell k we should repeat the described procedure using different modified amplitudes. Then the whole test cycle should be repeated for all N cells.
The described ideal pattern is too slow to be practical for long registers. The procedure can be accelerated by using various patterns with many sparse "1"s. In this case, each "1" independently verifies "its own" cell. For example, the extraction of margins can be started using a pattern with logic "1"s separated by two "0"s: 1 0 0 1 0 0 1 0 . . . This pattern deals with the first, fourth, seventh, and so on cells. In particular, we used it to identify and extract margins of the weakest cell among the mentioned. Then, the identified cell was eliminated from further measurements by replacing "1" corresponding to the cell by "0". For example, if the weakest cell was in position four, the new pattern becomes 1 0 0 0 0 0 1 0 . . . The new pattern with the reduced number of "1"s was then used to identify the cell with the second weakest margins, and so on. The described procedure continued until all margins were extracted and the pattern contains only "0"s. Then, the whole sequence was repeated using a "shifted" sparse pattern: 0 1 0 0 1 0 0 1 . . . that deals with cells in the second, fifth, eighth, and so on positions. Finally, the procedure was repeated with pattern 0 0 1 0 0 1 0 0 1 . . . Using the described procedure, we measured margins of all cells for both polarities of ac clock. The variations of cell margins can be also interpreted in terms of deviations of critical currents of Josephson junctions in the shift register from the design values. Indeed, the effect of any cell distortion can be extracted by straightforward numerical simulations of the cell margins with the nominal and distorted values of the selected parameter. Table I shows the results when the distorted parameters are critical currents of four Josephson junctions. The dimensionless numbers in Table I are the rates of margin changes with respect to changes of the critical currents.
According to Table I , variations of critical current of junction J2 have the major impact on the lower positive margin. Numerically it is about 4.4 μA change of the margin per 1 μA change of the critical current. Similarly, junction J4 has the largest impact on the upper negative margin. Impacts of simultaneous deviations of several critical currents can be also calculated using coefficients in Table I . For example, if we assume equal and statistically independent deviations of all four junctions, then the cumulative impact of deviations could be described by two coefficients shown in the last column of Table I . These dimensionless coefficients connect rms deviations of critical currents with rms deviations of clock margins. The coefficients are close to each other and, with a reasonable accuracy, it is possible to state that the spread of critical currents rms I C causes about 5.5 times higher spread of the clock amplitudes.
It is possible to calculate similar impacts of spreads of selfand mutual inductances on rms AM P . However, below we assume that the spread of critical current is the dominant spread factor and other factors can be neglected. 
IV. MEASUREMENT RESULTS
A. Smaller Shift Register With 16k Cells
A number of shift registers have been laid out, fabricated at MIT Lincoln Laboratory using the SFQ4ee and SFQ5ee process nodes with 100-μA/μm 2 Nb/AlO x /Al/Nb Josephson junctions [4] , [5] , and investigated. As mentioned earlier [19] , it is possible to extract the lower and upper operation margins for ac clock current of two memory loops-between junctions J1 and J2 and between junctions J3 and J4 (see Fig. 1 )-for every cell in the register, thus characterizing the uniformity of the fabricated cells. Positive clock amplitudes shift logic "1" from the first memory loop to the second. Negative clock amplitudes shift logic "1" from the second memory loop, between J3 and J4, to the first memory loop of the next cell. This tremendous amount of information about the entire circuit based on measured parameters of individual cells can be highly useful for characterizing and optimizing the fabrication processes. Fig. 5 shows a few types of margins for a 16 384-bit register. Fig. 5(a) To sort out possible origins of the spread of individual margins, we repeated the full cycle of measurements with and without thermal cycles. The results are shown in Fig. 6 .
The thin blue curve in Fig. 6(a) shows sorted lower positive margins averaged over ten thermal cycles. The thicker red curve is the best fit of margins M by an error function defined by mean value M and standard deviation σ where P(M) is a relative number of cells with margins below M. The fitted value σ = 14.7 μA is equivalent to 1 − σ spread of the critical currents of junctions in the circuit of 2.7 μA. This standard deviation characterizes a cumulative impact of the fabrication spread and, to a lesser degree, impacts of flux trapping and random measurement errors. Its value is quite close to the typical fabrication spread of junctions with 0.125 mA critical currents [5] . Below we estimate the contribution of random measurement errors and flux trapping effects.
The measurement errors can be extracted by analyzing the difference between one set of measurements and the averaged data, as shown in Fig. 6(b) . The error of measurements (1 − σ deviation) of the amplitude is only 2.6 μA. The equivalent spread of critical currents according to (1) is only about 0.47 μA. This very low value, comparable to thermally induced uncertainty of critical currents, can serve as a conservative upper limit to the accuracy of our measurements. Despite lack of outliers, we can suggest that impact of flux trapping is associated with the largest deviations of margins shown in Fig. 6(b) from zero. Fig. 7 shows locations of cells with the largest deviations of lower positive margins measured after each of 10 thermal cycles. There are a few coinciding locations but the majority of them look quite random.
B. A Larger Shift Register With 202 280 Cells
Comprehensive measurements of robust 8k, 16k and 36k shift registers were convenient for investigations of small trapping effects by means of comparative study of circuits exposed to thermal cycles. Measurements of longer registers are more difficult and time consuming. We discuss here only the measurement carried out within one thermal cycling. The register (see Fig. 8 ) occupies 8 mm × 8 mm payload area of a 1 cm × 1 cm chip. Some small payload area was reserved for two 16-bit registers, two Josephson junctions with 0.25 mA critical currents that served as thermometers and two 40-Ω resistors that served as Fig. 9 . Degradation of margins with the shift register length. Blue circles mark the full matching between the measured and expected patterns; red squares mark incorrectly delayed but undestorted measured pattern; green diamands mark operation with distorted distances between logic "1"s but with the correct number of logic "1"s.
local heaters. The main shift register consists of 202 280 cells shown in Fig. 1(a) , (d), and Fig. 3 . The register contains one data input, one data output, and 17 intermediate data taps that allow to diagnose partly operational circuits and simplify the debugging and optimization of the measurement procedure. The measurements were complicated by much longer time required for working with longer data patterns and stronger requirements for reliability of the testing procedures. In particular, the achieved 0.01% error rate for a single bit measurement event was insufficient for our purpose. To resolve the problem we repeated all measurements ten times and then analysed the redundant data to exclude incorrect measurement results. It was not so difficult because one or two incorrect values were very different from groups of 9 or 8 very close correct values. Only one of six investigated chips was fully operational, while the other five were alive but only partly operational.
As the first measurement step, we applied to the register a short (with six logic "1"s and six logic "0"s) "active" pseudorandom pattern followed by the sufficient number of "0"s to match the pattern and register lengths. Several taps were used to monitor the propagation of the pattern through the register. Output patterns were automatically compared with the input ones. To find the operation range, we automatically scanned parameters of interest and repeated the testing procedure at each selected set of parameters. Three plots in Fig. 9 show different grades of register operations. Blue open circles mark correct operations; red squares mark operations with correct patterns but with incorrect (shorter) delays between input and output patterns. Finally, green diamonds mark operations with correct number of logical "1"s but with distorted distances between them.
The leftmost plot shows "the perfect" operation of the shortest register section containing only 389 cells in a very wide ranges of amplitudes from about 0.5 to 2.2 mA for positive and −0.5 to −2.5 mA for negative amplitudes. The middle plot shows a noticeable operation area of a register section with 24 896 cells. The rightmost plot shows that the register is still operational but the delay between input and output data is always shorter than the expected for this tap 75 077 clock periods. Each thermal cycle changed the operational areas in the register and even the existence of the operation grades.
Shorter delays mean that a tiny fraction of the register cells operate as pieces of Josephson transmission lines (JTLs). This type of errors is typically observed when the absolute values of clock amplitudes are too high for the correct operations. For shorter registers, such a behavior was not an issue because we were able to return to the normal operation by reducing the clock amplitudes. For the sections of the longer register, we could not do this because some other "bad" cells stopped to work at the reduced amplitudes.
The described effect would be unacceptable for practical circuits. However, the observed shortening of the delay is somehow fruitful because it allows us to diagnose the technology. For example, we learned that the measured 202 218 clock period length of the register was by 62 clock periods shorter than its designed 202 280 clock period length. In other words, 62 cells operated as JTLs and were hidden from our testing procedure. We will refer to them as missing cells. The ratio of missing and operational cells, about 0.03% in this case, is a way to characterize the technology. The number of missing cells usually changed after each thermal cycle. It means that the missing-cell effect originated from the flux frozen in some unexpected places. Fortunately, the described measurement technique works well for registers with missing cells. Fig. 10 shows lower positive clock margins. These data are similar to lower positive margins shown in Fig. 5(a) . However, in Fig. 10 we use one-dimensional numbering showing the "clock period" distance of cells from the register input. In this way it is easy to see a number of rare but really large spikes of margins. At the first glance they look random and difficult to organize. However, sorting them in the ascending order does the trick, as shown in Fig. 11 . It shows only the tails of the distribution, showing a noticeable devaition from the mean value and containing less than 0.05% data points. The 99.95% of the data in the center of the plot were omitted for clarity because they match the Gaussian distribution with the accuracy better than the width of lines in the plot.
The tails contain spikes that can be sorted into a few groups. The first group contains only two lowest and two highest margin data points that could be related to the cells with the fabricationrelated distortions of parameters. The other two, more populated, groups are probably caused by fluxes frozen in unexpected locations, but we cannot prove it now. Note that the number of cells affected by the flux trapping and shown in the left side of Fig. 11 exactly coincides with the number of "missing" cells. We believe that both observed effects are correlated. In other words, each "missed" cell disturbs margin of its neighbor cell. Extracted ∼15 μA 1σ-spread of clock amplitudes and corresponding ∼3 μA spread of the critical currents of more than 809k junctions in the register are identical to these parameters in the shorter, 16k-bit register fabricated six months earlier. Fig. 12(a) shows that most spikes in positive margins in 
V. CONCLUSION
The-state-of-the-art in superconductor digital technology is close to the psychologically important million-junction level of integration. We have shown in this work that circuits with such level of integration can be designed, fabricated, and successfully tested. The main advantage of our ac-biased shift register is a possibility to extract properties of individual cells. In this respect, our circuits are similar to random access memories. However, in contrast to RAMs, our circuits are simpler for design and take less labor to layout. One could expect that the ability to access individual cells is especially important for superconductor circuits that could be affected by parasitic flux trapping, the phenomenon nonexistent in semiconductor electronic circuits [21] .
We have compared the properties of circuits with different integration levels, from about 64k junctions to more than 809k junctions per circuit. We have found that the upper limit of 1 − σ spread of critical currents (∼3 μA) does not depend on the level of integration and hence can be measured using small circuits.
In contrast, we have detected some fabrication defects with very low, ∼5·10 −6 defect per Josephson junction, probabilities which could have been detected only using circuits with the maximum possible level of integration.
We have definitely observed a significant difference in the probabilities of high impact flux-trapping events in the relatively small 16k-cell registers on 5-mm chips and the largest, over 200k-cell, registers on 1-cm chips. The rate of high impact flux-trapping in the largest circuits is about one event per 2000 cells. However, we did not find even a single similar effect after 10 thermal cycles of the circuit with over 16k cells. We cannot suggest a simple explanation of this difference in flux trapping. It is possible however that the difference is a result of a high sensitivity of flux trapping to the fabrication-dependent flux "freezing temperature" of superconducting films, as noted in [22] . Besides, the known theoretical flux trapping investigations dealt only with simple single-layer film structures. The real digital circuits contain from 6 to 9 layers of metallization. Properties of such circuits are extremely sensitive to differences between critical temperatures of superconducting layers and critical temperatures of vias connecting them.
We believe that the practical multilayer structures could and should be analyzed theoretically despite of these complications. The practical value of the theoretical modeling of the flux trapping may be limited because of a large number of combinations of various, often not fully known, parameters and high sensitivity to some of them. It could well happen that experimentation with flux trapping in multilayer structures will be a more practical solution to the problem. We are confident that test circuits for analyzing flux trapping effects must be a common test structure, similarly to those that measure resistive properties of normal and inductive properties of superconductor films. The circuits and techniques developed in this work are significant steps in this direction.
