Abstract-An integrated methodology combining redundant clock tree synthesis and pulse clocked latches mitigates both SEU and SET with reduced power consumption. The approach utilizes commercial CAD tools. An advanced encryption system is implemented with the proposed design is compared to a previous design with non-redundant clock trees and local delay generation. The proposed approach reduces energy per operation by 18% over an improved version of the prior approach, with negligible area impact.
I. INTRODUCTION
OFT-ERROR mitigation has become increasingly important for aerospace, safety-critical, and commercial integrated circuits (ICs) as process scaling has reduced the critical charge required to produce an upset [1] . This has required that memories be protected by error detection and correction (EDAC) leaving flip-flop (FF) and latch singleevent upsets (SEU) and combinational logic single-event transients (SET) as the primary issue. Close proximity of devices allows simultaneous multiple node charge collection (MNCC) [2] that requires spatial separation of redundant nodes in hardened circuits.
A. Motivation
Power dissipation is potentially the most important application specific IC (ASIC) circuit design factor. The clock distribution network and sequential circuits consume as much as 40% of the overall chip's power budget [3] [4] . Radiation hardening by design (RHBD) provides high reliability using commercial foundry processes but often at high power cost. Temporally hardened flip-flops, whether in the feedback loop or using delay filters require many delay circuits in a given design [5] [6] [7] . However, intentionally producing large delays is difficult, requiring very low current, i.e., high threshold voltages or current starving that exacerbate SET duration or prohibit low voltage operation. Alternatively, large capacitance increases delay, but commensurately increases power consumption. Finally, some delay based designs, e.g., using temporal filters, still have single node vulnerability at the outputs [8] .
Using multiple temporally separated clocks driving voted TMR FFs was originally proposed by Mavis and Eaton [7] and This work was sponsored by Space Micro Corp. subsequently discussed in [9] . In these designs, redundant clocks with rising edges separated in time provide the temporal sampling. These designs, by separating the clocks, provide hardness to all but clock root SETs. Pulse-clocked latches can simulate edge-triggered FFs. They save substantial clock and sequential circuit power, generally over 40% if done correctly, at the expense of greater hold time requirements.
A previous pulse-clocked latch based temporally hardened FF design [4] used a single global clock, but shared the clock delay elements across multiple latches to amortize area and power impact. However, the power savings could be greater, as the power improvement over the master-slave FF based design following [7] is 30%. Moreover, clock gating, which is essential to low power operation, without increasing the softerror rate of such approaches is difficult.
B. Contribution of this work
In this paper we investigate the power savings provided by using multiple clock trees rather than local filtering. We also examine the impact of resulting clock skew by comparing to a singly clocked design using the approach in [4] . An integrated clock and pulse clocked latch based methodology is proposed that provides full temporally hardened SET and SEU mitigation with low power. The proposed scheme can be integrated into existing commercial CAD flows with little additional effort. A soft-error hardened AES design is implemented with the proposed scheme using standard CAD tools and the results are compared with previous implementations. Results show that the proposed design reduces energy per operation over a design with local clock delays inside a multi-bit latch macro by 18%, and with negligible area penalty. Analysis of the resulting clocks on a realistic ASIC circuit block (an AES encryption engine) shows that the clocks have higher quality with the triplicated trees than local generation.
C. Paper Organization
Introduction, motivation and prior work on this topic comprise Section I. Section II presents proton testing results of the latch macros with local delay generation. Then the proposed integrated clock and temporal latch design is presented. The proposed design is used in example AES implementations, using standard commercial CAD tools Section III. Section IV analyzes the power, performance and SET immunity of the proposed design, and compares it with the previous work. Section V concludes this paper.
II. BACKGROUND AND PRIOR WORK
Single event upset's (SEU) in the storage nodes of a sequential element have been the primary concerns in softerror mitigation until recently. With the reduced capacitance and drive per scaled transistor, as well as increasing clock frequencies, SETs in the combinational logic are becoming more important. Voltage glitches produced by the SETs may be captured by the sequential elements [1] . Higher clock rates, providing more sampling edges, as well as fewer logic gates per pipeline stage compared to SET duration, increase this likelihood. Many existing hardened latch and FF designs, e.g., dual interlocked cell (DICE) [10] and built-in soft error resilience (BISER) FF [11] mitigate SEU but not SETs at their D or clock inputs. Simultaneous multiple node charge collection (MNCC) from a single charge track has been able to thwart hardened latch redundancy for some time. Poorly designed 'hardened' FFs, with insufficient critical (redundant) node spacing have demonstrated similar upset rates as unhardened FFs [12] .
SEU and SET hard sequential elements can be produced by sampling the data into three separate sequential circuits that are controlled by temporally separated clocks as initially proposed in [7] . By providing a delay between the clocks greater than expected SET duration, any SET at the D input will be captured by at most one FF. A majority gate at the output samples the three data copies and provides the correct output. This design is only hard to SET's on D input. Any SET on a non-redundant clock input will generate false edges and may capture wrong data. If two FFs capture the same wrong data, the output is incorrect. Thus, our original design used a single clock tree, and used local delays, which we believed would produce the best skew.
A. Temporal Pulse Clocked TMR Latches
A hardened temporal pulse-clocked latch (TPL) based flipflop comprises Fig. 1a [4] . The TPL has three redundant pulse-clocked latches acting as flip-flops. The latch capture windows are controlled by three redundant pulse-clocks spaced in time, so that the latches sample at different times to mitigate SETs at the D input. Clock delay and pulse generation for 16 TMR latches is local in a multi-bit FF configuration. A single global clock as mentioned above, to provide good clock quality. To avoid global clock SETs propagating to all the redundant clocks, delay filters drive the local clocks before the pulse generators. These mitigate transients of up to the delay element delay, t δ (here 600ps). The TMR ClkA, ClkB, and ClkC signals are produced by cascaded delay filters (see Fig. 1a ). A SET on the global clock is removed by the C-elements, which also filter upsets at the delayed clocks D1CLK, D2CLK, or D3CLK. A SET on the Celement output delayed clocks will propagate, but only to one of the three latch 16-bit latch sets, where it is corrected by the majority voting.
SEUs or captured SETs on the D inputs of individual latches are mitigated by majority voting the QA, QB and QC outputs to generate a single corrected Q. The layout spatially separates critical nodes to prevent MNCC induced upset of multiple latches [4] . Referenced to the incoming global clock, the setup time is -(δ + T PW -T SU ), where δ is the delay filter (i.e., delay element and C-element) delay, T PW is the pulse-clock width, and T SU is the latch D input setup time to the pulse falling edge. The most significant drawback of this approach is the large hold time required for the last (C copy) flip-flop to correctly capture. In normal operation, Q changes after the B copy PCLK rising edge (Fig. 1b) which is well before the C copy PCLK asserts, creating a long hold time. Using pulseclocked latches increases this hold time further, by the width of the latch capture window, resulting in a total required hold time of approximately 3(δ + T PW ). The FF dead-time is δ + T SU + T CLK2Q , where T CLK2Q is the latch delay from clock input to Q output.
B. Test chip and Experimental Hardness Results
The TPL design was fabricated on a 90-nm low standby power foundry process. Test structures comprised of parallel shift registers with these proposed TPL, as well as unhardened standard foundry library FF designs were included. The designs were tested by broad beam proton irradiation (Fig. 2) at the UC Davis Crocker Laboratory with 63 MeV protons. The die was not lidded (it is a COB as evident in Fig. 2 ) but the plastic protective covering was left in place during irradiation.
The FFs were tested in SEU only and SEU and SET sensitive conditions, i.e., clock held low during irradiation and clocked, respectively at V DD = 1 V. The primary goal of the testing was another design, so the results are limited. But in static testing this design had no failures with a flux of protons/cm 2 , while the unhardened designs, using the foundry supplied flip-flops had 2 errors. In dynamic operation with a flux of 70.14×10 6 and total fluence of 41.8×10 9 protons/cm 2 , the unhardened designs exhibited 14 errors while the TPL hardened designs again had none. Taking into account possible statistical error (i.e., proportional to the square root of the error count) the cross-section of the proposed design in dynamic operation is at least 90.25% less than that of the baseline foundry FF. Static operation has insufficient baseline failure data to make such an estimation. However, given the spatial separation of redundant latches, we believe that the improvement should be even greater, since two latches would have to collect charge.
While protons have been shown capable of upsetting SRAM cells via direct ionization [13] , for the energy used here, the proton LET in silicon is under 9×10 -3 MeV-cm 2 /mg and thus cannot upset these 90-nm latches, which have higher capacitive loading than SRAM cells. Thus the upsets are via indirect mechanisms, either elastic or inelastic scattering. The cross-section of these interactions are about five orders of magnitude lower than that of direct ionization [14] , leading to the low failure count in the baseline FF.
C. Power Dissipation
The TPL saves approximately 30% power compared to a similar design using master-slave flip-flop rather than pulse clocked latches. The delay filters are essential to protecting the flip-flop against SETs on the global clock but dissipate significant power. To address these, we modified the local delay and pulse generator design by improving internal slopes and by eliminating the first delay element D1 and generating CLKA directly from GCLK. A particle strike on GCLK propagates to CLKA, but is filtered out before CLKB and CLKC generation. This improvement reduces the per flip-flop energy per operation by 6% at 100% clock and 25% data activity factor.
While this design is hard and improves dramatically on the delay circuits per flip-flop required in the DF-DICE approach [15] , the large number of delay elements used throughout the chip suggests use of redundant clocks. The remainder of this paper focuses on reducing the power consumption by using the multiple clocks as in [7] [9] . Since neither of these attempted to quantify the impact of redundant clock trees, we focus on that in the rest of this paper.
III. MULTIPLE CLOCK IMPLEMENTATION

A. TPL with Skewed Redundant Clocks
The proposed design as shown in Fig. 3a uses redundant skewed clock trees with local pulse generators and majority voted TMR pulse-clocked latches that retain the local pulse generation. Local pulse generation ensures good pulse fidelity and is required for clock gating as shown below. CLKA is essentially the primary clock, while CLKB and CLKC are delayed at the clock root using temporal filters. The modified timing windows for this design are shown in Fig. 3(b) . A conservative required clock was determined by Monte Carlo SPICE simulations. Using a single global delay and triplicated clock trees also makes a digitally controlled variable delay straight forward. This allows adjustable SET hardness as well as adjustment of the delays to the as-fabricated corners as shown in Fig. 4 . Note that there is almost no penalty to protecting this root delay generation from clock upstream (PLL divider) SETs, as only two C-elements are required.
B. Physical Design for MNCC Robustness
The physical layout of the 16-bit FF macro is illustrated in Fig. 5 . There are two groups comprised of four columns of two FFs with their constituent components vertically interleaved, surrounding the pulse-clock generation unit. Vertical interleaving provides an intervening N well between potential upsetting nodes. The N-well bias at V DD provides a good charge sink. The local clock generation can be compressed over that in [4] due to the elimination of the delay elements, resulting in a 5% 16-bit macro size reduction. Decoupling capacitance is used as filler between pulse generators, so two do not collect charge from a single impinging particle.
C. Clock Gating
In the previous design where the temporal clocks are generated locally, clock gating is difficult. A clock gater residing in the clock tree may have a soft-error at the gater control input or in the latch within the gater itself, propagating the incorrect clock to all redundant copies. We found that putting a gater within each 16-bit FF macro was also problematic, leaving vulnerabilities and increasing the macro power dissipation.
The redundant clocks simplify clock gating dramatically. As shown in Fig. 6a , redundant latches in the standard configuration ensure that the enable hold times are met. While the latches are still vulnerable to both SEU and SET at their controlling inputs, the overall system is robust to errors on a single clock copy as shown in Fig. 6b . If the clock gater controlling signals are generated by rising edge clocks, there is a significant hold time that is again, difficult to meet at the third (ClkC) latch closing edge. However, it is less than the hold time required by receiving TPL flip-flop C copies since there is no pulse-generator.
IV. IMPACT OF THE TMR CLOCKING
To study the clock quality and compare the power dissipation of the local delay vs. global delay with TMR clocks approaches, we synthesized, placed and routed an advanced encryption system (AES) engine using the two TPL schemes. The AES engine is a 256-bit key, 128-bit data and fully pipelined as described in [16] . We used the same foundry low standby power 90-nm process as the TPL test chip above. Standard commercial CAD tools (Cadence Encounter, 
S1-n
Synopsys PrimeTime and Nanotime) were used for automated place and route and timing analysis, respectively. Foundry cells were used for combinational logic. As mentioned in Section III.B the TPL multi-bit macros have the same footprints and pin locations-only the local clock generation is different. The pulse widths required for the robust pulseclocking of the latches used Monte-Carlo analysis with the foundry variation parameters [4] . The three clock trees are spatially separated during physical design to ensure MNCC hardness using cell halos, which keep other clock cells from being within a specified distance. We remove the halos after freezing the clock trees post-CTS optimization, freeing the space for logic optimization. Both the designs were implemented using the same floorplan (844 μm by 842 μm) and timing constraints, with a clock period of 200 MHz. Figs. 7 and 8 show the clock trees synthesized for the two implementations of an AES engine. The two designs, with single clocks and FF macro integrated delays, and with the triplicated skewed clocks, are compared in Table I . Density is very slightly impacted by the use of TMR clock trees, although the total number of buffers was not large. The clock skew, as we originally feared, is substantially greater in the TMR clocks. However, this is mitigated by removing random skew variation in the delay filters, which is greater, as shown below. Note also that the total FF area is reduced by more than the increase in clock buffer area. The overall area increase is attributed to the clock routing impact.
V. ANALYSIS
A. Power Dissipation
Power dissipated in the clock tree is linearly related to the number of clock sinks in a design [17] . Thus, ideally, clocking the three trees, each driving one third load or with a single tree should require almost the same amount of energy. However, clock tree synthesis (CTS) minimizes skew by over and under driving nodes, and is constrained in placement, resulting in deviations from that ideal. In our experiments here, using three redundant trees consistently increased the clock tree power dissipation about 60% over using one tree. Fig. 8 shows considerably more clock routing than Fig. 7 , since each macro must now receive CLKA, CLKB, and CLKC However, the TMR clock tree approach still saves considerable power dissipation overall, when analyzed on an energy per bit basis, due to the elimination of the many delay circuits. The integrated clock and TPL approach reduces the overall energy by 15% at 50% data activity factor and by 18% at vanishing data activity factors (see Table II ). Table II also compares these schemes with the BISER flip-flop [11] , which uses two redundant flip-flops to provide SEU, but not SET mitigation. The BISER uses five latches (one jam latch at the output is required to save state when one of the FFs mismatches) per bit of storage. Since our design uses three latches per stored bit, the BISER provides an interesting comparison point. Nonetheless, the BISER flip-flop dissipates around 38% less active power compared to our design at vanishing data activity factors and about 7% less at high data activity factors. This is due to larger clock loading in our design due to the pulse generators and clock buffers that dominate the energy consumption at low data activity factors. Our proposed TMR clocks with TPL scheme cannot meet the energy per bit of this design, but does provide complete SEU and SET mitigation.
B. Area
The AES implementation shows that the single clock tree uses 25% fewer cells than the TMR clock trees combined. However, this is a negligible impact on the overall block area. The resulting original local delay TPL and proposed TRM clock implementations have similar area utilization of 74.6% and 72.1%, respectively. The increase in area due to the added cells in the redundant clock tree can be compensated by 5% smaller macros. For these experiments, to focus on the clock impact solely, we kept the TPL macros the same in these design trials, as stated in Section IV.
C. SET Hardness and Delay Variability Impact
SET hardness can be characterized by the width of the SET glitch that can be mitigated by the design, here the time delay between the redundant clock pulses. This SET duration hardness varies from latch to latch due to process variability, exacerbated by systematic clock skew between the TMR clock tree endpoints (Fig. 9) . We investigated this using MonteCarlo simulated variability of the delay elements, comparing it with the systematic skew between clocks driving the same multi-bit TPL macros.
The delay circuits have a mean delay of 615.4 ps, and a variance (σ) of 33.4 ps. The delays due to the skew in the clock trees however, have a mean of 600.2 ps and a variance (σ) of 14.15 ps. Thus, the clock tree skew, at least for this modest sized design, is less than the variability introduced by many separate delay circuits. The worst-case (i.e., smallest) separation between two pulses is 523 ps and 561 ps for the local delay and global delay generation (TMR clock) designs, respectively. This is due to the relatively small size of the delay circuits required to minimize their energy contribution.
One key advantage of the proposed design is that, the impact of variability on hardness depends on the relative clock skew, which can be controlled to higher degree than delay variations across process corners and due to random variations. Since only two delay elements are required in the top level tree, these can be arbitrarily large to essentially eliminate random variations, without significantly affecting the overall design power dissipation. Another key advantage of the TMR clocks is that the SET delay, the design can mitigate, can be easily calibrated or adjusted. There is only one delay generation circuit so providing adjustability is almost trivial.
VI. CONCLUSIONS
This paper proposes an integrated approach to SEU and SET soft-error robustness by using skewed TMR clocks driving TMR pulse-clocked latches. The latter provide low energy per bit storage, even with redundancy that can mitigate SEU and SET throughout the resulting logic. TMR clocks rather than local delay generation provides a minimum overall power savings of 15%, resulting in power dissipation and area that approaches the energy per bit previous designs that mitigate SEU only. The approach is also amenable to design or even run time adjustment for the SET widths to be mitigated. The proposed scheme is the lowest power published approach to SEU and SET soft-error mitigation. 
