As scaling of CMOS slows down, there is growing interest in alternative technologies that can improve performance and energyefficiency. Superconducting circuits based on Josephson Junctions (JJ) is an emerging technology that provides devices which can be switched with pico-second latencies and consumes two orders of magnitude lower switching energy compared to CMOS. While JJ-based circuits can operate at high frequencies and are energyefficient, the technology faces three critical challenges: limited device density and lack of area-efficient technology for memory structures, low gate fanout, and new failure modes of Flux-Traps that occurs due to the operating environment.
INTRODUCTION
Slowdown in Moore's law limits the energy-efficiency and performance that can be obtained with general purpose computers. To bridge the gap between available performance and application demand system designers are increasingly moving towards building application-specific accelerators [17, 26] . While accelerators provide significant performance and energy-efficiency gains, their continued performance growth also gets affected by CMOS scaling limits. Unfortunately, marginal improvements in CMOS device density and performance forces us to investigate alternative technologies that can provide improved performance and energy-efficiency. Superconducting technology is a potential candidate but is not yet mature enough to support complex designs. This paper presents a case for accelerators based on superconducting technology.
What is the Technology? Certain metals exhibit superconductivity (zero electric resistance) at extremely low temperatures. It can be leveraged to build energy efficient and high-performance switching devices known as Josephson Junctions (JJ) which can serve as building blocks for constructing logic and memory circuits. JJs have switching delay of about 2 pico-seconds and switching energy about five orders of magnitudes smaller than CMOS. However, to achieve superconductivity, JJs must be operated at temperatures close to few Kelvins (4K). This requires cryogenic coolers that typically consume 300W power for every 1W dissipated at 4K. Besides high cooling cost, JJs can still enable devices that have 100x lower energy consumption over CMOS even with cooling overhead [20] . What are the Challenges? The key challenge in building a JJ based design is limited logic and memory density. For existing process technology, JJ-device density lags by 1000x as compared to CMOS [32, 49] . Although JJ density is projected to grow [24] , near-term JJ-technology may not be able to close the 1000x density gap between CMOS and JJ devices. The higher device complexity results from limited output driving capacity of JJs. JJ logic gates can drive at most one gate without requiring extra drivers known as Josephson transmission lines (JTLs). It costs 2-JJs per JTL exacerbating the density problem. The limited fanout also results in different design trade-offs for accelerators built in superconducting technology compared to CMOS. The third challenge is that the reliability of JJ devices is susceptible to magnetic flux-trapping and manufacturing defects. These defects can result in intermittent faults. In this paper, we study JJ with near-term fabrication technology for building accelerators and make the following contributions.
Contribution-1: Study of Superconducting Accelerator: Given the lack of dense memory technology, accelerators built with JJs are likely to be restricted to applications with small working set size and high computational intensity.In this paper, we focus on a SHA-256 accelerator used in block-chain applications or bitcoin mining. This application fits well with the JJ constraints as the compute intensity is exceptionally high with tiny memory footprint. Also, blockchain applications have a concrete figure-of-merit for both performance (Giga-hashes per Second, or GH/s) and energyefficiency (Giga-hashes per Joule, or GH/J). Furthermore, existing bitcoin ASICs serves as highly optimized baseline facilitating technology to technology comparison to evaluate JJ based fixed function accelerators. We use Goldstrike 1 [3] as the baseline CMOS design. 1 Contribution-2: Technology-Aware Design: JJ-based adders have different trade-offs compared to CMOS. For example, to overcome limited fanout challenge, we choose an adder that minimizes the overall JTL count by fusing consecutive additions. The baseline design also incurs significant overheads from registers to store temporary values and wide buses. To overcome this, we leverage the predictable register production and consumption of intermediate variables and delay lines to synchronize the producer and consumer stages. Redesigning the accelerator to JJ-technology specific constraints improves performance by 1.8x and increases energyefficiency from 10.6x to 12.4x compared to CMOS implementation.
Contribution-3: Fault Mitigation and BTWC
: JJ technology has a totally different fault mode arising from the operating conditions and a fundamental property known as flux trapping. It results in correlated and large granularity faults with relatively longer lifetime than transients but are not permanent. To the best of our knowledge, this is the first paper to mitigate faults in JJ technology through architecture-level solutions, such as redundancy and sparing. To improve the reliability of the proposed accelerator, we design a fault-tolerant SHA-256 engine by provisioning an additional pipeline stage and a bypassing mechanism that can detect and protect the accelerator against large granularity faults. Our fault-tolerant design incurs minor storage overhead; however, it can be leveraged to improve energy-efficiency. For example, in superconducting circuits power is a product of critical current (I c ) and operating frequency. Critical current is essential for the correct operation of a circuit. This trade-off between I c and error rate can be leveraged to operate the accelerator at a Better-Than-Worst-Case (BTWC) operating point by leveraging the fault tolerance circuitry and reducing the bias current from 38 to 10 micro-amperes. This improves the overall energy-efficiency of the superconducting accelerator to 46x compared to the CMOS implementation.
Contribution-4: Methodology for Estimating Area, Performance and Power of Superconducting Accelerators: Estimation of performance, power, and area for superconducting logic is difficult due to lack of automated design tools. Furthermore, standard cells and design rules in superconducting logic families are fundamentally different from CMOS. This limits the direct usage of standard CMOS tools to perform a design space exploration for superconducting technology. To overcome this problem, we use 1 Bitcoin mining is a competitive industry and state-of-the-art designs are proprietary. The Goldstrike 1 design is available publicly enabling us to make a fair comparison with CMOS. Given that the design details of commercially successful bitcoin miners, such as 16nm Antminer S9 [8] , are not disclosed, we are unable to evaluate them directly. Nonetheless, we do compare the energy-efficiency of our design with the reported energy-efficiency of Antminer S-9 in Section 5.6 and observe a 15x improvement. open-source back-end design tools to incorporate design constraints specific to superconducting logic. In addition to the modified design tool, we use analytical models to calculate performance, power, and area for superconducting logic. We introduce a methodology to explore accelerators' design space built in superconducting logic.
SUPERCONDUCTING TECHNOLOGY 2.1 Josephson Junction Device
Few metals exhibit zero resistance to the flow of current at cryogenic temperatures, a phenomenon known as superconductivity. Superconductivity can be achieved by cooling metal wires below their critical temperature and can be leveraged in building a switching device called a Josephson Junction (JJ). A JJ consists of a thin barrier between two superconducting wires (shown in Figure 1 (a)) that allows electrons to tunnel through even in the absence of an applied voltage. The tunneling can controlled by changing the input current. For example, when the current flowing through the device exceeds its critical current (I c ), a JJ switches from superconducting to a resistive state. Alternately, it goes back to the superconducting state if the current is reduced below I c .
Josephson Junction as Switch
In a superconducting loop with a JJ, the magnetic flux(ϕ) is quantized i.e. it can only take integer multiple values of a single flux quanta (SFQ) (ϕ 0 ). Presence or absence of SFQ can be used to represent digital information "1" and "0" respectively. When a JJ switches from superconducting to a resistive state, the magnetic flux through the superconducting loop containing the JJ changes by a flux quanta, generating an SFQ pulse of about 1 pico-second duration and 2 millivolt magnitude, as shown in Figure 1 
JJs are almost ideal digital switches characterized by high-speed switching and ultra-low power dissipation. SFQ pulses can be as narrow as one pico-second making it possible to clock circuits at very high frequencies. Superconducting passive transmission lines (PTL) are also able to transmit SFQ pulses with extremely low losses at 4K. These lossless interconnects and low switching energy for Josephson junctions (2x10 −20 J) enable very low power dissipation.
Superconducting Logic Gates Using RQL
Reciprocal Quantum Logic (RQL) uses JJ switches to encode a digital "1" as a pair of SFQ pulses of opposite polarity and a "0" as the absence of SFQ pulses. The RQL family consists of two universal gates: the AND-OR gate and the logical A-AND-NOT-B (referred to as A-NOT-B) gate that enables the design of other gates and circuits [20, 43] . For details of RQL logic gate design please refer to [20] . Since its introduction in 2011, several RQL circuits with up to 72,800 JJs have been demonstrated [19, 21, 22, 47] . [14, 52] . We observe that memory operations are less energy efficient compared to arithmetic and logic operations for JJ technology. Furthermore, building memory takes more area in JJ technology as there is no dense memory solution currently available in the superconducting domain. Researchers are exploring solutions such as hybrid-JJ-CMOS memory, Josephson magnetic random access memory (JMRAM), but their capacity is likely to remain severely limited compared to conventional memory technologies. These factors constrain the potential applications to computationally intensive ones that have small working sets. We explore one such application for accelerator design in superconducting technology. 3 SUPERCONDUCTING ACCELERATOR Superconducting circuits offer high energy efficiency. However, with limited device density and memory capacity, designing superconducting general purpose computers is incredibly challenging. Until the technology reaches the maturity to manufacture and test billion Josephson Junctions per cm 2 , which is typically required for general-purpose computing, we can leverage the technology to build accelerators. We study the SHA-256 application for building accelerators using JJ technology. We provide an overview of the application, the baseline CMOS design, and our JJ-based design. We optimize the JJ design for performance (in Section 4) and reliability (in Section 5). We use the methodology and workflow described in Section 6 for our evaluations (all numbers include a cooling overhead of 300x).
Memory Challenges for JJ Technology

Background on Bitcoin-Mining
A blockchain is a decentralized public ledger of transactions that maintains the validity of transactions by a distributed consensus mechanism [42] . In bitcoin, the process of authenticating transactions is called mining and involves searching for a 32-bit key called nonce value such that when combined with the message which lists the transactions, the double SHA-256 hash of the block (message + key) falls within a certain range. A bitcoin miner maximizes profit by trying multiple keys as fast as possible as the probability of finding the key and getting rewarded is directly proportional to the total hashrate. However, repeated SHA-256 computation requires substantial power due to high computational intensity. Thus, net profit depends on both reward and operating costs [44] . Hence, energy-efficiency (in GH/J) is the figure-of-merit that is optimized to increase profits.
Background on SHA-256 Algorithm
The SHA-256 computation of a message is carried out as shown in Figure 2 . The message scheduler unit (MSU) takes a message, splits it into 512-bit chunks and schedules it to the compression function generator (CFG) over 64 rounds. The CFG uses the data and predefined constants to generate a 256-bit hash after every 64 rounds to be collected by the intermediate hash collector (IHC). When the message is processed, IHC registers has the final hash.
Baseline CMOS Accelerator Design
Bitcoin mining ASICs are available commercially from different vendors today. Furthermore, the state-of-the-art ASICs are fully custom designed at 16 nm or lower technology nodes and implement several design and algorithmic optimizations to increase the throughput (GH/s) and energy efficiency (GH/J). Also, bitcoin mining is a competitive industry and the designs of state-of-the-art industrial accelerators are often kept proprietary. To make a fair technology comparison for the same accelerator design, we use the publicly available Goldstrike1 [3] design as our baseline. We also compare the energy-efficiency of our proposal with the reported energy-efficiency for state-of-the-art AntMiner S9 in Section 5.6.
A hash engine contains two instances of the SHA-256 computation blocks. SHA-256 algorithm uses 64 iterations, which can be pipelined. In Goldstrike1, these iterations are fully unrolled for both the rounds that eventually lead to a 128-stage pipeline. Each pipeline stage comprises a compression function generation (CFG) logic and a message scheduling unit (MSU). The hash collector compares the output hash with the target to be achieved and if the criterion is met, it sends the result to the host (Figure 3 (a)).
Superconducting Accelerator Design
We propose a superconducting blockchain accelerator that operates at 4K temperature and communicates with a host at room temperature. The architecture of our hash engine is shown in Figure 3 (a). The host receives the incoming messages from the network and offloads them to the accelerator. The accelerator computes hashes for different nonce values and it sends a message to the host when the network target is met. We port the CMOS Goldstrike1 design to superconducting logic without any optimization.
SHA-256 algorithm requires computation using predefined constants. In our fully pipelined design, each pipeline stage operates with a different 32-bit constant, which are tied-off in the superconducting design to save on resources. Since the rotations and shifts in the SHA-256 computation involve fixed rotate/shift amounts, our design only requires the signals to be routed appropriately 
Design Overview
Figure 3(a) shows an overview of our JJ-based implementation of GoldStrike1, which is designed by simply porting the CMOS design to JJ technology. Based on our methodology described in Section 6.1, we compute the area (measured in JJ-complexity) for this design. The baseline design incurs significant overheads due to JTLs (buffers needed when a gate drives other gate(s)). The analysis of JTL overheads for our design reveals that a gate drives on average about 1.5 gates, requiring 50% additional JTL for fanout. We perform a design space exploration to best meet the requirements of JJ technology and present our results for a technology aware design of the superconducting SHA accelerator in Section 4 (as shown in Figure 3 (b)). We deal with the reliability challenge in superconducting logic circuits and present a case for a reliable, fault tolerant SHA accelerator in Section 5 (as shown in Figure 3 (c)).
Performance and Energy-Efficiency
In our design, 128 different values of nonce are processed in the pipeline and a hash is generated every cycle once the pipeline is full. The critical path in our design has four adders in the CFG. We report the hashrate, power, and energy-efficiency in Table 2 for the accelerator using the methodology described in Section 6 for two design points, with ripple carry adders (RCAs) and Kogge-Stone adders (KSAs). An RCA is 3x more energy-efficient than a KSA but a KSA has 30% lower latency. This enables us to compare two design points, one optimized for energy-efficiency and another optimized for performance. For the high performance design, KSAs are used economically only to optimize the speed-path to save resources. The non-critical path adders are still designed to be RCAs. Table 2 also compares the performance and energy-efficiency of the GoldStrike 1 accelerator designed with superconducting logic using the baseline CMOS-based architecture for the two different design points.
The JJ-based design that is implemented with only Ripple-Carry Adders is 10x more energy efficient than the CMOS implementation, however it has 37% lower performance. Using Kogge-Stone adders reduce the energy-efficiency to 6.5x while bridging the performance difference to within 10%. We observe that our design energy-efficiency reduces by almost one-third for design optimized with KSAs, indicating that optimizing only for high-speed can be detrimental to the overall energy-efficiency. However, both designs show that simply porting the accelerator from CMOS to superconducting logic provides significant energy-efficiency improvement.
The contribution towards JJ-complexity (reported in Table 3 ) for our hash engine comes from adders, registers and other logic. We observe that the contribution towards JJ-complexity from adders is 50% for an RCA design which increases to 67.7% for KSAs. Technology aware optimizations discussed in Section 4 accounts for technology specific constraints to further improve energy efficiency. 
TECHNOLOGY-AWARE DESIGN
In this section, we illustrate the contrast between CMOS and JJ technology and discuss its impact on design and architectural decisions. We do so by focusing on two critical components of the SHA accelerator: adders and registers and . We also discuss a way to optimize the communication for the accelerator.
Tradeoffs in JJ Adder Circuits
The proposed SHA engine uses 1200 adders, which accounts for more than 50% of JJ-complexity. Also, the critical path consists of four additions in CFG unit. Thus, adders dominate the on-chip resources and overall latency and optimizing them to improve timing and overall energy efficiency is essential. Typically in CMOS adder designs, latency and energy-efficiency improves at the expense of more transistors or complex connectivity. For example, a complex Kogge Stone Adder (KSA) is faster and more energy efficient compared to simple Ripple Carry Adders (RCA). However, JJ based adders do not follow the same trends. Tree based adders rely on complex communication patterns to improve the critical path from O(N ) to O(loд 2 N ) through complex logic, greater fanout and increased wiring density, the overheads for which is significant in JJ technology. For example, JJ based KSA improves performance, but it worsens the energy-efficiency [10, 12] . While designing the SHA engine, a combination of adders can be selected such that our design meets the baseline CMOS performance and Figure 4 : Latency, energy, and energy-delay product for different adder designs normalized to RCA parameters maximizes energy-efficiency. To satisfy these criteria, we choose different design combinations of KSA and RCA as shown in Figure 4 . Table 2 show that replacing RCAs with KSAs improves timing but degrades the energy efficiency. Furthermore, even after replacing all four critical path RCAs with KSAs, the JJ based design fails to meet the baseline performance. Our goal is to meet timing without deteriorating the energy efficiency. Thus, we try to optimize our design such that JTL overheads are reduced. We observe that most CFG (in Figure 3(a) ) additions are back to back without intermediate results being used elsewhere, thus making it possible to use carry save adders (CSA). An n-operands CSA computes the composite addition much faster than RCAs. If δ F A is the delay of a 1-bit full adder (FA), the latency of an N number addition with CSA that can add N k-bit numbers is given by Latency CS A = (K + N − 1)δ F A . CSA has lower fanout and does not requires wiring between distant gates. When two critical path adders in the CFG and MSU are replaced by a 3-op CSA, the design has 1.2x the performance and is 1.25x more energy-efficient than our baseline. Hardware optimization proposals already exist to move the addition of variables W i and K i from MSU to CFG [48] . We propose a similar optimization and pre-compute this value in the (i − 1) t h pipeline stage to be consumed in the i th stage. This enables us to use 4-op CSA in both CFG and MSU blocks to fuse 3 adds. resulting in 1.67x performance improvement over RCA baseline design and is 1.44x more energyefficient. Table 4 lists the performance and energy-efficiency of the superconducting hash engine for different adder optimized designs against the baseline design using all RCAs. A similar CMOS optimization uses multiple CSAs in parallel besides carry-lookahead adders [7] but we use these CSAs in conjunction with RCAs to have a more economical design in terms of JJ-complexity.
Fanout-aware Adder Design
Reducing Registers Using Delay-Line
In each stage of the baseline design, MSU and CFG uses 16 and 8 32-bit registers respectively resulting in about 35% JJ complexity. The contents of the registers are consumed by adders and other logic to produce an output in every stage to be consumed by the subsequent stage. Our baseline design replicates all the registers in every stage requiring large number of JJs and a wide bus. In JJ technology, JTL cost required for wide buses and registers is high. [6] . The local register file enables higher clock rate. Whereas, shared registers improve the critical path significantly, especially for heavily pipelined designs requiring data values every cycle. So, to supply register values each uses a local set of registers leading to high JJ complexity (registers account for 35% JJs).
The baseline design has fixed control path and identical operations are performed in every pipeline stage with only few register values produced. For example in MSU, only four registers are consumed in each stage of the pipeline, to produce one 32-bit output. After that, all the registers are simply copied to the next stage, such that N th register of the current stage is copied to (N + 1) t h register of the subsequent stage as shown in the Figure 5(a) .
An alternative to communicate between stages is to connect producer and consumer via a Delay-Line. Delay line memory is a form of memory used in earliest computers during the 1960s [2, 13] , based on sequential access and requires to be refreshed from time to time. Such memories rely on transmitting information through a circuitry that adds delay and then re-routing the end of the delay path to the input end such that the information can be transmitted continuously through the closed loop. We propose to use delay lines to route data from producer register to consumer register in a synchronized manner by using the precise number of delay elements to match the desired delay. In RQL, a delay line can be built using JTLs that repeat signals for every clock activation. On a JTL delay line, input data is propagated from one JJ to next JJ every clock phase. This provides an efficient way to reduce JJ complexity for our hash engine. Delay lines keep the data in flight and deliver to the consumer at the precise clock cycle.
A delay line facilitates delivery of intermediate results from one stage to another. It requires 4 JJs per clock cycle per bit whereas register storage requires 12 JJs per bit. Although the crossoverpoint for the flop based register file is 3 clock cycles, delay line memory enables point to point connection between the producer and consumer eliminating the need for 16 registers per stage. We use four staging registers along with the delay lines to tolerate clock skew. Delay lines reduce the per stage JJ cost by almost 20%. Table 4 shows the performance and energy-efficiency of our JJbased accelerator for the basic designs (with RCA/KSA), designs with technology aware optimization of 4-operands CSA (four-input) and the use of delay-lines to reduce register costs. The 4-operands CSA optimization improves energy-efficiency from 6.39x for KSA to 10.0x, while improving the performance by 15% (bringing it in line with the performance of the CMOS-based design). The delay-line optimization reduces register costs and improves the energy efficiency from 10.0x to 12.4x, while still having similar performance. 
Performance and Energy-Efficiency
FAULT TOLERANT & BTWC DESIGN
In this section, we discuss fault models for JJ technology, and present a design that uses architecture-level solutions to protect against these faults. We also discuss how the proposed fault-tolerant design can be leveraged to improve energy efficiency by operating the circuit at a Better-Than-Worst-Case (BTWC) design point.
JJ Logic Fault Sources & Models
There are three primary sources of faults in superconducting logic: fabrication defect, device level variations, and non-ideal operating environment. Fabrication defects result from the material and masking defects introduced during fabrication. These defects can manifest as permanent stuck-at-faults, similar to birth-time defects in CMOS, and can be mitigated by design time testing. Device parameter variation can cause degradation in noise margins. For example, variation in critical current (I c ) can cause degradation in noise margin resulting in timing errors and the design must operate at a point where it is robust against such variations. In JJ technology, flux trapping phenomenon causes a unique source of faults which we term the operating environment fault. These faults are challenging to protect against due to their correlated nature. Furthermore, the faults are neither permanent nor transient, and they manifest not only because of the device but also 2 Including cooling overhead of 300X due to non-ideal operating conditions. Flux-trapping results from trapping of a stray magnetic field in JJ circuits due to non-uniform cooling and can result in non-functional circuits or reduced noise margin for parts of the chip. Fortunately, steady progress and innovations in fabrication and device technology limits the problem of flux trapping considerably [21] . The reported flux-trapping solutions demonstrated on 50K JJ circuits are costly, hard to scale to large systems, and do not completely eliminate the problem. Some of the demonstrations use active magnetic field cancellation or extremely low temperature (<1K) at which flux vortex freezes. These additional requirements are expensive for large systems.
Impact of Faults on SHA-256 Hardware
To understand the impact of faults on the output of the SHA engine, we use fault injection to quantify the Architectural Vulnerability Factor (AVF). For the baseline design, injection of faults shows 98.89% AVF. The high AVF of SHA engine results from the entropy maximization property of the algorithm where a single bit operational error can corrupt the output. Protecting a SHA engine is a traditionally non-trivial problem due to its cryptographic properties and tight area and energy constraints. Techniques based on replication or parity detection circuits are either too complex and expensive or provide partial protection against faults.
Application Level Resilience
Transient faults do not have a large impact on mining and hashrate as transient errors can corrupt only one of the key combinations. The probability of a miner missing out on a reward due to a transient fault is extremely small. If the probability of finding a block is relatively low ( 1 2 32 ) and the probability of transient fault is P, then collision of those two events is even lower ( P 2 32 ). Recent proposals enables approximate bitcoin mining using this property [50] .
On the other hand, permanent faults would result in non-functional SHA engine thus reducing the yield significantly. Furthermore, if not detected before deploying, the miner would simply consume power without doing any work. This problem is significantly worse for the flux trapping faults as fault patterns change every warm up cycle which forces us to test SHA engines after every cool-down. In CMOS, non-functional chips can be isolated by post fabrication tests. Whereas, in JJ circuits, faults can happen not only because of fabrication defects but also due to operating conditions.
Fault Tolerant Design
The correlated nature of faults due to large granularity impact of flux traps limit the ability to use standard low-cost protection techniques that are usually used to protect single-bit faults that happen in conventional technologies. Our goal is to leverage the regular structure of the accelerator to improve the reliability of the JJ based SHA-256 engine without significant complexity.
For the pipelined SHA-256 accelerator, all the pipeline stages are functionally identical with a deterministic control and datapath. This can be leveraged to enable low-cost fault tolerance. We propose to add an extra pipeline stage and design a bypass logic between every pipeline stage such that if a fault is detected for a pipeline stage, that stage can be bypassed as shown in Figure 6(b) . The bypass logic and spare pipeline stage can be used to detect a faulty pipeline stage by bypassing the stages one by one with a standard input and output pair until the right hash is produced.
The bypass logic in between pipeline stages consists of four 32-bit 2:1 muxes as shown in Figure 6 (c). The muxes can bypass the faulty stage and re-route the signals to subsequent working stages and must be functional as they are placed between two stages as shown in Figure 6 (b). Fault on any of the muxes result in a non-functional SHA engine. However, muxes cover only a small fraction of total area and the likelihood of a fault occurring on any of the muxes is an order of magnitude less compared to other functional units. Thus, this design enables partial fault-tolerance as it can function correctly as long as faults do not occur on any of the muxes. To evaluate the effectiveness of the design, we perform binomial trials assuming identical and independently distributed (iid) errors. In the baseline, even a single fault leads to system failure whereas sparing design offers some fault-tolerance. To further improve the reliability, we propose a design that uses redundant muxes for bypass circuitry. As shown in Figure 6 (e), the redundant 8:1 mux can tolerate one fault on any of the four muxes. The design with redundant mux can tolerate one fault anywhere. Figure 7 shows the probability of system failure for the baseline, stage-sparing, and stage-sparing with redundant muxes. The design with sparing and redundant mux is almost 5-6 orders of magnitude more reliable than the baseline. 
Using Fault-Tolerant Design for BTWC
The energy-efficiency and performance of the superconducting circuit is determined by the critical or bias current (I c ). Reducing I c also reduces the energy consumption by reducing Ic but causes certain devices to fail. Therefore, the critical current is set conservatively such that none of the devices fail. Recent studies suggest that the I c distribution for future technology nodes may have a large spread between devices, leading to as much as 5x difference between the average and worst-case I c [23] forcing designers to pick I c conservatively. However, we can leverage the proposed fault-tolerant design to tune the optimal I c by using a better-than-worst-case (BTWC) design philosophy. The proposed reliable SHA-engine design can be used to tune bias current I c as it can protect against a large granularity failure. To perform the run-time tuning, I c is lowered until a failure is observed. If a fault can not be isolated, I c is increased. The tuning enables optimal I c by isolating a faulty pipeline stage, and mitigation of the fault. This can reduce I c from 38 µA to 10 µA (based on conservative scaling by Herr et al. [20] ). 
Evaluations: Tying it All Together
EVALUATION WORKFLOW
To the best of our knowledge, this is one of the first paper to explore superconducting accelerators and evaluate the performance and power using application-level metrics. As this is an emerging technology, there is no publicly available methodology or workflow for evaluating performance, power, and area of systems. Furthermore, standard cells and design rules in superconducting logic families are fundamentally different from CMOS that limits the direct usage of standard CMOS tools to perform a design space exploration for superconducting designs. To overcome this problem, we use opensource tools to incorporate design constraints specific to RQL. In addition to the modified design tool, we use analytical models to compute performance, power, and area for designs. Figure 9 provides an overview of the workflow of tools used in our evaluation.
Modeling Area Using JJ-Complexity
The area of a superconducting circuit is denoted by JJ-complexity. JJ-complexity is the number of JJs required to design a logic block. A logic block consists of gates and Josephson Junction Transmission Lines (JTLs). As JJ-based gates have limited driving strength, JTLs are inserted to facilitate the desired fanout. In this paper, we use JJcomplexity as a key figure of merit, similar to prior superconducting system designs [11, 12] . We evaluate the system level JJ-complexity by computing gate JJ-complexity and interconnect JJ-complexity. Gate JJ-Complexity: To evaluate the gate JJ-complexity, we use the RQL standard cells and Yosys [43, 51] , an open-source back-end tool that enables us to derive the gate level netlist using only RQL standard cells. Yosys uses ABC [38] , that allows it to map a design's gate level representation to a target custom library (the RQL cell library in this case). We process the netlist to compute the gate JJ-complexity by determining the number of gates used of each type. Note that lack of place and route tools, and restricted access to foundry models forces superconducting logic designers to use manual routing to calculate JJ-complexity. Interconnect JJ Complexity: RQL gates have limited driving strength and requires JTLs to drive gates. A JTL consists of 2 JJs. JTLs enable fan-out capacity similar to buffers in CMOS circuits and limit clock-skew and jitter. Due to limited driving strength, RQL gates require one JTL for every output load. We process the Yosys generated netlist and determine the fanout for every input and output port and internal wires. To account for JTL overheads we use rules based on [20] : (1) A JTL is added after a series of five logic gates to suppress clock skew and jitter. (2) A JTL is required per fanout. (3) XOR gates need one additional JTL because they operate at the phase boundary [43] (RQL uses a four phase clock). For the bitcoin accelerator design, most of the gates drive either 1 or 2 gates, and the percentage of gates that drive more than 2 gates is quite small (less than 1%). Given that approximately half the gates drive exactly 2 gates, the overhead of additional JTL due to fanout is approximately 50% for our baseline implementation. System JJ complexity: Full system design using superconducting technology requires JJs for implementing logic and enabling signal routing and fanout. We derive the total JJ-complexity of the system (J J syst em ) as shown in Equation 1.
J J syst em = J J дat e + J J int er connect
For validation, we compare our method of evaluating JJ-complexity against published designs that use foundry RQL standard cell library based on foundry models and observe that our estimates are within 12% of the numbers reported in prior work [11, 12] .
Modeling Performance & Power
To model performance, we count the number of JJs in the critical path of a design and multiply it by the switching time of a JJ. We assume a uniform JJ switching time of 2 ps [20] . Switching time can be improved by using larger feature size. However, for our analysis, we lack the design and layout tools to study such optimizations.
RQL delivers power to on-chip devices through inductive coupling to an AC transmission line. As a result, RQL circuits dissipate negligible static power. RQL uses reciprocal data encoding where "0" is represented by the absence of SFQ pulses. Therefore, the dynamic power dissipation in RQL circuits result from only digital "1"s, and digital "0"s do not dissipate power. The total power dissipated (P dynamic ) by an RQL circuit is given by Equation 2.
where, n is the number of JJs, f is the frequency, I c is the critical current, ϕ 0 is a universal constant, and α is the activity factor (or the percentage of JJs switching to "1" state). The power dissipated by the superconducting logic is directly proportional to the critical current which depends on the device fabrication technology and foundry services. For our evaluations, we assume the critical current to be 38µA. However, a conservative analysis of I c reveals that it can be reduced it to 10µA without substantial impact on the bit error rate [20] . We determine the activity factor (α) of a design by counting the number of "1"s from the value change dump (VCD) file of random simulations. We evaluate the total power consumption by multiplying the power dissipated by the design at 4.2K with a cooling overhead. We report the power consumption with a cooling overhead of 300x.
RELATED WORK
Superconducting circuits: A number of circuits were demonstrated in RSFQ logic in the 1990s such as DSPs, microprocessor components, mixed signal devices etc. [10, 16, 18, 25, 29, 31, 39-41, 45, 53] . However, due to static power dissipation challenges and high device counts per logic gate, RSFQ circuits faced scalability issues. Recently proposed RQL can mitigate some of the challenges faced by RFSQ to enable scalable solutions [9, 11, 12, 19-22, 24, 47] .
SHA Designs: SHA-256 optimizations involve changing the computational platform to FPGAs and ASICs [1, 4-7, 15, 30, 35, 46] . There also exists several algorithmic and hardware optimizations for SHA engines [4-7, 7, 33, 33, 34, 34, 35] . Fault tolerant SHA hardware use triple modular redundancy, register protection using Hamming codes and inbuilt self-checking mechanisms [27, 28, 36, 37] . These schemes assume uncorrelated errors and incur large area and complexity.
CONCLUSION
In this paper, we evaluate the system level performance and energy improvements for an accelerator built with Josephson junction technology. We focus on three JJ-technology challenges: low device density, limited fanout, and correlated faults due to flux trapping. We focus on SHA-256 engines, commonly used in bitcoin-mining accelerators which has high computational intensity, tiny memory footprint, and energy-efficiency is a key metric.
A direct translation of the baseline [3] from CMOS to JJs provides 10x improvement in energy-efficiency (GH/J). We study a technology-aware design that improves the performance by 1.6x while boosting the energy efficiency to 12x over CMOS baseline. We present a unique reliability challenge in JJ technology and propose a fault-tolerant design that can protect against large granularity faults that occur due to this new failure mode. Moreover, we utilize this fault-tolerant design to enable better than worse case design that enables scaling of the critical current without sacrificing functionality and providing a 46x improvement in energy efficiency over CMOS design.
