Single flux quantum (SFQ) circuits are an attractive beyond-CMOS technology because they promise two orders of magnitude lower power at clock frequencies exceeding 25 GHz. However, every SFQ gate is clocked creating very deep gatelevel pipelines that are difficult to keep full, particularly for sequences that include data-dependent operations. This paper proposes to increase the throughput of SFQ pipelines by redesigning the datapath to accept and operate on least-significant bits (LSBs) clock cycles earlier than more significant bits. This skewed datapath approach reduces the latency of the LSB side which can be feedback earlier for use in subsequent datadependent operations increasing their throughput. In particular, we propose to group the bits into 4-bit blocks that are operated on concurrently and create block-skewed datapath units for 32bit operation. This skewed approach allows a subsequent datadependent operation to start evaluating as soon as the first 4bit block completes. Using this general approach, we develop a block-skewed MIPS-compatible 32-bit ALU. Our gate-level Verilog design improves the throughput of 32-bit data dependent operations by 2x and 1.5x compared to previously proposed 4bit bit-slice and 32-bit Ladner-Fischer ALUs respectively. We have quantified the benefit of this design on instructions per cycle (IPC) for various RISC-V benchmarks assuming a range of non-ALU operation latencies from one to ten cycles. Averaging across benchmarks, our experimental results show that compared to the 32-bit Ladner-Fischer our proposed architecture provides a range of IPC improvements between 1.37x assuming one-cycle non-ALU latency to 1.2x assuming ten-cycle non-ALU latency. Moreover, our average IPC improvements compared to a 32-bit ALU based on the 4-bit bit-slice range from 2.93x to 4x.
I. INTRODUCTION AND MOTIVATION
The ever-increasing computational requirements of high performance computing (HPC) has leveraged the scaling of contemporary technologies for decades, now reaching the atomic level. However, the power density of silicon nanoelectronics limits their applicability to future exascale computing [1] , [2] , motivating the research for alternate technologies. Evolved from rapid SFQ (RSFQ) [3] technology, superconductive circuits that promise ultra-low switching energy of 10 −19 J [4] and clock frequencies exceeding 25GHz [5] have become a promising beyond-CMOS technology. This work was supported by the Office of the Director of National Intelligence (ODNI), the Intelligence Advanced Research Projects Activity (IARPA), via the U.S. Army Research Office Grant W911NF-17-1-0120. Various 8-bit SFQ microprocessors have been developed in the last two decades, including a bit-serial microprocessor with eight 1-bit serial ALU blocks (FLUX-1) [6] , a bit-serial CORE1 processor [7] , and a bit-serial SCRAM2 asynchronous microprocessor [8] . More specifically, the arithmetic logic unit (ALU), a critical part of a microprocessor, has gained significant research importance in RSFQ [9] , [10] , [11] , [12] . Recently Tang et. al. have proposed a 16-bit bit-sliced ALU [13] because earlier proposed serial [14] and 2-/4-/8-bit bitsliced [11] ALUs compute at a slower rate for 32-/64-bit processors. As we increase the ALU bit-width, its gate-level pipelined nature, forces an increase in latency and efficiently utilizing this deep pipelined architecture becomes more difficult.
To improve pipeline utilization we propose a block-skewed ALU architecture, called qBSA, inspired by the use of skewed datapaths in asynchronous CMOS design [15] . Our proposed architecture uses eight 4-bit ALU blocks skewed in time, reduces the delay of the data feedback loop, and enables individual blocks to start computing a dependent operation as soon as its own output is ready. The choice of 4-bit blocks enables a balance between keeping the latency of the 32-bit adder relatively low while requiring fewer Josephson junctions (JJ) than needed for higher bit-width blocks. We have simulated our results using the MIT LL 100µA/µm 2 SFQ5ee RSFQ cell library to demonstrate its functional correctness. We have also estimated its impact on the instructions per cycle (IPC) of a RISC-V processor.
The reminder of this paper is arranged as follows. Section II describes the proposed architecture and explains its functionality. Section III provides our simulation platform, results and performance analysis. Finally, the paper concludes in Section IV.
II. PROPOSED 32-BIT BLOCK-SKEWED ARCHITECTURE
In this section we describe the logic design of our 32-bit qBSA. We divided the design into eight 4-bit blocks as shown in Fig. 1(a) . Due to its low latency, and simple carry look ahead circuit with only one feed forward signal (c out ) we adopted the Sklansky prefix-tree adder [16] to design each 4bit block, as illustrated in Fig. 1(b) . Notice that the carry (C in ) is needed to compute the carry out (C out early , C out ) and sum (S n+3:n ) only after five pipeline stages. We leverage this fact Note that the first 4-bit ALU block is different from the other 7, because its carry input arrives at the same time as its A and B inputs. In contrast, the carry input for the 7 other blocks arrives 5 clock cycles after their A and B inputs. and start computing the sum and carry of more significant blocks before the C out early of the less significant blocks are evaluated. It is to be noted that we use C out early to quickly feed the input carry of the next 4-bit ALU block and delay it by one stage to provide the final C out . The feedback path from the output of each block back to its input (through a multiplexer) enables less significant blocks to start accepting and computing their next data-dependent inputs as soon as the previous corresponding output is ready, thereby avoids waiting for the entire 32-bit result. This staggers the computation start time for different blocks making the datapath skewed and better utilizes the gate-level pipelining nature of SFQ. In particular, this reduces the initiation interval (II) for backto-back data-dependent operations, defined as the number of clock-cycle separation between the start of two consecutive data-dependent operations. Table I shows the operations supported by qBSA and their associated control signals. Table II shows the latency and the initiation interval values of our proposed design. 
III. RESULTS
We used Verilog models of a 100µA/µm 2 MIT LL SFQ5ee cell library to design and simulate qBSA in the Xilinx Vivado 2017.4 tool. Note that in our simulated waveforms a signal A. Gate-level Simulation Fig. 2 shows a typical waveform generated through gatelevel simulation of the proposed 32-bit ALU. Notice that after the first output is available, the skewed datapath of the qBSA enables back-to-back data-dependent outputs available after the pipeline depth of a 4-bit ALU block (8-clock stages) instead of the pipeline depth delay of the entire 32-bit ALU (15-clock stages). Thus the initiation interval of our proposed qBSA is 1.5x and 2x faster compared to recently proposed 32-bit Ladner Fischer ALU (32LFA) [13] and 4-bit bit sliced ALU (4BSA) [5] , respectively. *
B. Performance Evaluation: Instruction Per Cycle
To quantify the benefit of our proposed design we estimated the impact on IPC for a set of benchmarks on a generic qBSAbased RISC-V processor with in order commitment (qBSP). We compared the obtained IPC to that of a 32LFA (32LFP) and 4BSA (4BSP) based processors. In particular, the IPC of a benchmark with total number of instructions T i and total NOPs needed to resolve dependencies T N OP is as follows: We estimate the IPC using a script that reads benchmark files generated through Spike, a RISC-V sodor core instruction set architecture (ISA) simulator, analyzes the dependencies, and estimates the number of NOPs required [17] . We assume all processor components are block-skewed and consume and generate inputs and outputs in block-skewed fashion. In particular, Equations 2 and 3 recursively defines the number of NOPs required before each instruction i and its final position considering the added NOPs.
Here, functions S(i,m) and I(i,m) provide the instruction type and original index of the instruction that creates the m th ∈ N Si source operand of the i th instruction. L(S(i, m)) is the latency of the instruction which creates the m th source register of instruction i. Our experiments explore a range of non-ALU datadependent operation latencies [1, 10] but in each individual experiment, for simplicity, we assume that all non-ALU operations have the same integral latency. As two examples, Fig. 3 shows the IPC improvement of qBSP over 32LFP and 4BSP with non-ALU latency assumptions 1 and 10. 
IV. CONCLUSIONS
The gate-level pipelined nature of RSFQ makes keeping the pipelines full a difficult micro-architectural challenge, especially in the presence of data-dependent operations. This paper proposes a block-skewed ALU to reduce the average pipeline initiation interval and estimates its impact on an ideal RSFQ processor. Averaging across multiple benchmarks with a simple dependency model, block-skewing improves IPC between 1.2x and 1.37x compared to a 32-bit Ladner Fischer ALU based processor and between 2.93x and 4x compared to a 4-bit bit-sliced ALU based processor. Our future work includes evaluating the benefits of block skewing on other processor components, the impact of different block sizes, and refinements of our model of instruction dependencies.
