Bridging the gap between quantum so ware and hardware, recent research proposed a quantum control microarchitecture MA which implements the quantum microinstruction set MIS [16] . However, MIS does not o er feedback control, and is tightly bound to the hardware implementation. Also, as the number of qubits grows, MA cannot fetch and execute instructions fast enough to apply all operations on qubits on time. Known as the quantum operation issue rate problem, this limitation is aggravated by the low information density of MIS instructions. In this paper, we propose an executable quantum instruction set architecture (QISA), called eQASM, that can be translated from the quantum assembly language (QASM), supports feedback, and is executed on a quantum control microarchitecture. eQASM alleviates the quantum operation issue rate problem by e cient timing speci cation, single-operation-multiple-qubit execution, and a very-long-instruction-word architecture. e de nition of eQASM focuses on the assembly level to be expressive.
INTRODUCTION
antum computing can accelerate solving some problems which are ine ciently solved by classical computers, such as quantum chemistry simulation [1, 2] . e goal is to develop a quantum computer with Noisy Intermediate-Scale antum (NISQ) technology [3] (without quantum error correction [4] ), whose capability goes beyond that of state-of-the-art classical computers [5] . is capability is also termed quantum supremacy [6] . To this end, a fully programmable quantum computer based on the circuit model should be constructed of several layers [7, 8] . ese layers form the full stack, which includes the quantum algorithm, quantum language, quantum compiler, quantum instruction set architecture (QISA), quantum control microarchitecture, quantum-classical interface, and quantum chip. Compared to the ourishing of research at the opposite ends of the stack, relatively less research has been dedicated to the low-level description of quantum applications with the required control microarchitecture for NISQ technology.
Related Work and the Challenges
To address the poor scalability of previous quantum control paradigms based on directly operating on waveforms and the problem that no control microarchitecture supports the execution of existing quantum assembly languages on real hardware (including QASM [9] , a virtual instruction set [10] , QASM-HL [11] , il [12] , Open-QASM [13] , f-QASM [14] , and cQASM [15] ), Fu et al. [16] proposed the quantum control microarchitecture MA implementing a quantum microinstruction set MIS to bridge the gap between quantum so ware and hardware.
However, MIS is unsatisfactory for three reasons. First, instructions in MIS do not support feedback based on qubit measurement results, which is vital for circuit-model-based quantum computing applications such as active qubit reset [17] , teleportation [18] , quantum gate decomposition [19] , and Shor's factoring [20] . For example, active qubit reset requires measuring the qubit followed by an X gate if the qubit measurement result is |1 (this process is also called binary control). Teleportation requires performing a subprogram (containing an X and Z gate) conditioned on the result of measurements on two qubits. In addition, feedback is necessary for fault-tolerant quantum computing where a key application is the implementation of non-Cli ord gates (e.g., the T gate [4] ). Feedback has been demonstrated in multiple experiments [21] [22] [23] [24] using customized hardware, but not yet using a (micro)architectural solution.
A second drawback of MIS is limited scalability. A MIS program has a relatively low instruction information density because (1) an explicit waiting instruction is required to separate any two consecutive timing points; (2) each target qubit of a quantum operation occupies a eld in the instruction, making the instruction width a limitation for the number of target qubits in a single instruction; (3) two parallel and di erent operations cannot be combined into a single instruction. e required number of quantum operations per cycle in general increases as the number of qubits grows; fetching all instructions for an increasing number of quantum operations from memory and applying them on qubits on time forms a challenge given the limited instruction issue rate (the quantum operation issue rate problem) [16, 25, 26] .
ird, MIS is limited in exibility because MIS instructions are low level and tightly bound to the electronic hardware implementation. Compared to existing quantum assembly languages, MIS instructions are microinstructions without explicit quantum semantics. us, MIS does not qualify as a QISA, and it remains an open challenge to design an executable QISA with quantum semantics which is scalable and supports runtime feedback.
Contributions
In this paper, we propose an executable QISA based on QASM, named executable QASM (eQASM). eQASM can be generated by the compiler backend from a higher-level representation, like Open-QASM or cQASM. eQASM contains both quantum instructions and auxiliary classical instructions to support quantum program ow control. eQASM supports a set of discrete quantum operations. e contributions of the paper are the following:
• Runtime Feedback: eQASM proposes two kinds of feedback with required microarchitectural mechanisms to implement them: fast conditional execution for simple but fast feedback, and comprehensive feedback control (CFC) for arbitrary userde nable feedback; • Operational implementation: eQASM is a QISA framework with the de nition focusing on the assembly level and the basic rules of mapping assembly to binary. It requires customized instantiation for the binary format targeting a particular platform, which allows the pursuit of exibility and practicability; • Increased quantum operation issue rate: eQASM adopts
Single-Operation-Multiple-bit (SOMQ) execution, Very-Long-Instruction-Word (VLIW) architecture and a more e cient method for explicit timing speci cation, which can considerably alleviate the quantum operation issue rate problem when compared to MIS; • Con gurable QISA at compile time: As opposed to the classical instruction set architecture (ISA) whose operations are de ned at ISA design time, eQASM enables the programmer to con gure allowed quantum operations at compile time, leaving ample space for compiler-based optimization.
We instantiate eQASM into a 32-bit instruction set targeting a seven-qubit superconducting quantum processor and implement it using a control microarchitecture derived from MA as proposed in [16] . We validated eQASM by performing several experiments over a two-qubit superconducting quantum processor using the implemented microarchitecture.
is paper is organized as follows. Section 2 introduces the heterogeneous quantum programming model adopted by eQASM and an overview of eQASM. e quantum instructions of eQASM with related mechanisms are explained in Section 3. Section 4 describes the instantiation of eQASM targeting a seven-qubit quantum processor as well as its microarchitecture and implementation. Section 5 shows the experiments, and Section 6 concludes.
EQASM OVERVIEW
To our understanding, it is viable to integrate quantum computing in a similar way as a GPU or an FPGA in a heterogeneous architecture. e quantum part can be seen as a coprocessor used to accelerate particular classically-hard tasks. is section introduces the eQASM programming and compilation model, the design guidelines for eQASM, the architectural state, and an overview of instructions. 
Programming and Compilation Model
OpenCL [27] is an open industry standard for classical heterogeneous parallel computing which served as the basis for de ning eQASM, of which the programming and compilation model is shown in Fig. 1 .
A quantum-classical hybrid program contains a host program and one or more quantum kernels with the quantum kernel(s) accelerating particular parts of the computation. e host program is described using a classical programming language, such as Python or C++, and the quantum kernels are described using a quantum programming language, such as Sca old [28] or Q# [29] . A hybrid compilation infrastructure compiles the host program into classical code using a conventional compiler such as GCC, which is later executed by the classical host CPU. e quantum compiler, such as OpenQL [16] , compiles the quantum kernels in two steps. First, quantum kernels are compiled into QASM, or a similar format mathematically equivalent to the circuit model.
is format is hardware independent and can be ported across di erent platforms for quantum algorithms. Most of the hardware constraints are taken into account in the second step, where the compiler performs scheduling and low-level optimization. e output is the quantum code consisting of eQASM instructions. e quantum code contains quantum instructions as well as auxiliary classical instructions to support comprehensive quantum program ow control including runtime feedback [17, 30] . A er the host CPU has loaded the quantum code into the quantum processor, the quantum code can be directly executed. In the rest of this paper, we focus on the quantum processor, i.e., the microarchitecture in charge of controlling qubits. e interaction between the classical processor and the quantum processor a research topic outside the scope of this paper.
Design Guidelines
e design of eQASM focuses on being executable on real hardware providing user-de nable feedback. It should be capable of describing quantum applications for various quantum technologies and not bound to particular electronic control setup. Calibration experiments usually occupy a considerable ratio of the time using qubits in the NISQ era. Examples include measuring the relaxation time of qubits (T 1 experiment) and calibrating the parameters (amplitude, phase, frequency, etc.) of pulses for quantum operations, and so on. ey need to use uncalibrated or uncommon quantum operations and explicitly change the timing of operations. eQASM is also expected to help quantum experiments required to calibrate qubits and quantum operations. e design of eQASM is guided by ve main principles:
(1) eQASM should include classical instructions to support quantum program ow control including runtime feedback; (2) eQASM should contain well-de ned methods to specify the timing of quantum operations; (3) Low-level hardware information should be abstracted away from the eQASM assembly as much as possible to avoid eQASM being stuck to a particular hardware implementation; (4) e quantum operation issue rate is a potential bo leneck of the quantum microarchitecture, and should be addressed, e.g., by densely encoding the instructions such as done with SIMD and VLIW for classical architectures; (5) Di erent experiments and radical compiler-based optimization techniques such as quantum optimal control [31, 32] may use a di erent set of quantum operations, which can be uncalibrated or uncommon. eQASM should be exible to allow di erent quantum operations via con guration.
Architectural State
As shown in Fig. 2 , the architectural state of the quantum processor includes:
2.3.1 Data Memory. e data memory can bu er intermediate computation results and serve as the communication channel between the host CPU and the quantum processor.
Instruction Memory & Program
Counter. e eQASM instructions are stored in the instruction memory, and the Program Counter (PC) contains the address of the next eQASM instruction to fetch. eQASM does not de ne an instruction memory size or a memory hierarchy.
General Purpose
Registers. e general purpose register (GPR) le is a set of 32-bit registers, labeled as Ri, where i is the register address.
Comparison
Flags. e comparison ags store the comparison result of two general purpose registers which are used by comparison and branch related instructions (see Table 1 ).
2.3.5
antum Operation Target Registers. Each quantum operation target register can be used as an operand of a quantum operation. Since most quantum technologies support physical operations applied on up to two qubits, there are two types of quantum operation target registers: single-qubit target registers for singlequbit operations [including measurement (MEASZ)], and two-qubit target registers for two-qubit operations. Each single-(two-)qubit target register can store the physical addresses of a set of qubits (allowed qubit pairs). An allowed qubit pair is a pair of qubits on which we can directly apply a physical two-qubit gate. A single-(two-)qubit target register is labelled as Si (Ti), with i being the register address. eQASM does not de ne the format of target registers (see Section 3.3 for a discussion).
Timing and Event eues.
To support explicit timing speci cation of quantum operations, eQASM adopts a queue-based timing control scheme [16] . e timing and event queues are used to bu er timing points and operations generated from the execution of quantum instructions (see Section 3.1). Together with the qubit measurement result registers, it separates the processor into two timing domains, the deterministic one and the non-deterministic one.
2.3.7
bit Measurement Result Registers. Each qubit measurement result register is 1-bit wide, and stores the result of the last nished measurement instruction on the corresponding qubit when it is valid (see Section 3.6). It is labeled as Qi, where i is the physical address of the qubit.
Execution Flag
Registers. Sometimes, the execution of a quantum operation depends on a simple combination of previous measurement results of this qubit [21, 22] . To this end, each qubit is associated with an execution ag register, which contains multiple ags derived automatically by the microarchitecture from the last measurement results of this qubit. e execution ag register le is used for fast conditional execution (see Section 3.5).
2.3.9
antum Register. e quantum register is the collection of all physical qubits inside the quantum processor. Each qubit is Applying operations on qubits a er waiting for a small number of cycles indicated by PI.
assigned a unique index, known as the physical address. Since data in qubits can be superposed, eQASM does not allow direct access to the quantum data at the instruction level. Instead, users can measure qubits using measurement instructions and later access the results in the qubit measurement result registers.
Instruction Overview
antum technology is evolving rapidly and is still far away from a stable state. To avoid the format of eQASM being stuck to a speci c quantum technology implementation with particular properties, the de nition of eQASM focuses on the assembly level and introduces basic rules of mapping the assembly code to binary instructions. e binary format is de ned during the instantiation of eQASM targeting a concrete control electronic setup and quantum chip. is fact enables the eQASM assembly to be expressive while leaving considerable freedom to the (micro)architecture designer to pursue microarchitectural practicability and performance.
An eQASM program can consist of interleaved quantum instructions and auxiliary classical instructions. An overview of the eQASM instructions is shown in Table 1 . Since the host CPU can provide classical computation power, auxiliary classical instructions are simple instructions to support the execution of quantum instructions. Complex instructions (e.g., oating-point instructions) are not included. e top part of Table 1 contains the auxiliary classical instructions. ere are four types: control, data transfer, logical, and arithmetic instructions. ese are all scalar instructions. e function sign ext(Imm, 32) sign extends the immediate value Imm to 32 bits. e operator :: concatenates the two bit strings. e CMP instruction sets all comparison ags based on the comparison result of GPR Rs and Rt. e BR instruction changes the PC to PC + Offset if the speci ed comparison ag is '1'. To enable arithmetic or logical operations on the comparison result, the FBR instruction fetches the speci ed comparison ag into GPR Rd. e FMR instruction supports comprehensive feedback control and is explained in Section 3.6. e bo om part of Table 1 contains the quantum instructions. ere are three types of instructions:
• Waiting instructions used to specify timing points (QWAIT, QWAITR), • antum operation target register se ing instructions (SMIS, SMIT), and • antum bundle instructions, which consist of the speci cation of a small waiting time and multiple quantum operations. ese quantum instructions have several features based on the following four observations:
Many quantum experiments, such as the T 1 experiment, require changing the timing of operations explicitly. Also, the timing of operations can signi cantly impact the delity of the nal result as quantum errors accumulate during computation (see Section 5) . eQASM can explicitly specify the timing of quantum operations to support quantum experiments and compiler-based timing optimization. e timing model is explained in Section 3.1. Di erent quantum experiments or algorithms may require a di erent set of physical quantum operations. To allow using di erent sets of quantum operations, quantum operations are speci ed by programmers at compile time via con guration (see Section 3.2) instead of being de ned at QISA design time. is exibility reserves ample space for compiler-based optimization. Only single-and two-qubit operations are allowed, and morequbit operations should be decomposed into single-and twoqubit operations by the compiler [33] [34] [35] [36] [37] . To alleviate the quantum operation issue rate problem, eQASM adopts SOMQ execution, which supports applying a single quantum operation on multiple qubits (see Section 3.3), and a VLIW architecture which can combine multiple di erent quantum operations into a quantum bundle (see Section 3.4).
Two kinds of feedback are supported. Fast conditional execution performs a Go/No-go decision for every single-qubit operation based on a execution ag of the target qubit (see Section 3.5). To be more exible, CFC allows programmers to de ne arbitrary feedback by redirecting the program ow based on the measurement results (see Section 3.6).
ARCHITECTURE
In this section, we construct the assembly syntax of quantum operations by introducing the aforementioned mechanisms.
Timing Model

3.1.1
eue-based Timing Control. eQASM adopts the queuebased timing control scheme proposed in [16] since it can support explicit timing speci cation. We brie y introduce this scheme and refer readers to the original paper for a detailed discussion.
In the queue-based timing control scheme, the execution of quantum instructions can be divided into a reserve phase in the nondeterministic timing domain and a trigger phase in the deterministic timing domain. A timeline is constructed by the reserve phase and consumed by the trigger phase: the result of executing quantum instructions in the reserve phase is consecutively creating new timing points on the timeline and associating events to them; the deterministic timing domain maintains a timer, and triggers all quantum operations associated with the timing point on the timeline that it reaches. Auxiliary classical instructions and mask se ing instructions are not directly associated with timing points. e trigger phase is handled by the microarchitecture; we introduce the reserve phase in the following.
Timeline Construction.
antum instructions fetched from the instruction memory form a quantum instruction stream. Instructions in the stream are executed in order; this constructs a timeline by generating consecutive timing points and assigning operations to them.
If the fetched instruction is a waiting instruction, QWAIT Imm or QWAITR Rs, a new timing point in the timeline is generated. e position of the new timing point is determined by the speci cation of the interval since the last generated timing point. e interval length comes from the immediate value Imm or GPR Rs. e rst timing point of the timeline can be set by a dedicated instruction, or by an external trigger to the microarchitecture. Both waiting instructions use the unit cycle for the interval length.
If the fetched instruction is a quantum bundle instruction, the quantum operation(s) speci ed in the bundle instruction is associated with the last generated timing point. If multiple quantum operations are associated to the same timing point, these quantum operations will all start execution at that same timing point.
Based on our observation over some testbenches (see Section 4.4), short intervals between timing points are a common case. To improve the quantum operation issue rate, eQASM allows merging a Square brackets [. . .] indicate that the content inside is optional. PI is short for pre interval, which speci es a short interval between last generated timing point and the one when the operations in this instruction are to be triggered. It defaults to 1 if not speci ed. Value 0 is acceptable to both the PI and the waiting instructions, which means that the following timing point is identical to the last timing point.
Example.
Assuming the durations of quantum operations Q_OP0, Q_OP1, Q_OP2, and Q_OP3 all equal one-cycle time, the following code triggers these four operations back-to-back. § ¤
antum Operation De nition & Decoding
Depending on the qubit technology and the algorithm to run, different quantum operations can be used. eQASM does not de ne a xed set of quantum operations at QISA design time, such as {H , T , CNOT, · · · }. Instead, the available quantum operations can be con gured by the programmer at compile time.
Flexible quantum operation con guration is achieved through the con guration of the assembler, the microcode unit and the pulse generator of the microarchitecture: on the one hand, the assembler is con gured to translate a quantum operation, e.g., the X gate, to the expected opcode, e.g., 0x01; on the other hand, the microcode unit translates the quantum opcodes into the expected microinstruction(s) using a microcode-based instruction decoding scheme [38] . Each microinstruction represents one or more microoperations, which are nally converted into pulses by the pulse generator with precise timing applying operations on qubits. e assembler, the microcode unit, and the pulse generator should be con gured consistently at compile time.
Address Mechanism
A quantum operation applied on multiple qubits is a common case. For example, quantum computation usually starts by preparing the superposition state from initialized qubits, which requires applying Hadamard gates on multiple qubits. eQASM uses SOMQ execution, which can apply a single quantum operation on multiple qubits at the same time. SOMQ is similar to classical single-instructionmultiple-data (SIMD) execution [39] , with the operation target replaced by qubits. An instantiated eQASM can also be treated as an implementation of the previously proposed Multi-SIMD(k, d) architecture [40] but removing the assumption of SIMD regions that in each region only a single quantum operation can be applied.
SOMQ is based on an indirect qubit addressing mechanism. e SMIS or SMIT instruction rst de nes a set of quantum operation target(s) in a quantum operation target register. en a quantum operation can use the target register as the operand:
<Operation Name> <Target Register>.
3.3.1 Address of Allowed bit Pairs. Since a two-qubit operation, such as a CNOT gate, can operate on its qubits di erently, two qubits with di erent orders, i.e., (Qubit A,Qubit B) and (Qubit B, Qubit A), are treated as di erent allowed qubit pairs. e term quantum chip topology indicates the available qubits and allowed qubit pairs of a quantum chip (see Fig. 6 for an example). e quantum chip topology can be represented as a graph where each available qubit can be denoted as a vertex, and an allowed qubit pair as a directed edge. In the directed edge (Qubit A, Qubit B), Qubit A is called the source qubit and Qubit B the target qubit of the pair.
Translation from Assembly to Binary.
Since the e ciency of encoding the qubit list (qubit pair list) may depend on the target quantum chip topology, the designer can choose di erent binary encoding schemes for di erent target quantum processors during eQASM instantiation. In general, it is more e cient to put the address pairs in the instruction for a highly-connected quantum processor, while a mask format could be more e cient when the qubit connectivity is limited. For example, since at most two twoqubit gates can be applied and each qubit can be addressed with 3 bits in a fully connected 5-qubit trapped ion processor [41] , only 2 × 2 × 3 bits = 12 bits are required to specify the target of a twoqubit gate. is is more e cient than a mask of 20 bits with each bit in the mask indicating one of all 20 di erent allowed qubit pairs selected or not. In contrast, a mask of 6 bits is more e cient for the IBM QX2 [42] , which also contains ve qubits but has only six allowed qubit pairs.
3.3.3
Example. e following code sets the single-qubit target register S7 to contain two qubits (0 and 1), and then applies an X gate on both qubits simultaneously. § ¤
¦ ¥ e following code sets the two-qubit target register T3 to contain two pairs of qubits (1, 3) and (2, 4) , and then applies a CNOT gate on them. § ¤
Very Long Instruction Word
3.4.1 antum Bundle Format. Apart from SOMQ, di erent operations are also allowed to be applied on di erent qubits in parallel. eQASM can combine parallel quantum operations into a quantum bundle in a VLIW format. We de ne parallel quantum operations as operations starting at the same timing point, regardless of the duration of each operation. e format of a quantum bundle is:
[PI,] <Quantum Operation> [| <Quantum Operation>]* e vertical bar | is used to separate di erent quantum operations in the same bundle. e asterisk * means the item in square brackets can repeat for n ≥ 0 times.
Translation from
Assembly to Binary. In the assembly code, an arbitrary number of quantum operations can be combined into a single quantum bundle. However, a single instruction can accommodate only a few quantum operations because of the limited instruction width.
e VLIW width of eQASM characterizes the number of quantum operations that can be put in a single instruction word, which is de ned during eQASM instantiation. Matching this, a single quantum bundle can be broken into multiple quantum bundle instructions with PI being 0. If the number of operations is not a multiple of the VLIW width, quantum no-operations (QNOP) ll up the last instruction. For example, given a VLIW width of 2, the bundle
can be decomposed by the assembler to two consecutive quantum bundle instructions PI, X S5 | H S7 0, CNOT T3 | QNOP.
Example.
In the code as shown in Fig. 3 , the instruction QWAIT 10000 initializes both qubits by idling them for 200 µs (assuming a cycle time of 20 ns). Line 6 applies a Y gate on both qubits using SOMQ. Line 7 is a VLIW instruction, which applies an X 90 and X gate on each qubit. In this paper, X 90 (Y 90 ) denotes the gate rotating the quantum state along the x-( -)axis by a π /2 angle. X m90 (Y m90 ) denotes similar gates but with the rotation angle of −π /2. Line 8 measures both qubits using SOMQ. According to the PI value, the Y gate happens immediately a er the initialization, followed by the X 90 and X gates 20 ns later and the measurement 40 ns later. e 1 µs waiting time (line 9) ensures no operations happening during the measurement. § ¤ Fig. 3 . Part of the code for a two-qubit AllXY experiment, which is used in validating eQASM in Section 5.
¦ ¥
Fast Conditional Execution
Fast conditional execution allows executing or canceling a singlequbit operation when the micro-operation is triggered. e decision is made based on the value of a selected ag in the execution ag register corresponding to the target qubit. e value of the execution ag is derived by the microarchitecture using prede ned combinatorial logic from the last measurement results of the same qubit. Once there returns a measurement result for a qubit, the corresponding execution ags are updated automatically. If the execution ag is '1', then the operation executes; otherwise, it is canceled. A selection signal is required for each micro-operation to select which execution ag to use, which can be generated by the microcode unit, or speci ed by an instruction eld [10] . Except for the default execution ag that should always be '1', which and how many execution ags there are, should be de ned during eQASM instantiation (see Section 4.2 for an example).
Example. In one instantiation of eQASM, the quantum operation C_X uses the execution ag which is '1' if and only if (i ) the last measurement result of the qubit is |1 . Figure 4 shows the code for the active qubit reset experiment, where qubit 2 is put in an equal superposition using an X 90 gate a er initializing it in the |0 state by idling it for 200 µs. A er a measurement, a conditional C_X gate is applied to reset the qubit. bit 2 is measured again to read out the nal state for veri cation. § ¤ 
Comprehensive Feedback Control
CFC allows adjusting the program ow based on measurement results of any qubits to enable arbitrary user-de ned feedback. is exibility comes at the cost of longer feedback latency. We propose a three-step mechanism to implement CFC:
(1) A measurement instruction is applied on the condition qubit i. At the moment that this measurement instruction is issued, Qi is invalidated. At the moment the measurement result is available, it is wri en in Qi. Qi turns back to valid if there are no more pending measurement instructions on qubit i. (2) e FMR Rd, Qi instruction fetches the value of the quantum measurement result register Qi into GPR Rd. If Qi is invalid, FMR should wait until Qi gets valid again. erea er, the value of Qi can be fetched into Rd. Qi remains valid until qubit i is measured again.
(3) GPR Rd is then used in a BR instruction to select the program ow to follow. Note, multiple FMR and BR instructions can be combined to support more complex feedback logic.
Example. e eQASM program shown in Fig. 5 rst measures qubit 1. If the measurement result is 1, a Y gate is applied on qubit 0, otherwise, an X gate is applied.
INSTANTIATION & IMPLEMENTATION
is section introduces an instantiation, microarchitecture, and implementation of eQASM. § ¤ 
¦ ¥
Target Superconducting antum Chip
e quantum chip topology of the target seven-qubit superconducting quantum chip is shown in Fig. 6 . It is part of a two-dimensional square la ice as proposed in [43] . It can implement a distance-2 surface code [44] , which can detect one physical error. In this gure, a vertex represents a qubit, and a directed edge represents an allowed qubit pair. Numbers besides the vertex (edge) are the addresses of qubits (allowed qubit pairs). For example, allowed qubit pair 0 has qubit 2 as the source qubit and qubit 0 as the target qubit. e feedlines are used to measure the nearby coupled qubits. bit 0, 2, 3, 5, and 6 (1 and 4) are coupled to feedline 0 (1). Each feedline has an input port and an output port. Besides, each qubit is connected to a microwave port and a ux port, which are not shown in Fig. 6 .
Operations supported by this quantum processor include measurements, single-qubit xor -axis rotations, and a two-qubit controlled-phase (CZ) gate. A typical gate time is 20 ns for singlequbit gates and ∼ 40 ns for two-qubit gates. e duration of a measurement is typically 300 ns -1 µs. A cycle time of 20 ns is used in this instantiation. 
Instantiation Design Space Exploration
To determine a suitable eQASM instantiation con guration for the target quantum processor [a single-(two-)qubit gate time of 1 (2) cycle(s), and a measurement time of 15 cycles], we perform analysis over three benchmarks using a quantum control architecture simulator derived from the previously proposed QPDO [45] . Because substantial time is spent on calibrating qubits before running applications with NISQ technology, the rst benchmark we select is the widely-used calibration experiment randomized benchmarking (RB) [46, 47] , which might be limited by the high memory consumption when the required waveform for control is plainly stored in memory. Each qubit is subject to 4096 single-qubit Clifford gates which have been decomposed into x and rotations. Because every gate happens immediately following the previous one, randomized benchmarking cannot reveal timing pa erns of quantum operations in real quantum algorithms, where the parallelism is limited by two-qubit gates. Addressing this, we also select two benchmarks from Sca CC [11] as the representatives of small-scale quantum algorithms that might be executed with NISQ technology: a parallel algorithm (Ising model using 7 qubits, IM) which has < 1% two-qubit gates, and a relatively sequential algorithm (Grover's algorithm to calculate the square root using 8 qubits, which is the minimum number of qubits required, SR), which has ∼ 39% two-qubit gates. e evaluation metric is the total number of instructions.
We investigate the impact of the VLIW width (w), three timingspeci cation methods, and SOMQ on the number of instructions. e three timing-speci cation methods include: the MIS fashion (specifying every timing point using separate QWAIT instructions, ts 1 ); including QWAIT in the quantum bundle instruction at the place of a quantum operation (ts 2 ); and using PI with various bit widths (w PI ) to specify a small waiting time and using separate QWAIT instructions to specify longer waiting times (ts 3 ). e simulation results are shown in Fig. 7 .
Con g 1 is (ts 1 , no PI, no SOMQ), and Con g 1 with w = 1 is chosen as the baseline. By increasing w from 1 to 4, the number of instructions can be reduced up to 62% (RB). Benchmarks with substantial parallelism (RB and IM) bene t more from a big w. e instruction reduction in SR (∼ 8%) indicates that large w slightly improves quantum applications with limited parallelism.
Con g 2 is (ts 2 , no PI, no SOMQ). A minimum w of 2 is required by ts 2 to distinguish it from ts 1 . Compared with Con g 1, by including the QWAIT operation as part of a quantum bundle instruction, Con g 2 can reduce the number of instructions by 20 -33% (RB), 24 -45% (IM), 43 -50% (SR) by varying w from 2 to 4. SR bene ts most because of two reasons. First, due to its sequential nature, it has relatively more QWAIT instructions. Second, limited parallelism in this algorithm leaves potential VLIW slots unused, which can be lled by QWAIT instructions. Con g 3/4/5/6 is (ts 3 , w PI = 1/2/3/4, no SOMQ). Con g 3 can reduce the number of instructions by 13 -33% for RB and 28 -44% for IM with w varying from 1 to 4 compared with Con g 1. Since the intervals between operations in RB and IM are mostly close to 1, further increasing w PI up to 4 bits introduces marginal bene t. Con g 3 reduces the number of instructions of SR by ∼ 17% regardless of w. Further increasing w PI to 3 or 4 bits can reduce the number of instructions of SR by up to 48%. Like SR, quantum algorithms are scheduled to be executed in a time as short as possible.
is result of Con g 3-6 suggests that most of the waiting time is short and can be encoded in a 3-bit PI eld. Note that Con g 3/4/5 is also more bene cial than Con g 2 when w = 1 or w = 2.
Con g 7/8/9/10 is (ts 3 , w PI = 1/2/3/4, SOMQ). Our analysis assumes that the target registers can always provide the required qubit (pair) list, and therefore shows the theoretical maximum bene t that can be obtained by SOMQ. Compared to Con g 3/4/5/6, SOMQ can introduce a maximum reduction of 42% (Con g 8, w = 2) in the number of instructions for RB, while it can only reduce at most 4% instructions for SR (Con g 8, w = 1). Regardless of w PI , SOMQ can help reduce the number of instructions of IM by ∼ 24, 19, 9, and 2% for di erent w.
is fact suggests that SOMQ is more e ective for highly parallel applications, especially when w is small. An application that would bene t signi cantly from SOMQ is quantum error correction, which requires performing wellpa erned error syndrome measurements repeatedly presenting high parallelism. As not shown in the gure, we also analyzed the number of e ective quantum operations in each quantum bundle for Con g 9, which is 1.795, 2.296, and 3.144 for RB, 1.485, 1.622, and 1.623 for IM, and 1.118, 1.147, and 1.147 for SR with w varying from 2 to 4, respectively. It indicates that with the existence of SOMQ, w > 2 is not highly required for many quantum applications (RB is a special case with extreme parallelism).
As a result of the analysis, our eQASM instantiation adopts Cong 9 (ts 3 , w PI = 3, SOMQ) with w = 2. A width of 32 bits is used by all instructions for the memory alignment. Two instruction formats are used: the single format with the highest bit being '0' and the bundle format with the highest bit being '1'. Single format instructions use the other 31 bits to encode a single instruction, including all auxiliary classical instructions, and SMIS, SMIT, QWAIT(R) instructions. For brevity, we only present the format of quantum instructions as shown in Fig. 8 . ere are 32 single-(two-)qubit target registers, and the target register address width is 5 bits. e target registers use a mask format.
e mask is 7-(16-)bit wide in the single-(two-)qubit target register. Each bit in the mask of the value '1' indicates that the corresponding qubit (allowed qubit pair) is selected. In the QWAIT(R) instruction, only the least signi cant 20 bits of the Imm eld or GPR Rs are used to specify the waiting time. In the quantum bundle instruction, each quantum operation occupies 14 bits and the q opcode is 9 bits.
Microarchitecture
MIS is implemented by the control microarchitecture MA with codeword-based event control, queue-based event timing control and multi-level instruction decoding [16] . Adopting these three mechanisms, we redesign a quantum control microarchitecture, MA v2, implementing the instantiated eQASM as shown in Fig. 9 . It supports all features of eQASM. e classical pipeline maintains the PC and implements the GPR le and the comparison ags. e execution ag register is maintained by the fast conditional execution module. e classical pipeline fetches and processes instructions one by one from the instruction memory. All auxiliary classical instructions are processed by the classical pipeline while quantum instructions are forwarded to the quantum pipeline for further processing. e timestamp manager processes the QWAIT(R) instructions and the PI eld to generate timing points. e quantum pipeline contains a VLIW front end with two VLIW lanes, each lane processing one quantum operation. e SMIS (SMIT) instructions update the corresponding target registers in each VLIW lane. Inside each VLIW lane, the q opcode is translated by the microcode unit into one micro-operation (labeled as µ op s ) for a single-qubit operation or two micro-operations (labeled as µ op src and µ op tgt ) for a two-qubit operation. µ op src (µ op tgt ) will be applied on the source (target) qubit of the target qubit pair. e con guration of the microcode unit is stored in the Q control store, which is implemented using a lookup table. e target register Si (Ti) is read for a single-(two-)qubit operation. e quantum microinstruction bu er resolves the mask-based qubit address and associates the quantum operations to the last generated timing point. It resolves the qubit address in two steps.
First, the mask stored in Si (Ti) is translated into seven two-bit micro-operation selection signals OpSel i , where i = 0, 1, · · · , 6, with each signal for one qubit. Table 2 lists the meaning of every case of the micro-operation selection signal. For single-qubit op- Table 2 . De nition of the micro-operation selection signal.
Value Operation to Select Value Operation to Select '00' None '10' µ op tgt '01' µ op src '11' µ op s erations, OpSel i is set to '11' ('00') if the i-th bit in the mask is '1' ('0'). For a two-qubit operation, OpSel i is set to '00' if qubit i is not contained in any selected allowed qubit pair. Otherwise, OpSel i is '01' ('10') if the target qubit pair contains qubit i as the source (target) qubit. Take qubit 0 as an example. It is connected to edges 0, 1, 8, and 9. When edge 0 or 9 (1 or 8) is selected in the mask, qubit 0 is the target (source) qubit and should be applied with µ op tgt (µ op src ). In other words, OpSel 0 should be '10' ('01'), and can be generated using a simple OR (_) logic: OpSel 0 = (Ti[0] _ Ti [9] ) :: (Ti[1] _ Ti [8] ). e assembler should check the validity of two-qubit target register values. For example, it is invalid if two edges connecting to the same qubit are selected in the same T register.
Second, based on OpSel i , either none or one micro-operation is output for qubit i. is step is fully parallel. e operation combination module also works in a two-step fashion. First, since each VLIW lane outputs none or one microoperation for each qubit, the operation combination module merges both micro-operations from both VLIW lanes. If both VLIW lanes output one micro-operation on the same qubit, an error is raised, and the quantum processor stops. Second, as explained in Section 3.4, a long quantum bundle requires multiple quantum bundle instructions to describe it. e operation combination module bu ers all micro-operations associated with the same timing point. Only when it detects that all quantum operations in the same quantum bundle have been collected, the operation combination module sends the bu ered micro-operations to the device event distributor.
is detection can be done, e.g., by recognizing a new timing point generated by the timestamp manager which is di erent to the one associated to the bu ered micro-operations. Also, if two di erent quantum bundle instructions specify a quantum operation on the same qubit, an error is raised, and the quantum processor stops.
As shown in Section 4.4, operating a qubit may require the collaboration of multiple electronic devices in the analog-digitalinterface, and a single device may also control multiple qubits. Hence, the micro-operations should be reorganized into device operations to trigger the corresponding devices. e device event distributor reorganizes multiple micro-operations associated with the same timing label into di erent device operations. A er that, each device operation with the associated timing label is bu ered at an event queue of the timing control unit awaiting execution. e timing controller then triggers every device operation at its expected timing point. Fig. 9 . antum microarchitecture implementing the instantiated eQASM for the seven-qubit superconducting quantum processor.
Quantum Pipeline
A er the device operations have been triggered by the timing controller, fast conditional execution is performed based on the selected execution ags of the target qubits. e execution ag selection signal comes from the microcode unit con gured by the programmer. Only device operations for qubits of which the selected execution ag is '1' are released to the analog-digital interface (ADI). In this eQASM instantiation, four types of combinatorial logic are used to de ne the execution ags:
(1) '1' (the default for unconditional execution); (2) '1' i the last nished measurement result is |1 ; (3) '1' i the last nished measurement result is |0 ; (4) '1' i the last two nished measurements get the same result.
Note, the last nished measurement result refers to the result of the last nished measurement instruction on this qubit when these ags are used. It is irrelevant to the validity of the quantum measurement result register. Once there returns a measurement result for a qubit from the analog-digital interface, the fast conditional execution unit immediately update the execution ags corresponding to that qubit.
To support CFC, a counter Ci is a ached to each qubit measurement result register Qi, with an initial value of 0. Once a measurement instruction acting on qubit i is issued from the classical pipeline to the quantum pipeline, Ci increments by 1. If the measurement discrimination unit writes back a measurement result for qubit i, Ci decrements by 1. Qi is valid only when Ci is 0. If Ci is not 0 when the instruction FMR Rd, Qi is issued, the pipeline is stalled until Ci is 0. In this way, it is ensured that the instruction FMR Rd, Qi always fetches the result of the last measurement instruction acting on qubit i.
Implementation
e hardware structure implementing the microarchitecture (Fig. 10) consists of a Central Controller responsible for orchestrating three modules containing slave devices for microwave control, ux control, and measurement. e Central Controller is a digital device built with an Intel Altera Cyclone V SOC 5CSTFD6D5F31I7N Field Programmable Gate Array (FPGA) chip. e Central Controller implements the digital part of the microarchitecture (le to the ADI in Fig. 9 ). e timing controller and fast conditional execution module work at 50 MHz to get a cycle time of 20 ns. e other parts work at 100 MHz.
Single-qubit x and rotations are performed by applying microwave pulses to the qubits. e pulses are generated by Zurich Instruments High Density Arbitrary Waveform Generators (HDAWG) and modulated using a Rohde & Schwarz (R&S) SGS100A microwave source. A custom-built vector switch matrix (VSM) is responsible for duplicating and routing the pulses to the respective qubits as well as tailoring the waveforms to the individual qubits [48] using a qubit-frequency reuse scheme that allows for e cient scaling of the microwave control module [43] .
Flux pulses that implement two-qubit CZ gates and single-qubit z rotations are performed by applying pulses generated by an HDAWG on the dedicated ux lines for each qubit. e measurement discrimination unit is implemented using two Zurich Instruments Ultra-High-Frequency antum Controllers (UHFQC) connected to the two feedlines shown in Fig. 6 . e UHFQC has two analog outputs that can be used to generate the measurement pulses and two analog inputs to sample the transmi ed signals from which the UHFQC can infer the measurement result. e measurement pulses going to (coming from) the qubits are modulated (demodulated) using a single R&S SGS100A. All analog ports operate at 1.8 GSa/s allowing for simultaneous measurement of up to 9 qubits per feedline using frequency multiplexing techniques [49] .
e Central Controller connects to the UHFQCs and HDAWGs via a 32-bit digital interface working at 50 MHz. Since measurement results are sent from the UHFQC to the Central Controller, 16 bits of the connection are sent from the Central Controller to the UHFQC and the other 16 bits the other way around. All operations on UHFQCs and HDAWGs are codeword triggered. e routing of microwave pulses by the VSM is controlled through seven digital signals with a sampling rate of 400 MSa/s.
EXPERIMENT
Since the target seven-qubit quantum chip is still under test at the time of writing, we replaced the quantum chip of this microarchitecture with a two-qubit superconducting quantum processor to validate the eQASM design. e two qubits are interconnected and coupled to a single feedline. A con guration le is used to specify the quantum chip topology with the two qubits renamed as qubit 0 and 2. It is used by the quantum compiler and the assembler. eQASM programs used to perform the experiments as described below are all compiled from OpenQL descriptions with corresponding quantum operation con guration.
We rst used eQASM to perform some single-qubit calibration experiments which utilize uncalibrated operations. For example, the Rabi oscillation [50] applies an x-rotation pulse on the qubit a er initialization and then measures it. A sequence of xed-length x-rotation pulses with variable amplitudes are used. Each pulse in the sequence is uploaded to the codeword triggered pulse generation unit of the microarchitecture and con gured to be an operation X_Amp_i in eQASM. As a result, this experiment calibrated the amplitude of the X gate pulse. Together with other experiments, Gate Pair Combination Qubit 2 Fig. 11 . Two-qubit AllXY result, corrected for readout errors.
the delity of single-qubit quantum operations used later reached 99.90% as measured in the following RB experiment. It is worth mentioning that we observed considerable speedup in performing these experiments with the eQASM control paradigm in practice. eQASM is then con gured to include single-qubit gates {I, X , Y , X 90 , Y 90 , X m90 , Y m90 } and a two-qubit CZ gate for the following experiments. e AllXY experiment is typically used to calibrate single-qubit gates. In AllXY, pairs of single-qubit gates are chosen from the set {I, X , Y , X 90 , Y 90 } and applied in such a way that the expected measurement outcomes produce a characteristic staircase pa ern that is highly sensitive to gate errors (red line in Fig. 3 ). In the two-qubit AllXY experiment, the control pulses are applied on each qubit simultaneously. e sequence is modi ed to distinguish the qubits on which it is applied: each gate pair in the sequence is repeated on the rst qubit while the entire sequence is repeated on the second qubit. e delity of qubit to the |1 state can be extracted by averaging the measurement results for each gate pair over N rounds and correcting for readout errors. e eQASM program for one routine of this experiment is shown in Fig. 3 . Figure 11 shows the nal measurement result of the entire experiment (blue dots), which matches well with the expectation (red line). is demonstrates that the timing control, SOMQ, and VLIW of eQASM work properly in the experiment.
To evaluate the impact of the timing of operations on the error rate, we use single-qubit randomized benchmarking, a technique that can estimate the average error rate for a set of operations under a very general noise model [46, 47] . In this experiment, a sequence of k random Cli ord gates are applied on a qubit initialized in the |0 state. Before measurement, a Cli ord is chosen that inverts all preceding operations so that the qubit should end up in the |0 state with survival probability p(k). By performing this experiment for di erent k and averaging over many randomizations, the Cli ord delity F Cl can be extracted from the exponential decay. Because each Cli ord gate is decomposed into primitive xand -rotations the gate count is increased by 1.875 on average. e average error rate per gate, ϵ, is then calculated as ϵ = 1 − F 1/1.875 Cl . Single-qubit randomized benchmarking was performed for different intervals between the starting points of consecutive gates (320, 160, 80, 40, and 20 ns). As shown in Fig. 12 , the average error per gate decreases by a factor of ∼ 7, from 0.71% to 0.10% when decreasing the interval from 320 ns to 20 ns. is demonstrates the signi cant impact of timing on the delity of the nal computation result, which substantiates the requirement of explicit speci cation of timing at QISA level to enable platform-speci c optimization and especially scheduling by the compiler. Fast conditional execution is veri ed by the active qubit reset experiment with qubit 2 using the code as shown in Fig. 4 . We nd the probability of measuring the qubit in the |0 state a er conditionally applying the C_X gate to be 82.7%, limited by the readout delity. We veri ed CFC by connecting the Central Controller and the UHFQC. e eQASM program used is shown in Fig. 5 . e UHFQC is programmed to generate alternative mock measurement results for qubit 0. e alternation between X and Y operations is veri ed by detecting the output digital signals using an oscilloscope. We also measured the feedback latency of fast conditional execution and CFC, which are ∼ 92 ns and ∼ 316 ns, respectively. e feedback latency is de ned as the time between sending the measurement result into the Central Controller and receiving the digital output based on the feedback from the Central Controller.
As a proof of concept of performing quantum algorithms using eQASM, we executed a two-qubit Grover's search algorithm [51, 52] . e algorithmic delity, i.e., correcting for readout in delity, is found to be 85.6% using quantum tomography with maximum likelihood estimation. is delity is limited by the CZ gate.
CONCLUSION
In this paper, we have proposed eQASM, a QISA that can be directly executed on a control microarchitecture a er instantiation. With runtime feedback, eQASM supports full quantum program ow control at the (micro)architecture level [17, 30] . With e cient timing speci cation, SOMQ execution, and VLIW architecture, eQASM alleviates the quantum operation issue rate problem, presenting be er scalability than MIS. antum operations in eQASM can be con gured at compile time instead of QISA design time, which can support uncalibrated or uncommon operations, leaving ample space for compiler-based optimization. Low-level hardware information mainly appears in the binary of a particular eQASM instantiation, which makes eQASM assembly expressive. It is worth noting that by removing the timing information in the eQASM description, the quantum semantics of the program can be kept and further converted into another executable format targeting another hardware platform.
As validation, eQASM was instantiated into a 32-bit instruction set targeting a seven-qubit superconducting quantum chip, and implemented using a quantum microarchitecture. eQASM was veri ed by several experiments with this microarchitecture performed on a two-qubit chip. e e ciency improvement observed in using eQASM to control quantum experiments broadens the scope of application of quantum assemblies.
Future work will include performing verifying comprehensive feedback control with qubits and controlling the originally targeted seven-qubit superconducting quantum processor with the implemented microarchitecture. Also, it will be interesting to instantiate eQASM to control other quantum processors, including superconducting quantum processors with a di erent quantum chip topology, and altogether di erent quantum hardware, such as spins in quantum dots [53] , nitrogen vacancy centers [54] .
