This paper presents the definition and implementation of a quantum computer architecture to enable creating a full-fledged quantum computer. A key question is to understand what a quantum computer is and how it relates to the classical processor that controls the entire execution process. We present the idea of a quantum accelerator which contains the full-stack of the system layer starting at the quantum logic layer representing the functionality that the accelerator will implement. The next layer is the OpenQL and cQASM layer that implements the algorithm in a set of instructions that can be executed by the micro-architecture. We distinguish very clearly between the experimental research towards better qubits and the industrial and societal applications that need to be developed and executed on a quantum device. In the first case, the physicists are using realistic qubits with decoherence and error-rates and the last case offers perfect qubits to the quantum application developer, where there is no decoherence or error-rates. We conclude the paper by explicitly presenting two examples of a quantum accelerator.
1

Introduction
The history of computer architecture has been quite long and very evolving. An important extension is the emergence of accelerators as described for FPGAs in [1] . Now, computer architecture research is getting more focused on quantum computing. In the next 5 to 10 years of quantum computer development, it makes sense to talk about quantum computing in the sense of a universal, Turing computer that can be applied in any kind of application domain. Given the recent insights leading to e.g. Noisy IntermediateScale Quantum technology as expressed in [2] as well as randomised compiler techniques as described in [3] , we are much more inclined to believe that the first industry-based and societal relevant applications will be as a quantum accelerator. It is based on the idea that any end-application contains multiple parts and the properties of these parts are better executed by a particular accelerator which can be, as shown in Figure 1 , either an FPGA, a GPU or a TPU. We now add the quantum accelerator as an additional coprocessor. The formal definition of an accelerator is indeed a co-processor linked to the central processor and that executes much faster certain parts of the overall application. So the classical processor keeps the control over the total system and delegates the execution of certain parts to the present accelerators.
Computer architectures have evolved quite dramatically over the last couple of decades. The first computers that were built did not have a clear separation between compute logic and memory. It was only with the definition of von Neumann's idea to separate and develop them distinctly that the famous von Neumann architecture was born. This architecture had for a long time a single processor and was driven forward by the ever increasing number of transistors on the chip, which doubled every 18 months. In the beginning of the 21st century, the single core was becoming too complex and did not provide any substantial processing improvement. This is why single core processors were changed to multiple cores. The homogeneous multi-core processor dominated the processor development for a couple of years but companies such as IBM and Intel started understanding that heterogeneity is the right way forward to improve the compute power. GPUs, TPUs and also FPGAs are seen as natural extensions of the computer architecture. This implies that the quantum accelerator is a logical next step. In the quantum computing world, there exist two important challenges. The first is to have enough and good qubits for any experimental quantum chip one wants to build. The current competitors are looking at different quantum technologies such as ion traps, majorana's, semi-conducting and superconducting qubits, NV-centers and even Graphene. These technologies are struggling with the overall status of the qubits that suffer from decoherence and that introduces errors when performing some kind of quantum gate operation. It is only when the quantum physical community overcomes those that the quantum accelerator will be a widespread adopted solution. This direction is shown in part a) of Figure  2 where different quantum technologies are competing with each other. The second challenge is to formulate at a high level the quantum logic that companies and other organisations need to be able to use high-performance accelerators for certain computations that can only run on the quantum device. This requires also a long term investment in terms of people and know-how from companies that want to go in that direction. Part b) of Figure 2 shows the industrial commitment to think about the logic that can be executed using the full-stack and running on the QX-simulator. It is important to emphasise is that the qubits are called perfect qubits that do not decohere or have any other kind of errors generated by them. With the emergence of huge amounts of data, commonly called Big Data, it is understood that this paradigm is not scalable to super-large data sets. The key factor is the huge amount of data that needs to be processed by multiple computing cores and that is exactly which seems to be a very difficult problem to solve. The data communication between the cores is a very difficult programming problem and the data management problem is substantially slowing down the overall performance. Based on what we did since 2004 [1] and as shown in Figure 2 , an important concept that we have been implementing in the quantum computing world is the implementation of a full stack for a quantum accelerator as will be described later in this paper. The basic philosophy of any accelerator is that a full stack needs to be defined and implemented. The last 10-15 years have shown a large number of accelerators that were developed as part of any modern computer architecture. It always consists of the same following layers: it starts at the highest level describing the logic that needs to be mapped on the accelerator. Examples are video processing, security, matrix computation etc. These application-specific algorithms can be defined in various languages such as C++, Fortran or others. In the case of FPGAs, these algorithms are translated in VHDL or Verilog. In the case of GPUs, the language is often formulated using mathematics or other libraries and translated by the compiler in an assembly language that can be mapped on any GPU-architecture. Especially in the case of FPGAs, there is no standard microarchitecture on which the VHDL or Verilog can be executed. Such an architecture needs to be developed for every application that needs to be accelerated. The final layer is a chip based implementation of the micro-architecture combined with the hardware accelerator blocks that are needed.
Realistic vs Perfect Qubits
An important concept that is introduced in our line of research is the use of two kinds of qubits, namely realistic and perfect qubits.
Realistic qubits: The first qubit type is the realistic qubit. Those qubits are the ones created and investigated by the quantum physicists in the community. There are a number of features that substantially need to be improved. Realistic qubits need to stay in a particular state for a long time (called the coherence) but for most of them the qubits go to the ground state in nanoseconds after they are brought in a particular state. On top of that, all the quantum gates that need to be applied to the qubits do generate errors. Without going into a detailed discussion, we should not forget that all the qubit and quantum phenomena such as superposition and entanglement are analogue phenomena and thus subject to various changes caused by contextual influences. There are currently many quantum technologies experimenting if good qubits can be produced that also produce reasonable results for quantum computation. The use of realistic qubits is very important for them as they need to understand the dynamic and static behaviour of the qubits under all different kind of circumstances.
Perfect qubits: When companies, governments and other organisations are interested in building a quantum accelerator, then they have to look at the availability in terms of quantum algorithms and have a way to test the correctness of the quantum logic that they are interested in. Many large companies use technology produced by companies such as D-WAVE or IBM where access to realistic qubits is possible. However, the quality as well as the number of these qubits is very limited and the decoherence and error-rates as mentioned before are very problematic as they tend to influence the overall result that the quantum device is computing. To this purpose, we have introduced also new kind of qubit, called the perfect qubit, such that any of the erroneous behaviour can be avoided. So the qubits do not have decoherence and stay in the state at which the end-user is interested in working with the idealistic model called the perfect qubit. Using those perfect qubits guarantees that the end-users can verify and check the algorithm that they are working on and test if the computed results have a meaning they can interpret. We are not the only ones who do this but it is a very clear concept that separates the two directions that we are investigating in the Quantum Computer Architecture Lab.
Background
One of the first papers about quantum computing was written by R. Feynman in 1982 [4] which created a world-wide research on quantum computing focusing on important low-level challenges leading to the development of superconducting qubits, ion trap qubits or spin-qubits. He formulated the use of quantum computers as an important scientific instrument to allow us to understand the quantum phenomena that quantum physics tries to understand. The design of proof-of-concept quantum algorithms and their analysis with respect to their theoretical complexity improvements over classical algorithms has also received some attention. However, we still need substantial progress in either of those domains. Qubits with a sufficiently long coherence time combined with a true quantum killer application are still crucial achievements on which the community is working. They are vital to demonstrate the exponential performance increase of quantum over conventional computers in practice and are urgently needed to convince quantum sceptics about the usefulness of quantum computing such that it can become a mainstream technology within the coming 10 to 15 years. However, as we will describe in this paper, we need much more before any kind of computational device has been developed, which ultimately connects the algorithmic level with the physical chip. What is needed involves a compiler, run-time support and more importantly a micro-architecture that executes a well defined set of quantum instructions.
An interesting and quite high-level kind of description was published in Communications of the ACM in 2013 [5] . The authors describe their understanding of the blueprint of a quantum computer. They correctly emphasise the need to look at computer engineering to better understand what the similarities and differences are between quantum and classical computing. As said before, the most important difference is the substantially higher error rate that qubits and quantum gates (10 −3 ) have compared to CMOS-technology (10 −15 ). Guaranteeing fault-tolerant computation can easily consume more than 90% of the actual computational activity. The second difference focuses on the nearest-neighbour constraint which imposes that two-qubit gates can only be applied if the qubits reside next to each other. The no-cloning theorem prohibits copying quantum states. The way that two-qubit gates are applied requires the two qubits to be sufficiently close to each other.
They also describe a hierarchical layered structure but rather than defining these layers in terms of more computer engineering concepts, the schema is more expressed in terms of the different, relevant fields and research domains. Examples are Quantum Error Correction (QEC) theory, programming languages, QEC and Fault-Tolerant (FT) implementation and so on. There are also other mechanisms with undefined time costs that are necessary to make FT-quantum computing (hopefully) efficient and performing. Examples are state distillation for ancilla factories and the emergence of a wide variety of defects and errors, which all impose an additional burden on the micro-architecture and the corresponding run-time management.
An older but conceptually quite similar paper was published by DiVincenzo in 2000. In that article, he specified 5 criteria needed to build a quantum computer: i) a scalable physical system with well characterised qubits, ii) the ability to initialise the state of the qubits to simple fiducial state, iii) long relevant coherence times, iv) a universal set of quantum gates and v) a qubit specific measurement capability. He also adds two more criteria needed for quantum communication, namely the ability to inter-convert stationary and flying qubits and the ability to transmit flying qubits between specified locations. [6] If we look at currently available quantum processors, we could say that they already comply to DiVincenzo's criteria and thus we already have a quantum computer. However, an important and missing criterion is the number of qubits that we need for any kind of reasonable application. Depending on the application domain, the estimates of the number of qubits goes from relatively low, such as a couple of hundreds, to several billions. If we are less critical, we could say that the first criterion explicitly formulates the size of the system, which is still a very considerable challenge to compute in a reliable way.
The rest of the paper is structured as follows and as shown in Figure 2 . We first describe the quantum algorithm layer and present the programming language OpenQL and the quantum assembly language cQASM to which the OpenQL compiler translates. We then introduce the micro-architecture, including the mapping of quantum circuits to the quantum chip, and conclude the paper with a detailed discussion of two particular examples of an accelerator that we are currently developing.
The Quantum Full-Stack
In the context of quantum computing, the same full-stack approach is adopted for which either perfect or realistic qubits can be used. The highest level starts at the end-user application for which a part of that application is developed in a quantum language such as OpenQL. That quantum part of any industrial or societal application can be executed on any kind of quantum prototype that is available. As explained before, the execution can be either on an experimental quantum chip or on the QX-simulator. For any quantum logic that is specified, a specific and target-related micro-architecture needs to be defined and used. We go through the different layers in the next pages.
Quantum Logic
For this entire section, we will always assume perfect qubits. The highest level is the application layer where any potential end-user of the quantum compute power will instruct what exactly needs to be computed. Quantum computing promises to become a computational game changer, allowing the calculation of various algorithms much faster (in some cases exponentially faster) than their classical counterparts. Especially applications with a large set of data items are very suitable types to be processed by quantum computers, which we call in this paper quantum accelerators. Currently there is no generally acknowledged or accepted functional domain where quantum technology would be the game changer. Evidently, the cryptography domain is a clear candidate as algorithms such as Shor's factorisation show that potentially a quantum computer can break any RSA-based encryption as it leads to finding the prime factors of the public key [7] based on which the private key can be easily calculated. However, the cryptography Another potential application area is the biological domain where chemistry, medication and pharmacology belong to. Let us focus on genome sequencing. For instance, a quantum computer would be necessary if we simply want to compute the DNA-profile of every human being in the world, which takes around one week on a large network of very powerful servers for one person's DNA. With the availability of enough qubit capacity, the entire parallel input data-set can be encoded simultaneously as a superposition of a single wave function.
2 This particular property makes it possible to perform the computation of the entire data-set in parallel. This kind of computational acceleration provides a promising approach to address the computational challenges of DNA analysis algorithms. The essence of accelerating sequence reconstruction is the ability to run parallel search operations of the short read obtained from sequencing an individual DNA from a sequencing machine, onto an already available reference of the organism. In recent years, GPU, FPGA and cluster computing frameworks like Hadoop and Spark have been used to reduce the total run-time. Quantum computation offers a fundamentally different way to address the enormous volume of data by employing superposition of reads in the search process, thereby reducing the memory requirement exponentially. The quantum search primitive (Grover's search) itself is provably optimal [8] over any other classical or quantum unstructured search algorithm. The rather modest quadratic speedup in cycles however becomes extremely relevant for industrial application due to the total CPU run-time involved in the big data manipulation (in order of 1000s of CPU hours [9] for a single human DNA sequence reconstruction ).
Programming language, compiler and run-time support
The quantum algorithms and applications presented in the previous section can be described using a high-level programming language such as Q# [10] , Scaffold [11] , Quipper [12] and OpenQL [13] and compiled into a series of instructions that belong to the (quantum) instruction set architecture.
Consistent with our distinction between perfect and realistic qubits, the compiler is capable of generating any qubit that the end-user wants to use. So there is an option that translates the qubits in a perfect qubit or in a realistic one. As shown in Figure 4 , the compiler infrastructure for such a heterogeneous system will consist of the classical or host compiler combined with a quantum compiler. It is important to note that the architectural heterogeneity where classical processors are combined with different accelerators such as the quantum accelerator, imposes a specific compiler structure where each compiler part can target the different instruction sets and ultimately generates one binary file which can be executed on different instruction set architectures. For the computer architecture that we envision, any high-level implementation of the system application will consist of 2 main parts: the classical logic which will be executed by the micro-architecture of the controlling processor and the quantum logic which will be mapped on the quantum chip and also executed by the micro-architecture for the quantum processor.
As we adopt the quantum circuit model as a computational model, the quantum compiler translates the quantum logic into quantum circuits for which reversible circuit design, quantum gate decomposition and circuit mapping are needed. The output of this compiler will be a series of instructions, expressed in a quantum assembly language such as QASM, that belongs to the defined instruction set architecture.
3
The definition of a shared quantum assembly language is a key challenge such that there is some kind of uniformity in what the different researchers are pursuing.
Realistic Qubits: Another key pass in the quantum compiler is the generation of fault tolerant (FT) quantum circuits. A general problem of any quantum technology is its fragility, implying that the qubit state disappears quite rapidly. First, the coherence time of qubits is extremely short. For example, superconducting qubits may lose their information in tens of microseconds [14, 15] . Second, quantum operations are unreliable with error rates around 0.1% [16] . It is therefore difficult to think about building a quantum computer without error correction but, in January 2018, Preskill [17] emphasises that early stage quantum computers should be based on Noisy Intermediate-Scale Quantum technology with much less ancilla qubits for Quantum Error Correction (QEC)-activities. This is a very interesting approach also for computer engineers as we will describe later in this paper. Quantum Error Correction is more challenging than classical error correction, due to the no-cloning theorem, which states that (unknown) quantum states cannot be copied. This makes the classical way of executing several copies of the same bit impossible. In addition, quantum errors are continuous and any measurement will destroy the information stored in qubits. The basic idea of modern QEC techniques is to use several physical imperfect qubits to compose more reliable units called logical qubits based on a specific quantum error correction code [18, 19, 20, 21, 22, 23, 24] . This is represented by the third dimension in Figure 2 and any computer architecture needs functionality to continuously monitor the quantum system to detect and recover possible errors, as we now will describe.
For various reasons and for quite a long period, the focus has been mostly on planar surface code as it was considered one of the most promising QEC codes for short-term implementations and for scalability concerns in the FT-era and manufacturing. Qubits are arranged in a regular 2D lattice with only nearestneighbour (NN) interactions. The array comprises of two kinds of qubits, namely the data and ancilla Figure 5 : An Example Micro-Architecture qubits. Data qubits are used to store the quantum information, whereas ancilla qubits are helper qubits which are used to detect both bit-flip and phase-flip errors by performing error syndrome measurements (ESM). This implies that after every sequence of n quantum gates, the system needs to measure out its state and interpret those measurements to see if an error has been produced. Given the constraints of the coherent qubit lifetime, it implies that a potentially very large graph needs to be interpreted such that any error can be identified. On top of that, measurements themselves can be erroneous and therefore need to be repeated multiple times before a final conclusion is reached. In 2018, Preskill launched a counterargument to this approach because Surface Code requires too many ancilla-qubits that are needed to have the logical protection. [17] This led to the re-initiation of the small-codes which were first defined almost 20 years ago. It remains to be seen what the impact will be at the system architectural and compiler level but that is currently the focus of a lot of research.
Perfect Qubits: As explained before, we introduced a new feature in OpenQL which is the perfect qubit. These qubits behave the way an end-user is expecting. When any quantum accelerator logic is defined, the end-user has the option to express everything in terms of the perfect qubits. The focus can then be fully addressed to the logic that the end-user is thinking about, without assuming any strange behaviour of the qubits. This way the correctness of the quantum algorithm can be verified and tested on the implementation of the micro-architecture and the QX-simulator (cfr infra).
Quantum Micro-Architecture
Any computer has a series of instructions which can be executed on the dominant processor. To this purpose, any kind of processor has a particular architecture capable of executing any sequence of the legitimate instructions. This also holds for the quantum processor which also has a series of instructions that it can execute, some of which are classical logic and others are the quantum instructions that will be executed on the quantum chip. So the quantum accelerator will consist of two components: the classical and digital micro-architecture part that has a classical processor to execute part of the accelerator logic and the quantum chip that contains the qubits that need to be executed in an analogue way.
Essential to any kind of computational device is the presence of one or multiple computer architectures that are responsible for executing the instructions that are delegated to the co-processor. The architecture of a machine connects the physical hardware to the applications that can run (on that hardware) and dictates how instructions are executed. In the case of a quantum accelerator, that is also true.
For the quantum algorithms to be understood by the quantum accelerator, we require a low level representation of the quantum instructions that the classical control hardware of the quantum chip can understand. This is known as the Quantum Instruction Set Architecture (QISA). The content of the QISA can be modified for each accelerator logic that needs to be implemented. Extensions to the compiler may therefore be needed but the micro-architecture will need hardware components that will execute the instructions that are sent to it. We want to be very precise in how the instructions are formulated and executed. As we described earlier, the OpenQL and related cQASM can either assume perfect or realistic qubits. One example of a micro-architecture is given in Figure 5 . For any micro-architecture, there are a number of properties that we have to estimate, such as the appropriate instruction-length, pipeline depth (for parallel quantum gates) and targeting multiple control channels per single instruction. Based on these principles, we can proceed to construct the basic blocks, such as timing control unit and the microcode instruction set of the overall micro-architecture. And that is possible for either perfect or realistic qubits.
Perfect Qubits: We do not yet have a clear implementation of the micro-architecture for logic expressed in terms of the perfect qubits. Later in this paper, we present a tentative micro-architecture for Quantum Genome Sequencing (QGS), which is one of the accelerators that we are working on. It is important to define the QISA needed for QGS and what the corresponding micro-architecture blocks are which are needed to execute the quantum instructions on the QX-simulator. More development along those lines is needed to fine-tune the blocks that are needed in the micro-architecture. Realistic Qubits: When participating in the experimental quantum physics part, we have to continue looking at the experimental algorithms that the physics community are interested in, such as randomised (single and double) qubit gates. This phase would also comprise of hardware assessment and characterisation to meet the timing-precision and signal synchronisation requirements for a specific qubit-technology. In a later phase, the experimental implementation will have to include error-correcting codes on the realistic qubits. A system-on-chip running a quantum error-decoder would enable faster development and debugging capabilities for QEC on hardware. Area utilisation and power consumption of such a firmware would become a necessary consideration at this point, depending on the size of decoders. A full fledged infrastructure to support long DNA sequences would be included in the architecture. The development and testing of this platform would be done on both the QX-simulator and qubit hardware technology.
Mapping of the quantum circuit also needs to be discussed and parts of it are already presented in the compiler part. In later versions of this paper, more extensive sections will be added.
QX Simulator
As shown in Figure 3 , we have the QX-simulator which can execute the quantum gates on either realistic or perfect qubits. This is the computer engine that can execute any kind of Quantum logic as long as it is expressed in OpenQL and translated by the compiler in cQASM, the common Quantum Assembly. The assumption here is that there is a micro-architectural level through which every cQASM-instruction is executed. The micro-architecture then sends a quantum instruction to QX which will then execute it, measure the result and send it back to the micro-architecture. QX is capable of executing a quantum circuit with up to 35 fully-entangled qubits, either perfect or realistic
Realistic Qubits: Whenever we are interested on running quantum circuits based on realistic qubits, we need to be able to introduce errors at the qubit or gate level. Current quantum error rates do not go beyond 10 −2 and we have to better understand what the impact will be of 10 −5/−6 error rates. The errors will have an effect on the qubits as well as the quantum gates. Using the QX-simulator, we can go beyond simplistic error models such as the depolarising model, where every quantum gate is followed by some error, drawn from a uniform distribution of the different errors than can follow the Pauli gates X, Y or Z. It can be extended to other error distributions which are more realistic and that will sketch the extensions that the quantum physics researchers need to look at.
Perfect Qubits: Similar to realistic qubits, there will be the need to execute whatever quantum logic one has conceived and verify if the computed results make sense and are expected. The QX-simulator is capable of assuming or not the emergence of certain errors. We will start using the QX-simulator to execute the logic for the Quantum Genome Sequencing algorithm that we are currently working on. In principle, any quantum logic can be executed on the simulator, the result can be measured and fed back to the micro-architecture.
Two Full Stack Architecture Examples
In this section, we present and briefly describe two implementations of the full-stack. One was developed for the experimental design of superconducting qubits and the second is being implemented for the Genome Sequencing logic that we are porting to a quantum version. The full-stack as shown in Figure2 is used as the basic structure.
Full Stack for Realistic, Superconducting Qubits
This text is about the developed micro-architecture for the superconducting quantum chip and based on an experimental implementation of all the components that were defined and needed in e.g. one year of working. It starts at the highest level, involving the writing of an algorithm, up to sending the analogue pulses to the qubits. We start with a high level quantum algorithm which is useful for the physicists. We have been focusing on randomised bench marking for one or two qubits which was written in OpenQL. Any code will be translated by the OpenQL compiler into our version of the Quantum Assembly language, cQASM. As a logical extension of cQASM, the compiler then translates that version in an executable QASM, called eQASM, which takes, in principle for any quantum technology, low-level information into account such that gate times, topology etc. It basically means that there is a second back-end compiler pass that translates cQASM into the eQASM version. An example is the following eQASM program which measures qubit q2 at first. If the measurement result is 1, a Y gate is applied on qubit q0, otherwise an X gate is applied.
§ ¤ ¦ ¥ The design of eQASM focuses on being executable on real hardware providing user-definable feedback. Run-time feedback with the auxiliary classical instructions contained by eQASM (see Table 1 ), eQASM supports full quantum program flow control at the micro-architecture level [26, 27] . The definition of eQASM focuses on the assembly level, and the binary format is defined during the instantiation of eQASM targeting a concrete control electronic setup and quantum chip. This fact enables the eQASM assembly to be expressive while leaving considerable freedom to the micro-architecture designer to pursue microarchitectural practicability and performance. Since timing is critical in the NISQ era, eQASM allows explicit specification of the timing of quantum operations using the QWAIT and QWAITR instructions and the pre interval (PI) value in the quantum bundle, which supports compiler-based timing optimisation.
Quantum operations of eQASM are specified by programmers at compile time via configuration instead of being defined at QISA design time. This flexibility reserves ample space for compiler-based optimisation, such as quantum optimal control where uncommon operations are used [28, 29] . In addition, the required number of quantum operations per cycle in general increases as the number of qubits grows; fetching all instructions for an increasing number of quantum operations from memory and applying them on qubits on time forms a challenge given the limited instruction issue rate (the quantum operation issue rate problem) [30, 31, 32] . To alleviate this problem, eQASM adopts Single-Operation- Multiple-Qubit (SOMQ) and a Very-Long-Instruction-Word (VLIW) architecture. SOMQ is similar to classical single-instruction-multiple-data (SIMD) execution, with the operation target replaced by qubits. It utilises the SMIS and SMIT instructions to specify the targets of quantum operations.
The eQASM instructions can be executed by the micro-architecture, as shown in Figure 6 [25] . As explained before, the eQASM is then executed and at run-time translated into the horizontal micro-code version which ultimately sends the micro-operations to the queues. 4 From that level on, the timing execution requirements are very strict and need to be precise up to the nanosecond level. The code-words that are generated by the micro-code unit will ultimately be translated in an analogue pulse and sent to the qubit chip.
This demonstration was done for two quite different quantum technologies: one for the superconducting qubit chip and one for the semiconducting quantum chip. It was the specific combination of the micro-architecture, the eQASM compiler pass and the micro-code unit proved very useful. Especially the last two options allowed us to re-target the same micro-architecture to two different quantum technologies and the only thing that was needed was to change the configuration file for the compiler as well as implement the micro-code unit needed for either quantum technology and make sure the analogue pulses, stored in the analogue-digital interface (ADI), were available.
Full Stack for Quantum Genome Sequencing on Perfect Qubits
Genome sequencing involves taking fragments of the DNA (called, short reads) from the sequencing machines and stitching them together to reconstruct the original genome of the individual. Reconstruction can either be carried out by aligning these reads to an already available reference genome, or in a de novo assembly manner. This requiring the algorithmic primitive of searching an unstructured database and graph-based combinatorial optimisation respectively. Translating such quantum kernels to an efficient implementation on a quantum accelerator requires in-depth tuning of both an architecture-aware quantum algorithm and the underlying micro-architecture.
We already have initial results from combining domain-specific modification on the Grover's search [33] and quantum associative memory [34] approaches. This new alignment algorithm, described and analysed in [35] , has been tested on the QX simulator platform. The reference DNA is sliced and stored as indexed entries in a superposed quantum database giving exponential increase in capacity. The designed algorithm considers inherent read errors in the sequence, imposing the requirement for approximate optimal matching. A quantum search on the database amplifies the measurement probability of the nearest match to the query and thereby of the corresponding index. Due to the reference database and index, being entangled, the closest-match index can be estimated. Current explorations involves designing optimisation algorithm for genomics applications using near-term Quantum Machine Learning primitives like the Quantum Approximate Optimisation Algorithm.
As already mentioned, the proposed quantum accelerator will not be a standalone machine, but rather a quantum co-processor that will be part of a heterogeneous system in which classical processors There will also be run-time support that will coordinate the activities of the different micro-architectural components and, as discussed, be responsible for the run-time routing of qubit states for two-qubit gates. In the quantum accelerator, the executed instructions will in general flow through modules from left to right. The pink block on the right of the figure represents the QX simulation platform or an implementation of a quantum chip on which the test-runs of the QGSalgorithms will be performed. The rest of the large (blue) block represents the micro-architecture. The DNA data-sets will be retrieved from an external classical database and transported to a local memory in the quantum accelerator. The size of the local memory will depend on the capabilities of the QX simulator platform and how that information is encoded. This research will be based on the large-scale micro-architecture simulation platform that we already developed. We will also use the QX simulator platform, which makes it possible to rapidly develop hardware prototypes and verify their behaviour and performance before any FPGA implementation is started. For now, we assume that the set of queues will be relevant for feeding the DNA information to the qubit chip and for defining how the quantum gates will be applied.
In a specific qubit plane topology, qubits will have to move around so that two-qubit gates can be applied on adjacent qubits. It is a prevailing idea that quantum compilers generate technology-dependent instructions [36, 11, 37] . However, not all technology-dependent information can be determined at compile time, because some information is only available at run-time due to hardware limitations, for instance qubits that need to be re-calibrated. For testing the functionality of the algorithm, we need to use artificial DNA sequences that preserve the statistical and entropic complexity of the basepairs in biological genomes; yet in a reduced size so that they can be efficiently simulated in a classical architecture with qubit limitations. This implies understanding which run-time and thus routing support will be necessary to make sure that the quantum accelerator always has enough data to process and that they are in adjacent positions when necessary.
From an algorithmic perspective, near-term quantum optimisation algorithms employ the variational principle, where a shallow parameterised quantum circuit is iterated multiple times while the parameters are optimised by a classical optimiser in the Host-CPU. This model of Hybrid Quantum-Classical (HQC) algorithms requires fast feedback between the quantum accelerator and the real-time circuit/instruction generator (i.e. the compiler and the micro-architecture). Since most quantum algorithms expect a statistical central tendency over multiple measurement, the expected probability of the solution state can be calculated inside the quantum accelerator itself, aggregating the measurements over multiple runs.
Towards In-Memory Computing
In-memory computing is becoming increasingly important as a new computer architecture. Rather than moving the huge amounts of data around to the logic, it is much more meaningful to move the logic around and keep the data as local as possible without moving it around. Using, for instance, innovative technology such as memristors. Memristors were theoretically defined already several decades ago by Leon Chua, but recently the semiconductor manufacturers are seriously investigating their production. The key idea of a memristor is that it can be used to store data but also to make calculations. This is why memristors are an ideal candidate for making an in-memory architecture. The concept of in-memory computing is described in a paper where the concept is illustrated using memristor based devices [38] . The main advantage of memristors is that they can be used both to store information and to work on it. So an intelligent merging of logic with data storage is the key of an in-memory architecture. It is a completely new way of designing algorithms and computing systems and it is far from evident what the design rules are that are needed to fully exploit the in-memory computing potential.
The link with quantum computing is very straight: the quantum logic is directly applied on the qubits and the qubits do not need to be transported to any Quantum Arithmetic and Logical Unit (ALU) before being processed. In quantum computing, the routing of qubit states is therefore also a very important problem. The qubits need to be put on the quantum chip in a way that the movement of qubit states is as minimal as possible. Also what routing protocols will be used for any quantum chip is a big open area of research in quantum computer engineering. Currently, in any of the semiconducting or superconducting quantum implementations the interaction between qubits has a nearest-neighbour constraint. That means that when a 2-qubit quantum gate needs to be executed, they need to be nearest neighbours. That induces the need for deciding where to map and how to route the qubits used in the algorithm on the quantum chip. This qubit routing is an important and illustrative example of what in-memory quantum computing actually means. When adopting an in-memory computing architecture, a crucial challenge is to decide on the placement of the data that needs to be processed and to have a programming language and compiler, such that the appropriate logic can placed close to the data. Any kind of algorithm will have data that the algorithm is changing to get a result and it is quite unlikely that there is no dependency between any of those data items. What that implies is that intermediate results will have to move around in the architecture such that it reaches the place where that result is used in the next computational step. Even though in-memory puts all the data in some kind of memory, those data items still have to move around such that a final result can be computed by the classical Host-CPU. From a quantum physics point of view, the main challenges are the coherence of the qubits, the fidelity of the operations and the overall error rate of the quantum computation, involving both the qubits as well as their operation and the involved error-corrections. This is already being sufficiently studied by the quantum community but there are also clearly other challenges that need to be researched as soon as possible.
One of the main problems is the error-proneness of the qubit behaviour which consumes up to 90% of the (quantum) computer time. As explained, the routing and moving around of qubit states is a very important challenge. So any progress the physics community is making in that respect is extremely important as it will reduce substantially the pressure on the micro-architecture and the overall system design. In [39] , the authors present a quantum computer architecture that addresses the important problem of qubit state routing for nearest-neighbour two-qubit gate execution. They use an idea from the von-Neumann architecture of classical machines such as a quantum bus which is a refreshable entanglement resource to connect distant memory nodes. The overall approach is at the level of entanglement purification and qubit pairs with different fidelities. Given that a quantum computation on qubits complies to the same overall in-memory computing logic, that particular architecture is definitely interesting for any quantum device. Quantum gates are applied on the qubits and change the state of that qubit. In principle, qubits therefore do not need to be transported somewhere else before one can apply the quantum operation. The challenges involved with in-memory computing are therefore the same as for quantum computing. The underlying technology are not memristors or other technology but any of the quantum technologies and require also a full-stack integration of the different layers. In that sense, the quantum computing research should be based very much on the basic principles that we think we understand for in-memory computing. 
Future Prospects
It is very important that companies and other organisations start investing as soon as possible in Quantum Technology. Figure 8 shows a projection of when different parts of software and hardware development will be required, to create an efficient quantum computer. So the distinction is made between the use of quantum accelerators and the other is the making of the quantum chip. In general, any commercial or other organisation is interested in new technology if the Technology Readiness Level (TRL) is high enough. If we adopt the same levels as for classical technology, the TRL needs to have reached level 8 and that is sketched in the red and black line that are shown in Figure 8 . There are 4 vertical, greendotted lines to illustrate 3 moments leading to the last phase where we assume there is enough software or hardware that can be used for any accelerator one wants to build. Phase I focuses on the reflection by the organisation on the concrete need that exists and for which a quantum accelerator logic can be developed. In this phase, team members can go for a PhD in Quantum Computing. Phase II resembles the team members again such that they can brainstorm on the logic for the quantum accelerator. They will express that logic in OpenQL and develop some prototype micro-architecture and executed the logic on the QX-simulator. Phase III then focuses exclusively on the actual implementation and execution of the Quantum Accelerator logic, whether on an experimental quantum chip or on the QX-simulator. This is the moment when the top and low curves can be combined in a real quantum prototype of the accelerator.
There is another figure that can be added and which is shown in Figure 9 . It clearly represents the way that the two lines of research are currently separated and which will be joined in maybe 10 years. The division was used in this paper where we made the distinction between the use of perfect and realistic qubits and how that determines the different layers in the full-stack.
Conclusion
The main conclusion of this paper is that over the last couple of decades, quantum computing has been a one-dimensional research effort focusing on understanding how to make coherent qubits and how to implement the different universal quantum gate sets on any of the multiple quantum approaches. As far as computer architectural choices were made, the community has been focused very much on the vonNeumann computer architecture and defined qubits in terms of memory and processing qubits. However, computer engineering as a field has understood by now that this approach never scales to the size needed for handling, for instance, the Big Data volumes that world wide are being generated and collected. Two approaches seem to be very promising: the first comes from the accelerator community and involves the Figure 9 : Structural Division between Perfect and Realistic Qubits full stack integration of the different layers that are needed to build the quantum accelerator. That implies the full integration of the different layers in the full-stack. The use of perfect qubits in that context makes sense as the end-users of any quantum accelerator can focus their reasoning on the quantum logic of the application and verify it through some implementation of the micro-architecture and the execution of the quantum instructions on the QX-simulator. The second option is to use the full-stack for the control of, for instance, superconducting and semiconducting qubits with a micro-code layer where we translate any kind of common QASM into an operational set of micro-instructions, seems to be a meaningful adoption of existing computer technology. It is very difficult to predict what the performance improvement will be of any quantum computational device but that it will be much higher than any existing computational technology is clear. but whether it will be 10, 100 or even higher times faster, depends on the quantum application that is looked at and at the way the qubits will be produced. Research will still be needed for at least 10 to 15 years before the full-integration effects become visible and verifiable.
