As the size of quantum systems becomes bigger, more complicated hardware is required to control these systems. In order to reduce the complexity, I discuss the amount of parallelism required for a fault-tolerant quantum computer and what computation speed can be achieved in different architectures. To build a large-scale quantum computer, one can use architectural principles, from classical computer architecture, like multiplexing or pipelining. In this document, a Quantum von Neumann architecture is introduced which uses specialized hardware for the different tasks of a quantum computer, like computation or storage. Furthermore, it requires long qubit coherence and the capability to move quantum information between the different parts of the quantum computer. As an example, a Quantum von Neumann architecture for trapped ions is presented which incorporates multiplexing in the memory region for large-scale quantum computation. To illustrate the capability of this architecture, a model trapped ion quantum computer based on Quantum von Neumann architecture, the Quantum 4004, is introduced. Its hardware is optimized for simplicity and uses the classical Intel 4004 CPU from 1971 as a blueprint. The Quantum 4004 has only a single processing zone and is structured in 4 qubit packages. Its quantum memory can store up to 32768 qubit ions and its computation speed is 10 µs for single qubit operations and 20 µs for two-qubit operations.
Introduction
Since the 1960s, the number of components, e.g. transistors, in integrated circuits (ICs) has doubled approximately every two years. This exponential growth is described by Moore's law [1, 2] and it enabled exponential increase in calculation power. Before the early 2000s, the clock speeds of ICs grew exponentially as well [3] . But in the early 2000s, the clock speeds of ICs reached levels for which cooling the ICs limited the clock speed. In order to maintain an exponential increase in calculation power, the size of transistors was decreased and multiple cores in ICs were incorporated. But eventually, the size of transistors will reach the ultimate size limit, the atomic level. Then, one will have to find new ways of computation or settle with a lower increase in computation power over time. One way to speed up computation for specialized applications is quantum computation (QC). In QC, the information is stored in quantum mechanical systems. Thus, a quantum computer can make use of quantum mechanical properties, such as for example superposition or entanglement, to speed up computation and the execution of quantum algorithms. One of the best known quantum algorithms is Shor's algorithm [4] . It allows factoring of large numbers exponentially faster than with classical algorithms and undermines the security of public-key cryptosystems. Other quantum algorithms allow the implementation of an oracle [5] or fast searches in databases [6] . Another application of quantum computers is quantum simulation [7, 8] . There, the idea is to use a well-controlled quantum mechanical system to simulate the behavior of another quantum mechanical system and gain insight into the behavior. Possible applications are in condensed-matter physics, atomic physics, or quantum chemistry [8, 9] .
As the quantum mechanical systems for information storage are subject to decoherence, quantum error correction (QEC) has to be performed for fault-tolerant QC [10, 11] . Experimentally, fault-tolerant QC has not yet been performed on any quantum mechanical system. Hence, building a quantum computer still remains a challenge. There are many different systems under investigation as possible candidates for a fault-tolerant quantum computer. Two promising systems are superconducting circuit quantum electro dynamics (QED) systems [12, 13] and trapped ions [14, 15] . But as the interactions in these systems vary vastly, it proved difficult to find a common architecture for a quantum computer [16] , similar to von Neumann architecture for classical computers.
Introduction to classical computer science
This section covers the fundamentals of computer science that are most applicable to quantum computers. It is intended for physicists with no specialized knowledge in computer science. If one is already familiar with terms like von Neumann architecture, abstraction layers, pipelining, multiplexing, or Rent's rule, one can skip this section and continue with Section 3.
A computer is a device that executes sequences of instructions [26] . A sequence to perform a certain task is called a program. With such programs, one can solve mathematical problems by executing algorithms [11] . How a computer solves problems and what types of problems it can solve (efficiently 1 ) is given by the incorporated architecture. Hence, the choice of computer architecture directly relates to how efficiently one can solve given problems.
Most modern computer architectures are based on von Neumann architecture [27] , which is depicted in Fig. 1 . The fundamental idea behind the von Neumann architecture is to divide the computer into individual parts: (1) arithmetic logic unit (ALU), (2) control unit (CU), (3) bus system, (4) memory, and (5) input and output (IO). The ALU can apply certain operations on its input registers. These operations contain addition, subtraction, multiplication, and division and allow calculation with the input registers. Together with the CU, the ALU forms the central processing unit (CPU). The CU controls the ALU by sending commands (e.g. when to perform what operation) as well as a bus system which enables reading new commands and data from the bus into the CPU and writing back the results to the bus. A memory and an IO unit are attached to the bus so that one can read or store information, and react to external inputs by accessing the bus.
To execute a command in a von Neumann architecture, the CU loads a command from the memory into the CPU where it is decoded and executed. The result of the execution can be access to the IO interface or storage in the memory. For example, if one wants to multiply two numbers, both numbers will be loaded into the CPU and stored in registers of the ALU, which are typically called Accumulator A and Accumulator B. The ALU performs the multiplication and writes the result back in the same registers where it can be used for further processing or it can be written back into the memory.
Abstraction layers
A computer can only interpret a sequence of instructions to manipulate data. In order to simplify the usage of computers, the abstraction layers have been introduced to make the interface to the computers a more natural process for humans [26] . The idea behind these layers is that each layer hides the implementation details of the layer below and offers simple functions for the layer above. To execute a program at a higher level, the next-lower level translates, or converts, the program to instructions of this level which are then executed. This process is repeated until the program is physically executed.
To discuss the concept of the abstraction layers with an example which has five abstraction layers, let's assume a user wants to generate a program that reads a number from a file and displays on the screen whether the number is smaller than 1000, or not. The user programs in the top layer, called the process layer, and the underlying layer, the operating system layer, offers functions e.g. for opening files, or for comparing numbers. To execute the user's program, a compiler converts the functions of the operating system layer into assembler code for the assembler layer, the third layer. In the firmware layer, which is the second layer from the bottom, these assembler commands are converted into hardware commands that can be executed on this specific CPU. Finally, these hardware commands are executed on the hardware of the computer, the hardware layer.
The concept of abstraction layers is fundamental in computer science. Without it, the usage of modern computers would not be as easy as it is today.
Parallel computing
To speed up computation, one can either increase the clock speed of the CPU, or if this is not possible, one can execute multiple instructions simultaneously in parallel [26] . Parallel computers incorporate multiple CPUs in one computer, as shown in the multiprocessor system of Fig. 2 a. Such a system illustrates a problem of the von Neumann architecture: data processing in the CPU might be fast but all data has to come through the bus. The clock of a bus is slower than the clock of a CPU. Hence, for fast CPUs (even for single processor systems), the bus system is the limiting the performance of the computer. This is usually referred to as the "von Neumann bottleneck".
Multiprocessor systems with shared memory have applications in for example graphics processing units (GPUs) [26] . The algorithms behind these applications have to be repeated many times (e.g. for each pixel on a computer screen). Another way to increase parallelism is to build multicomputers, as illustrated in Fig. 2 b. These computers contain multiple CPUs with their own private memories and buses. The CPUs are connected via special buses, making them very efficient when solving different problems in parallel. But as the different computers do not share memory, distributing single algorithms usually involves additional overhead so that each CPU always works with correct data.
One thing that unites the different classical computer architectures presented so far is that they separate the CPU from the memory. CPUs require much more silicon area per bit of information produce much more heat per bit of information than memories. Thus, building the CPU is very hardware demanding and most modern computers only have a few CPUs but a large memory. However in nature, brains compute massively parallel. As neurons in the brain both store information and are part of the computation, people have tried to emulate the behavior of neurons with electrical circuits to generate artificial neural networks (ANNs) [28] . Such systems, also called neuromorphic systems [29] , map their input registers either in an analog or a digital way onto output registers. This mapping is performed in parallel and allows fast data processing for specialized applications. To perform the desired task, these mappings have to be "learned" with learning rules. The applications of ANNs are for example pattern recognition in images, video or audio. The field of ANNs is still mainly research 3 but companies like IBM and Qualcomm are building their first products based on ANNs 4 . Usually, the execution of a single command in a CPU can be divided into multiple sub-operations. In the following example, a command can be divided into four parts. First, the command is loaded from the memory. Second, the command is decoded to find out what the CPU has to do. Third, the command is executed. And fourth, the results are written back to the memory. In order to speed up processing, pipelining [30, 26] , illustrated in Fig. 3 , is introduced where a CPU has the capability to execute all (four) different stages of a command in parallel. Since the individual steps of a command have to be executed serially, the execution of one command takes the same time as without pipelining. However, the tasks of the different stages of a pipeline can be executed simultaneously. Hence, the execution of algorithms is speeded up by a factor of the number of stages in the pipeline. In the discussed example, that would be four.
Pipelining in CPUs

Memory architecture -multiplexing
Memory that can be read and written is called random access memory (RAM). In order to read data from or write data into a RAM, the CPU writes the RAM address, corresponding to the number of the memory segment to access of the RAM, on the bus. A read/write flag indicates whether a read or a write process is being executed. As a result for a read process, the CPU expects that the data lines of the bus will contain the data that should be read. For a write process, the CPU writes data on the same data lines of the bus. The number of memory registers that one can access scales as 2 n , where n is the number of address lines on the bus which saves a lot of interconnections.
Rent's rule is an empirical formula which is often used in the semiconductor industry as a scaling law. It relates the number of pins P , or external interconnections, of a system, with the number of logical elements B of this system, by a simple power law [31, 32] 
where K is the Rent coefficient and r is the Rent exponent. The Rent coefficient is correlated to the average number of pins of a logical element. The Rent exponent sets the scaling with the systemsize, and r ≤ 0.75. Hence, the number of external interconnections has to grow slower than the system size. For example, a 1 MByte RAM chip must not have a thousand times more pins than 1 kByte RAM chip to obey Rent's rule. Multiplexers (Fig. 4 a) are electronic circuits that have many input ports and only one output port [26] . An address applied to the multiplexer allows choosing which input is routed to the output. Demultiplexers (Fig. 4 b) are the opposite of multiplexers and have only one input port but multiple output ports. The address applied to the demultiplexer defines to which output port the input port is routed. In a RAM, multiplexer and demultiplexer circuits are used to access (read or write) a specific memory segment and require few pins to access a large number of memory cells. This results in a Rent exponent r of 0.12 for static RAM (SRAM) [32] .
Quantum computer
A quantum computer is a computer which executes sequences of instructions in a quantum mechanical system. Quantum algorithms exploit quantum mechanical properties like superposition and entanglement and can solve certain problems exponentially faster than classical computers [4, 5, 6] . The quantum information in quantum computers is stored in quantum mechanical two-level systems which are called quantum bits, or short qubits [11] . As with any quantum mechanical system, the qubits are subject to noise-induced decoherence. Thus, quantum information cannot be stored infinitely long. Furthermore, quantum mechanical gate operations are analog operations which will introduce (small) errors in the computation in the Hilbert space spanned by the qubits. To enable successful quantum computation (QC), quantum error correction (QEC) has to be performed repeatedly in the quantum computer [11, 33, 34] which corrects the errors induced by decoherence and gate operations.
The concept of abstraction layers is fundamental in computer science and is applicable to quantum computers as well. In reference [35] , the authors propose a layered architecture for quantum computing with five layers. The bottom layer is the physical layer in which the physical gate operations are performed [36] . On top of that is the virtual layer in which one can use open-loop error cancellation for example with dynamical decoupling [37, 38] . In the third layer, QEC is executed. And in the next layer, the logical layer, a substrate for universal QC is constructed. Finally, the top layer is the application layer which provides an interface to the user who can input quantum algorithms there. In recent years, quantum assemblers [39, 40, 41] and quantum compilers [42] are being investigated to generate these sequences in an automated fashion similar to assemblers and compilers in classical computers.
A physical quantum computer has to fulfill the following five criteria which were proposed by DiVincenzo [43] . (1) A scalable physical system with well characterized qubits. (2) The ability to initialize the state of the qubits. (3) Long relevant decoherence times. (4) A universal set of quantum gates [11] . (5) A qubit-specific measurement capability. Currently, many different technologies are investigated as possible candidates for a quantum computer, such as for example trapped ions [14, 15] , superconducting circuit quantum electro dynamics (QED) systems [12, 13] , quantum dots in silicon [44, 45] , and ultra-cold atoms [46] . Finding a general architecture for a quantum computer, like von Neumann architecture for a classical computer, is challenging as the interactions for qubit manipulation vary vastly from technology to technology. Therefore, the field of quantum computer architecture is still in its infancy [16] .
Parallelism in quantum computation and classical hardware demand
There are two main sources of errors in QC. The first type is the storage error which is caused by noise from the environment coupling to the quantum computer. And the second one is the error created by imperfect quantum gate operations. QEC has to correct these errors, and for arbitrarily long QC, these errors have to be lower than a fault-tolerant limit 5 [47, 10, 11] . This means that errors during QC must be low enough for QEC to be effective. This fault-tolerant error threshold is assumed to be in the range between 10 −5 and 3% [21] . For serial QC with just one processing zone, the coherence time of the qubits would have to grow with growing number of qubits in the quantum computer to maintain a constant memory error. As this increase in coherence time is not a realistic assumption, quantum computers with a large number of qubits will require parallelism for QEC to be fault-tolerant [48] .
On the path to minimize all errors in the system and thus to reach the fault-tolerant threshold, one can decrease the memory error by making the quantum computer massively parallel [20, 17, 18] . If QEC is constantly executed to "keep the qubits alive" and all necessary gate operations for QEC are executed in parallel, the time required for QEC, and thereby the memory error, will be minimized. The error threshold for fault-tolerant QC considers the memory and the gate error. By minimizing the memory error, the threshold for the gate error will increase. As there is no technology yet that has shown faulttolerance on a (small) set of qubits, a higher gate threshold will simplify reaching the fault-tolerant threshold of a given architecture.
This massively parallel approach [20] is the most sensible one for small-scale systems. With growing system size, Rent's rule [31, 32] has to be considered. Otherwise, mundane engineering challenges like the complexity or expenses will limit the fabrication of bigger systems [16] .
To estimate the possible size of such massively parallel architectures, one can look at classical architectures. In such a massively parallel quantum architecture, each qubit site has storage and calculation capability. The classical equivalent that is closest to such architecture is a neural network. As there is no straight forward way to build the hardware of ANNs yet, one can look at graphics processing units (GPUs) which are often used to simulate ANNs in software because of their massively parallel computation capability. Modern GPUs have thousands of ALUs. If one assumes that the hardware resources required for an ALU are comparable to the resources to control a qubit, one can assume that building a quantum computer in a parallel architecture with hundreds or thousands of physical qubits will be feasible.
The hardware demand of von Neumann type classical computers is a sum of the hardware demand for a CPU and the hardware demand for a memory. More memory will only affect the hardware demand for the memory and the bus connections of the CPU. As the CPU is very hardware demanding 6 and one can use multiplexing technologies in the memory [26] , one can build classical computers with large memories. In massively parallel quantum computer architectures, which combine memory capability and computation capability on the same site, one will require techniques from ANNs to reduce the hardware overhead for computation or will otherwise be limited to small or medium size because of design considerations like Rent's rule. When following the classical approach for a large-scale quantum computer architecture, one splits computation and memory into separate regions [20] . The computation region will be very hardware demanding and the memory region will incorporate multiplexing technology to combine a big storage capability with low hardware demand.
In classical computer architecture, Rent's rule is the basis of models for power dissipation, interconnection requirements, or packaging [32] . For quantum computers, one should find a similar model to estimate their hardware demand. In general, one can state that typical parameters like price or power dissipation should scale as little as possible with the system size of (large-scale) quantum computers. For example, even small costs of the first qubit for computation can make hardware uneconomically expensive, if the scaling of the price with the number of qubits is linear and one wants to work with 10 8 qubits 7 . However, if the cost scales logarithmically with the number of qubits and the costs for the first qubit are high, large-scale QC can still be economically sensible.
For the estimation of the maximum number of physical qubits for fault-tolerant QC per processing zone, the quantum computer must be built with a quantum mechanical system, which allows moving quantum information from a memory region to the processing zone and back, and as a first step, it will only perform QEC but no additional QC to reach this maximum number of physical qubits for fault-tolerant QC. Typically, the coherence time of an experimental system is defined as a decay time of a coherence measure, e.g. the 1 /e-time of the decay of the Ramsey contrast [49] . In the following, the "QEC coherence time" is defined as the time when decoherence has caused a decay so big that QEC has to be performed and it can still be performed with a fidelity high enough for fault-tolerant QC. This time will depend on the quantum computer and the incorporated QEC scheme but it will only be a fraction of the stated coherence time. To estimate the time of one cycle of QEC in a large-scale quantum computer, one can neglect the time for detection and reinitialization because in a serial architecture with one processing zone, detection and (re-)initialization can be performed in different locations 8 . Thus, only the sum of all gate times will matter if the time, which is required to move the quantum information in and out of the processing zone, is part of the gate time. As an example, for a syndrome measurement with a 7-qubit Steane code [34] , one needs 4 entangling gates per syndrome, 3 syndrome measurements for bit flips and 3 for phase flips. This sums up to 24 entangling gates for 7 qubits. Hence for a 7-qubit Steane code, the time for a single cycle of QEC per physical qubit is 24 /7 times the average time of an entangling gate operation plus 1 /7 of the time to perform the correcting gate operation. For higher level logical qubit encoding, this value can be higher. With these assumptions, the maximum number of physical qubits per processing zone κ can be defined as κ = "QEC coherence time" time for a single cycle of QEC per physical qubit ,
where κ must be greater than 1 for fault-tolerant QC. The κ value states how much serialization is possible in a quantum computer. If this serialization can be implemented with multiplexing technologies, and thus much lower hardware demand than for computation, high κ values will indicate a potential for large-scale QC with serialization. Two of the most promising technologies to fabricate a quantum computer are superconducting circuit QED systems and trapped ions. To estimate the (rough) κ values of these systems, let us assume in the following that in the chosen logic qubit encoding, it takes the time equivalent of 10 physical entangling gate operations to perform QEC 9 . Furthermore, for simplicity the "QEC coherence time" is chosen between 1 and 10 % of the experimentally measured coherence time. For superconducting circuit QED systems [25] , the coherence times are on the order of 100 µs [50, 51] and entangling gates require about 100 ns [52, 53] . The resulting κ value is between 1 and 10, which suggests a parallel architecture for circuit QED systems 10 . In trapped ion experiments, coherence times on the order of 100 s [23, 24] and entangling gate times of about 100 µs [54] have been demonstrated, which yields a κ value between 1000 and 10000. This κ value classifies trapped ions as a system which can be used for parallel processing as well as for serial processing. Since the qubits in this system can be moved to a processing zone [55] and multiplexing is possible, the (classical) hardware demand for large-scale trapped ion systems can be reduced drastically by serialization of the computation, see Section 5 and 6 for details. 7 10 8 qubits are the approximated number of qubits required to factorize a number with about 10000 digits. See Section 3.2 for details. 8 An additional assumption here is that the detection time is so small that it can be neglected when compared to the "QEC coherence time". 9 10 was chosen for simplicity not because of a specific qubit encoding. 10 As circuit QED systems with such coherence times do not allow for large-scale QC with a single processing zone, one would have to include the detection time in the "QEC coherence time" to estimate κ more precisely.
Computation speed
In the previous section, serialization in quantum information processing (QIP) was introduced as a method to reduce the hardware demand in large-scale systems. This serialization implies that fewer physical gate operations can be performed per unit of time compared to a fully parallel architecture. But this serialization allows working with more qubits than in a parallel architecture with similar hardware resources, which will lead to a speed-up as discussed later in this section. The comparison of the computation speed of different architectures heavily depends on the quantum algorithm that has to be executed. In this section, Shor's algorithm [4] , which allows fast factoring of large numbers, is used as a benchmark for computation speed, as discussed in Van Meter's PhD thesis [56] .
With Shor's algorithm, one can factorize a number N which has a binary bit length of n. In his thesis, Van Meter introduces different architectural models to execute Shor's algorithm and relates them to their execution time. For this, he requires a logical clock speed of the hardware, which states how many operations can be executed on the logical qubits per time and implies that QEC is executed in between the operations on the logical qubits.
One architectural model used by Van Meter is Beckman-Chari-Devabhaktuni-Preskill (BCDP), which requires next-neighbor-only interaction. It uses (5n + 3) qubits to factorize a number with n bit length and its execution time scales as ∼ 54n 3 . Another model is the Neighbor-only, Two-qubit-gate, Concurrent (NTC) architecture, which is executed with next-neighbor-only interaction in a 2n
2 qubit space and its execution time scales as ∼ 20n 2 log 2 (n). The last model is the Abstract Concurrent (AC) architecture, which allows two-and three-qubit interactions between arbitrary qubits. This model is executed in a 2n 2 qubit space and its execution time scales as ∼ 9nlog 2 2 (n). As the qubit demand in the BCDP architecture only scales linearly with the length of the number to factorize, BCDP is the obvious choice for a massively parallel system, typically systems with low κ values. Whereas, the other two architectures are executed in a 2n 2 qubit space, which suggests some kind of serialization to reduce the classical hardware demand for factorizing reasonably big numbers. Hence, such models are executed on hardware with a high κ value, e.g. κ > 1000.
In the following example, let us assume that hardware allows for 1 MHz logical clock speed in a massively parallel approach used in the BCDP architectural model. In the NTC model, it is possible to compute with next-neighbor-only interaction even in a 2n
2 qubit space with a 1 MHz logical clock speed. And due to serialization, the AC model can only be executed with a 1 kHz logical clock speed. The execution times of a factorization with Shor's algorithm are depicted in Fig. 5 in dependence of n. For short numbers (n < 50), the higher logical clock speed in the BCDP and NTC architectures allows for the fastest (quantum) computation. NTC is faster than BCDP and is less dependent on n than BCDP. Hence, a bigger computation space allows for faster computation. As n grows, the AC architecture become faster than BCDP and NTC. This illustrated that for large-scale systems, interaction between arbitrary qubits speeds up computation more than a fast logical clock.
It is not possible to generalize the following statement for all quantum algorithms but typically one can state: small-and medium-scale QC is best executed with a fast logical clock speed for shortest computation time, suggesting a massively parallel hardware. However, the bigger the computation, the more it makes sense to switch to hardware which allows interaction between arbitrary qubits and temporarily storing quantum information in ancilla qubits. To outperform massively parallel architectures, such hardware requires a big total number of qubits which suggests some kind of multiplexing to reduce the classical hardware demand (Rent's rule). Then, systems with high κ values allow not only computation with more qubits but also faster computation.
Quantum von Neumann architecture
When building a novel type of computer, like a quantum computer, one can use an architecture that is based on one of the (presented) classical architectures [26] to avoid having to design a fundamentally different kind of architecture. Furthermore, facilitating scalability in quantum computer hardware is one of the most challenging tasks of quantum computer architectures. In this section, the quantum von Neumann architecture is introduced which combines the classical von Neumann architecture with the requirements of the DiVincenzo-criteria in QC [43] resulting in quantum hardware which incorporates scalability.
In a massively parallel quantum computer, the DiVincenzo criteria have to be fulfilled at every site that holds a qubit. In order to simplify the hardware of a quantum computer, one can fabricate hardware specialized on only one criteria and move the quantum information between these specialized hardware components to perform QC [20] .
The schematic diagram of the quantum von Neumann architecture is depicted in Fig. 6 . Like any quantum computer, it will require a classical control unit which controls the quantum computer. A quantum bus system allows moving quantum information between the different parts of the quantum computer. The manipulation of the quantum information is executed in the quantum arithmetic logic unit (QALU) which is the most hardware demanding 11 part of the quantum computer as quantum gate operations are executed here. The quantum information is stored in the quantum memory which should rely on multiplexing technology for large storage capability. Furthermore, an input and output region acts as an interface to the classical world in which the qubit state is initialized and/or detected. The operation principle of the quantum von Neumann architecture is similar to that of classical von Neumann architecture. A quantum von Neumann machine executes a series of quantum gate operations by loading the qubits which should be manipulated into quantum registers of the QALU. The quantum register length can be arbitrarily long. In order to entangle two qubits from arbitrary positions in the memory, the two qubits are loaded from the quantum memory into the QALU where the gate operations are performed. After the quantum gate operations, the qubits can stay in the QALU for further processing or the qubits are moved back into the quantum memory. To detect quantum states, the required quantum information can be moved to an output, or detection, region. As this detection region works independently from the QALU, QIP in the QALU and detection in the output region can be performed simultaneously.
After quantum state detection, the qubits can be moved to an input region for initialization into one specific state before the (now initialized) quantum information can be moved back into the quantum memory. If a desired initial state is a more complex state, e.g. an entangled state, the initialized qubits can be moved into the QALU which performs quantum gate operations to generate the desired initial quantum state.
Quantum memory region
In a classical computer, the components of a dynamic RAM (DRAM) needed to store one bit of information are a field effect transistor (FET) and a capacitor, as depicted in Fig. 7 a. Digital multiplexing logic controls the FET to access the DRAM cell. This low hardware demand per bit results in big data storage capacities on DRAM chips [26, 32] . In large-scale quantum computers, one has to achieve big quantum data storage capacities with low (classical) hardware demand in the quantum memory. If the hardware demand scaled linearly with the number of qubits that were stored, it would not obey Rent's rule and the control hardware would get too complicated and too expensive for large-scale QC with thousands or millions of qubits.
Reducing the hardware demand can be achieved with multiplexing circuits, as depicted in Fig. 7 b. Therefore, one needs to have the ability to store quantum information with a set of constant parameters. For example, to store an ion chain in a segmented Paul trap, one only needs a negative DC voltage at the position of the ion string and positive DC voltages surrounding it which form an axial confinement for the ion string. These few voltages can in principle be used to store arbitrarily many ion strings. During storage, this set of parameters (for trapped ions that would be a set of DC voltages) is applied to all qubits in the quantum memory. In order to access a specific memory cell, multiplexing technology allows a change of this set of parameters to another set which can be controlled independently. This independent set of parameters enables movement of the quantum information of an arbitrary memory cell out of the quantum memory.
Quantum information transport
One of the most critical features of this quantum von Neumann architecture is the quantum bus system for quantum information transport, which has to be performed with high fidelity to allow fault-tolerant QC. As quantum information cannot be copied [57] , quantum information can only be transported by physically moving the qubits, quantum teleportation [58] or via coupling to photons [59, 60] .
Atomic or molecular qubit systems enable quantum information transport via physical movement of the qubit from one location in space to another. For example in trapped ion systems with segmented Paul traps [61, 55] , ions or ion strings can be moved by changing the confining axial DC potential. In solid state systems, such movement is not possible in general. However, there are solid state systems which allow qubit movement, like spins in silicon [44, 62, 63] .
Quantum teleportation [58] requires an entangled qubit pair, of which one qubit is at the location from where the quantum information is taken and the second is the qubit at the destination. Furthermore, it requires a qubit measurement with a classical channel to the destination where a conditional quantum gate has to be performed. In order to store and read quantum information in the quantum memory, it implies read-out-and quantum-gate-capability at every site in the memory. This is in contradiction to a specialized hardware for each DiVincenzo criteria and, thus, more hardware demanding than physical movement. But it could be a strategy in many solid-state systems.
Mapping qubits to photons was demonstrated in atomic or molecular qubit systems [64, 65] as well as in solid-state systems [60] . Like in quantum teleportation, this approach requires quantum logic at every site in the memory and is therefore hardware demanding.
Quantum information transport with quantum teleportation or mapping to photons have one advantage over qubit movement: it is possible to change from one qubit system to another. For example, QIP could be performed with superconducting circuit QED systems [12, 13] and for long storage in the quantum memory, one could use nitrogen vacancy centers in diamond [66] . The disadvantage of these technologies compared to systems, which allow qubit movement, is the high hardware demand in the memory, as quantum gate operations and quantum state readout are required at every site in the quantum memory. If this cannot be overcome, quantum teleportation and mapping to photons will only be applicable to small-and medium-scale systems. Large-scale systems with low hardware demand per stored qubit may have to move the qubits in the quantum computer [20] .
Parallelism in quantum von Neumann architectures
In order to work with an increasing number of qubits in a quantum von Neumann architecture, one has to increase the κ value to compensate decoherence in the quantum memory. Therefore, one can either increase the coherence time or decrease the time per quantum gate operation. If both options are not feasible, one has to parallelize QIP. Similar to classical multiprocessor systems, one can use multiple QALUs in one quantum computer, as depicted in Fig. 8 a. Another option, illustrated in Fig. 8 b, is to couple multiple quantum computers via quantum interfaces, which is the quantum equivalent of a multicomputer system. These quantum interfaces can either be implemented by actual physical qubit exchange (qubit movement), quantum teleportation [58] , or mapping to photons [59, 60] . Here, the hardware demand for the interface can be higher because it is not needed at every storage site in the quantum memory but only once for the interface.
Possible technologies
A technology to implement the quantum von Neumann architecture has to have a high κ value and needs to be capable of quantum information transport, ideally by physically moving the qubits. Both criteria are fulfilled in trapped ion experiments and details on the implementation of a quantum von Neumann architecture for trapped ion QC are presented in Chapter 5 and 6.
Another technology suitable for a quantum von Neumann architecture is QC with ultracold atoms [46] . There, atoms can be stored in optical lattices. Hence, quantum memories with low hardware demand are feasible. Micro-electro-mechanical systems (MEMS) technology enables beam steering which, in combination with optical tweezers [67] , can be used to move atoms or qubits in QC with ultracold atoms.
A promising candidate for a solid-state system with high κ values and capability to move qubits is QC with spins in silicon [62, 63] , where coherence times of 28 ms have been demonstrated [68] . There, the qubits can be moved in an electric field by changing DC voltages [44] .
A quantum von Neumann architecture for trapped ion quantum computation
This section covers how one can build the different parts of a quantum von Neumann architecture in a trapped ion system. The next section will combine these individual parts to build a model trapped ion quantum computer based on quantum von Neumann architecture called Quantum 4004.
The guideline for development of the architecture is as follows: (1) (4) Simplicity of the hardware, especially for scaling of the quantum computer, is favored over optimization for higher abstraction layer tasks, containing things like QEC or quantum algorithms, throughout this section. (5) As there is no functioning fault-tolerant quantum computer yet, one cannot expect a first-generation quantum computer to work with high computation speed. Thus, computation speed has low priority in the development of the architecture. If a fault-tolerant quantum computer can be built and if Moore's law is applicable to quantum computer development, the computation speed (and quantum memory size) will increase exponentially over time.
In the quantum charged coupled device (QCCD) principle [55] , shown in Fig. 9 , a segmented ion trap is used to move ions to different positions on the trap by changing the axially confining DC voltages. This allows using one part of the trap as a quantum memory and another part as a processing zone, or QALU. The QCCD [55] is a general concept for trapped ion QC and resembles a quantum von Neumann architecture, as there are separate regions for the different DiVincenzo criteria [43] and it enables qubit movement (along the RF rails of the segmented Paul trap). In the following, the QCCD is used as the underlying principle of a quantum von Neumann architecture for trapped ion QC.
Loading Zone Processing Zone
Storing Zone For the QCCD, ideas for QEC and higher level architectures 12 have been proposed [69] . However, trapped ions offer a variety of different gate operations and, thus, it makes sense to adapt at the abstraction layer scheme for trapped ion QC. Gates can be performed using local RF fields [70, 23, 71, 72] , global RF fields [73] or optical fields [74, 54] . Even for optical entangling gates, there are multiple types of gates [75, 76, 77] . Similarly, multiple procedures for efficient ion movement have been demonstrated [78, 79, 80] . In the following, the lowest abstraction layer of the scheme described in Section 3 (and reference [35] ) is split into two. The new lowest level is then called the hardware layer, and on top of the hardware layer, a firmware layer is inserted. This new layer contains the firmwire, such as the type of quantum gates and ion movements. Since the top layers have only weak hardware dependence, they do not need to be adapted.
The quantum von Neumann architecture, presented in this section, covers only the hardware abstraction layer. As the exact performance of the hardware is not known and, thus, the optimum QEC scheme cannot be identified 13 , it does not make sense to discuss the higher abstraction layers for this architecture at this point.
The different design challenges for such a quantum von Neumann architecture with trapped ions are
• vacuum pressure,
• decoherence in the quantum memory, 12 Higher level means in abstraction layers above the hardware level. 13 Different QEC schemes have for example different fidelity thresholds or require a different amount of ion movement during computation. Hence, one has to find the best QEC scheme for a given architecture by evaluating the parameters of this architecture.
• multiplexing to enable large quantum memories,
• quantum gates,
• read out and initialization, and
• choice of qubits, which are discussed in the following subsections.
Reduce collisions with background gas
In order to maximize the coherence time in trapped ion systems, collisions with background gas should not limit the coherence times, as they can lead to ion loss or to loss of quantum information in the ion chain. Although these losses can be corrected with QEC, it is advisable to suppress such collisions as much as possible. In room temperature setups, collisions with residual background gas occur roughly once per hour per ion at typical UHV pressures of 10 −11 mbar [81] . That means when working with, for example 3600, ions in room temperature setups, one will have approximately one collision per second. In a cryogenic ion trap experiment at a temperature of 4 K, a residual background pressure of 10 −16 mbar has been observed [82] . Such pressures reduce the collision rate by 5 orders of magnitude compared to room temperature setups. Hence in cryogenic experiments, one can work with more ions than in room temperature setups while at the same time reducing the collisions with background gas. This suggests that large-scale QC with trapped ions will have to be performed in a cryogenic environment.
Ideally, one wants to be able to neglect collisions with background gas as a source of qubit loss or decoherence. Therefore, one can look at the two elements with the lowest boiling point (or triple point), hydrogen and helium. As the exact vacuum pressure in an experiment strongly depends on the used materials, whether they were baked before, and so on, one can only perform a worst-case analysis by looking at the vapor pressure of hydrogen and helium. For the vapor pressure, one assumes at the whole vacuum chamber is covered with at least one monolayer of the element in question. Hydrogen has a sublimation equilibrium pressure of 10 −6 mbar at a temperature of 4.2 K, and 10 −12 mbar at a temperature of 2.6 K [83, 84] . Hence at a temperature around 2 K, hydrogen can no longer sublimate and will definitely be frozen out. If helium is also a source for collision with the ions, one will have to cool even further, as 4 He has a sublimation equilibrium pressure of 10 −6 mbar at a temperature of 0.46 K, and 10 −12 mbar at a temperature of 0.24 K [83] . 3 He shows a sublimation equilibrium pressure of 10 −6 mbar already at a temperature of 0.22 K, and 10 −12 mbar at a temperature of 0.1 K [83] . This does not imply that the whole experiment has to be performed at a temperature of 0.1 K to not be limited by collisions with 3 He. But at least one surface in the cryostat will have to be that cold to exclude collisions with background gas from the sources of qubit loss or decoherence.
Decoherence in the quantum memory and magnetic shielding
In trapped ion QC, the qubit can either be encoded in an optical qubit [85, 74] or a ground state qubit [86, 87, 88] . In the optical qubit, one state of the qubit is a meta-stable D-state of the ion whereas the other one is in the ground state. The qubit transition frequency is in the optical regime and thus it is called optical qubit. As the live-time of this qubit is limited by the life-time of the meta-stable state, which is typically on the order of 1 s [86] , the coherence time will ultimately be limited by its spontaneous decay. Therefore, to achieve a long coherence time and a high κ value of the system, the qubit has to be encoded in the ground state of the ion, which does not suffer from such decoherence.
For ground state qubits, the main source of decoherence is magnetic field fluctuations. Therefore, generating a constant magnetic field and magnetic shielding are the most critical challenges to achieve long coherence times in trapped ion systems. Decoherence sources like spin-spin interaction [89] must be suppressed, e.g. by the choice of an |F 1 , M F = 0 to |F 2 , M F = 0 transition qubit, or by the choice of a qubit at a 'clock transition', for which the energy separation does to first order not depend on the magnetic field [86] . Other decoherence sources like leakage of resonant light must be reduced such that they can be neglected in the quantum memory, which is discussed in Section 5.5.5.
Quantum gate operations in trapped ion systems take between 10 and 100 µs [23, 54] . Experimentally, coherence times of more than 100 ms have been shown with mu-metal magnetic shielding [87] and dressed states [90] . All trapped ion experiments with coherence times of more than 100 s [22, 24] were performed with hyperfine qubits at a clock transition [86] and without external magnetic shielding. Hence with appropriate magnetic shielding, one should be able to increase the coherence by several orders of magnitude. This results in coherence times of hours or days and κ values 14 greater than 10 6 . As the main magnetic field noise in a laboratory environment is from alternating current (AC) sources, one way to shield against AC magnetic field is using skin-effect in a highly conducting material surrounding the experiment [91] . Another way is to encapsulate the experiment in a mu-metal shield [92] , which provides shielding against AC and DC magnetic fluctuations. However, slow magnetic field drifts such as changes in earth's magnetic field [93] still penetrate a magnetic shield made out of a highly conducting material or mu-metal 15 . Thus, these simple magnetic shielding schemes will not allow the desired coherence times of hours or days.
A consequence of Meissner effect [94] is that superconductors are perfect diamagnets and thus perfect magnetic shields. Inside a hollow superconductor, the magnetic field is constant and shielded from external magnetic fields. When placing the ion trap (equivalent to the whole quantum computer) in such an environment, the desired coherence times should be feasible with clock transitions in hyperfine qubits.
In practice, it is not straightforward to define a certain magnetic field strength inside a superconductor, as required for clock transitions in hyperfine qubits. During the phase transition into the superconducting regime, local magnetic flux can get pinned 16 . To avoid this pinning, the suggested solution is to have the superconductor undergo the phase transition in a zero-field environment [95] . For such a cool-down, the experiment has to be located inside a magnetically shielded room (MSR) [96] . Once the entire shield is superconducting, external magnetic field changes will no longer be able to penetrate the shield. Cables, fibers, etc. to operate the Paul trap will have to enter the superconducting shield through holes to which superconducting tubes should be attached. The shielding of such superconducting cylinders depends exponentially on its length (for a given diameter) [97] . Hence, long and thin cylinders are desired for high shielding against the environment.
The bias magnetic field at the position of the ions can be generated by superconducting coils inside the magnetic shield, as depicted in Fig. 10 a. During the cool-down in a zero-field environment, the superconducting coils do not contain persistent current [98] . With additional normally conducting coils, one can generate a magnetic field inside in the shield, shown in Fig. 10 b. When the superconducting coils are heated locally, as illustrated in Fig. 10 c, the generated magnetic field can penetrate the superconducting coils. After they are cooled back down into a superconducting regime, the magnetic field produced by the normally conducting coils can be switched off. The resulting persistent current in the superconducting coils will generate an ultra-stable magnetic field inside the superconducting shield. The zero-field environments in MSRs with a residual field of less than 1.5 nT have been demonstrated [99] . If the pinning of magnetic flux in the superconductor were to increase the residual magnetic field by a factor of 100, the magnetic field strength would be on the order of 100 nT. The magnetic field strength for clock transitions in hyperfine qubits is generated by the superconducting coils inside the superconducting shield and is on the order of 10 mT [100, 23] . Hence, the pinned magnetic field in the center of the superconducting coils (at the position of the ion trap) can only produce a relative offset of 10 −5 of the total magnetic field, which leaves hyperfine qubits safely in the regime with only quadratic Zeeman shift.
Such a setup will provide a temporally stable magnetic field to reach the desired coherence times of hours or days. In general, it is not necessary to completely eliminate magnetic gradients in trapped ion QC. If the magnetic field at each storage point is well known, one can calculate the phase evolution of all qubits. However, techniques like decoherence free subspace (DFS) encoding [101] require the same magnetic field for multiple ions. Therefore, it is desirable but not necessary to have high homogeneity. Spatial homogeneity is discussed in the appendix in Appendix A.1 in more detail.
Local oscillator stability
The transition frequencies of hyperfine transitions are typically on the order of 1-10 GHz [102, 100, 23] . If one wants to achieve coherence times of up to days, a frequency reference with a stability of about 10 −15 will be required. In order to achieve the required stability of the reference clock, one can sacrifice some ions of the quantum computer to act as a precise long-term frequency reference. Although one has to remove some ions from QIP for the clock signal generation, such a scheme allows stabilizing the local oscillator. It will even enable using the quantum computer for atomic clock measurements.
Multiplexing: ion storage and movement
An idling ion string accumulates on the order of 10 quanta/s and thus about a million phonons during a day of uncooled storage. These high phonon numbers will cause a melting of the ion crystal and the order in the ion string will be lost after a refreeze. Hence, ion storage times of hours or days require sympathetic cooling with a second ion species [103, 104] . Besides working with two ion species, sympathetic cooling implies that one needs cooling beams at each storage position. The illumination of each storage zone can either be accomplished by integrating fibers into the trap [105] or by illuminating multiple storage zones with a beam parallel to the trap surface, as displayed in Fig. 11 . Integrated fiber optics facilitate cooling of ion strings. However, one fiber per storage position will complicate the trap design whereas cooling multiple storage zones with a single beam will simplify the optical setup. These beams along the surface can even be reused by reflecting the light from one line of storage zones to the next line of storage zones, similar to the ideas discussed in reference [106] and shown in Fig. 15 b.
One thing that has to be kept in mind when designing a large-scale quantum computer in a cryogenic environment is the heat load. Large-scale QC will require thousands of storage sites. If light is coming from a fiber at every storage site, it will be hard to couple the light back into fibers to avoid heating the cryostat due to the light absorption. Whereas, light parallel to the surface cools multiple sites and is easier to couple back into a fiber.
A trap suitable for QIP with thousands of ions will require the control over thousands of segments and thus over thousands of voltages with digital-to-analog converter (DAC) channels. In order to reduce this hardware demand, one can use analog multiplexers. By employing such analog switches, one DAC channel can control multiple segments. An example of how this can be incorporated in ion movement is shown in Fig. 12 . At first, the ion is stored on the left side by controlling three segment pairs. During the shuttling, one has to control at maximum four segment pairs. When moving the ion right, the control of an unused segment pair on the left can be exchanged to control over the next segment pair on the right 17 . Hence, with this multiplexing scheme, it is possible to move ions in an arbitrarily big segmented trap with DC control over only four segment pairs and digital multiplexing logic. Furthermore, it is possible to adapt the voltage ramps for each segment individually which enables the compensation of stray fields on all parts of the trap. The digital multiplexing logic circuits have to contain at least as many digital outputs as there are segment pairs on the trap. As traps for large scale QC will contain thousands of segment pairs, this will require thousands of interconnects, and it is advisable to place both the analog switches and demultiplexer circuits close to the trap or even integrate it into the trap chip.
If the digital multiplexing logic allows the control of multiple segments with just one DAC, one can generate multiple confining potentials on the trap with the same DACs. As illustrated in Fig. 13 , these multiple confining potentials can be moved on the trap the same way as a single one, allowing the transfer of multiple ion strings simultaneously with the same DAC channels 18 . As the same confining potential is used for multiple ion strings, stray fields on individual ion strings cannot be compensated separately. Hence, this scheme can only be used in regions where micromotion does not influence the operation of the quantum computer.
Following the ideas of the hardware requirements for storage in large-scale Quantum von Neumann setups in Section 4.1, the ions in all storage zones can be confined by a small set of static voltages 19 , as shown in Fig. 14 (1) . To access a quantum memory cell, a digital signal from the multiplexing logic switches the voltage from the set of static voltages to a set of voltages controlled by DACs. With the control over the confining voltages of the single memory cell, one can move the stored quantum information from the quantum memory to another region of the trap for further processing 20 , as depicted in Fig. 14. Should splitting of the qubits and the cooling ions be required, it can either be performed in the quantum memory region or in the QALU before QIP is performed.
The movement through X-and Y-junctions [107, 108] may require the control over more than four segment pairs to block the ions from entering a wrong arm in the junction. But other than possibly having to control more voltages, there is no reason why this multiplexing architecture cannot be used to move ions through junctions. This allows structuring the memory region by branching with Y-junctions or by generating a grid with X-junctions, as depicted in Fig. 15 a and b . If one wants to use a qubit encoding scheme that is sensitive to magnetic field gradients, like DFS encoding, even tiny magnetic field gradients along the trap axis will cause different phase evolution in the different ions over long storage times of hours or days. If the gradient is linear, one could rotate the ion string in the middle of the storage time or repeatedly after a certain time interval to cancel the effect of the magnetic field gradient. Rotation of an ion string is easiest in a junction by moving the ions from arm 1 to into arm 2, from there into arm 3 and then back into arm 1, as illustrated in Fig. 15 c. 
Quantum gates
Heating
A major problem with entangling gates which use Coulomb interaction [75, 76] , thus phonons in an ion crystal, is motional heating [109] . A lot of effort has been made to characterize heating [110] , especially its dependence on the distance of the ion to the surface of the trap, and it has been shown that the heating rate is reduced in cryogenic environments [111, 112] . Experimentally, heating rates as low as 0.33 ph/s have been observed in surface traps [113] .
Due to sympathetic cooling in the memory region and short transport times between the quantum memory region and the QALU of less than about 1 ms, heating only affects QIP in the QALU. The quantum computer based on this quantum von Neumann architecture for trapped ions has to be operated in a cryogenic environment and, thus, heating rate should be low enough to allow for fault-tolerant QC.
RF or optical drive fields
RF fields enable qubit operations with the lowest infidelity in trapped ion systems to date [23] . Entangling operations via Coulomb interaction require high field gradients due to the low Lamb-Dicke parameter of RF fields. These high field gradients are typically generated with high RF amplitudes. If the QALU is surrounded by memory zones, one must protect the qubits in the quantum memory from the resonant and near-resonant RF fields. There is research on minimizing RF surrounding the processing zones of traps. However, it is unclear how well this RF field suppression would work for a large-scale quantum computer with tens of thousands of qubits or more surrounding the QALU. Experimentally, one has to stabilize the phase of the RF in QALU for high fidelity operation such that the length between the RF source and the ion does not fluctuate on a (tens of) micrometer scale.
On the other hand, high fidelity quantum operations can be performed with optical drive fields as well [54, 36] . There, the demonstrated infidelity is about one order of magnitude worse than with RF fields. However, unwanted fields can be avoided by inhibiting direct line of sight between the quantum memory and surfaces of the QALU that scatter light, see the Section 5.5.5 for details. Experimentally, the most challenging part is amplitude and phase control of the light field at the position of the ion. Given that optical frequencies are much higher than the RF frequencies, one has to stabilize the phase of the light with sub-nanometer precision. Suggestions on the phase stabilization is given in Appendix A.3. Furthermore, in order to avoid long distances between the quantum memory region and the QALU, the processing zone will be in the center of the trap. Single ion addressing with laser beams will require a numerical aperture (NA) of 0.2 or higher for ion-to-ion distances of about 5 µm. Therefore, the trap needs to be slotted in the region of the QALU to allow high NA addressing perpendicular to the trap surface.
Physical requirements for the gate operations
So far, this architecture requires (at least) two ion chains to be loaded into the QALU for QIP, where gate operations are performed. The type of gates [76, 77, 70, 23] that are executed is defined in the firmware layer of the architecture. For a full set of quantum operations, single ion addressing capability is required.
As the length of path between the drive field's source and the ions should not fluctuate for a stable phase reference, vibration isolation of the superconducting magnetic shield will be required, e.g. by suspending the shield with ropes from the vacuum chamber. Please, refer to Appendix A.2 for more details.
With RF gates, the trap can be used as part of the transmission line which simplifies the setup. For optical gates, light can be guided via fibers into the magnetic shield and optical alignment in the shield will enable enough optical access to perform the required gate operations. Furthermore, vibration isolation will reduce beam pointing instabilities and thus undesired varying optical crosstalk between the ions. The tight focusing, required for single ion addressing, results in a high local light intensity at the position of the ion. In reference [114] , the authors state that between 1 and 10 mW optical power is required for single qubit gates with a gate infidelity of 10 −4 employing Raman transitions. Moreover, between 100 mW and 1 W optical power is required for entangling gates 21 using a Gaussian beam with w 0 = 20 µm. If all gates in a quantum von Neumann setup are performed with highly focused Gaussian beams with w 0 ≈ 1 µm, the required total optical power will drop by a factor of 400 compared to their stated values. This lower optical power reduces problems like bleaching of fibers, which is worse at higher powers.
The crosstalk onto neighboring ions is a coherent process and thus can be eliminated by calibration and composite pulses [74] . If the crosstalk on all ions is known, one can construct a pulse sequence that performs all single qubit operations required by the quantum algorithm and at the same time corrects for the crosstalk [41] . Such calibration requires precise control over the amplitude of the driving field at the position of the ion. For operation with RF gates, this implies a clever segment structure of the trap. For operation with optical gates, beam pointing instabilities and imperfections in amplitude and timing control must be negligibly small. Thermal drifts might still cause spatial drifts on time scales of seconds or minutes. Therefore, it might be necessary to regularly place "calibration ions" in the QALU to track drifts of the crosstalk.
In order to protect idling qubits, it is possible to shelve populations from the clock state to other states in the Zeeman manifold [74] in which the QIP is performed. With this scheme, gate operations are not resonant with the clock transition in which quantum information is stored in the quantum memory. However, the imperfect shelving operations introduce leakage from the qubit states which needs to be considered in the employed QEC.
Pipelining
Since ions which arrive in the QALU from the quantum memory are only Doppler cooled, they have to be groundstate-cooled for high fidelity QIP. If cooling and QIP are executed in the same processing zone of the QALU, the processing cycle will be slowed down by the required initial cooling. Following the pipelining approach from classical computer science, one can use separate regions in the processing zone for the individual tasks required for efficient QIP. These tasks could be: Fig. 16 depicts such a pipelining approach which enables the execution of multiple tasks on multiple ion strings simultaneously. The thick black lines illustrate the RF rails along which ion strings can be moved. Fig. 16 a shows a QALU architecture for which two (or more) ion strings, which shall interact during QIP, are loaded from the quantum memory and combined to a single ion string. As Doppler cooling typically lasts milliseconds, whereas QIP is performed in tens of microseconds, the ion string passes through multiple stages of sympathetic Doppler cooling to ensure that the ions are at the Doppler limit before further processing is performed. After Doppler cooling, the ion string is ground state cooled with sympathetic EIT cooling. In the next step, the quantum information is decoded for example by transferring from the DFS encoding to the bare physical qubit. After qubit decoding, QIP is performed on the ion string. This enables interaction between arbitrary qubits of the quantum memory. After QIP, the ion string is encoded, e.g. with DFS encoding. At last, the long ion string is split into multiple ion strings which can then be sent back to the quantum memory.
Another QALU architecture is depicted in Fig. 16 b. It has the same cooling and QIP procedure as the previous one. However, the different ion strings loaded from the quantum memory are not combined in the first pipeline step but cooled individually. After ground state cooling, the cooling ions can be separated from the qubit ions. This simplifies the mode structure of the ion crystal but requires efficient ion splitting of ground state cooled ion strings. To simplify the mode structure even further, the qubit ions used only for DFS encoding are separated from the ones containing the quantum information after DFS decoding. In the QIP region, the two ion strings are combined and QIP can be performed. For DFS encoding, the ions that were split off can be reused. At last, the qubit ions are recombined with the cooling ions before ion strings can be sent back to the quantum memory.
Qubit encoding/decoding and QIP require single ion addressing. With optical gates, if there is not enough optical access to perform single ion addressing at multiple locations, these tasks may have to be performed at different positions on the trap.
Having different regions for the different parts required for QIP is not yet pipelining. In the pipelining process, an ion string is moved from one processing region to the next, while the next ion string is moved into the previous processing region 22 . Thus, the number of processing regions defines the depth of the pipeline. The parameters of the cooling and processing time have to be chosen such that they can be synchronized. The time of one execution cycle defines the speed of the processing. The distance between the different processing regions should be short so that ion movement does not increase the execution time of one pipeline step considerably. In the processing regions, micromotion [115] has to be compensated for effective cooling and QIP. This will require many independently controlled voltages in the processing zone. However, in the shuttling regions between the processing regions, micromotion is not crucial and one can use multiplexing, as shown in Fig. 13 , to reduce the number of DC voltages which need to be controlled in the QALU.
In this general pipelining approach, there are no restrictions on the ion strings processed in the QALU. In order to keep the vibrational mode structure of the ion strings simple, one has to limit the length of the ion strings. By choosing the ion string loaded from the quantum memory such that the qubit ions are surrounded by the cooling ions, one can detect ion loss during Doppler cooling. For this, one uses a camera to detect the number and positions of the cooling ions. From the spacing between the cooling ions, the number of processing ions can be inferred. Ion loss can be compensated by adding ions to the ion string either in the QALU or in a special zone outside the QALU.
Trap constraints
In order to minimize axial micromotion (which cannot be compensated), it is imperative to design the trap in the processing zone as symmetric as possible, e.g. as illustrated in Fig. 16 a and b .
In both the quantum memory and the QALU, the tracks along ions can be shuttled will form loops. Hence, inter-layer connectivity (vias) will be required for the fabrication of such trap structures. Modern traps with vias 23 typically route the signal lines underneath the trap surface to the segment. These traps use vias to connect the actual segments with the routing tracks. This enables placing ground planes at areas on the trap surface which are not used for electrodes. These ground planes shield against electric fields from the lower lying routing layer, thereby, reduce the cross-talk between segments. For the operation with optical gates, stray light that is (near-)resonant to a qubit transition is a serious problem for the long coherence times required for a quantum von Neumann architecture. A main source of stray light is light scatter at the slot, required for QIP with high NA, in the QALU, and it can be minimized by blocking direct line of sight between the quantum memory and the QALU, for example with walls on the segmented traps, as depicted in Fig. 17 . The height of the walls should be higher than the distance from an ion to the surface of the trap. These walls should not be perpendicular to the trap surface but under an angle so that reflections on the wall's surface reflect the light away from the trap surface. If the walls are made of a conducting material, they can be grounded and will have little impact on the trapping potentials. For the operation with RF gates, blocking of stray fields is not possible. It can only be minimized by clever segment structures.
QIP
Detection and initialization
For QIP with optical gates on ground state qubits, Raman transitions are incorporated to couple the quantum states [114] . These transitions are off-resonant with a typical detuning in the GHz or low THz regime. Because of this large detuning, a single photon is very unlikely to affect a qubit. Thus, it is safe it assume that reflections somewhere in the vacuum chamber can be neglected and it is sufficient to place a wall around the QIP zone in the QALU to shield the quantum memory from stray light. However, detection requires resonant light which causes fluorescence which is resonant as well. Furthermore, initialization produces resonant fluorescence. In the case of resonant photons, even a single photon can affect the information in the qubits, and one should try to avoid photons resonant with a state used for storing quantum information.
This problem can be circumvented by using a second ion species for detection. For this, the state of the ion to be detected has to be transferred onto the detection ion of a different species. Entanglement between two ions of different species has been demonstrated [116] . The swapping operation, illustrated in Fig. 18 a in the circuit model representation [11] , requires only near-resonant driving fields but no resonant fields. For detection, it transfers the quantum information to another ion species, while initializing the main qubit for further processing. Therefore, both state detection and initialization can be performed with ions of the second ion species and one does not have to worry about stray light resonant to the qubits in the quantum memory. This implies that initialization fidelity will depend on the fidelity of entangling operations. In order to increase the initialization fidelity for qubits which have a quadrupole transition, one can reinitialize the qubits additionally by optical pumping via the quadrupole transition. High fidelity state detection needs to be performed fast which requires high photon collection efficiency. For example, in Ca + , the qubit information is usually stored in the ground state of the ion and, thus, an electron shelving pulse is required to transfer the population of one qubit state into the D 5/2 state. The D 5/2 state has a limited lifetime and thus spontaneous emission causes errors in the detection. The life-time of the D 5/2 state in Ca + is about 1 s [117] which means that the detection has to be performed in 10 µs 24 to achieve it with a detection infidelity of 10 −5 . With a scatter rate of about 10 MHz in Ca + , an ion emits about 100 photons during 10 µs. If one requires 5 clicks on the detector for reliable detection, one has to collect about 10 % of the photons with a typical detector efficiency of about 50 %. A collection efficiency of 10 % requires the detection optics to have NA > 0.6. QEC requires ancilla qubits which have to be detected to extract information on the occurred errors, and thus the detection region has to be close to the QALU on the trap. This might limit the optical access to the QALU.
To increase the number of scattered photons in a certain period of time, one can use the fluorescence of multiple ions by employing Greenberger-Horne-Zeilinger (GHZ) states [119] , as demonstrated in reference [120] . The circuit representation of this detection scheme can be seen in Fig. 18 b. The input state |ψ = α |0 + β |1 is transferred onto N ancilla qubits of a second ion species to generate the GHZ state α |00 · · · 0 + β |11 · · · 1 . With N ancilla qubits, the count rate increases by a factor of N compared to the detection with just one ancilla qubit. For detection in the same time interval, the collection efficiency can be lower by a factor of N compared to the case with one ancilla qubit. It is also possible to increase the detection fidelity for longer detection times by performing a majority vote. As an example, if one chooses a detection time of 100 µs for detection of Ca + , this will result in an infidelity of ≈10 −4 due to spontaneous decay. If one chooses 5 ancilla qubits and one can detect how many ions are bright, 3 qubits will need to decay from the D-state to the S-state for a wrong state detection. The probability for this to happen is 10 −12 . Hence, the overall detection process will more likely be limited by how efficiently one can generate the GHZ state than by detection itself.
Pipelining can also be incorporated in the detection/initialization zone, as illustrated Fig. 18 c. The incoming qubits are cooled to the ground state of motion before they are moved to the swapping zone.
There, the CNOT-gates for GHZ state generation and the swap gate are performed with initialized ancilla qubits of a second ion species. After the swap operation, the initialized qubits are moved to another initialization zone where one can compensate the initialization error due to imperfect gates or leakage into other states during QIP. After the initialization, the ions are shuttled back to other parts of the trapped ion quantum computer. During the compensation of the initialization error, the ancilla qubits of the second ion species are moved to a detection zone where the (GHZ) state is detected. After detection, the ions are cooled and initialized in separate zones before they can be reused in the swap zone. Experimentally, the challenge will lie in protecting the quantum information between the swap and the detection zone from (resonant) stray light of the cooling and initialization zones.
Choice of ion species
At first, one has to decide how many species one needs for this architecture. One ion species is required for the qubit ions. Another ion species is required for sympathetic cooling. The ions for detection can either be from the same ion species as the ions for sympathetic cooling, or one can use a third ion species. If one only uses two ion species, cooling in the memory region and detection have to be pulsed so that resonant stray fields from the memory region does not affect the quantum state in the detection zone. Whereas, using three ion species makes the detection zone independent from the memory region. In the following, the three ion species architecture will be discussed. Table 1 : List of constants of ion species for QIP -part 1 [114] . I is the nuclear spin, ω 0 is the zero field hyperfine splitting, λ 1/2 and λ 3/2 are the wavelengths of the S 1/2 to P 1/2 and S 1/2 to P 3/2 transitions, and Γ 1/2 is the linewidth of the S 1/2 to P 1/2 transition. are not in the list as their spin is 0, they do not have hyperfine splitting, and the transitions are only shifted by a frequency in the GHz regime compared to their respective isotopes in this list. Table 1 and Table 2 show the properties for the choice of ion species in the quantum von Neumann architecture for trapped ions. The criteria for the choice of the ion species of the qubit ions are:
• Ground state qubit: required for long storage time
• Long coherence time: low magnetic field dependence, e.g. clock transitions in hyperfine qubits.
The field strength should be large enough that motional sidebands do not overlap with neighboring transitions, thus B > ∼10 Gauss.
• Wavelength: the lasers incorporated for the operation of the quantum computer can hit the trap surface. To avoid electron emission, the work function of the surface material has to be higher than the energy of a photon. Typical surface materials for segmented Paul traps are gold and aluminum.
Gold has a work function of 5.3 eV [121] , which corresponds to 234 nm. Aluminum has a work function of 4.08 eV [122] , which corresponds to 304 nm.
• Mass ratio: for sympathetic cooling the mass ratio of the ion species in the ion string should be close to 1 [104, 123, 124] , so that all modes of a mixed ion crystal can be efficiently cooled. Experimentally, sympathetic cooling of two-ion crystals with a mass ratio of 3 has been demonstrated [125] .
• Mass: with the same electric field, lighter ions are accelerated faster, which is advantageous for ion movement. In the same trapping potential, lighter ions have higher trap frequencies, which allows faster gate operations and less power is required for the entangling gate operations [114] .
The criteria for the choice of the ion species of the cooling ions are:
• Mass ratio: as described for the qubit ion species.
• Wavelength: as described for the qubit ion species.
• No nuclear spin: this simplifies the level structure and the laser system, if one does not require two beams with GHz detuning from one another.
The criteria for the choice of the ion species of the detection/initialization ions are:
• No nuclear spin: as described for the cooling ion species.
• Long lived D-state: there are several different detection schemes like electron shelving [126, 127, 128] , or using sigma polarized light to cyclically drive a single transition [61] . Both schemes allow high-fidelity state detection. But in the same setup, electron shelving with an ion species which has a long lived D 5/2 state usually yields higher fidelity than a detection scheme which is limited by off-resonant excitations [129, 118] .
The Quantum 4004
After the invention of integrated circuits, the first CPUs were developed [26] . The Intel 4004 was one of the world's first microprocessors and was the first microprocessor which customers could program themselves. It had the following technical specification 25 :
• 4-bit CPU
• Instruction cycle time: 10.8 µs
• Instruction execution time: 1 or 2 instruction cycles
• Able to directly address 32768 bits (4096 bytes) of memory This was one of the starting points of the exponential increase in clock speed and number of transistors of a CPU (Moore's law [1, 2] ), leading to the computers that we have now. Hence, it makes sense to assume that the speed and capabilities of first generation quantum computers may be the equivalent of the Intel 4004 but in a quantum world. Because of its simplicity, the Intel 4004 will act as a blueprint for a quantum computer in the following. If Moore's law is applicable to the development of quantum computers, the capabilities of quantum computers will increase exponentially over time as well. Following the ideas of the previous section, this section only discusses the hardware of the trapped ion quantum computer. The main focus is on simplicity of the hardware and, if possible, optimizations for certain QEC schemes or quantum algorithms are avoided 26 . This section only serves to exemplify the scalability and capabilities of such a quantum von Neumann architecture. This Quantum 4004 is based on the presented quantum von Neumann architecture for trapped ions with design parameters corresponding to the technical specification of the Intel 4004.
• The Intel 4004 had only one CPU. Thus, the Quantum 4004 will also work with only one QALU.
• The information is structured in ion chains of 4 qubits. -For a 7-qubit Steane code [34] , it may be advantageous to structure this architecture with 7 qubits in an ion chain. But since this first design should not (by choice) have any restrictions imposed by higher abstraction layers. Hence, the following design will stick to the Intel 4004 model of 4 (qu)bits.
• Each qubit in the memory zone is DFS encoded. Thus, one ion chain in the quantum memory contains 8 physical qubit ions. -The memory of the Intel 4004 was structured in bytes (8 bits) as well.
• Since the proposed hardware should result in coherence times of days or longer, a design with fast gate speed is not mandatory. In analogy to the instruction execution time of 1 or 2 instruction cycles (10.8 µs) in the Intel 4004, a single qubit operation should be executed in 10 µs and an entangling gate operation should be executed in 20 µs in the Quantum 4004.
• The Intel 4004 was able to access 4 kBytes of RAM (32768 bits) and thus the architecture of the Quantum 4004 should be able to store 32768 qubit ions. Due to DFS encoding, this would correspond to 16384 physical (non-encoded) qubits.
The QALU can be structured following the ideas illustrated in Figure 16 . In the region for QIP, there will only be one string of 16 qubit ions (and cooling ions). After DFS decoding, only 8 of these 16 qubit ions contain quantum information. Hence, the optics for QIP has to be optimized for an ion string of 8 ions 27 , following the ideas presented in Appendix A.3. It will be possible to perform 8 single qubit operations simultaneously. Therefore, in an ideal case, it will be possible to perform 800 thousand single qubit operations or 50 thousand entangling gates per second on 16384 physical qubits. For a syndrome measurement in a 7-qubit Steane code, one will need 4 entangling gates per syndrome, 3 syndrome measurements for bit flips and 3 for phase flips. This sums up to 24 entangling gates for 7 qubits. If all qubits were encoded in a 7-qubit Steane code and one would need 1.12 s to perform a syndrome measurement on all 16384 physical qubits. Due to coherence times of days, the 1.12 s for syndrome measurements should allow for fault-tolerant QC. These syndrome measurements would require about 14000 quantum state detections to read out the syndromes. Thus, serialized detection with a single detection zone must detect a quantum state in 80 µs or less, which is not possible with a single detection zone as detection involves swap gates (consisting of 3 entangling gates, which each take 20 µs) and CNOT-gates. For detection in 80 µs with 7-qubit GHZ states 28 , whose generation takes 180 µs 29 , one will require at least 3 detection zones. Ion shuttling must only be fast enough to not (considerably) slow down pipelining in the QALU. Thus, the already demonstrated speeds of several 100 µm in a couple of microseconds [79, 80] should be sufficient.
The layout of the Quantum 4004 with a subfigure of 2 × 2 unit cells in the quantum memory is shown in Fig. 19 and a zoom view of this layout near the QALU is depicted in Fig. 20 . The QALU with pipelining is structured in a round shape, following the ideas illustrated in Fig. 16 . The QIP region in the QALU is surrounded by walls to protect the quantum memory zones from near resonant stray light. Left of the QALU in Fig. 20 is the ion storage, or loading zone, which contains a grid for backup storage of ions which can be used to replace lost ions. The Quantum 4004 contains 3 detection zones and they are located in the center of the trap, as shown in Fig. 20 . The main surface area of the segmented Paul trap of the Quantum 4004 is occupied by the quantum memory, which is divided in 3 parts in Fig. 19 and structured following the ideas depicted Fig. 15 b. In the two 16 × 24 memory regions, the cooling beams travel from left to right and are reflected back to reuse the same light a second time for cooling. Whereas in the 32 × 40 memory region, the cooling beams travel from bottom to top. Efficient ion movement in the grid of a memory region is only possible parallel to the cooling beams, depicted by black lines. On the tracks perpendicular to the cooling beams, depicted by light blue lines, the ions are stored. When moving ions from storage to the QALU, it requires no additional ion shuttling to move ion strings perpendicular to the cooling beams. But the paths along the cooling beams are typically occupied by stored ions. To simplify ion movement along the cooling beams, the three memory zones are surrounded by tracks which are not used for ion storage. Table 3 sums up the required hardware resources of the Quantum 4004 architecture shown in Fig. 19 and Fig. 20 . In total, one would require about 57000 individual segments for this architecture, and thus 27 Even if one only needs to perform 1 single qubit operation on 1 ion, there will always be 8 qubit ions containing quantum information in the QIP region. 28 A 7-qubit GHZ state would require 7 ions of the detection ion species and 1 ion of the qubit ion species in the QIP part of the detection zone. Thus, like in the QALU, one can use a setup optimized for 8 ions. 29 Due to pipelining in the detection zone, the detection time can be as long as 180 µs as well. inhomogeneity for the whole trap of 10 −6 with Helmholtz coils, one would need Helmholtz coils with a radius of about 1 m, see Appendix A.1 for details. The dimensions of the superconducting shield would have to be a factor of 2 or 3 bigger than the dimension of the Helmholtz coils. Hence, the superconducting shield could be a cylinder with 3 m radius and 5 m length 30 . To connect the 57000 individual segments to external signals, one would have to place 57000 bond pads on the trap, which would occupy nearly as much space as the trap itself. Hence, if it is possible, one will have to integrate the analog switches and the multiplexing logic circuits into the trap, which will reduce the number of DC interconnections to 280 plus the control signals for the multiplexing logic circuits.
For With the design of this chapter, it will be possible to construct a trapped ion quantum computer, which is capable of handling 32768 qubit ions on which one can apply 800000 single qubit operations or 50000 entangling gate operations per second.
To get a feeling for how difficult the fabrication the Quantum 4004 trap would be, one can compare the layouts of Quantum 4004 (Fig. 19 ) and the Intel 4004 33 . While the Quantum 4004 requires more individual electrical lines, the Intel 4004 has a higher complexity because of the interconnections between the electrical lines.
Possible future quantum computers
The Quantum 4004 is an example to illustrate a prototype, which is optimized for a simple hardware, and that quantum computers with tens of thousands of qubits are technologically feasible. Of course, one will have to start by demonstrating that the individual parts of such a quantum computer work. But once that has been demonstrated, there is no reason why QC with more than 10000 qubits should not be possible in the (near) future.
The optimization for a simple hardware in Section 6 limits the capabilities of the quantum computer. For example, the choice of a single QALU slows the Quantum 4004 down. Furthermore, in future publications, one should discuss architectures which are optimized for certain QEC schemes or even quantum algorithms. For example, when working with a Steane code, ion chains of length 7 make more sense than length 4. If coherence times of days are achieved, one will not be restricted to only 32768 physical qubits but one will be able to work with millions of physical qubits (per processing zone). The only disadvantage is that even for 32768 qubits, most of the trap is already occupied by the quantum memory. Hence, such quantum computers will require much bigger traps. In the example of the Quantum 4004 presented in this section, the vacuum chamber of the quantum computer has the size of about a room, which seems a feasible size. When working with millions of qubits, one will have to find different ways how to generate a magnetic field with a big volume of high homogeneity or a way to allow QC with millions of qubits in a magnetic field with low homogeneity.
Summary
As there exists no functioning quantum computer yet and it is not fully clear how to build a large-scale quantum computer, quantum computer architectural design becomes increasingly important. In this document, I presented knowledge of classical computer architecture, like von Neumann architecture or pipelining, which is applicable in quantum computer architectures. Quantum computers are different from computers because decoherence in qubits destroys quantum information over time, which has to be compensated by executing QEC. Whereas, classical information can, in principle, be stored infinitely long. Thus, quantum computers with a large memory will require parallelism for fault-tolerant computation.
Currently, most quantum computer architectures favor a massively-parallel approach for which every qubit site has storage capability and computation capability. This is the most sensible approach for small-and medium-scale QC, as QEC can be performed on all qubits simultaneously and it allows higher gate thresholds for fault-tolerant QC. However, for large-scale systems, engineering challenges like obeying Rent's rule have to be overcome. This will most likely imply some kind of serialization of the computation.
The presented Quantum von Neumann architecture is based on classical von Neumann architecture and has a specialized part, or section, for each task to perform in the quantum computer, such as a QALU for QIP or a quantum memory for quantum information storage. This architecture is applicable on systems with a long coherence time and the capability to move quantum information from one region of the quantum computer to another.
After the general introduction, trapped ions are chosen to illustrate how a quantum computer based on Quantum von Neumann architecture could be built by incorporating multiplexing and pipelining technology. Furthermore, requirements for the quantum gate operations and the choice of ion species are given. At last, this theoretical knowledge is applied on a specific trapped ion quantum computer, the Quantum 4004, which has a simple hardware and is the quantum equivalent of the Intel 4004. The Quantum 4004 has just one QALU and the computation speed is 10 µs for single qubit operations and 20 µs for two-qubit operations. The quantum computer is structured in 4 qubits per ion string, which are DFS encoded. Thus, there are 8 qubit ions per ion string. In total, the Quantum 4004 can work with up to 32768 qubit ions in a fault-tolerant way on a 38.9 mm × 42.7 mm big ion trap with ≈57000 segments.
Acknowledgment
I would like to thank Philipp Schindler for helpful comments on this manuscript.
A. Appendix
A.1. Magnetic field homogeneity
In Section 5.2, a way to shield cryogenic experiments with a superconductor by exploiting Meissner effect was presented. Trapped ion qubits using a clock transition require a specific magnetic field, which is on the order of 10 mT. The generation of a constant field inside the magnetic shield requires a persistent current in superconducting coils. Typical trap dimensions are 40 to 60 mm, see Section 6 for details, and most of the surface is occupied by the quantum memory. Ideally, one wishes to have the same magnetic field strength over the whole trap. But as coils cannot become arbitrarily big and as the ion trap's permeability is different from vacuum's permeability, one cannot have a perfectly homogeneous magnetic field. Of course, one can measure the local magnetic field at all (relevant) positions on the trap and calculate the phase evolution of each qubit in the trap. This gets more and more laborious with increasing size of the trap and increasing number of qubits. Thus, it is advantages to have a highly homogeneous field at the position of the ion trap.
The sensitivity to magnetic fields at clock transitions is typically on the order of 100 kHz/(mT) 2 at a bias field of about 10 mT [102, 100, 23] . In order to achieve coherence times of a day, one requires the magnetic deviation to be smaller than 10 nT, which is a relative deviation of about 10 −6 of the typical bias field. 10 nT is also the offset that the superconducting shield might produce, see Section 5.2 for details.
The highest magnetic homogeneity inside a coil can be achieved with a solenoid coil or a cylindrical coil. Such structures limit the optical access to the trap and as all electrical signals have to be sent to the trap along the coil axis, it is not clear if that has an effect on the bias field. Therefore, a solenoid coil is an option for future designs but in the following, it will not be considered.
Helmholtz coils consist of two coils with radius R on the same coil axis, and the distance between the coils is R, as depicted in Fig. 21 a. They combine high magnetic field homogeneity with high optical access. The volume with same relative deviations from the field strength in the center is approximated by a sphere with radius r. Fig. 21 b shows a contour plot of a simulation of the magnetic field distribution of the Helmholtz coils in the XZ-plane. The coordinates in this simulation are normalized to the coil radius 34 , and the results along the axes of the coordinate system are plotted in Fig. 21 c. The results show that if one tries to get a spherical volume with a magnetic homogeneity with relative deviations of less than 10 −6 , the radius of the sphere will only be 3 % of the coil radius. As an example, the trap described in Section 6 has a diagonal length of about 60 mm. Therefore, Helmholtz coils that produce magnetic homogeneity with relative deviations of less than 10 −6 have to be at least 1 m in radius. To avoid influencing the magnetic homogeneity in the center, the dimensions of the superconducting shield have to be chosen a factor of 2 or 3 bigger than the dimensions of the Helmholtz coils to influence the magnetic 34 The simulations assume that the coils are infinitely narrow. In order to decrease the size of the coils while maintaining the spatial homogeneity and high optical access, one can use Maxwell coils as illustrated in Fig. 22 a. The simulations of the magnetic homogeneity along the axes, depicted in Fig. 22 b, show that a sphere with radius of about 9 % of the main radius R of the Maxwell coils yields deviations of the magnetic field of less than 10 −6 with respect to the field in the center. Hence, using Maxwell coils reduces the dimension of the setup by a factor of about 3 compared to Helmholtz coils.
So far, the medium in the center of the coils was assumed vacuum. But no material is perfectly nonmagnetic. Therefore, placing a trap in the center of the coils will cause additional inhomogeneity. By the choosing low-magnetic materials in trap fabrication, one can try to minimize the effect of the trap on the magnetic field inhomogeneity.
A.2. Cryogenic system
The superconducting shield, which was introduced in Section 5.2, has to become superconducting in a zero-field environment. Hence, if one wants to work with a closed-cycle cryostat, like a Pulse-Tube cryocooler, which produces a magnetic field, the cryocooler will have to be placed in a separate room so that the experiment can be shielded against magnetic field from the cryocooler, as depicted in Figure 23 . The heat transfer can be maintained via copper rods that connect the coldfinger of the cryocooler with the heat shields which contain the experiment. For the stability of the optical alignment inside the superconducting shield, vibration isolation is required [130] . Therefore, the copper rods are suspended with ropes and the heat connection to the coldfinger is maintained via thin copper wires. To avoid vibration transport through the vacuum chamber, bellows on both sides of the magnetically shielding wall divide the vacuum chamber in a part with vibrations and a part without vibrations. Furthermore, the cryocooler should be mounted on a vibration isolation platform to not mechanically couple the different parts of the vacuum chamber via the floor. If the light is transmitted via fibers to the experiment, the outer heat shield can be mounted rigidly to the vacuum chamber. However for good vibration isolation, the inner shield with the experiment should be suspended with ropes from the vacuum chamber, or the outer shield. As mentioned in Section A.1, the dimensions of the inner heat shield might be several meters in all directions so that one can generate a homogeneous magnetic field for the experiments. Such a heat shield might weight a ton or more. When suspended with ropes, this heavy mass will reduce the vibrations inside the shield due to its inertia. Furthermore, the dimensions the size of a room reduce the surface to volume ratio in the vacuum chamber which will allow lower vacuum pressures.
If the magnetic shielding with one layer is not sufficient, one can add a second layer of superconducting shielding. This second layer has to be on a separate heat shield, so that one can make sure to first enter the superconducting phase with the inner magnetic shield and afterward with the outer magnetic shield. Otherwise magnetic fields, which get locally pinned in the magnetic shields during transition to the superconducting phase, will increase magnetic gradients at the position of the ion trap. The disadvantage is that multiple separate heat shields might complicate the cryogenic setup. If the vacuum pressure is too high for storage times of days with a negligible amount of collision, one will have to add a dilution refrigerator to the cryogenic system to freeze out even more background gas.
A.3. Lasers and optical setup
Locked lasers have servo bumps at the edge of the locking bandwidth, which is typically on the order of 1 MHz [131] . This frequency is similar to the frequencies of motion in the trap. Hence, if one drives a sideband transition, the servo bump might be on resonance with the carrier and cause undesired quantum state evolution. To avoid this, one can use frequency-doubled laser systems. As the second harmonic generation is a non-linear process, the servo bumps are less efficiently converted than the main laser line and thus attenuated compared to the main laser line.
To avoid high frequency amplitude and phase noise in the laser light, one can use clean-up cavities, as they are used in measurements for gravitational wave detection [132] . The remaining (low frequency) amplitude noise can be compensated via AOMs or EOMs, as typically done in trapped ion experiments. Furthermore, the clean up cavities also reduce servo bumps.
The optical gate operations are performed off-resonantly on an S-to-P transition with blue light, for exact wavelengths see Table 1 . Blue light can cause bleaching in the fibers, which is a problem for the continuous operation of the quantum computer, as such fibers have to be replaced. But in recent years, fibers which show low loss for ultra-violet light have been developed [133] . For high fidelity gate operations, both the amplitude and the phase of the light have to be stabilized. Since the light is transmitted into the cryogenic setup with a fiber, the light polarization at the output of the fiber might fluctuate. Figure 24 shows an optical setup, in which a polarizer cleans the polarization of the beam before being sent to the experiment. After the polarizer, a beam sampler reflects light partially into a multimode fiber whose output is used for intensity stabilization, resulting in a stabilization of both the intensity and the polarization.
Optical Setup in a Vacuum Chamber
As the corresponding wavelength of a clock transition in a hyperfine trapped ion qubit [86] is in the range of millimeters, phase stabilization is not required for single qubit operation. However, entangling operations may require phase stabilization during their execution [114, 116] . Narrow linewidth light in a fiber experiences fiber noise [134] , which might broaden the light to a linewidth of as high as 10 kHz. As entangling gate operations take about 100 µs, fiber noise has to be canceled for entangling gates. Fiber noise cancelation (FNC) setups modulate the phase or frequency of an AOM to cancel the effect of the fiber [135] . FNC only works efficiently in continuous wave (CW) operation. In the setup of Figure 24 , there are two fibers used to send light into the cryostat. A fraction of the main light in the optical setup is coupled out and used for FNC of the second single mode fiber (FNC2). Inside the cryostat, the output of the second fiber is reflected back into the main fiber, and this CW light can be used for FNC of the main fiber (FNC1). The light used for the gate operations can be pulsed and is sent via the FNC1 setup into the cryostat. The remaining phase drifts are from thermal drifts and acoustics coupling into the optical setup. Of these drifts, optical path length fluctuations between the output of the fiber in the cryostat and the ions in the trap cannot be stabilized actively. However, inside the cryostat, there is no acoustic noise and the temperature is stabilized. Therefore, one can neglect acoustic vibrations in the vibration isolated cryostat. The remaining drifts are thermal, and thus slow. In the firmware, one has to choose gate operations which can cope with these slow drifts, for example such as demonstrated in reference [116] . It is helpful to place the optical setup for pulsing the light and the FNCs inside a small vacuum chamber and thermally stabilize it as well. That way, one can minimize the phase drifts during the experiment.
In order to save space inside the inner heat shield, one should miniaturize the optical setup, shown on the right hand side of Figure 24 , consisting of 3 fibers and a polarizer, as such a setup will be required for each beam used for QIP.
The optical setup described in Section 5.5.2 requires single ion addressing capability for all ions in the QIP zone of the QALU. A possible alignment strategy is illustrated in Fig. 25 , where each ion that should be illuminated has its own optical setup. The light from a fiber is imaged onto a mirror of a mirror array from where the light for the different ions is sent towards the ion string in the Paul trap. Therefore, an optical setup images the mirror array onto the ion string. The mirror array can either be an array of fixed mirrors or can be a microelectromechanical systems (MEMS) array.
A.4. Detectors
It is advantageous to include detectors for photon counting in the cryogenic setup. There, transition edge sensors (TESs) may be the prime choice, as they have a quantum efficiency of up to 95 % [136] . These sensors exploit the strongly temperature-dependent resistance at a superconducting phase transition. When a photon is absorbed in the superconductor, the material locally enters the normally conducting phase which can be detected electronically.
During operation, electric current flows through the TES so that it is close the phase transition. This current is temperature dependent and changes when a photon is detected. As the presented quantum von Neumann architecture (for trapped ions) requires long coherence times and thus high spatial homogeneity and temporal stability of the magnetic field, TESs should not be operated on or close to the trap 35 in this architecture. Ideally, the TESs are located near the superconducting shield or even outside of it so that the magnetic field generated by their operation does not influence the coherence time in the system. For such detection, one can use a detection lens close to the trap and a second lens to focus the light onto the detector. The idea is the same as for addressing, depicted in Fig. 25 only that the direction of the light is reversed. Another idea is to integrate a fiber into the trap for light collection. The output of the fiber can then be directed onto the detector [105] .
A.5. Trap Design
A major design challenge for large planar traps with thousands of segments pairs is shunting the RF on the DC segments. The traces on the trap might be too long for efficient shunting near the trap with shunt capacitors or on the trap with trench capacitors. Furthermore, trench capacitors on the trap would drastically increase the size of the trap. Hence, the RF will have to be shunted near the trapping zone of the trap.
The occurring capacitances of a planar Paul trap are illustrated in the cutaway view of Figure 26 a. A DC segment has the capacitances C shunt and C shunt to ground, where C shunt is the capacitance through vacuum and C shunt is the capacitance through the trap material. C shunt = r C shunt with r the relative dielectric constant of the trap material, e.g. SiO 2 : r = 3.8. A DC segment has the capacitances C seg and C seg to the next RF rail, where C seg is the capacitance through vacuum and C seg is the capacitance through the trap material. The ratio R=(C seg +C seg )/(C shunt +C shunt ) defines the RF voltage on the DC segment without an additional shunting network and is typically around 1. For efficient shunting, this should be 1/100 or 1/1000.
In order to decrease the RF voltage on the trap, one can decrease the capacitance to the RF rail through the trap material and increase the capacitance to ground by placing a ground electrode underneath each the DC segment, as depicted in Figure 26 b. In typical planar traps, the gaps between segments are between 10 and 30 µm. The skin depth in gold or aluminum for typical trap drive frequencies between 10 and 50 MHz is about 1 µm at cryogenic temperatures. Hence, the metal layers on the trap only need to be 2-5 µm thick. These dimensions cause C seg to vanish which reduces the total capacitance of a DC segment to the RF rail by a factor r + 1. Instead of C seg , the capacitance C GND to ground appears, where C GND is roughly equal to C seg of the old geometry. Therefore, the capacitance for the RF rail does not change, and the RF resonator will experience the same capacitive load. In this new geometry, C shunt is mainly dominated by the parallel plate capacitor underneath the segment which results in a higher capacitance than in the previous geometry. For example, SiO 2 has a dielectric strength of about 30 V/µm. Thus, 1 or 2 µm should be enough to safely operate the trap with DC-voltages of about ±10 V. As an example, a parallel plate capacitor with an area 100 µm × 100 µm and a 1 µm thick SiO 2 isolation layer results in a capacitance of 0.33 pF. The capacitance C seg of such a DC segment to the RF rail will be orders of magnitude smaller. In order to increase the capacitance C shunt further, one can place the ground plane underneath the DC segment all the way between the bond pad and the trapping zone. To sum up, in the new geometry, C seg can be neglected and the higher capacitance C shunt reduces the ratio R without changing the capacitances C seg and C shunt which are important for the behavior of the trap.
Another aspect that one has to take into account with planar traps, which contain hundreds of junctions, is that one has to make sure that the RF potential in both rails is always in phase with each other. Axial micromotion arises due to the wiring, when the electrical path length in one rail is longer than in the other rail by a fraction of the wavelength of the RF trap drive, typically 3 to 5
• . To circumvent this problem, one can short all connecting RF rails of a junction underneath it. But this comes at the expense of a higher capacitance between the RF rails and ground (or DC segments).
A.6. DAC design
With multiplexing architectures as described in Section 5.4, large scale planar traps with thousands of segment pairs may still require hundreds of controllable voltages. An architecture in which a single field-programmable gate array (FPGA) controls all DACs will limit the number of controllable DACs to the number of pins on the FPGA and serial communication will be required to reduce the number of interconnections per DAC.
To circumvent the possibly slow serial communication, one can use a single FPGA with a RAM chip for one or two DACs only. From the central experiment control, the FPGA receives a command what segment to control and what ion movement to execute. For example, "control segment number 491 and move an ion string from left to right". Standard segments will have only few commands, like "move from left to right", and "move from right to left". Segments at junctions may require a couple more commands and segments in the QALU will have the most complicated instruction set, as it has to include things like splitting and recombining of ion chains. All these different voltage ramps can be stored in the RAM.
Such a DAC-architecture allows incorporation of fast 16-bit DACs with more than 100 MSamples/s.
The RAM only needs to contain one DAC output value every 100 ns or 1 µs, and the FPGA can interpolate the values in between to minimize quantization noise. The quantization noise of these DACs will have a frequency of more than 100 MHz and can easily be filtered so that it cannot perturb the QIP. Ideally, one wants to integrate a system consisting of the FPGA, RAM, and DAC on one single chip, like a special-purpose direct digital synthesis (DDS) 36 . But even on a PCB, the DAC-unit for one segment pair can be miniaturized. To minimize the capacitive load on a DAC-unit, the DAC-units should be located close to the trap, which would imply that they need to be cryo-compatible. In general, the DAC-units can be placed outside the vacuum chamber as well. However, the multiplexing logic circuits controlling the analog switches will have to be located in the cryogenic environment to reduce the number of interconnects in the cryostat. FPGAs, which can be used for the multiplexing logic circuits, and analog switches are successfully operated in cryogenic environments [138, 139] .
