A wealth of quantum algorithms developed during the past decades brought about the concept of quantum supremacy. The state-of-the-art noisy intermediate-scale quantum (NISQ) devices, although imperfect, enable certain computational tasks that are demonstrably beyond the capabilities of modern classical supercomputers. However, present quantum computations are restricted to probing the quantum processor power, whereas implementation of specific full-scale quantum algorithms remains a challenge. Here we realize hybrid quantum algorithm for solving a linear system of equations with exponential speedup that utilizes quantum phase estimation, one of the exemplary core protocols for quantum computing. Our experiment carried out on superconducting IBMQ devices reveals the main shortcomings of the present quantum processors, which must be surpassed in order to boost quantum data processing via phase estimation. The developed algorithm demonstrates quantum supremacy and holds high promise to meet practically relevant challenges.
I. INTRODUCTION
After Shor invented a quantum algorithm for integer factorization [1] promising to solve the problem exponentially faster than any classical factoring algorithm, a wealth of quantum solutions emerged targeting classically intractable problems. This sparked the idea of an overwhelming advantage of quantum computing coined into the concept of quantum supremacy [2] . The latter was expected to prosper, should quantum solutions be implemented on noiseless quantum processor units (QPUs). Yet the progress in quantum computing is impeded by the inherent errors of the logical gates, by the imperfections of the readout procedure, and by the decoherence affecting the qubits. Despite these obstacles, the state-of-the-art intermediate-scale quantum devices [3] , although suffering the noise-related problems, enable computations lying far beyond the capabilities of modern classical supercomputers.
Recently, the Google collaboration demonstrated quantum supremacy by solving a problem that was purposely devised to profoundly demonstrate the advantage over classical devices [4] . The low-noise superconducting 53-qubit QPU has been programmed to execute random instructions defined by a random quantum circuit containing single-and two-qubit gates. Such a random unitary transformation was designed to mimic all possible quantum computations. However, the implementation of a specific transformation that can be adopted into a framework of the practical quantum algorithms remains a challenge.
One of the exemplary core protocols for quantum computing is quantum phase estimation (QPE) [5] , which has been extensively studied and applied as a subroutine in a framework of the variety of quantum algorithms such as the Shor algorithm, quantum counting, and the calculation of the eigenvalues of unitary matrices. By virtue of the QPE, a quantum computer gains an exponential speedup in the context of the matrix inversion problem [6] , which is the main computational challenge in many areas, including artificial intelligence [7] , partial differential equations [8] , and data analysis [9] .
Here, in order to exploit the QPE protocol for fast matrix inversion, we implement the quantum hybrid Harrow-Hassidim-Lloyd algorithm [10] (H-HHL) for solving the linear system of equations. One of the main advantages of the HHL algorithm is the exponential compression of the data. Therefore a quantum computer could operate on only several dozens of qubits to solve a large system of linear equations, a task for which a classical computer would run out of memory. Our experiment carried out on the superconducting IBMQ devices [11] , enables us to spot the shortcomings of modern QPUs and outline the way for surpassing them in order to boost quantum data processing [12] . We show that our implementation of the H-HHL algorithm offers the route to benchmarking quantum supremacy and to resolving the practically relevant challenges.
II. PROBLEM DESCRIPTION
To begin with, we define a problem for quantum devices to solve and discuss how it corresponds to the real computational tasks. As an exemplary task, we choose an efficient solution of the system of linear equations
... c ... Figure 1 . a. Quantum scheme of the standart HHL algorithm with 3 phase qubits that involves quantum phase estimation protocol (QPE) and ancilla quantum encoding step (AQE) in order to solve Eq. (1) with a unitary matrixÛ . While QPE part exploits controlled unitary operations CÛ , CÛ 2 , CÛ 4 and Quantum Fourier Transform in order to process eigenvalues and eigenvectors ofÛ , AQE algorithm assists the matrix inversion. b. Quantum scheme of the improved H-HHL algorithm that provides the solution for the same matrix as the circuit from a. The algorithm involves only 2 phase registers and the QPE part is significantly reduced.
whereÛ and b are given matrix and vector and (logÛ ) is the spectral radius of the logÛ . One notices the specific structure of this system, which differs from the usual A x = b. Let us now discuss the problem statement in more detail.
We solve Eq. (1) by using the quantum algorithm that employs the matrix exponentiation as a preparation step [6] . In general, such an exponentiation procedure is a major challenge. Let us assume that we know the physical operatorÛ , the exponent of the matrix logÛ , which we aim to invert. Next, since the logarithmic function is ambiguous, we fix the resulting matrix spectrum such that the largest absolute value of eigenvalues were less than 1.
For a quantum circuit construction, we do not decompose the random matrix into the single-qubit and CNOT gates, but, instead, we consider the matrixÛ comprising quantum gates. Furthermore, we choose b to make the corresponding state |b = |0 a computational basis. The entire family of gate-based matrices is expressed as a tensor product of M local operatorsÛ i that act only on the i-dimension subset of the computational circuit:
For illustrative purposes, let us consider three types ofÛ that can be naturally implemented in quantum circuits.
TP 1 :Û TP1 is the Tensor Product of single-qubit gateŝ U s resulting in dim(Û i ) = dim(Û s ) = 2: this operator does not entangle qubits.
TP 2 :Û TP2 is the Tensor Product of two-qubit operators, dim(Û i ) = 2 2 that leads to the emergence of two-qubit clusters within which qubits are entangled.
NTP:Û NTP is Not a Tensor Product of single-or twoqubit gates that leads to dim(Û i ) = dim(Û ): this operator entangles all qubits.
The QPE protocol, which is a vital subroutine of our algorithm, involves controlled unitary operations and, therefore, is the most complicated part. Indeed, the realization of an arbitrary control single-qubit gate inÛ requires at least two CNOTs [5] that leads to a highly complex quantum circuit, since n-qubitÛ TP1 ,Û TP2 and U NTP comprises n, 2n − 1 and 2n − 2 single-qubit gates respectively. Thus, we consider the continuous subset of single-qubit gatesÛ s in a way that the control-Û s is implemented via a single CNOT resulting in dramatic simplification of the algorithm circuit. For more details see Supplementary Information (SI).
We introduce a correcting single-qubit gateÛ c in U TP1 ,Û TP2 ,Û NTP matrices that on one hand allows us to control the matrix spectrum for each circuit type, and, on the other hand, leads toÛ 2s
c ⊗Î, where s ∈ N, that greatly simplifies the QPE as well (see SI).
III. HYBRID QUANTUM SOLUTION
In this section we consider the hybrid quantum HHL algorithm and its implementation on the modern quantum processors and discuss the main shortcomings associated with real superconducting QPUs.
A. Quantum algorithm for solving linear systems of equations
Let us briefly discuss the structure of the HHL algorithm, more details can be found in [6] . The HHL algorithm that inverts a N × N matrix exploits three groups of qubits: the vector register that consists of [log N ] qubits, the p-qubit phase register and a single qubit ancilla register -n = [log N ] + p + 1 qubits in total. The whole computation, in turn, consists of the QPE algorithm, the ancilla quantum encoding (AQE) part, in which the single ancillary qubit conditionally operates on the state of the phase registers, and the inverse QPE. The phase estimation protocol exploits controlled unitary operations CÛ , CÛ 2 , . . . , CÛ 2 p over phase and vector qubits and the Quantum Fourier Transform afterward in order to process eigenvalues and eigenvectors of the matrix. The quantum circuit of the HHL algorithm that utilizes 3 phase registers is depicted on Fig. 1a .
In order to implement the HHL algorithm we use the spectral decomposition of a N × N matrix aŝ
and encode the vector b into qubit's amplitudes |b = N i=0 β i |i [6] . Once we run the algorithm, the solution is encoded into the quantum state
where N x is the normalization coefficient. By the projective measurement, we obtain the expectation value x|M |x for some operatorM within exponentially shorter time than that allowed by classical algorithms.
B. Insights from classical approaches
The HHL algorithm is improved by using some prior knowledge about the system of equations, this classical information allows us to refine quantum algorithm in order to downsize the noise-sensitive quantum part. Leveraging this technique, we implement the hybrid HHL algorithm (H-HHL) [10] that takes advantage of the fact that some bits of (logÛ ) −1 eigenvalues could be the same for any eigenvector. Since the QPE part encodes eigenvalues into phase register qubits, such a fact allows us to apply an iterative quantum phase estimation algorithm [13] in order to determine identical bits of eigenvalues and, as a consequence, qubits corresponding to those bits. Therefore, one treats such phase qubits as classical bits and excludes them from a computational scheme yielding a reduction in the width and in the depth of a circuit.
Since we fully control the matrix spectrum via the correcting gateÛ c , we immediately eliminate unnecessary phase qubits without conducting the iterative phase estimation procedure resulting in sufficiently low p. As an illustrative example, we adjust the correcting gate in such a way that p = 3 phase qubits are required to encode the whole spectrum of N × NÛ TP1 ,Û TP2 orÛ NTP matrices. The hybrid part allows us to utilize p = 2 instead of 3 phase qubits in order to solve Eq. (1) resulting in n = [log N ] + 3 qubits in total. The reduced scheme for the H-HHL algorithm is shown in Fig. 1b .
C. Circuit complexity
In order to estimate the potential size of the quantum circuit, we employ the best existing superconducting quantum processors, such as IBMQ Melbourne (15 qubits), Johannesburg (20 qubits), Rochester (53 qubits) and Google Sycamore (53 qubits) [4] . We use the QPUs coupling map and evaluate the circuit depth -a leading characteristic of the circuit complexity including the amount of a single-and two-qubit gates.
Since the circuit of the quantum algorithm is usually elaborated for the all-to-all connectivity, we can not instantly run it on a real QPU, where the connectivity is limited. Firstly, one needs to transform the quantum circuit to fit the device topology -we will refer to this operation as transpiling. In order to achieve higher final for large-size IBMQ Rochester (blue) and Google Sycamore (red) processors: both 53 qubits. While the 53qubit circuit implemented on the all-to-all coupling map consists of ≤ 10 3 two-qubit gates (solid, dashed and dotted black lines), the circuit implemented on the IBMQ Rochester device requires ≥ 10 3 entangling gates for TP1 (circles), ∼ 3 · 10 3 for TP2 (triangles) and ∼ 6 · 10 3 for NTP matrices (crosses). Google Sycamore's topology allows for a 26% reduction in the number of CNOTs compared to the Rochester QPU, providing a significant improvement in fidelity.
fidelity one needs to solve the transpiling problem of finding the optimal transformation that maximizes the final fidelity. Unfortunately, finding the best transformation is an NP-hard problem, therefore we use the brute force search over dozens of transpile options and pick the best circuit decomposition. The dependence of the circuit depth on the circuit width (number of qubits) used in the H-HHL algorithm is indicated in Fig. 2a for medium-size IBMQ Melbourne and Johannesburg QPUs, and on Fig. 2b for the largest modern Sycamore and Rochester QPUs for all circuit types. In order to obtain the circuit depth, we averaged over 140 transpiled random quantum circuits (RQCs) that realizeÛ for each matrix type and size.
It is clear that the poor connectivity has a huge im-pact on the circuit depth as well as on the expected final.
Since the 53-qubit circuit implemented on the all-to-all coupling map consists of ∼ [10 2 , 5 · 10 2 , 10 3 ] two-qubit gates for TP 1 , TP 2 and NTP type respectively, the same circuit implemented on the IBMQ Rochester device requires ∼ [10 3 , 3·10 3 , 6·10 3 ] CNOTs. Google Sycamore's topology allows for a 26% reduction in CNOTs compared to the Rochester QPU resulting in a significant improvement in fidelity.
For illustrative purposes, we consider a quantum volume V Q of QPUs, a single-number metric presented by IBM that can be experimentally obtained using random quantum circuits [14] . A quantum volume quantifies the largest RQCs of equal width and depth that the computer successfully implements. We measure the quantum volume of the IBM processors and estimate such a metric for the Sycamore machine. We find that all 5qubit IBMQ devices, 15-qubit Melbourne, and 53-qubit Rochester processors, have V Q = 8, the 20-qubit Johannesburg has V Q = 16 and the Google Sycamore processor has V Q = 32. Thus, it is not a surprise that much lesser amount of two-qubit interactions is required for the Sycamore than for the Rochester.
IV. CLASSICAL METHODS FOR SOLVING THE PROBLEM
In this section we discuss the classical approaches of solving the Eq. (1) withÛ TP1 ,Û TP2 ,Û NTP matrices and analyze the computational complexity levels of such methods. We consider the direct solution with an arbitrary matrix and examine the complexity of the classical circuit simulations, that involve Schrödinger, Schrödinger-Feynman, and tensor network approaches.
A. Direct solution
Let us consider existing numerical methods for solving Eq. (1) and estimate the scaling of the modern classical algorithms. Since we construct 2 n × 2 n matrixÛ that defines the system of linear equations via a low-depth quantum circuit, the sparsity of the matrix increase exponentially with a circuit width. However, the computational cost of solving Eq. (1), which consists of the logarithm calculation and inversion procedure, is no lesser than O(2 2n ) regardless of matrix sparsity.
In order to compute logÛ one can utilize the matrix diagonalization [15] or the inverse scaling and squaring method [16] . Such methods require at least an exponential number of operations or addresses to matrix elements. If the sparse logÛ is obtained one may use iterative algorithms to solve the linear system afterward, e.g. by applying Kaczmars method [17] -however, such algorithms may converge in sub-exponential time.
Unfortunately, for classical algorithms, fixed spectrum and matrix decomposition Eq. (2) are unable to sim-plify the direct solution. For more detailed analyses see SI. In summary, the HHL algorithm possesses exponentially lesser computational complexity than classical algorithms in the context of the sparse matrices inversion and, thus, is expected to be highly efficient in solving Eq. (1).
B. Classical methods for quantum circuit simulation
Here, we discuss the classical simulation of the quantum circuits with large width and depth, and we estimate the computational cost. There are three state-of-the-art methods for quantum circuit simulation: Schrödinger, Schrödinger-Feynman, and tensor networks approach [4] . Let us discuss each method in more detail.
1. Schrödinger algorithm is essentially an application of the unitary transformation to the initial state vector from 2 n -dimensional space in the context of matrix multiplication; as a result, we obtain the final 2 n -dimensional quantum state vector. Unitary transformation, in turn, is defined by the quantum circuit of the H-HHL algorithm. This type of simulation, on the one hand, is straightforward and provides a great speed in processing low-width circuits in comparison to the other simulations. On the other hand, the Schrödinger algorithm requires a significant amount of RAM when processing many-qubits circuits. Such a requirement is difficult to meet since the memory of modern supercomputers is limited to ∼3 PB or, in terms of the quantum state size, to ∼47 qubits. As a consequence, high-width quantum circuits cannot be simulated by the Schrödinger algorithm. The runtime of the n-qubit circuit simulation is O(D · 2 n ), where D is the circuit depth.
2. Schrödinger-Feynman algorithm is a more sophisticated approach that may solve the memory issues inherent in a Schrödinger algorithm. Schrödinger-Feynman simulation, at first, cuts the circuit into two parts significantly reducing the width and, afterward, applies the Schrödinger algorithm to each part in order to estimate the final quantum state. Since the two-qubit gate is the only natural entangling operation in a circuit, we obtain a set of such gates that are affected by the cut. In order to perform the circuit split one needs to use the Schmidt decomposition of the two-qubit gates [4] . The obvious advantage of such an approach in comparison to the Schrödinger algorithm is the considerable reduction in RAM. However, by processing a large circuit with Schrödinger-Feynman algorithm we find that the number of entangling gates affected by the cut may be enormous, e.g. if there are k CNOTs on the cut we have to perform 2 k Schrödinger simulations since each CNOT gate has a Schmidt rank of 2 [4] . Thus, the simulation runtime is O D · 2 n+k , where n is the width of a circuit half.
3. Tensor networks approach exploits only local interactions between n qubits in order to construct n low-rank tensors and, as a consequence, does not process the full state vector. Such an approach is highly efficient in processing high-width low-depth circuits providing the significant reduction of the required RAM as well as of the algorithm runtime. The latter scales as O T · e tw{G} [18] , where T is the total number of gates and tw{G} is the treewidth of a tensor graph G corresponding to the quantum circuit. According to [18] , the treewidth is determined by the maximum number of twoqubit gates that affect one qubit. The treewidth of TP 1 , TP 2 , NTP circuit graphs scales as O(n),thus the tensor networks algorithm runtime is O(n · e n ) that is worse than a Schrödinger or Schrödinger-Feynman simulation.
All types of a quantum circuit emulation, which solves the Eq. (1) with N × N matrix, is considerably faster than the direct solution: O(N 2 ) vs O(N log N ). Thus at the moment, the Schrödinger-Feynman simulation is essentially an efficient way to solve the posed problem with 100% fidelity. However, quantum computers possessing a sufficient quantum volume provide an exponential speedup over classical solution resulting in O(log N ) scaling. We expect that future improvements in classical algorithms and hardware will provide a considerable reduction in runtime and computational resources, however, persistent enhancement of quantum hardware allows QPU to consistently outperform classical CPU.
V. IMPLEMENTATION ON THE REAL QPU
Here, we discuss the H-HHL algorithm performance and its implementation on a real QPUs provided by IBMQ [11] . Let us consider the case of a small and large amount of qubits separately: while low-width circuits verify the algorithm performance and can be characterized by the full state tomography F T OM [5] , large circuits show the algorithm scaling and can be characterized by the cross entropy benchmarking fidelity F XEB [4] .
A. Low number of qubits (≤ 7)
Let us consider a low number of qubits case where n ∈ [4, 7] . Under these conditions we employ the full state tomography analysis, which requires 3 n−3 experiments with one circuit; each experiment consists of 8912 runs. We investigate all matrix types TP 1 , TP 2 , NTP by analyzing 140 transpiled RQCs for each type for each matrix size. At first, we perform the full state tomography on the IBMQ simulator and show that the fidelity Fig. 3 (on the right) for different circuit width. Since the quantum volumes of IBMQ Burlington, Yorktown and Melbourne processors are equal, the fidelity behavior is similar. However, V Q of the Johannesburg QPU is twice as large, resulting in a higher fidelity level.
B. Large number of qubits (> 7)
Here, we consider large circuits implementation on IBMQ Melbourne and Johannesburg QPUs, but, at first, let us elaborate on a suitable performance metric. Since the full state tomography requires exp[O(n)] experiments for each circuit, it is not possible to obtain tomographical fidelity in a reasonable time. Thus, we inherit the cross entropy benchmarking approach from [4] that allows us to estimate the algorithm fidelity with only one experiment and single Z-projective measurement instead of 3 n−3 experiments. Thus, we characterize the final state with the 2 n -dimensional probability distribution of measured outcomes.
The obtained multi-qubit state consists of a single ancillary, 2 phase and n vector registers, and the algorithm ends successfully if and only if the ancillary qubit is in |1 state -such a beneficial fact allows us to filter measured outcomes with respect to the ancillary qubit. Let us consider M different RQCs and denote the 2 n−1 -dimensional probability distribution of filtered and normalized outcomes that corresponds to the noiseless implementation of jth circuit as p t j , measured distribution as p e j and chaotic probability distribution as p c = {1/2 n−1 , ..., 1/2 n−1 }. Thus we can define the fidelity as follows:
where (· · · , · · · ) is a scalar product. It is clear that the introduced metric shows averaged proximity of the obtained vector projection to the ideal solution rather than to a chaotic state -such a definition matches the cross entropy fidelity given in [4] . The fidelity F XEB of the H-HHL algorithm as a function of the circuit width is presented in Fig. 4 on the left for IBMQ Melbourne (experimental results) and in Fig. 4 on the right for IBMQ Johannesburg QPU (simulation results). While colored markers corresponds to different matrix types TP 1 (blue circles), TP 2 (orange diamonds), NTP (green triangles), the solid line indicates the digital error model (DEM). The digital error model [4] takes into account gates and readout errors and characterize circuit performance by a set of localized Pauli errors. Each point is the fidelity averaging over 140 RQCs that were transpiled 20 times in order to get the minimal depth and each RQC experiment consists of 10 5 runs.
It is clear from the simulations that the digital error model is a good approximation of fidelity behavior. For the real experiment on Melbourne QPU, we find that the Total number of qubits Total number of qubits performance of TP 1 circuits can be described by DEM, however, due to the device instability, DEM fails in predictions of the fidelity level for other matrix types. We expect that results obtained from the advanced QPU can be interpreted according to the digital error model regardless of the circuit type.
VI. QUANTUM SUPREMACY
Let us compare existing supercomputers and 50+ qubit quantum machines in the context of Eq. (1) solution. At first, we consider the circuit simulations and estimate the runtime. Then we evaluate the fidelity level of the H-HHL algorithm processing TP 1 , TP 2 and NTP matrices -using such a knowledge we compare equal-fidelity simulation on classical supercomputers with QPU.
A. Cost of the classical solution of the problem
Here, we analyze the computational cost of the Eq. (1) solution with TP 1 , TP 2 and NTP matrices on a classical powerful supercomputers.
Let us consider the Schrödinger-Feynman algorithm (SFA) performance -as was shown in Section IV, the SFA is the most efficient simulation of high-deep quantum circuits and, at the same time, the most efficient way to solve the Eq. (1).
Primarily, in order to estimate the SFA runtime, we perform Schrödinger simulations, which are essential building blocks of SFA. It was shown in Section IV that the runtime of a Schrödinger algorithm is T SA = C ·n·2 n . In order to find the scaling constant C we conduct the simulation on POWER8 processor with 160 cores and 512 Gb of RAM [19] -we find that C TP1 = 5 · 10 −9 s, C TP2 ≈ 14 · 10 −9 s and C NTP ≈ 17 · 10 −9 s, which we assume are scaled linearly in number of cores. Since mod-ern supercomputers have roughly 100K cores, we expect the constants to be C TP1 ≈ 10 −12 s, C TP2 ≈ 2.8 · 10 −12 s and C NTP ≈ 3.4 · 10 −12 s in the best scenario.
For memory estimates, while the state-of-the-art supercomputers posses 3 PB of RAM, we suppose that one can store a 2 47 -dimensional vector using 8 bytes to encode single complex number. Hence, SFA allows us to split the quantum circuit in the ratio (n − n) : n, where n < n − n ≤ 47. The execution time of the Schrödinger-Feynman method is T SF A = C (n − n) · 2 n− n · 2 k , where k is the number of CNOTs affected by the cut (calculation of k see in SI) and the scaling constant C is the same as in the Schrödinger algorithm, since we are forced to use almost all RAM and it is not possible to parallelize the algorithm.
B. Quantum supremacy characterization
Here, we discuss the possibility to show quantum supremacy using the H-HHL algorithm. We estimate the fidelity behavior of the 50+ qubit quantum devices by performing numerical simulations -we show that the equal-fidelity classical simulations require considerably more time than a quantum solution implemented on the state-of-the-art QPUs.
Let us estimate the H-HHL algorithm runtime on a superconducting QPU. Most of the hybrid quantum algorithm runtime is spent by the classical part, mainly, the optimal transpile search. We find that one transpile of a 53-qubit NTP circuit takes ∼ 141 seconds per core on the Intel i9 Core CPU, thus, 3K cores can optimally transpile 140 RQCs in a few minutes. By using the gate pulse duration we find that the sampling of the n-qubit circuit million times on a Sycamore device reported in [4] would take 2 + 1.5n seconds, thus net QPU time, which samples the 53 qubit circuit, is less than 2 minutes.
According to the last paragraph, we expect that the Table I . Estimation on the large QPUs performance processing TP1, TP2 and NTP circuits and their competitiveness with 100K cores supercomputer. The total FXEB consists of the fidelity of the readout Fr, single-qubit and two-qubit gates, F1QG and F2QG respectively. All values are presented for real IBMQ Rochester and Google Sycamore QPUs as well as for the enlarged 57-and 62-qubit processors. We also consider improved Google device that posses lower two-qubit gate error (∼ 0.2 %) resulting in higher VQ = 64, which is indicated as Sycamore * . We expect such enhancements to be realized in the nearest future. While the solving of Eq. (1) with TP1 and NTP matrices with the same fidelity as Rochester and Sycamore QPU T f takes less than 1 minute on a supercomputer due to the low final FXEB, an enlarged and improved Sycamore processor provides the fidelity that will keep the supercomputer busy for T f = 5 months in case of 57 qubits, T f = 270 years in case of 62 qubits for TP1 circuits and for T f = 2.2·10 5 years in case of 57 qubits for NTP. In contrast to TP1 and NTP circuits at least an increase of the Hilbert space or a reduction of two-qubit error rate is required: T f = 570 years required to obtain the solution on a supercomputer, when the circuit width is 57 qubits, and T f = 8 years, when the two-qubit error is reduced to ∼ 0.2% on a 53-qubit device. Verification of the fidelity requires TSF A = 10 months, 220 years and 2.2 · 10 5 years of calculations on a supercomputer for TP1 circuits with 53, 57 and 62 qubits respectively; TSF A = 5 · 10 6 years for 53-qubit and TSF A = 8.5 · 10 13 years for 57-qubit TP2 circuit (see Section VI A); TSF A = 4 · 10 8 years for 53-qubit and TSF A = 10 17 years for 57-qubit most complex NTP circuit. It is clear that there is a strong correlation between the quantum volume of the device VQ and equal-fidelity simulation runtime T f .
classical solution with 2 50 × 2 50 TP 1 matrix will be evaluated in T SF A (Û TP1 ) ≈ 10 months, however, other types require T SF A (Û TP2 ) ≈ 5 and T SF A (Û NTP ) ≈ 400 million years of classical computations respectively. In order to compare best classical machines and quantum devices, let us evaluate the fidelity level of a 50+ qubit circuit, which corresponds to the H-HHL algorithm, implemented on existing QPUs. It is clear that the computation time T f , which is necessary for classical computers to get the solution with QPUs fidelity F XEB , scales linearly with fidelity as T f = T SF A · F XEB . One can think of 1/F XEB as a number of attempts required to obtain the correct solution |x .
We assume that only the fidelity of readout F r , singlequbit gates F 1QG and two-qubit gates F 2QG contribute to the final fidelity resulting in
Using digital error model, we evaluate the final F XEB and equal-fidelity simulation runtime for TP 1 , TP 2 and NTP circuits implemented on large 50+ qubit devices. Our estimations on fidelity, simulation time and required quantum volume are presented in Table I . Let us discuss the hardware properties that are required in order to show quantum supremacy with TP 1 matrices. We expect that while Rochester device in a current state is too noisy to demonstrate supremacy using H-HHL algorithm -the equal-fidelity circuit simulation would take less than a minute -the Sycamore QPU, which processes TP 1 matrices, provides the fidelity level that is sufficient to show the quantum speedup by a factor of 10 − 10 2 . Nonetheless, we expect an increase in the number of qubits up to ∼ 60 and a decrease in CNOT errors to ≤ 0.2 % in the nearest future. As a result, equal-fidelity classical sampling would take 5 months and the fidelity verification would take 220 years for 57-qubit Sycamore processor with reduced errors; and 270 years and 225,000 years respectively for 62-qubit Sycamore QPU. Such an error rate was observed for some qubits in [4] , which means that QPU with presented characteristics is already in progress. The quantum volume of the Sycamore processor is 32, however, the mentioned decrease in two-qubit error provides V Q = 64.
Secondly, let us consider more complex quantum circuits based on TP 2 matrices. Current QPUs perform the H-HHL algorithm with TP 2 circuits faster than supercomputers perform equal-fidelity simulation of the corresponding circuit: the speedup is about the same as for TP 1 circuits. However, in order to provide concrete evidence of quantum supremacy, one needs to slightly improve quantum processors. In contrast to TP 1 circuits we do not need to increase the Hilbert space dimension and reduce errors simultaneously: one of such options could be realized. While the simulation of a 57-qubit TP 2 circuit would take 8.5·10 13 years, which is 6200×age of the universe, 570 years are required to obtain the solution on a supercomputer with the same fidelity as on arbitrary 57-qubit QPU with Sycamore's topology and error rate -as far as we know, such a device already exists but has not yet been demonstrated officially. At the same time, a 53-qubit Sycamore processor with V Q = 64 can ensure the fidelity level that would keep the most powerful supercomputers busy almost for a decade to provide the equal-fidelity solution, while the ideal solution would be obtained in 5 million years.
For the most sophisticated NTP circuits, the same improvement in quantum volume as for TP 1 circuits should be realized to claim quantum supremacy on 57qubit QPU. Such an experiment requires 240,000 years of equal-fidelity simulation, while the fidelity verification would take 100 quadrillion years.
In summary, the runtime of an equal-fidelity circuit simulation scales as C c · 2 Cq·n . Here, C c is classical constant, which is defined by the speed of a Schrödinger algorithm and by the dimension of the state that can be placed into RAM; the quantum constant C q is defined by the QPUs gate errors and topology (see SI for details). Both constant significantly vary for different matrix types and sizes, however, it is clear that quantum computers, whose runtime scales as O(n), exponentially outperform classical devices even now, when the noise level is still high.
VII. NEAR-TERM APPLICATIONS
Finally, we consider possible practical applications of the H-HHL algorithm in the framework of the posed problem.
The exact realization of our algorithm allows for addressing some tangible problems, for instance, in the context of Markov processes, one could obtain the generator of a known stochastic P matrix -a transition rate matrix Q = log P [20, p. 37]. Then, we can solve linear equation Q f ·∆t = ∆ f at some point of the process. Here, f is the distribution of state probabilities and ∆ f /∆t is a probability current, which is supposed to be known. Upon obtaining |f and measuring f | M |f we check that the probability of a specific state is non-zero.
An analogous problem can be addressed in the control theory. Let us suppose that the discrete-time process with time step ∆t: x n+1 = A x n is modelled by a continuous one:˙ x = B x. Since A = exp(B∆t), one could obtain vector x by solving˙ x = B x at some point of the process if˙ x is provided. By averaging appropriate Hermitian operators we determine whether some components of x are non-zero.
The algorithm can be modified by performing matrix simulation techniques for A in order to solve A x = b as it was originally proposed in [6] . In that case, a wider range of new problems may be approached. For instance, it was proposed to use the SWAP test in order to determine whether solutions of different systems coincide. Such linear systems are also used to solve partial differential equations, e.g. one can find electromagnetic field energy in some region. One of the promising applications related to deep neural network training was discussed in [21] : since the extension of the Bayesian approach to deep architectures is a serious challenge, one can exploit the hybrid quantum HHL algorithm developed for Gaussian processes in order calculate a model's predictor.
VIII. CONCLUSION
We implemented the quantum hybrid HHL algorithm solving a system of linear equation by the fast matrix inversion. The matrix, in turn, is approximated by the unitary transformation, which was dictated by the sequences of single-qubit rotations and CNOT gates. The size of the linear system grows exponentially with the increasing number of qubits. Using the state-of-art 53 qubits processor, one inverts the 10 15 matrix, which is far beyond the capabilities of the modern supercomputers.
We probed the algorithm on the simulator with embedded noise model and on the real IBMQ QPUs with 5, 15 and 20 qubits, and showed that the cross entropy fidelity F XEB can serve as an adequate performance metric for the real quantum algorithm. Furthermore, we estimated the fidelity level of the algorithm implemented on a nextgeneration 50+ qubit processors and found that the system cannot be solved with the QPUs fidelity on a supercomputer in less than a few centuries neither directly nor by using special techniques such as a high-performance simulation of a quantum circuit. Observed exponential scaling in the equal-fidelity classical computation runtime indicates that NISQ devices, which exploit only polynomial time resources, exponentially outperform existing classical analogues.
Our experimental results and theoretical estimates hold high potential for guiding researchers in solving large systems of linear equations utilizing state-of-the-art NISQ computers. We intend to probe the H-HHL algorithm on the cutting edge low-noise QPUs and collect experimental data in order to give a chance to the future computational devices to verify the fidelity. We envision that the planned experiment will stimulate major players of the quantum computing industry to demonstrate the hardware actually achieving the quantum supremacy.
Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. G.S.P. acknowledges support from the Academy of Finland through the Finnish Center of Excellence in Quantum Technology QTF (project 312296).
