The performance of a D-Wave Vesuvius quantum annealer was recently compared to a suite of classical algorithms on a class of constraint satisfaction instances based on frustrated loops. However, the construction of these instances leads the maximum coupling strength to increase with problem size. As a result, larger instances are subject to amplified analog control error, and are effectively annealed at higher temperatures in both hardware and software. We generate similar constraint satisfaction instances with limited range of coupling strength and perform a similar comparison to classical algorithms. On these instances the D-Wave Vesuvius processor, run with a fixed 20µs anneal time, shows a scaling advantage over the software solvers for the hardest regime studied. This scaling advantage opens the possibility of quantum speedup on these problems. Our results support the hypothesis that performance of D-Wave Vesuvius processors is heavily influenced by analog control error, which can be reduced and mitigated as the technology matures.
I. INTRODUCTION
Recent work by Hen et al. [1] probed for quantum speedup in a D-Wave quantum annealer on a class of Ising spin problems with two desirable properties: a planted (i.e. foreknown) ground state and difficulty that can be tuned with a single parameter α. Their benchmarking results show a scaling advantage for the D-Wave processor over the best of a suite of classical solvers in one region of α, and a scaling disadvantage in another region of α. However, the instances are constructed in such a way that for a fixed value of α, thermal effects and analog error are amplified as the problems increase in size. The construction of these instances can easily be modified to curtail this effect, thereby putting the analog and digital solvers on more level ground. Here we report on experimental results on these range-limited instances, comparing performance of a D-Wave Vesuvius processor with the two best-performing software solvers studied by Hen et al.: the Hamze-de Freitas-Selby (HFS) algorithm [2] [3] [4] [5] , and a solver version of simulated annealing (SAS) [6] .
II. FRUSTRATED LOOP PROBLEMS AND LIMITED COUPLING RANGE
For a particular hardware qubit connectivity graph G with n vertices (qubits), and a particular constraint-toqubit ratio α, Hen et al. construct a frustrated loop instance using k = roundoff(αn) loops like so:
1. For i = 1, . . . , k, loop i is a cycle in G chosen as the first cycle generated by the edges of a random walk in G starting at a random vertex. If i contains fewer than 8 vertices, it is discarded and generated anew.
2. The constraint Ising Hamiltonian J i corresponding to i has value −1 on every edge of i except for a * aking@dwavesys.com randomly selected edge e i of i , where J i has value +1. J i is zero elsewhere.
The final Hamiltonian is (h, J)
, where h is the zero vector and J = k i=1 J i . Any instance constructed with this method has integervalued h and J, but the coupling range R = max i,j {|J i,j |}, i.e. the maximum magnitude of any entry in J, is not necessarily bounded. Moreover, typical instances constructed at a fixed ratio α on increasingly large subgrids of the D-Wave processor have increasing range limits R [1]. Since input (h, J) to the D-Wave processor must be scaled to within the range [−1, 1], large coupling range R necessitates scaling by a factor of 1/R, which presents two problems. First, the operating temperature of the processor relative to the magnitude of the input is increased by a factor of R, increasing undesirable thermal effects. Second, since each coupler and local field is subject to analog control error on the order of 0.03 and 0.05 respectively [7, 8] , the analog control error relative to the magnitude of the input is also increased by a factor of R. In the range-unlimited instances studied by Hen et al. this amplification factor is as high as 17, and can be dictated by a single coupler that happens to be in disproportionately many loops.
In order to address this issue, we construct each instance with respect to an integer coupling range R ≥ 2, so that in our instances each entry of h and J is an integer between −R and R. To do this, when selecting a candidate for i via random walk we ignore edges of G on which | i−1 j=1 J i | is already R. This ensures that the final Hamiltonian (h, J) has all entries in the range [−R, R], so when the instance is necessarily normalized to the range [−1, 1] as input to the D-Wave processor, it is scaled down by no more than a factor of 1/R. Consequently, analog control errors and thermal effects in our instances are relatively amplified by no more than a factor of R.
There is another, less crucial modification of the construction: While Hen et al. reject and resample a choice of i if it is too short, we reject the choice if it is contained arXiv:1502.02098v1 [quant-ph] 7 Feb 2015 in a single eight-qubit unit cell [9] . Thus we sometimes allow loops of length 6, and sometimes forbid loops of length 8. This modification should in principle allow for greater frustration and less domain clustering within unit cells.
In any meaningful study of analog quantum annealing processors it is desirable to limit relative amplification of analog control error and unwanted thermal effects if it can be done without otherwise markedly detracting from the experiment. In this work we consider R ∈ {2, 3, ∞} and α ∈ [0.1, 0.5]; these values of α include the hardest regime. Issues surrounding coupling range and loop selection criteria are discussed further in Appendix B 2.
All of these frustrated loop instances will, by construction, have ↑↑ · · · ↑ and ↓↓ · · · ↓ as planted ground states. Hen et al. describe these instances as being constructed with respect to an arbitrary antipodal pair of planted solutions, but our construction is equivalent under change of variables both in theory and, due to the application of random gauge transformations, in practice.
III. EXPERIMENTAL RESULTS
We compare the D-Wave processor to HFS and SAS, both described by Hen et al. [1] . Our instances are constructed on subgraphs of a Chimera graph C L [9] , in which n of N = 8L 2 qubits are functional, for 4 ≤ L ≤ 8 (see Appendix A 1). Following their methodology, we run SAS on a linear schedule of optimal length in inverse temperature spanning the range β ∈ [0.01, 5] after scaling the input to the range [−1, 1]. Scaling of the Hamiltonian has no bearing on HFS, which is a large-neighborhood zero-temperature search that exploits low-treewidth induced subgraphs in the Chimera architecture. We consider hypothetical resource scaling of n cores for SAS and L = N/8 cores for HFS. Further detail on experimental methods, benchmarking and data analysis is given in Appendix A. To account for differences between SAS implementations, we use the assumption of Hen et al. that each Monte Carlo sweep takes time τ SA = 3.54µs. We assume that each HFS unit cell update takes 1µs. In Appendix B 3 we compare performance of the D-Wave processor to that of SAA (simulated annealing as an annealer). Our results suggest that correlation with the thermal model found by Hen et al. at β f = 5 is an artifact of analog control error in highly sensitive problems, rather than evidence of a predictive model of quantum annealing performance on frustrated loop instances.
A. Scaling and relative speedup results
In Figure 1 we show the scaling of the median time to solution for the three solvers as the problem size L increases. As in previous work [1, 8] , we are particularly interested in how the ratio between two solvers' time to solution scales with respect to problem size. This is shown in Figure 2 . A positive slope in Figure 2 indicates a performance scaling advantage for the D-Wave processor, and the possibility of limited quantum speedup as defined by Rønnow et al. [8] in the case of SAS, and the possibility of potential quantum speedup in the case of HFS 1 . Current D-Wave processors allow a minimum anneal time of 20µs; previous work has shown that C 8 -scale problems with optimal anneal time greater than this are elusive [1, 7, 8] . Proving limited quantum speedup in this framework with this processor would require data from the D-Wave processor using shorter anneals to certify that we are not artificially slowing the processor on easier instances. In particular for the smaller and easier problems, the minimum anneal time may mask the true performance scaling of the quantum annealing platform. When probing for quantum speedup it is important that instances of various sizes are treated fairly with respect to hardness, anneal time, and error sensitivity. In Appendix B 2 we provide evidence that enforcing limits on coupling range does not unduly weaken our results.
IV. DISCUSSION
Far from being artificial, fixed coupling range in the large system limit indeed appears at the phase transition in the Ising formulation of Not-All-Equal 3-SAT [10] 2 , which is NP-hard and can be formulated as a frustrated loop problem on the complete graph. Furthermore, limiting coupling range in hard frustrated loop instances affects a vanishingly small portion of each instance (see Figure 13 ).
The study of instances with fixed coupling range allows for the control of two factors: amplification of analog control error (for D-Wave) and effective operating temperature (for D-Wave, SAS, SAA, and any other simulated physical model with a thermal component [1]). Although we cannot assume that the change in effective temperature between range-2 and range-3 instances would necessarily affect a quantum annealer and SAS similarly, the relatively small advantage shown by SAS in range-2 instances over range-3 instances (see Figure 9 ) is consistent with the hypothesis that the difference in D-Wave performance is primarily due to a reduction of relative control error.
It is straightforward to construct input classes for which analog control error will dominate the performance scaling of an analog processor. When probing for quantum speedup, it is important to do the opposite: construct an input class for which the impact of analog control error is minimal. In doing so we might better observe properties of the annealer's mechanics rather than observing the effect of precision limitations, which by now are reasonably well understood [1, 7, 11] and expected to improve with the maturation of the technology and the possible implementation of error correction strategies [12] [13] [14] . Appendix A: Methods
Quantum annealing platform
In this work we used a Vesuvius quantum annealing processor manufactured by D-Wave Systems Inc. The processor is of identical design to the D-Wave Two V6 processor installed at ISI [1], and is from the same fabrication lot. We call these two processors "SR10-V6" and "ISI-V6" respectively. The problems we consider are generated on subgraphs of the processor's Chimera qubit connectivity graph [9] , using up to 467 qubits (see Figure 3) . The data was gathered in June, 2014. SR10-V6 had an operating temperature of approximately 15mK, and used maximum J inductance (coupling strength) of 1.25pH, compared with temperature and inductance of 17mK and 1.33pH for ISI-V6 [1]. All experiments were run using an anneal length of t a = 20µs, equal to the runs used for most of the key analysis in the work of Hen et al. [1] . Although the processors have the same architecture and similar ratio of temperature to inductance, they used different annealing schedules (see Figure 4) , and we cannot necessarily assume that they would give highly correlated results if they were run on the same input.
Experimental details
The primary testbed of problems studied consists of 200 instances of each size L ∈ {4, 5, 6, 7, 8}, for each value of α ∈ {0.10, 0.15, . . . , 0.50}, for each range limit R ∈ 
a. D-Wave processor
Each instance was annealed 10240 times by the DWave processor with an anneal length of 20µs, the minimum allowed by the system. These experiments were performed in batches of 1024 anneals, each batch with a random gauge transformation applied, as in previous work [8, 11] .
b. HFS
Selby's implementation of the HFS algorithm [1, 4, 5] is a heuristic approach whose main effort consists of tree updates, and which typically restarts to a random state when it performs two tree updates without improvement in energy. In order to treat HFS more like an annealer, we modified the code so that a solution is recorded before each reset -as in SAS, the original algorithm is modified to return possibly optimal intermediate states. It would be algorithmically equivalent to force the HFS algorithm to terminate instead of resetting.
In our experiments, each instance was solved 10240 times by the HFS algorithm. The runs were performed single-threaded in parallel using an HPC cluster of 8-core Intel Xeon E5-2670 processors.
c. SAS
SAS was run on each instance 1024 times at anneal length roundoff (2 a/2 ) for each integer a ∈ {2, . . . , 22}.
Thus the longest anneal for each instance is 2048 Monte Carlo sweeps, enough so that the optimal anneal length for median time to solution is exceeded for all problem sets studied (see Figure 5 ).
SAA was run on each instance 1024 times for 20,000 Monte Carlo sweeps. For similar instances, Hen et al. support the claim that 20,000 gets us reasonably close to equilibration (of performance as a constant-time optimizer) for frustrated loop problems. We did not investigate longer SAA anneals due to limited availability of computing resources. Like SAS, SAA was run on a linear schedule in inverse temperature from β 0 = 0.01 to β f ∈ {3, 4, 5}.
Benchmarking methodology
Following previous work [1, 8, 11] , we treat each solver (now including HFS) as a stochastic sampler, and measure the time to reach 99% confidence of having found the ground state energy of an instance. We call this time to solution (TTS). Given a solver achieving success probability p over a set of trials (i.e. anneals) and taking mean time τ to complete a single trial, we compute the number of samples required as TTS for this solver and instance as
and compute TTS for this solver and instances as τ r. For SAS we assume τ to be 3.54µs multiplied by the number of update sweeps in the anneal in order to remain consistent with Hen et al. [1] . We determine the optimal sweep length for each set of 200 problems as the length giving the minimum bootstrapped median time to solution. We disregard the issue of conditioning our SAS results on minimizing TTS; due to the smoothness of the curves in Figure 5 near the optimal anneal lengths, we do not expect this to have a significant effect on the results. For HFS our methodology differs from that of Hen et al. : We use an enumerative effort computation, as with the D-Wave processor and SAS, rather than timing the process. This allows a bare look at the dominant operations. Following Hen et al., we assume hypothetical parallelization of L cores on a C L instance, and accordingly assume that a tree update on a C L instance takes O(L) parallel steps (actually L steps) of "leaf updates". Using these assumptions we compute the total number of leaf update steps required for a given "anneal" (i.e. sample draw) and for convenience we assume a constant of L · 1µs per tree update, noting that this gives reasonably comparable performance at C 4 scale to the results of Hen et al. [1] . In fact our own timing data indicates that our [1] ). The slight increase in difficulty for the hardest problems as R increases can be explained by higher temperature relative to the gap of the final Hamiltonian. The increase in difficulty for long anneals in high-α problems as R increases can be explained by the suppression of randomness (and accelerated evolution of ferromagnetic structure) for low-R, high-α combinations (see Appendix B 2).
HFS experiments were, pound for pound, approximately five times faster than those of Hen et al.
a. Statistical methods
To generate the data points and error bars in the performance and speedup figures, we used the same Bayesian bootstrapping approach used by Hen et al.
[1]. First we describe the approach for performance data, which differs slightly for HFS.
For a given solver, SR10-V6 or SAS or SAA, and each set S of 200 instances at a given value of (L, α), we have, for each instance s i , an empirical success probability p i representing x i successes out of y trials. We consider the probability distribution of success probability to be β i = β(x i + 1 2 , y − x i + 1 2 ). We then construct 1000 bootstrap sets S j of size 100 by drawing 200 members from S with repetition, resulting in multisets S j = {s i,j | 1 ≤ i ≤ 100} j . Now for each set we sample a probability p i,j from distribution β i,j .
At this point we can apply the desired function f j to the set of 200 probabilities, which is typically the median of {r(p i,j )} for a fixed j. We then take the data point to be the mean of f j , and we take the error in the statistic to be the standard deviation of f j .
For HFS we use a similar approach suggested by Joshua Job (personal communication, January 2015). After we sample 1000 values indicating the number r i of samples needed for 99% assuredness of success, we take r i random samples (with repetition) from our set of 10240 samples, and take the sum of the numbers of tree updates in those r i samples to give a sample of time to solution.
Speedup. To compute data on speedup between two solvers (or equally, the same solver on two problem sets), we assume that time to solution is normally distributed with mean and standard deviation as estimated above. We then sample 1000 points from each normal distribution and compute 1000 speedup ratios. We use the mean and standard deviation of this set of ratios for our data point and error bar.
Scaling coefficients. When computing scaling coefficients (see Appendix B 3), we assume normal distributions on TTS for a given solver and set of 1000 instances (200 of each size). We then draw 1000 samples from each of the five distributions, and for each set of five samples we compute the slope of the best fit line ln(TTS) ≈ a(α) + b(α)L. From these 1000 slopes we take the mean and standard deviation. Error bars in Figure  14 In Section II we described Itay Hen's original construction of frustrated loop instances, and offered a modification. For convenience, we call the former HenFL instances, and the latter KingFL frustration in the problem, measured by Hen et al. as the proportion of extant couplers that are frustrated in the planted ground state. For large α, systems tend towards ferromagnetism. For sufficiently small α, a typical system is simply a collection of disjoint frustrated loops, and therefore both highly degenerate and combinatorially trivial. In Figure 6 we see the same qualitative dependence on α.
In HenFL problems, the mean loop length is approximately 11. In KingFL problems it is approximately 9 (see Figure 7 for the instances studied). It is therefore not surprising that the hardest problems appear for slightly larger α in our results compared with the results of Hen et al. We remark that a well-yielded Chimera graph will have between 2n/5 and 3n edges, so in our data the hardest problems arise when the expected number of loops containing a given coupler is near 1.
Effect of range limitation and loop distribution
Of fundamental importance to this work is the question of whether or not limiting the coupling range of frustrated loop instances makes them intrinsically easier. Here we give evidence that the difference in performance of the D-Wave processors here and in the paper of Hen et al.
[1] cannot be explained by our testbed being easier.
Clearly the range-limited instances studied here are C4 C5 C6 C7 C8 2 easier for the SR10-V6 D-Wave processor used in this work than the range-unlimited HenFL instances are for the ISI-V6 processor, and clearly the range-2 instances are easier than the range-3 instances for SR10-V6. For an idea of whether or not range-limited instances are combinatorially easier, we appeal to HFS performance, since HFS is unaffected by coupling range and has no thermal component. Perhaps the most pertinent answer to this question is in the bottom-middle panel of Figure 9 . There we see that for the highest α, HFS finds range-2 instances consistently easier than range-∞ instances, but this phenomenon weakens as α decreases. There is a simple intuitive reason for this: for any given coupling range limit R, if α is sufficiently large (but not so large that a random KingFL instance cannot be consistently constructed), the number of loops containing an edge is forced to be relatively consistent across the edges of the graph. Consequently, the system is more orderly and therefore more ferromagnetic -recall that at least 5/6 of nonzero couplers in each J i are ferromagnetic. This intuition is corroborated by our HFS results.
In Figures 10 and 11 we give HFS data on a more extensive set of inputs, arranged according to α. For each choice of L and α, the set contains 200 instances. Figure 10 suggests that we should not expect the rangeunlimited KingFL instances on the SR10-V6 hardware graph to be any easier than the range-unlimited HenFL instances on the ISI-V6 hardware graph, if for each set we choose α to maximize median HFS time to solution. This takes into consideration the fact that ISI-V6 uses 503 qubits on C 8 problems, while SR10-V6 uses only 467. Figure 11 suggests that for L ≤ 8 and α ≤ 0.3, we should not expect range-3 KingFL instances to be significantly easier than range-unlimited KingFL instances. For L ≤ 8 and α ≤ 0.2, we should not expect range-2 KingFL instances to be significantly easier than range-unlimited KingFL instances. In short, it appears that the difference between our results and the results of Hen et al. [1] , especially the scaling advantage we see for the D-Wave processor on range-3 instances, cannot be attributed to the problems being fundamentally easier.
For SAS, the issue becomes entangled with the issue of effective temperature. This is most troublesome for range-∞ instances, where the energy scale (and relative temperature) varies significantly between instances (see Figure 12 ). Speedup for range-3 over range-2 problems for low α may indicate that the final inverse temperature β f is too low relative to the energy scale of the normalized Hamiltonian, whereas speedup for range-3 over range-2 problem for high α may indicate, in agreement with HFS data, that these problems are easier in range 2. Figure 5 provides further insight into the question of easiness and temperature.
Going back to Figure 9 , we emphasize that particularly for α ≤ 0.3, which includes the hardest instances, the HFS speedup as coupling range decreases is small compared with both the speedup of the D-Wave processor over HFS and the speedup of the D-Wave processor between range-2 and range-3 instances. For α ≤ 0.25, range-3 instances appear to have very little structural "easiness" compared with range-∞ instances.
As further illustration of why this should be the case, Figure 13 takes the range-∞ KingFL testbed and plots the proportion of nonzero couplers exceeding a given range limit for each value of α. For a fixed α, the structural impact of imposing a coupler range limit of R dimin- ishes quickly as R increases. In other words, in range-∞ instances scaled to the interval [−1, 1], the scaling factor is determined by the tail of the coupler value distribution, which we expect to be insignificant to the overall combinatorial structure of the problem.
Ruling out a thermal model for D-Wave performance
In order to support a thermal model for performance scaling of the D-Wave processor, SAA data should provide a good match with D-Wave processor data on both range-2 and range-3 instances at the same final inverse temperature β f , for various choices of α. In other words, it should have predictive power [15] . Hen et al. found a good match between SAA and their D-Wave processor data at β f = 5 for certain values of α [1]. In Figure 14 we can see the scaling coefficients for the D-Wave processor and for SAA with β f ∈ {3, 4, 5}. The ratio of temperature to energy scale is similar for the D-Wave processor studied here and that studied by Hen et al. (∼ 7% difference). For range-3 instances, β f = 4 appears to be too high and β f = 3 is clearly too low. Although β f = 4 gives a better fit, its scaling does not match that of the D-Wave processor on low-α range-2 instances. Figure 15 shows instance-wise scatter plots of success probabilities for the D-Wave processor and SAA at β f = 4. It is clear from this data that the best correlation is poor when compared with the data of Hen et al., particularly looking at each problem size individually, and that there is not a good fit between success probabili- ties on range-2 instances. This data fails to support the hypothesis that SAA provides a thermal model for performance scaling on frustrated loop instances. On the contrary, there are several pieces of evidence that in this context, analog control error causes the appearance of thermal behavior. First is the poor correlation compared to that found by Hen et al.; the data shown in Figure   15 for range-3 instances represents the best visual fit between the D-Wave processor and SAA data out of all our experiments. Second is the fact that we do not find agreement at the same inverse temperature β f = 5. Third is shown by Hen et al., who show scaling coefficients for various choices of β f , along with results on perturbed instances at β f = 5. Their data suggests that perturbing the Hamiltonian has a similar effect to increasing the relative temperature of SAA. 
