Performance evaluation of coherent Ising machines against classical
  neural networks by Haribara, Yoshitaka et al.
Performance evaluation of coherent Ising machines
against classical neural networks
Yoshitaka Haribara1,2, Hitoshi Ishikawa3, Shoko Utsunomiya4,
Kazuyuki Aihara1,2 and Yoshihisa Yamamoto5
1 Department of Mathematical Informatics, University of Tokyo, Tokyo 113-8656,
Japan
2 Institute of Industrial Science, University of Tokyo, Tokyo 153-8505, Japan
3 PEZY Computing, Tokyo 101-0052, Japan
4 National Institute of Informatics, Tokyo 101-8430, Japan
5 ImPACT Program, Japan Science and Technology Agency, Tokyo 102-0076, Japan
E-mail: haribara@sat.t.u-tokyo.ac.jp
6 October 2018
Abstract. The coherent Ising machine is expected to find a near-optimal solution in
various combinatorial optimization problems, which has been experimentally confirmed
with optical parametric oscillators (OPOs) and a field programmable gate array
(FPGA) circuit. The similar mathematical models were proposed three decades ago
by J. J. Hopfield, et al. in the context of classical neural networks. In this article, we
compare the computational performance of both models.
Keywords: Degenerate Optical Parametric Oscillator, Measurement Feedback System,
Combinatorial Optimization Problems, Maximum Cut Problems
Submitted to: Quantum Science and Technology (focus issue on “Quantum coherent feedback
and reservoir engineering”)
1. Introduction
In recent trends in semiconductor technologies, the Moore’s law is slowing down mainly
due to the limitation of micro-fabrication heat dissipation and communication bottleneck
problems on a chip [1, 2]. Many efforts to boost the processor performance have been
made for parallelized architectures including GPU, other multi/many-core processors,
and neuromorphic hardwares [3]. An optics-based special purpose computer, which is
named the coherent Ising machine (CIM), has been proposed to exploit a rapid physical
convergence time for accelerating the solution search in hard optimization problems [4].
One of the well-known examples of combinatorial optimization problems is a
maximum cut problem (MAX-CUT) on a graph, which is essentially equivalent to the
ar
X
iv
:1
70
6.
01
28
3v
2 
 [q
ua
nt-
ph
]  
13
 Ju
n 2
01
7
2Ising model in statistical mechanics [5]. It is a problem to find the largest cut in a given
graph G = (V,E), where the number of edges at the boundary of a partition of vertices
into two subsets is maximized. The size of the cut is defined as the total weight of edges
separated by the cut, i.e., edges which have each endpoints in the different sides of the
cut. This objective function can be written as
CUT(x) =
∑
1≤i<j≤n
wij
1− xixj
2
, (1)
where the graph order n = |V | is the number of vertices, wij is the weight of the edge
(i, j) ∈ E and xi = ±1 is a binary value indicating which side of the cut the vertex
i ∈ V belongs to.
To implement the above problem on a physical system, the injection-locked lasers
[4] and the degenerate optical parametric oscillators (DOPOs) [6] were proposed to use.
With a series of experimental challenges, the implementation of the optical delay line
based [7, 8] and measurement feedback based [9, 10] coherent Ising machines has been
realized.
As a metaheuristic algorithm, the CIM can be interpreted as a mathematical model
to solve combinatorial optimization problems using recurrently updated neurons with
nonlinear activation function. From this point of view, there have been related and
interesting approaches by mathematical models of the neurons (e.g., [11, 12]) and their
networks (e.g., [13]). Hopfield developed the optimization algorithm by using such neural
networks [14]. Then Hopfield and Tank extended it to the continuous valued model to
improve the performance and applied it to the combinatorial optimization problems
[15, 16]. Simulated annealing (SA) is proposed in the same period [19].
Here, we try to clarify the relative performance of our CIM against a family of
classical neural network approaches and SA for combinatorial optimization problems
especially for MAX-CUT. This paper is organized as follows. In section 2.1, we
describe the basic concept and experimental configuration of CIM. In section 2.2,
we describe models of neural network algorithm for the combinatorial optimization
problems followed by section 2.3 to describe suitable hardware implementation. Then
the numerical experiments are performed in section 3. We discuss the results and other
possibilities of implementations in section 4. Finally, we conclude the paper in section 5.
2. Method
2.1. Coherent Ising machine (CIM)
We intend to solve combinatorial optimization problems by mapping the cost function
(1) to the energy of an Ising spin system. CIM is initially proposed as an injection-
locked laser system [4], followed by the proposal using a degenerate optical parametric
oscillator (DOPO) system [6]. So far, several experimental machines are demonstrated
with n = 4, 16, 100, 2048-pulse systems [7, 8, 9, 10]. Since the original MAX-CUT
has binary variables, we use a bistable optical device, DOPO at the output stage of
3computation, while an analog optical device, degenerate optical parametric amplifier
(DOPA), at the solution search stage of computation.
Figure 1 depicts the schematic of the measurement feedback based CIM [9, 10]. Here
we describe the typical experimental configurations in [10]. The DOPO part consists of a
1 km optical fiber with an externally pumped periodically poled lithium niobate (PPLN)
waveguide. The pulsed pump laser, at the 1 GHz repetition rate of 5000 times as the
cavity circulation frequency, generates 5000 individual DOPO pulses in a single fiber
ring cavity. A segment of them (2000 pulses) is used as the signal pulses for computation
and the remaining portion (3000 pulses) is used for the cavity stabilization.
The feedback circuit stores the interaction strength for each pair of DOPO pulses.
A portion of optical pulse is picked-off by a beamsplitter (numbered as 1 in the figure 1)
and measured by balanced homodyne detectors. The measured values of DOPO pulse
amplitudes are fed into an analog-digital converter (ADC), followed by FPGAs. Here,
1 GHz repetition rate of signal pulses is downclocked to 125 MHz (8 parallel) and the
measured amplitudes c˜i are sliced into the digital signals of 5 bits. Then 2 FPGAs sum
up the coupling effect from the other vertices (in the given topology)
∑
j Jij c˜j for the
ith pulse. The feedback pulse train is modulated in intensity and phase by this output
electrical signal after a digital-analog converter (DAC). The feedback pulse is injected
to the signal DOPO pulse running through the main fiber ring cavity via a beamsplitter
#2.
The DOPO is operated near the oscillation threshold by crossing the pump rate
from below to above the threshold in the case of [10]. In the beginning, the DOPO
is biased at below the threshold in which all phase configuration is established so as
a superposition state and the quantum parallel search is implemented [18]. Then, the
external pump rate (or the feedback) strength is gradually increased, and once the whole
system reaches the oscillation threshold, it selects a particular phase configuration which
corresponds to the near-optimal solution of the original optimization problem.
The dynamics of the CIM can be simulated by the quantum master equation.
Instead of numerically integrating the master equation for the DOPO density operator,
we can expand the density operator by the quasi-probability function in the phase
space. One quasi-probability function need for this purpose is the positive P (α, β)
representation in terms of the off-diagonal coherent state expansion, |α〉 〈β|. The Fokker-
Planck equation for P (α, β) is derived from the master equation and then the c-number
stochastic differential equations for α and β are obtained using the Ito calculus (see [17]
for detail). Another quasi-probability function used for this purpose is the truncated
Wigner representation W (α) in terms of the Gaussian states. The corresponding c-
number stochastic differential equations are derived in [18].
2.2. Classical neural networks
We describe in this section the classical neural network models to solve the same
combinatorial optimization problems, three of which are summarized in the table 1.
4SHG
Pump
pulse
SHG
pulse Signal DOPO pulses
#2    #1
Beam splitter #1
PPLN
waveguide
OPA
LO
pulse
Homodyne
detection
IM/PM
FPGAs
DAC
ADC
Feedback
pulse
fi
Beam splitter #2 Fiber ring cavity
ー
Figure 1. Experimental schematic of a coherent Ising machine implemented on a fiber
DOPO with an FPGA measurement feedback circuit.
Table 1. Classical neural-network approaches for combinatoral optimization problems.
Deterministic Stochastic
Binary Derandomized Hopfield network (HN) Simulated annealing (SA)
Analog Hopfield-Tank neural network (HTNN)
2.2.1. Derandomized Hopfield Network (HN) J. J. Hopfield implemented a classical
neural network model solving combinatorial optimization problems in his 1982 paper
[14], which is referred to the Hopfield network (HN). The neuron in this model has the
discrete output values xi = ±1 with a simple majority voting update rule:
xi ← sgn(
n∑
j=1
Jijxj) (2)
which will execute asynchronously. The spin index i is selected randomly in the original
paper but we derandomized it to enhance the speed, i.e., the spin indices from i = 1
to i = n are updated sequentially. Simultaneous updates introduce the instability or
periodic solution into the system. Since the update is local and deterministic, the
system will converge into the nearest local minimum, which is determined by the initial
state. Note that the model is originally proposed with {0, 1}-binary neurons, but for
comparison, we use equivalent {+1,−1}-valued neurons.
2.2.2. Simulated Annealing (SA) While the HN will often get stacked at poor local
minima, Kirkpatrick, et al. introduced a stochastic spin update strategy in simulated
annealing (SA) algorithm to mimic thermal annealing [19]. The probability of stochastic
spin flip is governed by the Boltzmann factor in the Metropolis-Hastings procedure as
5follows:
P = exp(−∆Ei/T ), (3)
even if the energy difference to flip the ith spin
∆Ei = 2xi
n∑
j=1
Jijxj (4)
makes the total energy increased, namely ∆Ei > 0. The spin index i is selected randomly
while temperature T is gradually decreased.
2.2.3. Hopfield-Tank Neural Network (HTNN) Hopfield and Tank proposed another
neural network approach using an analog valued neuron xi ∈ [−1, 1], which is referred
to the Hopfield-Tank neural network (HTNN) [16]. The time evolution of the HTNN is
described by ordinary differential equations (ODE):
dxi
dt
= −αxi + β
n∑
j=1
Jijf(xj), (5)
where f(x) is a nonlinear sigmoid function. In this study, tanh(x) is used as f(x). In
the extremely high linear gain limit, i.e., when the slope of the sigmoid function around
0 is steep, this HTNN model becomes close to the HN model described above. The
parameters in later section are optimized as the neuron decay rate α = 6 and the synaptic
connection strength β = 0.1 to achieve the best performance for the given MAX-CUT
problems. The numerical integration of (5) is performed by the Euler method with the
discrete time step ∆t = 0.01.
2.3. Hardware used for Implementation of classical neural networks
Here we describe the hardware configuration needed to implement the classical neural
networks, which will be used in the benchmark section. Note that all codes are
implemented with C++ ‡.
2.3.1. CPU (for SA and HN) SA and HN are iterative updating algorithms for discrete
spins. We can achieve CPU implementation efficiently by SIMD bitwise operations in
parallel§. In this paper, we mainly used Intel Xeon E3-1225 v3 @ 3.2 GHz (Haswell
architecture shipped in 2013). Note that the performance of SA is slightly improved
from the previous paper, in which SA is run on an older processor (Intel Xeon X5650 @
2.67 GHz Westmere architecture shipped in 2010) [10]. We did not use any accelerators
for HN and SA in this study since it is already parallelized by SIMD operations in CPU
and the cache hit rate is high enough as 98.8%.
‡ We used Ubuntu 16.04.4 with GCC 5.4.0 (CPU) and CentOS 7.1.1503 with GCC 4.8.3 (PEZY-SC)
§ The code is available here. https://github.com/haribara/SA-complete-graph
6PEZY-SC (1024PE)
Prefecture (256PE)
City (16PE) Village (4PE) PE
Program Counter × 8
L1 Instruction Cache
64bit × 256w (2KB)
ALU
4FP ops/cycle
Register File
32bit × 256w (1KB)
Local Storage
32bit x 4096w (16KB)
PE
L1 Data 
Cache 2KB
Village
(4PE)
Village
(4PE)
Village
(4PE)
Village
(4PE)
Special Function UnitCity
(16PE)
City
(16PE)
City
(16PE)
City
(16PE)
City
(16PE)
City
(16PE)
City
(16PE)
City
(16PE)
City
(16PE)
City
(16PE)
City
(16PE)
City
(16PE)
City
(16PE)
City
(16PE)
City
(16PE)
City
(16PE)
Prefecture (256PE) Prefecture (256PE)
Prefecture (256PE) Prefecture (256PE)
DDR4 DRAM 19GB/s
PCIe Gen3 x8
PCIe Gen3 x8
PCIe Gen3 x8
PCIe Gen3 x8
L3 Data Cache
2MB
DDR4 DRAM 19GB/s
DDR4 DRAM 19GB/s
DDR4 DRAM 19GB/s
DDR4 DRAM 19GB/s
DDR4 DRAM 19GB/s
DDR4 DRAM 19GB/s
DDR4 DRAM 19GB/s
ARM
926
ARM
926
L2 Data Cache
64KB
L2  Instruction Cache
32KB
L3  Instruction Cache
128KB
Atomic Cache
128B
L1 Data 
Cache 2KB
PE PE
PE
Thread0
front
Thread1
front
Thread3
front
Thread2
front
Thread1
back
Thread0
back
Thread3
back
Thread2
back
clock
clock
clockclock
(b)
(a)
Figure 2. (A) The hierarchical architecture of a PEZY-SC many core processor.
There are 1024 processing elements (PE) packed in a single PEZY-SC chip. (B) Each
PE core handles 8 threads independently.
2.3.2. MIMD many core processor PEZY-SC (for HTNN) Since HTNN is based
on ordinary differential equations (a continuous-valued continuous-time system) and
requires floating-point arithmetic, it is better to parallelize by accelerators. We used a
MIMD many core processor PEZY-SC @ 733 MHz with 1024 cores and 8192 threads
on a chip (the architecture is shown in figure 2), which is set in Shoubu supercomputer
at Riken (Japan). We parallelized matrix-vector multiplication and neuron updates in
8192-thread parallel. The coupling matrix is efficiently stored as a 1-bit matrix (since
Jij = ±1 has no empty entry) and neuronal states as floating points (32-bit float). Note
that it was 1.4 times faster than storing matrix values in 32 bits. The benchmark of the
hardware itself is shown in Appendix A.1.
3. Results
We compared the performance of HN, SA, HTNN, and CIM by solving the MAX-CUT
problems on a dense graph. The particular problem instance is a complete graph, in
which all pair of N = 2000 vertices are connected and edges are weighted by {+1,−1}
in uniform distribution (the identical instance as in Ref. [10]). Figure 3 shows the
performance on the complete graph, while the detailed computation time to target and
the hardware configurations are summarized in table 2.
7CIM(experiment)HN SA HTNN
10-5 10-4 0.001 0.010
-30
-25
-20
-15
-10
-5
0
Time (s)
Is
in
g
en
er
gy
Figure 3. Energy descent when solving a {+1,−1}-weighted N = 2000 complete
graph. Each thick line is the ensemble average of 100 trials (except for CIM experiment,
which consists of 26 trials), while the lower and upper shaded error bars show the best
and worst envelopes for each computational model. The gray dotted line is the target
energy (= −60278/n) which is obtained by the SDP relaxation algorithm [20].
We ran 100 different trials for the same problem instance (except for CIM, which
consists of 26 trials). Each solid line in the figure 3 indicates the ensemble average of all
trials, while the lower and upper shaded lines indicate the best and worst case envelopes,
respectively. Here, parameters for SA and HTNN are optimized to achieve the shortest
computation time to the target which is obtained by the SDP relaxation algorithm [20].
The computation time to the SDP-produced target is shorter in the order of CIM,
HN, SA, HTNN on the instance. The data from CIM are noisy due to experimental
noise, but it can find better solutions than the target in all 26 trials. HN is faster than
SA since HN can be regarded as a derandomized version of SA. Note that in the worst
case, HN cannot reach the target (it fails 3 times in 100 trials as it can be seen partly in
the worst case in figure 3). It can be understand that HTNN performs much slower than
HN/SA since it solves ODEs which deals with the analog variables. Note that HTNN
achieves lower energy than SA in figure 3 but the performance of SA heavily depends on
temperature scheduling. We optimized to reach the target shorter but slower scheduling
ends up lower energy generally.
4. Discussion
In this section, we will add the two discussions to justify the above conclusions,
• Validity for the hardware selection.
• Optimization for PEZY-SC implementation for HTNN.
8Table 2. Time to target and hardware configurations. The best (shortest) time to
reach the target value and the time to cross the ensemble averaged line are listed in the
upper table. Note that the target value is obtained by the SDP relaxation algorithm
which has performance guaranty to the 87% of the optimal value.
Best (ms) Average (ms) Hardware
CIM 0.071 0.264 fiber DOPO+2FPGAs
HN 0.924 1.84 CPU
SA 2.10 3.20 CPU
HTNN 7.04 9.67 PEZY-SC
Hardware Model Clock and Architecture Release date
FPGA Xilinx Virtex-7 VX690T 125 MHz 693k logic cells 2010
CPU Intel Xeon E3-1225 v3 3.2 GHz, Haswell 2013
MIMD many core PEZY-SC 733 MHz, 1024 core 2014
4.1. Validity for the hardware selection
HTNN is apparently efficient on PEZY-SC than on CPU, which is shown in Appendix
A.1. On the other hand, we did not use any accelerator for HN and SA in this study.
This is because we do not expect significant speed-up by naive implementation since
they are already parallelized by SIMD operations in CPU and the cache hit rate is so
high as 98.8% (measured by perf command in Linux). Generally, asynchronous update
in HN/SA seems to be not suitable for parallel implementation.
4.2. Optimization for PEZY-SC implementation for HTNN
We tried to optimize the implementation by storing matrix data efficiently. Since the
given adjacency matrix has only the 1-bit entry (Jij = ±1), we packed each value in
1 bit. This contributes the 1.4 times speed-up than having a 32-bit float matrix. But
putting the data in local memory does not contribute to significant speed-up since
its bottleneck in computation is not in memory transfer. There is a possibility of
speed-up if we replace the multiplication by the selector. Rather, it is possible to
scale out for parallel distributed processing by using multiple PEZY-SC chips in Soubu
supercomputer, especially when the problem size is larger.
5. Conclusion
In this paper we compared the performance of the CIM implemented on DOPOs and
FPGAs against the family of classical neural-network-based algorithms: HN, SA and
HTNN. To accelerate the performance of the classical neural networks, HN and SA are
implemented on CPU with bit operations and HTNN is implemented on a many core
processor PEZY-SC. It is shown that the CIM can achieve faster computational time
than HN (13.0 times for the best case and 6.97 times for the average), SA (29.6 times
9Table A1. Accelerator configurations for parallel implementation of the c-number
stochastic differential equations in figure A1.
Hardware Model Clock and Architecture Release date
CPU Intel Xeon W3530 2.80 GHz, Nehalem-WS 2010
GPU NVIDIA Tesla C2075 1.15 GHz, Fermi 2011
MIMD many core PEZY-SC 733 MHz, 1024 core 2014
for the best case and 12.1 times for the average) and HTNN (99.2 times for the best
case and 36.7 times for the average).
Acknowledgments
The authors thank H. Takesue and T. Inagaki for providing the experimental data, K.
Kawarabayashi, S. Tamate and T. Sonobe for accelerating SA implementation, Y. Saito
for comments on the FPGA configuration, T. Leleu and M. Oku for general discussion.
We used a super computer Shoubu in Riken (Saitama, Japan) to benchmark on PEZY-
SC. This research is supported by the Impulsing Paradigm Change Through Disruptive
Technologies (ImPACT) Program of the Council of Science, Technology and Innovation
(Cabinet Office, Government of Japan).
Appendix A.
Appendix A.1. Processor performance of PEZY-SC
We show the elapsed time for the CIM simulation by the following c-number stochastic
differential equations [18, 21]
dci = [(p− c2i − s2i )ci] dt+
1
As
√
c2i + s
2
i +
1
2
dWci, (A.1)
ci(t+ ∆t) 7→
√
1− Tmesci(t) +
√
Tmes
fi
As
, (A.2)
ci(t+ ∆t) 7→
√
1− Tinjci(t) +
√
Tinjξ
n∑
j=1
Jij c˜j, (A.3)
implemented on different processor configurations. Note that the simulation by the
Langevin equation is simplified version and we employed this model to contrast the
processor performance (detailed simulation will be reported in [22, 23]). Here, CPU
indicates serialized calculation on a single thread, CPU+GPU indicates matrix-vector
multiplication is off-roaded to GPU while other part of differential equation is calculated
on the same processor as CPU, PEZY-SC indicates that all processes are paralellized.
We conclude that is it preferable to implement HTNN on PEZY-SC than CPU or GPU
which we listed in the table A1 since the Langevin equations are similar to ODEs of
HTNN except for random number generation.
10
●
● ● ●
■ ■ ■ ■ ■ ■ ■
■ ■
◆ ◆ ◆ ◆ ◆ ◆
◆ ◆CP
U
CPU
+GPU
PEZ
Y-SC
CIM (experimental)
500 1000 5000 104
10-4
0.001
0.01
0.1
1
10
100
Problem size N
S
im
ul
at
io
n
tim
e
(s)
Figure A1. Simulation time for the c-number stochastic differential equations (200
round trips on complete graphs).
References
[1] Hennessy J L and Patterson D A 2011 Computer architecture: a quantitative approach (Elsevier)
[2] Waldrop M M 2016 Nature News 530 144
[3] Khan M, Lester D, Plana L A, Rast A, Jin X, Painkras E and Furber S B 2008 Spinnaker:
mapping neural networks onto a massively-parallel chip multiprocessor Neural Networks, 2008.
IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint
Conference on (IEEE) pp 2849–2856
[4] Utsunomiya S, Takata K and Yamamoto Y 2011 Optics express 19 18091–18108
[5] Garey M R and Johnson D S 2002 Computers and intractability vol 29 (wh freeman New York)
[6] Wang Z, Marandi A, Wen K, Byer R L and Yamamoto Y 2013 Physical Review A 88 063853
[7] Marandi A, Wang Z, Takata K, Byer R L and Yamamoto Y 2014 Nature Photonics 8 937–942
[8] Takata K, Marandi A, Hamerly R, Haribara Y, Maruo D, Tamate S, Sakaguchi H, Utsunomiya S
and Yamamoto Y 2016 Scientific Reports 6 34089
[9] McMahon P L, Marandi A, Haribara Y, Hamerly R, Langrock C, Tamate S, Inagaki T, Takesue
H, Utsunomiya S, Aihara K, Byer R L, Fejer M M, Mabuchi H and Yamamoto Y 2016 Science
354 614–617
[10] Inagaki T, Haribara Y, Igarashi K, Sonobe T, Tamate S, Honjo T, Marandi A, McMahon P L,
Umeki T, Enbutsu K, Tadanaga O, Takenouchi H, Aihara K, Kawarabayashi K, Inoue K,
Utsunomiya S and Takesue H 2016 Science 354 603–606
[11] McCulloch W S and Pitts W 1943 The bulletin of mathematical biophysics 5 115–133
[12] Hodgkin A L and Huxley A F 1952 The Journal of physiology 117 500
[13] Fukushima K 1980 Biological cybernetics 36 193–202
[14] Hopfield J J 1982 Proceedings of the National Academy of Sciences 79 2554–2558
[15] Hopfield J J 1984 Proceedings of the National Academy of Sciences 81 3088–3092
[16] Hopfield J J and Tank D W 1986 Science 233 625–633
[17] Takata K, Marandi A and Yamamoto Y 2015 Phys. Rev. A 92(4) 043821
[18] Maruo D, Utsunomiya S and Yamamoto Y 2016 Physica Scripta 91 083010
[19] Kirkpatrick S, Gelatt C D and Vecchi M P 1983 Science 220 671–680
[20] Goemans M X and Williamson D P 1995 Journal of the ACM (JACM) 42 1115–1145
11
[21] Haribara Y, Utsunomiya S and Yamamoto Y 2016 Entropy 18 151
[22] Shoji T et al. 2017 (in preparation)
[23] Yamamura A et al. 2017 (in preparation)
