Combinatorial optimizations are widely adopted in scientific and engineering applications, such as VLSI design, automated machine learning (AutoML), and compiler design. Combinatorial optimization problems are notoriously challenging to exactly solve due to the NP-hardness. Scientists have long discovered that numerically simulating classical nonlinear Hamiltonian systems can effectively solve many well-known combinatorial optimization problems. However, such physical simulation typically requires a massive amount of computation, which even outstrips the logic capability of modern reconfigurable digital fabrics. In this work, we proposed an FPGA-based general combinatorial optimization problem solver which achieved ultra-high performance and scalability. Specifically, we first reformulated a broad range of combinatorial optimization problems with a general graph-based data structure called the Ising model. Second, instead of utilizing classical simulated annealing to find an approximate solution, we utilized a new heuristic algorithm, simulated bifurcation, to search for solutions. Third, we designed an efficient hardware architecture to fully exploit FPGAs' potentials to accelerate the algorithm, and proposed three hardware-software cooptimizations to further improve the performance. By experimenting on benchmarks, our proposal outperformed the state-of-the-art simulated annealing optimization solver by up to 10.91 times.
Introduction
Successfully performing modern EDA (Electronic Design Automation) in VLSI largely hinges on satisfactorily solving logic partitioning, packing, placement, and routing. All these tasks can be formulated as solving well-known combinatorial optimization Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. FPGA '20, February 23-25, 2020, Seaside, CA, USA © 2020 Association for Computing Machinery. ACM ISBN 978-1-4503-7099-8/20/02. . . $15.00 https://doi.org /10.1145/3373087.3375298 problems such as min/max-flow, graph cut, and partitioning problems [3] . However, combinatorial optimization problems are notorious for their non-deterministic polynomial-time hardness (NPhardness), i.e., their complexity explodes exponentially with their problem sizes. Taking a graph partitioning as an example, when dealing with a graph with |E| number of edges, each edge is a potential cut, such that there are 2 |E | number of possible solutions. This huge solution space renders an exhaustive search intractable, and an exact solution is impossible to find. However, in most cases, an exact solution to these combinatorial optimization problems is unnecessary, as an approximate solution that is good enough to fulfill the goal is sufficient. As such, many studies have been conducted to effectively solve combinatorial optimization problems with heuristic algorithms, such as hill climbing and simulated annealing [12] .
Given the huge importance and extreme challenge of performing combinatorial optimization, we aim to design a flexible and general FPGA-based acceleration framework to approximately solve combinatorial optimizations, which consists of three key components: (1) A general, flexible, and hardware-friendly data structure to represent a large number of optimization problems. (2) A fast heuristic algorithm to search for an approximate solution for the problem. (3) A dedicated hardware architecture to fully exploit hardware potentials and accelerate the heuristic algorithm leveraging the hardware-friendly problem representation.
Simulating Ising Model to Solve Combinatorial Optimization
More specifically, we choose Ising model as an abstract mathematical model to reformulate a broad range of combinatorial optimizations. An Ising model typically consists of a set of spins interconnected with each other by a weighted edge on a 2D lattice, where each spin σ i has two discrete values σ i ∈ {−1, +1}. The local energy of each spin H i (σ i ) is determined by the current spin value, interactions, and neighbor spins' values. The total energy H (⃗ σ ) is defined as a sum of local energies. Many combinatorial optimization problems can be solved through searching for a specific configuration of spins, ⃗ σ = σ 1 , σ 2 , . . . , σ n , that minimizes H (⃗ σ ) [14] . Current Ising computing mainly focuses on 2D-lattice Ising models, however, the rigid 2D structure limits the ability to represent optimization problems. Furthermore, to embed an arbitrary optimization problem in a fixed lattice Ising model, a time-consuming graph minor embedding [7] [23] [8] is required which is an NP-hard problem itself. Instead of that, we focus on a more general Ising model with arbitrary topology, where each spin is potentially connected with all the other spins. By doing this, real-world applications can be directly mapped to Ising models without time-consuming preprocessing steps.
Simulated Bifurcation vs. Annealing
Simulated annealing (SA) is a heuristic algorithm mimicking the thermal annealing in physics. Essentially, it improves over the hill-climbing heuristic algorithm, by utilizing randomness to avoid being trapped at a local minimum while the randomness decreases when getting closer to the global minimum. When deploying the simulated annealing on the Ising model, at each iteration, it randomly chooses and flips one spin according to the change of the energy [3] . When using simulated annealing on an Ising model, to guarantee convergence, neighbor spins cannot be evaluated and updated simultaneously [18] [19] . Therefore, to improve performance, current state-of-the-art Ising computing methods evaluate and update non-neighbor spins simultaneously. However, when processing an arbitrary-topology Ising model, since each spin is possibly connected with all the other spins, spin updates have to be sequentially one-by-one scheduled affecting solution searching time. Simulated bifurcation (SB) is a new approach which was recently proposed to solve Ising model problems, which is a heuristic algorithm inspired by quantum adiabatic optimization using a nonlinear oscillator network [6] . Essentially, simulated bifurcation uses a pair of ordinary differential equations to search for an approximate solution. Compared to simulated annealing, simulated bifurcation does not require a one-by-one update to guarantee convergence [6] , exposing massive processing parallelism which is hardware friendly. The original simulated bifurcation algorithm was designed to solve fully-connected Ising models, however, most of the real-world Ising models are sparse and locally connected, such that a large number of multiplications outputting zeros are useless. Moreover, the adjacency matrix of J requires O (|V | 2 ) space complexity, which limits the size of the Ising model a machine can process. To improve performance of the simulated bifurcation on real-world sparse Ising models, we proposed a new graph-based edge-centric simulated bifurcation (GraphSB). By treating the simulated bifurcation as a sequence of operations over edges, only nonzero connections are traversed. By storing the adjacency matrix as a COO list, space complexity is reduced from O (|V | 2 ) to O (|E|). As shown in Table 1 , compared with quantum annealing [10] [5], digital annealing [7] [23] [21] , and original simulated bifurcation [6] which assume the whole topology can fit on the device, GraphSB stores the model using DRAM such that arbitrary-size models can be processed by the framework while providing speedup over the classical simulated annealing approach [12] .
Approach
Speed Scalability 
Hardware Optimization Techniques
Furthermore, we proposed three optimizations to accelerate the edge-centric simulated bifurcation. By partitioning the graph such that each partition can fit into the on-chip block memory, data are fully reused and the framework can process very large Ising models which have to be stored in DRAM. By splitting the vertex buffer into multiple banks and scheduling edges to avoid bank conflicts, multiple edges can be processed concurrently to fully utilize FPGA parallelism and DRAM bandwidth. By designing a hazard detection unit to monitor the processing pipeline, data hazards are avoided such that the datapath can be aggressively fully pipelined, and stalled and flushed on a conflict.
Specifically, we made the following contributions: • We proposed an edge-centric simulated bifurcation algorithm to solve sparse Ising models. DRAM bandwidth and computation resources are fully utilized by skipping zeros. • We designed a dedicated hardware architecture to accelerate our proposed algorithm and proposed three optimizations to further improve performance. • Leveraging the Ising model, we designed an FPGA-based combinatorial optimization problem solver which is flexible and general to solve a large number of optimization problems. • We prototyped a complete working system and experimented using the max-cut as a case study. By experimenting, our platform achieves up to 10.91 times speedup compared to a state-of-the-art GPU baseline implementation using simulated annealing.
Background And Related Work 2.1 Ising Model And Ising Computing
Ising model is an abstract mathematical model to describe ferromagnetism in statistical mechanics. An Ising model typically consists of a set of spins interconnected with each other with a weighted edge. The local energy or Hamiltonian of each spin is defined as:
where J i, j is the interaction weight between σ i and σ j , and h i is a bias or external force acting on σ i . The total energy is defined as a sum of local energies:
Ising model problem or Ising computing is to search for a setting ⃗ σ = σ 1 , σ 2 , · · · , σ n such that the total energy H (⃗ σ ) is minimized, and the solution configuration ⃗ σ is called the ground state. Equation 1 can be reformed as:
When all the other spins are fixed, minimization of a spin's local energy leads to a total energy minimization. To minimize local energy, the local spin σ i is determined by interaction weights and neighbor spin values, as shown in Equation 3. If S > 0, σ i = 1, otherwise, σ i = −1. Moreover, multiple uncorrelated spins with no interactions can be locally updated simultaneously. A typical Ising model topology is a 2D lattice where each spin connects to four nearest neighbors as shown in Figure 1a . However, this model limits the expressibility of the model to handle general problems with arbitrary connections. In this work, since we want to use Ising model to express all combinatorial optimization problems, we focus on Ising models with arbitrary topologies as shown in Figure 1b .
Since N number of spins contain 2 N number of possible configurations, the complexity of searching for the ground state explodes exponentially with problem size, such that an exhausted searching is intractable. Furthermore, as shown in Figure 2 , the energy landscape of an Ising model is normally multimodal such that the searching process is vulnerable to local minima. It was shown that many computationally intractable problems can be mapped to Ising models, such as graph coloring [17] , maxcut [3] [6] [7] [23] , and some other problems [14] . Throughout this paper, we use the max-cut problem as a typical application since it was commonly used in other works, and it was one of the original problems shown to be NP-hard by Karp [11] . Given an undirected weighted graph G = (V , E) and nodes are divided into two groups, U and V − U , a cut is defined as a set of edges crossing two groups,
and a cut value is the total weight of edges across the two groups. An example cut is shown in Figure 3 . The max-cut problem is to find a partitioning scheme to divide nodes such that the cut value is maximized. The max-cut problem has a lot of applications in EDA and VLSI design. A Hamiltonian of an N-spin Ising problem without external magnetic fields is formulated as:
By setting the coupling coefficients as J i, j = −w i, j , where w i, j = w j,i is the weight between the v i and v j in the graph, the maxcut problem to find a good division is transformed into assigning {−1, +1} to each spin such that the constructed spin model has a minimal energy. The max-cut solution is mapped to the ground state of the Ising model and {−1, +1} denotes two node groups. Figure 3 : An example cut across two node groups, gray nodes and white nodes Given an Ising model solution, the cut value can be calculated as:
Simulated Annealing
Simulated annealing [12] is a heuristic algorithm mimicking the thermal annealing in physics. Unlike hill climbing which is vulnerable to local minima, simulated annealing utilizes randomness to avoid being trapped. Essentially, when dealing with an optimization problem with a multimodal energy landscape, a temperature parameter is initially set to a high value and decays over the annealing process, and this temperature controls the degree of randomness. At each step of an iterative solution searching process, neighbor states are evaluated and a new state is accepted with a probability even if the solution quality of the new state is worse than the current state. The key idea of simulated annealing is that initially when searching space is very large, setting a high randomness by giving a high temperature could prevent the searching process from being trapped at local minima, and a low randomness is desired when the searching space becomes narrow as the process runs such that the process can eventually converge to an approximate solution.
Simulated annealing is an iterative algorithm and at each iteration, a spin σ i is randomly chosen, and the local energy change ∆E i by the flip of this spin is calculated as:
The flip of σ i is performed following the Metropolis rule [15] , which a flip is accepted if
where T is the temperature and r is a random number between 0 and 1. Isakov et al. [9] proposed different optimizations to improve the performance of simulated annealing, including sequentially updating spins instead of randomly picking a spin, and forward computation of ∆E i such that SA is accelerated when temperature is low. Other works [3] [7] [8] [21] adopted a simplified SA avoiding expensive floating-point operations of the Metropolis criteria. As mentioned in Section 2.1, uncorrelated spins can be updated simultaneously, therefore, to further accelerate simulated annealing, [18] and [22] evaluate and flip uncorrelated spins in parallel, however, only lattice Ising models are considered in the literatures.
Although simulated annealing is an efficient heuristic algorithm to solve Ising models, they are not hardware friendly since only uncorrelated spins can be evaluated and flipped concurrently to guarantee the convergence [6] . This is not a big deal for lattice Ising models since grid models can be easily scheduled, and obvious parallelism is easy to exploit. When dealing with arbitrary-topology Ising models, since each spin is potentially connected with any other spin, in the worst case, all spin updates are serialized. An arbitrarytopology Ising model can be preprocessed with the graph vertex coloring to divide spins into multiple batches such that spins within each batch can be updated simultaneously, but graph coloring itself is an NP-hard problem. Furthermore, since each batch is different, the variability increases the complexity of hardware design.
Simulated Bifurcation
Simulated bifurcation (SB) is a new heuristic algorithm proposed by Goto et al. [6] , inspired by quantum adiabatic optimization using a nonlinear oscillator network. It numerically simulates adiabatic evolutions of classical nonlinear Hamiltonian systems which exhibit bifurcation phenomena. In this work, we will not discuss the derivation of the algorithm, and we refer interested readers to [6] for more details. Essentially, the simulated bifurcation consists of a pair of differential equations.
The real part and the imaginary part, x i and y i are a pair of canonical conjugate variables, such as a position and a momentum, for the i-th Kerr-nonlinear parametric oscillator (KPO) [6] . The symplectic method [13] is used to solve the differential equations Eq. 10 and Eq. 11.
The complete SB algorithm is as follows. Initially, there are two vectors ⃗
x and ⃗ y, where ⃗ x is initialized to zero and ⃗ y is a random vector chosen from [−0.1, +0.1]. K, ∆, and ξ 0 are hyperparameters which need to be manually tuned. As recommended by the original paper [6] , K and ∆ are normally set as 1.0, and ξ 0 is set as
where N is the total number of spins and σ is the standard deviation of the adjacency matrix J . Increasing p(t ) linearly from 0 to 1, Eq. 12 and Eq. 13 are solved over all spins, and the sign of final x i provides the i-th spin of the approximate solution. As recommended by [6] , the symplectic method is stable when ∆ t ≤ 0.5.
Compared to SA, SB has three major advantages: (1) In each iteration, all spins are updated simultaneously. In contrast to SA which is an asynchronous algorithm such that a sequential one-byone update over spins is required, SB is a barrier-synchronization algorithm which each spin updates using the other spins' lastiteration values. At each iteration, there is no dependency between spin updates, such that spins can be updated concurrently. This difference between SB and SA indicates that SB exposes more massive processing parallelism and is easier to be accelerated on hardware.
(2) SB does not rely on randomness. Unlike SA relying on randomness to skip out of local minimum, SB requires a shorter solution searching time by avoiding randomness. (3) SB is amenable to matrix operations. As shown in Eq. 14 and Eq. 15, SB can be easily rewritten as a sequence of matrix operations, such that state-of-the-art matrix operation libraries can be utilized.
Hardware-Accelerated Ising Computing
In [21] , a novel CMOS-based simulated annealing processor was proposed in which spins are stored in an SRAM array, and all operations are implemented using NMOS FETs. [7] proposed two operations approximating the random behavior to avoid random number generators to save hardware resources and improve area efficiency. However, [21] has a 3D lattice topology and [7] has a 2D lattice topology, and rigid topologies limit the application of the hardware. [23] proposed a new hierarchical topology by placing units in a 2D lattice topology while multiple spins are fully connected within each unit. ASIC implementations with fixed topologies are not flexible and can only handle a specific set of problems. Given a real-world application with a large number of connections, a timeconsuming preprocessing is required to embed connections into the fixed hardware connection topology, such as graph minor embedding [7] [23] [8] which is another NP-hard problem. There are also a lot of novel works for Ising computing on GPUs [2] [1] [19] , however, they focus on Monte-Carlo simulation of Ising models instead of using Ising models to solve real-world optimization problems. Furthermore, they only focus on nearest neighbor models which are not flexible to describe practical applications. Cook et al. [3] proposed a GPU acceleration framework to improve the performance of simulated annealing to solve Ising models. By breaking the SA's one-by-one update rule to expose more parallelism, which is friendly to the GPU's SIMD architecture, GPU's threads are fully utilized, even though they did not discuss the correctness of the rule breaking. Tatsumura et al. [16] prototyped an FPGA-based simulated bifurcation machine, however, they store all the coupling coefficients to on-chip RAM such that the scalability is limited.
FPGA-Accelerated Graph Processing
There are a lot of FPGA-based architectures to accelerate graph processing due to its importance in data mining, social network analysis, artificial intelligence, etc. Zhou et. al [24] proposed an FPGA-based edge-centric graph processing. By partitioning edges Session: Applications I FPGA '20, February 23-25, 2020, Seaside, CA, USA and sorting edges according to destination vertices, random DRAM accesses are greatly decreased such that memory bandwidth can be efficiently utilized by sequential accesses. FPGP [4] partitions vertices into intervals and then partitions edges into sub-shards in 2 dimensions according to source vertices and destination vertices. When processing each edge sub-shard, source vertex intervals and destination vertex intervals are pre-buffered first, and then edge sub-shards are sequentially streamed in from DRAM. Due to the pre-buffering, vertices can be accessed within one clock cycle, such that edge processing throughput could be highly improved. And this two-dimension edge partitioning and vertex pre-buffering has already become a common technique to process large-scale graphs.
In this work, we also adopt this grid partitioning approach. In contrast to FPGP, GridGAS [27] focused on a special set of problems where vertices are much more than edges, such as a graph neural network where each vertex is a vector of thousands of features, and they proposed a vertex streaming framework. FASTCF [25] formulated collaborative filtering into a bipartite graph and proposed three-level optimizations to accelerate collaborative filtering. Our hardware optimization efforts are inspired by preprocessing techniques used in FASTCF, and we tailored preprocessing techniques specifically to the Ising model graph.
Graph-Based Simulated Bifurcation
As discussed in Section 2.3, simulated bifurcation can be easily rewritten as a sequence of matrix operations, as shown in Eq. 14 and Eq. 15. The original algorithm was designed to solve fullyconnected Ising models, however, most of real-world Ising models are sparse and locally connected, such that most of entries of the adjacency matrix J are zeros. Therefore, when dealing with a sparse Ising model, a lot of computation resources are wasted on operations outputting zeros. In addition, an adjacency matrix requires O (|V | 2 ) space complexity, which the problem size is limited and a lot of memory resources are wasted on storing zero entires. To deal with a sparse Ising model, the adjacency matrix J can be stored in compressed sparse row (CSR) or compressed sparse column (CSC) format, and the matrix-vector multiplication J · ⃗
x can be replaced with a sparse matrix-vector multiplication (SpMV). However, the irregular memory layout of the sparse matrix generates a lot of random DRAM accesses, which is not hardware friendly, especially for FPGAs [26] . Therefore, we rewrite the algorithm into an edge-centric graph algorithm, such that by treating the simulated bifurcation as an update operation over edges, only non-zero connections are traversed, as shown in Algorithm 1. By storing the adjacency matrix as a coordinate list (COO), space complexity is reduced from O (|V | 2 ) to O (|E|). Furthermore, by reformatting the algorithm into an edge-centric graph update operation, edges are stored and traversed sequentially. Since normally in a graph, edges are much more than vertices, sequentially streaming edges from DRAM could greatly increase utilization of the DRAM bandwidth.
Similar to the original simulated bifurcation algorithm, the graphbased simulated bifurcation is an iterative algorithm over a predefined number of steps, and at each step, all edges are traversed sequentially. We treat an Ising model as an undirected graph, such that when traversing an edge, the source vertex and the sink vertex update each other concurrently. To accelerate the proposed graph-based simulated bifurcation, there are three major challenges we need to overcome.
First, when an edge is streamed in from DRAM, the sink vertex and the source vertex are required to start the update, therefore, it is desirable to pre-buffer vertices to the on-chip memory such that they can be retrieved within one clock cycle. However, when dealing with a large Ising model, all the vertices cannot fit in the on-chip memory, such that DRAM accesses are needed to fetch vertices from external storage. Beyond that, since the order among edges' source vertices and sink vertices are not guaranteed, in the worst case, each edge's sink vertex and source vertex are different from the previous edge, such that DRAM is randomly accessed. A random DRAM access incurs a long latency and consequently the edge-streaming pipeline stalls and the accelerator hardly provides speedup. Therefore, a dedicated graph partitioning scheme is required such that vertices of each subgraph can fit in the on-chip memory to improve data reuse.
Second, as shown in Algorithm 1, when processing an edge, both sink vertex and source vertex are read, updated and written back to memory. In a multi-stage pipelined datapath, a data hazard refers to a situation that two pipelines read, update, and write the same vertex, and the second pipeline reads data before the first one finishes the modification. To avoid data hazard, the latter has to stall and wait until the former finishes updating and writes the modified vertex back to memory. If edges are not scheduled carefully such that consecutive edges update same vertices, the processing pipeline stalls frequently and a sequential processing is consequently incurred providing no speedup. Therefore, edges need to be rescheduled such that edges accessing same vertices are not streamed in consecutively.
Third, FPGA relies on massive parallelism to increase processing throughput and reduce latency. However, the on-chip Block RAM on FPGAs only has two ports (one for write and one for read), such that each clock cycle, only one write request and one read request can be served. When multiple edges are streamed in, multiple write and read requests are issued concurrently, and these requests have to be serialized such that parallelism is limited to one. Therefore, a multi-port on-chip memory is desired to serve multiple requests. Moreover, when multiple ports access the same memory bank, which is called the bank conflict and will be discussed in Section 4.5, requests are also serialized. Consequently, a low-overhead edge rescheduling to make sure within each batch edges access different banks is required.
To fully utilize DRAM bandwidth and FPGA hardware resources, we proposed three software-hardware co-optimizations, graph partitioning (Section 4.4), memory banking (Section 4.5), and data hazard resolving (Section 4.6). For each optimization, a one-shot-cost preprocessing and an efficient hardware architecture is provided.
Overview Architecture
The overall architecture is as shown in Figure 4 . Edge shards and vertex intervals, which will be discussed in the later sections, and metadata are stored in DRAM. A data controller is responsible for scheduling data transfer between processing engine and DRAM. The processing engine contains B number of processing elements. A hazard detection unit (HDU) monitors the data hazard and pushes an edge batch to the processing engine if no conflict.
Shards
Intervals Metadata 
Datapath
The datapath within each PE is as shown in Figure 5 . When an edge is streamed in, both source vertex and destination vertex are updated. Each vertex is stored as a tuple < x, y, deд, isV isited >, deд indicates the degree of the vertex, and isV isited indicates if the source is visited before, since in single iteration x is only updated once. Each edge is stored as a tuple < srcIdx, dstIdx, weiдht >, where srcIdx and dstIdx indicate offsets of the source vertex and the sink vertex in the source vertex interval and the sink vertex interval respectively. The datapath is fully pipelined such that one edge can be processed per clock cycle. 
Graph Partitioning
To simplify the discussion later, for an undirected edge e i, j , we refer the first vertex v i as the source vertex, and the second vertex v j as the destination/sink vertex, but there is no difference between two expressions e i, j and e j,i . "Destination" and "sink" will be used interchangeably throughout the work. To deal with a large Ising model such that vertices and edges cannot fit in the on-chip memory, a model is divided into multiple partitions, such that when processing each graph partition, all required vertices are pre-buffered into on-chip block memories. Due to the fact that the edge-centric algorithm will generate random accesses to vertices, buffering vertices increases vertex reuse, consequently greatly decreasing random DRAM accesses. Furthermore, by pre-buffering vertices, edges can be streamed in without stalls, which eventually highly utilizes DRAM bandwidth, since sequential DRAM accesses provide much higher bandwidth. Given a large Ising model, vertices are partitioned into P intervals, and edges are partitioned into P (P +1) 2
shards. Each edge shard P i, j (i ≥ j) contains all undirected edges between interval I i and interval I j . P is selected such that two intervals can both fit in on-chip memories, 2 · M · β ≤ K, where M is the number of vertices in each interval, β is the size of each vertex in bytes, and K indicates available on-chip memory resources in bytes. Algorithm 2 Scheduling of vertex intervals and edge shards for i ← 0 · · · P do Load I i as destination interval Load I i as source interval for e ← S ii do Update e.src and e.dst end for for j ← (i + 1) · · · P do Load I j as source interval for e ← S ji do Update e.src and e.dst end for Store I j back to DRAM end for Store I i back to DRAM end for The scheduling of vertex intervals and edge shards are shown in Algorithm 2. I i , i ∈ [0, P ) is loaded as the destination interval, and I j , j ∈ [i, P ) are sequentially loaded as the source interval. After that, edges within shard S ji are streamed in sequentially. When processing S ii , I i is loaded into both the source interval buffer and the destination interval buffer. When an edge e pq streamed in, v p is updated in the source interval buffer while v q is updated in the destination interval buffer. Therefore, the modification within I i is inconsistent, since the source interval buffer (I i ) and the destination interval buffer (I i ) both contain a part of updates to I i . And since I i is still needed to process S ji , j ∈ [(i + 1), P ), all modifications should be committed before using I i . Therefore, after processing S ii , the source interval I i and the destination interval I i should both be written back to DRAM, and reloaded to the destination vertex buffer for the later use. This causes performance overhead of writing two vertex intervals to DRAM and reading one vertex interval from DRAM. To handle this, when partitioning edges into shards, if the source vertex and the destination vertex belongs to the same vertex interval, e pq , v p ∈ I i and v q ∈ I i , we also add e qp into the edge shard. By doing this, since e pq and e qp are both contained in the edge shard, when processing S ii , v p and v q in source interval buffer and destination interval buffer are all updated. After processing, the sink interval buffer contains all the updated vertices and can be directly used for later updates without any need to commit the change to DRAM first. By doing this, for each vertex interval I i , two vertex interval reads and one vertex interval write are saved. 
Interval Bu er
Multi-Write Ports Multi-Read Ports Figure 6 : Multi-bank interval buffer To increase parallelism and improve DRAM bandwidth utilization, multiple edges are streamed in concurrently. However, each block memory contains at most 2 ports, one for read and one for write. To support multi-edge processing, multi-port memories are desired. To increase memory ports, we further split each interval buffer to B banks, such that each bank is stored as a separate onchip block memory. Among the split B banks, vertex v i is put at i B offset within the (i%B)-th bank. By splitting the interval memory into B banks, up to B writes and B reads can be served in parallel, such that a batch of B edges can be streamed in and processed concurrently. However, when processing multiple edges in parallel, if two or more edges access the same interval bank, which is called a back conflict, accesses to the bank have to be serialized, which in the worst case, all the B edges are serialized and no parallelism is utilized. Therefore, edges within each shard are scheduled such that there is no bank conflict within each B-edge batch. We adopts a simple greedy algorithm to systematically reschedule edges to guarantee edges without conflicts are grouped within a batch, as shown in Algorithm 3. Edges within a shard are further grouped into B 2 number of sub-shards, SS i j , i, j ∈ [0, B), and SS i j contains edges of which source vertices belong to the i-th bank and destination vertices belong to the j-th bank. A matrix table T ∈ R B×B is constructed with each entry T [i, j] containing the number of edges within SS i j . After that, an iterative edge grouping is conducted by choosing an entry from each row, which contains the most number of edges and has not been chosen before and the column is not selected in previous rows, such that at each iteration, B number of entries are selected each with a different column index and a different row index from others. At each iteration, after selecting B entries, we iteratively pick up one edge from each sub-shard and combine them into a batch. If edges within a sub-shard are empty, an edge with an invalid flag is picked. Algorithm 3 is an example rescheduling algorithm for B = 4. Algorithm 3 Rescheduling edges to avoid bank conflicts Figure 7 : Hazard detection unit If two processing pipelines access the same vertex, the second one has to stall and wait for the completion of the first. Frequent pipeline stalls will greatly affect the performance because FPGAs rely on the dedicated pipelined datapath to improve throughput. Therefore, edges within each shard need to be rescheduled and further partitioned into multiple groups, such that there is no more than one edges accessing the same source vertex or destination vertex within each group. We formulate this problem into an edgecoloring problem. For each edge shard, an undirected graph is constructed and we want to color edges such that each edge has a different color with neighbor edges, while a neighbor edge is defined as an edge has a common vertex with the current edge. As shown in Algorithm 4, we adopt a simple greedy edge-coloring algorithm by using at most 2 × max_deд number of colors, where max_deд denotes the maximum degree of the constructed graph. Edges are sequentially traversed and colored, and each edge is colored using the first available color. By edge coloring the constructed graph according to the edge shard, edges with the same color access different source vertices and destination vertices, such that the datapath can be fully pipelined. Even though more efficient edge coloring algorithms can apply to this problem, since this is only a one-shot-cost preprocessing, we find this greedy algorithm is already good enough. end for However, data hazard is still possible when edges with one color is streamed in while edges with the former color are still being processed by the multi-stage pipeline. To resolve this data hazard, we design a data hazard detection unit (HDU) as shown in Figure 7 , such that when an edge is ready to be pushed into the pipeline but it uses a vertex which is used by an edge being processed in the pipeline, the edge is stalled and not pushed into the pipeline. To realize this, a shift register array is constructed to record what vertices are being processed in the pipeline. The depth of the shift register array is the same as the depth of the pipeline L, such that as soon as the conflict edge is finished, the stalled edge can be pushed into the pipeline. The edge queue is implemented using a first word fall through (FWFT) FIFO [20] . At each clock cycle, if there is a edge batch waiting to be processed, HDU compares the candidate edge batch with edge batches in the shift register array. If there is no conflict, the edge batch is streamed in and also logged in the shift register array. If there is no edge batch waiting to be processed or the edge batch is stalled, HDU pushes an edge batch tagged as invalid to the shift register array.
Experiments 5.1 Experimental Setup
We implemented our FPGA design on ZCU106 using an Ultra-Scale+ XCZU7EV-2FFVC1156 FPGA. By developing the pipeline using Vivado HLS, the depth L was set to 42. The clock frequency of our design was set as 150 MHz. Throughout experiments, the number of banks was set to 8 by default and used in the comparison if not specifically noted. We tested our design on G-set benchmark problems 1 problems which were used in the baseline GPU-based Ising computing [3] . In G-set, each benchmark graph is stored as an edge list, where each edge is stored as a tuple {src, dst, wt }, wt ∈ {−1, +1}.
Beyond that we also tested the FPGA design on three custom large datasets which could not fit into on-chip memories such that vertices and edges have to be stored in DRAM. The statistics of the three large datasets are as shown in Table 2 . We compared our proposal, graph-based simulated bifurcation (GraphSB) with [3] which is the state-of-the-art GPU accelerator design to use simulated annealing to solve Ising model problems. The GPU accelerator was implemented on a NVIDIA Tesla K40c GPU which has 3072 CUDA cores and a 12GB memory. Furthermore, we also implemented a simulated bifurcation code using the sparse matrix-vector multiplication on a CPU written in Python with Numpy libraries. The detailed low-level implementation of Numpy is the Intel MKL library. The CPU baseline was implemented on an Intel Xeon Silver 4110 CPU with 16 cores. For each implementation on each benchmark, we ran 1000 steps and repeated 10 times to calculate average latency of finishing 1000 steps, which is the same as [3] .
Performance Comparison
We first compared all the three implementations on small datasets which could fit into the on-chip memory such that the graph partitioning is unnecessary. Data loading time was not considered and GPU performance data were from the original paper [3] . The preprocessing time for the FPGA implementation was also included. As shown in Figure 8 , When dealing with small datasets, CPUbased SB performed as well as the GPU-based SA, and on a few datasets, the CPU baseline even outperformed the GPU baseline a little bit. And the gap between the GPU baseline and the CPU baseline becomes larger as the problem size grows. This proves the power of the simulated bifurcation when solving Ising model problems by exposing massive parallelism such that even the CPU can greatly get benefit from. Our graph-based SB implementation on the FPGA outperformed the other two baselines on all the datasets. The performance gain mainly comes from the dedicated hardware and hardware-software co-optimizations. We only compared the FPGA implementation and the CPU implementation on custom large datasets, since the GPU baseline [3] did experiments on some custom large datasets but they were not open-sourced. As shown in Figure 9 , the FPGA design greatly outperformed the CPU baseline on large datasets, and this performance improvement mainly comes from the vertex data reuse by partitioning the graph such that each can fit into the on-chip memory.
Accuracy Comparison
Furthermore, we compared accuracy among three implementations. Best known cut values and average cut values of SA on GPU were reported in [3] , and we measured average cut values and best cut values over 10 repetitions for our FPGA design. As shown in Figure 10 , our FPGA design is almost as good as GPU-based simulated annealing, and we expect higher accuracy can be achieved by tuning hyperparameters more carefully. Hyperparameters can be tuned using the grid search, and we leave this to the future work to design an automatic and systematic optimizer. Furthermore, since the FPGA design is much faster than the GPU design, within the same time budget, more steps can be done on the FPGA implementation, consequently getting better solutions.
Performance Breakdown
We further broke down the execution time of our FPGA design to analyze effects of different operations. As shown in the Table 3 , when running 1000 steps, the one-shot preprocessing overhead is extremely small and is easily mitigated by the runtime. And during the runtime, the data communication time (vertex loading from and storing back to DRAM) spends almost the same time as edge processing. This is because ZCU106 only has one DRAM and only one port is available, such that when streaming edges vertices for the next edge shard cannot be pre-buffered. Consequently, whenever an edge shard is finished, the pipeline is stalled and the updated vertex interval is written back to DRAM and a new vertex interval is loaded from DRAM, after vertex loading is done, the pipeline is restarted and edges are streamed again. This communication time can be easily overlapped by double buffering if two or more DRAMs are available and the pipeline will not have to stall due to the vertex data transfer. We leave this performance improvement to our future work.
Performance vs. Parallelism
We further studied the effect of parallelism B on the performance. When increasing the number of banks, more edges can be processed concurrently, however, the number of data dependencies and the total number of edges after preprocessing are also affected. We compared three settings (8, 4, 2) of B and studied the effect and tradeoff of the parallelism. As shown in the Figure 11 , when increasing the number of parallelism, the performance is greatly improved since more parallelism can be leveraged. However, when grouping multiple uncorrelated edges into a batch, some edges with invalid flags are used for padding, the total number of edges after regrouping is also affected. As shown in the Figure 12 , on the three large datasets, the total number of edges when B = 8 is about 1.5 times more than other settings, which causes 1.5 times storage overhead. Beyond that, since a edge batch contains more edges, the number of data dependencies is also affected. We did a tradeoff study on three small datasets and three large datasets respectively. On small datasets, the number of data dependencies is the most when B = 4, and for the large datasets, B = 8 has the most data dependencies. The relationship between data dependencies and parallelism highly depends on the Ising graph, and it proves that some hand-tuning is necessary to fully exploit the potentials of the framework.
Computational Bottleneck Analysis
As shown in Figure 14 , we further did a computational bottleneck analysis of our current implementation. Our quantitative analysis was conducted on the three large datasets. As discussed earlier, since DRAM only has one port, the vertex data transfer and the edge processing cannot be overlapped. In our implementation, each vertex is stored as 128 bits and each edge is stored as 64 bits. The bandwidth of DDR4, which is used in the implementation, is 19.2 GB/s. According to our measurement, the bandwidth of vertex data communication is 17.9 GB/s which already saturates the external memory bandwidth. Therefore, for shortening data communication time, more external memory ports are necessary such that it can be overlapped by computation. The bandwidth of edge processing G13   G34  G19  G21  G20  G18  G51  G53  G54  G50  G47  G40  G39  G42  G41  G9 Figure 13 : Effect of parallelism on data dependencies is 8.87 GB/s, which is almost half of the DDR4 bandwidth. However, constrained by the data hazard unit, parallelism could not be increased from 8 to 16 to fully utilize the bandwidth. In the data hazard unit, whenever an edge batch is streamed in, the batch needs to be compared with all the edge batches which are already stored in the shift register. Given the parallelism B and the length of shift register L, at each clock cycle, L × B 2 number of edge comparisons are required. This generates a large MUX in the circuit, and in our experiment, the clock frequency is significantly decreased from 150 MHz when further increasing the parallelism. 
Conclusion
Effectively solving combinatorial optimization problems poses tremendous challenges due to their NP-hardness. Yet, it is crucial to the success of many engineering and scientific endeavors. Recent discovery of quantum adiabatic optimization using a nonlinear oscillator network offers an unprecedented opportunity to efficiently solve many difficult combinatorial optimization problems with great accuracy. Unfortunately, directly utilizing conventional FPGA digital fabrics to simulate adiabatic evolutions of classical nonlinear Hamiltonian systems proves to be quite cumbersome and hardware-inefficient. The key contribution of this work is to propose a novel graph-based design framework for synthesizing efficient FPGA-based circuits to massively simulate adiabatic bifurcations that can maximize both memory and compute parallelism by fully exploiting the inherent advantages of FPGA devices. With multiple benchmarks of different sizes, we have shown that our method can outperform the state-of-the-art simulated annealing of both CPU and GPU by more than 10 times in performance.
