Abstract-Graphs are important in many applications however their analysis on conventional computer architectures is generally inefficient because it involves highly irregular access to memory when traversing vertices and edges. As an example, when finding a path from a source vertex to a target one the performance is typically limited by the memory bottleneck whereas the actual computation is trivial. This paper presents a methodology for embedding graphs into silicon, where graph vertices become finite state machines communicating via the graph edges. With this approach many common graph analysis tasks can be performed by propagating signals through the physical graph and measuring signal propagation time using the on-chip clock distribution network. This eliminates the memory bottleneck and allows thousands of vertices to be processed in parallel. We present a domainspecific language for graph description and transformation, and demonstrate how it can be used to translate application graphs into an FPGA board, where they can be analysed up to 1000x faster than on a conventional computer.
I. INTRODUCTION
Network science is an emerging field combining models, theories and methods across application areas that involve processing large-scale graphs, e.g. social, telecommunication and biological networks [1] . A lot of on-going research is dedicated to the discovery of new algorithms for processing graphs, particularly focusing on improving their asymptotic complexity, where the runtime of an algorithm is characterised using the big O notation and constant factors are ignored. For example, breadth-first search, a textbook algorithm for computing the shortest path between two vertices in an unweighted graph, takes O(|V | + |E|) time to process a graph with the vertex set V and the edge set E, e.g. see [2] . Here the big O notation hides the constant factors c 1 and c 2 in the more precise runtime bound c 1 |V |+c 2 |E| for the sake of convenience and simplicity. The research on graph algorithms, therefore, tends to focus on improving the asymptotic complexity while paying little attention to constant factors.
In this paper we take an orthogonal approach: our primary focus is the improvement of the underlying constant factors by physically embedding graphs in silicon and making vertices communicate locally and directly, which is radically different from conventional graph processing in software that involves an indirect traversal of a graph data structure stored in memory. We do not claim any asymptotic improvements to classic graph algorithms, but we demonstrate that the proposed approach can provide 2-3 orders of magnitude speedups for sizable real-life graphs compared to a software implementation.
We automate the presented approach by developing a graph transformation language that allows the user to parse application graphs, perform common graph transformations, write graphs into an FPGA in the form of a circuit netlist, and execute graph analysis queries.
It is important to note that the presented approach is suitable only for those applications where the overhead associated with embedding the graph into an FPGA is negligible compared to the graph processing runtime. For example, executing a single shortest-path query takes only a fraction of the time required to synthesise the FPGA netlist. One therefore needs to execute thousands of such queries in order to achieve any improvement compared to a conventional software implementation. Our case study requires the execution of millions of graph analysis queries, which justifies the upfront cost of graph embedding.
The idea of using FPGAs to accelerate graph processing is not new. However, to the best of our knowledge the existing approaches follow the conventional Turing-style computing paradigm, where the graph is stored in memory and the algorithm is executed by a collection of processing cores that manipulate the graph representation by reading and writing to the memory. These approaches typically achieve the acceleration through innovative memory architectures that exploit the flexibility of FPGAs [3] [4] and/or through custom processing cores optimised for graph algorithms [5] [6] . In this paper we abandon the conventional data-compute separation: we synthesise both the data (the graph) and the compute (the mathematical formulation of the breadth-first search algorithm) together into an FPGA. This solution is not generalpurpose, but can achieve much higher acceleration factors.
The contributions of the paper are:
• We present a domain-specific language for manipulating and embedding graphs in silicon in Section II. The language is implemented in Haskell [7] , which allows us to reuse standard functional programming abstractions to efficiently manipulate graphs.
• Execution of graph analysis queries requires support of an on-FPGA infrastructure that is developed in Section III.
• We demonstrate the presented methodology on the example of biological network analysis in Section IV. 1 > g1 <-readGraphML "example.graphml" 2 > print g1 3 edges [("A","B"), ("B","C"), ("B","D"), ("C","E"), ("D","E")] 4 5 > g2 = mergeVertices ["C","D"] "CD" g1 6 > print g2 7 edges [("A","B"), ("B","CD"), ("CD","E")] 8 9 > g3 = splitVertex "CD" ["C","D"] g2 10 > print g3 11 edges [("A","B"), ("B","C"), ("B","D"), ("C","E"), ("D","E")] 12 13 > relevantProtein p = p`notElem`["A","D","CD"] 14 > induce relevantProtein g3 15 edges [("B","C"), ("C","E")] 16 17 > writeVHDL g3 "circuit.vhdl" 
II. GRAPH TRANSFORMATION LANGUAGE
This section presents Centrifuge, a framework for constructing, manipulating, and embedding graphs in silicon, where they can be analysed using the accelerator presented in Section III. The framework provides a domain-specific language for graph transformation, a parser for GraphML files, and a hardware synthesis engine for generating VHDL netlists. It is open-source and publicly available under the MIT license [8] .
A. Graph Data Type
Centrifuge is built on algebraic graphs [9] , which allows us to reuse existing functional programming abstractions, and formally prove the correctness of graph transformation algorithms. Graphs are represented by the abstract data type Graph a, where a corresponds to the type of the graph vertices. For example, Graph Int and Graph String are graphs whose vertices are integers and alpha-numeric strings, respectively. Our case study (Section IV) is dedicated to protein-interaction graphs, further referred to simply as networks, for the purpose of drug discovery. Vertices of these networks are proteins, and edges are known protein interactions. The following type synonyms are defined for convenience:
type Protein = String type Network = Graph Protein That is, a Protein is represented simply by its name, and a Network is a graph whose vertices are proteins.
The developed graph transformation language is fully polymorphic with respect to the type of graph vertices, however we only discuss protein-interaction networks in the rest of the section for the sake of simplicity. We refer an interested reader to the framework documentation [8] , which provides specifications and examples of using fully polymorphic versions of the functions we discuss.
B. Reading and Parsing GraphML Files
GraphML is a popular XML-based file format for graph storage supported by most graph processing engines. The framework supports reading and parsing GraphML files using the readGraphML function, whose type is shown in line 2 of Fig. 1 . The graph transformation language is embedded in Haskell and is therefore a purely functional language, requiring all side-effects to be explicitly reflected in function types. In the case of the readGraphML function, the type says that the function takes the path to a GraphML file as the input and returns the resulting network as the output, performing some side-effects during the execution (specifically, IO operations such as reading a file). The Centrifuge API is pure, which improves the testability and scalability of the framework. 
C. Transforming Graphs
This section describes examples of graph transformation supported by the presented language. Fig. 2 shows an example of an interactive graph manipulation session, and Fig. 3 illustrates the transformations. We cover transformations relevant to the drug discovery case study, namely vertex merging and splitting, as well as computing induced subnetworks. The library provides more functionality -see the documentation.
The session starts with parsing a GraphML file (line 1 of the session in Fig. 2 ), which is given the name g1 and printed using the function print -see the graph in Fig. 3(a) .
Vertex merging can be used to model the formation of protein complexes in the context of protein-interaction networks. A protein complex is a structure where two or more proteins physically come together and form a functional unit, the complex, where all the components are required for the complex to function. To merge a list of proteins into a complex, one can use the mergeVertices function, as demonstrated in lines 5-7 of Fig. 2 . The implementation is based on the standard functional programming abstraction called functor [10] : we apply a mapping function to each vertex of a given graph, as illustrated in Fig. 3(b) , and if two vertices are mapped into the same target they are merged.
Protein complexes may be unstable and their dissociation could be modelled by splitting the corresponding vertex of the network preserving its connectivity. Lines 9-11 of the session demonstrate the use of the function splitVertices to undo the effect of merging vertices C and D. Note that the resulting graph g3 coincides with the original graph g1. Vertex splitting is implemented using another standard functional programming abstraction called monad [10] [11] , where each vertex of the graph can be substituted with a subgraph, subsequently flattening the result, as illustrated in Fig. 3(c) .
An important step in the drug discovery process is the identification of proteins that are relevant to a specific biological process, and discarding the remaining ones from the network under consideration, i.e. computing the induced subnetwork on the set of relevant proteins. This can be achieved using the induce function, see lines 13-15 in the session. The example predicate relevantProtein discards proteins A, D and CD, and can be implemented as follows:
Here the function notElem returns True if the given element does not belong to the given list. Note that discarding the non-existent vertex CD is allowed.
The implementation of induce can also be expressed in terms of the monad abstraction, where discarded vertices are substituted with empty subgraphs, which effectively removes these vertices from the graph -see Fig. 3(d) .
Another reason to remove a vertex in the drug discovery context is to account for the introduction of a drug into the system, which can bind to certain proteins, preventing them from participating in their usual interactions. The removal of a vertex v can be expressed as computing an induced subgraph on all vertices but v.
A graph transformation session is typically ended by saving the result in a GraphML file or embedding it into an FPGA for accelerating its analysis, as described in Section III. Graph embedding is performed by the function writeVHDL.
The presented graph transformation language relies on powerful functional programming abstractions, such as functor and monad, which allows the user of the framework to implement common graphs transformations in a concise and clear manner, and reuse existing functional programming libraries. Further examples of graph transformation can be found in [9] . 
III. EMBEDDING GRAPHS INTO FPGAS
While the presented language is an expressive and powerful tool for manipulating graph descriptions, executing graph algorithms in software is generally inefficient due to the memory bottleneck. To this end we present an FPGA-based hardware acceleration backend that supplements the language described in Section II. The backend provides significant improvement in shortest path calculations compared to an optimised software implementation (we compare the two in Section IV).
Even though there are other hardware-based solutions for accelerating graph computations including many-core/clusterbased systems [12] [13] and GPUs [14] [15], we argue that FPGAs are better suited for the following reasons: 1) They are more cost effective than clusters of commodity general purpose processors. 2) They allow a direct mapping between network elements and physical silicon structures (i.e. vertices can be mapped to flip-flops and edges to interconnect paths, as will be discussed shortly). This increases both the scale and absolute performance of computations compared to more abstract network representations that are necessary in GPU and DSP implementations. 3) They are programmable, so a single device can be used to analyse multiple networks, unlike ASICs. This is particularly useful if the underlying network is frequently updated (e.g. due to acquisition of new data).
A. General Architecture
An architectural overview of the accelerator is shown in Fig. 4 . At the core, the accelerator consists of an in silico instance of the graph to be processed, synthesised by mapping vertices to individual memory elements (flip-flops) and edges to combinational paths between these elements. The resulting hardware graph is encapsulated by the control circuitry to enable/disable selected vertices, coordinate computation, and read shortest-path computation results. An on-chip software processor (NIOS II) is also included to communicate with the host computer and provide an Application Programming Interface (API) for graph processing.
B. Graph Traversal in Hardware
The basic idea behind representing graphs using flip-flops and combinational paths is that we wish to perform graph traversal by propagating logic high values between flip-flops. The logic state of each flip-flop therefore indicates whether a given vertex has been visited (logic high) or not (logic low). To propagate a "visited" state between flip-flops, we OR the outputs of all vertex neighbours and use it as an input to the vertex flip-flop. This mapping scheme is illustrated in Fig. 5 .
Using this hardware representation, shortest path calculation from a starting vertex S is performed as follows. Initially, all vertex flip-flops except for S are reset (indicating an unvisited state). On the first clock cycle following the initial state, the "visited" state of S propagates to its immediate neighbours. The newly-visited vertices then propagate this state to their own neighbours in the following cycle and so on until the graph is fully traversed (i.e. when all vertices have been visited). In short, this computation is a classic breadth-first search where each iteration is performed in a clock cycle and involves visiting flip-flops by changing their state to logic high.
Note that the maximum number of clock cycles required to traverse the graph is equal to the graph diameter, which is often very small for real-life graphs. For example, biological networks in our case study comprise thousands of vertices yet their diameter is typically around 5 due to the small-world phenomenon. These networks can therefore be traversed in few clock cycles, which is faster than a single memory access on a commodity computer. This forms the basis for the significant acceleration factors reported in Section IV.
C. Calculating Average Shortest Path
The accelerator is designed primarily to calculate the average shortest path ϑ(G) over all pairs of source and destination vertices (a, b) where a = b, or
where N is the number of vertices and D(a, b) is the shortest distance between vertices a and b. For better correspondence with the hardware implementation, presented next, we reformulate ϑ(G) as
where C(a, k) is the number of vertices at a distance k from vertex a. In this case the inner loop terminates when C(v i , k) = 0 since this implies C(v i , h) = 0 for h > k.
D. Implementation Details
We now describe in more details how the accelerator computes ϑ. The graph circuit interfaces with three registers: an initialisation shift register (SR), an enable register (RE) and an output register (RO). Register SR initialises vertex values at the beginning of each traversal while RE enables/disables selected vertices and RO detects which vertices have been visited during the current traversal step. Additionally, a counter CT maintains the step count during each traversal.
Computing ϑ involves N traversals, each amounting to calculating the inner sum in (2) . During each step (of each traversal), the number of bits in RO is multiplied by CT and the result is added to an accumulator AC. Each traversal is completed when RO = 0 (i.e. when no new vertices are visited). After N traversals, each starting from a different vertex, the value held in AC is divided by N (N − 1) to obtain ϑ.
Register SR initialises the graph in preparation for a traversal operation. As discussed earlier, all vertices except for the starting vertex S are initialised in an unvisited state. The value of SR is therefore a one-hot encoding of the index of S.
The accelerator is meant to be used in applications where selected graph vertices can be disabled and the impact of this on ϑ can be determined. Register RE provides this functionality; it is an N -bit register that can be prepopulated by the user (via API calls). Any 0 bit entries in this register will disable the corresponding vertices during the traversal process, effectively removing them from the graph. This approach is inspired by Conditional Partial Order Graphs [16] , which are used to efficiently represent large graph families in hardware.
The accelerator is controlled by a host computer; computations are started, monitored and their results are read via API calls. This provides a programmatic interface enabling the accelerator to be used as a step in an automatic quantitative workflow involving manipulating a base graph via vertex removal and evaluating the impact by re-calculating ϑ. Using the language presented in Section II, an input graph can be converted into VHDL code and synthesised into an FPGA within the accelerator framework. Therefore, developers can read a graph description, synthesise and implement the accelerator, and then use it to analyse the graph, all while remaining within the same programming environment.
IV. CASE STUDY
The presented graph transformation language and accompanying hardware implementation have been applied successfully by e-Therapeutics, a pioneer in computational drug discovery, to accelerate their analyses of protein interactions networks. We discuss this use case here.
A. Computational Drug Discovery: An Overview
Biological systems can be modeled at different levels of abstraction by extensive networks of interactions. At the base level, molecular interactions give rise to a rich set of interactions at the cellular level while the latter mediate higher forms of interactions and so on. If normal cellular function arises from the molecular interactions within the cell then disease mechanisms can be considered as emerging from collections of pathological interactions that only occur in the disease state [17] [18] . If the cellular mechanisms underlying (certain) diseases can be described as a complex system then drug discovery aimed at combating those diseases can be considered as the identification of agents that have a significant effect when used to perturb those systems.
Approaching the discovery of new therapeutic agents from this direction has a number of theoretical benefits, such as being better placed to address complex diseases that arise due to interactions between multiple components, to tackle inherent cellular robustness mechanisms [19] , potentially reduce the capacity to develop resistance [20] , and aid in the discovery of personalised therapeutics [21] . The robustness and resilience properties of complex systems implies they can withstand the failure, or functional perturbation, of small numbers of their constituent elements. Thus, substantial levels of change in system behavior can only occur when multiple elements are perturbed simultaneously. The fact that linear superposition does not hold implies that the identification of such element collections is not trivial, and is certainly not as simple as choosing those that appear most important individually. In the context of drug discovery this leads to the concept of the identification of collections of multiple proteins, that, when perturbed simultaneously, can have a large effect on biological function. The experimental identification of effective protein sets within intact cellular systems is tricky due to both experimental limitations and combinatorial explosion. Conversely, computational approaches are ideally suited for problems plagued by combinatoric issues. e-Therapeutics has developed a practical, in silico, systems based approach to drug discovery based on the above principles [22] . Networks can be considered a mathematical abstraction of a complex system [1] and have become a very useful tool for modeling the molecular interactions within a cell and guiding the understanding of integrated biological function. Percolation theory applied to complex networks [23] is concerned with exploring the change in network structure and behavior due to perturbation of collections of system elements. A key result [24] demonstrates that certain networks, including biological networks, show tolerance to random vertex failure but are vulnerable to targeted attacks. As such, network percolation forms a computational framework to develop analysis approaches aimed at the identification of effective protein sets in disease networks.
B. Drug Discovery by Network Analysis
The impact I of a perturbation, or removal, of vertices from a network is defined as I X = |ζ n − ζ 0 | ζ 0 where ζ n is the value of network measure X when n vertices have been removed from the network.
Numerous measures X have been used in studies of network percolation and robustness with two commonly used measures being network diameter [24] and the average shortest path [25] . Both these measures rely on the calculation of all shortest paths between every pair of vertices in the network, or those in the largest connected component if the network becomes fragmented. The practical use of percolation experiments during the drug discovery approach at e-Therapeutics necessitates the calculation of impact on a network of multiple different proteins sets with the total number of protein sets into the hundreds of thousands. As such, improvements in calculation speed of the core impact measures based on shortest path calculations would have a major positive effect on reducing computational bottlenecks in the discovery process.
C. Biological Networks on FPGAs
We embedded six protein interaction networks in an FPGA using the approach described in Section III. The networks are used to evaluate and compare the effects of different drug candidates on complex cellular systems. The impact of each drug is evaluated by disabling selected vertices that the drug is known to perturb and then calculating ϑ. Our motivation was: (i) to compare the accelerator performance to a software implementation, and (ii) to test the scalability of our hardware implementation using benchmarks from an industrial application. Our benchmarks ranged from very small test networks to considerably large real drug discovery networks (3487 vertices, 115898 edges) and had a small diameter (up to 5). Table I summarises accelerator resource utilisation and performance for the six networks. We found that the FPGA device we used (Altera Stratix IV -EP4SGX230) was sufficiently large to accommodate the largest network (n5). For this network, combinational logic and register utilisation represented a considerable but still permitting percentage of device resources (25%) while peak interconnect utilisation approached the device's limit (91%). This result is not surprising given that the high degree of connectivity in biological networks cannot be matched by the planar interconnect network of an FPGA. Many real-world networks have comparable degrees of connectivity and so we expect that the scalability of our hardware implementation will be upper-bounded by FPGA interconnect density. Nevertheless, n5 is the largest within its class of proteins interaction networks used at e-Therapeutics and so our hardware implementation and choice of FPGA device have been sufficient for this particular application.
The increase in network scale also meant that the clock frequency at which the network could be clocked was lower, a trend clearly visible in Table I . This is because neighboring vertices had to be mapped to more distant flip-flops to accommodate the entire network and worst-case propagation delay had to be increased correspondingly. Nevertheless, we found that we could achieve a target clock frequency higher than 100 MHz for the largest network (n5). Another upshot of increasing network scale is that the number of cycles to calculate ϑ also increases since shortest path computations have to be repeated for a larger number of vertices. This decreased performance further compared to smaller networks.
Compared to our reference software implementation (singlethreaded program written in C++, running on Intel i7-6700HQ 2.60GHz CPU, 16GB RAM, 6MB cache), the throughput of average path calculations using our hardware accelerator was higher by 2-3 orders of magnitude. Even though larger networks could be clocked at lower frequencies and required more cycles to calculate ϑ, the relative performance of the accelerator with these networks was higher (compared to software). Again, this is a trend that we expected; our approach scales much better with respect to network size compared to a software implementation. The performance benefit is therefore more prominent when processing larger networks.
V. CONCLUSIONS AND FUTURE RESEARCH
The paper presented a domain-specific language for graph construction and transformation, and a hardware acceleration backend for processing graphs on FPGAs. We demonstrate 1000x acceleration compared to a conventional software implementation on real-life drug discovery benchmarks.
The main idea behind our approach is to physically embed the application graph into silicon, where each vertex is mapped into a tiny core of just a few of gates, which can communicate concurrently and directly, avoiding the memory bottleneck.
Our future research will focus on investigating the applicability of the developed technology to graph processing in other domains, where graphs are typically fixed apart from minor perturbations and can therefore be efficiently analysed using the presented approach. One such example is smart grids, where vertices and edges are rarely added or removed. Deep learning networks are also suitable for embedding in hardware and it would be instructive to compare the developed technology to those produced by other groups in this active research area. We believe that the presented graph transformation language may be particularly well-suited for compiling machine-learning networks developed using frameworks such as Tensorflow [26] into hardware. Finally, with the advent of cloud FPGA technology it becomes possible to provide easily accessible and highly scalable graph manipulation and processing infrastructure for a wide range of applications and users, which is our long-term goal.
