Abstract-Graph algorithms and techniques are increasingly being used in scientific and commercial applications to express relations and explore large data sets. Although conventional or commodity computer architectures, like CPU or GPU, can compute fairly well dense graph algorithms, they are often inadequate in processing large sparse graph applications. Memory access patterns, memory bandwidth requirements and onchip network communications in these applications do not fit in the conventional program execution flow. In this work, we propose and design a new architecture for fast processing of large graph applications. To leverage the lack of the spatial and temporal localities in these applications and to support scalable computational models, we design the architecture around two key concepts. (1) The architecture is a multicore processor of independently clocked processing elements. These elements communicate in a self-timed manner and use handshaking to perform synchronization, communication, and sequencing of operations. By being asynchronous, the operating speed at each processing element is determined by actual local latencies rather than global worst-case latencies. We create a specialized ISA to support these operations. (2) The application compilation and mapping process uses a graph clustering algorithm to optimize parallel computing of graph operations and load balancing. Through the clustering process, we make scalability an inherent property of the architecture where task-to-element mapping can be done at the graph node level or at node cluster level. A prototyped version of the architecture outperforms a comparable CPU by 10∼20x across all benchmarks and provides 2∼5x better power efficiency when compared to a GPU.
I. INTRODUCTION Advances in mobile computing, coupled with the proliferation of online social networks, have given rise to a new class of applications and computing challenges [1] . These applications tend to be relational by nature. In other words, they express or encode relations, communications, connectivity and interactions between people, places, objects or systems. As such, the data of interest in these applications are often best represented in the form of graphs. Graph-based applications range from social network analyses to anomaly detections [2] . For computing purposes, graphs are commonly represented in one of two forms: (1) as adjacency matrix or (2) as adjacency list. Adjacency Matrix works well for densely connected graphs, i.e., the number of edges in the graph is close to the maximal number of edges. In general, computing on dense graphs can be easily parallelized and GPU and SIMD architectures have proven to be the platform of choice for executing such graph-based applications [2] . Unfortunately, the vast majority of large graph-based applications are sparse. For the efficient storage of large sparse graphs, adjacency list or other compressed representation schemes are used. Memory access and load balancing are some of the key bottlenecks to the efficient processing of large sparse graph algorithms and applications [1] . The memory access patterns often lack spatial and temporal localities resulting in high cache miss rates. Current cache-based processor architectures are simply not well suited for the computational flow of graph processing. In addition to the storage problem, computing on large sparse graphs currently presents a number of challenges including effective programming abstractions and models of computation that leverage the graph structure in the application. In this work, we present a domain-specific architecture tailored to graph-based algorithms and applications. Figure 1 shows an illustration of the proposed architecture. The three key modules of the architecture are (1) the graph processor, (2) the co-processor and (3) the main memory. The graph processor (1) module has a Memory Interface unit (1a) to coordinate batch accesses to the main memory or external memory units, a Dispatch Logic (1b) to perform scatter operations on data from the main memory, an Output Logic (1c) to gather output data from the graph processor, and a systolic array of simple processing elements called Node Arithmetic Logic Engines (NALEs) (1d) to carry out the actual graph computations. The co-processor (2) performs three key functions. It (1) executes non-graph parts of the application, (2) schedules the graph part of the application and (3) monitors the execution flow of the graph.
II. PROPOSED GRAPH PROCESSOR ARCHITECTURE
graph processor, and a grid of processing cores (1d) to carry out the actual graph computations. The co-processor (2) performs three key functions. It (1) executes non-graph parts of the application, (2) schedules the graph part of the application and (3) monitors the execution flow of the graph.
Graph Processor Micro-Architecture
The graph processor is a systolic array of simple processing elements called Node Arithmetic Logic Engines (NALEs). It is envisioned that the graph processor will contain a thousand to hundreds of thousands of NALEs. Figure [ ?] depicts the block representation of a NALE and Figure [ ?] shows the micro-architecture. The NALE is optimized for fast MAC (Multiply-AndAccumulate) operations with a three-state output comparator for fast node value sorting. It has two FIFO structures, one to communicate with neighbors and one internal FIFO to emulate multiple graph nodes (node cluster mode execution).
The NALE has a FIFO-based, memory-oblivious and latency-insensitive Instruction Set Architecture (ISA). In other words, there are no explicit load and store in the instruction set. Instructions are 16-bit long and operate as (1) fetch instruction, (2) read input FIFOs when data are present, (3) perform operation and (4) write to output FIFOs. Figure[?] shows the instruction format. Each NALE operates independently of others depending on the readiness of inputs. Communicating only through FIFOs allows for each NALE to run on its own clock speed. Furthermore, this approach allows us to adopt a GASL design methodology that can seamlessly Graph Processor Micro-Architecture: Figure 2 shows the micro-architecture of a NALE. The NALE is optimized for fast MAC (Multiply-And-Accumulate) operations with a threestate output comparator for fast node value sorting. It has two FIFO structures, one to communicate with neighbors and one internal FIFO to emulate multiple graph nodes (node cluster mode execution). Each NALE operates independently of others depending on the readiness of inputs. Communicating through FIFOs only allows each NALE to run on its own clock speed. Furthermore, this approach allows us to adopt a GasP asynchronous [3] design methodology that can seamlessly scale to hundreds of thousands of NALEs. Figure 3(a) illustrates the the clockless handshake logic between NALEs. In addition to the scalability benefits, the absence of a global clock allows for the underlining data dependencies to dictate application execution time. Model of computation and compilation : An asynchronous model of computation is adopted to fully take advantage of the graph processor. Given a graph application specification and a number of available NALEs for its computation, the execution preprocessing flow follows five key steps. Figure 4 illustrates these steps. In the first step, the application is profiled to extract the graph topology, followed by the clustering of nodes, clusters dependency analysis, placement and finally the compilation step. 
III. ARCHITECTURE EVALUATION
Experimental setup: To get high-fidelity performance and power measurements for the proposed architecture, we prototype it alongside a conventional CPU and a GPU with comparable complexity in FPGA. The Xilinx Virtex7-XC7VX980T FPGA device is used for our prototyping platform. We implement a synthesizable RTL version of the graph processor. We use the 7-stage RISC core in the Heracles [4] RTL simulator for the CPU. We adopte the MIAOW open-source generalpurpose graphics processor (GPGPU) based on the AMD Southern Islands ISA [5] 
