Abstract-The recent emergence of large-scale knowledge discovery, data mining and social network analysis, irregular applications have gained renewed interest. Cache-based architectures do not provide optimal performances with such workloads, mainly due to the low spatial and temporal locality of their control and memory access patterns. This paper presents a multi-node, multi-core, multi-threaded shared-memory system architecture designed for the execution of large-scale irregular applications, and built on top of three pillars that support these workloads. First, transparent hardware support for Partitioned Global Address Space (PGAS) provides a large globally-shared address space with no software library overhead. Second, multithreaded multi-core processing nodes achieve the necessary latency tolerance required when accessing physically distributed global memory. Third, hardware support is provided for interthread synchronization on the global address space. An analytical performance model that accounts for the main architecture and application characteristics is presented. The hardware design of the proposed custom architectural building blocks is then described. Finally, a multi-board FPGA prototype of the proposed system with typical irregular kernels and benchmarks is presented. The experimental evaluation demonstrates the architecture performance scalability for different configurations of the whole system.
I. INTRODUCTION Data mining, social network analysis and in general knowledge discovery are new classes of large-scale data-intensive applications that show high latent parallelism. However, dynamically changing pointer-based data structures are often used such as unbalanced trees, graphs and unstructured grids, which present poor temporal and spatial locality. Current High Performance Computing (HPC) systems integrate many powerful cache-based processors that rely on spatial/temporal data locality to reduce memory access latencies, therefore executing poorly this class of applications. Moreover, the data size require significantly more memory than what is currently available on a single node of a typical HPC cluster and are usually difficult to partition without incurring in load imbalances or high communication overheads. Finally, these applications present high synchronization intensity, due to the frequent memory accesses of the data structure exploration.
Very few machines have been explicitly designed to address the requirements of irregular applications. The Cray XMT ( [1] ) is one of such machines. The XMT uses custom processors that employ heavy hardware multi-threading to tolerate the system-wide latency for accessing memory on remote nodes. Moreover, the XMT has hardware support for extremely fine-grained inter-thread synchronization and its programming model provides to the programmer a PGAS abstraction on top of a physically distributed memory. However, being a fully custom machine, the XMT is more expensive than clusters based on commodity processors and components. Therefore, there is the need to conceive new affordable designs that can provide good performances with irregular workloads.
In this paper, we present the design of a new full-system architecture for irregular applications based on commodity processors. Starting from an architecture based on off-the-shelf soft-core processors, we introduce the hardware and software components required to optimize it for multi-node irregular applications. We show how considering multi-threading, fine grained synchronization and scrambled global address space, it is possible to integrate hardware and software modules that can enable efficient development and execution of irregular applications on commodity processors.
The main contributions of this paper are: an analytical model that correlates architectural parameters (number of cores, number of threads, frequency, thread switching overheads) with memory and network bandwidth and latency, a detailed description of the custom hardware and software components that enable support of irregular applications, an FPGA-based prototyping platform employed to validate the approach.
II. TOLERATING SYSTEM-WIDE MEMORY LATENCY
This section describes a simple model that links system performance with the main processor features (number of cores, multi-threading, thread-switching overhead, core frequency), memory/network latency and bandwidth. Since irregular applications performances are essentially memory-and networkbound [2] , it is critical to model the sustained network injection rate. For the network interface of a processing node in a multi-node machine, ignoring the effects of contention on the interconnection internal to the node, the injection rate can be expressed as:
where K is the ratio between remote and local memory references, C the number of cores in the node, f the core clock frequency (assuming single domain), S the average size of the memory reference in bits, P the number of memory units per core and Inj.Rate pipe is the memory reference injection rate of each pipeline. The average network latency Lat N ET for a remote memory reference is dependent on a variety of static factors, such as the network topology, routing policy and number of nodes, but also on dynamic factors, such as the spatial distribution of the accesses, which in turn affects the network contention. When a processing core issues a remote memory reference, its average time to completion is D net [cycles] if no multi-threading is present, the pipeline will stall until completion of the memory reference. In this case, the injection rate Inj.Rate pipe of a single pipeline is 1/D net . If instead the core has the ability to switch thread of execution to tolerate the D net latency, the resulting injection rate will increase. Assuming the thread is switched at each remote memory reference, we use I sw to indicate the number of cycles used by a thread for execution before a remote memory reference occurs. We indicate the thread switching delay with D sw , measured in processor cycles.
When multi-threading is enabled, the pipeline injection rate Inj.Rate pipe is not 1/D net anymore, instead it depends on the number of threads available for execution. If there are enough available threads such that the average latency D net of one thread can be completely covered executing other threads, then the total number of issued remote memory references is D net /(I sw +D sw ) and the injection rate is 1/(I sw +D sw ). We refer to this condition as complete latency tolerance. If instead there are not enough threads available to cover the latency, the injection rate can be expressed as T /D net . These expressions are valid under the assumption that one thread can issue only one remote memory reference and then it is suspended. The occurrence of remote memory references in the instruction mix is represented by I sw . From the concept of complete latency tolerance a first approximation of the minimum (threshold) number of threads necessary to cover the latency of a remote memory reference can be obtained as:
Expression 2 includes the dependency on the machine configuration (number of nodes, link latency, topology) in the D net term, while I sw accounts for the instruction mix. The term D sw models how the processing core implements multi-threading and context switching. For instance, a full-HW multi-threading implementation where the different available threads are scheduled on a cycle-by-cycle basis has D sw = 0, while software context saves and restores typically lead to D sw of hundreds of cycles. Substituting the expressions we found for the pipeline injection rate Inj.Rate pipe into equation 1, we obtain the complete expression for the node injection rate, which we can compare with the network bandwidth available at the node:
The actual used network bandwidth depends on the size of the network packet, i.e. the sum of the memory reference size S in bits and network overhead, referred as Overhead: Irregular applications typically benefit from fine-grained network traffic w.r.t. memory reference size S. Thus, a strategy to reduce the impact of the Overhead term is to aggregate several memory references directed to the same destination node in a single network packet, as explored in [3] for an XMT-like architecture. However, this work does not focus on memory reference aggregation, and a single memory reference per network packet will be considered.
Using the introduced equations, we show in Figures 1a and 1b the relation between the achievable node injection rate, that we intend to maximize, and the number of cores per node and threads per core. Figures 1a and 1b respectively show a 4-node and a 128-node system configurations (i.e. increasing system size), assuming:
-3D-torus network topology with 0.5 µsec latency per link, -10Gbps link bandwidth, 90% of peak at 256-bit size, -160 bits remote memory reference size S, -thread switching cost D sw dependent on the number of threads. We modeled a context saving/restoring based on a scratchpad-shared memory hierarchy, where the scratchpad stores only thread contexts, while the shared memory is also exposed to normal application access. Therefore, when the number of threads per core increases over the scratchpad size, the delay D sw increases, -An average value of I sw = 10, meaning a memory reference occurs on average every 10 instructions, -Core operating frequency of 800MHz, -A network traffic that is uniformly scrambled amongst all the nodes of the machine, resulting in a (N − 1)/N ratio of remote memory references over the total references.
Figures 1a and 1b show how the node injection rate, which is key to the overall system performances, has a peak for a specific number of threads T th which, neglecting effects of network contention, increases with the system size (from 4 to 128 nodes). Using more than T th threads leads to a performance degradation due to the overhead in scheduling and switching context. Using these equations, we also modeled the behavior of the prototype system that we will describe in detail in Section IV. As we will describe later, the system has four nodes, each including up to 32 cores and up to 4 software threads per core. The cores run at 100 MHz and the channel has a bandwidth of 0.6 Gbps and a latency of around 0.3 µs.
We analyzed the percentage of used physical bandwidth for increasing number of threads per core and cores per node, obtaining the results shown in Figure 2 . Figure 2 shows how, for this specific configuration, using 4 threads per core is not more beneficial than 2 threads per core, regardless of the number of cores included in each node. Moreover, once the utilized bandwidth saturates (around 80% of peak value, due to the network overhead) adding more cores does not provide additional benefits. In reality, we expect the saturation region not to be flat, but slightly decreasing for intra-node network contention effects, which are not taken into account in the modeling.
III. NODE ARCHITECTURE DESCRIPTION
This section will present the node architecture of the system. Figure 3 shows the building blocks of the node architecture, namely the processing core, the Global Memory Access Scheduler (GMAS) module, the Global Network Interface (GNI) module and the Global SYNChronization (GSYNC) module. Each processing core is attached to a GMAS module. Each node includes one GNI and one GSYNC module. The architecture also includes a scratchpad memory for each core and a memory controller for the DDR3 RAM shared across all the cores in a node.
A. The processing core
The architecture has been designed to be independent from the specific ISA of the processing cores that it includes. The only requirement that we pose for the processing cores is that they have in-order pipelines and the possibility to process precise interrupt signals.
B. The GMAS module
The GMAS component provides the global address space across multiple nodes of the system. It is connected as a slave to a core and as a master to the on-chip bus. It intercepts all the load/store operations (local and remote) that the core issues. Figure 4 shows its structure. The local address space includes the scratchpad memory, the (memory-mapped) on chip peripherals and a configurable portion of the DDR3 RAM. The Fig. 4 : GMAS internal structure global address space is scrambled across the DDR modules of the different nodes in the system with a configurable fine granularity (defaults to 64B). All the references to the local address space are simply forwarded by the GMAS. References to any location of the global address space go through the specific logic that provides support for the PGAS mechanism. This logic includes the address decoder, scrambler and translator in Figure 4 . In detail, the global address goes through a hardware scrambler to determine the destination node identifier. If the destination node equals the local node (global address scrambled to local memory) the reference is forwarded to the local memory controller through the on-chip interconnect. If instead the destination node is remote, the reference is forwarded to the local GNI module for transmission into the external network. At the same time, the load/store queue (LSQ in Fig. 4) is updated with the pending reference, at the position identified by the running thread. When remote memory references are detected, the processing core is notified by the GMAS sending a fake response and simultaneously raising an interrupt signal. The hardware scheduler is a key component of the GMAS module. When multi-threading is enabled, the scheduler takes part in the thread switching process by determining which thread will execute after the suspending thread. The hardware scheduler is automatically triggered in two circumstances. The first one is the occurrence of a remote memory reference, the second one is an explicit signaling from the processing core, implemented through a memory-mapped read-write register. For our architecture, a thread is available for execution when it is active and does not have a pending remote memory reference in its LSQ.
C. The GNI module
The GNI interfaces the on-chip interconnection with the external network. It decodes and encodes memory transactions, encapsulates them in network packets, routes the packets across the network and translates addresses and identifiers between the on-chip interconnection format and the network format. When a remote memory reference is issued by the processing core, a transaction is generated that traverses three different network domains in round-trip: the sender on-chip interconnect, the inter-node network and the receiver onchip interconnect. Each one of this transitions requires an address translation and the generation of an unique transaction identifier. Figure 5 shows the internal organization of the GNI module. A master/slave interface to the on-chip interconnect directs outgoing and incoming network references to and from the network. Outgoing requests are received at the on-chip slave interface and sent through the sender channel (TX), vice versa incoming requests are received through the receiver channel (RX) and sent through the on-chip master interface. On the other hand, outgoing responses are received at the onchip master interface and sent through the sender channel, vice versa incoming responses are received through the receiver channel and sent on the on-chip slave interface. The different address translation and identifier generation phases that need to take place for every remote load/store transaction are the following: first, the processing core issues a load/store operation to a global address. The global address is scrambled by the GMAS module to form a {destination node,memory address} pair, and the remote destination node identifier is calculated. The GMAS module then sends a transaction over the on-chip interconnect to the GNI module. A first transaction identifier is generated for the sender node on-chip interconnect domain, as the concatenation of the processing core id (core id) and the source thread id (thread id), if multi-threading is enabled. Within the sender node on-chip domain, the transaction identifier is unique for all the transactions. Second, when the GNI module receives the transaction, it creates the packet to send out into the network, through its sender channel. It inserts the network addresses of the source and destination nodes into the packet header. The packet is then routed to the destination node through the external network. No further tagging is required, since the external network treats requests and responses of the same load/store operation as different transactions. Third, at the destination node, the GNI module receives the network transaction and generates a transaction for its local memory through the receiver channel. It extends the physical memory address found in the received packet with the bits necessary to address the local memory in the on-chip interconnect. The transaction id found in the received packet is replaced with a new id and stored in a hash table (id translation table). The network address of the source node is also stored in the hash table. The new id is the concatenation of the GNI id and a unique pending transaction counter. Fourth, upon return of the destination node memory access, the GNI receives the response and uses the id translation table hash table to retrieve the source node network address and the previously stored source node transaction id. The GNI is now able to send a network transaction back to the original source node. Finally, the original source GNI module receives the response packet and generates a response on the on-chip interconnect of the source node. The proper unique transaction id is directly found in the packet header.
D. The GSYNC module
The GSYNC implements fine-grain synchronization for the global address space. It is a memory-mapped slave component that integrates a lock table. The GSYNC is responsible for managing all the lock and unlock operations on the memory addresses of the node exposed as part of the global memory. Thus, when a core emits a synchronization operation on a remote location, only the GSYNC of the destination node handles it. The entries of the lock table are directly mapped to the local addresses of the global memory. GMAS and GSYNC handle basic single-transaction lock and unlock operations without requiring, neither for the on-chip interconnect nor for the external network, any support for atomic transactions. The synchronization mechanism works as follows: The processing core writes the address to lock/unlock into a dedicated memory-mapped register of the GMAS. The GMAS has two different registers for lock and unlock. The write to the memory-mapped register returns immediately. In the case of a lock operation, the processor then polls on the same register to wait for the response. In the case of an unlock operation, no polling is involved from the processing core. If the lock/unlock generates a remote memory operation, the GMAS will raise an interrupt for the processor, potentially triggering thread switching. In the case of a lock operation, the GMAS sends a read to the GSYNC while, with unlock operations, the GMAS sends a write to the GSYNC. When the GSYNC (local or remote) receives a lock/unlock request it acts like a bank of test-and-set registers. Lock attempts will succeed only if the selected entry is empty and they will always return the current value stored in the entry. Unlock attempts will always succeed.
IV. PROTOTYPING PLATFORM FPGA prototyping has recently gained momentum as an enabler for architectural studies [4] . This section describes the multi-node FPGA prototyping platform developed to evaluate the designed architecture. We implemented the FPGA design using the Xilinx ISE Embedded Design Suite, version 13.4. To prototype a multi-node system, we employed 4 ML605 boards, each mounting a LX240T Virtex-6 FPGA. The two boards communicate through the RocketIO GTX transceivers and are interconnected with four coaxial cables forming a full-duplex link (two differential signaling cables per direction), as in Figure 6 . They exchange data through the Aurora protocol with 8B/10B encoding. Aurora is a lightweight link-layer network protocol developed by Xilinx that allows the use of any upper layer protocol on high-speed serial links.Each node of the prototype system includes multiple Xilinx MicroBlaze 32-bit cores. Nevertheless, the proposed approach is independent from the specific Instruction Set Architecture (ISA) of the cores, as long as in-order implementations are selected and with support for precise interrupt processing. The on-chip interconnect sub-system is a multi-layer configuration of the Finally, each node also includes a UART controller and a hardware debug module (MDM).We synthesized the architecture with up to 32 MicroBlaze cores per node. The cores run at a frequency of 100 MHz, while the GTX transceiver provides a bandwidth of 600 Mbit/s. These parameters reflect the numbers that we used with the analytical model in Section II to produce Figure 2 . Table I shows the FPGA area occupation of the three custom modules presented in Section III, their maximum operating frequencies and traversal latencies. The percentages are referred to a lx240t Virtex6 device. The most critical occupation figures are those of the GMAS, as every processing core attaches to a GMAS module, while only one GNI and GSYNC modules are included per node. To give a sense of comparison, the MicroBlaze area is almost three times larger than the GMAS. In terms of maximum operating frequency, none of the custom components is critical.
V. EXPERIMENTAL VALIDATION This section will describe the experiments that we ran on the prototype described in Section IV. The first experiments that we ran characterized the prototype single components in terms of latency (Table I) . These latency numbers are measured without any contention in the on-chip network. The latency necessary to access the GNI module is higher than the one to access the GSYNC because a clock domain conversion happens in the GNI bus interface. Traversing the GMAS results in 1-3 clock ticks, depending on whether the memory reference is accessing the local scratchpad memory, the local DDR controller or the remote memory in the second node. In the remote case, this added latency is not significant for the overall remote access latency, which is 147 cycles. Our software for thread switching takes approximately 80 cycles to execute. In Section II and Figure 2 , the analytical model estimates (eq. 2) the number of threads necessary to have complete latency tolerance as 2, which is also coherent with Figure 2 . Figure 7a shows the network bandwidth utilization of the prototype when executing a pointer-chasing kernel, while increasing the number of cores per node and of threads per core. Bandwidth utilization is measured at the GNI interface. With one thread per core, increasing the number of cores results in an increased injection rate into the network and, consequently, the bandwidth utilization increases almost linearly. With two threads per core the injection rate increases even more, exploiting multi-threading for better latency tolerance. On average, using two threads per core improves bandwidth utilization of 75% with respect to a single thread, up to 16 cores. Increasing further the number of threads to 4, instead, does not lead to significant improvements (only 2% more with 16 cores). As previously discussed, more than two threads do not increase the network injection rate, but only increase overhead. This is coherent with what we found using the analytical model in Figure 2 , obtained modeling a system similar to our multi-node FPGA prototype. In all the cases, bandwidth utilization saturates at 85%. The behavior depends on both the overhead introduced by the network header, which however is very small for this 4-node prototype, and on the maximum number of concurrent operations on the AXI4 onchip bus, which limits the injection rate into the network. We also evaluated the prototype with a typical irregular algorithm, Breadth First Search (BFS). Figure 7b shows the number of Traversed Edges Per Second (TEPS) of the algorithm while increasing the number of cores per node with different number of threads per core. We executed the algorithm on a graph of 50,000 vertices and 2,495,423 edges. The algorithm is both memory-and synchronization-intensive, since it performs a lock for each explored vertex. With a single thread per core, the performances keep increasing while increasing the number of cores. At 32 cores, the algorithm obtains a speed up of 23. This matches the speed up obtained from the application of Amdhal's law to the algorithm (considering a parallel portion of 98.9%). With two and three threads per core, the algorithm reaches higher performance than with a single thread, up to 32 processors. At 8 cores, the performances with 3 threads are 40% higher, at 16 and 32 cores, they are 20% higher. Finally, with four threads per core, the system generally provides lower performance than with two or three threads. The performance of four threads per core is comparable to the performance of two threads per core only up to 8 cores. As previously discussed, using 4 threads does not increase the injection rate and therefore does not improve the overall performances. However, a higher number of threads increases the scheduling overhead and the contention on synchronization variables, thus determining an overhead.
