Organizing and optimizing data objects on networks with support for data migration and failing nodes is a complicated problem to handle as systems expand to hundreds of thousands of nodes. The goal of this work is to demonstrate that high levels of speedup can be achieved by moving responsibility for finding, fetching, and staging data into an FPGA-based network interface. We present a system for implicit routing of data via FPGA-based network cards. In this system, data structures are requested by name, and the network cooperatively finds the data and returns the information to the requester. This is achieved through successive examination of hardware hash tables implemented in the individual FPGA network cards. By avoiding the complex network software stacks between nodes, the data is quickly transferred entirely through FPGA-FPGA interaction. The performance of this system is approximately 26x faster vs. the software network on a per-node basis. This is due to the improved speed of the hash tables, higher levels of network abstraction and lowered latency between the network nodes.
Introduction
With our sights currently set on exascale computing resources, the management of parallelism provided by hundreds of thousands of processing cores represents an exThis work is released under LA-UR 10-07974. Los Alamos National Laboratory, an affirmative action/equal opportunity employer, is operated by the Los Alamos National Security, LLC for the National Nuclear Security Administration of the U.S. Department of Energy under contract DE-AC52-06NA25396. By acceptance of this article, the publisher recognizes that the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or to allow others to do so, for U.S. Government purposes. Los Alamos National Laboratory requests that the publisher identify this article as work performed under the auspices of the U.S. Department of Energy. Los Alamos National Laboratory strongly supports academic freedom and a researcher's right to publish; as an institution, however, the Laboratory does not endorse the viewpoint of a publication or guarantee its technical correctness. treme challenge. Legacy and dusty deck codes were not designed to take advantage of the large parallel resources provided by current and upcoming systems. New approaches are required to harness the power of these systems. Key to overcoming these difficulties will be increasing the productivity of our scientists and our computer systems. The execution speed of a given code is of little value if the system is obsolete before any useful code can be run on it. When the level of abstraction is raised by the programming environment, developers are able to work in a domain specific to the problem at hand, not the domain of the machine. This approach provides a more natural and comfortable interface to the machine -there is no need to learn the semantics of an all-purpose environment or to "wedge" the problem into the model of traditional programming languages.
With ever increasing number of nodes and cores, communication between processing elements in HPC systems is a performance bottleneck. Experience at LANL has shown that up to 50% of application execution time is actually just waiting on the network [2] . The network topologies in HPC systems are typically designed to accommodate the communication patterns of any generic application, but are not typically optimized to maximize the performance of any specific application. Ideally the network behavior and application would be co-designed to optimize performance by matching the topology to the communication needs of the application.
We are currently developing the idea of the "RDI" or Reconfigurable Data Interface. The RDI is a Field Programmable Gate Array (FPGA) based "Smart" network card connected via PCI-E that can make decisions about data movement and migration in order to optimize a computation. The programmer then interacts with the network via abstracted calls that provide them with some separation from the dirty details of working with an FPGA. The network application interface (API) supports the abstraction of a global name space, similar to a virtual address space. This allows the programmer to adopt a new paradigm, but not have to worry about performance. With a good abstraction layer, they will likely continue to use the new system. Accelerated "Tuple Spaces" are a potential answer to the problem of keeping track of data and computation on a large network. Performing load balancing on a large system requires data and computation to migrate to available resources, making it difficult for a program to manage computation effectively. By providing a fast mechanism to manage data and computation, we alleviate some of the burden from the programmer. This is accomplished through a distributed associative memory technique called Tuple Spaces. The attractiveness of the Tuple Space paradigm is that a user requests data by key or name, rather than by its location. This abstraction layer allows the system to maintain the location of data, similar to the way virtual memory works in common microprocessors.
The abstraction layer is modeled after NoSQL databases which can scale to large amounts of data or across large clusters of computers. These databases are less structured, typically lack tables and are not relational. This decreases the overhead needed for storing data and eases the difficulty of scaling. An example of storage of this type would be key/value pairs. Key/value pair datastores organize the data using a key to store the data or value. Thus, the data can be recalled using the key that was used to store it. Memcached [7] is a prime example of the key/value datastore, that is used by several major websites (Facebook, Twitter, YouTube, Wikipedia, etc. [4] ) to cache data. Amazon uses a similar approach in their proprietary system called Dynamo [3] , which powers the Amazon Web Services.
Another approach used by NoSQL databases are distributed hash tables (DHT). DHTs are commonly used in Peer-to-Peer systems (such as Bittorrent [16] ), to store and distribute the data. Pernicious systems, such as botnets, use DHTs to store and distribute data as well. DHTs have three properties that make them useful and a potentially a key technology to high performance computing problems: Decentralization, Scalability and Fault Tolerance. Current investigation [12] shows how useful this approach could be for high performance computing.
In [6] , the real performance of a Tuple Space implementation on a small cluster is approximately 300 tuple lookup per second, with a latency of about 60ms. This particular implementation is in Java and has the associated performance degradation of running in a virtual machine rather than a native implementation. By moving the details of managing the Tuple Space to hardware and allowing it to be accessed via standard API calls, we believe that the interest in implicit data routing may be rejuvenated.
Reconfigurable Data Interface
The Reconfigurable Data Interface (RDI) steers an application's computation and communication. As shown in Figure 1 , RDI is not just a network interface card -the RDI is Adding a small local router to each processing node extends the ideas that are currently being applied to multi-core processing chips. Distributed routing eases the burden placed on higher-level routers required to build up very large machines. In multi-core nodes, a router is available for a small number of cores enabling the transfer of data to be optimized at a local level. In this same way, the RDI's programmable local router provides the ability to off-load the transfer of data between local nodes and optimize the problem between the nodes within a given shelf of processing nodes. Programmable NICs have demonstrated speed ups for various operations (e.g., [17] , [13] ) FPGAs have shown a 10x speed up for certain MPI collectives [8] , and 1000x for network processing [1] , which demonstrate the potential for custom operations within a programmable network to provide system speed up. FPGAs provide a fabric upon which applications can be built. In these devices, a SRAM serves as the configuration memory that controls all of the functionality of the device. The look-up tables can produce any combinational logic functionality necessary, the registers provide integrated state elements, and the interconnect routes values into the appropriate paths to produce the desired operations. Devices also include a variety of custom blocks such as fast multipliers, ethernet MACs, PCI-E end points, local RAMs, and clock managers.
Because the RDI is based on FPGA technology, application specific instructions can be performed very quickly in the programmable hardware. A configurable network processor allows a system to be molded to the application and provide the customization necessary to handle different kinds of operations. For instance, customized network collectives can be implemented in hardware. These are operations that use a large number of nodes, and often are a determining factor in the overall performance of the application. Potential operations we plan on addressing include collectives such as all-reduce, distributed tree traver-sals, and scatter-gather primitives. A potential candidate for movement into the RDI is distributed tree traversal, where the network card automatically fetches the child nodes of a tree and presents them as a package for processing to the CPU or GPU.
A second key advantage that the RDI has is intelligent data routing. Normally, a network card interrupts the CPU to let the CPU know that some small amount of data has arrived and that it will transfer it to a known location in memory. The network card does not know what the data is or what needs to be done with it. In the case of exascale systems, it will be necessary for a more intelligent network interface to route the data or computation to the specific core or coprocessor that needs it. In the case of the RDI, this may be a CPU or PCI-E attached GPU.
The RDI approach reduces the time the nodes spend waiting for network operations, and increase the network abstraction layer for the programmer. This is key to producing a more efficient and manageable system.
Tuple Spaces
Tuple Spaces are based on giving data keys or names that are separate from their address in memory. This is similar to a virtual address with fully associative mapping. However, it can be extended across network nodes, and support more complex data structure than a flat page of memory.
In past implementations, the secondary lookup of the data's name has hindered the performance of a Tuple Space. In the RDI system, as shown in Figure 2 , the FPGA maintains the tables that keep track of the current location of data blocks as well as ongoing requests for computation. This is done using the local memory on the FPGA as well as external DRAM. The network connections are also directly managed by SATA or other network connections. The FPGA interacts with the host using the PCI-E connection.
The approach of network-based management provides many benefits. First, the RDI is directly connected to the network and can keep track of data movement more closely than a CPU that is several levels removed from the network. Second, the RDI can fetch data directly from the GPU without interfering with the CPU's computation work. Third, the distributed nature of the Tuple Space means that data can be stored across the network, rather than solely at a given node. This positively impacts the reliability because the failure of one node does not bring down the entire computation. The ability to load balance across the network means that scaling and usability will potentially be improved over current MPI systems.
We have implemented the Tuple insertion, search, and deletion operations in an FPGA hardware system. This provides a significant speedup for finding and accessing data remotely in the system. At this point, we have not addressed the larger problem of inserting computation requests in the Tuple itself. Although this relieves the programmer of the burden of determining where computation will occur, it is not clear that the cost of moving computation and data can be mitigated. For some kernels and a limited set of computation, it may be efficient but for requests that include patterns (for instance, regular expressions), the process of handling the patterns would be very expensive to implement in hardware. This is particularly true with the memory-based hashing system we have devised, a more sophisticated pattern matching system would require Content-Addressable Memory. This would significantly increase the per-entry cost of the hash table.
Tuple Architecture
We have demonstrated a prototype distributed hardwarebased Tuple Space system. Similar in spirit to the softwarebased LINDA [9] tuple system, the system builds a distributed hardware hash table across several FPGA nodes. The FPGAs communicate across multi-gigabit links to transfer data between nodes automatically. This is emphasizes a co-design approach to large cluster design: Allow certain data management and collection functions to migrate to hardware, while avoiding full-custom hardware design of a computation application kernel.
The hash architecture is the core of the tuple system. The hash key is the name of the tuple data element. The hash lookup results in the address of the data in local DRAM memory store. The architecture consists of the blocks in Figure 4 . The main block of the hash system is a CRC generator and a block RAM for storing the keys. The Cyclic Redundancy Check (CRC) generator is used to produce an initial hash address from a large, unique key for storing the data. The tuple space process starts when the hash table controller state machine resets the CRC generator and inputs the requested key. The output of the CRC is used as an address in the key block RAM. The output data format is shown in Table 1 .
If the requested operation is a write operation on a new key, the occupied bit is checked. If it is unset, then that address is the new location of the hash key and the key, and the length and address of the data in bulk memory is written into the block memory.
The CRC can take in a key of arbitrary size and produce a hash that acts as the address in a small local RAM. In our prototype, we used a 32 bit key which results in a 16 bit hash. The useful point of using the CRC generator is that it can accept much larger keys by progressively taking in chunks of the key. The large key is thus turned into a smaller index into the memory table. The possibility of two (or more) keys hashing into the same index is still possible, but collisions are handled in hardware; thus even with collisions the system is still fast.
If the occupied bit is set, implying a collision, then the output CRC is fed back into the CRC generator to try again in a new location. Another approach is to increment the address and check again. This would save one cycle in the lookup, but could create clusters of data in the hash table. In software, collisions are often handled via dynamically created linked lists, but this is difficult to implement in hardware. Frequency and area performance is about the same for either option.
If the requested operation is a read or delete, the CRC output is checked for the occupied bit as well as comparing the input key with the stored key. If the occupied bit is set but the keys do not match, the entry is assumed to be from an earlier collision. By following the collision chain until the keys match, the collision can be resolved. In the case of a delete operation, the key is zeroed and the occupied bit is unset. If an unset occupied bit is found while resolving a collision chain, this implies that the key is no longer in the system, or was never there in the first place.
A collision chain with deleted entries is somewhat problematic because a chain of collisions is determined to be in error if an unset occupied bit is found. however, if delete is intended to not corrupt the table, then there are two options. First, all deleted key entries are set to an arbitrary value, perhaps -1. Thus, the any collision chains that happen to pass through the deleted entry will continue. However, this also prevents any new key from being entered in the location, effectively creating a memory leak.
The second option is to add another bit to the hash data structure to imply a delete flag bit. An entry with this bit set is not a valid entry, but can be used for new data. This has the trade-off of increasing the size of the hash memory footprint. Because we envision the hash system running for extended periods of time with extensive New and Delete operations, exchanging an extra bit per entry was deemed a better option that an intentional memory leak. Figure 5 illustrates the speed in which new entries are added to the system. This data was collected for a relatively small hash table to better illustrate the effect of collisions on the hash table performance. The initial entries generally do not encounter collisions, which means that they can be entered into the first address generated by the CRC. As the table becomes congested, the average time to find an empty slot increases. There is no quality of service requirements, except a desire for it to be fast, so the system will continue to search for a location until it has searched every slot. This may be unacceptable in a real system, but can only be addressed in hardware through allocating a larger memory at design time to the hash table, or moving the data somewhere else, be it a backing store in a larger DRAM or on another host entirely. We will address the second option in Section 3.3.
The mMIPS soft processor handles high-level commands as well as initiating and receiving transfers from the network. In addition to the block RAM, there is also a large DRAM store for placing the data itself. The data is transfered to and from the host and/or network via the switch box.
The mMIPS [5] soft processor is a 32 bit CPU that uses a small subset of MIPS-like instructions. Use of the soft CPU has allowed us to quickly prove out ideas in hardware at hardware speeds rather than develop full hardware implementations. Adding a standardized bus system to the CPU has allowed other IP to be quickly developed and combined to make larger and more complex systems. The C compiler for the CPU was modified to allow the use of larger instruction stores, so that we can have more complex programs. These features combined have made the mMIPS a useful vehicle for testing ideas and exploring the space of the RDI hardware system.
The switch box element is essentially an improvement over using the mMIPS processor to move data between the Aurora, PCI-E, and DRAM cores. Like a DMA system, addresses are set up for DRAM and PCI-E transactions, the the "to" and "from" FIFOs are specified for streaming data, and the "go" bit is set. The data transfer then progresses without interferences from the soft CPU.
Tuple Performance
In the case of JavaSpaces, the tuple space implementation on a small cluster results in 300 tuple lookup per second, with a latency of about 60ms. This performance of tuple insertion and deletion directly impacts overall application performance. In the current prototype system, as shown in Figure 5 , tuples are entered into the hash table in roughly 130ns vs 500ns per tuple in a C implementation on a single CPU host. This naive initial measurement ignores a lot of complication in the system. For instance, the number of hash entries tested fits in the CPU cache, so write costs to main memory are ignored, and this is not considering the FPGA-PCIe latency. These latencies are considered in more depth shortly.
Using a simple ring architecture, we predict the lookup latency in a 10 node prototype cluster to be 2.5µs, a 1000× performance improvement in locating and fetching remote data.
Given that we have a limited number of boards, we were only able to test to four nodes, but this allows comparison to a similar setup using MemCached and a custom hash server. The custom hash server is based on persistent C++ TCP sockets and an open source hash library uthash [10] . The difference between the two packages turned about to be ¡5%. The goal of the experiment was to have requests that fail on the first nodes and succeed on the node at the end of the ring, to determine the worst case performance. Thus, keys are added in the final node and assured to not be on the middle nodes.
We found the time for local accesses to the hash server for 1500 accesses to be about 610ms. The second experiment starts with the same 1500 requests originating on the first node, with each request failing, then fails over to a second node on the same switch. These 1500 requests required 205ms. In the third experiment, the requests fail on the second node and fail over to a third node. In this case, the 1500 access require about 400ms. Thus, the incremental cost for individual requests to an additional node in the ring is about 130us. Much of the cost in the software-based approaches is traversing the network stack, as well as the PCI-E bus to the network. This cost is incurred for every remote request and the response, which adds up quickly. The largest cost in the FPGA implementation is the PCI-E latency, and that cost is only incurred once before the network takes over to search for the data. The incremental cost is approximately 5µs per node, or about 26x improvement vs. the software network. The hash table operates at 100MHz. It could easily operate faster but that is of limited value given that the hash lookup is generally not the performance bottleneck. The system processes one hash lookup at a time, although the pipeline structure can accommodate three simultaneous lookups with minor changes.
The Xilinx ML-50x boards support a single lane of PCI-E connectivity, for a maximum of 2 Gbps between the host and the FPGA. In practice, for our small data blocks bandwidth is less relevant than the latency, at approximately 1µs per transaction. The PCI-E interface is based on the BMD design provided by Xilinx. The driver was developed using the Jungo WinDriver [15] package for Linux, but should port to Windows easily. The WinDriver package handles the setup and provides a simple API for communicating with a hardware device. We have measured the PCI-E latency from user software to the FPGA to be approximately 1.6µs per read transaction and about 250ns for write transactions. In a wiser implementation, the read transactions would be implemented as direct writes from the FPGA to the CPU's memory, but this is saved for future work. The latency of the other transactions is quite short in comparison with the PCI-E latency T read pci = 1.6µs. The hash lookups T hash -ignoring the occasional multiple-collisions lookup -take less than ten cycles, or about 100ns. The time to fetch an address from the DRAM is approximately 155ns with 8 bytes every 5ns (burst). The total hops between the n nodes FPGAs via MGT links running Aurora are in the tens of cycles T hop , but are counted twice, for both the request and response legs. The total result is shown in Equation 1.
(1) The ring structure reveals its limitations in this analysis. While the total time at any given node is low the aggregation of hash lookups in every node the search passes through builds up becomes expensive. Clearly a ring structure is not ideal in most cases, but is convenient when prototyping with the ML-50x series of boards (see Figure 3 for an out-of-box photo). These boards feature two SATA connectors, allowing for easy building of ring structures. The board also has a set of SMA connectors, allowing for a third link, so a tree structure is also quite feasible but somewhat less elegant.
The hash table system as implemented occupies 60% of the block RAM resources in the LX110T. The soft processor's memory occupies the balance of the block RAM. The hash table, including the CRC, operates at 100 MHz. This is not the maximum achievable speed, but decreases the time required for place and route. The mMIPS processor operates at 50 MHz and is responsible for steering data between the PCI-E connection, the hash table, and the two Aurora links. The Aurora links operate at 1.5Gbps. The Aurora links can go much faster, up to 6 Gps with Gen 3 cabling. The single-lane PCI-E connectors on the ML-50x boards are limited to 2Gbps. Simply upgrading to Xilinx's ML-605 would increase the block RAM available for the hash table and bring a eight lane connector for a proportional increase in bandwidth.
Because we developed the system using a combination of Xilinx ML-505 and ML-507 boards [14] , we must target a combination of "GTP" and "GTX" multi-gigabit transceivers. This is inconvenient, as the Aurora source trees are somewhat different. One convenience we added is a merging of the source trees so that any MGT on any transceiver tile on a particular V5 device can be targeted without delving deeply into the source code.
The largest performance improvement comes with knowledge of data structures. For instance, when traversing a tree or a linked list, the FPGAs can be given knowledge of the data structure in question. Traditionally, the traversal would require pulling data from the network, pushing it up through the PCI-E bus to decode the data structure and then back through the PCI-E bus to the network card for the next request. In our simple, single node tests with memcached [4] , traversing a 7 element linked list took roughly 125µs. In the hardware system with a single node, this same operation took 3µs, where about 2 µs is the PCI-E latency, and the pointer chasing essentially falls into the noise. The speedup is entirely due to traversing the PCI-E bus twice for the entire operation: first, to make the initial request, and second, to fetch the result. The pointer chasing through the hash table happens entirely in hardware. The data structure is handled by the soft processor, so adding knowledge about a particular data structure is software effort, not more expensive hardware development.
Handling Larger Data Sets
In the hash table architecture discussion, we came upon an important question. What happens when the hash table is full, or full enough that the hash performance is degraded? In a software-based system, the size of the hash table can be expanded dynamically to available memory. This is more difficult in hardware unless we are willing to degrade performance by switching out of the limited on-FPGA block memory resources and into the essentially limitless external DRAM. However, given the distributed nature of the system, the latency to store entries on another node's block RAM can actually be less than going to external DRAM locally. This can be generalized into a system where hashed addresses can point anywhere in the system. Unfortunately, this global addressing has many characteristics of the systems we are trying to improve upon. The system is intended to be resistant to node or link failure (also addressed in the 3.3 section). Thus, keeping a static pointer to a particular node is self-defeating. This is easily avoided by not using the node's address per se, but rather a secondary key structure that a node can claim. In this mode, the request traverses the network structure until it finds a matching node, then performs a hash lookup to find an index into the local DRAM.
Migration
Repeated global access to locally requested keys with to given a location can cause unnecessary traffic on the network and extra work for the FPGAs. Migrating the data and hash table entries to the local node could be a solution, but can also increase the complexity of managing the data.
There are two potential pathways to achieve the goal of migration.
First, repeated requests to a particular tuple can be tracked in the tuple data structure. Requests that exceed a threshold can cause the entire data structure to be moved, including a deletion of the tuple from the original location. In this mode, a single copy of the data is accessed by all nodes in the system. This allows for simple coherence, but hurts performance. Migration allows for the copy to move to where it is most often used, but multiple readers could cause excessive migration. The movement of data will also cause the physical separation of the reader and owner of a piece of data. One strength of this approach is the ability to change values in the data structure pointed to by the tuple in-situ, rather than reading it out and completely re-writing it in memory.
A more traditional compiler approach is to require static single assignment (SSA). In this mode, every data element is written once, and no re-writing is allowed. This allows references to be tracked more easily, as any changes to the data structure requires a completely new tuple entry and allocation in the bulk storage. In this mode, any requester of a particular data structure can create a local copy of the data. Since it cannot be changed by another node, the data node remains valid. Any updates to the data will take place in a newly created tuple entry. Then, the updated tuple name/data combination can be propagated as nodes request the new tuple name.
The issues with migration are theoretical at this point as they have not been implemented or tested in real hardware. However, the work is largely in software as the cases are special enough to implement solely in software in the embedded processor.
Improvements to Network
We also plan on exploring the issues of network topology in the future. Our second implementation will allow the network topology to dynamically change by creating and closing dedicated circuit-switched links as needed during program execution. This is akin to the operation of the original circuit-switched telephone network where switchboard operators were in place to physically connect and disconnect callers based on customer demand and connection availability as opposed to having dedicated lines between customers. In our case, unused circuit-switched links would immediately become part of the packet-switched network, providing more network capacity for generalized communications.
The basic system structure is shown in Figure 6 . Each component of the system is commodity hardware. This will allow us to estimate the performance of the initial system without a large outlay for custom hardware. The net- work is logically built around a central MindSpeed crossbar switch [11] . This crossbar acts as the bulk circuit switching resource in the system, providing the ability to configure the network topology as needed, including hardware-based multicast. As shown in Figure 6 , crossbar forms an essentially direct electrical connection between ports, forming the desired network topology. The network interface cards (NIC), inside each PC, are connected to the crossbar. The NICs are responsible for two main functions. First, they provide the fast PCI-E interface with the CPU and memory of the computer. Second, they provide any packet and circuit switching capability that is not covered by the crossbar. Because the crossbar can only provide connectivity that is a subset of that enabled through physical wiring, the NICs provide any other needed connection paths and alternate routes. Using Aurora as a starting point, we will add a protocol layer providing flow control and packet headers for switch-mode links. This protocol layer will support both packet-and circuit-switched communications, meaning that the hybrid network can fluidly transition between operating modes without change of protocol.
Conclusion
We have presented a system for implicit routing of data via FPGA-based network cards. In this system, data structures are requested by name, and the network of FPGAs finds the data within the network and returns the structure to the requester. This is achieved through successive examination of hardware hash tables implemented in the FPGA. By avoiding software stacks between nodes, the data is quickly fetched entirely through FPGA-FPGA interaction. The performance of this system is 26x faster for each additional node searched, and orders of magnitude faster than software implementations for data structure traversals. This is due to the improved speed of the hash tables and lowered latency between the network nodes, and avoidance of PCI-E transactions.
