ABSTRACT Vector similarity searching consists of comparing a query vector against a high volume of entries in a reference data set, according to a chosen similarity metric such as the L1-norm, L2-norm, or Hamming distance. Large-scale research and commercial applications of these computations are developing rapidly across artificial intelligence fields as diverse as semantic text querying, retrieval of multimedia materials, and prediction of the properties of pharmacological molecules and engineered materials. While vector similarity searching is, at present, predominantly implemented on standard central processing unit (CPU) hardware running optimized indexing algorithms, the interest in massively-parallel computing architectures is increasing. Accordingly, a range of systems based on graphics processing units (GPU) and field-programmable gate arrays (FPGA) have been proposed; however, the availability of the design materials for these systems remains largely confined to a small number of corporations and research institutions. Here, we introduce a fully open-source hardware accelerator for vector similarity searching, based on an array of 21 FPGAs densely intertwined with 42 GB of high-speed dynamic memory and installed on a custom-designed compute node board, which yields an aggregate bandwidth of 33.6 GB/s and can be seamlessly reconfigured to implement nearly arbitrary distance calculations. A novel logic and software architecture, based on a lane-wise organization of independent engines implementing distributed distance calculation and sorting, allows attaining noteworthy query latency and power consumption performance on both single-and multi-node system configurations. The entire circuit board hardware, FPGA logic, and host software design is herein presented and freely provided for unlimited use, supporting open innovation and research in this area.
I. INTRODUCTION
Information retrieval, that is, extracting and reporting all entries from a data set which match some given search conditions, has represented a significant application of computing
The associate editor coordinating the review of this article and approving it for publication was Chintan Amrit. since its inception. With the exponential increase in the amount of stored data and its diversification, over the last two decades the ability of retrieving and sorting approximate matches according to their relevance has grown in importance; it is now central to the virtual totality of practical search engines, and to many artificial intelligence applications in general. While a large number of material-specific structured VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ representations of information and indexing algorithms have been developed, a more general approach based on feature vectors often remains preferable, when not unavoidable. According to this, an initially large amount of data such as an image or document is reduced to a small set of carefullychosen scalar values (features), usually on the order of 50-500 numbers, which encompass those properties deemed most relevant to the given search problem. For example, each feature may represent how intensely a corresponding semantic field is reflected in a text. Information retrieval, then, requires finding the vectors in a data set that have the shortest distance to a query vector according to a chosen distance (that is, similarity) measure, such as the 1 -norm, reconfigurability, which allows realizing application-specific logic that minimizes unused structures and optimizes data paths, often resulting in higher performance and lower power consumption compared to both CPUs and GPUs. Accordingly, besides in indexing-based applications such as genomics, FPGAs have been extensively developed as an efficient means of implementing convolutional neural networks featuring heterogeneous size, topology, and precision. Furthermore, of particular relevance to the present study, promising results have been consistently obtained regarding the application of FPGAs in the parallel calculation of vector distance measures such as the 1 -and 2 -norms [13] - [29] . Vector similarity searching is a heavily memory-intensive computation, wherein a constant stream of data set feature vectors needs to be steadily and efficiently supplied to the units calculating the distances at the level of the vector elements and summing them. The application is, therefore, paradigmatic in exposing the memory bottleneck inherent in the Von Neumann architecture. Consequently, a wide range of solutions have been proposed, most of which can be broadly grouped into in-memory computing, wherein calculations physically take place within a memory cell array, and near-memory computing, wherein memory and computation logic are brought closer, for example, though inter-twining them on a single die. An early and paradigmatic example was the Zero Instruction Set Computer (ZISC), which could calculate vector distances in a massively parallel manner and perform sorting in a distributed form through a tailored bus logic infrastructure. Unfortunately, an issue with these approaches is that the fabrication processes typically used for realizing compute elements are well-suited for building Static Random-Access Memory (SRAM) but not Dynamic RAM (DRAM), which requires different, often competing process choices and optimizations. This represents an issue because the storage density of SRAM is drastically lower in terms of both unit area and unit power, even though the bandwidth can be higher. As a consequence, SRAM-based in-and nearmemory computing tends to deliver high throughput and low query latencies, but at a density which is inadequate for most mainstream data center applications [30] - [39] .
In this framework, FPGAs acquire particular interest as a means of realizing optimized board-level hardware tightly intertwining memory and compute infrastructure, by leveraging the high-density and low-cost of commercial DRAM, together with seamless reconfigurability of the distance calculation units, for example in terms of distance measure, vector depth and width. In this paper, we present the complete hardware, logic, and software design of a novel co-processor accelerating large-scale vector similarity searching and realizing these aspects. Its tailored architecture is based on an array of independent engines implementing distributed distance calculation and sorting; it allows attaining noteworthy query latency and power consumption performance on both single-and multi-node system configurations. Furthermore, in contrast with GPU hardware, which is inherently proprietary, and with the majority of published FPGA accelerators, which are at prototype stage and also closed-source, here we fully provide the blueprint of all elements realizing a complete system, for unlimited community use and further development.
II. SYSTEM OVERVIEW
The iFlex system is a server running a collection of software and containing one or more custom compute node boards. The system is designed to accelerate nearest-neighbor searches, but is entirely reconfigurable, within the boundaries considered in the Discussion section; hence, it is in principle capable of performing a wide variety of functions related to vector similarity searching. In its present realization, it accelerates searches by performing massively-parallel Distance Calculations between a query vector and the entire set of reference data set vectors stored in one or more servers.
To perform the massively parallel calculations, the system utilizes custom-designed Compute Node boards, each one comprising an array of FPGA devices that perform Distance Calculations on known data vectors stored locally, alleviating the memory bottleneck which affects other architectures. Each FPGA streams the known data vectors from its local fragment of a reference data set, and performs Distance Calculations at the highest speed at which it is able to continuously read data from its memory bank. This architecture provides a low and largely deterministic latency in response to any query vector search request.
The iFlex system software provides an easy-to-use Application Programming Interface (API) for managing the data set vectors stored on the compute node boards, submitting queries to the system, and retrieving the results. The software handles all aspects of communicating with the boards, and can seamlessly support multiple boards, located in a single or multiple server chassis. Query vector calculations can be run in parallel across all boards in the system, further decreasing the latency and increasing the throughput.
The remainder of this paper describes the system architecture, focusing on the board hardware and the system software. To support academic research as well as commercial open innovation in this area, the entire hardware and software design is herein provided for unlimited use under the standard terms of the GNU General Public License version 3 [40] .
III. HARDWARE SYSTEM ARCHITECTURE A. CIRCUIT BOARD DESIGN
The iFlex compute node is a full-length Peripheral Component Interconnect express (PCIe) form-factor board (312 × 112 mm); it is equipped, as shown in Fig. 1 (Lattice Semiconductor, Inc., Portland OR, USA; detailed description in Ref.
[42]) connected to two industry-standard 8 Gbit Double Data-Rate 3 (DDR3) DRAMs; the latter provide 2 GB of local and independent storage for each Calculation Engine, in a 32-bit configuration running at a clock rate of up 400 MHz. The corresponding theoretical aggregate bandwidth from the memory to the calculation logic is 33.6 GB/s.
As visible in Fig. 1 and Fig. 2 , the Calculation Engines are organized as three lanes with seven Calculation Engines per lane, each one arranged as a Compute Island, comprising an FPGA having an independent memory array. The Board Master FPGA interfaces to the host PCIe bus, communicating commands and results between the same and the Calculation Engines. Communication between the Board Master and the Calculation Engines in each lane is accomplished via a bi-directional daisy-chain bus, referred to as the Parallel Bus (PBus), comprising data lines configurable as 20 single-ended (SSTL-135 standard, [43] ) or 10 differential-pair (DIFF-SSTL-135 standard, [43] ) connections, together with a further 3 single-ended status lines. To alleviate the design challenges related to timing closure, source-synchronous clocking is implemented in both directions; in the present version of the system, the PBus operates at up to 100 MHz in differential, single-data-rate mode, but the hardware can also support double-data-rate operation. To ensure compliance with data center-grade requirements, the board is powered by seven high-efficiency switchingmode converters supplied by an external 12 V input, generating multiple high-current 1.0, 1.1 V, 1.35 V, 2.0 V and 2.5 V rails, and monitored by a microcontroller suitable for implementing the Intelligent Platform Management Interface protocol (IPMI, [44] ). All programming and debugging VOLUME 7, 2019 FIGURE 2. Layout design of the iFlex board. Left side: Board Master FPGA with associated DRAMs. Right side: compute array, comprising 3 horizontal lanes of 7 compute islands, connected in a point-to-point manner via the PBus and each containing one FPGA alongside two dedicated DRAMs (see inset). Green, yellow, blue and violet: selected internal signal routing copper layers (DRAM: dynamic random-access memory; FPGA: field programmable gate array; PBus: parallel bus; SMPS: switching-mode power supply). signals are consolidated into a service connector located on the top card edge.
The circuit board design, visible in Fig. 3 , has been optimized for manufacturing also in resource-constrained settings despite its high density, in that it does not require technologies such as buried or blind vias. Power bypassing and filtering are accomplished with under-package 0402-size capacitors (i.e., 10 nF, 1 µF and 10 µF) and additional 1210-size capacitors (i.e., 100 µF), which were selected through spectral response simulations. The board contains 12 layers, of which 6 used for power supply and ground distribution and 6 are allocated to controlled-impedance signal routing; it contains ≈ 13, 000 vias, the smallest finished size of which is 0.2 mm, with a smallest copper feature width of 0.1 mm, rendering the design well-suited for mass production. Preliminary analyses indicate that a single-height copper heat-sink having an area of 252 × 94 mm and a height of 12 mm, with parallel fins aligned with the chassis air flow direction, is sufficient for passive cooling under the majority of normal operating conditions in a 3U-6U size server or suitable workstation.
The board schematics were entered using Orcad Capture ver. 17.2 (Cadence Design Systems, Inc., San Jose CA, USA) and the layout design was performed under PADS ver. VX.2.1 (Mentor Graphics, Inc., Wilsonville OR, USA). The complete fabrication materials for the main board and the service adapter, including the RS-274X X-Gerber files, drilling files, netlist, schematics and bill-of-materials are freely available on the specific vsx-board project repository on GitHub [45] , and a snapshot is provided as Supplementary Material to this article. Fully-built and tested prototypes, referred to as VSX boards, are also available upon request [46] . 
B. BOARD MASTER LOGIC
The logic blocks instantiated within the Board Master FPGA and used to interface the lanes containing the Calculation Engines to the host are shown in Fig. 4 . The Board Master is connected to the host computer via a PCIe interface, which is operated in Direct Memory Access (DMA) mode by means of a dedicated core (Xillybus, available at Ref. [47] ).
To maximize its configuration flexibility, the Calculation Engine interface logic consists of a separate Lane Communication logic instance for each lane. Each Lane Communication block connects to the Xillybus block by a pair of First-In First-Out (FIFO) buffers: a write FIFO receiving commands and data from the host for sending them to the Calculation Engines (i.e., upstream), and a read FIFO transmitting command response and data from the Calculation Engines towards the host (i.e., downstream).
The lane read and write FIFOs are connected, respectively, to the Packet Transmitter and Packet Receiver logic blocks. The Packet Transmitter receives data from the host and, after packet recognition, formats each packet for use by the Calculation Engines, then sends it to the PBus Transport Logic block for transmission to the same. Correspondingly, the packets transmitted by the Calculation Engines are received by the PBus Transport Logic block; they are subsequently passed on to the Packet Receiver block which, in turn, formats each packet for the host and stores it in the read FIFO. The PBus Transport Logic handles all aspects of the PBus physical protocol for transmitting and receiving the packets on each lane.
A pair of additional FIFOs are connected to the Xillybus core for communicating to a Serial Peripheral Interface (SPI) Master module, which is used in downloading the programming bit-stream to the Calculation Engine FPGAs and in resetting them. Furthermore, the Differential Clock Generator generates 21 differential clock pairs, each of which supplies the primary clock input of one Calculation Engine.
The complete logic-level source code of this block, written in the VHSIC Hardware Description Language (VHDL) lanaguage, is freely available on the specific vsx-core-agent project repository on GitHub [48] , and a snapshot is provided as Supplementary Material to this article. It was synthesized and implemented using the Vivado Design Suite ver. 2017.2 (Xilinx, Inc.). To aid understanding, automatically-generated block diagrams for the first two hierarchical levels of the VHDL code are also provided as Supplementary Material.
C. CALCULATION ENGINES
As represented in Fig. 5 , there are two major sections in the logic which constitutes a Calculation Engine and is implemented within the FPGA in each Compute Island: the Packet Transport Layer, and the array of Calculation Modules. Furthermore, supporting infrastructure includes a DDR3 Memory Controller (enclosing core logic provided by the FPGA manufacturer) and a Configuration Module, which implements the interface controlling the operation of the Calculation Engines.
The Packet Transport Engine contains three primary components: the downstream PBus Transport Logic module, the Packet Layer module, and the upstream PBus Transport Logic module. Similarly to the PBus Transport Logic module instantiated within the Board Master, the PBus Transport Logic handles the physical protocol for transmitting and receiving all packets on the PBus. The Packet Layer module inputs packets that are being streamed through the bus, awaiting any that match its hardwired Calculation Engine address. All packets transmitted from the Board Master are visible to all Calculation Engines, with each Calculation Engine Packet Layer being responsible for accepting and processing those packets which are addressed to the corresponding Calculation Engine. Furthermore, the Packet Layer performs the functions related to creating and transmitting packets containing data, status and results to the Board Master.
The Configuration Registers module stores the parameters of the Calculation Engine and holds status information which can be retrieved by the host. Reading and writing these registers is accomplished via the PBus and the Command Packet Detector to Configuration Register through the Register Read/Write Interface. The DDR3 Memory Controller enables receiving data set vectors from the Command Packet Detector and storing them in the DRAMs. It receives multiple control signals from Configuration Register, such as the count and length of the data set vectors. In addition, the same controller reads the data set vectors stored in the DRAMs and supplies them to the Distance Calculator row module, which is detailed below.
The Distance Calculation (DC) Slots Array module performs the Distance Calculations and the pre-sorting of results. It is a flexible design based on the so-called enhancement of radial basis function with restricted Coulomb energy learning, or kNN based classifiers, described in detail in the Refs. [49] and [50] ; these were, in turn, partly inspired by the initial idea of the ZISC, which could operate both as a neuromorphic device and as a content-addressable memory, as further specified in the Refs. [30] and [31] .
All settings that specify the Distance Calculation parameters such as the value of k in a kNN calculation, or the threshold value in a threshold-mode calculation, are stored in the Configuration Registers. The DC Slots Array harbors at least one slot, while the maximum number of slots is determined by the capacity of the Compute Island FPGA together with the complexity of the chosen distance measure (presently, the maximum is 16 slots).
The calculations in a slot begin whenever the Write Query Vector command is received by the associated Command Packet Decoder, which subsequently loads the Query Vector Buffer into the first open slot, as determined by the Query Vector Multiplexer. For each query vector, the data set vectors are written in parallel to all the slots. After the Distance Calculation is finished, the results are written to the Results Capture module, then passed to the Response Packet Generator module of the Packet Layer.
The Response Packet Generator receives the responses from the associated Calculation Engine and thereafter forms the downstream packets. Furthermore, it accepts downstream packets from the right neighbor Calculation Engine through its right-side Transport Layer. All packets from both sources are transmitted towards the downstream Calculation Engine (or Board Master FPGA) through the PBus. The operation of the DC Slots Array is described in further detail below.
The complete logic-level source code of this block, written in the VHDL lanaguage, is freely available on the specific vsx-core-dce project repository on GitHub [51] , and a snapshot is provided as Supplementary Material to this article. It was synthesized and implemented using the Lattice Diamond Software ver. 3.11 (Lattice Semiconductor, Inc.). To aid understanding, automatically-generated block diagrams for the first two hierarchical levels of the VHDL code are also provided as Supplementary Material.
D. COMMAND, TRANSPORT LAYER AND PARALLEL BUS
As visible in Fig. 5 , each Calculation Engine includes a Command Packet Decoder module, which receives and parses packets, detecting commands and writing them to the corresponding module. The Command Packet Decoder is also responsible for forwarding packets to the upstream neighbor Calculation Engine.
As shown in Fig. 1 , the seven Calculation Engine FPGAs in each lane on a board are connected to their first neighbor(s) using a point-to-point bidirectional Parallel Bus. The Board Master is, in turn, connected to the first calculation FPGA in each lane via a separate bidirectional Parallel Bus. The default bus direction is upstream, that is, from the Board Master towards the chain of Calculation Engines. The bus direction is changed from upstream to downstream whenever a Calculation Engine asserts the a signal indicating that a packet is available for transmission to the Board Master. The receiving downstream Calculation Engine then acknowledges the request by changing a dedicated direction control signal. At this point, the upstream Calculation Engine begins sending packets to the downstream Calculation Engine.
Downstream transmission continues until at least one of the following two cases is verified. The first case is that the upstream Calculation Engine indicates that it does not have any more downstream packets to send. The second case is that the upstream Calculation Engine signals that it has more packets to transmit, but also the downstream Calculation Engine has packets to send upstream. Given that only the downstream Calculation Engine can change the direction of the bus between itself and its neighboring upstream Calculation Engine, during downstream transmission, the downstream Calculation Engine checks, every few clock cycles, for the presence of packets that it needs to transmit. In both cases, the downstream Calculation Engine switches the bus direction at the next packet boundary, then transmits its packet upstream. 
IV. DISTANCE CALCULATION MODULES A. DISTANCE CALCULATION ROW
The Distance Calculation (DC) row module performs Distance Calculations then pre-sorts the calculated distance results, and its architecture is shown in Fig. 6 . Each DC Row includes 16 Distance Calculation Math (DCM) units, each one receiving two data inputs. The first data input is received from the Query Vector Buffer (QVB), which is common to all DCMs in each row and holds the incoming Query Vector, whose distances are to be calculated with respect to all vectors in the stored data set. The second data input comes from the Data Set Storage (DSS) memory; this input is unique to each calculation unit in the DC Row.
As visible in Fig. 5 , the sixteen DCM units receive data from the local DRAMs via a DDR3 Memory Controller. Each DRAM has a 16 bit-wide data bus, yielding a total of 32 bits. The logic clock runs at half the frequency of the DRAM controller; therefore, the logic-to-memory clock ratio is such that 128 bits of data are available from the DRAMs for every logic clock cycle. A gearing is thus implemented: this is specified in Fig. 7 , where the two inputs to the DCMs, namely from the DDR3 Memory Controller and from the QVB, are visible.
Upon completion of the Distance Calculation, the 16 calculated distances enter the Row Tree module, which is responsible for pre-sorting the results and outputting them to the Row Best Result module in a low-to-high sorted order. Sixteen clock cycles are required to clock out the sixteen results from the DC Row to the Row Best module.
As shown in Fig. 5 , each Calculation Engine supports multiple DC Rows in parallel. This enables multiple search operations to be performed simultaneously, which drastically improves the search throughput of the system. Each DC Row can receive a Query Vector independently, or asynchronously, of all the other rows. Each DC Row can start its Query Vector search operation on a data set vector boundary. Therefore, each DC Row can effectively begin its Query Vector calculations at any data set vector boundary, and complete its search at the initial data set vector minus one (the end of its data set).
A key characteristic of this architecture is that it is agnostic to the specific distance measure coded in the Calculation Engine. In the current version of the system, two measures are provided. First, the 1 -norm, which is defined as x 1 = n i=1 |x i | and implemented for 8-bit integers. Second, the Hamming distance, which is defined as D H = n i=1 |x i − y i | where x, y ∈ {0, 1} and implemented bit-wise for 32-bit words. By design, all calculations occur at the rate of one clock cycle for feature vector element; hence, there is a linear dependence of the calculation time on the feature vector length, regardless of the implemented distance measure. In addition, an overhead of approximately 16 clock cycles is incurred at the end of each vector.
B. ROW BEST RESULTS
The Row Best Results (RBR) block receives the calculation results from the sixteen DCMs in the DC Row module, determines the best results in accordance with the selected calculation algorithm, then sends them to the Results Capture module. Two different Row Best Results algorithms are implemented: the kNN mode and the Threshold mode.
In kNN mode, the best k results are to be found. In this case, the RBR receives all distances calculated by the DC Row module for a query vector, then selects the k number of best results (shortest distances, i.e., smallest values). The RBR receives previously calculated distances as input data, until a dedicated signal indicates that the results of the last stored dataset batch have been received. Until the last batch has been received, the calculated distances are cyclically stored in two FIFOs, A and B. Reading from or writing to either FIFO is controlled by a toggle signal. When the first batch's distances are received, those distances are written to FIFO A. After those distances have been written, the toggle logic switches FIFO A to read mode and FIFO B to write mode. When the next batch's distances are received, the previous distances together with the current distances are input to the kNN Node, which is responsible for merge sorting these results. These two sorted distances are written to FIFO B, and the toggle logic switches FIFO A to write mode and FIFO B to read mode. This means that each batch's distances enter into the kNN Node together with previously sorted distances from the FIFO which is currently in read mode. As a result, the best k distances are written to the FIFO which currently is in write mode. The process continues, sorting and storing the best k results until the signal flagging the last batch is received. In this case, the previously merged and sorted results enter the kNN Node with the final batch's results, and final sorting is to be performed. After final sorting, the best k results are written to the Results FIFO Z and, in turn, routed to the Results Capture module. In Threshold mode, the RBR module receives the calculated distances from DC Row module as input data along with the specified Threshold value. A comparator determines if the result meets the Threshold criterion, then transmits the distances that meet it to the Results FIFO Z, from which these are, in turn, routed towards the Results Capture module. 
V. SOFTWARE SYSTEM ARCHITECTURE A. OVERVIEW AND MODULES
Finally, the iFlex software bridges the user application with the hardware system. It is based on a Transmission Control Protocol/Internet Protocol (TCP/IP) client-server architecture. The user interacts with the system through sending high level commands over a TCP/IP connection. The software has two modules: a client module, and a server module. The client module is installed on the application machine, while the server module is installed on the server housing the board(s), if different. The server module translates the high-level commands received from the client module, converts them into low-level commands, then sends them to the hardware. The server module further collects the results from each board, aggregates then returns them via the client module. The server application architecture is depicted in Fig. 8 . The Configuration Parser module is responsible for reading the configuration file, wherein operational settings including the server listening address, port, API key, log file path, log level, local data set directory path, and slave node definitions are stored. The parser makes these values available to the other server application components. The Logger module records all application events for debugging and performance analysis. The Listener module is the entry point for client connection: it accepts and creates a session for each incoming connection. In turn, the Session(s) handle the request, and include the Request Parser, Checksum(s) and API Key Validation, Request Dispatching to Master Node and Response Builder.
Communication between the client and the server is performed via packet transfers. The packets consist of a fixed-length header followed by a variable-length body. The Request Parser module is responsible for inspecting the packet header and calculating its checksum: if this is valid, it reads the body and validates its checksum, otherwise it drops the packet. Validation of the API key is performed following a successful Request Parser operation. The Dispatcher module dispatches the request packet after performing Key Validation. When the API key is valid, the session dispatches the request packet to the Master Node module and awaits its response. The Response Builder is invoked after receiving a response from the Master Node module. The Master Node module passes a response to the Response Packet Builder, which, in turn, creates a response packet that is sent to the client.
The server module has been written in the C++11 language, leveraging the following third-party open source libraries: Boost [52] , HDF [53] , spdlog [54] , JsonCpp [55] and HighFive [56] . The complete source code is freely available on the specific vsx-agent project repository on GitHub [57] alongside a dedicated tool for downloading the FPGA configuration bit-streams vsx-dce-prgmr [58] , and a snapshot is provided as Supplementary Material to this article. The client module has been written in the Python language. The complete source code is freely available on the specific vsx-client project repository on GitHub [59] , and a snapshot is also provided as Supplementary Material to this article.
B. MASTER NODE
The Master Node is instantiated once per system installation, and performs the following operations: i) Receiving a request from the session, ii) Parsing and validating the session commands, iii) Dispatching commands to the slave nodes, iv) Collecting the results from the slave nodes, v) Processing the results, vi) Forwarding the results to the session for response building. Its architecture is visible in Fig. 8 .
The Command Parser parses each request packet, ascertaining whether a command is valid and supported, then passes its attributes to the Command Validator block. In turn, the Command Validator block verifies whether the command has the required attributes received from the Command Parser, and whether the provided attribute values are valid. If the command is valid, the associated data are passed to the Command Dispatcher and Result Collector module. This dispatches the received command to the slave nodes, collects the results from the same, and submits them to the Result Processor module.
The Result Processor module is responsible for collecting the results from each Slave Node, then sorting them. For example, a user may have requested the top 100 kNN results from the system. Each board would return the top 100 kNN results from each lane, and the sorting of all lane data and returning of the final top 100 kNN would be performed by the Result Processor. Any further required data transformations are performed on the results, after which the processed data returned to the session.
C. SLAVE NODE(S)
The Slave Nodes act as bridges between the server application and the hardware nodes; accordingly, each lane on the boards is treated as a separate node entity. Each Slave Node is responsible for executing commands on the corresponding lane and collecting the results from it. A Slave Node consists of three separate threads: Input, Output, and Response Parser. The Input and Output (I/O) threads are responsible for packet serializing/deserializing to/from the hardware; the architecture is optimized for rapidly retrieving results from the lanes, thus keeping the channel open for creating and sending results. If a lane is unable to transmit results due to a lag in the I/O threads processing the results, then the lane will have to pause calculation until it is able to send results again. The Response Parser thread is responsible for receiving packets from Input thread, validating and then passing them to the results collector. When all of the expected results for a particular command have been collected, the slave node passes them back to the Master Node.
As shown in Fig. 9 , in order to load a data set into the Calculation Engine memories, the client firstly initiates a connection, which is in turn accepted by the Listener module, that opens a session corresponding to it. The latter validates the request and then sends it to the Master Node. The Master node, in turn, validates the command and its attributes, then loads the HDF5-formatted data set file from the hard disk (or other non-volatile source). Here, HDF5 formatting is used to maximize the data transfer rate, particularly for the benefit of dense multi-board configurations. The Master Node next dispatches the data set load function between the slave nodes. After all slave nodes have completed loading their own portions of the data set asynchronously, they return to the Master Node, which sends the operation status to the session, that returns it to the client.
As similarly visible in Fig. 9 , a query operation corresponds to requesting the system to find a given number of nearest neighbors for a single incoming vector with respect to the entire stored data set. The client initiates a connection with the server, so that the Listener module accepts the connection, and opens a session for it. The session then validates the request and forwards it to the Master Node. This, in turn, validates the command and dispatches the query function to the Slave Nodes. After all Slave Nodes have finished their query operation, and have returned their results asynchronously, the results are sent to the Master Node. If necessary, the Master Node subsequently performs a merge sort on the host computer, then sends the operation status and data to the session, and finally to the client.
VI. BENCHMARKS
To illustrate the performance of the system, representative benchmarks are given for the 1 -norm at 8-bit integer precision, with a feature vector length of 300 entries, which corresponds to an application scenario of intermediate complexity [3] - [8] . Namely, we considered a system configuration involving 10 DC slots operating at a clock frequency of 100 MHz (PBus frequency: 50 MHz, DRAM frequency: 300 MHz; these clock frequencies were kept lower than the maximum as, in this configuration, performance is capped by the DC slots clock). The FPGA occupancy and maximum clock frequency obtained from the respective synthesis and implementation flows are given in Table 1 , additionally for a reduced configuration providing 1 DC slot.
In this configuration, one fully-programmed iFlex card consumes ≈ 34.2 W (i.e., ≈ 2.8 A at 12 V) when idle, which goes up by about 35% under full calculation load (all DC slots busy) to ≈ 46.3 W (i.e., ≈ 3.9 A); for comparison, when no data set is loaded (i.e., no DRAM refresh occurring etc.) the consumption is ≈ 8.7 W (i.e., ≈ 0.7 A).
For a single query, as visible in Fig. 10a the query time ranged from ≈ 10 ms for a data set of 1 million vectors, up to ≈ 1.35 s for a data set of 150 million vectors (maximum board capacity), increasing approximately linearly between TABLE 2. Performance comparison between iFlex, CPU and GPU (t /t iFlex refers to the CPU or GPU vs. iFlex in 1 or 10 DC configuration, whichever was faster. q denotes query batch size.). these extremes. As expected, the number of nearest neighbors requested had a smaller effect on query time; for example, for a data set of 10 million vectors the query time was ≈ 89.8 ms for k = 1, ≈ 91.7 ms for k = 10 and ≈ 105.8 ms for k = 100.
Assuming 16 bytes per lane, 21 lanes per Compute Node (board), one has 336 bytes transferred per clock cycle. Given a data set size of 10 million vectors, containing 300 features (bytes) each, one also determines that 8.9 × 10 6 clock cycles are required to scan through the entire data set. For a clock frequency of 100 MHz, then, the corresponding theoretical time would be 89.3 ms, which is in close agreement with the experimental measurement; more noticeable overheads are expected for k > 1 and m > 1 (see below).
As also shown in Fig. 10b , in terms of query processing rate, there was an effect of batching, observed largely between submitting single queries and batches of q = 16 queries, which was primarily due to the filling of all available DC slots (i.e., q ≥ 10); notably, a similar dependency is also observed in CPU-and GPU-based algorithms [60] . As visible in Fig. 10c , with increasing data set size, the processing time per-query per-vector-comparison quickly settled on an approximately constant value, which was ≈ 9 ns for single queries, and ≈ 1 ns for batches for q = 16 queries or more, with a smaller effect of batch size beyond that value (≈ 1.1 ns for q = 16 down to ≈ 0.9 ns for q = 256 and above).
As detailed in Fig. 10d , the advantage of using a system involving multiple Compute Nodes, namely 10 iFlex boards, gradually increased with data set size. Assuming a query batch size of q ≥ 16, the acceleration expressed as t single /t multi was of ≈ 1 for a data set size of 1 million vectors, which increased to ≈ 6.4 for 10 million vectors and reached ≈ 9.3 for 100 million vectors; these results are dependent upon the specific settings for spreading the reference vectors over multiple boards and combining their results.
A detailed performance comparison with CPU-and GPUbased systems is left for future works, as it will need to be conducted in the framework of specific application scenarios and optimized code in order to ensure its validity. Nevertheless, for purely illustrative purposes, the case of an cloudbased compute instance, representative of industry state CPU computing, is considered. Namely, a compute-optimize instance on Amazon AWS (Amazon.com, Inc., Seattle WA, USA), type c5.4xlarge, providing 16 virtual processors and 68 EC2 compute units, was used. An optimized implementation of 2 -norm, with 32-bit single-precision floating point data type, and a feature vector length of 300 entries, was considered. Vector similarity calculations were performed using a highly-optimized indexing implementation, the Facebook AI Similarity Search library, available in Ref. [61] and described in detail in Ref. [60] ; the full test source-code is provided in Ref. [62] . In addition, a general-purpose GPU, namely the Tesla K80 (NVIDIA Inc., Santa Clara CA, USA) which is equipped with 4,992 cores and 24 GB DRAM, was tentatively considered.
The comparative benchmarks are provided in Table 2 , separately for single queries and small batches (q = 1, 16) and data set size of 1 and 10 million vectors. The query times provided by the CPU and GPU were compared to iFlex configured for 10 DC slots at 100 MHz or 1 DC slot at 150 MHz (whichever was faster). The acceleration ratios compared to both were rather consistent regardless of query batching and data set size. With respect to the CPU, the acceleration was on average of ≈ 32.4. Though EC2 power consumption figures are not published, assuming a 2/3 loading of an Intel Xeon Platinum 8175 processor [63] and neglecting the DRAM and chip-set power supply, one may estimate a draw of ≈ 110 W, providing a power reduction P CPU /P iFlex ≈ 2.38. With respect to the GPU, the acceleration was less considerable, namely on average of ≈ 1.7. Nevertheless, also in this case the power saving was potentially considerable, given that the published measurement for the chosen GPU is in the range of 150-300 W [64] . It should finally be noted that explicit comparison of the 1 and 10 DC configurations revealed that the optimal level of Calculation Engine FPGA occupancy depends on the query batching: for single queries, the 1 DC mode (19% occupancy, 150 MHz rate) was faster by ≈ 1.3, whereas for multiple queries, the 10 DC mode (80% occupancy, 100 MHz rate) was faster by ≈ 5.5.
VII. DISCUSSION
In this paper, a novel general-purpose co-processor architecture for accelerating vector similarity searching was proposed. Its tailored architecture allows parallelizing distance calculation and sorting operations over an arbitrarily large number of FPGAs. As each one is endowed with a local, dedicated and independent DRAM storage harboring a portion of the reference data set, comparatively high aggregate bandwidth is attained, while maintaining the implementation straightforwardness and undemanding power consumption VOLUME 7, 2019 associated with a relatively low clock frequency. In the present system, a single Compute Node could store 42 GB of reference data accessible at 33.6 GB/s. Detailed comparative benchmarks should be conducted in future work, addressing configurations tailored to specific applications in order to fully represent all aspects of the hardware-software system. Nevertheless, and notwithstanding the fact that the memory bandwidth is lower compared to current CPU and GPU solutions, the initial data presented in this paper suggest a possibly considerable acceleration with respect to a high-end CPU-backed cloud computing instance; this was accompanied by potentially significant power savings compared to both CPU and GPU-based solutions. These data, however, should only be considered as purely indicative due to the need to carefully optimize CPU and GPU code for specific scenarios [3] , [9] - [11] , [20] .
An inherent advantage of FPGAs is their seamless reconfigurability, which will enable a virtually unlimited variety of distances measures to be implemented on the present platform. In this regard, one aspect that will require particular consideration is the calculation precision. At present, the bulk of neural network and vector similarity applications are developed with single-or double-precision floating point representation, whereas the present results were given for an 8-bit integer features, and therefore do not represent a fully like-for-like comparison. Nevertheless, there is increasing recognition that, when suitable normalizations are applied, vector similarity calculations, deep neural network evaluation, training and related operations may be realized without substantial performance loss also at low, in some instances even binary, precision. The flexibility of the present FPGAbased architecture in this regard appears particularly relevant, in that it allows tailoring the level of precision to the specific application requirements, in principle also supporting mixed-precision representations. In the present experiments, the ability to reconfigure the FPGAs for different clock rate vs. logic occupancy trade-offs also allowed tailoring the system for better performance given single or batched queries [65] - [69] .
In considering the limitations of the present system, it is important to note that vector similarity searching as implemented here, though highly prevalent, does not constitute a neural network and thus corresponds to only a part of the existing artificial intelligence and machine learning applications. Unquestionably, convolutional neural networks, alongside other types of neural networks and array algorithms, also play a fundamental role. Knowingly, the associated computational requirements are substantially different, because there is a need to perform large numbers of multiply-andaccumulate and non-linear transform operations, over large numbers of neurons and layers, while supplying the input layer with data and extracting the outputs at a high rate. In principle, because each of the Calculation Engine FPGAs in the present system includes 156 multipliers, for a total of 3276 per board, the system might seem applicable for accelerating these networks also. However, two constraints are problematic. One is the low bandwidth of the daisy chain bus together with the subdivision in independent lanes, which was central to reducing design complexity and power consumption: while this is not an issue for vector similarity searching, as it is only necessary to stream queries and retrieve results having a small size, it effectively prevents aggregating the compute power across the FPGAs for realizing a practically useful neural network. Towards such an end, it is vastly preferable to use a single, larger FPGA, as done in other studies focusing on the hardware acceleration of convolutional neural networks and deep learning; however, due to the limited number of input-output lines available, this approach limits the amount of independent memory banks which can be attached. A trade-off, therefore, exists. The other issue highlights a more general limitation of FPGA technology compared to CPUs and GPUs, namely, the inherently lower clock frequency, which is often on the order of 100 MHz as opposed to 1 GHz and beyond. With these limitations in mind, the present system, and similar alternatives based on FPGAs, should be considered primarily of relevance whenever the reconfigurability of programmable logic together with the fine-grained data set memory subdivision confers a possible advantage, for example, towards the realization of distributed sorting and aggregation. Future work should systematically compare this and other FPGA-based solutions to CPUs and GPUs across different types of machine learning algorithms [9] , [10] , [12] , [13] , [17] , [18] , [22] - [29] .
A notable aspect of the present system over the prevailing indexing-based solutions is inherent in the absence of any index. For this reason, it is possible to rapidly add, delete and modify data set vectors, without incurring any penalty: this can enable new artificial intelligence applications in contexts where the reference material is highly dynamic, for example, when it is drawn from an real-time stream of data such as news, feeds or ongoing time-series measurements. By comparison, the generation of indexes for large data sets may take up to several CPU or GPU hours [1] , [2] , [15] , [60] .
Besides, the uniqueness of the system described in this paper lies in its fully open-source nature: to the authors' knowledge, the present one is the only instance of a high-performance hardware architecture offered in a nearly product-ready stage for unlimited use by the community under a standard license. Regardless of the continuous performance improvements of CPUs and GPUs, this important advantageous aspect is expect to persist. Importantly, the core aspects of novelty in the present distributed distance calculation and sorting approach can also be used for future realization in different hardware forms, such as applicationspecific integrated circuits, yielding higher clock frequencies; accordingly, the distributed organization in compute engines and lanes may retain important advantages not only for boardlevel but also package-level (i.e., multi-chip module) and dielevel (i.e., single-chip) integration. Mirroring other developments and trends in cloud computing, we therefore anticipate that this offering will bolster and simulate research, hands-on education and open innovation in the area of hardwareaccelerated vector similarity searching [70] , [71] . In 2013, he has completed online courses on hardware/software interfacing (material from courses taught in Washington University) and cryptography (material from courses taught in Stanford University). In 2014, the course on digital systems from logic gates to processors offered by the Universitat Autonoma De Barcelona. Since 2005, he has been a Developer of forensic hard drive data capturing products, until 2010, when he transferred to the role of systems engineer in a company developing softwaredefined radio including RF and DSP technologies. Since 2014, he has been a senior R&D engineer on pattern recognition and neural networks. He has held research and engineering roles with Yerevan State University and private companies. He has authored 11 articles and 2 textbooks. He is currently an Assistant Professor with Yerevan State University. His research interests include microwave and terahertz waves radiation and propagation, nonlinear communication systems, and antenna systems.
HAYK GHALTAGHCHYAN received the Ph.D. degree in physics from the Russian Armenian University, Yerevan, Armenia, in 2016. His research interests mainly concern the investigation of electronic properties of quantum nanostructures (quantum wells, wires, and dots), optical properties (interband and intraband transitions, impurity and excitonic light absorption, and direct and nondirect interband light absorption) of quantum nanostructures (quantum wells, wires, and dots), few body problems in quantum dots, and electrodynamic and spin characteristics of quantum dots, and stationary adiabatic approximation for the description of quantum nanostructures.
CHRIS MCCORMICK received the B.S. degree in electrical engineering with a software focus from Stanford University, in 2006.
He began his career in private industry working as an Embedded Software Engineer with Texas Instruments for six years. In 2012, he moved from embedded software to become a Computer Vision Researcher with CogniMem Technologies for three years. In 2015, he became a Machine Learning Researcher with Nearist.Ai investigating approaches for vectorizing content (feature extraction techniques) and benchmarking the acceleration of brute-force k-nearest neighbor searches with specialized hardware such as GPUs and Nearist's own hardware design. Since 2013, he has been maintaining a machine learning blog at mccormickml.com where he has written a variety of tutorials on subjects related to his research.
MICK FANDRICH received the M.S. degree in engineering technology from DeVry University, in 1980. He has held a number of engineering and management positions in the semiconductor industry, primarily with Intel Corporation, where he was a two-time recipient of the prestigious Intel Achievement Award. His positions have included microcontroller characterization and design, flash memory interface and algorithm design, business and engineering design automation, and solid-state drive system engineering. He is the author or coauthor of more than 45 U.S. patents. He is currently the Founder and the CEO of Intermotion Technology Inc. VOLUME 7, 2019 
