Introduction
Big data is a popular research area with researchers and companies developing new tools to analyze the huge data that they could not analyze before. Clustering is often the very first step of information extraction because most clustering algorithms work in an unsupervised manner. One of the simplest and easiest is the K-means (Lloyd algorithm). For the K-means algorithm itself, the execution time is approximately the time spent in each iteration times the number of iterations, although the time for each iteration may not be exactly the same (could be influenced by experiment differences, cache performance and etc). The average time for each iteration is important since it nearly determines how long it takes for the algorithm to run for the same the number of iterations. Generally, K-means algorithm is implemented in software written by C/C++ or other languages, even MATLAB. The execution time is not a big issue for small amounts of data input. However, if the data input is an image of several features,the time for each iteration is largely affected by the number of pixels and number of clusters. For example, it could take only 0.05 seconds each iteration for a 500x500 image of 5 features(component of color Red, Green, Blue and pixel position x, y) and 8 clusters. Meanwhile, it can take 0.5 second each iteration for 8 clusters and 2.5 second each iteration for 64 clusters, both with data input of a 1000x1000 image of 5 features.
These empirical results indicate that the execution time for each iteration grows geometrically as data size and number of clusters increase if the number of features remains constant.
If the data size is extremely large, the execution time for each iteration could be very long and thus leads to extremely slow processing time for K-means. In this case, pure software or even optimized with parallel methods is not the optimal option for accelerating the K-means algorithm because this task can result in bad performance of current cache-memory designs. Hence, we choose to use FPGAs to implement the K-means algorithm for better performance.
For the hardware implementation in FPGA, we selected two hardware platforms: Gidel and AWS. We build K-means on FPGA with these two platforms, each with a user software application on host for top-level data source transfer and runtime control.
CHAPTER1. Introduction
Contributions
In this research, we make the following contributions: 1) Implement the K-means algorithm with levels of parallelism in RTL design on FPGA 2) Enhance the scalability of K-means algorithm on FPGA 3) Explore the implementation of this algorithm on two different platforms: Gidel and Amazon Web Service (AWS) and compared the difference of development flow and system architecture 4) Compare the hardware performance to three parallel methods of software implementations (OpenMP, MPI and GPU) 5) Explore the system performance of Gidel and AWS
Thesis Organization
Chapter 2 discusses features of the algorithm. It also presents background on parallel programming techniques applied to the K-means algorithm we used during this research including OpenMP, MPI and GPU programming. This chapter also presents the hardware-related toolkit of Gidel and AWS and some core IP provided by those vendors.
Chapter 3 covers the methodology and design of the pipelined K-means algorithm implemented in FPGA hardware and interface with the host for data transmission and movement.
Chapter 4 describes the hardware and software setup as well as the input test cases used in simulation and implementation. Then the experimental results for software implementations and hardware implementations are presented.
Chapter 5 explains our conclusions for this exploration in hardware acceleration for this algorithm and future work.
Background

K-means Algorithm
The K-means clustering algorithm was first proposed by Stuart Lloyd in 1957. The aim of this algorithm is to divide N points in d dimensions into K clusters so that the within-cluster sum of squares is minimized.
The input to the algorithm is a matrix of N data points each with d dimensions and K initial cluster centers. [5] The general procedure of this algorithm is an iterative refinement method trying to partition data points to K clusters based on the smallest Euclidean distance among the clusters. In each iteration, the data points would possibly be moved to a different cluster. Means should be updated after each iteration and are used as the new centers for the following iteration.
The algorithm is expressed as two steps:
Assignment step: Assign each data point to the cluster whose mean yields the least within-cluster sum of squares (WCSS):
where each x n is assigned to exactly one cluster, even if it could be assigned to two or more of them.
Update step: Calculate the new means to be the centroids of the observations in the new clusters:
The algorithm pseudocode is:
Algorithm 1 K-means CLustering Algorithm 1: procedure MAIN ALGORITHM 2:
C K ← Initial Centers
3:
t ← iterations 4: top:
5:
if t > MAX ITERATIONS or State = Converged then return false 6:
t ← t + 1 7: loop:
if distance(x n ,C
i ) distance(x n ,C (t) j ), ∀ j, 1 j K then 9:
n ← i.
10:
i + x n .
11:
goto loop.
12:
close;
13:
k |.
14:
State ← checkConvergence(idx (t−1) , idx (t−1) ) 15: goto top.
Sometimes, the assignment step and update step are also called expectation and maximization separately.
Initialization
From the K-means clustering page by Wikipedia, the illustration about initialization is described as the following methods.
For the K-means algorithm, the most common methods for initialization are Forgy and Random Partition.
The Forgy method randomly chooses k data pixels from the data set and uses them as the initial means.
The Random Partition method first randomly assigns an index to each data point and then proceeds to the update step assuming that the data points with the same index is viewed as being assigned to the same cluster, thus computing the means to be the centroids of each cluster's randomly assigned points. The In the real application, the mean table should be determined on the host and then sent to the FPGA for further processing. However, for simplicity, the initial mean table is currently hard-coded into the FPGA RTL design.
4
CHAPTER2. Background
Distance Selection
K-means minimizes within-cluster variance. The definition of variance is identical to the sum of squared Euclidean distances from the center. The basic idea of the K-means algorithm is to minimize squared errors. K-means may stop converging with other distance functions. The common proof of convergence is: the assignment step and the mean update step both optimize the same criterion. Therefore, it must converge after a finite number of improvements.
The Manhattan-distance variant of K-means is also known as k-medians, because the median is a known best L1 estimator. Another distance functions is k-medoids (partitioning around medoids). The medoid minimizes arbitrary distances (because it is defined as the minimum), and there only exist a finite number of possible medoids. It is much more expensive to compute than the mean.
In this research, the standard squared Euclidean distance is applied. However, the Manhattan-distance could reduce the hardware resource use and runs faster in software implementation. Generally, in practice, the distance is selected according to the application.
Termination
Convergence is defined as an ideal situation that there is no reassignment of data pixels after finite number of iterations. However, in real-life application, the thrashing of data pixels would happen. Hence, a termination condition needs to be defined, The termination condition could be one of the following:
1) A fixed number of iterations has been completed. This condition limits the runtime of the clustering algorithm, but in some cases the quality of the clustering will be poor because of an insufficient number of iterations.
2)Assignment of data pixels to clusters does not change between iterations. Except for cases with a bad local minimum, this produces a good clustering, but runtime may be unacceptably long.
3) Centroids do not change between iterations. In practice, the change of centroids between iterations can be traced and we often terminates it as soon as the change rate or absolute change value is smaller than a given threshold. This is generally associated with situations where some data pixels can be close to two different centroids. Classifying to either cluster generates very small difference. If the K-means algorithm runs for a large number of iterations, the changes to centroids become smaller and smaller. In practice, the classification of those data pixels is not very significant.
4)
Terminate when WCSS falls below a threshold. This criterion ensures that the clustering is of a desired quality after termination. It indicates that termination is close to convergence. This property can be easily verified by its definition of within cluster sum of squares. If the difference between iterations is smaller than some threshold, the termination condition can also be reached.
5
CHAPTER2. Background
In this research, the first termination condition is selected. This is due to the following reasons. Firstly, the primary motivation is to explore the speedup of implementation on FPGAs over software implementation.
If the number of iterations is fixed, it is much easier to determine the speedup factor. Secondly, it is easier to implement this termination on hardware and reduce extra hardware resource use. Finally, it is better for debug and test when we need to terminate this algorithm.
K-means Parallel Methods
There are many different ways that parallelism can be extracted from this algorithm.In this research, three of them are explored. The implementation assigns every data element to a cluster before updating the cluster centers, so the assignment and update steps are serialized for one data point, but there is opportunity for parallelism within each step and pipelining between data points. The three parallel approaches include OpenMP, MPI and CUDA.
OpenMP
OpenMP is an easy multithreaded-programming implementation compatible with C or C++. It is a method for parallelizing the current master thread with a number of slave threads so that the system is able to divide tasks among them. This is a parallel region which achieves speed up because the threads then run concurrently, accessing shared memory in the processor.
The parallel section is marked with a platform-independent set of compiler pragmas, directives, function calls, and environment variables that explicitly instruct the compiler how and where to insert threads into the application. Each thread has an id attached to it which can be obtained using a function. The thread id is an integer, and the master thread has an id of 0. After the execution of the parallel region, the threads join back into the master thread, which continues to the following sections of the program.
In the OpenMP implementation, we use a shared memory model and a multi-threaded program to spawn parallel computation for the assignment step, the distance calculation. There always exists such a tradeoff between parallelization and programming that parallelizing the code does not always generate speed up. In the update step, the sequential part performs better than the parallelized version and thus the better choice is adopted in this implementation. 
MPI
GPU Implementation
For the GPU implementation, there are different ways of parallelizing this algorithm. According to the architecture feature of SMs, it is even more suitable for parallelized independent workloads, such as image pixel processing, than OpenMP ana MPI given that the number of parallel sources is still not as large as GPU. Similarly to the OpenMP and MPI, the assignment step is still close to the scheme used in the previous parallel implementations, but GPU also provide unified memory which remove explicit memory allocation and data movement between host and GPU. Since there are quite trade-offs between the data acquisition in different architectures and strong computability of GPU, we implemented various 7 CHAPTER2. Background versions including using unified memory as well as computing the means in a reduction fashion on the GPU instead of on the host. In the results section, we report the run time of the best GPU version.
Software Parallelization Summary
All of these implementations have a master or host node and workers or (in the case of the GPU) a device for implementing parallelism. In all cases, initialization and convergence checking is done on the master (host). The data is transferred once to the workers/device as it does not change during the computation. Subsequent iterations can make use of the same data without needing to retransmit it. The assignment step is done in parallel on the workers (device) and the mean updating is done on the master node. Assignment is parallelized by distributing the data among the threads or workers. As this is where most of the time in the algorithm is spent, it makes sense to parallelize the assignment step and update step is done on the host. The master (host) decides whether to terminate or to initiate another iteration.
Thus, the assignment step is done in parallel, while other computations are serialized. and host could be either explicit or implicit and the GPU architecture is more adapted to massively independent parallel data processing. For FPGAs, although the main trend indicates that pure hardware RTL design is not preferred by development, the pipeline stages could be the critical source for a hardware parallelism. Pipelining could avoid data dependencies and increase the level of parallelism in hardware implementation. The data dependence in the original software implementation is preserved, but the required circuit is divided into a chain of independent stages. All stages in the chain run in parallel on the same clock cycle. The only difference is the source of data for each stage. Each stage in the computation receives its data values from the result computed by the preceding stage during the previous clock cycle. [9] This implies that a pipelined architecture in which raw data, semi-computed data and final data can exist simultaneously, and each stage result is captured in its own set of registers. Thus, although the latency for such computation is in multiple cycles, a new result can be produced on every cycle and only the first result will take as many clock cycles as the latency.
FPGA
For FPGA implementations, we consider two platforms: Gidel and AWS.
Gidel 2.3.1.1 Gidel
Gidel is a system development and integration company. With their project-level approach, it creates tools for high-performance system development. They have been continuously developing FPGA-based reconfigurable systems and development tools for diverse applications.
In addition to the development boards, they also provide a variety of accessories and interface standards that connect directly to their FPGA Proc boards for data acquisition and inter-board connectivity.The
Gidel Proc Boards are designed for modular customization such that the user can tailor the system interfaces according the design specifications.
Gidel's Proc Development Kit is a set of building blocks designed to facilitate the development task and hardware-software integration, and to enhance system productivity. It is a system solution including boards, software tools, IP and optional daughter boards. 
Gidel Development Infrastructure
Gidel ProcWizard is a hardware-software integration application that was designed to simplify the project development task. Working in conjunction with Gidel Proc boards, ProcWizard enables the user to rapidly build a design that may be automatically translated into HDL and C++ code. The generated C++ code communicates with the generated HDL design via the PCI/e bus to ensure easy hardware-software integration.
In addition to code generation, the ProcWizard enables the developer to test and debug the design in the PC environment.
Gidel ProcWizard enables hardware and software designers to work in parallel, sharing the same infor- Gidel ProcWizard features an automatic generation of HDL and C++ code (and of the interface documentation), enabling software and hardware engineers to share the same interface and database. The generated HDL code includes an interface unit (which handles the host communication protocol), the IP cores that were added to the design and other units needed for the design. A top-level design which connects all these units together is generated as well. Usually, the interface between user logic and the top level would be defined in the ProcWizard and when it comes to the RTL level, the users add logics in their own functional module as a sub-design to complete the whole design.
MultiPort IP
GiDEL ProcMultiPort is a DRAM memory controller that enables efficient usage of the memory featured on GiDEL Proc boards. With ProcMultiPort, special-purpose memories may be replaced with standard on-board DRAM memory blocks and FPGA internal memory, while keeping a common interface. ProcMultiPort provides [2] the way to achieve high system performance with the on-board SDRAM / DDR / DDR II/ DDR III memories. 
Gidel Hardware-Software Co-design
The development of K-means on FPGA is a hardware-software co-design process and the Gidel ProcWizard automatically generates HDL code according to the design created by users. Every item in the design (memories, registers etc.) is implemented in this generated code, along with any additional code needed
11
CHAPTER2. Background to communicate with these items. All the registers that were declared in the project are physically generated. For each memory, a select signal is generated which goes high whenever the software accesses the corresponding memory. In addition, user sub-designs that were added to the project are connected to the top-level design.
As the sub-designs of the users are implemented and tested, the following design flow mainly depends on the compilation steps in the Quartus tool of Altera because Quartus must be added to synthesize all the designs and generate the bitstream for downloading to the Proc Board. The board is connected to the mother board of the PC through the PCIe bus and the necessary hardware for communication between host and FPGA is already implemented as sub RTL designs of the top-level design, creating a complete hardware design.
Additionally, the ProcWizard is able to generate a software driver header for the software design part.
It is a C++ application driver which makes the hardware application appear as a regular C++ object that contains all the structures in the user's design, such as memories, registers and register arrays. The
Application Driver object is connected to the hardware, thus enabling easy communication with the hardware by calling the object's methods.
The Application Driver object also performs:
1. Hardware initialization, including loading the logic design and setting the clocks on the Proc board.
2. Access to application data members.
3. Board general services, such as DMA transfer handling, interrupt handling etc.
The main work of software is initialized with a C++ constructor that automatically loads the hardware design file into the board and then is able to communicate with the board via the registers and memories that the user has created. These registers and memories can be accessed and set through the class member functions. 
AWS
Related Work
The K-means algorithm is one of the most popular algorithms in data analysis because of its simplicity and speed. Various methods have been applied to improve its execution performance or reduce the amount 15 CHAPTER2. Background of computation required. Many previous works have investigated data reduction for improving K-means on large data. We focus instead on efficient parallel computation for large datasets.
There exists recent research to implement K-means clustering algorithm using shared (OpenMP) and distributed memory (MPI) platforms [14, 15] . Also exploring K-means on GPUs [16, 17] [20] present FPGA-based K-means clustering using a tree-based data structure. They also filter the input data because pruning of some candidates greatly reduces the distance calculations between data pixels and centers. This approach requires two passes over the input data, and also that the data fully fit in on-chip memory on the FPGA, thus restricting the size of data that can be clustered. We implement the basic K-means algorithm without filtering the data and instead focus on parallelism. Our focus is on scalable designs to easily support large data.
Other researchers [21] presented a multi-core implementation for K-means on an FPGA. Their design targets a Xilinx Zynq, uses the ARM core for the update step and the FPGA fabric for the assignment step. Their assignment uses Manhattan rather than the more popular Euclidean distance. They show how their design scales to a large number of clusters. Their results do not include data transfer time to the Zynq processor. Choi and So [22] present implementing K-means with map-reduce on an FPGA-accelerated computer cluster. They make use of three FPGA boards, two for mapping and one for reduction, and achieve approximately 20x speedup over software.
Summary
In this chapter, the basic background of K-means is introduced. We also discussed different software parallelization methods: OpenMP, MPI and GPU. Additionally, we focus on the parallelism of hardware on FPGAs. We introduced the system architecture of the two platforms as well. The development flow, hardware and software toolkit are presented. Some related work in this algorithm is listed in this chapter, too.
In the following chapter, the detailed implementation hardware implementation and software-hardware co-design will be discussed.
Methodology and Design There exist many underline levels like kernel level, system interrupt level . However, these level are provided either by Gidel or AWS. They also take care of partial solutions for FPGAs which contain some hardware design which connects to PCI/e and on-board DDRs. Developers are mainly responsible for their own data path and user sub-design.
FPGA parallelism
The main distinctions between the K-means algorithm implemented with software and hardware lies in the parallelism shown in Figure 3 .1. In this figure, CC stands for clock cycle number, DC denotes distance calculation, CMP denotes comparison and ST means store and dispatch. ACC denotes accumulation.
For software, the sequential code is divided into two sections: Assignment (assigning data to different clusters) and Update (calculating the new means for each cluster), shown in figure 3.2. For hardware, the whole process is pipelined and the pixel data can be passed to the sum accumulation module along with the cluster number it should be assigned to. This means the assignment and update step of the algorithm CHAPTER3. Methodology are totally overlapped. Theoretically, if the hardware resource can hold the design for a complete parallel design, in each clock cycle, one index would be calculated from the assignment step of the algorithm.
To get the best performance and achieve a perfect trade-off between data read and processing, for update step, one pixel should be accumulated to its corresponding cluster sum. In this research, the number of clusters range from 8 to 32. The reason for choosing the maximum number of 32 is that firstly, the time for hardware synthesizing is acceptable and the routing requirement can be met. Secondly, the hardware resource of Gidel board limits the upper bound of the number of clusters.
Even though we target two platforms, we try to be consistent throughout those platforms. As mentioned before, the optimal situation is that one index is calculated in each clock cycle. In this situation, exactly K distance calculators should be instantiated and K distances would be the input for a K-input comparator.
However, if the cluster K increases to a relatively large number, it takes too long to compile and fit the design to the FPGA board. If K keeps increases to some larger value, the hardware resource can be exceeded. In the latter two situations, one clock cycle to get one result from distance calculator module is not possible and it takes several clock cycles to get one index. In this case, the maximum of K in this paper is set to 32 to remain consistency among two platforms, we temporarily do not consider the situation for larger K.
Read Generator
• Receive Sync Signal 
Comparator
The comparator is basically a tree structure for parallel comparison. At most, the comparator has 32 distances for comparison because we set 32 as the upper limit for number of clusters. Due to the limits of I/O pins, we applied three level of 8-inputs comparators, which is indeed a 32-inputs comparator. To indicate that the output from the comparator is valid, a signal is asserted high. Since the whole process is pipelined, exactly n clock cycles is necessary for receiving n indexes. This is applicable for a number of cluster which is not a power of two because the extra input ports are set to a constant value, the maximum floating-point value 0x7f7fffff. Therefore, the index assigned to that pixel would be a number between 0 and K-1.
If we at most have 32 clusters, the hardware resources are sufficient and one clock to get an index from the assignment part (the comparator) is attainable. We can parallelize the Update step of the algorithm(mainly the accumulation module) accordingly.
Update Step
Accumulation Data Input
The assignment part (mainly including distance calculator and comparator) of the algorithm is overlapped with the update part (accumulation module). The input for the accumulation module is the index and the pixel data. The index is the output from the comparator. Similarly, the five features should be in parallel passed to the accumulation module and we use another 5 ports with 32-bit width to read data to the accumulation module. We do not use any shift registers to hold the pixel values because it wastes too much hardware resource and the length of the shift register depends on the clock cycle output latency of the previous floating-point IP (Adders, subtractors and multipliers). It is much more convenient to read the data from the on-board memory each clock cycle, which requires no extra hardware resource to store these data.
Data Dispatch Scheme
For the 'Accumulating data points to clusters' block in Figure 3 .2, we need a data dispatch scheme because the data pixels are stored in the on-chip memory associated with each cluster. For the update step, we need to read send the data source to the on-chip RAM In this data dispatch scheme, we instantiate K different pre-dispatch modules each with a RAM module, a data fetcher, a pixel read generator and a data aligner, as shown in Figure 3 .3. This parallelizes this step with extra RTL level logic control.
For the update step, in each clock cycle, a valid index and data pixel are needed. We hope that a maximum throughput can be achieved. This can be done by some low level of hardware parallelism. Adding two data pixels with floating point type is quite expensive. We use storage of small on-chip RAM, resembling the functionality of a FIFO. We implement extra hardware control logic to fetch multiple data pixels and sum them to improve the computational throughput. The read generator is used to generate the read signal that fetches 8 data pixels from the RAM module. The read generator checks the RAM contents each 8 clock cycles which is synchronized with the following levels of floating-point adders. In this way, the update step can process one pixel each clock cycle in average and as a result, the algorithm can be completely parallelized.
If the number of data pixels is not a multiple of eight, the last few data pixels remaining in the RAMs are processed to complete the update step and a final read signal will be sent to each cluster to read the residual data pixels to get the final sums. Thus, the correctness of the final results can be guaranteed.
Hence, the sum and number of pixel points can be stored for each cluster
Sum and Mean Calculation
This is the following step of the accumulating for different clusters. It is named as the Sum and Mean module that consists of an adder, a divider and an converter from unsigned integer to a floating point data for each cluster.
It keeps accumulating data pixels when the data fetcher keeps providing the data. Each cluster should own an adder to calculate the sum. However, each 8 clusters can share such an adder (for resource-efficiency).
The mechanism is: the adder is also pipelined and it has an output latency of eight clock cycles, which means it can at most store the sum for eight clusters. This requires that the different sync signal should be passed to the clusters and each of them should have on clock cycle delay to the previous cluster. Then the outcome from the data aligner of each cluster also has the delay of one clock cycle. Then for K clusters, we only instantiate K/8 adders to hold the sums for all clusters.
The pixel read generator has already records number of pixels for each cluster. The converter from unsigned integer to a floating point data converts the storage address to a floating-point type data. If all the pixels have been stored into the RAMs, a End Storage signal is pulled high and it probably takes another one sync signal period(eight clock cycles) to flush all the pixels. Then the FlushComplete signal is pulled high for each cluster if the read address is equal or bigger than the storage address. Note that the data mask is applied here, we do not need to bother whether the data fetcher fetches the redundant pixel data into the sum and mean module.
Finally, the sums need to be fetched to be stored to different register arrays for 5 features. With the number of pixels of each cluster, the new means of 5 features of each cluster can be calculated accordingly and updated to the Mean holders(RAMs) later for the following iteration.
Data transmission and HW-SW interface
For this design, all the algorithm is implemented in RTL code but the data is still stored in host. Then the data transmission from host to FPGA is done differently in two platforms.
Data transmission is an explicit communication overhead in the heterogeneous system design with FPGA.
Data is originally stored in the RAM of the host and for further processing, the data needs to be sent to the memory system in the FPGA or external memory blocks connected to the FPGA. The memory requirements depend heavily on the nature of the applications and memory throughput can be the most critical requirement in this hardware-software co-design system. The memory system for a FPGA board is often described as different levels. In this implementation for K-means, two types of memory are used:
on-chip memory and on-board memory. On-chip memory is the simplest type of memory for use in an In user software design, the register data access and chunk data transfer scheme is provided by different platforms: Gidel and AWS. The software API and library calls are defined by those platforms and is relatively associated with how they implement the hardware, but the functionality is quite close. As discussed earlier, Gidel encapsulates all implementation details and let users put a MultiPort IP in the hardware design and they adopt a write/read handle for chunk data transfer. The register access details are exposed to users as a field get/set method in a class and a signal port in the user sub-design. AWS adopts AMBA architecture and AXI protocols and separate AXI-4 and AXI-4-Lite protocols for different sizes of data use. To users, the single data transaction is through a AXI-4-Lite bus with extra state machine logic control to hardware and a peek/poke API in the software level. For bulk data transfer, AWS provides an extra data transfer driver (EDMA).
One major bottleneck for K-means on FPGA is the data transmission throughput between host and FPGA.
For real life applications, the data transmission size could be as large as several gigabytes. Even for the slower memory throughput system of Gidel, the data transfer rate could be several hundred megabytes per second, which is acceptable for an application that needs to be run over several minutes on a CPU.
Since the hardware could achieve a hundred times speedup, this overhead is negligible compared to the strong computational capacity brought by FPGA hardware.
Gidel
In the whole hardware architecture, the main functionality of data transfer depends on the MultiPort IP.
Each Port of ProcMultiPort may be configured to work in one of two modes: Random or Sequential.
The Random mode enables to perform random read and write operations through the Port. Each time the Port accesses the memory using a specific address so bursts cannot be implemented with a Random
Port. Random Ports have the highest priority when accessing the memory. Sequential mode enables performing bursts to and from the on-board memory. Sequential Ports may have one direction only:
they must be defined either as READ or WRITE. All Sequential Ports have a FIFO interface. The data is not read/written directly from/to memory but goes through an internal FIFO implemented inside the
The MultiPort IP also specifies the memory bandwidth in the configuration flow. Because this IP supports at most 16 read/write ports to the DDR memory, the bandwidth is always calculated before hardware design generation. In this application, the data layout in the memory is assigned such that each feature of the source data is stored to a contiguous memory block independently with its own read/write port. This is compatible with the MutliPort IP sequential mode and is adaptable to burst. Different read ports are associated with a different starting address and there is no interference with each other.
Data transfer in hardware follows a hand-shake rule in most senarios and this also works for the MultiPort IP. Internally, it contains FIFO structures which are used to better control the output flow so it could take some extra latency until the FIFO is filled to some level and ready to be read.
AWS
The Amazon Web Service provides memory system architecture that contains a data path, from host to the DDRs, which is composed of AXI-4 and AXI-4-Lite protocols. The K-means algorithm design on FPGA is a stream application and the data acquisition from DDRs is not The main functionality of data movement from a memory mapped domain to a stream domain is a Xillinx IP core Datamover.
It provides the basic AXI4 Read to AXI4-Stream and AXI4-Stream to AXI4 Write data transport and protocol conversion. DataMover provides the MM2S and S2MM AXI4-Stream channels that operate independently in a full-duplex like method. The AXI Datamover IP core is a key building block with 4
KB address boundary protection, automatic burst partitioning, and provides the ability to queue multiple transfer requests using nearly the full bandwidth capabilities of the AXI4-Stream protocol.
[24]
CHAPTER3. Methodology
The Datamover always behaves as a master and initialize read address channel requests to the interconnect which would pass the requests to DDRs. The read channel of the interconnect would then send the data through the relevant ports to the Datamover. The Datamover outputs the final data to a stream port which is the input data source to the K-means algorithm.
Summary
In this chapter, we mainly discussed the system design of hardware and software. We pay attention to the data movement in two platforms and explained how we implemented the K-means module in RTL design. In the following chapter, we will discuss the experiment setup and results. 
Hardware Board
The board we use is a ProcV D8 series board. The basic model is ProceVD8-BXSM. The FPGA is a 5SGSD8 series device. The speed grade is -2. It has two DDR III SODIMM sockets. It also has 695K
LEs, 2567 M20K blocks and 3926 18x18 multipliers. Typical system frequency for ProcVD8 board is 150 -450MHz and the internal local clock runs at 100 MHz. For memory performance, it the board has up to 32 DMA channels. The memory capacity and throughput is listed in table 4.1
Since the required memory space for the image is as large as several GBytes but and it could be larger than the embedded memory on board. It also could be larger than 200 MBytes. This is why we add a 8GBytes DDR III SODIMM to the board. In practice, we need to write raw data to the SODIMM and then read data from the SODIMM. Gbps in each direction. The FPGAs within the f1.16xlarge share access to a 400 Gbps bidirectional ring for low-latency, high bandwidth communication.
The specific FPGA part is a Xilinx xcvu9p series. This FPGA contains 2,586 K system logic cells, 6800 DSP slices, 345.9 Mb on-chip Memory and 832 I/O pins. 
System Data Input
Our K-means clustering implementation takes color images as input datasets. We cluster pixels in an image based on five features including three RGB channels and the position (x, y) of each pixel. We choose random data points to initialize each cluster centroid, and then use Euclidean distance to calculate the nearest cluster for each data point. The parameters of K-means include number of desired clusters (K), number of iterations (I), and size of input dataset (N). The number of features is fixed for this set of experiments, but can easily be changed in our design. For our experiments we run K-means clustering on a compute cluster with NVIDIA GPUs as well as on a ProcV FPGA board from Gidel and an AWS F1
instance. Details of the inputs, setup and experiments are given in this section.
Our data sources come from images of different sizes ranging roughly from 300x200 to 3000x4000 pixels. The size of the 10 images in pixels is given in Table 4 .3. Raw data is read from the bit map image files and then translated to floating point for processing. This translation is done once and used by all the different implementations. Similarly, the data for the initial means is computed once and shared by all implementations. The main thing that effects resource allocation on the FPGA (as shown in Table 4 .4) is the number of clusters, K. To achieve parallelism, K distance calculators and K dispatch modules are instantiated.
Software Implementation Setup
Hardware resource utilization is generated by Quartus Prime (version 15.1). The largest K reported is K = 32, where the ALM use percentage is 109% and the total ALMs number for this board is 262400. As the area scales linearly with the size of K, larger designs (K equals 64, 128, ...) cannot fit on the target device using our current design. Thus, we consider to implement the design with multiple FPGA boards.
Data Transfer Time between CPU and FPGA
Gidel
One of the dominant times in our design is the data transfer time between the CPU and the FPGA, given in Table 4 .5. The time reported includes the time from data residing in CPU software buffers to the time the data resides in the on-board memory on the FPGA. During initial transfer, the data is also copied to on-chip FPGA memory so that processing can be overlapped with data transfer for the first iteration. Note 
AWS
AWS also provides the hardware and software framework for data transfer from host to on-board memory.
The detailed transfer time and the iteration time is listed in Table 4 .6.
From the two above tables, we find that the EDMA driver that is provided by AWS to complete data transfer from host to on-board memory performs better than the Multiport IP provided by Gidel. However, this feature is influenced by many factors such as host hardware parameters and FPGA running frequency.
In either cases, the time for data transfer is acceptable given that the whole application runs for several seconds.
Our experiments compare sequential, parallel software (OpenMP and MPI) and accelerated (GPU and parallel methods are referred from Janki's work [16] .
The sequential software version implements the K-means algorithm by first computing the assignment of each data point to a cluster, and then, after the entire image has been processed, updating the means. As expected, this sequential code is significantly slower than any of the parallel methods. We report the total run time for 8, 16 and 32 clusters. For clusters of 128 or even larger, the execution time for over 50 or 100 iterations is extremely long.
The FPGA implementations on the Gidel board all have very close to the same run time, independent of the number of clusters. This run time for K = 8 is given in Table 4 .5. Since the FPGA design implementations instantiate corresponding hardware resource for different K, the only difference in run time for different data size. Note that per iteration run times are in milliseconds scale, even for the largest images, and this time is smaller than the time it takes to transfer the data to the FPGA. It is clear from this table that for small number of iterations, the computation benefit from hardware does not guarantee a big speedup for the entire application. version is an order of magnitude slower than any of the software versions. For the really small images around several MBytes or smaller, OpenMP provides the best speedup, followed by MPI. This is due to the fact that hardware acceleration for small datasets does not compensate for the data transfer overhead . For larger images, with sizes greater than 20 MBytes, the FPGA and GPU implementations are the fastest. This is partly due to the fact that cached memory systems do not handle large datasets efficiently.
Hardware can be better tuned to handle large data in an efficient manner.
Specifically, for image size of 6000 × 8892, when K= 32 and iterations is 50, the execution time and speed up factor for different versions is shown in Table 4 .7. significantly outperforms all other methods independent of data size. This is partly due to our FPGA design where all K cluster comparisons are done in parallel, and partly due to the fact that with more iterations the initial data transfer time becomes a smaller percentage of overall run time. For larger datasets, we expect to use more clusters and more iterations to achieve performance. Hence these results
show that FPGAs are the best choice for K-means clustering on large data. 
AWS Experiments
For AWS, the stand-alone K-means module is tested. We have not yet completed integration with onboard memory. The same number of iterations is set to 50 and the time of different number of source input is listed in Table 4 .9:
. From this table, we can compare the performance for the K-means module without memory access section. The number of K is 8 and the number of iterations is set to 50. The default frequency for AWS is set to 125 MHz while the frequency for Gidel is set to 100 MHz.If we normalize the Gidel results, we can find that the runtime is nearly the same for this part , as seen in 4.4. However, AWS can also target higher frequency and it provides better data transfer performance between host and FPGA board. So considering the end to end application of this hardware-software co-design, the AWS platform is expected to provide better runtime performance. The whole system runtime will be updated in the future. 
Summary
In this chapter, we presented the experiments of K-means implementaion on FPGA. The first part we focus on is the data transfer performance for these two platforms. Then we presented the comparison of sequential version, software parallel versions, GPU version and FPGA version (Gidel) and discussed the trade-off between data transfer overhead and computation. Finally, we present the stand-alone K-means module run-time test for AWS. In the following chapter, we will discuss conclusion and the future work.
Conclusion and Future Work
Conclusion
In this research, the K-means design is implemented in RTL design on FPGA on two different platforms:
Gidel and AWS. The hardware implementation is developed with a runtime software to compare to the sequential software versions, parallel software versions and GPU versions. Given that the purpose of this research is to parallelize K-means on FPGAs to improve K-means performance, parallelism and pipelining can benefit K-means acceleration a lot. We test the K-means with relatively large datasets and store original data to on-board SODIMM memory for further hardware processing on FPGAs. The data transfer time is the critical part in this hardware-software design. We have separate tests for this part on two platforms. On Gidel, it takes extra tens of seconds for the software driver to initialize and roughly 700MB/s to 800MB/s bandwidth to transfer the data to the on-board memory through the PCI/e bus. For AWS, since there are four DDRs on board, the memory access is not uniform but the average bandwidth is around 1.5 GB/s and the effective bandwidth also varies due to different transfer data size. The standalone K-means module on FPGA has these advantages: first, the K-means execution time for software in each iteration varies due to the number of clusters and the number of pixels, and resource in hardware can be pre-allocated and pre-fitted for the maximum K, which means the time of each iteration for different K is the same. Second, parallelism can be achieved on different levels of hardware implementation on an FPGA. It can be carefully designed so that the data path is parallelized. It also can be designed so different sub-modules are overlapped and pipelined. Therefore, the total latency of every iteration can be scalable. For the FPGA performance. it can achieve an extremely impressive speedup over the parallel software method or even the GPU method. In this case, hardware is much more suitable for analyzing image processing algorithm with extremely large datasets because streaming data process can be scalable and benefit from pipelining and multiple levels of parallelism.
Comparing the two FPGA platforms, we know that AWS presents better performance. This is because the hardware setup is better for AWS experiments, which contains better CPU frequency, larger memory bandwidth and a high-end FPGA board targeting higher frequency. Additionally, AWS also provides several FPGA boards in one host PC. There are several different connection protocols to support intraboard data movement. Although this support is not yet released, this can be applied to extend to large design for processing stream data.
Future Work
There are still some future work we can do: 1) we need to get K-means end-to-end application working on AWS 2) We currently use one FPGA board for executing K-means for large datasets. In the future, more FPGA boards can work together to parallelize the K-means algorithm, which will be suitable for datasets with more features. Since communication between boards is efficient, the K-means design can fit datasets of large size, number of features and number of clusters.
3) Bandwidth from on-board memory to FPGA is very important. For Gidel, the local bus frequency limits the data bandwidth from the local bus to user logical design. Currently, the internal logic frequency is only 100Mhz, the data bandwidth between the local bus and the user's design is limited to 2GB/s. We plan to investigate ways to improve this to accommodate more features.
4)
We do not include the convergence module in the hardware but it can still be parallelized with the update step of the algorithm (accumulation part). The only difficulty is that if the size of dataset is so large that on-chip memory cannot hold all the results. For large designs, the index table will have to be kept in on-board memory. The method can be to write back to on-board memory and read the previous index and compare it in one clock cycle. This can be nearly totally parallelized with the assignment step of the algorithm and the hardware execution time for each iteration will not have to wait for the convergence judgment. But this requires even larger memory bandwidth for a totally parallelized version.
Finally, hardware parallelism and pipelining is suitable for streaming data processing and we consider it can be applied to a large range of Machine Learning algorithms. 
