The k-means algorithm is one of the most common clustering algorithms and widely used in data mining and pattern recognition. The increasing computational requirement of big data applications makes hardware acceleration for the kmeans algorithm necessary. In this paper, a coarse-grained Map-Reduce architecture is proposed to implement the kmeans algorithm on an FPGA. Algorithmic segmentation, data path elaboration and automatic control are applied to optimize the architecture for high performance. In addition, high level synthesis technique is utilized to reduce development cycles and complexity. For a single iteration in the k-means algorithm, a throughput of 28.74 Gbps is achieved. The performance shows at least 3.93x speedup compared with four representative existing FPGA-based implementations and can satisfy the demand of big data applications.
INTRODUCTION
The k-means algorithm is an unsupervised clustering algorithm to partition the input samples into k clusters, so that samples within a cluster share similar attributes, while dissimilar samples are grouped into different clusters [1] . The algorithm is widely applied to applications ranging from data mining, pattern recognition to bioinformatics.
Albeit powerful, the k-means algorithm becomes time-consuming as the input sample set grows large, rendering hardware acceleration necessary, especially for big data applications.
Field Programming Gate Array (FPGA) is characterized by its natural feature of parallelism, which makes it an applicable platform for exploiting the algorithmic parallelism and accelerating the k-means algorithm. Over the last decade, several hardware designs for the k-means algorithm were proposed and implemented on the FPGA. Despite displaying respectable performance, most of these implementations can hardly satisfy the demand for computing power and precision required by largescale clustering tasks nowadays. Among these implementations, some are not optimized for high performance due to their inefficient algorithmic segmentation and data path [1] , while others show a lack in precision because of the use of fix-point arithmetic or bitwidth truncation [2] [3] .
To solve the problems above, a coarse-grained Map-Reduce architecture is proposed to implement the k-means algorithm on an FPGA in this paper. The main contributions of this paper are listed below:
 The k-means algorithm is adapted for the proposed coarsegrained Map-Reduce architecture by algorithmic segmentation, so that the intrinsic parallelism of the algorithm can be fully exploited.
 The high-performance and high-precision hardware accelerator for the k-means algorithm is developed with high level synthesis (HLS) technique. Stream interfaces and the elaborated system architecture are applied to ensure highspeed data transmission between the memory and the accelerator.
 The host programs for task scheduling and data management in traditional Map-Reduce frameworks are implemented in hardware circuits to reduce the communication overhead between the host and the hardware accelerator.
The proposed architecture is implemented on the Xilinx ZC706 Board [4] . Evaluation result shows that a throughput of 28.74 Gigabit per second (Gbps) is achieved for a single iteration in the k-means algorithm. The performance displays at least 3.93x speedup when compared with four representative existing FPGAbased implementations.
The remainder of this paper is organized as follows: Section 2 gives the background of the k-means algorithm and an overview on the related work. Section 3 discusses about the design considerations. Section 4 provides details of the implementation. Section 5 shows the experiment results. Section 6 concludes the paper.
BACKGROUND AND RELATED WORK

K-means Clustering
The k-means algorithm proceeds in iterations until a convergence is reached. At first, cluster centroids are initialized randomly or in heuristic ways. Then during each iteration, the algorithm consists of two steps: sample clustering and cluster centroids updating. In the sample clustering step, distances between each sample and all the cluster centroids are calculated, then each sample is assigned to the nearest cluster and marked with a label that indicates the cluster it belongs to. The distance calculation is based on a distance metrics, such as Euclidean or Manhattan distance. Then in the cluster centroids updating step, the means of samples in each cluster are calculated by accumulation and division to update cluster centroids for use in next iteration. During each iteration, a distortion error is also calculated, which measures the sum of distances between each sample and the cluster centroid to which it has been assigned [5] . The iteration repeats until the change of the distortion error is tiny enough during two consecutive iterations.
Related Work
Early k-means implementations on FPGA focus on clustering hyper-spectral and multi-spectral images. Dominique Lavenier [6] designs a systolic array architecture to accelerate the distance calculation in the k-means algorithm. The design is evaluated on different FPGA boards for comparison. The work is later improved and implemented on a hybrid processor [7] . A maximum speedup of 11.8 is achieved over a software implementation. Mike Estlick et al. [2] apply algorithmic transformations to map the k-means algorithm to the reconfigurable hardware. Selected metrics for distance calculation and bitwidth truncation are used to optimize the performance and reduce the consumption of hardware resource. During evaluation, a speedup of 50x over a software implementation is achieved. All the designs above only implement part of the k-means algorithm on hardware, while the rest part is executed on the host.
Venkatesh Bhaskaran [8] is the first to implement a complete k-means algorithm on an FPGA. The division is implemented on hardware using dividers from the Xilinx Core Generator. Then Hussain et al. [9] propose a multi-core architecture for the k-means algorithm to process microarrays. A 51.7x speedup is achieved over a software implementation when five cores are applied. Besides, several implementations are proposed to improve the performance by algorithmic optimizations. In references [3] [10] [11] , the kd-tree data structure and the triangle inequality are applied to improve the performance of a single iteration in the k-means algorithm. Despite displaying respectable performance, implementations above can hardly satisfy the demand for high throughput required by large-scale clustering tasks nowadays, due to the insufficient bandwidth of processing elements or the inefficient data path. Moreover, uses of fixedpoint arithmetic and bitwidth truncation render low precision of calculation and accuracy of clustering.
To adapt the k-means algorithm for large-scale clustering tasks, two widely-used programming models for parallel computing, OpenCL and Map-Reduce, are implemented on reconfigurable hardware. Ramanathan et al. [12] propose an OpenCL-based architecture to accelerate the k-means algorithm. The method work-stealing is leveraged for runtime load balancing. Besides, Choi et al. [1] propose a multi-FPGA implementation following the Map-Reduce programming model. The system mainly consists of Mappers and Reducers and each iteration of the k-means algorithms is regarded as a Map-Reduce job. The sample clustering is assigned to the Mappers, while the update of cluster centroids is executed in the Reducers. As the numbers of Mappers and Reducers are configurable and samples are stored in the harddisk rather than the on-chip memory on an FPGA, this implementation provides a practical framework for big data applications. However, both of the implementations above conform too strictly to their corresponding programming models, which are general-purpose and thus contain some unnecessary contents for the k-means algorithm. Those unnecessary contents prevent the implementations from achieving satisfying performance. In contrast, a coarse-grained Map-Reduce architecture for the k-means algorithm is proposed in this paper. Unnecessary contents such as the shuffle and sort processes and the key/value pair in the traditional Map-Reduce model are removed. The proposed architecture can be optimized to achieve high performance in big data applications.
DESIGN CONSIDERATIONS
In the k-means algorithm, the most computational-intensive part lies on sample clustering, since it requires the calculation of distances between all the samples and all the cluster centroids. Let N be the number of input samples, D be the dimensionality of each sample and K be the number of clusters that samples are partitioned into. If sample clustering is executed in a serial way, its time complexity will be O(N·K·D In this paper, a coarse-grained Map-Reduce architecture is proposed to implement the k-means algorithm on an FPGA. The architecture consists of M Mappers and one Reducer, as shown in the Figure 1 . Each iteration in the k-means algorithm is executed as a Map-Reduce job. The performance of each Map-Reduce job is optimized by algorithmic segmentation, data path elaboration and automatic control. In addition, 32-bit single-precision floatpoint arithmetic is used to enhance the precision of calculation and the accuracy of clustering.
Algorithmic Segmentation
Each iteration in the k-means algorithm, a Map-Reduce job in our design, is segmented into the Map phase and the Reduce phase. The Map phase is responsible for sample clustering and accumulation, which are executed in the M Mappers. The Reduce phase mainly takes charge of generating new cluster centroids by division and is executed in the single Reducer. Compared with the architecture proposed in [1] , accumulations of samples in each cluster are offloaded to the Map phase rather than the Reduce phase. This helps reduce latency caused by data transfer, since samples are no longer required by the Reducer. Figure 1 shows the job distribution between the M Mappers and the single Reducer. Each Map-Reduce job is executed in steps as follow: resource, M is much smaller than N, hence the Map phase is responsible for the majority of the total runtime when N grows large. Additionally, M can be configured to strike a balance among the system performance, hardware resource and memory bandwidth.
Data Path
In the proposed Map-Reduce architecture, data transfer between the FPGA and the memory. Data copy in the memory, which is time-consuming, is avoided by elaborating the memory space allocation. In addition, the Xilinx Advanced eXtensible Interface (AXI) protocol [14] is adopted for high-performance data transfer. Figure 1 shows the data interaction between the memory and the FPGA, as well as the memory space allocation. Samples in the sample set are stored continuously in the memory and equally divided into M parts for M Mappers. The intermediate results generated by Mappers are stored continuously in the memory, for the convenience of transferring through a Direct Memory Access (DMA) toward the Reducer later. The input cluster centroids for Mappers and the new cluster centroids generated by the Reducer share the same memory space. The new cluster centroids override the old one, and are used by Mappers in the next iteration. In this way, data copy in the memory is avoided, which largely reduces latency.
Two AXI4 interfaces, AXI4-Master and AXI4-Stream, are adopted for high-speed data transfer. The AXI4-Master interface is memory-mapped and allows high-performance burst transmission of up to 256 data transfer cycles with a single address phase. The AXI4-Stream interface is for high-speed streaming data transmission without address phases and allows unlimited data burst size [14] . When an AXI4-Stream interface is applied for data transfer between the memory and the FPGA, a DMA engine is required. Briefly, the AXI4-Master interface is commonly used for high-performance data transfer in relatively small amount, while the AXI4-Stream interface is suitable for high-speed data transmission in large scale. 
Automatic Control
In traditional Map-Reduce frameworks, a master node is required to schedule the operation of Mappers and Reducers and to control the data transfer. Typically in FPGA-based implementations, the master node is a host CPU with programs for scheduling and control. The communication overhead between the host and the FPGA may reduce the performance of the entire system significantly. Therefore, the host programs are implemented by hardware circuits in the proposed Map-Reduce architecture. In addition, at the end of each iteration in the k-means algorithm, the system should adjudicate on whether the algorithm is completed according to the distortion error. This arbitration is also implemented on hardware. Details for automatic control are discussed in Sections 4.3 and 4.4.
K-MEANS IMPLEMENTATION
To implement the Map-Reduce architecture described in Section 3, four IP cores: Kmeans_Map, Kmeans_Reduce, DmaScheduler and Iteration_Controller are designed. The first two IP cores implement the function of the Mapper and the Reducer respectively. The DmaScheduler is designed to control the data transfer between the memory and the FPGA. The Iteration_Controller is used to control the execution of the k-means algorithm. The Kmeans_Map and the Kmeans_Reduce IP cores are described in C and synthesized by high level synthesis tool Vivado HLS provided by Xilinx [15] , while the DmaScehduler and the Iteration_Controller are described with Hardware Description Languages (HDLs). Details for implementations of the four IP cores are discussed as follows.
Implementation of Mapper
Traditionally, IP cores are designed with HDLs such as VHDL or Verilog, which can be time-consuming tasks for complex algorithmic implementations. Applying HLS technique allows users to describe their designs in a higher abstraction level with advanced languages such as C/C++, thus largely reduces the complexity and cycles for development. The HLS tool can translate the descriptions from a high level to the register transfer level and automatically schedule the timing. In addition, pragmas are provided by the HLS tool for users to explore the parallelism in their implementations conveniently. The Kmeans_Map and Kmeans_Reduce IP cores are developed with HLS technique.
The Kmeans_Map IP core implements the functionality of the Mapper in proposed Map-Reduce architecture. Figure 2 shows the pseudo-code for describing the behavior of the Kmeans_Map IP core. Interfaces for inputs and outputs are specified in Lines 2-5, according to the consideration of data volume discussed in Section 3.2. In our implementation, samples are stored in the DDR rather than the on-chip memory on FPGA due to the relative small size of the on-chip memory. At the beginning of the execution, cluster centroids are first loaded from the DDR (Line 7). Then samples are streamed into the FPGA sequentially (Line 10). The cluster centroids and the sample in process are frequently accessed, hence they are cached in the local memory of the FPGA to reduce transmission latency. Subsequently, distances between the sample in process and all the cluster centroids are calculated according to the Euclidean metrics. Meanwhile, the present minimum distance and its corresponding ordinal of cluster are updated (Lines 11-15 ). Once the minimum distance is found, the sample is attached with a label indicating the cluster it belongs to, and the accumulator and counter in the corresponding cluster, along with the distortion error, are updated in the meantime (Lines [16] [17] [18] [19] Two pragmas are applied in the Kmeans_Map IP core design for further optimization. In line 9, a PIPELINE pragma is applied to guide the HLS tool to construct pipelines automatically, which helps to reduce the initialization intervals between the processes of two consecutive samples by allowing concurrent executions [13] . In addition, the ARRAY_PARTITION pragma in line 12 is used to instruct the HLS tool to map the arrays in C code to registers rather than Block RAMs (BRAMs), which removes the potential bottlenecks caused by BRAM accessions. After high level synthesis, the initialization interval is equal to the dimensionality of each sample, since each dimension of a sample has to be accessed sequentially due to the feature of the AXI4-Stream interface. Besides, each Kmeans_Map IP core can achieve a throughput of 2.9 Gbps, according to the synthesis report provided by the HLS tool.
Each Kmeans_Map IP core, combined with an Axi_DMA IP core [16] , constitutes a Mapper Unit, as shown in the Figure 3 . The Axi_DMA IP core is a DMA engine encapsulated by AXI4 interfaces. As discussed in Section 3.2, the sample input and the label output are implemented as AXI4-Stream interfaces so that data transfer through a DMA engine. In contrast, the cluster centroids and the intermediate results share an AXI4-Master interface for transmission. 
Implementation of Reducer
The Kmeans_Reduce IP core implements the functionality of the Reducer. The pseudo-code for describing the behavior of the Kmeans_Reduce IP core is shown in the Figure 4 . Interfaces for the input and output are specified in Lines 2-3 according to the consideration of data volume discussed in Section 3.2. The execution proceeds in three steps. First, the number and the partial sums of samples in each cluster, along with the partial sum of the distortion error, are accumulated respectively, all of which are generated by Mappers as intermediate results (Lines 5-9). Then the mean of samples in each cluster is calculated by division to generate the new cluster centroids (Lines [10] [11] [12] [13] [14] . A PIPELINE pragma is applied to allow the division to be executed in parallel to reduce latency (Line 11). Later, the new cluster centroids are sent to the DDR for use in the next iteration (Line 15) and the distortion error is compared to its value in the last iteration meanwhile (Line 16). If the change of the distortion error is within a user-defined error threshold, a one-bit signal iteration_done is asserted, which indicates that the k-means algorithm is done and the final clustering results are ready (Lines 17-19).
The Kmeans_Reduce IP core, combined with an Axi_DMA IP core, constitutes the Reducer Unit, as shown in the Figure 5 . According to the discussion in Section 3.2, an AXI4-Stream interface is applied to the intermediate results input so that data transfer through a DMA engine. In contrast, an AXI4-Master interface is applied to the new cluster centroids output.
Both of the Kmeans_Map and Kmeans_Reduce IP cores are flexible and scalable. Parameters such as the number of samples that one IP core can process each time, the dimensionality of each sample and the number of cluster that the sample set is partitioned into can be configured in a simple way, allowing users to customize their designs for particular applications conveniently.
Hardware Integration
The proposed Map-Reduce architecture for the k-means algorithm is implemented on the Zynq FPGA provided by Xilinx [16] , while it can be easily migrated to other hardware platform. Figure 6 shows the hardware architecture on the FPGA. The system is divided into three hierarchies. First, as mentioned in 
DmaScheduler
The DmaScheduler is designed to automatically control the data transmission by scheduling the operation of the DMA engines in the system. As shown in the Figure 6 , two DmaSchedulers are instantiated, one in the Mapper Block while the other in the Reducer Block. The structure of the DmaScheduler is shown in the Figure 7 . Each DmaScheduler consists of several DMA managers and a finite state machine.
As discussed in Sections 4.1 and 4.2, an Axi_DMA IP core is instantiated for data transfer through the AXI4-Stream interface between the FPGA and the DDR in each Mapper Unit and the Reducer Unit. The operation of these DMA engines needs to be scheduled systematically. In each DmaScheduler, the DMA Managers are designed to control the behaviors of the DMA engines. Each DMA Manager is responsible for one DMA engine and consists of a Global-ID Generator, an Address Calculator and a Simple-Mode DMA Driver. Before the data transfer, the DMA engine needs to acquire the start address of the data block to be transferred. The start address is generated by the Address Calculator. As stated in Section 3.2, data in data blocks including the sample set, intermediate results and labels are stored continuously. Hence the start addresses of these data blocks can be calculated easily according to their specific IDs, which are assigned to each data block by the Global-ID Generator. The start address for each data block is formulated as a sum: Address = BaseAddr + ID × Length
Besides, the Simple-Mode DMA Driver in each DMA Manager is used to manage the behaviors of the DMA engine automatically. In the implementation, each DMA engine operates in Simple Mode [16] . The Simple Mode allows a DMA engine to be configured and used easily, but only supports up to 8MB of data volume for each data transmission. For volume larger than 8MB, data can be partitioned and transferred successively.
In addition, the finite state machine in the DmaScheduler is designed to control the operation of the IP core. Taking the DmaScheduler in the Mapper Block as an example, the operation is divided by states as follows: 
EXPERIMENTAL RESULTS
The proposed design is evaluated on the Xilinx ZC706 FPGA board with an xc7z045ffg900-2 FPGA. In evaluation, the input sample set is fetched from the host through the PCIe ×4 gen2 interface and the transmission bandwidth can achieve 7.6 Gbps in our implementation. Since the transfer of samples from host to FPGA happens only once for each k-means execution and the overhead is small compared to the total runtime, this overhead is neglected in the following evaluation. Besides, to avoid the bottleneck caused by the memory bandwidth, two on-board DDRs are utilized: the DDR in the Processing System and the DDR in Programming Logic [4] . The system runs at 100MHz for evaluation, while a higher frequency can be achieved.
The evaluation is divided into three parts. In section 5.1, performance of our implementation is evaluated and compared with other FPGA-based implementations. Then effects caused by variations of the parameters (M and N) are explored in Section 5.2. Finally, the resource utilization is discussed in Section 5.3.
Performance Evaluation and Comparison
The performance of our implementation is measured by runtime and throughput. For better comparison, the same sample set as in [1] is used for evaluation. The sample set is a power consumption data set from the UCI Machine Learning Repository [18] . It records the electric power consumption in one household with a one-minute sampling rate over 4 years and contains 2075259 samples with 9 attributes each. Two input sample sets are created by extracting attributes from the original data set. One consists of 2-D samples, whose data are extracted from attributes global_ active_power and global_reactive_power. The other makes up of 4-D samples, whose data are extracted from attributes global_ active_power, sub_metering_1, sub_metering_2 and sub_metering _3. Besides, the initial cluster centroids are generated randomly in software, and 12 Mappers are used due to the limitation of hardware resource.
The evaluation results are shown in the Table 1 . It shows that the 2-D and the 4-D sample sets converge after 34 and 11 iterations on average respectively when partitioned into 4 clusters. The runtime required for a single iteration is similar for the two sample sets, which means that the throughput for each iteration of the 4-D sample set is twice the throughput of the 2-D sample set. Additionally, a throughput of 28.74 Gbps for a single iteration is achieved for the 4-D sample set.
The performance is compared with that of four representative FPGA-based implementations for k-means acceleration. The first one is a conventional implementation for clustering microarrays [9] . The second one is an optimized implementation based on the kd-tree data structures [11] . The third one is a implementation that 
Comparison results are shown in the Table 2 . It can be seen that the proposed implementation provides at least 3.93x speedup over the other four implementations. Moreover, the first two implementations use fixed-point arithmetic, while 32-bit singleprecision float-point arithmetic is applied in our implementation. Hence our implementation achieves high performance when maintains high precision in the meantime.
Additionally, the benefit brought by applying the automatic control on hardware is evaluated. The runtime of a single iteration is reduced by 1.2% after integrating the Iteration_Controller and the DmaScheduler into the Map-Reduce architecture. Since the runtime can grow large for large-scale inputs, the 1.2% performance enhancement for each iteration is not negligible. Furthermore, the workload of the host CPU is reduced and the control procedures are simplified by applying automatic control.
Effects of Parameters
As mentioned in Section 3.1, the number of Mappers (M) can be configured to strike a balance among the system performance, hardware resource and memory bandwidth. Adding Mappers can enhance performance by increasing parallelism, but consumes more resources of an FPGA in the meantime. Thus in this section, the effect on performance caused by adding Mappers is explored.
The sample set used for evaluation is generated manually on Matlab, which consists of 512000 4-D samples. Figure 8 displays the throughput of a single iteration as M increases. It can be seen that the throughput increases in an approximately linear behavior with the increase of M, which means that the performance of the system benefits from the extra parallelism brought by adding Mappers. Besides, the growth of throughput slows down slightly when M grows larger than 8. This is caused by the sharing of a single High Performance (HP) transmission port [17] by two or more Mappers.
Additionally, the performance is evaluated as input samples increase, especially under large-scale inputs. A series of 4-D . This means that the proposed architecture is applicable for big data applications.
Moreover, the ratio between the runtime of the Map phase and the total runtime is measured as N varies. As shown in the Figure 10 , the ratio increases rapidly when the input is small, while the ratio tends to remain high and constant under large inputs. When N grows larger than 10000, the Map phase is responsible for more than 85% of the total runtime. This meets the theoretical prediction in Section 3.1: The Map phase occupies most of the total runtime when N grows large, since the runtime for the Map phase is approximately proportional to N, whereas the variation of N has no effects on the Reduce phase. 
Resource Utilization
Resource utilization of the FPGA for implementing the proposed Map-Reduce architecture is also evaluated. As shown in the Table  3 , the implementation consumes more than 80% of LUTs and nearly half of FFs and DSPs on the xc7z045ffg900-2 FPGA when 12 Mappers are used. The utilization of hardware resource on the FPGA is maximized to achieve the highest performance.
CONCLUSION
In this paper, a coarse-grained Map-Reduce architecture is proposed to implement the k-means algorithm on an FPGA. Algorithmic segmentation, data path elaboration and automatic control are applied to optimize the performance of each iteration in the k-means algorithm. In addition, HLS technique is utilized to reduce the development complexity and cycles. And float-point arithmetic is used to increase the accuracy of clustering. During evaluation, the implementation shows a throughput of 28.74 Gbps for a single iteration in the k-means algorithm and a speedup of at least 3.93x over four representative existing implementations. The performance can satisfy the high computational requirement in big data applications. Future work includes extending the proposed architecture to multi-FPGA implementations and other data mining algorithms.
