Abstract-In this brief, we propose a low-complexity architectural implementation of the K -means-based clustering algorithm used widely in mobile health monitoring applications for unsupervised and supervised learning. The iterative nature of the algorithm computing the distance of each data point from a respective centroid for a successful cluster formation until convergence presents a significant challenge to map it onto a low-power architecture. This has been addressed by the use of a 2-D Coordinate Rotation Digital Computer-based low-complexity engine for computing the n-dimensional Euclidean distance involved during clustering. The proposed clustering engine was synthesized using the TSMC 130-nm technology library, and a place and route was performed following which the core area and power were estimated as 0.36 mm 2 and 9.21 mW at 100 MHz, respectively, making the design applicable for low-power real-time operations within a sensor node.
I. INTRODUCTION
The fundamental concept of the cluster analysis is to form groups of similar objects as a means of distinguishing them from each other [1] . Clustering techniques have been successfully used in diverse fields, such as medicine (EEG and activity recognition) and marketing, involving multivariate data and can be conveniently deployed with limited resources (memory and CPU) [1] . The K -means clustering algorithm owing to its computational simplicity and efficiency has been an attractive choice for a wide variety of signal processing applications [1] . It is a well-perceived fact in the research community that the cluster analysis is primarily used for unsupervised learning, where the class labels for the training data are not available. However, the K -means algorithm can also be used for supervised learning, where the class labels of the training data are known a priori [15] . Apart from using it as a learning algorithm, K -means has also been utilized for signal preprocessing, feature reduction, and time-domain signal analysis [2] . Hence, using K -means for real-time cluster analysis requiring computation in resource-constrained sensor nodes for remote health care monitoring systems, where online multimodal data acquisition and analysis is the key (e.g., cardiovascular disease prognosis), requires an effective implementation strategy. The fundamental requirement for such applications is to cut down on continuous transmission and has a low-power operation to prolong their battery life. Hence, an effective algorithm-to-architecture holistic mapping is required to fulfill the notion of low-power operation aimed for long durations. The K -means algorithm exhibits an iterative nature, where it computes the distance of each data sample from the centroids until convergence. This is generally achieved by the use of power hungry multipliers, square rooters (for Euclidean distance computation), and multiplexers [3] - [9] , thereby rendering direct mapping of this algorithm to architecture infeasible for implementation on resourceconstrained platforms. An attempt was made to replace the Euclidean distance by a combination of Manhattan and Max distance but by trading off accuracy for power consumption in [10] . Therefore, an optimization is necessary to achieve a tradeoff between algorithm efficiency and architectural complexity. Coordinate Rotation Digital Computer (CORDIC)-based architectures exploring its different transcendental functions to compute complex arithmetic operations [11] , [12] have been widely used for computationally intensive signal processing algorithms [11] - [14] , which apply the K -means clustering algorithm. Hence, in this brief, we investigate the use of a CORDIC-based low-complexity engine to implement K -means clustering algorithm.
The rest of this brief is organized as follows. Section II discusses the proposed methodology, Section III analyzes the hardware complexity, and Section IV concludes this brief.
II. PROPOSED METHODOLOGY
The K -means algorithm iterates to minimize the squared error between the empirical mean of a cluster and the individual data points, defined as the cost function. Initially, k centroids are defined and data vectors are assigned to a cluster label depending on how close they are to each centroid. The k centroids are recalculated from the newly defined clusters, and the process of reassignment of each data vector to each new centroid is repeated. The algorithm iterates over this loop until the data vectors form clusters and the cost function is minimized [2] . CORDIC is an efficient implementation technique for vector rotation and arctangent computation using shiftadd operations. In this brief, we use CORDIC in vectoring mode for our implementation.
The architecture of the proposed CORDIC-based K -means clustering engine is shown in Fig. 1 . The input data are stored in memory and transmitted to different blocks via control unit (CU). The Euclidean distance is calculated in distance unit using low complexity CORDIC vectoring module. The distances from each point to each of the centroids are sent to a comparator block to identify the cluster to which it belongs. Once the clustering is done, centroid calculation block will be activated to compute the new centroids. If these new centroids differ significantly from the previous values of the iteration, clustering is repeated, else clustered data are sent to the output. The CU governs the data flow among all the modules. The proposed engine utilizes CORDIC to compute Euclidean distance between two points, which is a metric to compute the clusters and has been explained through an illustrative example.
In this brief, our focus is to propose a methodology for utilizing CORDIC to compute Euclidean distance between two points, which is a metric to compute the clusters. In 2-D signal space, if (x 1 , x 2 ) and (y 1 , y 2 ) are two points, the Euclidean distance between these two points will be
One square rooter, two square, one adder, and two subtraction operations are involved in this computation. If we give a and b as the x-and y-inputs to the vectoring mode CORDIC, the x output will be the magnitude of vector (a, b), which is (a 2 + b 2 ) 1/2 . So, with
and (x 2 − y 2 ) as the x-and y-inputs, respectively, the vectoring mode generates the distance between two points. Architecture of the 2-D distance measurement unit using vectoring mode CORDIC is shown in Fig. 2(a) . We can extend this methodology to n-dimensional (nD) signal space to formulate distance between two nD vectors. Considering the case of 3-D signal space (n = 3), distance between these two points (x 1 , x 2 , x 3 ) and (y 1 , y 2 , y 3 ) will be
With inputs as (x 1 − y 1 ) and (x 2 − y 2 ) to vectoring mode as in 2-D case, the x-output of vectoring mode CORDIC(vec l1 x ) is represented as
If we pass this x-output to next CORDIC level as one input, with (x 3 − y 3 ) as second input, x-output will be the desired 3-D distance
Fig. 2(b)
shows the 3-D distance measurement unit. Since these two stages are executed sequentially, the same CORDIC unit can be reused only at the expense of two multiplexers, as shown in Fig. 2(c) .
can be calculated using (n−1) CORDIC stages, as shown in Fig. 2(d) .
For calculating the distance between two nD vectors, the inputs to the vectoring mode CORDIC block will be as follows. Thus, recursively using the fundamental 2-D-CORDIC unit, we can generalize it for computing the nD distance. As the calculations are done sequentially, the same CORDIC unit can be reused for calculations in two levels, only at the expense of one two-input multiplexer and one (n−1) input multiplexer at the inputs of CORDIC unit, as shown in Fig. 2(d) .
Multiplexers at input and demultiplexers at output of the CORDIC block have been used accordingly. 
III. RESULTS, ANALYSIS, AND DISCUSSION

A. Methodology Validation
The proposed architecture was coded in Verilog as hardware description language and functionally verified by evaluating on a set of 24 data sets each having 60 samples of kinematic data, collected from a wrist-worn triaxial accelerometer measuring human arm movements. This data set was chosen in the view of the popularity of K -means for analyzing human movement in daily living scenarios using inertial sensors [15] . Here, we have verified the design for cluster values ranging from 1 to 16 (k) and dimension of the input data as 3 (n). It is important to note here that although the input data are 16 bits wide, the width of the data path in the CORDIC unit is 22 bits. In order to achieve the desired 16-bit accuracy, a 22-bit word length should be selected, according to the formulation (N +Log2N +2) and having at least 16 iterations. Therefore, to obtain a high accuracy, a 22-bit CORDIC was used for this implementation. The output was validated against MATLAB model of the K -means algorithm with the same initial seed values. The results indicate the similarity in predicting the number of clusters and the required iterations. Moreover, we achieve precision up to eight decimal places using a 16-stage vectoring mode CORDIC with input magnitudes ranging from 1 to 10 15 . Fig. 3(a) shows the resulting clustered data from MATLAB inbuilt K -means function, and Fig. 3(b) shows the RTL output obtaining a 100% accuracy, exhibiting a robust system. A detailed error analysis has been provided in Section III-C.
B. Hardware Complexity Analysis
Throughout the hardware complexity analysis, we keep a generalized view of word length b and follow a similar procedure used in [11] . Since the distance computation in K -means clustering using CORDIC is an iterative procedure, we consider only one single iteration, because the same hardware can be reused for the next iterations as well as for successive stages of CORDIC in vectoring mode for dimensions higher than 2.
Computation of distance between two nD points (x 1 , x 2 , x 3 , . . . , x n ) and (y 1 , y 2 , y 3 , . . . , y n ) using the conventional method, i.e., ((x 1 − y 1 ) 2 + · · · + (x n − y n ) 2 ) 1/2 requires n squaring operations, (n−1) addition operations, and one square-root operation. Since this distance is only used for comparison, the absolute value of the distance is not required and hence, square rooting operation can be omitted without compromising on accuracy. To provide a comparison on a uniform platform, we consider only ripple carry adder (RCA) and conventional array multiplier (CAM) as the means of implementing the arithmetic operations. One b-bit RCA requires b full adders (FAs) (in a simplified view) [11] , and b X b CAM requires b * (b − 2) FA plus b half adders (HAs) and b 2 AND gates [11] . Similarly, one bbit square root needs 0.125 * (b + 6)b FA and XOR gates [11] . In addition, considering one FA cell requires 24 transistors and one HA cell, and one two-input XOR gate consists of 12 transistors, and a two-input AND gate consists of six transistors [11] , we can calculate TCA = 24b and TCM = 6b(5b − 6), where TC* is the transistor counts (TCs) for RCA and CAM, respectively. Following the same procedure used in [11] , savings in terms of arithmetic operations for distance computation in different dimensions without using CORDIC are computed. For nD distance computation, (n − 1) RCAs and n CAMs are required. The transistors used here will not be required when the proposed CORDIC-based engine is used for K -means clustering, since we are reusing the CORDIC unit. Therefore, the total TC computed here will be the transistor saving (TS), given by
Expressing TS n D in terms of the total number of transistors saved and normalizing with respect to b, a metric TS per word length (TSPW) can be computed following the approach presented in [16] . 
C. Error Analysis
Accuracy is determined by comparing outputs from proposed methodology with MATLAB inbuilt "K -means" function. In the proposed technique, CORDIC is used for calculating the Euclidean distance. The accuracy depends on the input dimension (D) and the number of microrotations, which is equal to the number of stages (n) of CORDIC. A set of 1000 randomly generated signals are taken as input and mean absolute percentage error (MAPE) was calculated with the various stages of CORDIC for different input dimensions. Fig. 6 shows the variation of MAPE for the proposed architecture with dimensions ranging from 2-D to 6-D, with different CORDIC stages n = 8, 12, 16, and 20, and with word length b = 4, 8, 16, and 32. MAPE is of the order of 10 −2 with the number of CORDIC stages (n) as 8 for a given dimension while with n = 16, MAPE decreases to the order of 10 −4 . From Fig. 6 , it is evident that as the number of CORDIC stages increases, MAPE decreases due to the increase in the resolution of CORDIC. Furthermore, for a fixed n, MAPE increases with D of the input, because subsequent dimensions are passed on to CORDIC in a cascaded fashion. In addition, from Fig. 6 , it is observed that for a given dimension and the number of CORDIC stages, MAPE does not change significantly with the word length b. It is important to note that although Fig. 6 has been demonstrated for processing data up to 6-D, the architecture can support up to 16-D input data. For a 16-stage CORDIC, all the stages are sequential and at any instant of time, only one of these 16 stages work on the particular data point. Hence, 16 different data points can be processed using 16 stages of CORDIC at any clock cycle. Moreover, this MAPE is the error in Euclidean distance, which is used to cluster data points. The final output is the coordinates of the cluster centroid and not the Euclidean distance itself, which ensures a robust output.
D. Analysis of Computational Time
In the proposed architecture, an n stage (=16) CORDIC takes n number of clock cycles to compute a single 2-D Euclidean distance and, therefore, achieves 100% throughput with 16 dimensions/cycle (see Table I ), having a latency of n clock cycles (=16), equivalent to the number of stages of CORDIC operation.
For each cluster (k), distances from every cluster centroid to every data point have to be measured. To calculate nD distance, (n − 1) stages of CORDIC are needed. Hence, each iteration takes k * num * (D − 1) + 16 clock cycles, where num is the number of data points, D is the dimensionality of data, and k is number of clusters. The variation of the required number of clock cycles with varying dimensions and number of clusters for a data set having 64 points is shown in Fig. 7 . For clustering 64 data points into three clusters, a 16-stage CORDIC distance measuring unit takes 400 clock cycles. The proposed architecture at 100-MHz operating frequency will take 4 us to complete a single iteration step. Here, the number of iterations depends on the initial seeds and the distribution of data. Considering a worst case scenario of 1000 iterations, the total K -means operation using the proposed architecture takes less than 1 ms to complete, thereby achieving real-time standards. In Fig. 8 , we present a relationship for a range of functionally verified input frequencies (1 to 360 MHz) against the surface power density (SPDpower per unit area) obtained as a result of dc synthesis using TSMC 130-nm technology library. A third-order polynomial fit is used to describe this relationship, over the selected frequency range wherein we achieve a 100% throughput using the shared CORDIC resource. The reported cost function (SPD = 8.9e −13 F 3 − 5.2e −10 F 2 + 1.8e −7 F + 1.2e −5 ) and the associated goodness-of-fit parameters, namely, the low values of sum of squares due to error, root-meansquare error, and high value of the adjusted R-square as mentioned in Fig. 8 , indicate a best fit in comparison to other tested models.
E. Comparison With Other Architectures
The proposed architecture is synthesized by synopsys design compiler and the place and route was performed using synopsys IC compiler using 0.13 − µm standard cell CMOS technology. The core area and power consumption of the proposed engine are 0.36 mm 2 and 9.21 mW at 100-MHz frequency for VDD = 1.2 V. The engine consumes 62% less power with a comparable area consumption with respect to the state-of-the art architecture for ASIC implementation in [8] (the power reported is from backend simulation using SoC encounter). A comparison of the area requirement and power consumption of the proposed engine with the state-of-the-art architectures has been highlighted in Table I . As different architectures use different technologies, the area and power values are normalized to the same technology node [17] . These results are provided to give an insight about the performance of the proposed methodology-based architecture. The proposed CORDICbased clustering engine compares favorably in terms of area with respect to the other reported architectures, as illustrated in Table I . It is to be noted that due to the unavailability of an appropriate memory module in our standard cell library, the architecture is implemented using registers. We believe that the use of appropriate memory will significantly reduce the area and power consumption.
IV. CONCLUSION
In this brief, we propose a novel CORDIC-based clustering engine implementing the K -means algorithm, generalized to compute nD distance using a 2-D fundamental core exploiting the recursive nature of the algorithm. The hardware analysis shows a minimal amount of transistor overhead, and the area and the low-power consumption at a relatively high clock speed (i.e., 100 MHz) make it suitable for on-board sensor processing in pervasive healthcare applications. This engine can be suitably integrated onto a sensor platform in the form of a dedicated ASIC as the first step toward point-of-care diagnostic for applications involving the activities of daily living using inertial sensors, where clustering techniques are an integral part of unsupervised learning approaches on data sets with no corresponding class information.
