Sensor systems for machine condition monitoring face many challenges in the world of digitized data with an emphasis being placed on the application of high performance feature extraction computing systems. These systems must be robust, reliable and economically viable before being adopted by industry. This paper describes a platform for condition monitoring that has been developed using a field programmable gate array (FPGA) for the real time signal processing in a complex grinding process. An architecture which can sample 16 channels at 12.5kHz and perform 1024 bin FFTs on a National Instruments CompactRIO with a Xilinx Virtex 5 LX50 is described. FPGA resource utilization figures for multiple configurations of this FFT are reported. The FPGA also performs an exponentially time weighted RMS on 10 acceleration channels and samples four quadrature encoded axis channels in real time. Results are displayed as conditioned data to a machine operator HMI for machine and process evaluation. The research demonstrates how multisensory multiprocessing platform approach can be realized and implemented in industry for future high end condition and process monitoring applications.
Introduction
Information acquisition from the myriad of electronic devices and the concept of big data has become a common term in this ever more technologically pervasive world. In the area of condition monitoring for metal cutting machines it is no different. Sensor systems are playing an ever more important part in the role of high performance machining.
A recent comprehensive review of the monitoring of machining operations [1] discusses many of the sensor types available: force, torque, vibration, acoustic emission (AE), motor power and integrated tool temperature devices among others. Signal processing techniques for extracting this information is essential if this big data is to make sense. This review paper outlines many of the signal features (SF) which have been applied to the relevant machining operation: common time domain measurements such as root mean square (RMS) and statistical moments to more advanced techniques such as linear regression and moving average modelling, principle component analysis (PCA), singular spectrum analysis (SSA) and permutation energy H p have been studied. Frequency domain processing such as the wellestablished fast Fourier transform (FFT) and the timefrequency domain signal analysis techniques such as that of the discrete wavelet transform (DWT) and Hilbert-Huang transform (HHT) are also presented.
Reconfigurable and flexible machining [2, 3] is becoming a necessity for modern manufacturing systems. Likewise the sensor systems must be flexible and reconfigurable for the many different and complex processes encountered. The signal features extracted may have very different characteristics for these varying processes and the signal processing ability of these systems needs a high degree of flexibility and the ability to process the increasing volume of data in a manageable time frame. Such a technology that has matured over the last decade is the field programmable gate © 2014 Published by Elsevier B.V. Open access under CC BY-NC-ND license. Selection and peer-review under responsibility of the International Scientific Committee of the 6th CIRP International Conference on High Performance Cutting array (FPGA) in the application of machine tool control mainly in the area of servo motor positioning [4] . FPGAs are integrated circuits that may be reconfigured in the field to perform tasks that would normally have been performed by dedicated application specific integrated circuits (ASIC). This technology is now being applied to the monitoring of machining processes [5, 6] whereby a system on chip (SoC) has been developed to acquire and process multi-channel vibration signals and apply feature extraction techniques based on the FFT, DWT and short time Fourier transform (STFT) has been developed.
However the expertise required in developing FPGA systems is notoriously specialised and very labour intensive. This problem is being addressed most notably by National Instruments [7] who have a graphical programming language used to program their data acquisition and control instrumentation, namely LabVIEW. This language lends itself well to the parallel programming nature of FPGAs and has been deployed on their range of reconfigurable input output (RIO) technology. A number of researchers [8, 9] have reported on this technology for signal acquisition and power spectrum processing.
The work described in this paper employed a CompactRIO NI 9022 controller with a Xilinx Virtex 5 LX50 chassis for data acquisition and real time signal processing on up to 16 channels of 12.8kHz sampled data. Exponential time weighted RMS and quadrature decoding on four axis channels was performed. The architecture for implementation of a 16 channel 1024 bin FFT analyser was also discussed and FPGA resource figures are presented for a range of configurations.
Grinding case study
The monitoring of complex grinding processes requires collation of multiple sensor inputs in order to infer machine and process condition and ultimately infer workpiece quality. Within this research, a case study was selected on a complex grinding process where experimental investigations were undertaken into the precision grinding of cylinders at an industrial printing machine manufacturing plant. In this case study, surface quality was the fundamental criterion, and the primary cause of surface issues has been attributed to undesirable vibration between the tool and work piece [10] . This phenomenon is referred to as chatter and the marks seen on the cylinder are normally referred to as chatter marks. The mechanisms of chatter are well known and have been described [11] but it is sufficient to say here that undesirable machine vibrations of one kind or another may lead to chatter and the observation of these vibrations before they reach a level where damage occurs is vital.
Workpiece, grinding machine and process
The cylinder work pieces were manufactured from heat treated steel CK45, 180mm in diameter, 535mm in length and weighed 70kg. The profile of a cylinder workpiece is illustrated in Fig. 1 and can be seen to consist of a concentric surface terminated with two off-centre arced surfaces which form a transition into the channel opening along the cylinder. The machine under investigation and illustrated in Fig. 2(a) was a Schaudt PF61 cylindrical grinder with 3 major axes of motion. The two linear motion axes are illustrated in Fig. 2 (b) and consisted of the X infeed axis and the Z cross-feed axis. The C axis is illustrated in Fig. 2 (c) and was the workpiece rotational axis.
There were three well established categories of grinding processes [12] involved in the machining of the cylinder workpiece together with a number of intermediate grinding wheel dressings and balancing operations. The initial process was a number of plunge grinding operations along the length of the workpiece where the objective was primarily to centre the cylinder. The second process involved a number of rough traverse grinding passes where the cylinder surface was refined so that the cylinder dimensions were consistent along its full length. The final machining process was a number of polish traverses of the cylinder where the grinding wheel cutting speed and Z-axis feed rate were reduced to produce a blemish free cylinder surface to the specified cylinder dimensions.
As is apparent from this grinding operation it was complex and depended upon many varying parameters including multiple feed rates on all axes, varying depths of cut and an asymmetric cylindrical surface. A state and position based monitoring approach was pursued. 
CompactRIO platform
The hardware for the condition monitoring system is illustrated in Fig. 3 and used a National Instruments CompactRIO with C-Series data acquisition modules. The module chassis was an 8-slot NI cRIO-9118 with a Xilinx Virtex 5 LX50 FPGA. The embedded controller was a NI cRIO-9022 with a 533 MHz PowerPC processor, 256MB DRAM and 2GB of Flash storage. Two Ethernet ports and one USB port were also provided. The human machine interface (HMI) was an Advantech PPC-L157T flat panel touch screen PC with an Intel Atom N270 1.6GHz processor and 2GB DDR2 SDRAM. This was connected to the embedded controller over the Ethernet link and provided all machine operator input/output.
Vibration sensors were employed throughout the system, the locations of which are illustrated in Fig. 2(a) . The accelerometers used were a combination of three tri-axial Kistler model types 8762A5A/50A and a single axis Kistler type 8141A. A National Instruments simultaneous sampling NI 9234 analogue to digital converter module provided direct signal conditioning for four accelerometer channels while two Kistler 5134B1 Piezotron couplers were used for conditioning the remaining six accelerometer channels. An Artis MU-3 active power monitor was also installed to measure spindle power. A NI 9205 16 channel multiplexed analogue to digital converter was used to acquire any remaining analogue signals.
Two NI 9411 digital input modules were used to interface to the quadrature encoded signals from the Gemac interpolators. They each had six differential input channels subdivided into two sets of three channels which were dedicated to quadrature encoded signals; phase A, phase B and an Index signal. Each of the two modules could accept two quadrature encoded axes at a 2MHz sample rate.
FPGA signal processing
The FPGA on the cRIO-9118 chassis must control all aspects of the signal acquisition modules as it interfaces directly with the modules over separate serial peripheral interface (SPI) busses. The FPGA also performed real time signal processing on the acquired data The architecture for the FPGA implementation is illustrated in Fig. 4 . A real-time RMS calculation was applied to each analogue channel on a sample by sample basis. A 1024 bin FFT was also applied in various channel configurations. The processed data including all raw data were then transferred over the peripheral component interconnect (PCI) bus using two direct memory access (DMA) channels to the embedded PowerPC controller. 
PowerPC Embedded Controller

PCI Bus
Hanning Window
RMS calculation
Root mean square (RMS) is certainly the most common signal processing technique used in measurement systems and is expressed by eqn (1) . (1) where Vrms is the RMS value, T is the duration of measurement, and V(t) is the instantaneous voltage, a function of time, but not necessarily periodic.
Many sensor conditioning units have a DC signal output representing the RMS of the raw sensor signal. For example, a Kistler 5127 Piezotron coupler for accelerometer conditioning has an RMS output and is user configured with different time constants using modular plug-in capacitors. On investigation the integrated circuit used for this function is from Analog Devices. The operation of this device may be illustrated in Fig. 2 using an explicit method for finding the RMS; the actual implementation uses a more subtle implicit method. For the purposes here however it is relevant to note the use of the capacitor-resistor pair which forms the exponentially weighted averaging. [13] This averaging circuit is effectively a single pole low pass filter and provides a method for adjusting the time constant for the RMS calculation. The description of the output given by eqn (1) should have a term included to account the exponential weighting this but it is sufficient to say here that the integration of the signal over the observed time will be a single pole filter as shown in Fig. 5 . This filter will have a characteristic time constant τ = RC where R is the resistance and C the capacitance. The cutoff frequency F c for this single pole filter is given as: (2) This single pole filter may be implemented in a digital system as an infinite impulse response (IIR) filter using a difference equation as illustrated in eqn (3) for the discrete input signal x i (t) and its output y i (t).
where the coefficients a 0 , a 1 , and b 0 , b 1 , are calculated for a cut-off frequency corresponding to the desired time-constant RC as in eqn (2) . The coefficients were calculated using Matlab and are shown in Fig. 6 for a time constant of 100ms and a sample rate of 12.8kHz. It can be seen from the value of the coefficients, a and b in Fig. 6 that a 0 is unity and b 0 = b 1 so the calculation from eqn may be reduced to that in eqn (4) . (4) In order to calculate the RMS, each digital sample x i (t) must be first squared, the filter then applied as in eqn (3) which will consist of two multiplies and effectively two additions followed by a square root. In the monitoring system there were 10 acceleration channels and rather than duplicate the RMS function 10 times it was possible to have only one instance of the function; the FPGA could do the calculation for all channels much faster than the minimum time of 78μs based on a sample rate of 12.8kHz. However, the state of the filter must be stored for each channel and so requires some register memory to keep previous values of the filter's input and output.
A screen shot of the RMS LabVIEW code block is shown in Fig. 7 and is implemented inside a single cycle timed loop. At a base frequency of 40MHz this means that the calculation is performed within 25ns. The square root also takes 25ns and for 10 channels the total time taken is approximately 0.5μs which is over two orders of magnitude better than the minimum requirement for real-time operation in 78μs. The feedback nodes store the filter state for each channel. The real time RMS data can be displayed to the front panel of the HMI for operator inspection. Fig. 8 shows an example screen shot for the three axes acceleration response at the spindle headstock for a part of a plunge grind operation. 
FFT Implementation
LabVIEW FPGA 2011 contains an express virtual instrument (VI) which may be easily configured for implementation as a hardware description language (HDL) block for a range of frequency resolutions or number of bins. However, each implementation of this FFT requires significant FPGA logic and for multiple channels it would quickly consume all the FPGA's resources. In order to conserve these resources it is possible to have one instance of the FFT and buffer up the data for each channel and then feed the blocks of data through the FFT sequentially. The code is based on [14] , only real data input while real and imaginary data are outputted .
At a sample rate of 12.8kHz it has been seen that all operations must be completed within 78μs. A single 1024 bin FFT for example, takes a single clock cycle to output each data point and the clock cycle latency will be twice the number of bins before valid data are available. The FFT must work on a complete block equal in size to the number of bins, i.e. in this case 1024 data points from the same data stream. This means that while these data are being clocked through for one channel, data from all the other channels must be stored for subsequent processing. A memory block of at least 1024 samples per channel is therefore required and the time t FFT taken to process all the data stored in memory is shown in eqn (5). (5) For 1024 bins and 4 channels and a base clock rate of 40MHz the time taken to run the FFT on all data is 102μs. In order for real-time operation all the data need to be processed within 78μs before another sample is acquired. The block RAM therefore needs to be at least 1024+1 in size otherwise data will get overwritten. Conveniently though, the allocation for block RAM is always 2 n + 5 where n is a positive integer. From eqn (5) it can be seen that for 4096 bins and 4 channels the time required to clock all the data through the FFT is 409μs. Considering the extra 5 FIFO elements, the time required between FFT calculations before data is overwritten is 6 x 78μs = 468μs. Therefore, for the selected FPGA base clock of 40MHz and sample rate of 12.8kHz: 4096 is the maximum number of bins achievable using 4 channels without additional memory or increasing clock speed. An example implementation for four channels and 1024 FFT bins is illustrated in Fig. 9 . An optional Hanning window is first applied to all channel data represented as fixed point with a word length of 24 bits and an integer word length of 4 bits. The data are then converted to 18 bit word length as this is the base size for the multiply accumulate (MAC) blocks. Data are then placed on each of the channel specific FIFOs and successively clocked into the FFT. As outlined, 1029 elements will be allocated for the FIFOs used for each of the four channels. As valid data are clocked out of the FFT they are converted to single precision floating point and placed on a transmit FIFO where a DMA engine will transfer them over the PCI bus to the embedded controller. The output FIFO is 1029 elements in size and with a PCI bus speed of 50MHz it is possible to clock out all the data to the embedded controller faster than the FFT generates the data. However, delays may be such that this FIFO needs to be larger.
FPGA resource utilization
The data shown in Table 1 outlines the resources used for a 1024 bin FFT with doubling channel count. An entry has been generated for a compilation with 0 channels which has no FFT, Hanning window or output FIFO. FPGA utilization figures using four channels for a range of FFT bin numbers from 256 to 4096 were also compiled. The results for these various configurations are given in Table 2 . It can be seen from Table 1 that the 1024 bin length is repeated where different utilization figures were observed. This would have been as a result of using the recommended Xilinx compiler options while the compilations for the bin length adjustment used a custom setting with high priority on area and high priority overall.
Between 17 and 18 block RAMs were consumed for bin lengths between 256 and 1024 and then quickly grew to 23 for 2048 bins and 34 for 4096 bins; the same as a 16 channel 1024 FFT. Device slice logic utilization changed little across all bin lengths, ranging from 78 to 80% utilization. 
Conclusions
The FPGA implementation of the exponentially weighted RMS and quadrature decoding implemented on the FPGA functions well but consumes 65% of the FPGA slice resources. Some of this may be due to the overhead LabVIEW FPGA must impose for controller communication. It should however be possible to reduce this utilization by employing an implicit calculation of the square root function and thus reducing the required dynamic range of the fixed point numbers.
The FFT can be implemented for 16 channels with 1024 bins. Higher frequency resolution using a bin length of 4096 on the four simultaneously sampled channels might be an important option. Both the real and imaginary data are outputted from the FPGA so phase information may be readily obtained by the condition monitoring platform.
The FPGA is reconfigurable for other signal feature extraction techniques and will reduce the burden on the embedded controller to implement the necessary cognitive rationale needed by a reliable and robust condition monitoring system. The approach has been demonstrated as a working solution for state and position oriented condition based analysis of grinding.
