Abstract-This paper describes an image processing algorithm and its efficient architecture. The proposed architecture is used to process images of microelectrode arrays (MEAs) and microwells captured by a microscope camera in a dielectrophoresis (DEP)-based system which consists as well of digital switches for turning the DEP force 'on' or 'off'. The images are processed in order to determine if a neuron has entered any of the micro-wells in which case the corresponding switch turns 'off' the DEP force. This process must be in real-time to avoid more than one cell to be loaded in a micro-well. The proposed architecture has been successfully implemented and tested on a Zynq SoC. Results achieved have shown that the system can process one image in 9 ms which meets the minimum real-time requirements of this DEP system.
INTRODUCTION
Microelectrode arrays (MEAs) are devices that offer a bidirectional way of communication between in-vitro cultured neuronal networks or brain slices and instrumentation [1] [2] [3] [4] . They comprise an array of electrodes in the micrometer scale printed on a substrate like glass or silicon onto which biological neuronal networks or slices are cultured. Each electrode in the array can be used to stimulate cells or a specific region on a brain slice and to record evoked extracellular activity.
Traditionally, MEAs have been used as biosensors for pharmacological screening [5] [6] [7] [8] , which is due to the fact that the electrophysiological behaviour of cells is dependent upon the composition of the culture medium that bathes them. Such experiments usually involve randomly dispersing neurons on MEAs, culturing them on the device for a few days to grow processes and form networks, and recording their electrical activity before and after the alteration of the composition of their culture medium. Nevertheless, placing neurons randomly on MEAs means that there is no control over the position of cells with respect to electrodes. As a result, the quality of recorded signals is degraded, and the electrodes could be recording signals from several cells that are nearby, which requires the use of spike sorting algorithms to determine the origin of the signal [9] . Moreover, because neurons move within the cultured network over time [10] , even if they are placed one-by-one on top of each MEA electrode, they can still change their position with respect to electrodes. Due to these limitations the task of mapping the electrical behaviour of the neuronal network at the single-cell-level is quite challenging.
To confine a single neuron within the area of each MEA electrode and form single-cell-per-electrode networks threedimensional microstructures have been used by several researchers [11] [12] [13] . These include micro-wells fabricated on top of each MEA electrode in order to confine neurons loaded in them, and micro-trenches that connect adjacent micro-wells and allow the outgrowth of neural processes from one cell to its neighbours for the purpose of network formation Conventionally, the task of loading a single neuron inside each MEA micro-well is achieved using micropipettes guided by micromanipulators. Nonetheless, this method is time-consuming and can achieve the loading of only a few micro-wells as cells are left without incubation for several minutes during the loading process. Faster neuron loading has been investigated using dielectrophoresis (DEP) [14] [15] [16] [17] , which is essentially an electrokinetic force that can drive neurons towards the MEA electrodes and consequently inside micro-wells. In [18] we reported a DEP-based system for loading single-cell-per-electrode. The system, which is illustrated in Fig. 1 , utilizes a set of digital switches for turning the DEP force 'on' or 'off' and a microscope camera for capturing images of the MEA and micro-wells. The images are then processed by MATLAB in order to determine if a neuron has entered any of the micro-wells in which case the switch that corresponds to that particular micro-well turns 'off' the DEP force. It should be noted that DEP is induced due to an uneven electric field created between the MEA electrodes and an ITO counter-electrode.
Testing the system on MEAs with 16 micro-wells revealed that on average 43% were loaded with a single cell, while 2% had more than one and the remaining 55% could not be assessed due to problems related to the fabrication of the wells. It was observed that the low success percentage was, among other factors, due to the relatively long period of time that the image processing was performed. The system required 417 ms to capture and process each camera image and check for the presence of neurons in the wells. During this time it was likely that more cells were loaded. [18] In the pursuit of faster processing time, the PC/MATLAB part of the system was replaced by a hardware-based solution to perform the image processing task and control the DEP switches. This new system has demonstrated considerable improvement as it brings micro-well image processing time down to 9 ms. The subsequent sections of this manuscript describe the details and the performance of the system. Section II is concerned with the used cell detection algorithm. The proposed architecture and its hardware implementation and results are presented in Section III. Section IV concludes the paper.
II. IMAGE PROCESSING SYSTEM
Images from the microscope in the DEP system are used to check for the presence of a cell inside each micro-well and turns 'off' the corresponding DEP force when a cell is detected in order to prevent more than one cell to be trapped in one micro-well. The micro-wells are aligned with a mask image, which consists of 16 squares that overlay the image of the wells as shown in Fig. 2 .
A cell is not trapped in this micro-well A cell is trapped in this micro-well During single-cell positioning, the image processing system processes the captured image and analyses the pixels within the 16 regions of interest (i.e. micro-wells' regions). If the pixel values for a specific region of interest exceed a specified threshold value, which is an indication that a cell is positioned inside the micro-well, then the switch which corresponds to that region is opened in order to prevent the attraction of more cells to that micro-well.
A. Trapped Cell Detection Algorithm
The input image is firstly converted from RGB colour into grayscale image, and then the trapped cell detection algorithm is used to localise each micro-well. Since the used planar microelectrode array and camera are fixed, the positions of each embedded micro-well on the captured images are thus constant. Let (x 1 , y 1 ) denote the coordinates of the left top corner position of the first micro-well (i.e. the top left one), g denotes the horizontal or vertical gap distances between each micro-well. The coordinates of the left top corner position of the micro-well at the i th row and j th column of the mask image can then be calculated as follows: 
where {i Ժ | 1 i 4} and {j Ժ | 1 j 4}. Once the left top corner position of a micro-well is located, all pixels within the rectangular region of the microwell from the top left corner (x i , y j ) to the bottom right corner (x i + a, y j + b) are converted into binary pixels using threshold t 1 , where a and b are the height and width of the rectangular region. After that, a sum of the binary pixels S 1 is performed and a predefined threshold t 2 is then used to decide whether this micro-well has trapped a cell, for instance, if S 1 t 2 , then a cell is trapped in this micro-well, otherwise, not cell has been trapped. In Fig. 3 (b) , the red rectangles are used to mark the micro-wells that have trapped a cell. The remaining microwells are not marked. As soon as a cell is detected inside a micro-well, the DEP force is stopped in that micro-well, which will prevent additional cells from being trapped. In this case, the processing time of one image is crucial to the accuracy of the entire system. Although the proposed image processing algorithm is not computationally intensive but the high resolution of input images, the number of microwells in one image and the real-time requirements of this application require a maximum of 25 ms processing time for one image which could not be achieved using the software based solution [18] . Acceleration of this image processing part is thus necessary. One of the acceleration approaches is the use of a hardware-based solution where customised hardware architecture could be designed to meet the realtime requirements.
B. Proposed Hardware Architectures
One of the common used techniques for accelerating image processing algorithms is to exploit pipelining and parallelism during the design stage of hardware architectures to improve the throughput rate and the execution time. In this proposed work, the input image is first buffered in a RAM, and a pixel stream is then generated and loaded to different processing blocks in a propagating way, as illustrated in Fig.  4 . The function of each processing block is introduced as follows:
1) Pixel Loading Block: This block is used to buffer the pixels from external memory and transfer them to the 'RGB to Greyscale' block.
2) RGB to Greyscale Block: This block is in charge of converting the RGB pixels into luminance values using the following equation:
where R, G, and B are the corresponding colour values.
3) Micro-well Region Checking Block:
This block checks the coordinates of the current pixel to find out whether it belongs to a known micro-well region. It has two outputs, if the pixel does not belong to any known microwell region, it will then be ignored for the following processes. Otherwise, the pixel will be identified as a microwell pixel, and it will be sent to binarisation block for further processing.
4) Binarisation Block:
This block converts greyscale pixel value into binary value based on a predefined threshold value t1. In this proposed work, the threshold t1 has been set to 128 as the input image is captured in a constant indoor environment, the noise impact is minor.
5) Trapped Cell Detection Block:
This block is used to count the number of pixels (i.e. with value '1') within a particular micro-well. Since the pixel streams are generated row by row, there are four micro-wells that can appear in the same row of the micro-well region. Therefore, four counters are used to record the number of pixels. The internal structure of this this block is shown in Fig. 5 .
The architecture in Fig. 5 consists of four separate accumulators. Each accumulator is in charge of a micro-well located in a row of the 4×4 micro-well regions. Once a row is scanned, the threshold t 2 will be used to determine whether a micro-well has trapped a cell or not. The result will be stored in an output matrix and the four buffers will be then reinitialised to zero. The same is applied to all rows. In Fig. 5 , the vertical coordinate of an input pixel is compared with four predefined sets of coordinate ranges stored in C 1 , C 2 , C 3 , and C 4 . These ranges cover the vertical coordinates of the four micro-well regions in the same column. Once a predefined range is met the output of the corresponding block Ci will be set and the output of the remaining blocks will be cleared. If the input binary value is equal to '1', then the corresponding accumulator will increment the content of the buffer 'Buffer i ' by 1. For instance, if it is found that the pixel belongs to the first micro-well, then it will be passed to the first accumulator. Since the maximum area of a micro-well region is 23×24, the size of the accumulator is 10 bits as well as the size of 'Buffer i '.
III. HARDWARE IMPLEMENTATION
Prior to the hardware implementation, the proposed image processing system was validated using MATLAB. 54 captured microscope images were used to evaluate the performance of the trapped cell detection algorithm, where all trapped cells within the micro-wells were successfully detected. Moreover, it was measured that on average MATLAB required 160 ms in order to finish checking all the micro-wells in a sample image. A hardware-based implementation could reduce the execution time significantly which will result in quick response from the image processing system to stop the DEP force. This will avoid more than one cell to be attracted by the same micro-well. Therefore, in order to achieve real-time performance, specialist hardware platforms can be one of the valuable solutions for accelerating computationally intensive image processing algorithms. Currently, the most commonly used hardware for solving such problems are Digital Signal Processors (DSPs), Graphic Processing Units (GPUs), Special Purpose Application Specific Integrated Circuits (ASICs) and Field Programmable Gate Arrays (FPGAs). However, FPGA allows user's architectures to be implemented in an ad-hoc digital circuit and to be optimised for specific design task, which makes it an extremely powerful tool for accelerating image processing algorithms, and also balance the gap between software and hardware design to allow maximum performance and flexibility to be delivered during development.
The target hardware for the implementation of the proposed architecture is a Xilinx ZC702 evaluation board equipped with a Zynq XC7Z020 CLG484-1 SoC [19] . Xilinx Vivado high-level synthesis (HLS) tool [20] has been used for the design and development of the proposed hardware architecture. The design was firstly implemented in C, and then a C-level simulation was performed in Vivado HLS environment. The purpose of this is to evaluate the results of the algorithm which should be the same results obtained from MATLAB implementation. After that, a C synthesis was performed to translate the C codes to Hardware Description Language (HDL). VHDL was selected as the target HDL. Thereafter, a register transfer level (RTL) simulation was employed where the same C testbench used in C-level simulation has been used again to evaluate the final RTL implementation, which simplifies the design process for evaluating the image processing algorithm. At the end, the synthesised design was imported to a Vivado Zynq Base Targeted Reference Design (TRD) for the hardware implementation [21] .
Two HLS directives have been used in the C to VHDL translation:
• Set_directive_loop_flatten off This is used to flatten nested loops into a single loop hierarchy to save clock cycles.
• Set_directive_pipeline II = 2 This is used to set a loop is pipelined with an initiation interval of 2.
The archived maximum running frequency and latency of the design is 175 MHz and 15 clock cycles respectively. The RTL simulation was performed in Vivado HLS, where ModelSim was used as an external simulator to evaluate the synthesised results. The number of clock cycles needed for processing a captured image with resolution 1024×768 is 1585679 clock cycles. Thus, the calculated time for processing the entire image is 9 ms, which meets the minimum requirement of real-time image processing DEP system and it is around 18-folds faster than the MATLAB implementation. Table I summarises the results obtained after place-and-route for the proposed algorithm. As it can be seen from Table I , the proposed architecture consumes less than 4% of the available hardware resources which means that the rest of resources can be used for implementing other components of the system (e.g. the control of DEP force, function generator, etc.).
The on-chip power consumption consists mainly of two parts, which are static and dynamic power consumption. The static power is consumed due to transistor leakage. The dynamic power is consumed by fluctuating power as the design runs, i.e. Zynq7 Processing System (PS7), clock power, logic power, signal power, BRAMs power, etc., which are directly affected by the chip clock frequency and the usage of chip area. The details of estimated power consumption of the implementation is summarised in Table  II. As it can be seen from Table II , the PS7 consumes much more power than the programme logic; this is due to the fact that the ARM dual core Cortex-A9 based processing system has much higher running frequency than the programme logic and it runs operating system and other high level user interfaces provided by Zynq TRD. Compare to the PS7, the proposed video processing IP core consumes only a small portion of the total on-chip power consumption, which means it has very less effects on the entire system. 
IV. CONCLUSION
A hardware-based solution to accelerate the processing time of images captured by a microscope camera in a DEPbased system has been proposed. The proposed solution, where parallelism and pipelining have been exploited, replaces the PC/MATLAB part of the system. This new system operates in real-time where an image is processed in 9 ms which is around 18-folds faster than the PC/MATLAB implementation. The proposed hardware architecture consumes less than 4% of the available on-chip resources of the Zynq SoC Platform which gives enough room for the entire DEP system to be implemented on a single chip.
