ABSTRACT: Local Binary Pattern (LBP) is a simple yet efficient texture operator which has become a popular approach in texture classification. In this paper, we propose a novel hardware architecture for texture classification algorithms based on local binary patterns that can be executed efficiently on a field-programmable gate arrays (FPGAs). A new memory structure and window operations have been used in this hardware design to accelerate images processing. The new architecture is implemented on Xilinx Virtex-6 FPGA. The experiments show that this new method exhibits more efficient execution compared with standard implementations based on central processing units and graphics processing units.
INTRODUCTION
Texture classification is a basic vision problem with applications in numerous areas such as biomedical image analysis, object recognition, industrial surface inspection, content-based image retrieval, and face analysis. Textures are captured at arbitrary angles in a number of practical applications. Rotation invariants are considered as particularly important properties of successful texture descriptors.
Early methods for texture classification have focused on the statistical analysis of texture images. These methods include the co-occurrence matrix [1] and filtering-based approaches [2] . Kashyap and Khotanzad were among the first to study rotation invariant texture classification using a circular autoregressive model [3] . Numerous other models were then explored such as the multi-resolution autoregressive model [4] and the Gaussian Markov random field [5] . Ojala et al. [6] proposed using local binary patterns (LBP), which is an operator for image description based on signs of differences among neighboring pixels, for texture classification. LBP is computed rapidly and is invariant to monotonic gray scale. The original LBP method was enhanced by deriving a rotation invariant LBP descriptor [7] . Consequently, numerous variants of LBP have been introduced in literatures [8] [9] [10] [11] [12] [13] . The original LBP approach was extended to the dominant local binary pattern (DLBP) approach [8] to effectively capture the dominating patterns in texture images. Guo et al. [9] proposed three operators, CLBP_S, CLBP_M, and CLBP_C, which were defined to extract the sign, magnitude features of local difference, and the image local gray level, respectively. Zhao et al. [10] focused on the rotation invariant image features for static texture description. A noise-resistant LBP (NRLBP) [11] was proposed to preserve the image local structures in presence of noise. In paper [12] , Pinjari et al. developed a watermarking method by using the concept of LBP. A texture classification method, called scLBP [13] was proposed, which encodes consecutive LBP patterns in a sorted manner, dictionary for scLBP based on kd-tree.
Graphics processing units (GPUs) have been used in accelerating texture classification to satisfy real-time performance. Zolynski et al. presented a novel implementation of an LBP-based texture analysis operator on a GPU [14] , which yielded a 14-fold to 18-fold run time reduction compared with standard central processing units (CPUs) implementations. Leibstein et al. proposed a new algorithm called radial LBP [15] , which is based on ideas from relevant literature [14] . These methods have attained good parallel results, but all of them are based on software programming.
Field-programmable gate arrays (FPGAs) have become the chosen medium for designing fast and reconfigurable algorithm hardware. Numerous FPGAs and LBP-based parallel approaches have been developed in the past decades for facial analysis and object recognition [16] [17] [18] . However, no research has been conducted yet on using FPGAs to accelerate texture classification algorithms.
In this paper, we propose a novel hardware architecture for texture classification algorithms based on LBP that can be executed efficiently on FPGAs. A new memory structure and window operations have been used in this hardware design to accelerate images processing. The proposed hardware architecture was implemented on Xilinx Virtex-6 FPGA. Our experiments show that the new method exhibits more efficient execution compared with standard implementations based on CPUs and GPUs. The rest of the paper is organized as follows. Section 2 briefly reviews the implemented texture classification algorithm. Section 3 provides an overview of the proposed hardware architecture and describes the detailed design of its submodules. Section 4 presents the FPGA-based implementation of the proposed hardware architecture, the synthesis, and the experimental results. Finally, we provide a conclusion and discuss plans for future work in the last section.
LBP-BASED TEXTURE CLASSIFICATION ALGORITHM
The LBP algorithm is currently one of the most efficient descriptors used in texture analysis. The LBP descriptor is both computationally inexpensive and easily parallelizable compared with more sophisticated texture analysis algorithms. Although LBP is already an efficient CPU algorithm on its own, taking advantage of the additional computational power of FPGA can further enhance the performance of LBP. The algorithms we implemented in texture classification will be discussed in the following sections.
Original LBP
The LBP operator was first introduced by Ojala et al. [6] . Since then, it has been proven to be an effective descriptor in texture classification [7] . An input texture image is converted first into gray scale before the LBP operator is applied to each pixel within the image to create an LBP representation. The original LBP operator works on a 3 × 3 neighborhood, with the center value as the threshold. The neighboring pixels are set to 0 or 1 by thresholding them with the center pixel value. An LBP code is produced by multiplying the threshold values with the weights presented by the corresponding pixels and by summing up the subsequent result. The neighborhood consists of 8 pixels, thus, a total of 2 8 = 256 different labels can be obtained, depending on the relative gray values of the center pixel and the other pixels in the neighborhood. An example of a 3 × 3 LBP and the resulting description value of the center pixel with a pattern of "11001011" and an original LBP value of 128+64+0+0+8+0+2+1=203 is shown in Figure 1 . In [7] , an extended LBP operator was described with a circularly symmetric neighborhood defined by R and P ,where R is the distance of the neighbors to the center, and P provides the number of samples at the distance designating the samples as neighbors
. If the locations do not fall exactly at the center of a pixel, then these locations are estimated through interpolation. The resulting sequence of 0s and 1s is then known as LBP. If gc denotes a pixel, and gp is its p-th neighbor, then the LBPP,R operator is defined as:
where s(x) is the thresholding function
A feature vector describing the textural properties of the input image is obtained from a histogram of the LBP values of the image. LBP is invariant to intensity changes (or any monotonic change to the channel), thus making it attractive because of its robustness to lighting variations.
Rotation invariant LBP (
The LBPP,R operator produces 2 P different output values corresponding to 2 P different binary patterns that can be formed by P pixels in the neighbor set. When the image is rotated, the gray values gp correspondingly move along the perimeter of the circle around g 0 . g 0 is always assigned as the gray value of element (0;R) to the right of gc, thus, rotating a particular binary pattern naturally results in a different LBPP,R value. This rule does not be applied to patterns comprising only 0s (or 1s), which remains constant at all rotation angles. In order to remove the effect of rota-tion, that is, to assign a unique identifier to each rotation invariant LBP, we define:
where ROR(x,i) performs a circular bit-wise right shift on P-bit number x within i times.
, ri P R LBP is obtained by considering the unique minimum value of binary patterns, which is obtained by shifting the binary structure to acquire unique bit sequences. The 36 unique rotation invariant LBP that can occur in P=8, that is, 
LBP )
A uniformity measure U ("pattern") is introduced, which corresponds to the number of spatial transitions (bitwise 0/1 changes) in the "pattern," to formally define "uniform" patterns. The first row in Figure 2 has a U value of 2, because exactly two 0/1 transitions occur in the pattern. The other 27 patterns similarly have a U value of at least 4. We designate the patterns with a U value of at most 2 as "uniform" and propose the following operator for the grayscale and rotation invariant texture description instead of 2 ,
.
where , 
By definition, exactly P+1 "uniform" binary patterns can occur in a circularly symmetric neighbor set of P pixels. Equation (4) assigns a unique label to all "uniform" patterns (indexed 0, 1,..., P), whereas "non-uniform" patterns are labeled with P+1. In practice, mapping from 3 THE PROPOSED HARDWARE ARCHITECTURE
System overview
The hardware architecture of the proposed system is presented in Figure 3 . The proposed system consists of five submodules: control, init, processing, LUT, and write back. The control module generates control signals, which are sent to the other modules. The init module consists of a RAM0, which is used to store the input image, and two First In, First Out (FIFO) memories, which are used for reusing data. The processing module has a 3 × 3 window operator and a calculating unit. After passing through the processing module, each pixel in the input image is transformed into the original LBP value. The LUT module transforms the original LBP value into a rotation invariant LBP value using the rotation invariant operator mentioned in Section 2. A block RAM (RAM1) is used as the LUT for storing the map table. In the write back module, a dedicated hardware structure is designed to calculate the histogram of LBP values, and a block RAM (RAM2) is used to store the final histogram vectors.
Memory structure and data reuse
Window operations are frequently used because the LBP-based texture classification algorithm works on the center and neighboring pixels. In this paper, we propose a novel memory structure and window operations based on the work of Dong [19] [20] [21] . The new memory structure discards control switches, and a novel shift structure is proposed to realize the glide of the window.
The proposed memory is shown in Figure 4 . The input image, which has been preprocessed into a grayscale image, is stored in RAM0. Each pixel uses an 8-bit unsigned integer for presentation. Two FIFOs with a depth of 1024 pixels are employed for reusing input data. These FIFOs are called shift buffers, as data shift one block in each cycle. The first FIFO's output is connected to the second FIFO's input. When the data in both FIFOs are ready, the first two lines of the input image have been cached in the FIFOs. In the next cycle, the data in the first group step into the pipeline registers. A 3 × 3 window operator is implemented by the shift registers. The block RAM and FIFOs operate together with the window operator to use input data efficiently. This memory structure is easily extended to support bigger window operator. For instance, 3 FIFOs with a temporary register can support a 4 × 4 window operator. The finite-state machine (FSM) of the shift buffers and window operator is illustrated in Figure 5 . The initial state is IDLE. The start signal is used to start the initialization. In the first cycle, read the first date to the temporary register, and set the FSM state in RAM_READY. Then the state moves to INIT_FIFO. When the data of row i and i+1 are ready, the state will be set as FIFO_READY. The next three states will be INIT_WIN, WIN_READY, and CAL, which will read the first three data of row i+2 into the window operator registers, and then the calculating unit will be fired. If the data of row i is finished, then the window operator will wait two cycles until the data of row i+1 and the first three data of row i+2 are ready before processing is continued. When all input date have been processed, and the state goes to FINISH, which means the output data is ready. The proposed hardware structure can process images with a width less than 1024 pixels. The shift buffers can be extended to support larger input images, by using deeper FIFOs. When the shift buffers is filled completely, an LBP value can be obtained from almost each cycle. Two additional cycles are needed to fill in the window operator with new rows. 
Texture algorithm implementation
The hardware architecture of the original LBP operator is presented in Figure 6 . A 3 × 3 window operator is implemented with a 9-shift register. 8 calculating units are used to compare the center pixel with a neighboring pixel and to produce an output with a binary bit 0 or 1. The addition operation is replaced by an 8-bit operation.
A 3 × 3 window operator can obtain 8-bit binary patterns with 256 different values. A map table is used to implement the rotation invariant transform discussed in Section 2. The LUT has 768 items, with each item using an 8-bit operation. Therefore, the total size of the LUT is 768 bytes. The RAM2 is used to store the map table. 
EXPERIMENTS AND RESULTS
The proposed system was designed using the Verilog hardware description language based on the block diagram presented in Figure 3 . Verilog codes were simulated using Modlesim SE 6.5a. The system was synthesized for a Virtex-6 (XC6VLX 240T-1FF1156) FPGA device using the Xilinx ISE 13.2 Design Suite. The Xilinx ML605 evaluation board was chosen as the target platform. The hardware architecture proposed in Section 3 produced a histogram as output. We implemented the classifier using MATLAB language to finish the texture classification.
The experiments measured the accuracy and speed of the implemented texture algorithm, which were then compared with the works of Zolynski [15] and Leibstein [16] .
Accuracy
The input images come from the database of Brodatz textures rotated at six different angles [23] , and followed the evaluation methodology presented in [15] to benchmark the algorithms discussed in Section 2. The image data comprised 13 textures from the Brodatz album. For each texture, 512 × 512 images digitized at 6 different rotation angles (0°, 30°, 60°, 90°, 120°, and 150°) were included. The images were divided into 16 disjointed 128 × 128 sub-images, totaling 1248 samples, with 96 in each of the 13 classes. The classifiers were trained on one angle (a total of 208 images) and were tested on the other five angles (the remaining 1040 images). This procedure was repeated six times at one time per angle. The average classification accuracy over the six runs was used as the evaluation parameter.
All experimental results are basically consistent with the result in [22] . The original LBP algorithm, which is not rotation invariant, clearly has poor classification performance with a classification accuracy of approximately 25%. The result suggests that variation in rotation invariant has an important role. The result of 
Speed
Four images with dimensions of 512 × 512 pixels from the Brodatz database were merged into a single image with a dimension of 1024 × 1024 pixels for comparison with related studies.
We implemented the design on the target platform with four parallel blocks, as illustrated in Figure 3 , in processing the input image to sufficiently use the resource of the hardware. Each block processed an image measuring 512×512. The device utilization/timing summary report is presented in Table 1 . The maximum frequency after the place and route phases is 392.927 MHz. A total of 262154 cycles are needed to process a 1024×1024 image. The computation time is presented in Table 2 . The best process time was approximately 0.7ms. The running frequency of the target platform is 200M Hz. The cost of processing the input image is approximately 1.3 ms. Reference [14] revealed that the computation time of a fast GPU was 5 ms, whereas that of a fast CPU was 1150 ms. Kryjak [18] used a new powerful GPU, which cost approximately 6 ms to process a 1024 × 1024 image. 1150 ms --GPU GT7600 20 ms --GPU GT8600 11 ms --GPU GTS8800 5 m s 1 . 0 GPU 2010 6 m s 0 . 8 FPGA max 0.7 ms 7.1 FPGA v6
1.3 ms 3.8
CONCLUSION
In this paper, we presented a novel hardware architecture for LBP-based texture classification algorithms that can be efficiently executed on an FPGA. Three LBP-base algorithms were implemented for texture classification. The experiments demonstrated that image-processing speed significantly improved and real-time performance was satisfied.
For our future work, we plan to use DRAM in storing the input image and improving the memory structure for processing massive images. We will also design a reconfigurable window operator and use a logic circuit to replace the LUT.
