Most image processing algorithms are parallelizable, i.e. the calculation of one pixel does not affect another one. SIMD architectures, including Intel's WMMX and SSE and ARM's NEON, can exploit this fact by processing multiple pixels at a time, which can result in significant speedups. This study investigates the use of NEON SIMD instructions for two image processing algorithms. The latter are altered to process four pixels at a time, for which a theoretical speedup factor of four can be achieved. In addition, parts of the original implementation have been replaced with inline functions or modified at assembly code level. Experimental benchmark data shows the actual execution speed to be between two to three times higher than the original reference. These results prove that SIMD instructions can significantly speedup image processing algorithms through proper code manipulations.
INTRODUCTION
Increasing the performance of computer applications has always been of major interest. Recently, higher resolution images and videos place higher demands on a computer processor and require more time to process. One solution to this problem is the use of Single Instruction Multiple Data (SIMD) instructions. SIMD instructions take advantage of the inherent parallelism present in many algorithms by processing multiple operands within one instruction cycle. The architecture splits the registers into multiple lanes that are loaded with the data to be processed. The instruction is then executed simultaneously across all the lanes, i.e. all lanes are processed in parallel.
Traditionally, SIMD instructions were extensions of the main processor instruction set. The instructions used the same functional units as the non-SIMD instructions, modified to accommodate data sets with operand widths that were smaller than that of the machines. This allowed for greater performance with little additional hardware. However, SIMD instructions were limited by the size of the registers and by how many lanes each register could be split into. Newer architectures attempt to solve this problem using a separate coprocessor for the SIMD instructions [1] . This coprocessor has additional registers, which are often four to eight times larger than those of the main processor. As a result, larger data sets can be processed in parallel.
One primary use of SIMD instructions is in image processing. Most image processing algorithms are linear such that the result from the calculation of one pixel does not affect another one [2] . This linearity allows pixels to be processed in parallel rather than sequentially. Images usually contain pixels with values containing either 8 or 16 bits. Therefore, multiple pixels can be loaded into a register. Both of these properties allow image processing algorithms to significantly benefit from the use of SIMD instructions.
The paper continues with references to related work in section 2, followed by a description of the study's setup and test cases in sections 3 and 4, respectively. Section 5 captures the results and presents the lessons learned from this study.
RELATED WORK
SIMD instructions have already been shown to provide some performance benefits to image and video processing algorithms. Theoretically, speedups of four to eight times are possible depending on how many pixels can be processed with each instruction. Previous studies have shown the real speedup to be less than the theoretical. In [3] , using a video processing algorithm the speedup is shown to be no greater than 1.5 times.
In another study [4] , the authors have reported the implementation of two image processing algorithms using Intel's Streaming SIMD Extensions (SSE). The four tests consisted of a sepia or crossfade filters using either integer or floating point numbers. They achieved the most significant speedup with the integer based algorithms. These algorithms showed a speedup of between 2.6 and 2.7 times for both filters. The floating point based algorithms showed speedups of about 1.9 times for both filters.
STUDY SETUP
In this study two image processing algorithms were used to determine the performance benefits and develop best practices in using SIMD instructions for image processing. The algorithms were tested on the BeagleBone prototyping board from Beagleboard.org. Section 3.1 describes the hardware setup, section 3.2 describes the software setup including the compiler, and section 3.3 describes the two algorithms that have been used.
Hardware Setup
The hardware used for testing is a BeagleBone prototyping board that contains a Texas Instrument AM3359 Cortex-A8 processor [5] . The Cortex-A8 has the ARMv7 architecture with a Vector Floating Point (VFP) coprocessor, a dual issue pipeline, four performance counters, and NEON advanced SIMD coprocessor. The VFP coprocessor allows for fast single and double precision floating point operations. The performance counters can be used to collect runtime information. These counters can measure metrics such as cache accesses and misses, branch accesses and misses, and how the ARM and NEON units work together.
The NEON coprocessor is developed by ARM specifically designed to execute SIMD instructions. The NEON coprocessor contains a separate register file consisting of 16 128-bit registers or 32 64-bit registers. The 128-bit registers are divided in half to create the 64-bit registers. The registers can be split into lanes of 64, 32, 16, or 8 bits containing floating point numbers, integer numbers, or polynomials. The ARM processor issues the SIMD instructions directly to the NEON coprocessor. The coprocessor then decodes and executes the instructions. The coprocessor can send or receive data from the ARM processor, L1 cache, or the L2 cache directly [6] .
The NEON architecture is very useful for image processing. If one uses 8-bit pixels, then the registers can hold up to 16 pixels. Performing the manipulations on 16 pixels at a time could achieve a speedup of up to 16. Many algorithms will need more than 8 bits for computational accuracy and the SIMD instructions have some overhead, so the speedup will likely not be that high.
Software Setup
The BeagleBone comes preloaded with the Angstrom distribution of the Linux kernel version 3.2.0. The kernel allows for easy file management, file transfer, and execution of programs. The kernel was recompiled to enable the performance counters from a kernel module.
A cross-compiler is used to compile programs on the host PC for the BeagleBone. The GNU compiler version 4.5.1 was chosen for its advanced optimization options, such as automatic vectorization of loops, and its open source nature. Vectorization means that the compiler chooses when to use SIMD instructions. The compiler also accepts NEON intrinsic functions. These functions can be called directly from C and will assemble to NEON instructions. The functions allow the efficient use of SIMDs without writing assembly code.
Algorithms
The study has applied SIMD instructions to two image processing algorithms. The first is the bilinear interpolation algorithm, which expands or contracts an image to a set dimension. In this work it is used to expand an image. When an image is expanded, pixels are added to the image and the value of these new pixels is found through interpolation. Interpolation in this context averages the values of the nearest pixels in order to determine the value of the new pixel.
The bilinear interpolation algorithm chosen was written by Etienne Sobole [7] . It does not have any floating point operations and processes the image in one pass. The number of floating point variables is kept low, because most processors manipulate integers faster than floating point numbers. The image is processed in one pass, which helps eliminate unnecessary jumps. This algorithm is used as the baseline for comparison for the bilinear interpolation test cases.
The second algorithm used for testing is a distortion algorithm. This algorithm was originally written by HP to correct the distortion of a captured image. The program accepts an uncompressed image and a distortion matrix. The matrix has multiple 2D vectors that define how the pixels in the source image map to the destination image. The matrix's floating point values are converted to integer values.
The program flow for this algorithm is shown in Figure  1 . The loop begins with setting the x and y index into the matrix. The distortion matrix is used to obtain the distortion vectors dvx and dvy in the routine GetDistortionVector. The vectors are then used to retrieve the source image pixels, Fig. 1 Program flow of the distortion algorithm and perform bilinear interpolation to determine the value of the destination pixel in the image processing part of the code. The loop continues through all the pixels of the destination image. This algorithm is used as the baseline for comparison for the distortion test cases.
TEST CASES
For the bilinear interpolation algorithm we built three test cases. The first test is the baseline test, NEON-Bi0, which is described in section 3.3. The other two test cases use the SIMD intrinsic functions to produce NEON (SIMD) machine code. In the second test case, NEON-Bi1, the three color channels are calculated in parallel. Each of the four pixels needed for interpolation are allocated to a half of a 128-bit SIMD register. This allows each color channel to have 16-bits for calculations. Figure 2 shows how the registers are allocated for this test. There are four unused lanes (shown in grey) because the registers allow four lanes for each pixel, but each pixel has only three color channels.
The third test case for the bilinear interpolation, NEONBi2, calculates four pixels in parallel, but each color sequentially. Each pixel is allocated to a 32-bit lane of an SIMD register. The test closely follows the execution of the baseline, except that four pixels are processed simultaneously. Part of the algorithm is not parallelizable, and thus a loop is added to accomplish this part.
The distortion algorithm has three test cases as well. The first test, NEON-Dist0, is the baseline as described in section 3.3. The next test, NEON-Dist1, uses the SIMD intrinsic functions to parallelize most of the image processing part of the code. In the final test, NEON-Dist2, the assembly code from NEON-Dist1 is modified.
The NEON-Dist1 test not only parallelizes the image processing, but it also incorporates other modifications to increase performance. To parallelize the image processing the SIMD register is split into four 32-bit lanes, which allows four pixels to be calculated during each iteration of the inner loop. The SetIndexX function is moved inside the GetDistortionVector function to eliminate a function call. The GetDistortionVector function is moved to the middle of the image-processing loop. Because this function uses mainly ARM instructions, and the image processing is done mainly using SIMD instructions, this change attempts to increase the concurrent use of the ARM processor and the NEON coprocessor. The compiler includes a function that prefetches the data that will be accessed next. The prefetch function is used to load the L2 cache with the next likely data for the source and destination images.
The final test case, NEON-Dist2, uses the assembly generated from the NEON-Dist1 test with a few modifications. The first modification forces the compiler to inline the SetIndexY and GetDistortionVector functions, to help reduce the stalls related to function calls. The second major modification is the switch from processing four pixels to eight pixels during every loop iteration. During the image-processing portion of the code not all NEON registers are being used, and thus processing eight pixels at a time allows registers to be more fully utilized. The third major modification changes the location of where static variables are saved. The GetDistortionVector function has many static variables. Initially, the memory address of these variables was saved on the stack. Loading or storing these variables meant loading the address from the stack, and then loading or storing the variable from this address. This is changed to loading or storing these variables directly from/to the stack, which eliminates a load instruction for each one.
RESULTS AND DISCUSSIONS
The bilinear interpolation algorithm achieved a speedup of about two times when using SIMD instructions applied to an image of one million pixels, interpolated using five different factors. The speedups of all three test cases are shown in Figure 3 . As can be seen the interpolation factor has only a small effect on the overall speedup. The average speedups for the NEON-Bi1 and NEON-Bi2 tests are 2.04 and 0.84, respectively.
Using SIMD instructions changes the number of L2 cache accesses and branch mispredictions. The NEON-Bi1 and NEON-Bi2 tests have more L2 cache accesses than the baseline, because the NEON coprocessor can directly access the L2 cache. The NEON-Bi2 test has about 29% more L2 cache accesses than the NEON-Bi1 test, which can be attributed to the lower speedup. Branch mispredictions decrease by 72% for the NEON-Bi1 test but increase by 38% for the NEON-Bi2 test. With each misprediction causing a 13 cycles stall, this could also be the reason why the NEON-Bi2 test does not perform as well.
The NEON coprocessor works concurrently with the ARM processor for 56% and 77% of the cycles in the NEON-Bi1 and NEON-Bi2 tests, respectively. NEON-Bi2 is the better test at this metric, because to fully utilize the unit, the ARM and NEON processors need to work in parallel for as many cycles as possible. The NEON-Bi2 test has about 45% more stalls related to a full instruction or memory access queue compared to the NEON-Bi1 test. These stalls could also be attributed to the low speedup of the NEON-Bi2 test. The distortion algorithm's NEON-Dist1 and NEONDist2 test cases perform increasingly better than the baseline. These test cases use an image of eight million pixels and a 23 by 17 distortion matrix. Figure 4 shows the results from these test cases. The NEON-Dist1 and NEONDist2 tests have a speedup of 2.20 and 3.09, respectively. The NEON-Dist2 test outperforms the NEON-Dist1 test by removing unneeded instructions and using more parallel processing attributes, i.e. SIMD instructions. The speedup does not depend on the type of distortion as long as the matrix template is preserved.
The speedup in the NEON-Dist2 test case is mostly due to the use of SIMD instructions. Non-SIMD techniques, such as function inlining and changing the static variable's location, account for 27% of the speedup. SIMD techniques, such as processing eight pixels, account for 73% of the speedup.
The SIMD test cases have more L2 cache accesses, but less branch mispredictions. Similar to the bilinear interpolation algorithm, the SIMD instructions create more L2 cache accesses, because the NEON coprocessor can directly access the L2 cache. As a result, the NEON-Dist1 and NEON-Dist2 tests have about two times the cache accesses of the baseline. When using SIMD instructions the branch mispredictions decrease slightly. The NEON-Dist1 and NEON-Dist2 tests have 7% and 13% fewer branch misses, respectively. Although, the change is small, the 13 cycle branch misprediction penalty makes any change significant. Fewer branch mispredictions account for the increase in speedup in the SIMD test cases.
The use of SIMD instructions shifts some of the workload from the ARM processor to the NEON coprocessor. Ideally, the two would work 100% in parallel, but in practice, this is also the source of additional stalls in the NEON pipeline. The two processors work concurrently for 36% and 50% of the total cycles in the NEON-Dist1 and NEON-Dist2 test cases, respectively. When adding more SIMD instructions, the NEON instruction and memory queues stall significantly more. In the NEON-Dist2 test, the stalls are more than double that of the NEON-Dist1 test case. Although, there are more stalls from SIMD instructions, having the processors work more in parallel helps the NEON-Dist2 outperform the NEON-Dist1 test.
CONCLUSION
This study shows that using SIMD instructions can provide significant speedups to image processing algorithms when processing multiple colors/pixels at a time is possible. Both algorithms under investigation, the bilinear interpolation and the distortion algorithms, achieved a speedup greater than two using SIMD intrinsic functions. Furthermore, the distortion algorithm achieved a speedup greater than three after modifications made to the assembly code. Processing more pixels at a time, removing branches and memory accesses, prefetching data, and interlacing ARM and NEON instructions, are all factors that contribute to the speedup. Although, using SIMD instructions can increase the performance, one has to consider the time required for the modifications necessary to achieve it. This time, especially at the assembly level, can become significant.
