Abstract -Optical Character Recognition (OCR) converts images of
Introduction
Smart phones and handheld devices have gained widespread popularity by placing compute power and novel applications conveniently in the hands of end users. iPhone, iPad and Blackberry serve as good examples for this usage trend [18] [19] . The introduction of new low-power general-purpose processors like Intel' s Atom™ processor family enables future handhelds to enjoy a larger base of general-purpose applications. One of the emerging consumer electronics applications that have entered the mobile domain is Optical Character Recognition (OCR). OCR converts images of handwritten or printed text captured by camera or scanner into electronic text. It has been widely used in database systems, where scanned books and other document materials are converted into text so that they can be accessed conveniently by search engines. While OCR continues to be used for database and search engines in desktop/server platforms, it is gaining traction in mobile platform as well [3] . One such usage model for OCR in mobile domain is a reading device, which can help people with vision and reading problems to read a book or text by reading it out to them. Few such examples are KNFB a Reader Mobile which is loaded onto Nokia cell phones and Intel Reader [1] [17] .
Following is an instance of OCR usage model. A person is reading a book or magazine using a smart phone or handheld device. The person can take a picture of the reading material as shown in Figure 1 (a). The handheld device recognizes the text in the image and converts it into text as shown in Figure 1(b) .
Once the text is available, it can be read out to the reader or even be translated to other languages as in [2] . Though image capture and text post-processing are also involved in this usage model, we focus on OCR due to its performance constraints. [7] . Few works, such as [23] , focus on acceleration of Arabic script for DSP processors. As OCR becomes a popular workload for handhelds, where both power and performance are major concerns, it is important to understand its performance characteristics so that architectural improvements can be made to improve user experience. We address this by analyzing the compute and memory requirements of an OCR workload on an Intel® Atom platform.
for the comparison against the detailed simulator, we chose X benchmarks from the SPEC 2000 Benchmark Suite [22] to form heterogeneous multiprogrammed work loads lor each line-grained multithreaded core being simu lated. Table 3 shows the simulated workload of each core for our first study. We ran our detailed simulator such that each core executed at least 400 million instructions. For our hardware comparison, we evaluated homoge neous workloads consisting of a varying number of threads, each running a memory intensive application (mci). We also evaluated several heterogeneous workloads consisting of multiple instances of two types of applications to ob
In this paper, we take a reference software implementation [8] of an OCR application and characterize its various phases. We take the above OCR software, analyze the compute requirements, identify the key hotspot functions and propose software optimizations. To the best of our knowledge, this is the first work with detailed performance analysis of OCR workloads on handheld platforms. The main contributions of this paper are: 1. Detailed compute, cache and memory characterization of an OCR software. 2. Analysis of the key hotspot functions of OCR in terms of their scaling behavior. 3. Analysis of performance optimizations such as compute, multi-threading, and sampling step optimizations. 4. Design a hardware accelerator for a hotspot in OCR. The rest of this paper is organized as follows. We give an overview of an OCR system (OCRopus) [8] and describe the algorithm phases used in Section 2. Section 3 describes the profiling characteristics of OCRopus software on Atom platform and analyzes the key hotspot functions. Section 4 describes a set of software optimizations that we implemented and show their associated performance benefits. Section 5 describes the hardware accelerator for a hotspot. We conclude in section 6 by outlining the direction for future work on this topic. [11] . Column finding identifies whitespaces using a maximal whitespace rectangle algorithm to find vertical whitespace rectangles with a high aspect ratio. The rectangles that are adjacent to character-sized components on its left and right have high likelihood to be column boundaries. Figure 3 shows the input and output of layout analysis phase. The text in the input image is bounded by boxes, the paragraphs are segmented from each other and, columns are identified and demarcated. . Tesseract recognition phase Tesseract uses a two-pass shape comparison between protos character shapes in the database and features identified in the input image. In the first pass, the words that are recognized with good accuracy are passed to an adaptive classifier as training data. As the classifier is being updated to adapt to the input words, it will do a better job for text lines that are down the page. Figure 4 shows the computation involved in the recognition phase between the features identified and the protos for character 'h'. The co-ordinates x, y represents the row and column position of the features/protos, and (theta) is the angle. Statistical language modeling: This phase resolves ambiguous characters obtained in the previous phase. Statistical language models can be dictionaries, stochastic grammars, etc. OCRopus uses open source OpenFST library [12] as its language modeling tool. This phase improves the accuracy of text recognition.
OCR Application Overview

Components that we study
Among the four major phases, the language modeling phase is optional in OCRopus. Therefore, we only focus on the first three phases in this paper. While pre-processing is supposed to be the very first step, some of its functions are scattered among other phases for performance efficiency. For example, the image smoothing for noise reduction is performed after layout analysis as this function is only needed for text line regions instead of white space. Therefore we breakdown the OCRopus code into four components: binarization, segmentation which represents the layout analysis phase, image smoothing and text recognition. We will focus on these four components and their performance in this paper.
OCR Application Performance on Atom-based Platform
In this section, we analyze OCRopus implementation running on the Atom platform. We start with an overall performance analysis and then dive into each components.
Platform Configuration and inputs
Our measurement is based on a 1.6 GHz Atom processor running CentOS 4.1 Linux kernel 2.6. The CPU had hyperthreading enabled supporting 2 hardware threads. The core has 24KB L1 data cache with 6-way associativity, 32KB instruction cache with 8-way associativity, and 512KB unified L2 cache with 8-way associativity. The front-side bus runs at 400 MHz and is connected to 1GB of DDR2-533 DRAM. The performance of OCRopus is dependent on various image parameters as listed in Table 1 . Although other factors such as skew degree and clarity can affect performance, these four factors are the important and representative features of images that allows us to understand the OCRopus behavior. We used the sample input images from the OCRopus package as well as images taken by us of random text pages. Only images with very high recognition accuracy are considered for this study. Table 1 . Input Image Parameters
Factor Values
Image Resolution 1 -15MegaPixels(MP)
Number of text lines 1, 5, 10, 100, 200 character pixels/total pixels ratio 30-60% Figure 5 shows the overall execution time as a function of image resolution. It also shows the breakdown for the four major steps. The base input image has 5MegaPixels(MP) resolution and we scaled down the same image to various resolutions for this experiment.
Overall Performance
This study shows that though the total execution time increases with the image resolution, it has different impact on each of the four components. Binarization and smoothing time increases linearly with resolution, whereas recognition phase stays almost the same for all but 1MP image. As mentioned earlier, there are various factors besides resolution that impact the execution time. Therefore in the following subsections, we analyze each component in detail and show its scaling behavior. Binarization phase converts the grey-scale or colored image into a binary image. It computes a threshold based on the average pixel value of the entire image and sets each pixel in the binarized image to 0 or 1 by comparing its original value to the threshold. Hence, the higher the image resolution, the more time is spent on Binarization. However, this phase constitutes a negligible part of the total execution time and is small compared to the other three components. Therefore we will focus on the other phases for scaling behavior studies.
Scaling Behavior
In this section, we measure the execution time of each OCR component and analyze its scaling behavior as a function of the parameter it depends on. Since the various OCR components scale differently with respect to resolution and text content, we characterize each phase independent of the remaining phases.
1) Segmentation
Segmentation, which represents the layout analysis phase, consists of four main functions as listed in Table 2 . This is a compute intensive phase as seen from Figure 5 . Each input image factor has a different impact on these functions; therefore segmentation is affected by multiple factors in varying degrees. We characterize these factors separately based on the functions. To understand the scaling behavior of the remaining functions in segmentation, we used a metric called character to total pixels ratio. This represents the text content in an image. Character pixels are the pixels that are classified to be part of a character (data) during the segmentation phase and the rest are classified as non-data pixels. These non-data pixels are separate from the image area that is classified as column boundaries. We used this metric as we observed that different font sizes can lead to different amount of text content in an image and, OCR processing was dependent on pixels rather than actual content. We fixed the image resolution at 8 MP for this study and varied the ratio of character to total pixels by increasing the data content in the image. The results are shown in Figure 7 . As expected, the first two functions were not affected by the image content as the resolution was fixed. As we increased the character-total pixel ratio, the execution time for compute() function decreased while that of extract() increased. As described in Table 2 , compute() function examines non-data pixels alone and searches for the maximal whitespace rectangle to identify columns. Therefore the increased whitespace leads to increased execution time. On the other hand, extract() function is dependent on text regions i.e. the actual character content in the image. Hence its execution time increases with the text content (i.e. character pixels). Figure 7 . Impact of character-total pixels ratio on Segmentation 2) Image smoothing Image smoothing removes noise and smoothens the boundary of the characters. Binary morphology [13] [14] is widely used in this process and, the main function used in this implementation is binary dilation. This function calculates the maximal value for each pixel for a given radius. As shown in Figure 8 , each box is representative of a pixel. The central red box is the current pixel that is being smoothed. When the radius is set as three, all the yellow pixels fall within a circle with the red box as its center. The maximum of these 29 values is calculated and used as the value for the new image.
Input
Output Input Output Figure 8 . Image smoothing process Figure 9 shows the impact of image resolution on image smoothing process. We scaled down a 15MP image down to 1MP for this study. We can observe that the execution time scales linearly with the resolution. Although we show the resolution from 1 to 15 mega-pixels for our scaling study, we have observed that this trend is repeated at intermediate resolutions as well as smoothing is applied for each pixel of the segmented image. Therefore this step takes a significant amount of time in the total processing time. In our studies, we do not consider clarity as there is no good metric to define it. Figure  10 shows the execution time for sample images for various text lines, varied from 1 to 5. We observe that, with other parameters such as image resolution being constant, the recognition time is directly proportional to the number of lines and scales linearly. Figure 10 . Impact of line count on recognition We also found that for a given resolution, the recognition time depends primarily on the actual text content in an image. This is a more generic case (superset) of the above example wherein we varied the line count. Figure 11 shows the recognition time for various character pixels to total pixel ratio of a 5MP image. Figure 12 , the recognition time greatly depends on character-total pixels ratio but does not increase linearly. Besides the line count and text content ratio, it is observed that recognition time can be affected by other factors such as clarity of image and sharpness. The investigation of these factors will be part of our future work. Figure 13 shows the various architectural characteristics of OCR measured using Vtune[15] utility for different image resolutions. Figure 13(a) shows that the Cycles per Instruction (CPI) is high for all phases of OCR. This is due to the low throughput of in-order atom core. Figure 13 . Architectural Characteristics of OCR Components CPI remains constant for image smoothing and recognition phases for all image resolutions but increases significantly (by almost 50%) at 1MP for segmentation phase. This is due to the increased L1 data cache Misses per Instruction (MPI) for 1MP as shown in Figure 14 (a). Figure 13(b) shows the L2 MPI and 13(c) shows the memory bandwidth for various phases. These graphs highlight the low L2 misses and shows the relatively low memory bandwidth utilization (~200MB/Sec which is less than 8% of the maximum throughput available in the platform). Figure 14 shows the various cache statistics obtained using CMPsim, a pin based cache simulator [16] . Figure 14 (a) highlights the L1 MPI for segmentation phase. L1 MPI increases by almost 50% at 1MP, a significant degradation compared to 5MP, for the segmentation phase. This contributes to the low CPI in segmentation phase at 1MP as observed in 13(a). Segmentation phase has significant spatial locality. Hence, at higher resolution, prefetchers' effectiveness becomes pronounced and the L1 Data cache MPI is reduced. Figure 14(b) shows the L2 MPI of the workload using CMPsim. We used a 5MP input image for this study and varied the L2 cache size from 512KB to 2MB with 8-way associativity. The results show that the working set size for 5MP image is around 1MB. L2 MPI varies from 0.0025 for 128KB cache to around 0.001 for 1MB cache. This shows that this workload is not memory bound and can fit in a small cache (512KB) as found in Atom platform. These results along with Vtune results corroborate that the application is computationally bound.
Architectural Characteristics of OCRopus
Figure 13 (c). DRAM Bandwidth
Figure 14 (b). OCR L2 MPI for various cache sizes
Software optimizations
As shown in previous sections, the execution time for text recognition is in the order of several seconds and is not appealing for real-time interactive usage. Further, the execution time increases with the image resolution and, our observation based on empirical analysis shows that we need 5 or 8 mega-pixel image resolution to achieve accurate recognition in a reasonable time. Moreover these resolutions are supported by the handheld devices as well [17] [20] .
In this section, we analyze various software optimizations for the various OCR phases to speedup its execution on Atom processor.
Image smoothing optimizations
As shown in previous sections, image smoothing takes a significant amount of time in OCRopus implementation especially for higher image resolutions. Hence image smoothing, which depends on the image resolution, is one of the main hotspot of OCR Image smoothing is performed on each every pixel on the binarized image. Each new pixel value is computed independent of neighboring new pixel values. As there is no data dependency in this process, we start with multithreading mechanism. Figure 16 shows the various software optimization results at different resolutions for image smoothing phase.
1) Multi-threading (MT):
We make use of multithreading capability (2 hardware threads) available in Atom. We threaded this function using p-thread libraries in Linux. We threaded the binary dilation across the various rows and columns and found the results to be similar. As shown, multithreading improves the performance by about 24%. Figure 15 shows the pseudo code for the base implementation of this function along with comments. The pixel for which binary dilation is computed is loaded twice for each comparison, as shown in line 6 and 7, which leads to 29 additional loads. Furthermore, boundary conditions were checked for each pixel before the actual computation as shown in line 7. Therefore, we re-wrote the code to optimize the unnecessary computations. These miscellaneous compute optimizations (CO) yield significant reduction in runtime. Our results shows that CO alone improves the performance by over 3X. [22] . Instead of computing the maximum value for each pixel, we modified the algorithm to compute the maximal value for every other pixel. This effectively reduced the number of pixels computed by 75% as we skipped the even rows and columns. By sampling, we were able to reduce the execution time significantly while maintaining the recognition accuracy. Our results shows that sampling reduces the execution time by about 2.5X compared with the original code that has incorporated other optimizations. We also combined the various optimizations (MT+CO), (MT+CO+SP) and results are shown in Figure 16 . It can be observed that with all three optimizations, the execution time for binary dilation can be reduced by as much as 9X.
2) Computation Optimization (CO):
Segmentation optimization
After examining the code, we primarily applied multi-threading optimization to segmentation. The various functions in segmentation phase iterates over various pixels or bounding boxes (identified earlier on in the phase). Segmentation phase has high CPI due to the in-order nature of Atom. The independent instructions that would be able to fill the instruction pipeline on an out-of-order processor cannot be issued in an in-order core. By executing independent threads simultaneously, hyper-threading increases instruction issue width and helps to remove some of the pipeline bubbles. Figure  17 shows that multi-threading improve execution time by about 27% for various image resolutions. Every function in segmentation is benefited equally due to multi-threading as they all exhibit similar behavior. 
Recognition optimization
This phase has the highest CPI among the OCR components due to significant data dependency among instructions. Hence multi-threading provides the maximum benefit for this process. The multi-threading benefits for this phase are independent of the image resolution and text-content as the recognition of each character can be executed in parallel. Figure 18 (a) shows the multi-threading results for different text content in the image and, figure 18(b) shows it for various resolutions. We can observe that the execution time reduces by about 31% for this phase. for(t=0; t<image.lenght;t++) //stride through columns 6.
new_image(p,t) = //load pixel and compare 7.
max(image(p,t), image(check_border(p-i,t-j))) 8. Figure 19 summarizes all the software optimizations for various OCRopus phases. We reduced the overall execution time by at least 2X for a 5MP image and the image smoothing time (a hotspot) by almost 9X using various software and algorithm optimization techniques. The performance improvements are significant across all resolutions and increases with it. 
Software optimizations summary
Hardware Acceleration
We designed and implemented a hardware accelerator to speedup the image smoothing process and optimize for power. A naïve approach is to calculate the maximum of the surrounding 29 values of current pixel (marked as red box in Figure 8 ), then move current pixel rightwards (and downwards to the leftmost pixel after a whole line is finished). However, in this naïve approach, each value is read many times from memory. A more efficient way is to read each pixel from memory just once and save the temporary max value of adjacent 5 and 7 values into registers for later use. The key component of this implementation called computing unit (CU) consists of two comparators, seven Row Registers (RR) and one control unit. As shown in Figure 20 , one row (7 pixels) of the image is read from SRAM. The first comparator calculates three maximum values from 1, 5, and 7 pixels respectively and stores them in a 3-byte row register. Then the next row of 7 pixels is read and three maximum values are calculated and stored in the next row register, and so on. After reading and calculating seven rows of pixels, the seven row registers are filled up with maximum values for each row. Then the control unit selects one byte from each of the row registers and feed them into the second comparator. This comparator calculates the maximum value among the seven pixels, which is the final value for the new pixel. The new value can then be stored back to the SRAM. Image smoothing process is performed on a binarized image, where the pixel values are 0 or 1 (indicating a black or white pixel). Hence the size of each pixel is a byte. Therefore, we replaced the comparators with OR logics to obtain the maximum value. This reduced the execution time and die area. Overview of a Computing Unit Note that the control unit controls which register row to fill for the next iteration and assures that the appropriate byte from each row is enabled. Essentially, each register row is read seven times before it is overwritten. It provides one byte in each cycle in the following order: M1, M5, M5, M7, M5, M5 and M1 (M1, M5, M7 represent the max value of adjacent 1, 5 and 7 pixels around current pixel). For instance, to calculate the first pixel, M1 from RR0 and RR6, M7 from RR3, and M5 from the rest of the row registers are chosen. When the next row is processed, RR0 is overwritten with the three new maximum values and, the next pixel is calculated using M1 from RR0 and RR1, M7 from RR4, and M5 from the rest of the row registers. An alternative way of controlling the row registers is to always store the three values of a processed row into RR0. Then each register is shifted to the next one in each cycle. For example, RR0 is shifted to RR1, RR1 to RR2, and so on. In this case, we always read M1 from RR0 and RR6, M7 from RR3, and M5 from the rest of the registers. However, the second approach incurs increased energy consumption. Therefore we adopted the first approach.
Once the first pixel value for the new image is obtained, the CU pipeline is filled up. As we keep reading and processing the next row in CU, we can get a column of new pixels at a speed of one pixel (byte) per CU cycle assuming that one row of seven pixels can be read from SRAM in one CU cycle. To get multiple columns of new pixels, we can use multiple CUs simultaneously. As shown in Figure 21 , we can process M*N pixels using M CUs in (T + N) cycles where T is the initialization time before the pipeline is filled up. To speedup boundary pixel processing, we design a boundary control unit (BCU). BCU identifies boundary pixels and inserts padding pixels into the SRAM when the input image is loaded. Therefore, for an image size of X*Y pixels, the BCU will inserts 6X+6Y+36 padding pixels in total. If the SRAM can hold (X+6)*(Y+6) pixels, the new image with X*Y pixels can be obtained in X*(T+Y)/M cycles. It is obvious that the performance can be improved with more CUs. However, this speedup is limited by the data transfer time (between memory and SRAM), which is limited by the main memory bandwidth. Our approach finds a reasonable M so that the next line is transferred to SRAM before it is being by CU, i.e., the computation time and the transfer time can be overlapped. Figure 21 . Using multiple CUs to parallel process columns SRAM design in our accelerator is divided into input and output halves to hold the input and output image respectively. Let us assume an input SRAM with X*Y bytes. The data transfer efficiency increases with X as each byte is used seven times. In addition, the larger Y is, the computation efficiency we can achieve since the initialization time (to fill up the pipeline) can be better amortized. However since larger SRAM incurs more area, we choose a 128B by 256B (32KB) SRAM for both input and output for a total of 64KB. When the 64B output is ready, the data is transferred back to the main memory. We implemented this accelerator using Xilinx 110T and runs at 250 MHz. Table 3 lists the detailed information about our implementation with 1 and 2 CUs along with 64KB SRAM. 1 and 2 CUs were chosen based on the sustainable memory bandwidth available in the platform which is about 2GB/s. Figure 22 shows the execution time of image smoothing (in milliseconds) using our accelerator for various image resolutions. We can observe that for a 5MP image, the execution time with 2 CUs is about 12 ms, which is 33 times faster than our optimized code. Computational Units with 64KB SRAM We also synthesized our accelerator using 65nm technology. Table 4 lists the area and power consumption for 3 different frequencies. Each of these implementations gives a comparable performance to FPGA or better. We can observe that this implementation adds negligible overhead to a SoC platform in terms of power and area. Figure 23 shows the final execution time of OCRopus with all the software and hardware optimizations. We can observe that the execution time has been reduced significantly from ~10 seconds to almost 4 seconds for 5MP image. Figure 24 shows the energy-delay product of image smoothing process with the software and hardware optimizations in logarithmic scale. We measured the Atom' s dynamic CPU power using power-meter to be 700mW and plotted the energy- delay product for this phase compared to an FPGA implementation using 2 computation units. We can observe that energy-delay has been reduced by orders of magnitude for 5MP image compared to the base and software optimized code for this phase. 
Conclusions
In this paper, we analyzed the execution time of OCRopus processing on the Intel Atom CPU for handheld devices. We showed that the base software implementation requires more than 10 seconds for OCR processing even on a 1.6GHz core. We presented several software optimizations (multithreading, image sampling) as well as hardware acceleration to the OCRopus hotspots, implemented them and showed that these can improve the overall processing time by as much as 2X for a 5MP image and, almost an order of magnitude for a hotspot. We also described our hardware accelerator for image smoothing, a hotspot in OCR, which reduces the power consumption significantly. As part of future work, we are looking into accelerator designs for other phases, segmentation and recognition, to further reduce execution time and, enable real time recognition in text to speech or other interactive applications. We also plan to enhance the software optimizations using vectorization available in the platform. We also would like to characterize the OCR performance with respect to precision for various inputs and characterize the application based on image clarity.
