Abstract. The maximum subarray problem is used to identify the subarray of a two-dimensional array, where the sum of elements is maximized. In terms of image processing, the solution has been used to find the brightest region within an image. Two parallel algorithms of the maximum subarray problem solve this problem in OðnÞ and Oðlog nÞ time. A field programmable gate array implementation has verified theoretical maximum performance; however, extensive customization is required, restricting general application. A more convenient platform for this work is a graphics processor unit since it offers a flexible trade-off between hardware customization and performance. Implementation of the maximum subarray algorithm on a graphics processor unit is discussed in this article for rectangular solutions and convex extensions are explored.
Introduction
Efficient algorithms applied to two-dimensional (2-D) (image) data are of significance to meet real-time performance demands of today's mixed signal technologies. High sample rates supported by state-of-the-art radio, optical, and x-ray sensor technology often exceed signal processing capabilities of computer hardware. Thus, the adaptation of traditional image processing algorithms to meet these demands, particularly in research fields such as astronomical and medical imaging, is of paramount importance.
A commonly used image processing function is blob detection. 1 Blob detection is a generalized term used in computer vision that encompass a variety of methods to detect points or regions within an image that vary from neighbouring regions, such as in brightness or color. Extensions to the maximum subarray problem (MSP) over either rectangular and, more recently, convex regions, employing the maximum convex sum problem (MCSP), provide an alternative approach to existing methods of blob detection. However, the ease with which such methods can be supported as a mesh structure, directly implementable in parallel hardware configurations for significant speed-up, is of prime consideration and forms the basis of this article.
The organization of this article is as follows. Section 2 provides a descriptive background and formal definition of the MSP, including K-maximum and convex sum extensions. In Sec. 3, an overview of suitable hardware platforms for the implementation of these algorithms is given. Section 4 details an image processing application that is suitable for MSP implementation on a general-purpose graphical processor unit (GPU). The results of this implementation are presented in Sec. 5. Last, Sec. 6 concludes with a summary of this article and future work is described.
2 Maximum Subarray Problem MSP can be defined as follows. Consider a collection of onedimensional or 2-D array elements, where the selection of a segment of consecutive array elements will form the largest possible sum of all possibilities over the given array. An efficient method is required to return the coordinates, i.e., the location of the maximum subarray. In terms of a 2-D array used to represent digital images, the upper bound based on serial processing is cubic or near cubic time. 2 For example, let a 2-D array, a½1 · · · m; 1 · · · n, be given as input, then the MSP is to maximize the array portion a½k · · · i; l · · · j and to return the indices ðk; lÞ and ði; jÞ representing the top-left and bottom-right positions of the rectangular maximum sum. Let a be defined as 0
Then the maximum sum, s ¼ 15, is given by the rectangle defined by the upper-left corner (3, 5) and the lower-right corner (4, 6) , as shown by the vertical bars.
The MSP was originated by Bentley in 1984. 3 An iterative approach was taken by Bentley using Kardane's method on extended rows to find the maximum subsequence within a 2-D array and was of Oðm 2 nÞ time complexity for an m × n array ðm ≤ nÞ, which is cubic when m ¼ n.
A fundamental requirement for the correct operation of these algorithms is that data are represented either as signed integer or floating point, i.e., single or double precision, IEEE 754, such that the array contains both positive and negative values. 3 For example, if the array contained only unsigned integers, the maximum sum would be the entire array. Similarly, if all elements have negative values, the solution is an empty subsequence.
Parallel 2-D Prefix-Sum Method
The parallel prefix-sum method employs a list of prefix sums to determine the maximum subarray. Each prefix sum, s, is updated progressively as both row, i, and column, j, indices iterate through a 2-D array, a½i; j. While the prefix sum is accumulated, the minimum of the preceding prefix sums is also maintained. By subtracting the minimum prefix sum from the prefix sum, a candidate is found, which may be the solution, "sum", for the maximum subsequence problem. Two registers, hð·Þ and gð·Þ, act as array pointers and are continuously updated to provide the scope of row-wise prefix sum, rð·Þ, and vertical accumulation of the row-wise prefix sum, sð·Þ, respectively. A schematic that shows a graphical representation of the prefix-sum method is shown in Fig. 1 .
The location of the maximum subarray found at each iteration of i and j within the region ½1; · · · ; i; 1; · · · ; j is stored in row and column registers, ði 1 ; j 1 Þjði 2 ; j 2 Þ. These registers represent the top-left and bottom-right location of the maximum subarray, respectively. A formal proof of the MSP is presented in the Appendix.
Parallel 2-D Mesh-Kadane Method
The implementation of the mesh algorithm for a 2-D matrix is similar to the prefix-sum method. Both algorithms are mesh-based, i.e., a processor and control unit are allocated to each cell; for an image processing application, each cell represents a single pixel. Thus, a fine-grained architecture is defined where each cell receives data from above and from the left neighboring cell. A control unit is used to synchronize the transfer of data from neighboring cells. For example, cell (2, 2), as shown in Fig. 2 , passes to cells (3, 2) and (2, 3) the partial maximum sum, s, of the group of cells directly above and to the left, and the location of the maximum sum region. After each iteration, a diagonal of updated cells propagate from the top-left toward the bottom-right of the array. A special case is considered for boundary cells. A detail of Bae's original architecture, based on a systolic array of cells, is shown in Fig. 2 .
The mesh-Kadane algorithm was designed for a parallel programming environment and therefore is suitable for either GPU implementation or custom VSLI programmable logic. As with the prefix sum, both algorithms are heavily dependent on computer hardware resources; this is discussed in Secs. 6.2 and 6.3. A detailed discussion of both serial and parallel variants of the MSP is provided by Bae and Takaoka. 4 
Formal Definition of the Mesh Algorithm for
the Maximum Subarray Problem In this section, we prove the correctness of the algorithm using an extended Hoare logic. Our assertion involves three indices ði; j; kÞ and several arrays indexed by i, j, and k. The algorithm has a sequential structure of the outermost loop, by k. Inside this loop, there is a parallel control structure for all i and j.
In Fig. 3 , at cell ði; jÞ, a½i½j is added to row r if j ≤ k for all i and j in parallel. Arrays max, min, s, and solution are similarly updated in parallel. In the following description of the mesh algorithm, max½i½j is effectively floating, i.e., the left boundary and right boundary are not fixed. We set up the following six assertions and prove they are invariants under the parallel execution of the mesh connected cells. Fig. 1 The prefix sum method showing the scopes of r ði; jÞ and sði; jÞ, and scope registers gði; jÞ and hði; jÞ. for i ¼ 1; · · · ; n and j ¼ 1; · · · ; k P1 r½i½j is the row sum at ði; jÞ, that is, the sum of Initialization for all i; j between 0 and n do in parallel r½i½j¼0; min½i½j¼0; max½i½j¼−∞; control½i½j¼ 0; solution½i½j ¼ −∞; end for all i do in parallel control½i½0 ¼ 1;
for all i; j between 1 and n do in parallel
Proof rules are given in the Appendix.
Generalization of the Maximum Subarray
Problem Generalizing the mesh algorithm outlined in Sec. 2.2 requires determination of coordinates for the 2nd−, 3rd−, . . . , and K-maximum sum. When the subarray is disjoint, this problem is easily solved by setting each cell containing the maximum subarray to −∞. However, the Kadane algorithm was not found suitable for determining overlapping Kmaximum sums. Thus, the prefix sum method was developed for such requirements. 5 Bae originally outlined 43 versions of the maximum subarray solution in the form of 43 algorithms; these are listed from Algorithm 1 to Algorithm 43. 2 The work presented here focuses on four of these algorithms. A description of these algorithms is listed in Table 1 .
In this article, we are primarily concerned with parallel implementations of the MSP, i.e., Algorithms 38 and 39, and their suitability for application to astronomical imaging problems. For consistency, the K-maximum sum extension is not considered, i.e., the location of only the brightest portion of an image or subimage is compared using the four aforementioned algorithms.
3 Computer Hardware Considerations MSP parallel algorithms have been implemented using a field programmable gate array (FPGA) 6 and, more recently, GPU computer hardware. This section provides a brief overview of general-purpose GPU hardware.
Graphics cards work by having a number of processing units arranged in parallel. Each processing unit is capable of executing a small instruction set. By itself, a single core of a GPU performs slower than a CPU of similar cost. By performing many concurrent operations on a GPU, performance can be increased to exceed that of a CPU.
For a GPU to be more effective than a CPU, as many cores as possible must be utilized. This means that code running on a GPU takes a different form than code targeted for a CPU. GPUs are available on many computers, but often are only used by very specific programs that require graphical processing. This is due to the historic lack of parallel applications, and somewhat related to this, the increased difficulty of programming in parallel. The aim of the latest NVidia graphics cards is to make coding a GPU easier and effective on all platforms.
Maximum Subarray Applications
Very large-scale integration (VLSI) has been used to implement the Shack Hartmann wavefront sensor, where wavefront slope is determined iteratively by subdivision of the pupil into zones. 7 However, more efficient VLSI implementations employing the MSP have been used as an alternative to centroiding algorithms for optical wavefront sensing. 6, 8 To accelerate processing, centroids of each zone are calculated concurrently using an FPGA, where OðnÞ performance was achieved for small n × n regions.
Attention has recently been drawn to the need for an efficient implementation to determine the maximum subarray regions of simulated radio telescope images developed for use in the Australian square kilometer array pathfinder (ASKAP) project. 9 A convenient platform for this work is a GPU, and our adaptation of the maximum subarray algorithm has shown significant performance gains with an efficient implementation of two parallel algorithms introduced in Sec. 2.4.
In this section, the maximum sum algorithm was implemented on a GPU to determine the location of the brightest region (subarray) from a simulated radio telescope image. Processing times were reported for various-sized images. Two parallel algorithms, suitable for GPU execution, were benchmarked against two serial algorithms that were run on a CPU core.
GPU Image Processing for Radio Astronomy
The ASKAP project is a major consortium to develop 36 antennas, each of 12 m diameter, in Murchison Western Australia with a baseline of 6 km. The aim of the project is to instantaneously view 30 square degrees of sky and planned operation is 2013. Seven working groups have been established, the first of which (WG 1) is responsible for simulations and imaging. As part of a wider collaboration with New Zealand, the ANZSKA consortium has been established, and research on image extraction is being conducted using the maximum sum algorithm as an efficient method to classify source objects.
Data, comprising a simulated sky with added complex bright sources and large-scale diffuse sources, as observed by an ASKAP telescope using models supplied by EMU WG1, 9 were used to test the performance of the MSP using the four algorithms listed in Table 1 . Due to memory constraints imposed by the GPU used for this evaluation (GTX 480), the size of the original 6144 × 6144 image, representing the sky model with source objects for a full continuum observation, was cropped to a maximum size of 4000 × 4000 pixels. algorithms described in Sec. 2.4. To highlight structure within the continuum, void of bright sources, Fig. 4(b) shows the fluctuation in background intensity. Also shown is the location of the ROI given in Fig. 4(c) , where the underlaying structure adjacent to the brightest (darkest for inverted) source object is shown. Figure 4(d) shows the intensity variation between a bright object in the top-left of the frame within the continuum.
GPU implementation of the MSP algorithm was achieved using the compute unified device architecture (CUDA). CUDA is a programming language developed for NVidia graphics cards. 10 It is an extension on C, and it facilitates parallel programming by using a Kernel to separate CPU operations, GPU serial operations, and GPU parallel operations. Code contained within the Kernel is distributed to a specified number of thread blocks, where each thread block executes the Kernel code concurrently. 11 Multiple copies of each Kernel can be spread to each thread block by specifying block height and block width, and in this way, operations can be pipelined.
When the number of thread blocks exceeds the number of logical processors, the thread blocks are queued. Since each thread block can be executed independent of other thread blocks, the number of active thread blocks and the width of the queue can be determined on the fly to suit any graphics card configuration. Operations are also pipelined. With a sufficient number of queued operations, pipelining can become optimally efficient. 
Preprocessing
The current CUDA MSP code can only process images to a maximum of 4000 × 4000 pixels. This is limited by the number of simultaneous threads supported by the GTX480 (∼16 million). There are two alternatives for processing images larger than 4000 × 4000 pixels. The first is to separate images into multiple overlapping blocks, process each image independently, and compare the results. This assumes that the maximum subarray does not extend past the common regions of two sub-blocks.
The second method is to scale the image to 4000× 4000 pixels. However, information is lost in this process. The amount of information loss depends on the original size of the image. The suitability of this approach also depends on the data being used. When an ROI is known a priori, cropping an image can be performed as an alternative to scaling.
Results
The MSP algorithm was implemented on a GPU and tested using the ASKA simulated telescope continuum image discussed in Sec. 4.1. Due to the wide dynamic range of image data, double precision floating point operations were employed. The continuum image served as the basis for test data, and ROIs of sizes 20 × 20; 50 × 50;100 × 100; · · · ; 2500 × 2500 and 4000 × 4000 were generated. The purpose of these tests was to compare the speed of different mesh algorithms over varying memory sizes for both CPU and GPU platforms. The tests can be summarized as follows:
• Algorithms 38 and 39 compiled with CUDA and executed using a GTX480 GPU.
• Algorithms 8 and 9 compiled with GNU compiler collection and executed using an Intel i7 2.6 GHz CPU.
The results of these tests are shown in Fig. 5 . First, these results in Fig. 5 show up to a 5× improvement in speedup using either parallel algorithm 38 or 39 on a GPU, over either serial algorithm 8 or 9 executed on a CPU for image sizes in excess of 500 × 500 pixels. Second, the difference between either mesh algorithm was minimal. Since the coding of both algorithms is very similar, and memory utilization was identical, this was to be expected. Last, the CPU was actually faster for small images compared to the GPU. This is attributed to inefficiencies exhibited by loading GPU memory from main memory for relatively small images.
Conclusion and Future Work
In this article, we have described a selection of four efficient algorithms that can be used to solve the MSP. These MSP algorithms comprised two serial and two parallel approaches and were applied to a set of radio telescope images as a computationally efficient method to determine the brightest region within an image and return the coordinates of this region. The systolic structure of both parallel mesh and prefix-sum algorithms are well suited for implementation on low-cost hardware platforms, such as general-purpose GPUs. A mathematical proof of the parallel 2-D mesh algorithm provides insight into the maximum subarray problem and is used as a basis for further adaptation. These proof rules are detailed in the appendix. GPU results were compared with a serial algorithm, compiled, and processed on an i7 CPU, and a 5× speedup was reported. Further development of both GPU and FPGA platforms are required to reach a compromise between computational efficiency that was demonstrated using VLSI custom hardware, and relative simplicity when implementing the MSP on GPUs with minimal hardware constraints. Such extensions will be the result of future work and are described in the remainder of this section.
Maximum Convex Sum Problem
An extension to the MSP is the MCSP, where we generalize the MSP to the MCSP by using dynamic programming approach. That is, the shape is not restricted to a rectangle. The convex shape is a more flexible shape to cover various data distributions. The convex shape is defined as a shape that has a center column linked with W and N shapes; such a column is called an anchor column, [12] [13] [14] as shown in Fig. 6 . Other researchers call the shape a rectilinear convex shape. 15 In this study, we will use the concept of the convex shape, which is not following the geometrical convex shape definition. 12, 13 A W shape can be described as a region with a top contour inclining or remaining horizontal and a bottom contour declining or remaining flat from left to right, 12 whereas the N shape is a mirror image of the W. 12, 13 The combination of the W and N shapes is defined as the convex shape for our descriptive purposes. In terms of this article, where two applications have been discussed, the MCSP is more appropriate due to the convex shapes that astronomical objects typically represent.
By our experiments, the convex sum has about twice as much gain over rectangular shapes. 12 Thus, this problem can solve the brightest spot problem more accurately. Our convex sum algorithms simplified Fukuda et al. 15 to solve the MCSP in Oðn 3 Þ time, when m ¼ n, based on dynamic programming. The proof of the simplification is discussed in Ref. 12 . We extended the simplified algorithm to the K-MCSP with time complexity of Oðkn 3 Þ for the disjoint case, 12 whereas the current time complexity for the overlap case is Oðn 3 þ kn 2 Þ.
14 Our final target will be to achieve a parallel algorithm for the K-MCSP with OðnÞ time.
Optimizing Hardware for Improved Utilization
Solving a complex problem such as the MSP in OðnÞ time is possible by implementing a mesh algorithm; however, it comes at a cost. The demands on complex hardware modules, i.e., adders, comparators, and registers, are such that even a small image can consume the entire resources of programmable logic devices, such as FPGAs, even for relatively small (100 × 100) array sizes. Simply migrating to a larger FPGA when available is not a solution. Optimizing hardware resources through better interpretation of the hardware description language, in terms of its behavioral definition, is required to meet these demands. Currently, work is continuing on the use of FPGAs that contain up to 2M logic cells. We are also exploring the possibility of cascading processing over multiple devices, essentially resulting in 2-D mosaic of VLSI devices.
GPU Memory Transfers
For GPU implementation, algorithms 38 and 39 are limited in performance and image size by the memory transfer between thread blocks. These algorithms could be modified to facilitate different thread block layouts, to improve performance, image size, or both. Due to the thread-block limit on the GTX480, image size is limited to just over 4000 × 4000 pixel images. By arranging the cells so that more than a single cell is computed by each thread, the maximum allowable image size could be increased. Another benefit is that cells that share a thread can transfer data via thread registers, as opposed to relatively slow global GPU memory. To accommodate this change, the algorithm itself would increase in complexity, and therefore the computation time for each cell to complete one step may increase. The trade-off between memory-transfer time and raw computation time is currently under investigation, as well as the efficiency of different cell-to-thread configurations.
Appendix
Proof rules are given below. For simplicity, cells are described in a one-dimensional manner. We assume the corresponding statement in each cell is executed in a synchronized way. Some of Hoare's axioms 16 are extended to concurrent processes below.
Parallel assignment
Let x 1 ; · · · ; x n be local variables in cell 1; · · · ; cell n. Let assðiÞ be defined by an assignment statement x i ¼ e i . The expression e i can include variables in other cells. fP½e 1 ∕x 1 ; · · · ; e n ∕x n g assð1Þk · · · kassðnÞfPg:
P½e 1 ∕x 1 ; · · · ; e n ∕x n is a simultaneous substitution, that is, e 1 ; · · · ; e n are independently substituted for x 1 ; · · · ; x n .
Parallel if-then statement: Let "if B i then S i " be a corresponding if-then statement in cellðiÞ.
Parallel execution: Let S i ½l be the l'th statement of m statements in cell i.
For rule
We index our assertions with the time index k in the following:
At the beginning of the k'th iteration, C1ðk − 1Þ holds.
fC1ðk − 1Þcellð1; 1Þ · · · cellði; jÞ; · · · cellðn; nÞfC1ðkÞg.
Proof for P1: P1(0) is true. Suppose P1ðk − 1Þ is true. r½i½j ¼ r½i½j − 1 þ a½i½j is performed for j ¼ 1; · · · ; k in parallel. Then P1ðkÞ is true.
Proof for P2: P1(0) is true. Suppose P2ðk − 1Þ is true. s½i½j ¼ s½i − 1½j þ r½i½j is performed for j ¼ 1; · · · ; k in parallel. Then P2ðkÞ is true.
Proof for P3: P3(0) is true. Suppose P3ðk − 1Þ is true. min½i½j ¼ minimumðmin½i½j − 1; s½i½jÞ is performed for j ¼ 1; · · · ; k in parallel. Then P3ðkÞ is true.
Proof for P4: P4(0) is true. Suppose P4ðk − 1Þ is true. max½i½j ¼ maximumðmax½i½j − 1; s½i½j − min½i½jÞ is performed for j ¼ 1; · · · ; k in parallel. Then P4ðkÞ is true.
Proof for P5: P5(0) is true. Suppose P5ðk − 1Þ is true. solution½i½j ¼ maximumðsolution½i½j; solution½i½j − 1; solution½i − 1½j; max½i½jÞ is performed for j ¼ 1; · · · ; k in parallel. Then P5(k) is true. P5ð2n − 1Þ at ðn; nÞ is: solution½n½n is the maximum sum in a½1; · · · ; n½1; · · · ; n.
Extended proof of fPðk − 1ÞgSfPðkÞg Note that in the following proof, "at time k" means "at the end of the k'th iteration." P1: At time k − 1, r½i½j − 1 is the row sum of a½i½1; · · · ; j for j ¼ 1; · · · ; k − 1.
At time k, r½i½j ¼ r½i½j − 1 þ a½i½j is performed for j ¼ 1; · · · ; k.
Thus, r½i½j is the row sum of a½i½1; · · · ; j for j ¼ 1; : : : ; k.
P2: At time k − 1, s½i − 1½j is the sum of a½i − 1 þ j− ðk − 1Þ; ···;i − 1½1; ···;j ¼ a½i þ j − k; ···;i − 1½1; ···;j.
At time k, s½i½j ¼ s½i − 1½j þ r½i½j is performed. Thus, s½i½j is the sum of a½i þ j − k; · · · ; i½1; · · · ; j. P3: At time k − 1, min½i½j − 1 is the minimum of s½i½l, l ¼ 1; · · · ; j − 1.
At time k, min½i½j ¼ minimumðmin½i½j − 1; s½i½jÞ is performed.
Thus, min½i½j is the minimum of s½i½l; l ¼ 1; · · · ; j. P4: At time k − 1, max½i½j − 1 is the maximum of the sum of a½i þ j − 1 − ðk − 1Þ; · · · ; i½l; · · · ; l 0 ¼ a½i þ j − k; · · · ; i½l; · · · ; l 0 for 1 ≤ l ≤ l 0 ≤ j − 1. At time k, max½i½j ¼ maximumðs½i½j − min½i½j; max½i½j − 1Þ is performed.
Thus, max½i½j is the maximum of the sum of a½i þ j − k; · · · ; i½l; · · · ; l 0 for 1 ≤ l ≤ l 0 ≤ j. P5: At time k − 1, solution½i − 1½j is the maximum sum of a½i − 1 þ j − ðk − 1Þ; · · · ; i − 1½1; · · · ; j ¼ a½i þ j− k; · · · ; i − 1½1; · · · ; j.
solution½i½j − 1 is the maximum sum of a½i þ j − 1− ðk − 1Þ; ···;i½1; ···;j − 1 ¼ a½i þ j − k; ···;i½1; ···;j− 1.
solution½i½j is the maximum sum of a½i þ j − kþ 1; · · · ; i½1; · · · ; j At time k, solution½i½j ¼ maximumðsolution½i½j; solution½i − 1½j; solution½i½j − 1; max½i½jÞ is performed.
Thus, solution½i½j is the maximum sum of a½i þ j − k; · · · ; i½1; · · · ; j.
