Euler number of a binary image is a fundamental topological feature that remains invariant under translation, rotation, scaling, and rubber-sheet transformation of the image. In this work, a run-based method for computing Euler number is formulated and a new hardware implementation is described. Analysis of time complexity and performance measure is provided to demonstrate the efficiency of the method. The sequential version of the proposed algorithm requires significantly fewer number of pixel accesses compared to the existing methods and tools based on bit-quad counting or quad-tree, both for the worst case and the average case. A pipelined architecture is designed with a single adder tree to implement the algorithm on-chip by exploiting its inherent parallelism. The architecture uses O(N) 2-input gates and requires O(N log N) time to compute the Euler number of an N · N image. The same hardware, with minor modification, can be used to handle arbitrarily large pixel matrices. A standard cell based VLSI implementation of the architecture is also reported. As Euler number is a widely used parameter, the proposed design can be readily used to save computation time in many image processing applications.
Introduction
Topological properties that remain invariant under various transformations are useful in image characterization for matching shapes, recognizing objects, image retrieval from a database, and in many other image processing and computer vision applications. An important topological feature of a binary image is the Euler number (or genus), which is defined as the difference of the number of connected components (objects), and the number of holes [1, 2] . Euler number remains invariant under translation, rotation, scaling, and rubbersheet transformation of the image. Many critical image processing applications involve large amount of data, and at the same time demand quick real-time response. Euler number provides a simple and fast method of screening in such cases. Euler number of cell images is widely used in medical diagnosis [3] . It has recently been observed that Euler number is the most clinically useful feature that discriminates many cervical disorders [4] . Being a fundamental topological feature, it has numerous applications in image processing, e.g., optical character recognition, document image processing [5] , reflectance-based object recognition [6] , analysis of sandstone for geological applications [7] , shadow detection [8] .
Other topological properties using the convex deficiency and convex hull of the shape are used in conjunction with Euler number for classifying typewritten letters or binary silhouettes [9] . Recently, based on Euler number, the concept of Euler vector is introduced to characterize a graytone image [10] .
The classical algorithm for computing the Euler number of a binary image is based on counting certain (2 · 2) pixel patterns called bit-quads [2, 11] over the entire image. Gray [11] used the fact that the Euler number of a region of space is locally countable [12] . The classical graph-theoretic definition of Euler number relating vertices, edges and faces is applied to an image followed by triangulation and then its Euler number is computed as the difference of the number of connected components and that of holes. Alternatively, Euler number is shown to be the difference of left-facing convexities and concavities in the image. Computation of these parameters needs counting of specific types of bit-quads in the image namely, [14] proposed an algorithm to compute the Euler number of an image represented by a quad-tree. Samet and Tamminen [15] improved the algorithm further by using a new staircase type of data structure to represent the blocks that have already been processed. Juan et al. [16] considered a skeletonized version of the binary image and computed Euler number in terms of the number of terminal points T p , and the number of threeedge points TE p , as (T p À TE p )/2. The terminal and three-edge points are defined from pixel neighborhood relations. Chen and Yen [3] developed a parallel localized algorithm using square graphs to calculate the Euler number of a given binary image on a square grid. Chiavetta and Gesú [17] used connectivity graph representation of a binary image and developed an algorithm for computing the Euler number. The connectivity graph was derived from the discrete version of the cylindrical algebraic decomposition of the digital plane. The authors also described a parallel implementation of the same algorithm on a linear array network topology.
Rosenfeld and Kak [18] observed that Euler number can be computed from the run length representation of an image. If for each run r, k(r) be the number of runs on the preceding row to which r is adjacent, then Euler number can be expressed as: E = P r (1 À k(r)). Further results on 2D and 3D images are reported in [19] . Di Zenzo et al. [20] suggested a planar graph representation of an image from run descriptions. Next, the number of connected components C in the image is computed by applying a standard graph algorithm. The number of holes H is then computed from the EulerÕs formula of a planar graph as H = 1 + m À v, where m = number of edges and v = number of nodes in the graph. Finally, Euler number is calculated as E = C À H. Dey et al. [21] have reported a divide-and-conquer algorithm that can be parallelized for computing the Euler number of a binary image.
In this work, we revisit the run-based expression of Rosenfeld and Kak [18] , and present a novel hardware design for computing Euler number. We show that certain properties of runs and neighboring runs and their distributions in the pixel matrix can be exploited to compute the Euler number of a binary image very efficiently. Performance analysis of the algorithm indicates that the proposed technique outperforms significantly in speed, the existing bit-quad counting and quadtree based methods [2, 11, [13] [14] [15] . Experimental results on a logo database show very favorable results. This run-based algorithm, with its inherent parallelism, provides the basis of building a pipeline architecture for on-chip computation of Euler number. Given a pixel matrix of size (N · N) the upper and lower bounds on the value of the Euler number are also derived. This result allows us to design the architecture correctly. We further derive analytical expressions for performance measures of the proposed architecture to establish its efficiency. It is also shown that the circuit, with minor modifications, can handle a pixel matrix of arbitrary size. A standard cell based VLSI implementation of the architecture on a real technology is described and relevant data on circuit area and delay are reported. To the best of our knowledge, on-chip design of any run-based technique does not seem to be available to date.
The rest of the paper is organized as follows. Section 2 presents the formulation for computing Euler number based on runs, theoretical analysis of complexities, and experimental results on run distributions. Next in Section 3, a simple parallel version of the algorithm, the proposed pipeline architecture, and results on its performance analysis are reported. Section 4 describes VLSI implementation of the design for a 256 · 256 image. Section 5 concludes the paper.
Proposed algorithm

Theme
Let the binary image be represented by a 0-1 pixel matrix of size (N · M), in which an object (background) pixel is denoted as 1 (0). In a binary image, a connected component is a set of object pixels such that any object pixel in the set is in the 8 (or 4) neighborhood of at least one object pixel of the same set. A hole is a set of background pixels such that any background pixel in the set is in the 4-(or 8-) neighborhood of at least one background pixel of the same set and this entire set of background pixels is enclosed by a connected component. The sets referred to in the definition are sets contained in the image. A run in any row (or column) of the pixel matrix is defined to be a maximal sequence of consecutive 1Õs in that row (or column). Let R(i) denote the number of such runs in the ith row (column). R(i) can be counted as the number of 0 ! 1 transitions in that row with a 0 padded at the start of the row (the 0 is to be padded to handle the case where there is a 1 at the start of the row). See Figs. 1 and 2 for illustration. There can be no holes in a single row, and the number of connected components in that row is same as the number of runs in that row. Fact 2. Euler number satisfies the local additive property. Given two images I 1 and I 2 with Euler numbers E(I 1 ) and E(I 2 ) respectively, the Euler number of the image I = I 1 [ I 2 is given by: E(I) = E(I 1 [ I 2 ) = E(I 1 ) + E(I 2 ) À E(I 1 \ I 2 ) (see [11, 12] ).
The union ([) of two images is defined as simple juxtaposition of I 1 and I 2 either vertically or horizontally, without any overlap. The intersection (\) of I 1 and I 2 is the image formed by the last row (or column) of I 1 , and the first row (or column) of I 2 , if the images I 1 and I 2 are joined horizontally (or vertically). The intersection image is always two pixel row (or column) wide. Without any loss of gener-ality, let I 1 and I 2 be joined horizontally. Thus, the last row of I 1 will lie above the first row of I 2 . See Fig. 1 for an example. Two runs appearing in two adjacent rows each, are said to be neighboring if at least one pixel of a run is in the 8-(or 4-) neighborhood of a pixel of the other run (we follow 8-neighborhood convention throughout). Clearly, (I 1 \ I 2 ) will denote the image containing the last row of I 1 and the first row of I 2 and is a two-row wide image. See Fig. 1 for an example. No holes can be present in a two-row wide image, and also the number of connected components in a two-row wide image is the number of neighboring runs. So, we have the following observation: We now use Facts 1-3 iteratively to compute the Euler number of the entire image, as follows.
Let I iÀ1 be the partial image consisting of rows 1, 2, . . . , (i À 1) of the pixel matrix. Let E(I iÀ1 ) be the Euler number of I iÀ1 . The Euler number of the image consisting of only row i = R(i) (by Fact 1). The row i is now added to I iÀ1 to form the union image I i . The intersection image is formed by the (i À 1)th and the ith row. Let the number of neighboring runs between them be O i . Hence,
where, I N denotes the entire image. The above analysis proves the following known result [18] , and provides the basis of our hardware implementation.
Theorem 1. The Euler number of a given binary image is the difference between the sum of the number of runs for all rows (or columns), and the sum of the neighboring runs between all consecutive pairs of rows (or columns). Proof. The proof is trivial. h Two runs in two adjacent rows are neighboring if at least one pixel of a run in a row is in 8-neighborhood of a pixel of a run in the other row. To count the number of neighboring runs, the occurrence of the start of a neighboring run is to be determined. A neighboring run between two rows can start only when there is an occurrence of a run (0 ! 1 transition) in at least one of the two rows, and there is at least one 1-pixel in 8-neighborhood of the location where the run starts. 
endfor return E(I) as the Euler number.
Space and time complexity
The lower bound of computation of Euler number is X(N 2 ) by definition of the Euler number. In that sense, most of the algorithms are asymptotically optimal in that they compute Euler number in O(N 2 ). But in image processing applications where huge data is involved and real time application is needed, the constant so often hidden in the O-notation becomes important. This has motivated several algorithms for computation of Euler number. So, we find out the exact number of image pixel matrix accesses for an N · N matrix under the sequential RAM model for the previous algorithms and compare it to the run-based method.
The bit-quad counting technique [2, 11] checks for bit-quads (i.e. 4 pixels) for each entry and also has to check for the convexity and concavity along the borders. This takes 4N 2 + 4N accesses. Note that, irrespective of the matrix entries, these number of pixel accesses are required. So, the average case access would also be the same.
The Euler number computation based on the thinned version of the image [16] checks for two types of terminal points. For that, 8 neighbors of each pixel is to be accessed. That takes 8N 2 , add to that the time thinning takes and obviously this algorithm takes more pixel accesses than the bitquad counting technique [2, 11] .
The Euler number computation based on the quad-tree [14, 15] involves computing the quad-tree from the pixel entries followed by a traversal on the quad-tree. The time complexity T(n) for formation of the quad-tree involves the following recurrence:
& Solving this recurrence, T(n) = n log 4 n + n. In our case n = N 2 , so the number of accesses is N 2 log 2 N + N 2 . Now, traversal of a quad-tree for computing the Euler number takes linear time in the number of leaf nodes in the quad-tree [14, 15] . The number of leaf nodes can be N 2 . So, Euler number computation takes N 2 log 2 N + N 2 + cN 2 . This is obviously greater than the accesses required by the bit-quad counting technique.
The above analysis shows that the bit-quad counting algorithm is the best in terms of less number of pixel accesses. That might be the reason of the commercial image processing toolbox MAT-LAB using this algorithm [13] .
Next, we compute the worst case and average case analysis of the proposed run-based technique to show its efficacy.
We require N · N space to store all the pixels of the given image. To calculate the number of neighboring runs between two consecutive rows, we need to store for each run, the column numbers where a run starts or terminates. The worst case arises in a checkerboard situation when a row has the maximum number of runs, i.e., when every alternating pixel is an object pixel. For N columns, the maximum number of runs in a row is N/2. We calculate the number of neighborhood runs between two consecutive rows at a time. Since a run can be designated by its two ends, the total space required = (N · N) + 4 · (N/2) % O(N 2 ), i.e. linear in the number of pixels.
Computation of runs for all the rows needs 2 · (N · N) pixel accesses. The number of runs R(i) in a row in the worst case is N/2. The number of pixel accesses required to check and store the two end points of all runs present in the matrix =
Checking whether a run in row (i À 1) is in the neighborhood of a run of row i, requires 2 accesses. Also, a neighboring run is to be checked only after the occurrence of a run. So, determination of neighboring runs for all consecutive pairs of rows needs = 2
2 Þ=2 À ðN =2Þ accesses. Note that, the upper limit on the summation of O i is R(i) (due to Lemma 2) and not N. Therefore, the total number of accesses is,
For the average case analysis, we assume a random input of 0 and 1 with a probability of 1 2 each. As the number of accesses depend on the number of runs, we compute the expected number of runs in a row of width N. Let X t be the number of 1 runs in the first t entries and Y t be the number of 0 or 1 runs in the first t entries. Obviously, E(X t ) = E(Y t )/2, where E(X) denotes the expectation of X. Now, the conditional expectation of Y t+1 when Y t = c is EðY tþ1 jY t ¼ cÞ ¼ c þ 1 2 , because a 0 or 1 can occur with probability 1 2 in the (t + 1)th entry. Now, EðY tþ1 Þ ¼ P . So, the number of accesses in the expected case is
. Thus, it can be seen that the proposed method of Euler number computation requires fewer number of accesses than the bit-quad counting method.
In the proposed method, the number of pixel accesses depends on the distribution of the runs in the pixel matrix in contrast to the method in [11] . It has been observed that for most of the images, R(i) ( M and also O i ( M, and typically both R(i) and O i have a value around 4. Thus, on the average, the number of pixel accesses will be much less compared to those of the other methods. Empirical evidence justifies the rationale. We have considered a database of 1039 logo images, and normalized each of them to the same size. From the experimental results, the expected value of the number of runs R(i) present in a column is observed to be 4.252741 and that of the neighboring runs between two consecutive columns, O i is found to be 4.266170 (see Fig. 3(a) and (b) ). Thus, the expected number of runs in actual images is nearly constant and much less than the expected case of
runs. Therefore, the number of accesses would be much less than that of the average case value. The above analytical expressions and experimental validation suggest that the proposed method of computing Euler number will outperform the existing techniques significantly.
We observed that the average CPU time required for computing the Euler number of a 256 · 256 logo image by running the proposed algorithm is around 11 · 10 3 ls on a Sun Ultra-5_10 Sparc workstation (233 MHz) with SunOS Release 5.7 Generic OS. On the same platform, the bit-quad based algorithm [2, 11] takes around 19.5 · 10 3 ls.
Hardware implementation
Euler number satisfies the local additive property or additive set property [11] . This property allows us to split the image into smaller subimages, and calculate the overall Euler number by combining the Euler numbers of the smaller subimages. An essential feature common to most of these methods is the addition and subtraction of the Euler numbers or some other parameters of the subimages [3, 11, [16] [17] [18] 21] . Thus for fast computation of Euler number, the design of the adder is most vital. It may be noted that once the local property counting has been completed involving a set of pixels, the same pixels should not be required again to reduce CPU time. This indicates that in terms of precedence relations the addition process follows the local property counting and it can be overlapped with the next set of local property counting involving another set of pixels. This observation suggests that addition can be pipelined with the local property counting. In this section, the hardware design is reported. We also derive the upper and lower bounds of Euler number, which lead to estimation of the time complexity as well as the overall gate count. The design proposed here is then compared to other methods. The run-based sequential algorithm can be easily parallelized soas to make it suitable for hardware implementation. 
Parallel algorithm
Hardware architecture
To implement the proposed algorithm in hardware, two types of processing elements (PE) are required: P 1 for identifying the start of a run in a row; P 2 to signal the start of a neighboring run.
Using Lemma 1, the PE P 1 (shown in Fig. 4(a) ) is used to detect the transition from 0 to 1; the number of runs is equal to the number of such detections. A delay (D) flip-flop is initialized to 0 at the start of processing each row. It holds the value of the previous pixel for the purpose of checking a transition. The pixels in a row are pipelined into P 1 . At any instant of time t i , the ith pixel and (i À 1)th pixel of a row are checked for a 0 ! 1 transition. The maximum number of runs in a row having M columns is d(M/2)e.
In Fig. 4 (b), P 1 is shown within a box, and the remaining portion of the circuit constitutes the processing element P 2 . It is used to detect the start of a neighboring run between two consecutive rows following Lemma 2. The PE P 2 checks for the condition when a run in a row begins, and whether it is in the neighborhood of another run in its adjacent row. The pixels corresponding to the columns of two adjacent rows are fed to P 2 in a pipeline. In addition, it receives data from the outputs of the D flip-flops of these two rows. At any instant of time t i , ith and (i À 1)th pixels of two consecutive rows are checked for a neighboring run. The maximum number of neighboring runs is 2 · d(M/2)e. To process an (N · M) image in parallel, we require N pieces of P 1 , and (N À 1) pieces of P 2 . 
A. Bishnu et al. / Journal of Systems Architecture 51 (2005) 470-487
To compute Euler number of the whole image, we add the number of runs (calculated by summing up all 0 ! 1 transitions in all rows) and deduct from it the number of neighboring runs (calculated by adding up the number of all neighboring runs between consecutive rows). At any instant of time t i , the output from P 1 or P 2 goes high or low indicating the start of a run or neighboring run respectively. A neighboring run implies the presence of a run in either or both of the corresponding row(s). Thus, a high (1) output from P 2 is accompanied by a high output from either one or both of the P 1 Õs connected to it. The interrelation of runs and neighboring runs is depicted as a truth table in Table 1 , where · represents a donÕt care entry.
Truth table and combinational circuit
In order to perform the addition process efficiently, we combine the local results in such a fashion so that the final subtraction can be avoided and a single adder tree can be designed to output the final result.
We take the outputs from the two modules generating P 1 and two modules producing P 2 for consecutive rows. To distinguish them, we use upper and lower case symbols (P 1 , p 1 , P 2 , p 2 ) in Table 1 . The sum of the two P 1 outputs is computed, and the sum of the two P 2 outputs is then subtracted from it. It is easy to prove that the result is always either À1, or 0, or 1. This follows from the properties of runs and neighboring runs as mentioned in the previous subsection. The entries corresponding to rows 5 and 6 in the truth table are not feasible as P 2 = 1 implies that either or both of P 1 and p 1 must be 1. The case as in row 10 does not appear as p 2 = 1 and p 1 = 0 implies the PE P 0 1 associated with p 2 (and not included in this group of 4 P 1 Õs and P 2 Õs, and as such not shown in the truth table) should be high. If P 0 1 and p 2 = 1, it implies the presence of a continuing run (not a run start) in the row associated with p 1 ; otherwise p 2 cannot be- Table 1 Truth table
Sign-bit (s) come high. If there exists a continuing run associated with p 1 and also P 1 = 1, then P 2 should be high, but that does not happen. Hence, this input combination never arises. Rows 11 and 12 are also inadmissible as in both the cases, P 1 = 1 and p 1 = 1 imply that P 2 = 1, which is not true. To represent À1, 0, 1 in 2 0 s complement, we require 2 bits (the sign bit and data bit). A combinational circuit C as shown in Fig. 5 , is designed following the truth table (Table 1) to produce the sign and data bits from different input combinations of the two P 
Adder design and time complexity
The sum of the combination of the outputs from the two modules generating P 1 and two modules generating P 2 for consecutive rows lies in the range [À1, 1]. To design the adder circuit for Euler number computation, it is essential to determine the range of values Euler number can assume for an (N · M) image, so that the width of the adders can be properly determined. The adders should be so designed that they are able to handle this range of numbers. Alternatively, it can also be proved from the truth table entries as given in Table 1 
. If the entry 3 occurred at any clock instant t i , then for p 1 (see Table 1 ) to have a high (1) output, a 1 was fired at time instance t i and a 0 at t iÀ1 , for a run (0 ! 1 transition) to be detected. At time instance t i+1 , the 9th or 15th entry of the Table 1 cannot occur. If the 9th entry occurred, there should be an occurrence of a neighboring run, because with p 1 going high at time instance t i and P 1 going high at time instance t i+1 , P 2 should also go high at time instance t i+1 . So, the 9th entry should have been as P 1 = 1, P 2 = 1, p 1 = 0 and p 2 = 0. So, there is a contradiction, and as such the 9th entry cannot occur after the 3rd entry. Similarly, it can be shown that the pairs (3, 15) , (9, 15) , (2, 8) , (2, 14) and (8, 14) cannot occur at consecutive time instances. h
The complete hardware design for a (16 · 16) image is shown in Fig. 7 . The output of C represents a 2-bit 2 0 s complement number generated by two P 0 1 s and two P 0 2 s. An adder circuit is needed to sum up the outputs of all these C-modules to obtain the final result. We use a binary adder tree in pipeline [22] to accelerate the addition process. Addition of two 2 0 s complement numbers may produce a 3-bit number. To implement such a scheme, we use at the leaf level of the adder tree, a set of 3-bit adders each with a sign bit extension. For each subsequent level in the tree, the adder size (width) is increased by one bit. The depth of the adder tree would be dlog 2 Ne À 1. As the width of the adder in the tree increases by 1 at each level, the width of the adder A r at the root, would be = (3 À 1) + dlog 2 Ne À 1 = dlog 2 Ne + 1. Finally, a sequential full adder FA is used to accumulate the sum. The range of numbers FA should be able to handle is [ÀM/2 · d(N/2)e,M/2 · d(N/2)e] (see Lemma 3) Thus, the number of bits T required for the adder FA would be = dlog 2 {M · d(N/2)e + 1}e. With the assumption that M % N, T is O (log 2 N) . The number of the stages of the pipeline is two more (one for the PEs P 1 , P 2 and C and the other for the adder FA) than the depth of the adder tree. Therefore, the stages of the pipeline equals (dlog 2 Ne À 1) + 2 = dlog 2 Ne + 1. The clock period of the linear pipeline is determined by the sum of the longest pipeline stage and the delay of the latches, and hence is equal to T + d, where T is the time taken by the adder FA and d is the time delay of the latches. The number of clock cycles required by the linear pipeline to perform the entire addition is (dlog 2 Ne + 1) + (M À 1) = dlog 2 Ne + M. Therefore, total time needed to produce the final output in 2 0 s complement form is (T + d) · {M + dlog 2 Ne} and M being approximately equal to N, the time taken is OðN log N þ 2 Â log 2 2 N Þ % OðN log N Þ. The performance measures e.g., speed up, efficiency and throughput [22] for the linear pipeline designed here are given below.
Speed-up, S k : The speed-up is nk/k + (n À 1), where n is the number of tasks and k is the number of stages of the pipeline. So, in our case for an (N · N) image,
Efficiency, g: The efficiency is the ratio of speedup and the number of stages. So, here the efficiency is w
neglecting d.
The graphs in Fig. 6(a) -(c) show speed-up, efficiency, and throughput, with the variations in the image size N respectively.
If the final adder FA performs a carry lookahead addition [23] taking O(log T) time, then the clock period of the pipeline can further be reduced and would be determined by the delay of A r . 
Circuit cost and gate count
We require the following components to implement the hardware architecture for processing an (N · M) binary image: 
Hence, the total gate count is O(N).
The complete circuit for a (16 · 16) image with appropriate adder blocks is shown in Fig. 7 .
Comparisons with other methods
The various formulations of computing Euler number require additions of two or more properties and finally their subtraction. The formulation in [11] given as = (
requires addition and subtraction of three properties viz. v, t and d, pertaining to bit-quads Q 1 , Q 2 and Q D . Three types of processing elements would be required to detect bit-quads Q 1 , Q 2 and Q D . Also, adder complexity would be higher as more variables are to be dealt with in comparison to the run based method. The formulation has a divideby-4 operation. So, it is obvious that the width of the adder would be more and shift registers will be needed for the division operator. This bit-quad formulation for computing the Euler number has also another drawback regarding border pixels. Any implementation of this bit-quad algorithm has to deal with the convexity on the border by specially employing separate processing elements for the border pixels other than the ones required for bit-quad checking. The parallel implementation based on Euler number computation of the thinned version [16] does not hold promise as parallel thinning in itself is a non-trivial task [24] . The hardware implementation of quad-tree based algorithms [14, 15] are complicated as the sizes of the blocks represented by the leaf nodes may be unequal. Further, the number of leaf nodes may vary widely for different image samples. The best known parallel algorithm has been proposed in [17] that uses the connectivity graph (CG) derived from the cylindrical algebraic decomposition of the Euclidean plane. The authors describe a parallel implementation of their algorithm on a linear array network topology that uses the CG as the image data structure and performs a parallel searching of the sub graphs in CG. The estimated time complexity is O((dM/5e/16)(1 + log 2 M)), where M is the number of arcs in the CG. A careful analysis reveals that the number of arcs M in CG can be linear in the number of pixel entries (which is N 2 ) i.e., M can be O(N 2 ) and so does the time complexity. Our proposed algorithm with O(N log N) time complexity and O(N) gates thus compares favorably against existing algorithms.
Handling large images
Given an architecture for computing the Euler number of an image matrix of size N · N, the Euler number of a larger image of size K · K, where K = x · N can be easily determined. The matrix is partitioned into several N · K sub-blocks as B 1 , B 2 , . . ., B x , where each B i is an N · K matrix.
The Euler number of each such block of size N · K is computed using our proposed architecture for an N · N block by changing the size of the adder FA. Let E i be the Euler number of a block B i . The adder tree can handle an image matrix with N rows, as it deals with only one column at a time. The adder FA adds up values from all columns. When the number of columns changes from N to K, the width W of the adder FA should be made equal to dlog 2 {K · d(N/2)e + 1}e. To calculate the number of neighboring runs (O i ) between the last row of block B i and the first row of block B i+1 , we require processing elements P 1 and P 2 . The output of P 2 is fed as an input of the last C-module in the column (Fig. 8) . The outputs of FA corresponding to each block of size N · K is pipelined to the final sequential adder FRA (Final Root Adder) (see Fig. 8 ). Using Lemma 3, it can be deduced that the number of bits T required for the adder FRA would be dlog 2 {K · d(K/2)e + 1}e. The clock period of the linear pipeline is obviously T + d, where d is the delay of the latches. The number of stages S N of the linear pipeline in the scalable circuit is one more than the earlier case and is (dlog 2 N + 2e). Therefore, the total time needed to compute the Euler number scalably is (T + d) · (S N ) % O(Nx log 2 (Nx)). The speed-up, efficiency, throughput are as follows:
Speed-up S k : The number of tasks is obviously now Kx ¼ K 2 N and the number of stages is S N . So, the speed-up is
The ideal speed-up is obviously dlog 2 N + 2e. Efficiency g: The efficiency is the ratio of speedup and the number of stages and is 
It can be seen that for a fixed N, speed-up and efficiency of the pipeline increases with x and throughput decreases with x. This is a desirable feature of the pipeline.
The circuit for computing the Euler number of an image of size (256 · 256) using the circuit module for a (16 · 16) image, is shown in Fig. 8 . We have assumed sequential full adders for our complexity analysis for both the normal and scalable cases. It can be observed that the time complexity can further be improved by using carry-save addition [23] .
VLSI implementation of the architecture
The proposed architecture has been designed on-chip using Mentor Graphics Leonardo Spectrum, Modelsim, and IC Station run on a SUN Blade 2000 Workstation. The design is coded with VHDL, and after simulating and verifying it using Modelsim, all the VHDL modules are synthesized to generate the Verilog netlist using 0.18 micron technology. The Verilog netlist file is then fed to the tool IC Station, which produces the final chip layout using standard cell design style. All the geometric information of the layout is produced in Graphic Design System-II (GDS-II) code that is needed by foundry for fabrication of the chip. The synthesized netlist report is presented in Table  2 for a 256 · 256 image. The internal zone area is 400,980.9 lm 2 . The overall chip X-dimension is 607.9 lm and the Y-dimension is 671.2 lm. The critical delay in the circuit is 2.29 ns. The on-chip time to compute Euler number of a 256 · 256 binary logo image turns out to be 0.6 ls.
Conclusions and discussions
A run-based algorithm for computing the Euler number of a binary image is formulated and its performance is analyzed. The algorithm is based on certain combinatorial and statistical properties of runs present in the pixel matrix of the image. Analytical and experimental studies on a logo database show that the proposed algorithm outperforms existing methods based on bit-quad or quad-tree significantly. A new hardware implementation using pipeline architecture for fast on-chip computation of Euler number is also reported. The hardware design uses O(N) gates to compute the Euler number of an N · N image in O(N log N) time. This improves on the best known parallel implementation of O(N 2 ) on a linear array network topology. The basic module can be used to handle arbitrarily large-sized pixel matrices. The architecture has been implemented in VLSI and relevant chip data on area and speed are reported. It has been observed that the on-chip computation is extremely fast and hence, the design will be useful to many real-time applications. Design of algorithms for applicability to higher dimensions needs further investigation. 
