The main contribution of this work is to present several hardware implementations of an "n choose k" counter (C(n, k) counter for short), which lists all n-bit numbers with (n − k) 0's and k 1's, and to show their applications. We first present concepts of C(n, k) counters and their efficient implementations on an FPGA. We then go on to evaluate their performance in terms of the number of used slices and the clock frequency for the Xilinx VirtexII family FPGA XC2V3000-4. As one of the real life applications, we use a C(n, k) counter to accelerate a digital halftoning method that generates a binary image reproducing an original gray-scale image. This method repeatedly replaces an image pattern in small square regions of a binary image by the best one. By the partial exhaustive search using a C(n, k) counter we succeeded in accelerating the task of finding the best image pattern and achieved a speedup factor of more than 2.5 over the simple exhaustive search. key words: FPGA-based computing, instance-specific solutions, digital halftoning
Introduction
An FPGA (Field Programmable Gate Array) is a programmable VLSI in which a hardware design can be embedded quickly. Typical FPGAs consist of an array of programmable logic elements, distributed memory blocks, and programmable interconnections between them. The logic block usually contains either a four-input logic function or a multiplexer and several flip-flops. The distributed memory block is usually a dual-port RAM on which a word of data for possibly distinct addresses can be read/written at the same time. Design tools are available to the users to embed their hardware logic designs into the FPGAs. Our goal is to use FPGAs to accelerate useful computations. In particular, it is very challenging to develop FPGA-based solutions that are faster and more efficient than traditional software solutions.
Let C(n, k) denote a set of all n-bit binary numbers that has (n − k) 0's and k 1's. For example, C(6, 3) is C(6, 3) = {000111, 001011, 001101, 001110, 010011, 010101, 010110, 011001, 011010, 011100, 100011, 100101, 100110, 101001, 101010, 101100, 110001, 110010, 110100, 111000}.
An "n choose k" counter (C(n, k) counter for short) is a counter that lists all numbers in C(n, k). The main contribution of this paper is to present several hardware implementations of C(n, k) counter. We first present the concept of C(n, k) counters and discuss several straightforward implementations on an FPGA. We then go on to present several efficient implementations of C(n, k) counters on an FPGA. The second contribution of this paper is to use a C(n, k) counter to accelerate a digital halftoning method [6] , which repeats the partial exhaustive search. digital halftoning is a key operation to obtain binary images for printing [7] , [8] . We use the partial exhaustive search to reduce the search space of the exhaustive search performed by a digital halftoning method presented in [6] . We have developed a halftoning system using a PCI-connected FPGA board with a Xilinx VirtexII family FPGA, XC2V3000-4 [15] . By the partial exhaustive search using a C(n, k) counter, we have achieved a speedup factor of 2.5 to 4.0 for digital halftoning.
This paper is organized as follows. In Sect. 2, we show a concept of C(n, k) counters and motivation of our research. We then discuss several straightforward implementations of C(n, k) counters in Sect. 3. In Sect. 4, we show basic ideas for efficient implementation of C(n, k) counters. Sections 5 and 6 present the details of our implementations of a C(n, k) counter. In Sect. 7, we evaluate the performance of these implementations using Xilinx VirtexII FPGA, XC2V3000-4. Section 8 shows how we apply the C(n, k) counters to the digital halftoning. Section 9 offers concluding remarks.
Concept and Motivation for C(n, k) Counters
It is well known that an n-bit binary counter can be simply implemented using n DFFs (D-type Flip Flops) and n HAs (Half Adders). A binary counter is mainly used to enumerate the number of events, which are represented as edge triggers of a signal. On the other hand, an n-bit binary counter can also be used to list all 2 n binary numbers. For example, suppose that we have a function f : {0, 1} n → {0, 1, . . . , m} for some positive integer m, and we need to find an n-bit binary number r such that f (r) takes the minimum value over all possible 2 n n-bit binary numbers x. In other words, our task is to compute
This task is a kind of combinatorial optimization, which has many practical applications. A fast and efficient solution for this task is to design an instance-specific solution using an FPGA as follows. We design a circuit that computes f (x) for any given n-bit binary numbers x. The output x of the n-bit counter is given to this circuit computing f (x). A comparator is used to compare the current value of f (x) and the minimum value obtained so far. If the current value f (x) is smaller, then the current minimum f (x) and x are updated. We refer the reader to Fig. 1 for an illustration of the hardware computing formula (2) . This hardware approach is promising whenever there exists an efficient (i.e. compact and of small depth) circuit computing f . An example of function f for which this approach works efficiently is the MAX-SAT problem. An input instance of the MAX-SAT problem is a set of m Boolean formulas f 1 , f 2 , . . . , f m of n Boolean variables. MAX-SAT problem is a combinatorial optimization problem to find an assignment of Boolean variable values that maximizes the number of satisfied formulas (or minimizes the number of unsatisfied formulas). To solve the MAX-SAT problem using the above approach, we define function f : {0, 1} n → {0, 1, . . . , m} such that
It should be clear that, r in formula (2) for function f in (3) is an optimal solution of the MAX-SAT problem. Also, Boolean formulas can be implemented in the FPGA by a combinational circuit in an obvious way. For example, an AND binary operator in a Boolean formula can be implemented using AND gates with fan-in 2. Thus, the circuit computing f (x) above can be implemented in the FPGA very efficiently and the above approach works for the MAX-SAT problem. This approach is an instance-specific solution [1] , [13] , [19] , because the circuit embedded in the FPGA depends on the input instance (i.e. m Boolean formulas) of the problem. The above approach is also called (simple) exhaustive search, which has a quite large search space of all 2 n values. This approach is not practical even if n is not large, say, n = 40. So, many researchers have devoted to develop practical methods to solve this type of combinatorial optimization problem. For example, heuristic approaches such as local search, genetic algorithms, approximation algorithm, and randomized algorithm are used to find either nearly optimal solution or the best solution with high probability [4] , [11] . Also, several FPGA-based instance-specific approaches for solving SAT problem have been presented [17] - [19] . This paper presents a different approach that we call partial exhaustive search, This approach reduces the size of the search space.
Sometimes, function f has some properties which enable us to reduce the size of the search space. Let us see some examples. The first example is a property of biased input. Suppose that an input instance of the MAX-SAT is given as a CNF(Conjunctive Normal Form) and most of the literals in the input formula are negative. If this is the case, it is expected that the optimal solution has few 1 (or true) assignments. Hence, we can omit the evaluation of the value of f (x) for input x that has many 1's.
Another example of the properties for function f that enables us to reduce the size of the search space is a property of concavity. Let r k be the optimal solution of f (x) over all numbers in C(n, k), that is,
It should be clear that
A function f is concave if there exists
Clearly, if f is concave and satisfies the above relation, then f (r) = f (r i ) and r i is an optimal solution. If f is concave, we can find r i by the binary search or linear search techniques on f (r 0 ), f (r 1 ), . . . , f (r n ). Hence, we do not have to evaluate f over all 2 n n-bit numbers. Since the exhaustive search is performed to compute f (r k ) for each k, we call this approach the partial exhaustive search.
If function f satisfies these properties, it is sufficient to compute r k in (4) for several k's. To compute r k by the instance-specific FPGA-based approach we can use the hardware illustrated in Fig. 1 , where an n-bit binary counter is replaced by a C(n, k) counter. Thus, it is significant work to design efficient implementation of C(n, k) counters.
Straightforward Implementations of C(n, k) Counters
As we have mentioned, it is well known that an n-bit binary counter can be implemented using n DFFs and n HAs. However, the implementation of a C(n, k) counter is not trivial. We classify implementations of a C(n, k) counter using the following terminology:
lexicographical : an implementation of a C(n, k) counter is lexicographical if it outputs all numbers in lexicographical order. More precisely, in lexicographical order, for any two C(n, k) numbers x and y, x must appear before y if x < y. redundant : an implementation of a C(n, k) counter is redundant if it outputs more than n k numbers including all numbers in C(n, k). A redundant implementation must provide a redundant bit indicating that the current output is redundant. In other words, the redundant bit is low (or 0) in exactly n k clock cycles and every number in C(n, k) is provided in these clock cycles.
If an implementation of a C(6, 3) counter outputs all numbers in (1) in this order it is lexicographical. It is also nonredundant if all the 20 6-bit numbers in (1) are provided one by one in every clock cycle. Sometimes, lexicographical implementation of C(n, k) is necessary, because the lexicographically first best solution is required in some combinatorial optimization problems [10] . All of the implementations presented in this paper are lexicographical.
Let us observe a simple example of a redundant implementation of a C(n, k) counter that we call the naive implementation. This implementation uses an n-bit binary counter and a tree of adders. The output sequence of the naive implementation for C(4, 2) counter is as follows: [1] represent the redundant bit. The redundant bit is 1 iff the number of 1's in a 4-bit number is not 2. The n-bit output is exactly the output of an n-bit binary counter. The Muller-Preparata's circuit [9] , [12] - [14] , which enumerates the number of 1's in an n-bit binary number, is used to compute the number of 1's in the current output. The basic structure of the Muller-Preparata's circuit is a tree of adders, which has O(n) gates with depth O(log n) [14] . We can determine if the current output has exactly k 1's using an log n-bit comparator, and so the redundant bit can be provided in an obvious way. However, this naive implementation has too many redundant numbers. If we use it for the partial exhaustive search, the search operation for all possible 2 n instances has to be performed. Another simple implementation of C(n, k) is the ROM implementation, which uses a memory block of an FPGA as a ROM. In the ROM implementation, we use an n-bit 2 nword ROM, which stores all numbers in C(n, 0), C(n, 1), . . ., C(n, n). More specifically, the j-th (0 ≤ j ≤ n k − 1) number of C(n, k) is stored in a word with address C(n, 0) + C(n, 1) + · · · + C(n, k − 1) + j. It should be clear that, by reading words with addresses
The ROM implementation is possible only if a ROM which can store all necessary 2 n n-bit numbers is available. Since current FPGAs have memory blocks of up to several Mega bits, C(n, k) implementation is possible if n is small, say n = 16. If n = 16, we need 16 · 2 16 = 1M bits. In this paper, we focus on implementations of a C(n, k) counter which does not use memory blocks of FPGA. We first present two non-redundant implementations of a C(n, k) counter that we call the simple shift and the binary shift. The simple shift implementation runs in high frequency for small n although it uses so many gates that it does not fit in the FPGA for large n. The binary shift implementation uses much smaller number of gates, but it runs in low frequency. We then go on to present two redundant implementations of a C(n, k) counter that we call the left shift and the right shift. The key idea of these implementations is to use the shift register to find the next number. These implementations work in higher frequency than non-redundant implementations for large n. The right shift implementation has fewer redundant states than the left shift implementation if k > n 2 and has more if k < n 2 . If k = n 2 , they have the same number of redundant states. Table 1 summarizes the theoretical analysis of the used ROM bits, the number of used gates, the maximum delay between DFFs, and the clock cycles necessary to lists all n k numbers in C(n, k). Note that an implementation is redundant if it runs in more than n k clock cycles. Every implementation uses n DFFs to store a current n-bit number and O(log n) DFFs for storing the value of k and for the state control. The naive and the ROM implementations use a binary counter involving an n-bit adder, which can be implemented in O(n) gates with O(log n) depth using the carry lookahead technique [2] , [16] . The ROM implementation uses a ROM with n2 n bits. The details of the theoretical analysis of our implementations will be given later. Although theoretical analysis is important for large n, it often does not reflect real performance for practically small n. Hence, we evaluate the clock frequency and the size of used hardware resource of an FPGA.
Basic Ideas for Implementing C(n, k) Counters
The main purpose of this section is to show basic ideas for implementing non-trivial C(n, k) counters.
We can list all numbers in C(n, k) in lexicographical order by the following five rules:
Rule 0: (initialization) Let the current number be 0 In Rules 1 and 2, the swap operation is performed to find the next number. Both the swap and the shift operations are performed when Rule 3 is applied.
First, we show how we implement the swap operation which is performed in Rules 1, 2, and 3. For this purpose, we determine index p above. Let
Since z is the prefix OR of y, z can be obtained using the parallel prefix circuit [2] , [3] , which has O(n) gates of depth O(log n). Let u 1 = z 1 , and u i = z i ∧ z i−1 for each (2 ≤ i ≤ n − 1). It should be clear that, u i = 1 iff p = i. We refer the reader to Table 2 for examples of x, y, z, and u. The swap operation can be simply done by
where u n = u 0 = 0 and ⊕ denotes the XOR operator. Next, we will show how the shift operation is implemented. Recall that the shift operation is performed for Rule 3. Let s i = z i ∧ x i for each i (1 ≤ i ≤ n − 2). Clearly, s is a sequence of bits to be shifted to the right. Let t n−2 t n−3 · · · t 1 be a sequence of bits that can be obtained by repeating the shift of s n−2 s n−3 · · · s 1 until the rightmost bit is 1. We refer the reader to Table 2 for examples of s and t. Once t is obtained, we can perform the shift operation by the following formula:
where s n = s n−1 = t n = t n−1 = 0 for simplicity. We assume that every bit of t i is 0 when all bits of s i are 0. Then, when Rules 1 or 2 are applied, s i = t i = 0 for all i. Thus, from formulas (7) and (8) combined, regardless of the applied rules, the next number x can be obtained by a single formula as follows:
Note that if z n−1 = 0 then y n−1 = y n−2 = · · · = y 1 = 0. If this is the case, there exists no p such that x p+1 = 0 and x p = 1. In other words, x n x n−1 · · · x 1 = 1 k 0 n−k and Rule 4 (termination) should be applied.
As we have seen, y can be obtained by n − 1 NOT gates and n − 1 AND gates. The prefix OR circuit, which can be implemented using O(n) gates of depth O(log n) [2] , is used to compute z. Once z is obtained, u and s can be computed using n−2 NOT gates and n−2 AND gates, each. After that, if t is obtained, each x i can be computed using two OR gates, one AND gate, and one XOR gate. Thus, a C(n, k) counter can be implemented using O(n) gates of depth O(log n) excluding the circuit for computing t from s. However, it is not easy to obtain t. In what follows, we will show how we obtain t from s. For later reference, let s = 0 N−l−m 1 l 0 m , where N = n − 2. Clearly, we need to compute t = 0 N−l 1 l .
Non-redundant Implementations of C(n, k) Counters
The main purpose of this section is to show two implementations the simple shift and the binary shift that compute t from s in a clock cycle. Thus, these implementations for a C(n, k) counter are non-redundant.
The Simple Shift Implementation
The simple shift implementation uses all the shifted sequences of s. For each i and j
In other words, s [ j] is a sequence obtained by shifting s by j bits to the right. Then, t can be obtained by
Let us confirm that t is correctly computed by formulas (10) and (11) . Recall that s = 0
= 1, and s
is not empty, that is 1 ≤ i ≤ l. Therefore, t 1 = t 2 = · · · = t l = 1 and t l+1 = t l+2 = · · · = t N = 0, and thus t is computed correctly. Let us evaluate the number of gates used to compute t. Since s
= 0 (i ≥ 2) always holds, t i can be computed using N − i + 1 AND gates and N − i OR gates. Thus, t can be computed using at most N + (N − 1)
2 AND gates and at most (N − 1)
OR gates. Since each t i can be computed by a tree of N − 1 OR gates with fan-in 2, the depth of the circuit is O(log N) = O(log n).

The Binary Shift Implementation
The binary shift implementation computes the binary representation of the number of 1's in s and generates the same number of 1's by exponential shifting. For simplicity, we assume that N = 2 u − 1 for some integer u. Let l be the number of 1's in s and l u l u−1 · · · l 1 be the binary representation of
The binary representation l u l u−1 · · · l 1 can be computed by the MullerPreparata's adder tree circuit [12] . Let s j (0 ≤ j ≤ u) be a sequence of length 2 j − 1 determined by the following procedure † .
If l j = 1 then 2 j−1 1's are added to the sequence. Thus, it is not difficult to see that t = s u holds. Further, each s j can be computed from s j−1 using 2 j −1 multiplexers whose output is determined by l j . Thus, t can be computed using at most 2 1 − 1 + 2 2 − 1 + · · · + 2 u − 1 < 2N < 2n multiplexers. Also, it is easy to confirm that the depth of the circuit is O(log n).
Redundant Implementations of C(n, k) Counters
In this section, we present two implementations called right shift and left shift that compute t from s in several clock cycles. Since we do not have to compute t in a single clock cycle, we can obtain high clock frequency. The key idea is to (cyclically) shift s by one position either to the right or to the left until we obtain t. Both implementations use an N-bit shift register.
The Right Shift Implementation
The right shift implementation uses an N-bit shift register to store s and shift it by one to the right in every clock cycle. Again, recall that s = 0 N−l−m 1 l 0 m . Thus, it takes m clock cycles to obtain s. Using the right shift implementation of a C (6, 3) are two bits where the swap operation will be performed, and1 is a bit 1 where the shift operation is performed. Note that, when Rule 3 is applied, the swap operation is performed before the shift operation starts. This example has 7 redundant states, in which the redundant bit is high ([1]) .
Let us evaluate the number of redundant states of a C(n, k) counter using the right shift implementation. For this purpose, let us observe the numbers that appear in the redundant state. Such numbers for the C(6, 3) counter are as follows: 010110, 011010, 101100, 100110, 101010, 110100, 110010.
The numbers in the redundant state satisfy the following properties:
(1) the rightmost bit is 0, (2) three (i.e k) 1's are not consecutive, and (3) every number satisfying (1) and (2) appears exactly once.
If the rightmost bit of a number is 1, then the shift operation is completed and it is not a redundant state. Hence (1) must be satisfied. Also, a number that has consecutive k 1's cannot be a number in the redundant state, because the swap operation is performed before the shift operation starts. Thus, (2) must be satisfied. If a number satisfying (1) and (2) appears twice, then the same number appear after the shift operation is completed. Since this is not possible, no number satisfying (1) and (2) appears twice. Further, every number satisfying (1) and (2) must appear, and thus (3) must be satisfied. It is not difficult to confirm that, n−1 k numbers satisfy (1), and n − k numbers have consecutive k 1's among them. Thus, we have, † Note that s 0 is the null string of length zero.
Theorem 1:
The right shift implementation of a C(n, k) counter has
Since we have n k non-redundant states, the ratio of the redundant and the non-redundant states is approximately
Therefore, the right shift implementation is efficient for larger k.
The Left Shift Implementation
In the left shift implementation, we shift the register storing s cyclically by one position to the left in every clock cycle. For example, if s = 0011100, then the cyclic shift is performed as follows:
0011001, 0010011, 0000111.
The left shift implementation compute This example also has 7 redundant states. To simplify the evaluation of the number of redundant states, we assume that the swap operation is performed after the shift operation is completed.
Let us evaluate the number of the redundant states. Again, let us observe the numbers that appear in the redundant states using those for C(6, 3) as follows: 001101, 001011, 010101, 011001, 010011, 100101, 101001.
Similarly, these numbers in the redundant state satisfy the following properties:
(1) the rightmost bit is 1, (2) three (i.e n − k) 0's are not consecutive, and (3) every number satisfying (1) and (2) appears exactly once.
Since we can verify these properties similarly to those for the right shift implementation, we omit the proof. It is easy to see that 
Therefore, for small k, the left shift implementation has fewer redundant states than the right shift implementation. We should note that by inverting the output of a C(n, n− k) counter using n NOT gates, we can obtain a C(n, k) counter. Thus, we can obtain a C(n, k) counter with fewer redundant states using the right shift implementation for small k. However, the resulting output sequence is not lexicographical.
Performance Evaluation
This section is devoted to show the performance evaluation for the Xilinx VirtexII family FPGA XC2V3000-4, which has 14336 slices. A slice is a unit block of the VirtexII FPGA, which has two four-input function generators, carry logic, multiplexers, and two storage elements [5] . We have used Xilinx ISE logic design tool (Ver 6.3i) to analyze the timing and the number of slices used. We have wrote the HDL source codes for C(n, k) counter implementations in RTL (Register Transfer Level) of Verilog HDL. We have used default parameter values, for example "Optimization goal = Speed" and "Optimization effort = Normal", for logic synthesis using Xilinx ISE logic design tool. Also, we gave no user constraints to synthesize our Verilog HDL source codes. Figure 2 shows the clock frequency and the number of used slices for n = 8, 16, 32, 64, 128, 256, 512, and 1024 estimated based on the net list obtained by XST logic synthesis tool, which is a part of Xilinx ISE logic design tool. Note that, the performance are obtained from the net list. After the implementation (i.e. mapping and routing), the actual clock frequency can be 10%-30% smaller.
For n ≥ 512, the simple shift implementation does not fit in the XC2V3000-4. The simple shift implementation runs in highest frequency for small n, because the circuit is simple and compact for small n. The binary shift implementation uses fewer slices than the simple shift one, but it runs in lower frequency. The binary shift, the left shift, and the right shift implementations are comparable in the number of used slices. The clock frequency of the binary shift is the worst of the four implementations. Recall that the binary shift implementation has two circuits: (1) the MullerPreparata's adder tree circuit to compute the number of 1's and (2) the multiplexer tree to generate consecutive 1's. Although both circuits has O(log n) depth, the adder-tree is complicated and has large depth. To confirm this fact from the practical point of view, we have performed the logic synthesis for the Verilog HDL source codes of these circuits independently. The XST logic synthesis tool reported that the adder tree and the multiplexer tree run in 67 MHz and in 244 MHz for n = 128. It follows that the adder tree is the bottle neck for the performance of the binary shift implementation. Further, the simple shift implementation runs in 71 MHz for n = 128. Hence, as long as the MullerPreparata's adder tree is used, the clock frequency of the binary shift implementation cannot be better than the simple shift.
The fake implementation in Fig. 2 is an implementation of C(n, k) in which a circuit for computing t is removed. In other words, the fake implementation consists of common circuits of the four implementations. Although the fake implementation does not compute C(n, k) numbers correctly, it is useful to analyze the complexity of the four implementations; It gives the lower bound of the performance of C(n, k) counter implementations in the sense that the performance of any C(n, k) counter implementation cannot be better than that of the fake implementation. For n = 128, the fake, the simple shift, and the binary shift implementations runs in 112 MHz, 70 MHz, and 45 MHz, respectively. It follows that, the delay for computing t in the binary shift implementation are dominant when n = 128 while that in the simple shift is small. On the other hand, for n = 128, the number of used slices are 324, 4034, and 788 respectively. Hence, most of the slices in the simple shift implementation are used to compute t.
Suppose that, for each k (0 ≤ k ≤ n), a C(n, k) counter is used h(k) times to solve some problem. Recall that the binary shift implementation runs in n k clock cycles to list all numbers in C(n, k), since it is non-redundant. Thus, it runs in T = 0≤k≤n h(k) n k clock cycles if we use the binary shift implementation. On the other hand, the right shift implementation runs approximately in ( 
clock cycles to solve the problem. Hence, we can guarantee that the right shift implementation (and the left shift implementation) never runs in more than 2T cycles. Further, if a C(n, k) counter is used almost symmetrically, that is, if
clock cycles. If this is the case, the right shift and the left shift implementations list all numbers faster than the binary shift implementation when their clock frequency is 3 2 times larger than that of the binary shift. Actually, the right shift and the left shift implementations are faster for n ≥ 256 and the binary shift implementation is faster for n ≤ 32 to list all C(n, k) numbers.
Applications to the Partial Exhaustive Search
The main purpose of this section is to present how we use a C(n, k) counter for a digital halftoning method presented in [6] , which finds a high quality binary image reproducing an original gray-scale image.
Suppose that an original gray-scale image A = (a i, j ) of size n × n is given, where a i, j denotes the intensity level at position (i, j) (1 ≤ i, j ≤ n) taking a real number in the range [0, 1]. The goal of halftoning is to find a binary image B = (b i, j ) of the same size that reproduces original image A, where each b i, j is either 0(black) or 1(white). The halftoning is one of the necessary tasks to print gray-scale images using laser and ink jet printers. We measure the goodness of output binary image B using the Gaussian filter that approximates the characteristic of the human visual system. Let V = (v s,t ) denote a Gaussian filter, i.e. a 2-dimensional symmetric matrix of size (2w + 1) × (2w + 1) satisfying −w≤s,t≤w v s,t = 1, where each v s,t (−w ≤ s, t ≤ w) is determined by a 2-dimensional Gaussian distribution. The image C = (c i, j ) restored from a binary image B = (b i, j ) by applying the Gaussian filter is a gray-scale image:
From −w≤s,t≤w v s,t = 1, each c i, j takes a real number in the 
and the total error is defined by
Since the Gaussian filter approximates the characteristics of the human visual system, we can think that image B reproduces original gray-scale image A if Error(A, B) is small enough.
It is known that a good binary image B with small total error can be obtained by the partial exhaustive search for windows of the binary image [6] . We briefly explain the idea of the partial exhaustive search. Suppose that an original image A and a temporary binary image B are given. Let W(i, j) be a window of size m × m in B whose top-left corner is at position (i, j), x be a binary pattern in W(i, j), and B/x be the binary image such that W(i, j) of B is replaced by x. We refer the reader to Fig. 3 for illustrations of B, x, 
By computing f (x) for all possible 2 m 2 bit patterns x, we can obtain an optimal pattern r in formula (2) . Clearly, the total error of B/r is not larger than that of B. The idea of the partial exhaustive search is to repeat this operation for a window moving in the raster scan order. A halftoning algorithm presented in [6] first initializes a binary image by random thresholding, and then repeats the partial exhaustive search in raster scan order until no more improvement is possible. The resulting binary images are sharp and high quality, and reproduce the continuous tone of original images very well. Now, we have the following conjecture in term of formula (15).
Conjecture 3:
Function f of formula (15) is concave.
We have no proof for this conjecture, but we believe this conjecture is correct because we cannot find its counter example. Since f satisfies formula (6), we can find the best binary pattern r in formula (2) by the binary search or linear search techniques.
We have performed the linear search technique to find r as follows: Let β be the total intensity in window W(i, j) of a current binary image B, that is, 
Since B is an intermediate solution, it is not "bad" binary tion we have used a C(n, k) counter to accelerate a digital halftoning method that generates a binary image reproducing an original gray-scale image. By the partial exhaustive search using a C(n, k) counter, we accelerated this task and achieved a speedup factor of more than 2.5 over the simple exhaustive search. For a whity gray-scale image, the speedup factor is more than 4. The partial exhaustive search is helpful for reducing the size of the search space. In particular, if a combinatorial problem is represented by a function f , which is either biased or convex, this approach may work very well. It would be of interest to apply this approach for real life combinatorial optimization problems.
