Proposed here is a unique cell histogram architecture which will process k data items in parallel to compute 2 q histogram bins per time step. An array of m/2 q cells computes an m-bin histogram with a speedup factor of k; k≥2 makes it faster than current dual-ported memory implementations. Furthermore, simple mechanisms for conflict-free storing of the histogram bins into external memory array are discussed.
Introduction:
The real-time computation of a histogram is a common operation, especially in computer vision and image processing such as tone reproduction and contrast enhancement in image-processing engines of still digital cameras [1] .
Parallel histogram methods do exist that exploit software and hardware techniques [2] . The two most conventional techniques for hardware-based histogram implementation use either an array of counters or a memory array. However, an implementation with an array of counters suffers from inefficient use of resources [3] , therefore, most techniques use memory arrays. In general, the main challenge for parallel histogram computation using a memory array is in handling updates to a particular bin count when at least two data items map to the same bin resulting in a memory write conflict. Most mechanisms for parallel histogram computation therefore require multi-port memory arrays, where memory write conflicts have to be dealt with.
Due to practical limitations, a dual-port memory array is the common solution, but then the maximum speedup is limited, up to a factor of two. This Letter argues that these challenges are easily overcome by revisiting the original idea of an array of counters, but to distribute the counting of bins in a fully pipelined manner.
Performance speedups, up to a factor of k, are achieved with the cell architecture proposed here. In general, for n data items, the histogram will be computed in . In this case, the pT P cost [5] of the k-way parallel cell is 2
while the pT P of a (serial) pipeline cell is also O(n) as seen in Table 1 . 
