This paper presents a low-area and high-throughput design and implementation of JPEG encoder on FPGA. The design consists of three main components: (1) 2-D DCT module, employing the row-column decomposition technique, (2) Quantization in zigzag ordering, utilizing look-up tables, and (3) Entropy coder, transforming the quantized DCT coefficients into JPEG words. All components are fully pipelined and optimized for FPGA resource utilization. The proposed implementation of JPEG encoder is able to encode 143 and 71 SDTV frames per second with 720x480 gray scale and color pixels per frame, respectively, on Xilinx Spartan 6 FPGA. Moreover, the proposed architecture is capable of encoding at least 53 and 26 HD Ready TV frames per second with 1280x720 gray scale and color pixels per frame, respectively, on this FPGA chip. Thus, the proposed JPEG encoder architecture is well-suited to various image and video compressing applications where performance and area are significantly important.
INTRODUCTION
The Joint Photographic Experts Group introduced the JPEG compression standard for still photographic images [1] , which has become the widely used lossy compression standard ever since. JPEG is heavily used in high-resolution image transmission applications including digital cameras, image scanners and so on. Such applications require both high-speed and low cost implementations of image encoding and decoding. In order to meet several needs of various applications, the JPEG standard specifies two classes of encoding and decoding processes: Discrete Cosine Transform (DCT) based processes for lossy compression and predictor based processes for lossless compression, each with a few modes of operation [1, 2] . Among the different operation modes of the DCT based lossy compression, the baseline mode is widely implemented in software and hardware for JPEG compression [3] [4] [5] [6] [7] [8] [9] [10] . Consequently, the JPEG IP core architecture proposed in this study is based on the baseline mode.
In the literature, several studies are devoted to FPGA [3] [4] [5] [6] [7] and ASIC [8] [9] [10] implementations of the baseline JPEG compression. The common approach of these studies is to split the baseline JPEG compression process into four different modules handling 2-D DCT operation, quantization step, zigzag ordering step, and entropy coding operations separately. Among these studies, only [4, 9, 10] provide detailed hardware design of each of these modules, while the others describe how each of them operates in general without elaborating on the hardware design. As a result, the hardware design of JPEG IP core proposed in this study is compared against [4, 9, 10] : (i) The proposed design implements 2-D DCT based on single 1-D DCT hardware, while they use two 1-D DCT modules. (ii) The proposed design performs the quantization in zig-zag ordering in a single module, whereas they do the quantization and zig-zag ordering in different modules. (iii) All designs are fully pipelined to obtain the highest throughput possible.
This study proposes a novel IP core for high throughput JPEG compression on low cost FPGAs. In order to achieve the best throughput possible, each module is designed to be fully pipelined and optimized for the FPGA resource utilization, while meeting the requirements of the JPEG standard [1, 2] . All modules were captured in Verilog HDL. The proposed design is synthesized, simulated and verified with Xilinx ISE 14.7 tool. According to the results from ISE 14.7 tool, the JPEG IP core proposed is capable of compressing more than 143 SDTV frames per second with 720×480 gray scale pixels and 53 HD Ready TV frames per second with 1280×720 gray scale pixels per frame on a low cost Xilinx Spartan 6 FPGA chip.
The rest of the article is organized as follows: the design philosophy and hardware implementation details related to the proposed JPEG IP core architecture are presented in Section 2. The implementation results of the proposed JPEG core on different Xilinx FPGAs are given in Section 3. Furthermore, comparisons between the proposed design and other JPEG cores from the open literature are given in Section 3. Finally, the article is concluded in Section 4.
JPEG IP CORE ARCHITECTURE
Based on [1, 2] , the proposed JPEG IP core is composed of three main components, which are 2-D DCT, quantization in zig-zag order, and entropy coder. FIFO based input and output interfaces provide lowcomplexity flow control between the modules, which provides write and read semantics similar to a FIFO buffer interface, and they are explained as follows:
 The input interface: A new data is received by the module on its writeData bus in the next rising edge of clock when writeEn signal is asserted and full signal is de-asserted during the current clock cycle. When full signal is asserted by the module, it cannot accept a new data word in the current clock cycle. The bit length and the direction of the input interface signals are given below.  writeData bus, input, 8-, 12-, or 96-bit  writeEn signal, input, 1-bit  full signal, output, 1-bit  The output interface: A new data is ready on its readData bus in the next rising edge of clock when readEn signal is asserted and empty signals is de-asserted during the current clock cycle. When empty signal is not asserted by the module, there is always an available valid data word on its readData bus. Once the module cannot produce a new data word in the current clock cycle, the empty signal is asserted. The bit length and the direction of the output interface signals are given below.  readData, output, 12-, or 96-bit  readEn, input, 1-bit  empty, output, 1-bit Following sections provide necessary details for each component in the proposed hardware design.
2-D DCT Architecture
In this study, a low area implementation of the 2-D DCT is designed and implemented on FPGA. Its architecture is adopted and modified from a study proposed by [11] for ASIC implementation. In selecting the architecture proposed by [11] to be implemented on FPGA, there are three important reasons: (1) it is based on the row-column decomposition technique. Hence, a low area utilization is achieved since only single 1-D DCT component is used in a time-shared fashion, (2) a shift-register based transpose buffer is implemented in order to improve the utilization of Block RAM resources, and (3) simpler finite state machines are responsible to control the datapath since the control logic is distributed among the components. Figure 1 . The main differences of 2-D DCT architectures between [11] and this study include a different operation of pong buffer, pipeline register inclusion and the logic for rounding method in 1-D DCT operation, output buffer, and a special pipeline that can be stoppable. In the following sections, each of five components are separately explained by specifying what each module is responsible to compute in 2-D DCT.
Ping-Pong buffers
Ping-pong buffering method contains two separate buffers: ping and pong buffers. While ping buffer is being loaded with input data, 1-D DCT component performs 1-D DCT operation using data stored in pong buffer.
The ping buffer is simply a 96-bit shift-register which is controlled by a finite state machine (FSM) with two states {empty, full}: According to the proposed pong buffer operation, 64+8=72 clock cycles are needed to perform the computation of either 1-D or 2-D DCT coefficients. Therefore, a total of 144 clock cycles are required by pong buffer to complete the processing of 8x8 matrix of pixels. In a fully-pipelined operation, the ping buffer latency will be completely overlapped with the pong buffer latency, so they will together introduce a latency of 144 clock cycles.
1-D DCT
Eight-point 1-D DCT operation is computed by using the row-column decomposition technique [13] and it is given as follows:
where zi denotes the transformed coefficient, xi denotes the pixel data, a=C1, b=C2, c=C3, d=C4, e=C5, f=C6, g=C7, Ci=0,5cos(kπ/16), i=0,1,..7, and k=1,2,.. During the computation of 1-D DCT coefficients, as long as writeEn is asserted, a new 88-bit word will be stored into the register in every clock cycle and the machine stays in full state; otherwise, it goes to empty state. While computing 2-D DCT coefficients, on the other hand, a new 88-bit word will be stored into the register if both writeEn and readEn_outbuff are asserted so as to guarantee that the new 2-D coefficient is accepted by the output buffer. In this state, full signal is not asserted only if the output buffer becomes full during the computation of 2-D DC coefficients.
22-bit result {r21, r20, …, r0} for each signed multiplication is rounded to 12-bit 2's complement value by a combinational logic circuit depending on the sign of the result and it is defined as follows:
 When the result is positive: There are three cases:  If {r21, r20, …, r10} is the maximum 22-bit positive number, the rounded result is equal to {r21, r20, …, r10}.  If {r21, r20, …, r10} is not the maximum 22-bit positive number and r9 is equal to 0, the rounded result is equal to {r21, r20, …, r10}.  If {r21, r20, …, r10} is not the maximum 22-bit positive number and r9 is equal to 1, the rounded result is equal to {r21, r20, …, r10} + 1.  When the result is negative: If r9 is equal to 0, it is equal to {r21, r20, …, r10}; otherwise, {r21, r20, …, r10} + 1.
After rounding operation, 12-bit adder tree with four-input is used to compute either 1-D or 2-D coefficient values in 2's complement format.
Transpose and output
Transpose buffer is simply a shift register with 63×12=756-bit length and it is adopted from [11] . There are two scenarios to consider for this module:
 serial-in: If writeEn signal is asserted during 1-D DCT computation, a new 12-bit coefficient is serially shifted in this buffer.  parallel-out: Consider the shift_register={reg62, reg62, …, reg0} that is composed of 63 12-bit registers. When the shift register becomes full, a set of eight registers column={reg56, reg48, reg40, reg32, reg24, reg16, reg8, reg0} store the first column of 1-D DCT coefficients. In the next clock cycle, the 64th 1-D DCT coefficient is shifted in while the first column is received into pong buffer. After the right-shift, column will store the second column of 1-D DCT coefficients.
The output buffer stores the coefficients of 2-D DCT operation and isolate the 2-D DCT hardware in Figure 1 from the following quantization component. Output buffer consists of two registers, namely reg0 and reg1, and they are controlled by a three-state {empty, almost-full, full} finite state machine whose states are defined as follows:
 empty state: it means that both registers are empty. When writeEn signal is asserted, a new 12-bit word is loaded into reg0 and finite state machine changes its state to almost-full state. In this state, empty signal is asserted.  almost-full state: when both writeEn and readEn signals are asserted or deasserted, FSM stays in here. Moreover, when they are asserted, a new word is received into reg0 register. If writeEn signal is asserted, but readEn signal is de-asserted, a new word is received into reg1 register and it goes to full state. If writeEn signal is de-asserted, but readEn signal is asserted, it changes its state to empty state since reg0 register has been read. In this state, empty signal is asserted.
 full: If readEn signal is asserted, it makes a transition to almost-full state while old reg1 register is loaded into reg0. In full state, full signal is asserted.
In a fully-pipelined operation, the latency of the proposed 2-D DCT architecture is optimal in the sense that there are no wasted clock cycles. That is, 
Quantization in Zigzag Order
The 2-D coefficients Zij of an 8×8 block should be uniformly quantized according to the quantizer step size 1≤Qij ≤255 from an 8×8 matrix called the quantization table. Specifically, quantization is defined as division of each 2-D DCT coefficient by its corresponding quantizer step size and it is followed by rounding the value to the nearest integer:
where Zij * is the quantized 2-D DCT coefficient.
The quantization results in that most of the 2-D coefficients towards the lower right corner of 8×8 matrix, which are high-frequency coefficients, are zero. The zigzag ordering is used to rearrange the two dimensional 2-D coefficients in a one dimensional vector so that the low-frequency coefficients are placed before the high-frequency coefficients in the vector so as to maximize the compression during the entropy coding stage.
In the literature, it is typical that the quantization and zigzag ordering are handled in two different components. In the proposed design of JPEG encoder, however, these two components are combined into one architecture in Figure 3 by performing the quantization in the zigzag order.
The zigzag ordering is achieved by a two state {write_ram, read_ram} FSM together with a 64-entry RAM and 64-entry look-up table (Zigzag Table) as follows:
 write_ram: The DCT coefficients that are received in the column major order from the 2-D architecture are written into the RAM one by one. While the last coefficient is being written, the FSM goes to the other state.  read_ram: The RAM is read in the zigzag order defined by [1] that is available from Zigzag Table in Figure 3 ) and Qij is the related quantizer step size from the quantization table in [1] . Note that there are luminance and chrominance quantization tables, one of which is chosen by chrom signal accordingly. Then, there is a register controlled by a FSM with two states {empty, full}, which is the same as the one in 1-D DCT. This register receives the 24-bit multiplication result in every clock cycle if its writeEn is asserted, and if either it is in empty state or its readEn is asserted. After that, the 24-bit multiplication result from the register is rounded into 11-bit integer number by a rounding logic similar to the one in Figure 2 , and the quantized 2-D DCT coefficients are written into the output buffer one by one for the next stage.
Entropy Coder Architecture
Entropy coding is the last step in JPEG compression, which loads the quantized 2-D DCT coefficients and provides to the output the compressed and assembled JPEG words. The entropy coding is achieved by three components, namely run length coding, Huffmann coding, and assembler, each of which is elaborated in detail below.
Run length coding
After the quantization in zigzag order, a vector of 64 coefficients will be ready for the entropy coding. Among these coefficients, the first element is called DC component and all other elements are known as AC components. According to the JPEG standard, the run length coding (RLC) must be applied to only AC components. In the proposed design of JPEG encoder, the RLC is simply achieved by a three state {dc_coeff, ac_coeff, insert_zrl} FSM followed by an output buffer as follows:
 dc_coeff: The FSM first receives the DC coefficient Z00 * from the quantization architecture, inserts {1'b1, 4'b0000, Z00 * } into its output buffer, and goes to ac_coeff state, where 1'b1 signals that this is the DC coefficient and 4'b0000 is the related run length of zeros. According to this FSM, it takes 64 clock cycles at best or 70 clock cycles at worst to run length code 64 coefficients, which is sufficient enough to support a fully-pipelined operation without any stall clock cycles.
Huffmann coding
The output of RLC is a 16-bit word in the form of {1-bit DC flag, 4-bit run length, 11-bit DC/AC coefficent}. The Huffmann coding architecture proposed in Figure 4 receives such run length coded words and provides the Huffmann codes in a pipelined fashion as follows:
When a new output word is available from RLC, the registers in the first stage of pipeline latches the new word as zrl= 4-bit run length, Zij
In addition to these received signals, 11-bit diff and 1-bit lumin signals are computed by the differential coder and the chrominance block marker architectures, respectively, and sent to the first stage. According to the JPEG specification, the DC components must be first differentially coded by a simple subtraction between the DC component (Z00 * ) of the current 8×8 block and the DC component (prevZ00 * ) of the previous block from the same source image component (Y, Cb, Cr). Even though it is not shown in Figure 4 , the differential coder architecture consists of one adder and three registers in order to store the previous DC components of luminance and chrominance components. As a result, when the reception of a new DC component is signaled by asserting 1-bit dc signal, diff = Z00 * -prevZ00 * and Z00 * is written into the related register. On the other hand, the luminance block marker is a simple counter that asserts its lumin output signal only if the current block being processed by the Huffmann coder is a luminance source image component.  Stage-2 (Category Selection): The differentially coded DC coefficients and AC coefficients are passed through an encoder shown as the cat (category selection) block in Figure 4 . That is, the cat block receives 11-bit coefficients and encodes them to 4-bit numbers according to the JPEG standard, where a 4-bit number indicates how many bits are significand in the current coefficient.  Stage-3 (Huffmann coding): According to the JPEG standard, Zij * (x in Figure 4 ) is decremented by one if Zij * is negative (signx=0 in Figure 4) ; otherwise, it is kept the same. For any DC coefficient, its 4-bit category value is 15-bit Huffmann coded (11-bit Huffmann code and 4-bit Huffmann code length) by means of either the luminance DC Table or the chrominance DC table selected Note that 4-bit Huffmann code length for each AC code word is kept as the decremented by one from its real code length in the AC tables to optimize the area usage. In order compensate this decrementation, the 4-bit Huffmann code length for each AC code word is incremented by one before writing into the output buffer.
Assembler
The assembler in Figure 5 is used to convert the variable length compressed data coming from the Huffmann coder into a stream of 32-bit compressed data.
The assembler works as follows. In the first stage of pipeline, an OR-mask circuit is used to set the insignificant bits of 11-bit coefficient (Zij + ) to zero according to 4-bit category (cat) of this coefficient. Meanwhile, 16-bit Huffman code (Hij) is variably left shifted a number of times according to the same category value by a Shifter in order to bit align the Huffman code with the masked coefficient data. Finally, the left shifted Huffman code is ORed with the masked coefficient data, and the result is written in register A. On the other hand, the total length of the Huffmann code and coefficient, which is the sum of category (cat) and Huffmann code length (len), is placed into register length-A. Note that the result length cannot exceed 27 bits since the maximum length of the Huffman code is 16 bits and the coefficient is 11 bits.
In the second stage of pipeline, the content of register B is first variably left shifted a number of times according to length-A by another Shifter in order to bit align with the content of register A, which currently keeps {Huffman code, masked category}. Then, the left shifted register B is ORed with register A, and the result and its total length are stored into register B and register length-B, respectively. Note In the third and final stage of the pipeline, a new 32-bit data word is written into the output port whenever the length-B is equal to or greater than 32 and the output buffer is not full. In order to find out the next 32-bit word to be sent, register B is variably right shifted a number of times according to (length-B-32).
IMPLEMENTATION RESULTS
The JPEG IP core architecture proposed in the article is captured with Verilog HDL with a device independent form, simulated and verified by a series of testbenches using Xilinx ISim. It is synthesized using Xilinx ISE 14.7 for several Xilinx FPGAs including Xilinx Spartan 3 (XC3S1000-5FG320), Spartan 3E (XC4VSX35-12FF668), and Spartan 6 (XC6SLX75T-3FFG676) FPGA devices. Table 1 summarizes the synthesis results for the JPEG IP core and its three main modules for Spartan 3 and Spartan 6 FPGAs. According to Table 1 , the entropy coder uses the most of FPGA's LUT and FF resources, followed by 2-D DCT and quantizer modules. FFs utilizations are similar on both Spartan 3 and Spartan 6 FPGAs, while LUTs utilizations are lower on Spartan 6 FPGA. This is due to the fact that Spartan 6 equipped with 6-input LUTs (as compared to 4-input LUTs of Spartan 3) provides more efficient combinational logic implementation. Furthermore, the AC Huffmann tables are mapped to two BRAMs on Spartan 6 instead of a LUT-based implementation on Spartan 3. The 2-D DCT module has the lowest operation frequency among three modules and defines the maximum operation frequency (Fmax) of the JPEG IP core. Finally, since Spartan 6 is a newer technology than Spartan 3, the complete design and all individual modules achieve better maximum operation frequencies. Remember that it takes at most 144 clock cycles for the computation of all 2-D coefficients. Furthermore, the fully pipelined design of the JPEG IP core allows it to process an 8×8 block of pixels in 144 clock cycles in a pipelined fashion. Thus, the JPEG IP core, when mapped to a Spartan 6 FPGA, reaches a minimum period of 1.28 µs and processing rates up to 49.74 Msamples/s. This processing rate is sufficient for compressing more than 143 and 71 SDTV frames per second with 720×480 gray scale and color pixels per frame, respectively. Furthermore, the IP core proposed can compress more than 53 and 26 HD Ready TV frames per second with 1280×720 gray scale and color pixels per frame, respectively. These results indicate that the JPEG IP core proposed implemented on a low cost FPGA like Xilinx Spartan 6 can be deployed as an IP core of an M-JPEG video compressor directed to both SDTV and HD Ready TV applications.
The proposed JPEG IP core is compared against four other competitive designs in Table 2 . According to Table 2 , it is evident that the proposed IP core is clearly superior to [6] and [7] in terms of both resource utilization and maximum operation frequency. Note that comparing Stratix IV with Virtex 5 is slightly unfair for the proposed design since [14] reports that Stratix IV is 35% faster than Virtex 5 and packs 1.8X more logic element than that of Virtex 5. The JPEG compressors of [3] and [5] are better than the proposed one in terms of the LUTs utilization, which is due to implementing the tables on BRAMs, and Fmax. On other hand, they tend to use more FF resources and multipliers. Thus, there is no clear winner among [3, 5] and the proposed design. 
CONCLUSIONS
This article proposes a fully pipelined and low area JPEG encoder architecture. The key element of the proposed methodology is to carefully design the pipeline stages in order to increase the resource sharing and decrease the total clock cycles to complete the overall JPEG operation. Furthermore, we compare the achieved implementation results of this study with four different competitive designs from the open literature. In order to minimize the FPGA resource utilization as low as possible, the proposed architecture employs the row-column decomposition technique for 2-D DCT transformation step in JPEG encoding, which yields a saving of 1-D DCT resource utilization. The proposed low area design achieves high processing rates that is well-suited to various image and video compressing applications.
