In this paper we present an approach for designing an adaptive video compression system that allows regions of interest to be identified and the picture size and quality configured to optimize performance for a system computation and communication capabilities. We present an FPGA prototype of the complete system, as well as a prototyping environment that allows users to easily explore and evaluate design alternatives. Design exploration can be performed on the Motion JPEG coding standard, with an operation frequency of up to 52 MHZ, a frame rate of over 37 fps with a resolution of (720 × 480) and a compression ratio of 47:1 for 0.51 bits per pixel.
Introduction
The continued advancements within our computing and communication infrastructure is making computer systems more pervasive and ubiquitous. Mobile systems with video capabilities in particular have proliferated due to the availability of cheap cameras and embedded computational platforms. As mobile video systems become more ubiquitous their use cases continue to expand from real time target recognition in custom designed high end systems, to cheap commodity set top boxes used for personnel entertainment and social interaction. For some applications where video requirements are relatively modest, minimal compression is required to meet the systems transmission bandwidth capabilities. However other applications may require computationally complex compression algorithms for maintaining high resolution images to match wireless channels with limited bandwidth. Historical application domains such as surveillance drones require real time transmission of high resolution and large video data sets needed for identification, planning and response. On the other hand consumer video recorders used for documenting social events may not require the highest resolutions nor need to support realtime video capture and processing rates. Commercially available Field Programmable Gate Array (FPGA) components have now reached a maturation point in gate densities where they can support complete systems on chip architectures. These devices have reached a cost-performance equilibrium that can allow their integration into consumer electronics and also provide significant performance capabilities for high end custom designed systems. Although the devices are available, the design tools needed to help designers to rapidly develop a wide range of systems that optimize delivered quality of video processing to match application needs and supporting communication network bandwidth is lagging.
The most challenging computational requirements for video systems are typically associated with data compression algorithms. A video sequence of HDTV (1920×1080) at 30 frames per second with 24 bits per pixel requires 177.9 Mega Bytes per second in uncompressed form. The same sequence when JPEG compressed would require only 2.97 Mega Bytes (depending on the compression ratio). However, this comes at a cost. Performing this type of compression requires a dedicated co-processor to meet standard real time video rate requirements. In many instances the complete video stream may not require high compression. If the system contained adaptive intelligence capabilities that allowed the identification of a subset of the image area to be passed through a high quality, computationally complex compression, then the remaining portion of the image could be transmitted with a lower quality and less computationally complex compression algorithm. If this capability were dynamically and autonomously tunable, then the system could adapt to time varying bandwidth availability. Under such a scenario, when excess bandwidth is available the system could continue to adaptively process more raw video data through the high resolution compression algorithm until a balanced computation-communication equilibrium is met. When bandwidth becomes limited the system could adaptively maintain an overall quality of service by starting with the most important areas of the picture for high compression, and then reducing the quality of compression for the remaining areas to achieve a balanced quality of service-system capability equilibrium.
In this paper we present preliminary work in developing needed design tools that could field such an adaptable system. Figure 1 shows our overall design flow that allows designers to rapidly explore various hardware/software configurations for a wide range of mobile video compression system requirements. Designers can engage in what-if type design space explorations, and select and integrate a set of proposed building blocks into a complete deployable system. While our current prototype set of tools focuses on adaptivity during design time and not run time, these tools are a necessary step towards fielding fully autonomous and adaptive systems. We have focused our tools for rapid prototyping and fielding of FPGA-Based platforms, however without loss of generality our design flow and tools can also be applied to the design of ASIC chip solution. The examples used throughout this work target our interest in real time mobile surveillance cameras for the automotive industry. We show the versatility of our approach by showing how multiple implementations ranging from computationally simple sequential, to full parallel custom designs for detecting pedestrians and other objects around automobiles can be rapidly explored. The rest of the paper is organized as follows: In section 2 we provide a brief background on video compression. Section 3 discusses how we identify regions of interest within an image. Section 4 provides a short overview of the JPEG compression standard. Section 5 provides a description of our generic system architecture. Our approach to design space exploration is presented in section 6. Finally section 6.1 concludes with a discussion of future work.
Related work
Architectures for real-time image and video processing systems have been well studied. The literature is well represented with proposed custom ASIC ( [2] , [8] , [1] ) and FPGA implementations ( [5] , [3] , [4] ) as well as off the shelf microprocessor and digital signal processors solutions. In [8] KyungHyun proposed a fully pipelined architecture for realtime JPEG compression of 1024 × 768 images without downsampling. The compressor was targeted for various applications such as scanners, color FAX and network cameras. However due to the size of the generated compressed file, the architecture was not suitable for realtime applications and could not be used on a reconfigurable device. Lopez Lorenzo [5] improved upon this proposed architecture by first downsampling the input image and then computing the discrete cosine transform. This has had the advantage of reducing the size of the compressed file and increasing the throughput. That architecture nevertheless could be only used for small images with a resolution of 352 × 288 and was not suitable for realtime video purpose (only 40 frames per second). In [3] Mohammed Elbadri et al. proposed a FPGA based design for a JPEG decoder with a low 67 MHz operating frequency while Simpson A. et al. [7] implemented a realtime JPEG CoDec for refresh rates up to 25 frames/s using an FPGA-centric processing platform and reusable IP cores. The CoDec could handle frames of relatively high resolution such as VGA (640 × 480).
However, despite the high throughput achieved by these architectures, the fact that they could not be adapted according to the user needs (high image quality, power consumption, bandwidth utilization, etc) is seen nowadays as an important limitation. In this paper we will present an adaptive system whose goal is to built realtime applications according to user defined constraints.
Computing Regions Of Interest
This section will give an overview of the overall architecture of our system whose first step is the identification and marking of regions of interest. For systems with sufficient bandwidth, this step can clearly be bypassed. However for systems with bandwidth constraints this first step is used to reduce the quantity of data that needs high resolution computationally complex compression. For our automobile application area, we considered two approaches to detect high value areas of an image: object contours and shadows. While contour detection is well adapted for many objects of interest, it is computationally complex. Detecting shadows is computationally cheaper for relatively big objects such as automobiles. Shadows can be identified using simple filters implemented as simple and regular hardware co-processors. Shadows are identified by : 1) first finding the shadows through image processing (hypothesis generation), 2) extracting the identified shadows to assume vehicle positions, and 3) marking the position and sending the information to the compression module as augmentations with the picture.
Shadow Identification Shadows can be identified through simple thresholding techniques as areas below a certain brightness level when compared to the remainder of the image. Thresholding can become more difficult if multiple light sources are present. Threshold levels between shadow and non-shadow regions can differ greatly depending on the overall lighting conditions and road surface properties. A dynamic approach can automatically adapt the threshold to each situation. A promising approach for thresholding car shadows is to estimate the average road color or brightness. This technique works on the assumption that the road surface is approximately uniform and all other areas will be significantly darker than the shadow areas. To approximate the average color of the road, we calculate the average color of the complete image under the assumption that the camera is focused towards the road. These assumptions work well for monitoring cameras used in automobiles for detecting nearby objects.
Binarizing the image by applying the calculated threshold alone would mark any single pixel below the threshold as a shadow. However, due to image noise and road structure, not all pixels below the threshold need be considered. Only large connected areas need be considered for further processing. To eliminate small errors, we apply a basic noise reduction filter. A more robust approach would be to apply a mean filter before binarization. All pixels contained in a certain environment around the considered pixel are summed up and divided by the number of pixels in the environment. The resulting mean average value is then applied to the pixel in the center. The desired effect is a smoothing and blurring of the whole image thereby eliminating small errors and insignificant details. A second alternative approach based is based on an erosion (from the shadows' point of view) of the binarized image. Erosion starts by considering a set of pixels within an area of interest. If all pixels are 1 (as the image is binarized), the pixel in question is labeled 1. If any of the pixels in the environment is 0, the center pixel is labeled 0, thereby shrinking larger areas and eliminating smaller areas completely. Under this approach, only regions of certain sizes exceeding the size of the mask remain ( Figure 2 ).
While both approaches yield roughly the same results, a mean filter is more computationally complex as it requires a division operation by the number of pixels contained within the area of interest. Conversely erosion is a simple counting process and thus more computationally simplistic. However, simple erosion has an issue not suitable for the system. While eliminating small positive noise (pixels that are falsely classified as shadow), it also rejects areas with small negative noise (pixels that are falsely classified as non-shadow). In practice, this issue enlarges small negative errors that need to be bridged and thereby eliminated. To address this issue, we relax the tight constraints on erosion:
Instead of accepting only regions that are completely filled and rejecting all others, a certain percentage of positive pixels are tolerated. When the number of required pixels is equal or above max(⌈ columns 2 ⌉ · rows, ⌈ rows 2 ⌉ · columns), the filter will shrink larger areas. The higher the percentage, the more positive noise eliminated while the size of bridgeable gaps decreases as well. Therefore, a design space trade-off exists for designers between reducing positive and negative noise. An example of the benefits compared to Figure 2 is given by Figure 3 . To perform the de- sired image erosion, pixels in a given area (or environment) are summed. The integral image, an image representation easily calculated from the original image is computed to simplify this procedure. In the integral image, each pixel represents the sum of all pixels left and above, including its own value from the original picture (Figure 4 ). This image representation makes it possible to calculate the sum of any rectangle in constant time from only four values. As FigFigure 5 . Calculation of the sum of contents of a rectangle using the integral image. ure 5) shows, the sum of any rectangle can be calculated by taking the sum of the rectangle from the origin to the bottom right corner (D), then subtracting both rectangles left (C) and above (B) the rectangle. As the area that is both left and above it was now subtracted twice, it has to be added again (A) for a correct result.
Data Extraction
Now that shadows are found on the basis of single pixels, it is important to correctly group them together to treat a connected region as a single entity. The extraction algorithm is divided into two phases. In the horizontal merging phase, horizontally connected marked pixels are represented together by the 4-tuple object = (X start , Y start , X end , Y end ), representing the horizontal bar's coordinates in the image. In the vertical merging phase, the algorithm tries to connect these tuples vertically. Two tuples are merged if they are vertically adjacent and overlapping horizontally (thus being connected in the image) or vertically sufficiently close to each other and horizontally overlapping to a certain degree. Especially the second condition is subject to parameter setup (allowed vertical gap size, required horizontal overlapping region). This process results in one remaining tuple for each possible car, which is then appended to the image and forwarded to the compressor.
JPEG Compression Overview
For the realtime video compression we decided to use the JPEG standard among others because its computation is very suitable for a parallel processing. By focusing on some parts of the computation chain, we could considerably speed up the processing or increase the data throughput. Motion JPEG (M-JPEG) represents a class of video formats where each video frame or interlaced field of a digital video sequence is separately compressed as an JPEG image. As defined in [9] a JPEG Baseline encoder system is composed of 4 steps; color space conversion and downsampling, computing the discrete cosine transform, quantization, and entropy encoding, as shown in Figure 1 . The JPEG Baseline coding for our images is based on 8-bit samples using Huffman encoding. The first step in the process is downsampling. The human eye is more sensitive to brightness (luminance) than color (chrominance) [6] . Thus most JPEG encoders reduce the chrominance components to half of the resolution in both dimensions by taking the mean value of each 2 × 2 block. This sampling method is called "4:2:2" or "4:2:0" and is applied on images stored in YCbCr color space rather then RGB color space. This step is one of two steps in the compression process where information is lost.
Next the data is regrouped into Minimum Coded Units (MCU), which are the smallest group of interleaved data units. An MCU contains a block of 8 × 8 pixels with the number of blocks per MCU depending on the chosen sampling method. For example, in a "4:2:2" sampling method in the Baseline JPEG, a MCU has 4 luminances blocks and 2 chrominances blocks of 8 × 8 pixels.
The Forward DCT, which is the next step in the processing, transforms the image into the frequency domain where low frequencies are separated from high frequencies. Since neighbor pixels are highly correlated and are in the low frequency spectrum, the output of the DCT results in most of the block energy stored in the lower spatial frequencies. Higher frequencies will have values equal to or close to zero such that they can be ignored without introducing loss in image quality.
In the Quantization step, the DCT outputs will be divided according to the 64 values of an 8 × 8 matrix called the quantization table. There is no information lost in the division of the coefficients themselves, but the result is then rounded to an integer value. This represents the second step of the compression where information is lost. This is a key step in the compression process since less important information is discarded according to our desired image quality. The outputs of this step are rearranged such that most of the zeroes will be placed at the end. This is referred to as a Zigzag mapping. The array with many consecutive zeroes at the end can now optimized to achieve high compression in entropy encoding.
The entropy coding uses run-length encoding (RLE), Huffman coding, variable length coding (VLC) and differential coding (DPCM) to decrease the number of bits required to represent the image [9] . The reduced data is added to the JFIF header to form the output JPEG bitstream. This header specifies the source image characteristics, the number of components in the frame, the sampling factors for each component and the destination for the quantized tables to be used with each component are retrieved.
The Prototyping Environment
During the implementation process, several parameters and constraints must be considered. User requirements such as image quality, power constraints, bandwidth constraints, etc effect the overall design. These requirements must then be evaluated based on specific target technology capabilities and limitations. The Figure 7 presents the workflow of the prototyping.
MJPEG Prototyping
Once the user goals and the target technology have been defined, the modules used for the prototype are chosen from a database of JPEG building blocks and connected together to form an efficient motion JPEG encoder/decoder. The database contains various hardware implementations of modules using in the JPEG compression, as well as synthesis result of these implementations. Using these results and the user constraints, Figure 7 . The prototyping environment the selection of candidate modules for our prototype can be made. The generated architecture is then synthesized and evaluated to determine if the design can fit into a target device. If synthesis fails, the architecture can be refined, trading off performance capabilities versus technology mapping. Once the evaluation stage has been successfully navigated, the system can be validated and the final architecture generated. One of the advantages of baseline JPEG compression for this type of design space exploration is the ability of it's regular structure to be pipelined and partitioned into parallel channels. Indeed during the downsampling step, the components of an image are treated separately to reduce the chrominance samples. After this stage the three components Y, Cb and Cr can be processed independently until interleaved in the Huffman coding step to prepare the scan codes for the JPEG bitstream. Design space exploration during MJPEG prototyping explores these possible parallel paths.
As explained earlier, the Baseline process allows us to replicate the Forward DCT module, Quantization module, ZigZag module, and part of the encoding module for the Y, Cb and Cr components. We explore the parallelization of these modules in the following section. The generated prototype is classified in three different modes or groups according to the parallelization of the Y, Cb and Cr components: sequential, semi-parallel, and parallel modes.
The Sequential Mode
Design space exploration can begin in the sequential mode for evaluating picture quality. Sequential mode is the most computationally modest of the three modes, and represents the normal pipelining of JPEG baseline compression. Indeed after the downsampling stage, the three components are merged in one channel and are processed in a pipelined fashion, one after the other. Pipelining does provide reasonable throughput and is appropriate when the goal is to minimize technology requirements. For FPGA's this sequential pipelined mode minimizes the number of memories and slices required to meet non-performance based Figure 8 . A prototype in the sequential mode user requirements. This simple sequential mode can be appropriate for systems that do not require high resolution compression such as movies, or for low-power architectures such as cellphones.
The Semi-Parallel Mode
In the semi-parallel mode, a higher throughput can be achieved. In fact, an architecture is in the semi-parallel mode when at least one of the aforementioned parallelizable modules is replicated within the design. Figure 9 for example shows a semi-parallel architecture where the DCT-2D, the Quantization and the Zigzag scanning for luminance and chrominance samples are processed separately. In this design, we use two blocks of DCT-2D, two blocks Figure 9 . A prototype in the semi-parallel mode of Quantizer and two blocks of Zigzag scanner. This is the perfect semi-parallel architecture, since all the parallelizable modules are replicated. Another option would be to double in the design only the Zigzag module because the DCT-2D module is extremely computationally intensive and the Quantization module uses memories for its tables. The semi-parallel mode could be used for realtime processing application where no huge amount of data are processed.
The Fully Parallel Mode
Architectures in fully parallel mode are those where replication occurs to fully maximize performance. For our running example all parallelizable modules are used two times (for a total of three channels). This mode is used to increase the throughput in a step where the need arises. Of course all the parallelizable blocks can be include in a design as we can see in the Figure 10 . As we can see, the prototype of the Figure 10 . A prototype in the fully-parallel mode Figure 10 uses three DCT-2D blocks, three Zigzag blocks but only two Quantization blocks. Indeed after the Zigzag step, the Chrominance samples Cb and Cr are merged in one channel before starting with the Quantization. This could be useful to reduce the memory utilization. In this mode, the compressor achieves its highest speed since all the three components of the image could be processed in parallel. This mode is used in applications, where a huge amount of resources should be available and is also good for realtime systems.
Evaluation
Once a JPEG encoder prototype has been generated, it has to be evaluated. The evaluation consists to synthesize the prototype and to check whether the timing constraints for the design are respected and also if the design has enough resources (slices, memory, etc...) to fit into the target technologie. For our experimentation we use an FPGAdesigned circuit (Figure 11 left) into which the different prototypes are integrated. In the FPGA-Based JPEG encoder system, the incoming image from the camera is first processed through a video controller (video to encoder module) and converted into RGB pattern. The result samples are then sent to the encoder prototype through a Fast Simplex Link Bus [10] for compression. Once the image is compressed, it is then sent to a JPEG decoder for decompression. After the decompression step, the reconstituted frame is sent back to the video controller(Decoder to memory module) and stored in the DDRAM where it will be read and displayed on a monitor using an XPS Thin Film Transistor controller [11] . The JPEG encoder modules have been written in Handel-C and synthesized in Xilinx ISE 12.3. The FPGA board is a Virtex5-xc5vlx110t connected to a camera (see Figure 11 right).
During our experiments three different prototypes have been evaluated and the synthesis results are presented in the Table 1 , followed by the output images (Figure 12 ) and the comparisons of compression factors (Table 2) . 
MJPEG

Results Interpretation
The three prototypes (see Table  1 up) we built during our experimentation phase have been used to generate the output images 12(b), 12(c) and 12(d) respectively. Compare to the existing hardware implementations, we can see that the generated prototypes perform Table 2 . Comparison of compression factors (quality, ratio and speed) for various parameters quiet well. The fact that the modules used in these prototypes have been written in HandelC, a high level language, makes them easy to be modified compare to VHDL implementations used in most of the existing architectures. Furthermore, using our system, we are no longer bounded by the image resolution, memory constraints, etc. A new set of user's constraints leads to a new prototype.
This paper has presented an environment for rapid prototytping of adaptive motion JPEG encoders. The goal of the environment was to design architectures according to the user constraints on one side and the target technologies on other side. Before being validated a prototype could be refined several times until it meets particular constraints. Thus, for various applications, a dedicated image compression system can be rapidly built based on this proposed environment. Our design operates at a frequency of 52 MHZ, and can encode from 10 to 13 mega pixels per second depending upon desired image resolution. Although we are working towards systems that can adapt to time varying system capabilities we performed our refinement step manually during design time. In future work we will be automating this step in order to accelerate the design process and make it less complex. In this way, a system could be implemented and dedicated to automatically analyze the synthesis results and re-design of the prototype when necessary.
