Abstract-This paper introduces a reconfigurable architecture for ensuring secure and real-time video delivery through a novel parameterized construction of the Discrete Wavelet Transform (DWT). This parameterized construction promises multimedia encryption and is also well-suited to a hardware implementation due to our derivation of rational filter coefficients. We achieve an efficient and high-throughput reconfigurable hardware implementation through the use of LUT-based constant multipliers enabling run-time reconfiguration of encryption key. We also compare our prototype (using a Xilinx Virtex 4 FPGA) to several existing implementations in the research literature and show that we achieve superior performance as compared to both traditional CPU-based and custom VLSI approaches while adding features for secure multimedia delivery.
I. INTRODUCTION
Security is a major concern in an increasingly multimediadefined world. The recent emergence of embedded multimedia applications such as mobile-TV, video messaging, and telemedicine have increased the impact of multimedia and its security in our personal lives. For example, a significant increase in the application of distributed video surveillance technology to monitor traffic and public places has raised concerns regarding the privacy and security of the targeted subjects.
The large data volumes, interactive operations, real-time responses, and scalability features inherent to real-time multimedia delivery limit the practical application of traditional private and public-key cryptographic schemes. Furthermore, embedded multimedia systems have constraints on power consumption, available computational resources, and performance; restricting the operations of full-encryption schemes or selective-encryption schemes [1] , [2] .
The authors in [3] present a joint signal processing and cryptographic approach to multimedia encryption. They use index mapping and constrained shuffling to achieve confidentiality protection. This ensures that the encrypted bitstream still complies with state-of-the-art multimedia coding techniques. The scheme gives good results, however, it requires several extra computations (and hence extra hardware resources) to implementation. [4] presents a multimedia encryption scheme based on wavelet coefficients confusion. A scheme based on wavelet coefficient permutations alone is bound to be separable and weak against any cryptanalysis. In this work, we do use a wavelet coefficient permutation referred to as 'subband reorientation' which is optimized for implementation without any computational overhead. However, our cryptographic approach has more parameters that build the key space which prevents an adversary from easily cracking our scheme by parallel brute force trials in the individual sub bands.
The initial results regarding a parameterized construction of the Discrete Wavelet Transform (DWT) were presented in [5] . In this paper, we use the parameterized DWT along with the subband re-orientation property to build a large and secure multimedia encryption keyspace. Furthermore, we map this encryption architecture to reconfigurable hardware, and perform additional optimizations to obtain a zero-overhead scheme. Reconfigurable Constant Multipliers (RCM) were used in the design to obtain a LUT-based abstraction and LUTbased hardware acceleration.
The main contributions of this work can be summarized as follows:
1) This paper presents a reconfigurable hardware architecture for multimedia encryption. 2) RCMs are used to accelerate DWT operations and provide LUT-based hardware abstraction. The use of an RCM typically boosts the clock frequency by mapping complex multiplications into consecutive Look-up and addition operations. A multiplier-less hardware implementation is presented in the paper. 3) Initial investigation regarding the security features of proposed scheme has been provided in this paper. 4) Performance comparison with existing CPU-based software implementations and custom hardware implementations indicates favorably toward design of such reconfigurable architectures for secure multimedia delivery.
Our DWT-based architecture serves as a compression-cumencryption system that provides dual features of high throughput multimedia compression and embedded multimedia security. Section II gives a brief introduction to the underlying mathematics behind the DWT operation. We derive an expression for the parameterized DWT. Section III discusses the security performance of our scheme while Section IV gives details of hardware architectures and optimizations. Section V presents the synthesis results for the DWT architecture on a Xilinx Virtex-4 FPGA, its performance comparison with traditional architectures and the speedup over a CPU-based implementation. Section VI gives a short summary of our work.
II. PARAMETERIZATION OF THE DISCRETE WAVELET TRANSFORM

A. Preliminaries
The two most common DWT filters used in image compression are the Le Gall's 5/3 filter and the Daubechies 9/7 filter [6] , accepted in the JPEG2000 standard. The Le Gall's filter has rational coefficients and its hardware implementation requires less resources. The Daubechies 9/7 filter has better compression performance, however, it has irrational coefficients and leads to lossy compression.
B. Parameterized DWT derivation
The Bi-orthogonal Wavelet Filter Banks are used in image compression because of their excellent image compression properties. They must satisfy the perfect reconstruction (PR) condition and are expected to have a large number of vanishing moments (VMs) for good approximation property [7] . Daubechies 9/7 filter has 4 VMs each for the analysis and synthesis low pass filters. The irrational coefficients in Daubechies' 9/7 filter limit its precision of implementation to fixed point hardware such as FPGAs and ASICs.
Liu et al. [8] discusses the derivation of the rational coefficients filter for DWT with an arbitrary number of taps. The parameterization is achieved by reducing two VMs in the filter expression to introduce a free parameter a in the design.
Let H 1 (z) and H 2 (z) denote the analysis and synthesis low pass filter coefficients. On introducing a free parameter α in the equations for H 1 (z), the corresponding value of H 2 (z) is obtained by solving for conditions for linear phase, PR and low pass filter [8] .
where
The values q n are calculated by the following expression:
An approximate expression for the rational representation of 9/7 filter coefficients (q n ) is obtained by simplifying the Taylor series expansion for the above expression. We get q 0 = 1; q 1 = 5 − 2 × α; q 2 = 4 × α 2 − 14 × α + 16, and q 3 = 36 × α − 8 × α 2 − 60 + 32/α. Simplifying these expressions, we get the following expression for H 1 (z) and H 2 (z).
The rational terms in the expressions for these filters can be implemented in hardware using shifts and adds instead of multiplication operations. This is a big saving over the original Daubechies filter in terms of hardware requirements. However, we need to perform multiplication with the free parameter α and its exponents. This filter is implemented in our DWT architecture and is explained in the following sections.
III. MULTIMEDIA SECURITY
In this section, we give a brief summary of the security aspect of our design. One level of decomposition using DWT requires two sets of low and high pass filters (one each along the rows and columns). For a High Definition Video (with resolution 1024 × 768 pixels and above), we can have up to nine levels of wavelet decomposition. Thus, DWT parameterization gives us a wide range of α to choose from.
Moreover, we can implement simple transformations such as transposing the matrix and reverse-ordering of the subbands along the rows and columns of individual subbands resulting from the DWT operations. Thus, we use DWT parameterization and subband re-orientation operations to build a key space for multimedia encryption. These simple transformations can be implemented without any computational overhead by changing the memory access pattern, but such shuffling of subband information leads to significant degradations in images reconstructed with an incorrect key. Figure 1 shows the variation in PSNR of the reconstructed image at a bitrate of 0.2 bpp using the set partitioning in Fig. 2 . Possible transpose relationships for sub bands. A is the original matrix. The eight permutations are achieved using transpose relationship ('), and reverse-ordering of the subbands (− for reverse, + for forward read access) along both rows and columns hierarchical trees (SPIHT) coder with the parameter α. The variation of α beyond the range of 1 to 4 yields a poor PSNR value reflecting the poor compression property of the resultant DWT coefficients. Thus, we can divide the interval from 1 to 4 into subintervals of 0.0117 so that it takes 8 bits to encode one α parameter. For an image of 512x512 pixels, 6 levels of DWT operations are applied for the transform operation, each requiring two filters: one along the rows and the other along the columns. Thus, we have 12 DWT kernels each requiring 8 bits to represent its parameter α, giving us a 96 bit long keyspace for the DWT operation.
A. Building the Keyspace
The parent-child coding gain in the DWT-based coders was quantified by Marcellin et al. [9] . These dependencies are generally credited for the excellent mean square error (MSE) performances of zero-tree-based compression algorithms such as SPIHT and others. The subbands were rotated by 90
• with respect to the previous scale prior to zero-tree coding. The experiments indicate that, for natural images, the coding gain due to these dependencies is not considerable (typically around 0.40 dB for SPIHT-NC and 0.25 dB for SPIHT-AC).
The novel parameterization of DWT was first presented in [5] . The above discussion on rotation of subbands allows us to improve the performance of our parameterized DWT scheme by using simple transformations such as transposing the matrix, reverse-ordering of the subbands along the rows and columns. This allows us to build a larger key space without any computational overhead or any significant loss in compression.
In Figure 2 , we illustrate how we can represent the same subband in eight different orientations: we have four orientations of the subband decided by the forward or reverse ordering of the matrix along rows or columns. We get four more orientations by transposing the above four, summing up to eight possible transformations for each subband. We need a 3 bit value to represent this transformation. We have 19 subbands in a 6 level DWT decomposition, each one of which can be independently transformed giving us a 57 bit key space. Thus we have a total key space of 12 × 8 + 19 × 3 = 153 bits.
B. Multimedia Security
We can further increase the security feature of this scheme by introducing more parameterizations and by non-linear transformations on input key [11] .
The main advantage of our lightweight encryption scheme is that, while maintaining competitive compression performance and providing security, it comes at an extremely low computational overhead. [12] uses a wavelet filter parameterization scheme to provide key dependency to a blind watermarking algorithm. Similarly, the 153 bits keyspace can be used for real-time encryption of the input frames of a video. Figure 3 shows the results of encrypting our image with a random 153 bit key. Fig 3(i) gives the original image. Fig 3(ii) gives the reconstructed image when same key was used for reconstruction. Fig 3(iii-vi) give the results for the case of reconstruction with randomly generated keys. The numerical values of PSNR values between original image and the reconstructed images are given in Table I . The low PSNR values with wrong random keys indicate poor reconstruction with random keys.
Next, we perform experiments to evaluate the performance of our encryption scheme by flipping a certain number of bits in the key randomly. We assume that the adversary only has the information of the output images (something similar to cipher-text only attacks). Any attempts to recover the correct key information from incorrect trials would use correlation between different key-trials.
Shannon's 1949 paper [13] , which serves as the foundational treatment of modern cryptography calls this property as the 'confusion' property. Ideally, change in one bit of the key should change the cipher text completely.
We performed 1000 simulations where we flipped k bits of the key and performed the inverse transform with the new modified key. Results are given in Table II . It can be observed that k > 4 gives a low PSNR value which shows the low correlation of the reconstructed image with the input image. We can argue that change in few bits of key (k > 4) leads to poor correlation between input and output images. This level of security can suffice for the soft encryption requirements of mobile multimedia applications [14] and surveillance applications as previously mentioned. Figure 4 provides an overview of our parameterized DWT architecture. The input data (one pixel input per cycle) x is pipelined for eight cycles. In this block, we perform shift and add operations to implement fractional multiplications for the DWT filter. The high and low pass filter coefficients are the final outputs of the DWT filter.
IV. DWT ARCHITECTURE AND DESIGN
We performed several optimization steps to reduce the cost of the underlying hardware, as summarized below: 1) Division by binary coefficients (e.g. 1/64, 1/16, 1/4) was performed using arithmetic shift operations. This eliminates the requirement for multipliers in the circuit and reduces the number of multipliers from 69 to 23. 2) Eight out of the nine inputs are passed through four adders to reduce the number of variables to five. These values (labeled as w 0 , w 1 , w 2 , w 3 and w 4 ) are multiplied with a, a 2 and a −1 to get the necessary intermediate values which are input to shift and add logic. This optimization gives a tremendous savings in hardware. It reduces the number of adders in the design from 70 to 41 and the number of multipliers from 23 to 13.
3) The input stream was pipelined. As shown in Figure 4 , our architecture takes one pixel (or channel input) as the input and outputs the low and high pass signal coefficients with a finite latency. Increasing the system latency allows us to achieve a higher clock speed (and hence higher throughput). A direct implementation of this architecture using a Virtex-XC4VLX40 FPGA resulted in a clock frequency of 60 MHz which was improved by pipelining the critical path [5] .
A. Reconfigurable Constant Multiplier
The parameterized low and high pass filters were implemented using architecture given in Figure 4 using several multiplier units. The w i , i ∈ {0, 1, 2, 3, 4} values are obtained by summing the inputs for symmetric taps in the DWT implementation as shown in Figure 4 . w i is calculated as follows (where x(i-j) is the input x pipelined by j cycles):
Then, we can represent the filter expressions as: Here K i (α) andK i (α) are the functions of the variable α, and w i are obtained from the pipelined input. The values of functions K i (α) andK i (α) remain the same as long as we have the same α parameter. This implies that the values of these functions behave as constants and change only when we change the encryption key (and the associated parameter α). This value can thus be computed and hard-coded into the circuit. This constant multiplication can easily be mapped to a reconfigurable hardware with programmable LUTs. If the input is represented by B 1 bits and constant is represented by B 2 bits, we can use (B 1 + B 2 ) B 2 -input LUTs to get the output values of H 1 (k) and H 2 (k). Alternatively we can break down a (B 1 × B 2 ) bit multiplication into smaller input LUTs. Thus, the LUTs based multiplication can be reconfigured to incorporate any changes in encryption key.
We discuss the implementation of a 4 × 4 bit multiplier to explain the LUT mappings.
1) 4 × 4 Bits Multiplier using LUTs: Arbitrary hardware multipliers can be implemented using Propagate and Generate algorithm [15] . We make some interesting observations to build a direct LUT-based multiplier.
Let A and B be the two operands, both being 4 bits long. We define P i = A i ⊕ B i , and G i = A i B i . The output bit and the sum at each stage can be represented as:
On simplification, we get We can observe that S i is a function of inputs and is characterized uniquely by a logical expression. If one of the inputs (say B) is a constant, S i can be represented as a logic function of bit values of A.
The truth table of these functions f i (...) can be evaluated either by logical simplification or by exhaustive search over the input values. Thus, we can implement a 4 × 4 bit constant multiplication using 8 4-input LUTs or more generically, we can implement a M × K bit constant multiplication using (M + K) K-input LUTs.
It has been discovered that the LUT size of 4 to 6 provides the best area-delay product for an FPGA [16] . Most commercial reconfigurable devices such as FPGAs have 4-input LUTs. We therefore discuss the mapping of an M × K bit constant multiplier into 4-LUTs in the next subsection. A (K +1)-input LUT can be built from 2 K-input LUTs (as shown in the Figure 5 ). For example, we can build a 8-LUT from 2 7-LUTs which can be synthesized from 2 × 2 = 4 6-LUTs. Thus, one 8-LUT can be made from 2 4 = 16 4-LUTs and an arbitrary M -LUT from 2 M −4 4-LUTs. Figure 6 gives an example of multiplication of 8-bit number with 12-bit constant (M = 8, K = 12). Figure 6 (a) depicts 
V. HARDWARE PERFORMANCE
We estimated the hardware performance of our architecture by synthesizing the design on a Xilinx Virtex-4 XC4VLX40 FPGA, using ModelSim SE 6.4 for simulation and Xilinx ISE 10.1 for synthesis. This design just serves as the proof of concept for our architecture. An ASIC implementation with fixed interconnects for LUTs can achieve significant improvements in clock speed and throughputs.
We achieved a clock frequency of 103 MHz for our reconfigurable architecture. The hardware requirements of our implementation are summarized and compared with other implementations in Table III . The critical path of the design is dependent on the mapping of LUTs. Thus, a hard-wired LUT implementation will give considerable improvements in clock frequency over the present value of 103 MHz. It is noteworthy that together with [5] , this is the first known hardware implementation of the parameterized DWT filter (to the best of knowledge of the authors). Thus, comparison with reported architecture only indicates the trade offs involved in building a secure DWT scheme. A large number of 4-LUTs (over 5000) were used in the design. Table IV gives speed up of our hardware implementation over Virtex IV FPGA over an optimized software implementation. The software implementation was done using Matlab R2008a on a dual-core Intel 2.0GHz processor with 2 GB RAM. The exact speed-up is highly platform dependent, but we still get an order of magnitude speed-up by FPGA implementation.
VI. CONCLUSIONS
The proposed multimedia encryption scheme gives promising results for image and video encryption while the underlying hardware architecture is developed using LUTs allowing reconfiguration and providing high throughput. The key information is embedded into the configuration bit stream of reconfigurable hardware. To the best of our knowledge, this is the first such scheme, optimized to provide high throughput multimedia delivery alongside with multimedia encryption using parameterization of compression blocks.
