In this paper the three main hardware architectures for the two-dimensional discrete wavelet transform (2D-DWT) are reviewed. Also optimization techniques applicable to all three architectures are described. The main contribution of this work is the quantitative comparison among these design alternatives for the 2D-DWT. The comparison is performed in terms of memory requirements, throughput, and energy dissipation, and is based on a theoretical analysis of the alternative architectures and schedules. Memory requirements, throughput, and energy are expressed by analytical equations with parameters from both the 2D-DWT algorithm and the implementation platform. The parameterized equations enable the early but efficient exploration of the various trade-off related to the selection to the one or the other architecture.
I. INTRODUCTION
The inherent time-scale locality characteristics of the Discrete Wavelet Transforms has established them as powerful tools for numerous applications such as signal analysis, signal compression, and numerical analysis.
This has lead numerous research groups to develop algorithms and hardware architectures to implement the DWT. In [1] , [2] , [3] , and [4] VLSI architectures for the 1D and 2D DWT have been proposed. Additionally, comparisons among the architectures and scheduling algorithms for the DWT, regarding their efficiency when the DWT is mapped in custom VLSI architectures, has been performed in [5] , [6] .
Although the comparisons presented in [5] and [6] are enlightening, the related analysis is performed in an abstract level ignoring implementation platform parameters (e.g. memories' latency, number of ports, type of filters etc) that can heavily affect the results of such a comparison. Additionally, no direct comparison in terms of energy efficiency has been attempted so far. Furthermore the possible optimizations and their effect in the design parameters are not discussed by prior work. However, what prior work has pointed out, is that none of the alternative architectures has a clear lead in terms of either memory requirements or throughput or energy dissipation for all possible sets of parameters. Hence, the researcher or designer has Dep . of Electrical and Computer Engineering, University of Patras, Patras, Greece. not yet been provided with the an analytical comparison that will enable the early and secure selection of one among the alternative architectures for the 2D-DWT. In this paper we attempt to fill this gap. Specifically, the main VLSI architectures for the 2D-DWT are analytically described. Additionally throughput and memory minimization optimizations are presented and their effect is analyzed. The main contribution of this paper is the comparative study of the alternative architectures, which is based on the development of analytical equations for memory requirements, throughput, and energy. Analysis focuses on the forward 2D-DWT. It is considered that comparison result are also valid for the inverse 2D-DWT, since hardware architectures for the inverse 2D-DWT use the same resources and a reversed control flow.
The rest of this paper is structured as follows. In Section II the basic background needed to follow this paper is given. In Section III the core structure of any architecture for the 2D-DWT, namely the filters, are described and throughput optimizations are discussed. Section III also describes a memory minimization technique applicable in any architecture for the 2D-DWT. In Sections IV, V, and VI the three architecture alternatives for the 2D-DWT are analyzed. In Section VII we compare the alternative architectures, while in Section VIII some conclusions are drawn.
II. BASIC BACKGROUND
In this section the necessary background to follow this paper is reviewed. Specifically, subsection II-A briefly describes the 1D and 2D DWT decomposition, while subsection II-B presents the energy model used for the characterization of the alternative architectures.
A. The Discrete Wavelet Transform
The 1D-DWT can be viewed as the multiresolution decomposition of a sequence [7] . It takes a length AE sequence Á AE Ò , and generates and output sequence of length AE . The output is a multiresolution representation of Á AE Ò . The highest resolution level is of length AE ¾, the next resolution level is of length AE , and so on. We denote the number of frequencies or resolutions levels or levels with the symbol Ä. The 1D-DWT filter bank structure, realizing the 1D-DWT dyadic decomposition, is illustrated in Fig. 1 The 2D-DWT binary-tree decomposition is illustrated in Fig. 2 . For each level the input signal is filtered along rows and the resulted signal is filtered along columns. In this way, the 2D decomposition of an input signal Á AE Å AE , with Å columns and AE rows, is described by the following equations: 
À À ·½ Ò ÓÐ
In the rest of this document we use the term layer to indicate both intermediate and output signals, i.e., Ä , À , ÄÄ , ÄÀ , À À , and À Ä , while level is used for each decomposition stage.
B. Energy Model
For the energy characterization of the alternative hardware architectures for the 2D-DWT, only energy consumed due to data storage and transfers is taken into account. This suffices for the purposes of this paper for two reasons:
1. In hardware implementation of data-intensive algorithms, such as the 2D-DWT, the energy dissipation due to data storage and transfers forms the dominant component (up to ¼±) of the total power budget [8] . It is indicative that a transfer to/from an on-chip memory consumes ½¼ times more power than one addition, while an off-chip accesses requires ½¼ ½¼¼ times more power than an on-chip access [8] . 2 . The different hardware architectures, perform exactly the same number of filtering operations. Thus it can be said that energy consumed to arithmetic operations is a common cost for all architectures.
The energy dissipated on the memory hierarchy is approximated by the energy dissipation due to on-chip memory accesses plus the energy dissipation due to off-chip memory accesses.
The energy consumed upon on-chip interconnect (busses) is much smaller than the internal power consumption of on-chip memories, thus the energy cost of an on-chip memory transfer is approximated by the energy cost of the memory access itself. The energy consumed on accesses to the on-chip memories is estimated using the model presented by Landman in [9] , [10] . According to this model the energy dissipated on memory accesses is a function of the memory size in terms of stored words, the number of bits per stored word, the number of access, the technology and the number and the type (R or R/W) of ports. It is assumed that the energy is linearly proportional to the number of accesses, and sub-linear to memory size. We also assume supply voltage for all architectures. Thus, the energy dissipation due to on-chip memory accesses is given as:
For a given supply voltage the function of Eq. 10 determines the relation between the memory energy consumption and the memory size and depends only on technology. Such a function is described in [9] , [10] , [8] and is used for the estimation of memory access energy cost in this paper.
During an off-chip memory access, power is consumed by the bus driver, the memory and processing element(s), chip I/O pins (bonding wires and pads), the bus wires and the memory banks. High-level accurate estimation of the effective capacitance corresponding to each one of the above sources of power consumption is very difficult to be made. However a rough but still useful (for the aim of comparison) estimate can be acquired by considering typical values for the effective capacitance corresponding to each one of the above factors [8] . More precisely the off-chip memories are assumed to be the most power conscious ones, i.e., low-power SRAM [11] . The internal power consumption of the off-chip memories is also modeled by Eq. 10 . For the I/O pins, the bus driver, and the bus wires a capacitance of 40pF per bus line is assumed. It is also assumed that in average half of the off-chip bus (bit) lines make a transition per off-chip memory access.
Thus the effective capacitance can be derived by multiplying the number of bus-lines (word-length) with the half of the 40pF. This results to the following formula:
III. COMMON ISSUES
This paper do not cover all architectures proposed in the past, but focuses on these that are likely to be implemented in real-life designs. So, to be realistic, we consider RAM-based architectures that use parallel filters. We choose RAM-based architectures, since they offer the highest regularity/density of storage and scale more easily, compared to architectures based on systolic or semi-systolic routing [3] , [5] . Finally, we choose parallel filters because i) they offer a throughput equal to one output per cycle, and ii) they can be pipelined at any level, unlike serial filters.
High throughput is imposed by the application domain of the 2D-DWT, namely image/video compression, in which real-time operation is typically required. We remind here that the 2D-DWT is a computational intensive algorithm, which has a complexity in terms of filtering operations in the order of Ç´ AE ¢ Å µ, where AE , Å are the input's dimensions and a constant. High throughput is also significant for low-power application, where it can be traded for reduced power supply [12] . Of course, the high throughput of parallel filters comes at the expense of a greater number multipliers. Although this is true, we remind that unlike the 1D case, in architectures for the 2D-DWT, it is the storage that dominates on design's size and complexity, not the number of multipliers [5] .
A. Parallel Filter Architectures for the DWT
As far as parallel filters for the DWT are concerned, the conventional and a throughput-optimized hardware architecture are studied. Conventional architecture consists of an input FIFO with width equal to AE Ï .
Additionally, the same AE Ï multipliers are used for the computation of the low and the high frequency outputs.
The conventional architecture implements Eq. 1 and 2 in an interleaved manner. Specifically, for the even clock cycles the multipliers are fed with the constant coefficients of the low-pass filter, while for the odd cycles the same multipliers are fed with the constant coefficients of the low-pass filter 1 . In this way, a pair of high-and low-frequency coefficients is produced each two clock cycles. Throughput-optimized architecture consists of a modified FIFO that receives two input pairs per clock cycle and a separate data-path for the low-pass and high-pass filtering. Thus, throughput-optimized architecture produces a pair of high and low coefficients each clock cycle. Fig. 3 illustrates the conventional and the throughput optimized parallel filter for a 4/3 DWT.
Although it is expected that the conventional architecture occupies less area than the throughput-optimized, this is not always the case. This is because with the throughput-optimized architecture, the efficient application of an additional optimization is enabled. Specifically, multiplication among a variable (input sample) and a constant can be easily reduced to a number of shift and add operations, resulting this way in a much smaller implementation [13] . For example a multiplication times 3, is reduced to a left shift by one and an increment by one. In Table I the area occupied by the non-pipelined conventional and throughput optimized architectures with and without operation reduction, for the widely used 9/7 filter is given. It can be observed that the throughput optimized architecture with operation strength reduction is in all cases faster and in most cases smaller than its non-optimized counterparts.
Another difference among the two alternative parallel filter architectures for the DWT is that the throughput-optimized architecture receives two inputs in parallel. For DWT RAM-based architectures, this imposes to use dual-port memories and wider input-data busses. This is expected to slightly increase the 
B. Parallelization of Filtering Operation
According to the decomposition of Fig. 2 , the 2D-DWT is computed by applying the high and low pass filters along row and columns of a layer. To speed-up the process of filtering along a one-dimensional input sequence, a linear array of filters can be used. Due to down-sampling by two, two new input coefficients are required to produce the next low-and high-frequency coefficient. Thus to perform AE Í successive filtering operations along a one-dimensional input sequence, AE Ï · ¾ ¢´AE Í ½µ input coefficients are needed. Hence, such a linear array of AE Í parallel filters requires an input FIFO of width AE Ï · ¾ ¢´AE Í ½µ.
The first filter in the array receives input from position 0 up to AE Í ½ of the FIFO, the next filter receives input from position 2 up to AE Í · ½ and so on. In Fig. 4 The usage of a linear array of filters is not the only way to succeed parallelism for the computation of the 2D-DWT. Another option is to employ parallelism among filtering operations of different layers of the 2D-DWT. In this paper, this form of filtering operations parallelism is not studied, since this approach requires a more complex control and does not allow for ½¼¼± utilization factor in the general case. The following analysis of different 2D-DWT hardware architectures uses the parameter AE Í to model the number of parallel filters in the linear array.
C. In-Place Mapping for the 1D and 2D DWT
Typically in any architecture for the 1D or the 2D DWT two different memory blocks are allocated to store the input and the output of the transform. We remind that the input and the output of the transform, and thus also the corresponding memory blocks, are of equal size. In this subsection, we describe a in-place mapping scheme that allows to perform the 1D or 2D DWT using only one of these memory blocks. The key concept is simple: Store filtering outputs in-place of no-longer needed filtering inputs.
For example consider the 1D-sequence of Fig. 5 and assume that input coefficients are fed from the input memory to the filter from a FIFO, which we name filtering FIFO ( ). Additionally assume, that input memory stores coefficient Á Ǽ¼µ at address 0, coefficient Á Ǽ½µ at address 1 and so on. The pair of coefficients Ä ½ ¼ , À ½ ¼ is produced by filtering the 3 first input coefficients, after performing a symmetrical mirroring. Since, input coefficients Á AE ¼ and Á AE ½ are currently in the and will not fetched again from Now, for the 2D-DWT consider the input and decomposition layers as two dimensional arrays of coefficients. In an analogous way, the coefficients of Ö Ó Û , ÓÐ of decomposition layers Ä and À are stored in the addresses:
The coefficients of Ö Ó Û , ÓÐ of decomposition layers À Ä , ÄÀ , ÄÄ , and À À are stored in the addresses: 
in-place mapping for input with size ¢ , is illustrated in Fig. 6 .
Note that the addressing expressions required for the in-place mapping can be implemented in hardware using very simple structures, since they consist of multiplications among indexes and a power of two (easily implemented by shifting operations) and increments by one. Although this is true, this additional hardware consumes additional energy. However, this energy penalty is insignificant since addressing is responsible only for a very small fraction of the total energy budget of architectures for the 2D-DWT, which is dominated by the energy dissipation due data storage and transfers.
IV. ARCHITECTURE I: LEVEL-BY-LEVEL
The level-by-level architecture is the straightforward implementation of the two dimensional decomposition of Fig. 2 . Specifically, input image is scanned in a row-by-row manner and filtering along layers is not interleaved. This means that for each level, the filtering along columns is performed after the completion of the filtering along rows. Furthermore, the filtering of level j is initiated after the completion of filtering at level j-1. The filtering operation schedule is described by the pseudo-code of Fig. 7 2 , while the hardware architecture is illustrated on Fig. 8 . It is noted that initialization and finalization process (needed at the image limits) is ignored in Fig. 7 , and throughout this paper, for the sake of simplicity. The memory Á Ñ Å ( Fig. 8 ) initially stores the input image. Thus, the size of Á Ñ Å in terms of coefficients is:
A linear array of AE Í parallel filters is used to perform the necessary filtering operations. We remind that the usage of such an array is sensible only if Á Ñ Å has AE Í ¢ Ô read and write ports. Each layer's coefficients after their production are written back to Á Ñ Å according to the in-place mapping scheme described in Subsection III-C. It must be stressed here that with this architecture it is meaningless to introduce local memories to store intermediate results (i.e. coefficients of layers À , Ä , and ÄÄ ). This is due to the large size that such memories would have. For example, to store the coefficients of layers À ½ and Ä ½ requires a storage mean equal to the initial image size.
A. Number of Memory Accesses
Under the constraint related to the number of ports of Á Ñ Å , the number of read (write) accesses to Á Ñ Å , required to perform the 2D-DWT with the level-by-level approach is found as follows: to produce the À , Ä layers, we need to read all the coefficients of layer ÄÄ ½ . Layer ÄÄ ½ has a size of 
B. Throughput
To come-up with a formula for the throughput of the level-by-level approach we need to define an extra parameter, namely the latency of Á Ñ Å . In this way, we name Ø Á Ñ Å the number of clock cycles (latency)
per Á Ñ Å access (read or write). Now, if we assume ½¼¼± utilization of all filters in the linear array, then the number of clock cycles (throughput) needed to perform the 2D-DWT is:
Finally throughput in terms of input coefficients processed per second is:
where Ð is the clock frequency.
C. Energy
Á Ñ Å can be stored either off-chip or on-chip, with respect to integration technology capabilities and image dimensions. For the case that Á Ñ Å is stored on-chip, the energy consumption of the level-by-level architecture is estimated using Eq. 11. Replacing in Eq. 11, the number of words with AE ¡ Å , the number of ports with AE Í ¡ Ô, the number of bits per word with d and the number of accesses with ¾¢Eq. 13 results in the formula:
If Á Ñ Å is integrated on-chip, then the energy consumption of the level-by-level architecture is estimated using Eq. 10 and thus the above equation is reduced to:
V. ARCHITECTURE II: LINE-BASED
The line-based architecture scan input image in a row-by-row manner and comply to the following concept: Proceed to the next layer filtering ASAP, without interleaving filtering along a row. The linebased architecture is based on a algorithm analogous to the 1D-RPA algorithm [3] . Although extensions of the RPA algorithm to the 2D-DWT are referenced by some researchers (e.g. [5] , [6] ), (to the best to authors' knowledge) such an algorithm has not been presented yet. However, architectures based on the above concept has been described in [14] and [15] , the latter of which is also proposed by the JPEG 2000 committee. In this paper a recursive algorithm for the 2D-DWT is proposed. We call this algorithm two-dimensional RPA (2D-RPA). The 2D-RPA is illustrated in Fig. 9 and is the base of the line-based architecture described here.
A direct consequence of interchanging layers' filtering ASAP is that latency is minimized. Additionally, the 2D-RPA enables the usage of small local memories for the reused data, which are the coefficients of layers Ä and À for ½ ¾ Ä , and ÄÄ for ½ ¾ Ä ½ (see Fig. 2 ). This feature is highly favorable in many cases, since localizing memory accesses can result to lower energy consumption and higher throughput, at the cost of course of higher integration area. To identify how local memories can be used under the line-based architecture, consider the example of 
A. Local Memories
Remember that Ä ·½ and À ·½ layers' coefficients are produced by filtering along rows of ÄÄ layer and that ÄÄ is produced in a row-by-row manner. Hence, the storage requirements to interchange between ÄÄ coefficients production and Ä ·½ and À ·½ coefficients productions is equal to just one row of coefficients of layer ÄÄ . These storage requirements are satisfied by a local memory, here called ÊÓÛ× Å . The size of the ÊÓÛ× Å is: 
B. Number of Memory Accesses
Since each input coefficient is read and written only once, and the number of ports of all memories is AE Í ¡ Ô, the number of read and write accesses to Á Ñ Å is: coefficients, ¾AE Ï ¡ Å ¾ Ä (or À ) coefficients must be read. Since at level the total number of pairs of rows of ÄÄ -ÄÀ (or À Ä -À À ) coefficients is AE ¾ , the total number of accesses to Ó Ð × Å can be found as follows:
To produce one row of coefficients at layer Ä ·½ and À ·½ , Å ¾ coefficients must be read from ÊÓÛ× Ç Ñ . Thus the number of coefficients read from ÊÓÛ× Å is equal to the product of Å ¾ , times the number of rows at layer Ä ·½ which is AE ¾ . Hence, the total number read accesses from (write accesses to) ÊÓÛ× Å is:
C. Throughput
In the following analysis we use the symbols Ø Á Ñ Å , Ø Ó Ð ×Å , and Ø ÊÓÛ× Å to indicate the number of clock cycles per Á Ñ Å , Ó Ð × Å , and ÊÓÛ× Å respectively. Thus, assuming a ½¼¼± utilization of all filters in the linear array, the number of clock cycles to perform the 2D-DWT with line-based architecture is:
Hence, throughput in terms of input coefficients processed per time unit is:
D. Energy
To come with a formula for the energy dissipation of the line-based architecture, we first consider the typical case according to which: i) input image is stored off-chip and ii) local memories ( Ó Ð × Å and ÊÓÛ× Å ) are stored on-chip. This allocation of memory blocks is considered typical mainly due to the memory blocks' size. For example, in the extreme case that AE Å ½¼¾ , AE Ï , and Ä , the sum of sizes of Ó Ð × Å and ÊÓÛ× Å is less than ¾¼Ã coefficients, which modern technologies allow to be stored on-chip. On the other hand Á Ñ Å Ñ size is Ã coefficients even when AE Å ¾ . Under this allocation energy dissipation is: Fig. 12 . 3 . This block-by-block traversal is a disadvantage for the block-based architecture, since this short of input image scanning is unsuitable for streaming applications and furthermore it requires a rather complex addressing.
A block diagram of the block-based hardware architecture is illustrated in Fig. 14 A slightly different hardware architecture results, if instead of fetching one block at the time, a super-block of AE ¢ AE blocks is fetched in the IPM. Filtering within such a super-block enables the production of AE ¢ AE pcts without accessing the Á Ñ Å . The differences among initial description and this variation of the block-based architecture, are just in terms of memory sizes and number of accesses to each memory. For this reason, from this point forward we will study both in a unified way. Specifically, we will refer to both alternatives using the name "block based architecture/approach" and the oncoming analysis will consider AE and AE as two extra implementation parameters.
The pseudo-code of Fig. 15 provides a more compact description of the basic steps of the data and control flow of the block-based architecture. In Fig. 15 initialization and finalization phenomena are omitted for the sake of simplicity. 
A. Local Memories
As previously mentioned, the Á È Å can store AE ¡ AE blocks with size ¾ Ä ¡ ¾ Ä . Thus:
Input image is scanned in a row of (super-)blocks-by-row of (super-)blocks order. For each block filtering along all layers of decomposition is performed. This traversal imposes Ç Å Ó Ð × to store AE Ï ¾ coefficients for each column and each decomposition level. The number of columns at decomposition level is Å ¾ .
Hence, the size of Ç Å Ó Ð × is: 
B. Memory Accesses
Since each input coefficient is read and written only once, and the number of ports of all memories is AE Í ¡ Ô, the number of read and write accesses to Á Ñ Å is:
Since for each (super-)block the 2D-DWT is performed in the same way as in the level-by-level approach, we can use Eq. 13 to compute the number of coefficients read from IPM per block. Of course Á Ñ Å in Eq. 13 must be replaced with Á È Å . Additionally each block is written back to Á Ñ Å . The total number of (super-)blocks is AE ¡ Å ´AE ¡ AE ¡ Ä µ. Hence:
AE Ï ¾ coefficients are fetched/stored from/to Ç Å Ó Ð × , for each column of each decomposition level of each (super-)block. The number of (super-)blocks is AE ¡ Å ´AE ¡ AE ¡ Ä µ; the number of column at decomposition level of a block is AE ¡ ¾ Ä ¾ . Thus:
AE Ï ¾ coefficients are fetched/stored from/to Ç Å ÊÓÛ×, for each row of each decomposition level of each (super-)block. Thus in the same way as in the case of Ç Å Ó Ð × , the number of read (write) accesses to Ç Å ÊÓÛ× is:
C. Throughput
In the following analysis we use the symbols Ø Á Ñ Å , Ø Á È Å , Ø ÇÅ Ó Ð × , and Ø ÇÅ ÊÓÛ× to indicate the number of clock cycles per Á Ñ Å , Á È Å , Ç Å Ó Ð × , and Ç Å ÊÓÛ× respectively. Thus, assuming a ½¼¼± utilization of all filters in the linear array, the number of clock cycles to perform the 2D-DWT with block-based architecture is:
D. Energy
To come with a formula for the energy dissipation of the block-based architecture, we must first allocate each memory block on-or off-chip. Á È Å , Ç Å ÊÓÛ×, and Ç Å Ó Ð × are typically stored on-chip, since the sum of their sizes is less than ¿¾Ã coefficients even for the extreme case that: AE Å ½¼¾ Ä , AE Ï , and AE AE ¾. For Á Ñ Å Ñ we consider two cases. According to the first case Á Ñ Å lie off-chip, while for the second case Á Ñ Å lie on-chip. In the first case the energy dissipation of the block-based architecture is estimated by the following equation:
In the case that Á Ñ Å is stored on-chip the term ¡ AE Í ¡ Ô ¼¡½¼ ½¾ ¡ Î ¾ should be removed from Eq.36.
From Eq. 36 and Eq. 27 33, energy dissipation can be expressed in terms of the basic parameters used throughout this paper.
VII. COMPARISONS
In this section we attempt a comparison in terms of memory requirements, throughput and energy dissipation of the design alternatives of the 2D-DWT presented in sections IV, V and VI. The comparison is based on parametric Eq. 13 -36. Table II summarizes the performed analysis parameters, while Table III summarizes the parametric equations. Although analytical equations are available, however the identification of the conditions, under which the one architecture overcomes the others, in a purely analytical fashion is almost infeasible. For this reason we perform the comparison for typical cases of the 2D-DWT, and draw general conclusions whenever possible. We consider as typical examples some of the 2D-DWT included in the JPEG2000 final committee draft [16] , namely the 2D-DWT based on the 5/3, 9/7 and 10/18 filters.
A. Memory Requirements
The level-by-level approach has the smaller memory requirements, than the line-based and block-based varies with input image dimensions, number of decomposition layers, and filter width. Fig. 16 illustrates the sizes of the local-memories with the line-based and block-based architectures for 5/3, 9/7 and 10/18 2D-DWT.
As it can be observed the block-based has ½ ± ± smaller requirements for local storage than the linebased architecture in the majority of cases. This is not true in cases that the number of decomposition levels is greater than 6 (e.g ¾¼ or ½¾), where the the block-based has ¾ ± ½¼¼± greater requirements for local storage than the line-based architecture.
Furthermore, an interesting observation is that the number of decomposition levels slightly affects memory sizes for the line-based architecture, while heavily affects memory sizes of the block-based architecture. For example in the case of ½¼¾ ¢ ½¼¾ input image and 5/3 2D-DWT, the line-based architecture requires local storage of Ã Ó × for ¿ levels and ½½ Ã Ó × for levels of decomposition, while the block-based architecture requires local storage of ¿ Ã Ó × and ¾¾ Ã Ó × respectively. This must be taken into account especially in cases that the design goal is a DWT engine programmable in terms of decomposition levels. 
B. Throughput
To perform a comparison among the alternative architectures in terms of throughput, we consider a typical case according to which i) the clock frequency is ½¼¼Å À Þ , and ii) all memories are dual-ported (AE Í ¡Ô ¾ ). 2D-DWT, when Ø Á Ñ Å Ø ÐÓ Ð Ñ Ñ× . In this case, the parameter that defines the comparison outcome is the total number of memory accesses, since there is no latency difference for accesses to Á Ñ Å and local memories. Hence, as expected the level-by-level architecture, which performs the smallest possible number of memory accesses, is the one that is faster than the other two architectures for all filter sets and number of decomposition levels. The comparison among the other two architectures leads to the same conclusions as in the case that Ø Á Ñ Å Ø ÐÓ Ð Ñ Ñ× , but now the percentage differences are greater from ± to ½ ±.
Summarizing the above leads to the following general statements: 
C. Energy
One off-chip access is 10 to 100 times more energy consuming than one on-chip access, depending on technology, off-chip buses length etc. For this reason the comparison of the three alternative architectures for the 2D-DWT is performed separately for the cases that Á Ñ Å is stored off-chip and on-chip. (Eq. 20 Eq. 30), the comparison among them is focused only in the energy consumed due to accesses to the on-chip local memories. Fig. 19 shows the energy consumed due to on-chip memory accesses for the cases of a 5/3, 9/7 and 10/18 2D-DWT. From this figure it can be observed that in all-cases the line-based consumes more energy due to on-chip memory accesses than the block-based architecture. The difference in terms of energy dissipation due to accesses to the local memories among the two architectures is significant and lies in the range of ½¿± ±. This difference is due to the fact that the line-based architecture performs the dominant majority of on-chip memory accesses to Ó Ð × Å , while the block-based architecture performs the dominant majority of on-chip memory accesses to the smaller Á È Å , where the energy-cost per access is much lower. Fig.20 , for 5/3 2D-DWT the level-by-level consumes on average ± more energy than the block-based, while for the 18/10 2D-DWT and AE ¢ Å ¾ ½¾ the level-by-level consumes on average almost ¾¾± less than the than the block-based architecture. Finally, the line-based is ± ± less energy efficient than the other two architectures.
Summarizing the above, results to the following statements:
In the case that Á Ñ Å lies off-chip and local memories lie on-chip, the block-based is the most energyefficient, while the level-by-level is the less energy-efficient architecture. 
D. Discussion
From the comparison performed for the three filters of JPEG2000, it is evident that the efficient and secure selection of one, among the alternative architectures, requires a detailed exploration of the various tradeoff. None of the alternative architecture has a clear lead in all cases and/or for all sets of parameters. The comparison result turns in favor of the architecture that, for a certain implementation platform and a certain type of filters and decomposition levels of the 2D-DWT, combines low integration cost (related to number of filtering units and memory requirements), sufficient throughput for the given task, and dissipates the less energy. This short of exploration is facilitated by the derived formulas (Table III) and conclusions of this paper, the most important of which are summarized in Table IV. Specifically, the exploration performed in subsections VII-A -VII-C indicated that, in cases that technology and cost allow for the on-chip integration of Á Ñ Å , the easiest to implement level-by-level architecture combines relatively high throughput and low energy dissipation, while requiring the smallest amount of storage. However, if Á Ñ Å is stored off-chip (which today is the typical case), then the level-bylevel is transformed to the slowest and most energy-hungry among the architectures studied here.
In the latter case and for relatively small filter widths, the line-based architecture offers high throughput at expense of increased energy dissipation, while the block-based architecture requires the lowest energy-budget at the expense of processing speed. Furthermore, in the case of relatively large filter widths the block-based overturns the other two architectures in terms of both energy dissipation and throughput. However, the blockbased architecture has the disadvantages of being the hardest to implement, of not being suitable for streaming applications, and of significantly modifying its storage requirements when the number of decomposition levels varies.
VIII. CONCLUSIONS
In this paper, alternative hardware architectures for the 2D-DWT have been analyzed and compared in terms of memory requirements, throughput, and energy dissipation. This paper do not cover all architectures proposed in the past, but focuses on these that are likely to be implemented in real-life designs. The comparison is based on theoretically derived formulas for memory requirements, throughput, and energy dissipation. The formulas are generic in terms of parameters of both the 2D-DWT and the implementation platform. The comparison has indicated that none of the architectures has a clear lead for all sets of parameters, but it has also lead to the identification of strengths and weaknesses of each architecture, and the conditions, under which each architecture overturns the others in terms of storage requirements, processing speed and energy dissipation.
