I. INTRODUCTION

M
OTION estimation is an essential part in standard video coders such as H.26x, MPEG-1, MPEG-2, and MPEG-4. By removing temporal redundancies existing in adjacent frames, motion estimation can reduce the coding bit-rate significantly. The block-matching algorithm (BMA) is used as a motion estimation method for most of the video coding systems. Its goal is to find a block that is most similar to a current block within a pre-defined search area in a reference frame. As a straightforward method, the full search BMA (FSBMA) is widely used because of its high performance and low control overhead. Usually, FSBMA is computationally expensive in a video encoder, and about 70% of the total encoding time is spent on this algorithm. This heavy computational load limits the performance of the encoder in terms of encoding speed and power consumption. Hence, many VLSI architectures for FSBMA have been reported previously for fast implementation [1] - [5] . However, due to its high computational complexity, FSBMA usually requires a large number of gates and high memory bandwidth in hardware implementation for real-time applications.
As a result, in order to reduce the heavy computational load of FSBMA, active research has focused on fast BMAs for a long time. Most of the fast BMAs exploit a unimodal error surface model [6] - [18] , a pixel sub-sampling technique [19] , spatial/temporal correlation [19] - [21] , or a hierarchical/multi-resolution frame structure [22] - [27] . In [6] - [18] , the number of search points is reduced by selectively checking the points under the unimodal error surface model, where the matching error is assumed to increase monotonically as the search moves away from the position of global minimum error. Even though they suffer from considerable PSNR degradation as the search range increases, many hardware architectures have been proposed due to their simplicity and regularity [28] - [32] .
On the other hand, in [19] - [21] , the spatial/temporal correlation of MV fields is exploited. The main idea is to compose a set of candidates from the MVs of spatially/temporally adjacent MBs, and then choose the best one among them as an initial estimate for further refinement according to a certain rule. Theoretically, the initial estimate can be obtained from an autoregressive (AR) model of MV fields. However, the AR model often encounters noise with an unexpectedly large variance due to a discontinuous MV field, which stems from moving boundaries or scene changes. To prevent performance degradation caused by the discontinuity, one approach takes advantage of anti-causal prediction as well as causal prediction [19] , [21] . But, these algorithms are not suitable for real-time implementation due to an inherently long latency and data-fetch repetition.
The multi-resolution schemes of [21] - [27] are based on the idea of predicting an initial estimate at the coarse level and refining the estimate at the fine level. Typically, two approaches are popular. One is to use a variable block size at each level [22] - [25] , and the other is to use a constant block size [21] , [26] , [27] . In the former, an initial estimate is obtained from a large measurement block at the coarse level and becomes a search center for the next level, based on the assumption that the initial estimate approaches the true MV. Generally, the motion vector obtained from a larger block is more noise resistant than that from a smaller block. However, an initial estimate obtained from a larger block may not be a good estimate. since several motions may exist in one block as the block size increases. The latter approach is relieved from the problem above since the block size is constant at all levels. Hence, this technique tends to provide fast computation with little performance degradation, even for sequences having complex motion. But its motion vectors may be less robust to noise.
To increase the robustness of a motion vector in a constantblock multi-resolution scheme, several algorithms have been proposed by combining the scheme with a multiple candidate search [33] - [37] . In [33] - [35] , since candidates are selected only on the basis of minimum distortion, many candidates are needed to achieve a prediction performance close to that of FSBMA. Hence, this approach requires a high computational cost. In [36] , the neighborhood relaxation scheme is adapted to a multi-resolution algorithm with multiple candidates. However, this algorithm is to improve the motion vector coding efficiency in very low bit-rate coding rather than the motion estimation performance itself. As a recent effort, the algorithm in [37] uses four candidates corresponding to four superblocks with different shapes and adopts binary block matching to reduce computational complexity due to multiple candidates. This algorithm is mainly focused on MPEG 2 and HDTV coding rather than low-bit rate coding.
Meanwhile, a few hardware architectures have been proposed for multi-resolution BMA [23] , [38] , [40] , [41] . But they do not aim at small-area implementation for low-bit applications. Namely, the architecture in [23] is not efficient in terms of chip area due to its large on-chip memories, the one in [38] is focused on implementation for high bit-rate applications such as MPEG-2, and each multi-resolution level in [40] , [41] has its own specified systolic array which cannot be commonly used among different levels.
In this paper, we propose a novel multi-resolution BMA, which is effective in terms of both estimation performance and LSI implementation. And also, we provide its corresponding architecture with a small hardware size for low bit-rate video-coding applications. The proposed algorithm can be distinguished from existing algorithms in providing all the following aspects simultaneously.
1) The proposed algorithm can reach a high estimation performance close to that of the FSBMA. 2) The performance becomes more reliable and efficient as the search area increases.
3) The proposed algorithm is appropriate for LSI implementation with small-size hardware. 4) The proposed architecture can produce 8 8 block-based MVs as well as 16 16 macroblock (MB)-based MVs simultaneously. This paper is organized as follows. In Section II, we propose a multi-resolution motion search algorithm (MRMCS), and show intensive simulation results. The proposed LSI architecture is described and its application to the advanced prediction mode is discussed in Section III. Section IV shows the implementation results, and finally, conclusions are given.
II. MRMCS
As mentioned above, conventional multi-resolution search algorithms tend to fall into a local minimum due to their reduced search points at coarser levels. This local minimum problem may partially be overcome by increasing the number of candidates at each level as in [33] - [35] . However, this approach requires high computational cost to get the prediction performance close to that of FSBMA because multiple MV candidates are required for local searches even at the finest level, as well as the middle level. The use of multiple MV candidates at the finest level increases the overall I/O bandwidth of the algorithm because the I/O data bandwidth at the finest level generally dominates the bandwidth in the conventional multi-resolution BMAs. Furthermore, as the search range increases, its estimation accuracy deteriorates because the interval between adjacent search points at the coarsest level becomes larger. Therefore a significant local minimum phenomenon occurs. To overcome these drawbacks, the proposed MRMCS limits MV candidates to only three at the middle level, and sets the total number of levels to three. In order to maintain high performance with those three candidates, we introduce a candidate based on spatial correlation in a MV field in addition to the candidates based on minimum distortion. This improves the accuracy of MV candidates. Thereby, a single candidate at the final-level search is enough to provide the desired performance. As a result, the MRMCS performs only one local search at the finest level, and its overall computational cost and data bandwidth burden decrease.
A. Multi-Resolution Frame Structure
The proposed MRMCS consists of three resolution levels. Level numbers are ordered from 0 to 2, and levels 0 and 2 represent the finest and coarsest levels, respectively. For the -th input frame, , the upper level images are constructed via the following sub-sampling:
where represents the intensity value at the position of the th frame at level . For a cost-effective implementation, low-pass filtering is not applied before sub-sampling. Performance degradation due to the absence of low-pass filtering is found to be negligible. The number of pixels at the next upper level is reduced to one fourth of the lower level. The multi-resolution frame structure is illustrated in Fig. 1 . The MB size becomes 16 16, 8 8 , and 4 4 at levels 0, 1, and 2, respectively.
The sum of absolute differences (SAD) is widely used as the matching criterion for BMA due to its low computational cost. For a 16 16 MB, SAD at level can be defined as (2) where is the level number and . We can see that the computational complexity of the matching process is drastically reduced to the fourth power of the sub-sampling factor at each level (i.e., 1/16 at level 1, 1/256 at level 2). 
B. Framework of MRMCS
The MRMCS is based on a multi-resolution frame structure mentioned above. The overall scheme of MRMCS is illustrated in Fig. 2 . Let a whole search range at level 0, or , be .
1) Search at Level 2:
We choose three MV candidates, i.e., , based on the spatial correlation in MV fields as well as the minimum SAD, and employ them as initial search centers at level 1. First, and having minimum SAD are obtained by full search within a given search area at this level (3) where . From the perspective of level 0, the full search at level 2 is equivalent to examining regularly sub-sampled points within , which is usually useful for the searching of random or complex motion. Thus, and play a main role in finding the true MV in a complex motion area. Then, the final candidate is predicted from adjacent MVs at level 0 via a component-based median predictor employed for MV coding in MPEG-4 and H.263 [42], [43] . In a continuous MV field, based on spatial MV correlation can be a better initial estimate than or .
2) Search at Level 1:
Local searches are performed around the three candidates in order to find a MV candidate for the search at level 0, i.e.,
. The MV candidate is calculated as (4) where 3) Search at Level 0: A final MV is found from a local search around , i.e.,
where
The search complexity of MRMCS can be described as follows:
operations/s (6) where denotes the search complexity at level . , and are the image size, the number of operations for computing SAD per pixel, and the frame rate, respectively. In case of FSBMA, computational complexity is given as operations/s (7) From (6) and (7), we can expect that the computational burden of the MRMCS can be reduced to 4.7% and 1.5% of that of the FSBMA for of 16 and 32, respectively. This relative burden is reduced further as the search range of increases.
C. Experimental Results
To evaluate the performance of MRMCS, we use seven MPEG test video sequences: "mother and daughter (m&d)," "news (news)," "car phone (car)," "foreman (fman)," "flower garden (fg)," "football (fb)," and "table tennis (tt)." The first two sequences have relatively small motions and the others have fast or complex motions. All the sequences consist of 300 frames at a frame rate of 30 fps, and the size of each frame is 352 288 for the first four sequences and 352 240 for the others. The integer-pel search range is set to , e.g., , 16, or 32. As a performance evaluation measure, PSNR is used and defined as follows: where , and denote the image size and the th motion compensated image, respectively.
We compare performances of MRMCS with four existing algorithms: FSBMA, -step search ( , and , 4, 5 for , 16, 32, respectively) that is a general version of the 3-step search (3SS) to cover increased search ranges [39] , the multi-resolution search algorithm based on multiple candidates [33] (we will call it MRMC-for the algorithm has candidates at each resolution level), and multi-resolution search algorithm using spatio-temporal correlations (MRST) [21] ; and depict them in Table I. exhibits the lowest computational complexity with consistency that is proper for hardware implementation. However, it is noted that provides lower PSNR performance especially for the sequences having fast motion. Also, in this approach, 4SS or 5SS sometimes provides even worse performance than 3SS with of 8, because a large search area causes a more severe local minimum problem. Even though MRMC-also needs a consistent computational complexity, it provides worse PSNR performance than MRMCS and MRST for similar computational complexity, and its performance becomes noticeably worse as the search range increases for all the video sequences. Meanwhile, MRMCS and MRST provide prospective PSNR performances that are close to the FSBMA. But, compared to MRMCS, MRST has a longer frame delay due to its noncausal processing, a larger storage for MVs of the previous frame in addition to those of spatially adjacent blocks, and an inconsistent computational complexity. Hence, it is proper for implementation in software rather than hardware.
The use of spatial MV correlation improves the estimation accuracy of our algorithm. This can be examined in Table I by comparing the performance between MRMC-4 and MRMCS. By replacing a MV candidate based on minimum SAD at level 1 with the one based on spatial MV correlation, MRMCS can provide a more reliable MV candidate at level 0. (For sequences having large motion, the effect of the adoption of spatial MV correlation becomes more noticeable.) As a result, in MRMCS, only one local search is enough at level 0, thereby its overall computational cost and data bandwidth burden decrease.
III. PROPOSED LSI ARCHITECTURE
Based on the algorithm proposed in Section II, we propose an area-efficient LSI architecture for low bit-rate video coding in H.263 and MPEG-4. A search range of is adopted. In this architecture, we try to reduce the number of PEs and on-chip memory size, so that the search algorithm can be implemented with a much smaller number of gates than FSBMA while keeping the degradation of coding performance to a negligible level.
A. Overall Architecture
As described in Section II, the proposed motion-estimation scheme consists of three levels, and matching processes at each level have different block sizes and search ranges (see Table II ). A systolic array processor is popular for implementing a motion estimator due to its simple and regular structure. However, the size of the systolic array is directly related to the block size and search range of the matching process. This makes it difficult to find an efficient architecture for a multi-resolution search BMA having variable block sizes and search ranges depending on each level. A straightforward architecture for a multi-resolution search scheme is to have different sized systolic array processors for the search of each level [40] , [41] . But, this kind of architecture is inefficient because it occupies a large chip area and may cause an idle state of systolic array processors due to inter-level dependency.
In order to produce a systematic and area-efficient multi-resolution architecture, we introduce the concept of a basic searching unit (BSU) that performs a full search for a 4 4 sub-block at all levels. Note that one BSU can be commonly utilized at all levels, because our multi-resolution scheme employs the block size of 4 4 as the smallest processing block size, and the search range of (25 search points) as the smallest search range among the three levels (see Table II ). The proposed architecture consists of a BSU, an address generator, two comparators, six register arrays, and memory banks, as in Fig. 3 . The register array for MB stores 4 4 sub-block based SADs ( 's), which are obtained from the BSU operation in order to calculate or . Four register arrays for blocks are used for the advanced prediction mode (8 8 prediction mode), and a register array for MVs is used to store neighboring MVs. Memory banks provide a scheduled data flow to the BSU in order to calculate 's.
B. BSU
The number of PEs is decided by clock rate, picture size, and search range. If a 1-D systolic array is sufficient to process the data in time, it will be chosen to reduce the number of PEs; otherwise, a 2-D systolic array is necessary. The number of PEs is also restricted by maximum clock rate. For an area-efficient hardware design, the BSU is implemented by adopting a simple 1-D systolic array processor in [3] (see Fig. 4 ). The minimum number of PEs required for real time application is given as (9) where is the computational complexity given in (6) and is the clock rate. Here, represents the value of the minimum integer larger than . For the design parameters of a pixel-rate of CIF@30 Hz, a search range of , and the maximum clock rate is assumed to be 40 MHz, the required minimum number of PEs is five according to (9) including computational complexity of half-pel search. Meanwhile, it is desirable for the number of PEs to be coincident with the dimension of the search range or the processing block. And since the search range is , five can be selected as the proper number of PEs. For a higher throughput or lower clock rate, a 16 or 25 PE-array may be adopted as a BSU. As shown in Fig. 4 , a BSU consists of five PEs, flip flops (DFFs), multiplexers (MUXs), and simple logic for flow control of the search-area data. The PE that calculates is, in principle, the same as the one in other BMA hardware designs.
As mentioned above, the BSU performs every unit matching process, or a full search for a 4 4 sub-block. Therefore, for a unit matching process, it requires 4 4 sub-block pixels in the current frame and 8 8 block pixels in the previous frame. In the BSU in Fig. 4 , current block pixels are sequentially shifted into DFFs in a row-scan order at each cycle. Then, is available to each PE with one cycle delay. To avoid idle time of PEs at the boundary of a search area, we divide the 8 8 search area into two regions, and ; is the left half of the 8 8 search area, and is the right half as in [3] (see Fig. 5 ). Search-area pixels and are also sequentially read and broadcasted through two ports at each cycle in a row-scan order; and one of them is selected and fed to each PE through a 2-to-1 MUX. through the other port. and the first four rows of and , which are required for evaluating the five search points in the first row of a search range, are read in 16 cycles. During the next 16 cycles, is provided again and four rows of and from the second to the fifth row, which are required for evaluating five search points in each row of a search range, are also provided. In this way, to provide all the data needed for a unit matching process, 80 cycles are required in the BSU.
In all cycles, PE0 is supplied only with and PE4 is supplied only with . PE1, PE2, and PE3 select either or as in Fig. 6 . Thereby, five PE0, PE1, PE2, PE3, and PE4, select their corresponding search-area data within the first 16-cycle  period, and produce  ,  ,  , , and at cycle 16, respectively. In the following 16 cycle periods, the rest of in a search range of can be calculated. This search procedure is summarized in Fig. 7 and the corresponding MUX control signals generated from the MUX control logic are plotted in Fig. 4 .
C. SAD Calculation
Since the BSU calculates 25 's for a search range of , it can be directly applied to the matching process of level 2, which requires 's. However, matching processes at level 1 and level 0 are based on 's and 's, respectively. Therefore, to obtain a , its corresponding four 's are to be added together as follows: (10) Similarly, are obtained by adding 16 's as follows:
(11) A register array for MB and an adder in Fig. 3 are used for the addition operations above. During the unit matching process, 25 's obtained from the BSU are stored sequentially in the register array. During the next unit matching process, each of the 25 newly obtained 's is added to the previous SAD. Then, the newly accumulated SADs are stored again in the register array. Through this operation, we can obtain 's in every 4 unit matching processes, and 's in every 16 unit matching processes.
D. BSU Processing at Each Level 1) Level 2:
Since the matching process at level 2 is performed on a 4 4 block within , the processing block size is the same as that of the unit matching process performed by BSU. Therefore, 's obtained from BSU can be compared with each other directly to find a minimum SAD without using a register array. However, is , which is larger than that of the unit matching process. So, this search range can be covered with four search ranges of that are centered at , , and (2, 2), respectively, thereby requires four successive unit matching processes. Then, the locations corresponding to the first and second smallest SADs are selected as two out of the three search centers required at level 1. Since the matching process needs four unit matching (BSU) processes, 320 cycles are required at level 2.
2) Level 1: Three matching processes are performed for 8 8 blocks with a search range of , because level 1 has three MV candidates: two candidates obtained from level 2, and the other one obtained by the correlation of neighboring MVs. One matching process consists of four unit matching processes. In other words, an 8 8 block is divided into four 4 4 sub-blocks, and then the matching processes for each sub-block are performed sequentially using a BSU. "Register array for MB" accumulates 's to obtain a . All the 's obtained from the three local searches are compared and one location corresponding to the minimum SAD is selected as a search center at level 0. Since the matching process consists of 12 unit matching processes, 960 cycles are needed at level 1.
3) Level 0: The matching process at level 0 is performed on a MB within . Since a MB can be divided into 16 sub-blocks for BSU processing as in Fig. 8 , the matching process consists of 16 unit matching processes. "Register array for MB" is also used to obtain 's. All the 's are compared, and then a location corresponding to the minimum SAD is selected as the final MV. Since the matching process consists of 16 unit matching processes, 1280 cycles are required at level 0.
E. Advanced Prediction Mode (8 8 Prediction Mode)
In the advanced prediction mode in MPEG-4, every 8 8 block motion vector in a MB is to be obtained [42] , and pixel refinement around the final is exemplified for obtaining an for each block [42], [43] . Due to the matching strategy and hierarchical multi-resolution frame structure of the adopted algorithm, the proposed architecture can efficiently realize the advanced prediction mode in a different manner from [42], [43] . In our scheme, instead of refining the after the final is obtained, refinement is performed during the search of in level 0 so that a and four 's are obtained simultaneously. In level 0, we first assign a processing order for 16 sub-blocks as in Fig. 8 , and perform matching processes for each sub-block in this order. Next, 16 's are accumulated together to obtain 's for a . Concurrently, to obtain 's for each 's, every four 's are accumulated in each 'register array for block' separately. Finally, a is obtained by comparing 's, and four 's are obtained by comparing 's for each block. These operations can be described as the following equations: (12) (13) (14) where and denote the SAD for the th sub-block and that for th block, respectively. denotes the MV for the th block. It should be noticed in (12) that 16 's are calculated in the processing order, and every four 's are added together for 's. This approach for the advanced prediction mode provides the following two advantages.
1) There is no additional cycle to calculate 's. Hence, the throughput increases.
2) The same search-area data are used for both of and . Since there is no additional data access, memory bandwidth is lowered. The performances of the advanced prediction based on the proposed architecture and the existing one in [42] , [43] are compared for various sequences in terms of PSNR. As a result it was found that their difference is negligible.
F. Half Pel Search
After the final integer pel MV is determined in Section III-D or Section III-E, a half pel search is to be performed around it. Like level 0, we divide the MB into 16 sub-blocks for BSU processing. Then, for each sub-block processing, 6 6 pixels are needed to generate an interpolated image (see Fig. 9 ). In the figure, denotes the 4-pixel average and denotes the vertical (horizontal) 2-pixel average, respectively. , , and are calculated simultaneously by an interpolator, which consists of eight registers and three adders as in Fig. 10 . Then, they are separately stored in on-chip memories, and used in the BSU. Four PEs are used for a half pel search so that each PE can calculate SADs of two search points (see Fig. 11 ). 512 cycles are needed in a half-pel search.
G. On-Chip Memory Organization
During the matching process, most pixels are used several times to evaluate different candidate locations. To avoid high bandwidth requirements for memory systems, on-chip memories are necessary for motion estimators. As is well known, there is a tradeoff between the memory bandwidth and the size of on-chip memory. However, if the memory bandwidth is acceptable, a smaller sized on-chip memory is preferred. In our design, the minimal required memory size corresponds to the memory size for a unit matching process. This size is mainly determined by , which is the data size for a unit matching process, or the sum of bytes for a current sub-block and bytes for its corresponding search area. The memory size becomes 288 bytes by considering double buffering and half pel memories. For this minimum size, the required memory bandwidth is given as bytes/s (15) where and denote the number of MBs per frame and the number of unit matching processes per MB, respectively. For CIF@30 Hz, becomes about 30 Mbytes/s in the proposed architecture, which is considered acceptable. Therefore, we adopt the minimum on-chip memory size as mentioned above. It should be noted that, in the case of FSBMA or fast algorithms like TSS, the on-chip memory cannot be reduced to a size similar to the proposed one with an acceptable memory bandwidth [44] . Fig. 12 depicts the on-chip memory configuration for the proposed architecture. During the unit matching process for integer pel search, 16 pixels of a 4 4 current sub-block and 64 pixels of the 8 8 search area are read sequentially in the row-scan order and stored in on-chip memories, respectively. Therefore, three random access memories are required: one 16-byte memory for current block data and two 32-byte memories for the search-area data, i.e., and . During the first 16 cycles, current block data are stored in the row-scan order. Then, in the next 64 cycles, and are stored row by row into their own memories for efficient memory access. Finally, , , and are loaded simultaneously for the matching process. However, during the unit matching process for half-pel search, , , and are stored separately into four on-chip memories. Therefore, two more 32-byte memories are needed for storing interpolated image. In general, double buffering is employed for simultaneous I/O and computation. In our system, a pair of memories is adopted for double buffering. Therefore, the total on-chip memory size is 288 bytes for both integer and half-pel search considering double buffering.
H. Scalability of a Search Range
Although the proposed architecture covers a search range of , a larger search range can be obtained by simply increasing the search range of level 2. For example, by using a search range of at level 2, the search range can simply be increased to . However, to increase the search range further, the 4-level MRMCS becomes more effective for high-performance fast search rather than the 3-level MRMCS.
I. Implementation Results
The proposed architecture for MRMCS is compared with other architectures for FSBMA and 3SS in Table III . The table shows that the proposed architecture has a smaller number of PEs and input data pins than the other architectures. Although PEs are idle between levels due to inter-level dependency of a hierarchical algorithm, the idle time is insignificant compared with the total execution time, and a utilization of PEs is 96.97%. The architecture has small-size on-chip memories of 288 bytes. It is noted that the throughput of the proposed architecture is not very high because it is aimed to achieve an area-efficient design for low bit-rate coding. The architecture can be expanded for applications requiring higher throughput by replacing a 5-PE BSU with a 16-or 25-PE BSU at the expense of additional gates. If the lower memory bandwidth is required, the memory bandwidth requirement of the architecture can also be reduced to 1024 at the expense of a larger size on-chip memory. 3SS seems to be the best scheme for hardware implementation, but its implementation is usually limited to a search range of since a larger search range may not be meaningful due to performance degradation.
By using VHDL, the proposed architecture is simulated and synthesized. And, it is found that the architecture needs an area equivalent to a gate count of about 25K with 288 bytes of on-chip memories. In the architecture for FSBMA, the data storage and control logic occupy a small portion of the total gate count since PEs occupy most of the total count. However, in the proposed architecture, data storage and data flow control logic occupy about 90% of the total gate count because of the very small number of PEs. It should be noted that the total gate count includes the additional data storage for block-based SADs for the advanced prediction mode, which is about 20% of the total, as well as PE's, control and address generation circuitry overheads. Even for a larger search range and better searching performance, the total gate count of the proposed architecture is comparable to that of 3SS (see Table III ). The clock rate of 40 MHz is sufficient for real time application for CIF@30 Hz.
IV. CONCLUSIONS
This paper presents a novel multi-resolution motion estimation algorithm, MRMCS, that efficiently uses multiple MV candidates and spatial correlation in MV fields for fast search. The MRMCS demonstrates its superiority by providing not only robust PSNR performance close to that of the FSBMA, but also a regular and simple search scheme suitable for LSI implementation. We also provide the LSI architecture with a search range of for the proposed algorithm, by applying it to a low-power video encoder for H.263 or MPEG-4 simple profile. For area-efficient implementation, the architecture iterates the process on a single BSU to reduce the gate number, and uses a small size on-chip memory. The architecture also supports the advanced prediction mode (8 8 prediction mode) for H.263 and MPEG-4. The proposed LSI architecture is implemented with a smaller number of equivalent gates (25K with 288 bytes of RAM) using the synthesizable VHDL.
