Abstract-An efficient feature matching architecture targets at real-time video stabilization is revealed in this paper. For some applications, such as vehicular application, real-time video stabilization is needed to provide instant stable video input. However, feature matching is usually the bottleneck to achieve high performance. High speed feature matching architecture is proposed to accelerate the performance of video stabilization. Locality sensitive hashing (LSH) helps us realize the feature matching procedure in hardware implementation. By applying the proposed dynamic table allocation and on-chip cache mechanism, this work achieves 422K queries/s and real-time feature matching with 90% in memory reduction and more than 50% in relieving the feature bus burden of the system.
I. INTRODUCTION
Video camera becomes smaller and eventually portable in many applications. As a result, more smart video sensors will be placed around to bring about a more convenient life. However, the captured video may be blurred and shaky owing to unstable camera platforms. This not only results in uncomfortable watching experience but also introduces severe artifact to back-end applications. Therefore, many real-time applications need on-the-fly video stabilization to provide steady video inputs.
The basic idea of video stabilization is to correctly estimate the camera movement and to determine which part of the movement is undesired. Existing video stabilization algorithms are roughly classified into three major categories, specialimage-structure-based, block-motion-based, and feature-based methods:
• Special-image-structure-based method. Utilizing the image structure or prior knowledge to fix the unstable video. This kind of idea has been commonly adopted in vehicular applications. Liang [1] proposed a stabilization mechanism based on fixing the position and slope of lane lines. However, the requirement for image clearance for traits makes this method hard to maintain robust performance under various scenarios.
• Block-motion-based method. This category is the most widely employed approach. Through dividing the image into blocks and calculating the block movements across consecutive frames, the image motion can be estimated [2] , [3] . Generally speaking, large number of block motions guarantees a consistent stabilization result, but it usually implies timeconsuming computation.
• Feature-based method. For reducing the weighty computa- tion for calculating all block motion vectors, recent researches focus on interested points that are regarded as some corners or features. Feature motion vectors are calculated for global motion estimation [4] , [5] . Feature-based methods guarantee a more reliable solution for video stabilization.
However, even with reduced interested feature points, current feature-based methods still can not meet real-time requirement through CPU implementation, especially for a high resolution video input. The bottleneck is the feature matching stage. In order to real-time video stabilization can be achieved, an efficient feature matching engine is required. Previous literature tries to adopt a processor or a RISC to accelerate the matching procedure [6] , but the result shows that it can only support small number of features. In this paper, a high speed feature matching architecture is proposed. Firstly, optimized hardware for locality sensitive hashing (LSH) [7] reduces the computational complexity for matching features. Secondly, applying on-chip cache to minimize the bus usage. The result shows that our system provides fast feature matching flow and further assist real-time video stabilization under high video resolution.
This paper is organized as follows. The basic feature-based video stabilization algorithm is described in Sec. II. Feature matching method is deeply discussed in Sec. III Sec. IV reveals the proposed architecture. Architecture analysis is presented in Sec. V and Sec. VI concludes this work.
II. FEATURE-BASED VIDEO STABILIZATION ALGORITHM
Based on the previous work [8] , Fig. 1 illustrates the block diagram of the feature-based video stabilization system. Input video is first processed to extract those potentially interested feature points. Each feature is represented by a feature description. Secondly, we find the corresponding position for every extracted feature by probing the feature pool. The matched pairs are collected to estimate the camera motion and to compensate the unwanted camera movement.
From the timing analysis illustrated in Fig. 2 , we observe that more than half of computation time is spent on feature extraction and feature matching. In the feature extraction and matching step, matching features is much more difficult when the feature pool is too large for a feature to find its corresponding point. This causes frequent memory access to the off-chip memory. Thus, we present a searching scheme for achieving efficient matching in a large feature pool.
III. FEATURE MATCHING VIA LSH Most techniques adopt traversing the tree-structure feature pool such as KD-tree to search for the most possible feature candidates. However, when the feature dimension is above about 20, the tree based search methods do not yield any performance improvement over brute force methods [9] , as in the case of our algorithm, a 64-dimension SURF descriptor.
Approximate methods, such as LSH, can dramatically reduce the computational complexity. The idea of LSH is shown in Fig. 3 . Traditional hashing tries to spread out the features into buckets randomly to achieve that each bucket contains just a few features. In LSH, the basic motivation is to spread out in the buckets while preserving the spatial relationships between features. Through utilizing the locality information, we can search the corresponding feature in the feature pool by examining all its neighboring buckets.
In order to maintain the spatial information, given a feature vector V , the hashing function h a,b is depicted in Eq. 1:
where a is a n dimensional vector with entries chosen independently from Stable distribution. The parameter b is a real number chosen uniformly from the range [0, r], in which r represents the maximum value of a single hash index. The h a,b maps a n dimensional vector V onto a set of integers. Through selecting k sets of hashing functions, we can combine these single hash indexes into a final LSH index function g i as shown in Eq. 2
The resulting index function guarantees the probability of finding the most similar feature in the neighboring indexes is higher than ρ [7] . Fig. 4 indicates the way of inserting features into a LSH table. There are multiple LSH tables for feature insertion. Each LSH table bas its unique g i , which helps to vary the insertion rules to achieve better performance. Generally speaking, more LSH tables would lead to higher accuracy of feature matching but more processing time and memory requirement. The feature searching process is shown in Fig. 5 . After determining the hashing index, we search the corresponding bucket and its neighboring buckets.
IV. PROPOSED ARCHITECTURE
The proposed architecture for feature matching is shown in Fig. 6 . For matching features, the LSH index is assigned through index computation. It determine which addresses should be probed to find all possible feature candidates by fetching from the on-chip cache and the feature bus. The final matched result is outputted after sequentially comparing their feature descriptors. After finish feature matching, we store the Cache Table 3  Table 3 LSH Table 1 Index Computation newly collected feature into the LSH tables. Through the index calculation procedure, features can be inserted into the feature pool and waited to be compared by the features extracted from the next frame. However, this process requires large memory space to construct LSH tables. Besides, limited bus bandwidth degrades the system performance while fetching the off-chip feature pool. Two major schemes, dynamic table allocation and cachebased feature searching, are brought up to solve the problems. The memory usage is significantly reduced and the processing time is largely accelerated. The following subsections describe the detailed architecture and operation.
LSH

A. Dynamic Table Allocation
In order to avoid huge memory allocation for constructing hash tables, we adopt the dynamic memory allocation in our architecture design. We cut the data space in each bucket into small pieces which are gathered together and controlled by a dynamic table distributor (DT distributor). The dynamic table structure is revealed in Fig. 7 . The LSH table is composed of a DT distributor, a bucket, and a DT.
Each calculated LSH table index points to a bucket index. The bucket contains two information, the number of data stored in this bucket and the first memory address where we can find the feature data in DT. Assume that a new feature is going to be inserted into a certain bucket index, the bucket check whether there is enough space or not. If there is free space, the feature is inserted according to the calculated bucket index. Otherwise, the feature will be inserted after additional bucket memory allocated by the DT distributor. When the DT distributor receives a request, an empty DT slot would be released to the bucket. A DT slot describes N number of feature index data. The last part of the slot indicates the next DT slot if there is any descendant.
B. Cache-based Feature Matching
Owing to the bandwidth of the system bus, we need at least 18 cycles to complete a feature transmission to or from the offchip memory. For each feature matching procedure, it might involve hundreds of features loading from off-chip memory and there are nearly a thousand features waiting to run the matching procedure. Fortunately, there is a high probability that different features may retrieve the same feature candidates from the feature pool. In order to prevent loading frequentlymatched features several times from off-chip memory, we propose cache-based feature matching to reduce multiple offchip access and relieve the burden of feature bus.
We adopt a 2-way set-associative cache to avoid retrieving redundant features. Each time we try to find the matched pair of a feature, the cache is searched first, the feature information, which is the feature descriptor, will be obtained in 2 cycles. This is much less than the cycles needed to access feature information from the off-chip memory. Increasing the hit rate can avoid the time needed to access the outside memory.
V. ANALYSIS AND SYNTHESIS RESULT
A. Design Analysis
The proposed architecture is synthesized and analyzed in a 90nm CMOS process. We set the specification of the targeted video input is a 1280 × 960 video stream with 30 frames per second. Approximately 1000 features are extracted for each frame and stored in the feature pool. We refresh contents in the feature pool and match features every frame. The SURF feature adopted here needs 544 bits to represent its 64-dimension descriptor. In addition, the bandwidth of the feature bus is set to reasonable 32 bits.
There are two significant factors that determine the system performance. The first one is the length N of a slot in the dynamic table. The second one is the cache size S. Large cache size relieves the bus loading and compensates the penalty incurred by the cache miss. We examine the factors and decide the proper setting for the proposed architecture.
• Dynamic table length, N . Larger N means there are more data can be stored in one DT slot. However, if most of the buckets are empty or contain only a few data, large N would lead to excessive memory allocation as shown in Fig. 8(a) . In our profiling, the memory usage meets the minimum when N = 2. Fig. 8(b) depicts that with increasing N , feature probing time can be largely reduced. Considering that we want to reduce the table memory, N = 2 is set in our architecture.
• Cache size, S. In order to avoid penalty due to cache miss, larger cache memory size would lead to higher hit rate as shown in Fig. 9(a) . Without frequently accessing the off-chip memory, the total processing time to match all features is reduced as depicted in Fig. 9(b) . Choosing large cache size remarkably accelerate the system performance, but memory usage comes as the cost. We set the baseline that the hit rate should be higher than 50% to reduce the potential cache miss punishment. Therefore, 64 entries 2-way set-associative cache is selected.
B. Synthesis Result
Combined with the two techniques mentioned above, the proposed architecture provides a high efficiency feature matching engine, which is the most critical part in the video stabilization system. Synthesis result reveals that the proposed architecture supports 422K feature queries per second. Compared with state-of-the-art feature matching architecture [10] , the proposed feature matching architecture shows better performance as described in Table I . The proposed architecture supports more hash functions and hash tables with less onchip memory. The matching performance is approximately 10x higher. With fast feature matching engine, the feature motion vectors can be calculated much faster and leads to overall system enhancement.
VI. CONCLUSION
A high performance feature matching architecture is proposed to accelerate the feature-based video stabilization system. Locality sensitive hashing is utilized to realize the feature matching procedure in hardware implementation. To cope with redundant memory usage and insufficient bandwidth for feature bus, dynamic hash table allocation and cache-based feature matching are employed to enhance system performance. By applying dynamic table allocation, the memory consumption is reduced by more than 90% compared to allocate full size of hash tables. 2-way set-associative cache is adopted to reduce frequent feature fetching from the off-chip memory. More than 50% reduction of the bandwidth requirement of feature bus is achieved. Through the proposed architecture, 422K feature queries/s allows the video stabilization to achieve real-time assistance even with high resolution video input.
