Simultaneous Localization and Mapping (SLAM) is a critical task for autonomous navigation. However, due to the computational complexity of SLAM algorithms, it is very difficult to achieve realtime implementation on low-power platforms. We propose an energy-efficient architecture for real-time ORB (Oriented-FAST and Rotated-BRIEF) based visual SLAM system by accelerating the most time-consuming stages of feature extraction and matching on FPGA platform. Moreover, the original ORB descriptor pattern is reformed as a rotational symmetric manner which is much more hardware friendly. Optimizations including rescheduling and parallelizing are further utilized to improve the throughput and reduce the memory footprint. Compared with Intel i7 and ARM Cortex-A9 CPUs on TUM dataset, our FPGA realization achieves up to 3× and 31× frame rate improvement, as well as up to 71× and 25× energy efficiency improvement, respectively.
optical flow method or direct method. Among feature-based approaches, ORB (Oriented-FAST and Rotated-BRIEF) [8] is the most widely adopted feature because of its high efficiency and robustness. However, the high computational intensity of feature extraction and matching makes it very challenging to run ORB-based visual SLAM on low-power embedded platforms, such as drones and mobile robots, for real-time applications.
Several prior efforts have been made to accelerate visual SLAM on low-power platforms, but no fully integrated ORB-based visual SLAM is proposed on such platforms so far. Feature matching and ORB extraction is accelerated on FPGA for visual SLAM system, respectively in [2] and [4] . A SIFT-feature based SLAM is implemented on FPGA [6] where only matrix computation is accelerated but the most time-consuming part, feature extraction, is not involved. A optical-flow based visual inertial odometry is implemented on ASIC [11] , which is relatively less computational intensive but may fail in scenarios with variational illuminations or large motions/displacements, because the basic assumptions of optical flow method are invalid in these scenarios [5] .
In this paper, eSLAM is proposed as a heterogeneous architecture of ORB-based visual SLAM system. The most time-consuming procedures of feature extraction and matching are accelerated on FPGA while the remaining tasks including pose estimation, pose optimization and map updating are performed on the host ARM processor. The main contributions of this paper are listed as below:
• A novel ORB-based visual SLAM accelerator is proposed for real-time applications on energy-efficient FPGA platforms.
• A rotationally symmetric ORB descriptor pattern is utilized to make our algorithm much more hardware-friendly.
• Optimization including rescheduling and parallelizing are further exploited to improve the computation throughput.
The remainder of this paper is organized as follows. Section 2 presents the ORB-based visual SLAM framework and the introduced rotationally symmetric descriptor. Section 3 illustrates the detailed architecture of eSLAM. Experimental results are evaluated for the proposed eSLAM in Section 4. Concluding remarks are given in Section 5.
ORB-BASED VISUAL SLAM SYSTEM 2.1 ORB-SLAM Framework
The ORB-based visual SLAM system takes RGB-D (RGB and depth) images for mapping and localization. Its framework, as shown in Figure 1 , consists of five main procedures: feature extraction, feature matching, pose estimation, pose optimization and map updating. In this work, feature extraction and feature matching are accelerated on FPGA, and remaining tasks are performed on ARM processor.
Feature Extraction: In this function, ORB features are extracted from the input RGB images. ORB is a very efficient and robust combination of FAST (Features from Accelerated Segment Test) keypoint and BRIEF (Binary Robust Independent Elementary Features) [1] descriptor. It calculates orientations of every feature and rotates the descriptor pattern accordingly to make the features rotationally invariant. And to obtain scale invariance, a 4-layer pyramid is generated from the original image. Aiming to implement ORB algorithm on hardware efficiently, a hardware-friendly, rotationally symmetric BRIEF descriptor pattern is proposed in this work and illustrated in Section 2.2.
Feature Matching: In feature matching, each feature detected in the current frame is matched with a 3D map point in the global map according to the distance between their BRIEF descriptors. BRIEF descriptors are binary strings, their distances are described by Hamming distances.
Pose Estimation: We apply PnP (Perspective-n-Points) method to the matched feature pairs to estimate the translation and the rotation of the camera. RANSAC (Random Sample Consensus) is used to eliminate the mismatches.
Pose Optimization: In this function, camera pose estimated by PnP is optimized by minimizing the reprojection error of the observed map points. Assuming that the pixel coordinates of the features in the current frame are (c 1 , c 2 , ..., c n ), the positions of the matched map points are (д 1 , д 2 , ..., д n ), the pose of the camera is p, and h(д i , p) refers to the pixel coordinate of д i when it is projected to the current frame. The reprojection error E can be defined as the following formula:
Levenberg-Marquardt method [7] is applied iteratively to minimize E while adjusting the camera pose p.
Map Updating: Map updating is only executed in key frames. Key frames are a set of frames where the translation or rotation of the camera is larger than a threshold. When a key frame is detected, the 3D map points in the key frame are added to the global map, and the map points that have not been matched for a long period of time are deleted from the global map to prevent it from becoming too large. 
Rotationally Symmetric BRIEF
To compute the BRIEF descriptor of a feature, 2 sets of locations in the neighborhood around the feature, Originally, L S and L D are randomly selected in the neighborhood according to Gaussian distribution. And every location after rotation needs to be calculated using the following formula:
where (x, y) refers to the initial location and (x , y ) refers to the location after rotation. Since 512 locations are required to be rotated in order to compute the descriptor of each feature, the rotation procedure is quite compute-intensive.
To reduce the computation cost of rotation procedure, a popular approach is to pre-compute the rotated BRIEF patterns [8] instead of computing them directly each time. In this approach, the orientation of features is discretized into 30 different values, i.e., 12 degrees, 24 degrees, 36 degrees, etc. Then 30 BRIEF patterns after rotation are pre-computed and built as a lookup table. The lookup table is utilized to obtain the descriptors when necessary so that the computation cost could be reduced significantly.
One drawback of the above approach is the degradation in accuracy. Because the orientation of features is discretized, there will be a deviation from the true value which is up to 6 degrees (half of 12 degrees). However, considering that the test locations are selected from a circular patch with a radius of 15 pixels, the maximum error of a test location is about 1 pixel on the smoothened image. Hence, the influence on the accuracy is almost negligible.
Although the pre-computing approach could reduce the computation cost significantly in algorithm level, it is still difficult to implement them on hardware platforms directly. For FPGA hardware implementations, all the 30 BRIEF patterns are required to be pre-computed and stored as a lookup table, which will introduce considerable amount of extra resources so that it still could not satisfy the required energy efficiency.
In order to make descriptor computing more hardware-friendly, we put forward a special way to select the test locations and proposed a 32-fold rotationally symmetric BRIEF pattern (RS-BRIEF). The procedure to generate RS-BRIEF pattern is as follows. First of all, it selects 2 sets of locations,
in the neighborhood around the feature according to Gaussian distribution. Each of the 2 sets contains 8 locations. Then, it rotates L S 1 and L D1 by increments of every 11.25 degrees, i.e., 11.25, 22.5,
are the final test locations. The RS-BRIEF pattern is visualized and compared with the original BRIEF pattern in Figure 2 .
In summary, the rotationally symmetric pattern (RS-BRIEF) is generated by rotating the two sets of seeded locations, L S 1 and L D1 . To calculate descriptors with RS-BRIEF pattern, the operations of rotating test locations can be reduced to changing the order of these locations or shifting the generated descriptor. And consequently it could be much more hardware friendly than original BRIEF descriptors by dramatically reducing the computation without introducing extra memory footprint. The overall architecture of the proposed ORB-based visual SLAM accelerator, eSLAM, is shown in Figure 3 . It is partially accelerated on programmable logic of FPGA and hosted by an ARM processor. The ORB Extractor and the BRIEF Matcher are implemented to accelerate feature extraction and matching, which account for over 90% of the runtime on general computing platforms. And the Image Resizing module is adopted to generate image pyramids layer by layer for the ORB Extractor. When the ORB Extractor is processing one layer, the Image Resizing module applies nearest neighbor downsampling on the same layer to generate the next layer until the whole image pyramid is processed. The ARM processor performs pose estimation, pose optimization as well as map updating.
ORB Extractor
The ORB Extractor aims to extract ORB features from images. It reads data from SDRAM via AXI bus, and computes the ORB features with a local cache. After feature extraction is finished, it sends the result back to SDRAM and the descriptors of the features to the BRIEF Matcher. The original workflow of ORB feature extraction could be summarized as follows: (1 Obviously there are two major problems when implementing the original workflow on hardware platforms. Firstly, the Detecting and Filtering procedures could be executed in parallel while the descriptors Computing procedure has to be idled until the Filtering is finished. Furthermore, it requires amount of on-chip cache to store the intermediate data when Computing the descriptors. In order to improve the computation throughput and reduce the memory consumption, the workflow of ORB feature extracting is rescheduled as a streaming manner as follows:
(1) Detecting keypoints from the input image. Assuming that M keypoints are detected. After rescheduling, the descriptors Computing procedure is executed before Filtering procedure so that they could run simultaneously and be pipelined for the streaming keypoints. Compared with the original workflow, there are M − N extra keypoints calculated which will introduce some overheads but the latency has been optimized significantly due to the eliminated idle states. Moreover, the required on-chip cache is also reduced dramatically according to the streaming processing manner.
The detailed architecture of the ORB Extractor is shown in FAST Detection: The FAST Detection module takes a 7×7 pixels patch from the Image Cache as input. It detects FAST keypoint on this pixels patch and computes Harris corner score for each keypoint. If a FAST keypoint is detected, the corresponding Harris score is written into Score Cache.
Image Smoother:
This module applies Gaussian blur operations on the 7 × 7 pixels patch of the original image for smoothing. Then the smoothened image is utilized for calculating descriptors and orientations of features.
NMS: The NMS module applies non-maximum suppression on the results of the FAST Detection module. It removes FAST keypoints that are too close to each other, and only reserves the one with maximum Harris score in any 3 × 3 pixels patch.
Orientation Computing: This module determines the orientation of each feature. The orientation is defined as the vector from the center of the feature to the mass center of the circular patch. The position (u, v) of the mass center is defined as:
where C refers to the circular patch and I (x, y) refers to the intensity of the pixel located at (x, y). BRIEF Rotator: The BRIEF Rotator shifts the descriptor according to the feature orientation, which provides the same results as rotating the test locations of RS-BRIEF. Assuming that the feature orientation is n, the BRIEF Rotator moves the 8 × n bits from the beginning of the descriptor to the end.
Heap: The Heap is created to store and filter the descriptors, coordinates and Harris scores of features. To filter out some of the superfluous features, a max-heap structure is utilized to guarantee that only the 1024 features with the best Harris scores are reserved. Once the feature extraction is finished and stored in the heap, the descriptors and coordinates are sent to SDRAM through AXI Interface, and the descriptors are also delivered to the BRIEF Matcher. Cache: There are 3 caches in ORB Extractor including the Image Cache storing pixels of the input image, the Score Cache storing the Harris scores of the keypoints, and the Smoothened Image Cache storing the smoothened image. These caches are designed by a manner of "ping-pong mechanism" so that the streaming data could be processed simultaneously. The Image Cache is taken as an example to explain the data I/O mechanism. The Image Cache consists of 3 cache lines, each of which stores 8 columns of image pixels. As shown in Figure 5 , the 3 cache lines receive input data by turns. The data I/O of the cache lines is controlled by a finite-state machine (FSM). The FSM is initialized by pre-storing 16 columns of pixels in cache line A and B. For each FSM state, one cache line receives input data while the other two send the data for output. HD(H i1 , H i2 , . .., H im ), the Comparator searches through HD and finds the minimum value to determine the matching result and stores them into the Result Cache.
BRIEF Matcher

Parallelizing Mechanism
Since eSLAM is a heterogeneous system with the ARM processor as the host controller and FPGA as the acceleration modules, the parallelizing mechanism is critical to improve the computation throughput. The utilized parallelized pipeline is shown in Figure  7 . For normal frames processing, while the ARM processor is performing pose estimation and pose optimization, the ORB Extractor and BRIEF Matcher are fired up to do feature extraction and feature matching for the next frame. However, it is different to process key frames because map updating is executed on the ARM processor after pose estimation and pose optimization. The ORB Extractor performs feature extraction on FPGA in parallel with the ARM processor, but the BRIEF Matcher would not start to work until map updating is finished.
With the parallelizing mechanism above, the several stages could be performed efficiently in pipeline. For normal frames, feature extraction and matching runs in parallel with pose estimation and optimization. And for key frames, feature extraction runs in parallel with pose estimation and optimization. These parallel processing manners could improve the computing throughout significantly.
EXPERIMENTAL RESULTS
Experimental Setup
Hardware Implementation: The proposed eSLAM system is implemented on Xilinx Zynq XCZ7045 SoC [12] , which integrates an ARM Cortex-A9 processor and FPGA resources. The clock frequency of the ARM processor is 767 MHz, and the clock of accelerating modules is 100 MHz. The resource utilization of the proposed system is shown in Table 1 . Since only about 1/4 resources are utilized on XCZ7045, it is possible to prototype them onto SoCs with less resources and lower price, such as XCZ7030/XCZ7020. 
Dataset:
The proposed eSLAM is evaluated on TUM dataset [10] . It contains RGB images along with depth information and is widely used in visual SLAM community. The image resolution is 640 × 480. Five different sequences in the dataset, f r1/xyz, f r1/desk, f r1/room, f r2/ xyz and f r2/rpy are used for evaluation. Each sequence contains a ground truth trajectory that is obtained by a high-accuracy motion-capture system.
Accuracy Analysis
The accuracy of the visual SLAM system is measured by trajectory error which means the difference between the ground truth trajectory and the estimated trajectory. As shown in Figure 8 , the average trajectory error is compared with the original ORB based SLAM implementation on the five sequences from TUM dataset. For f r1/xyz, f r1/room, and f r2/xyz sequence, the implementation with original ORB has a better accuracy than with RS-BRIEF descriptor. However, the implementation with RS-BRIEF descriptor could have a better accuracy than with original ORB when evaluated on f r1/desk and f r2/rpy sequence. Among the five sequences, the total average error of RS-BRIEF based implementation is about 4.3 cm, and the original ORB based implementation is about 4.16 cm, which indicates that the accuracy of RS-BRIEF descriptor is comparable to the original descriptor.
Meanwhile, the trajectories estimated by the RS-BRIEF based implementation and the original ORB based implementation are also compared with the ground truth trajectory on f r1/desk sequence and visualized in Figure 9 . Aiming to display the trajectories clearly, only a piece of them are selected as shown in Figure 9 .
Performance Evaluation
The performance of the proposed eSLAM system is compared with the software implementations on the integrated ARM Cortex-A9 processor of XCZ7045 SoC and an Intel i7-4700mq processor [9] . The runtime comparison is shown in Table 2 . Accelerated by ORB Extractor and BRIEF Matcher, the latency of feature extraction and matching procedure in eSLAM is reduced to 9.1 ms and 4 ms, respectively. Compared with Intel CPU and ARM, eSLAM could achieve 3.6× and 32× speedup in feature extraction, 4.9× and 61.6× speedup in feature matching. Table 3 compares the average runtime per frame, the frame rate, the energy consumed per frame, the power consumption of eSLAM with the ARM processor and the Intel CPU. For normal frames, eSLAM performs feature extraction (FE) and matching (FM) simultaneously with pose estimation (PE) and optimization (PO). The average runtime is the sum of processing time of PE and PO, 17.9 ms. For key frames, FE is performed simultaneously with PE. eSLAM's average runtime time is 31.8 ms, which is the sum of processing time of FM, PE, PO and MU. Compared with the ARM processor, eSLAM achieves about 17.8× speedup when processing key frames and 31× speedup for normal frames. Compared with the Intel i7 processor, it could achieve 1.7× to 3× speedups.
In terms of energy consumption, the proposed eSLAM also shows great advantage compared with the ARM and Intel CPU. Although the power consumption of eSLAM is increased by about 23% compared with the ARM processor due to the additional FPGA accelerating modules, the energy consumed per frame is still reduced by 14× to 25× depending on the key frame rate. Compared with the Intel i7 processor, the energy consumption is reduced by 41× to 71×.
Discussions
As shown in Table 3 , the key frame rate of eSLAM is 31.45 f ps, and the normal frame rate is 55.87 f ps, which is much less than 171 f ps which is achieved by Navion [11] . This gap is mainly because of the adopted different algorithms. Navion adopts the opticalflow method while only keypoints are detected but descriptors calculation and feature matching are not required. However, the adopted feature-based approach in eSLAM is much more robust in many scenarios where optical-flow methods may fail. Because the optical-flow methods are only available with two basic assumptions: constant illumination and small motions/displacements existed [5] .
Compared with the ORB extractor implemented on FPGA in [4] , the ORB extractor in eSLAM has deployed hardware-friendly optimization, such as RS-BRIEF and workflow rescheduling. Hence, the latency of feature extraction in eSLAM is approximately 39% less than the latency of [4] , even if 48% more pixels are processed in eSLAM because of the involved extra two layers in the image pyramid. 
CONCLUSIONS
In this paper, a heterogeneous ORB-based visual SLAM system, eSLAM, is proposed for energy-efficient and real-time applications and evaluated on Zynq platforms. The ORB algorithm is first reformulated as a rotationally symmetric pattern for hardware-friendly implementation. Meanwhile, the most time-consuming stages, i.e., feature extraction and matching, are accelerated on FPGA to reduce the latency significantly. The eSLAM is also designed as a pipelined manner to further improve the throughput and reduce the memory footprint. The evaluation results on TUM dataset have shown eSLAM could achieve 1.7× to 3× speedup in frame rate, and 41× to 71× improvement in energy efficiency when compared with the Intel i7 CPU. Compared with the ARM processor, eSLAM could achieve 17.8× to 31× speedup in frame rate, and 14× to 25× improvement in energy efficiency.
