Bundle adjustment (BA) is a fundamental optimization technique used in many crucial applications, including 3D scene reconstruction, robotic localization, camera calibration, autonomous driving, space exploration, street view map generation etc. Essentially, BA is a joint non-linear optimization problem, and one which can consume a significant amount of time and power, especially for large optimization problems. Previous approaches of optimizing BA performance heavily rely on parallel processing or distributed computing, which trade higher power consumption for higher performance. In this paper we propose π-BA, the first hardware-software co-designed BA engine on an embedded FPGA-SoC that exploits custom hardware for higher performance and power efficiency. Specifically, based on our key observation that not all points appear on all images in a BA problem, we designed and implemented a Co-Observation Optimization technique to accelerate BA operations with optimized usage of memory and computation resources. Experimental results confirm that π-BA outperforms the existing software implementations in terms of performance and power consumption.
Abstract-Bundle adjustment (BA) is a fundamental optimization technique used in many crucial applications, including 3D scene reconstruction, robotic localization, camera calibration, autonomous driving, space exploration, street view map generation etc. Essentially, BA is a joint non-linear optimization problem, and one which can consume a significant amount of time and power, especially for large optimization problems. Previous approaches of optimizing BA performance heavily rely on parallel processing or distributed computing, which trade higher power consumption for higher performance. In this paper we propose π-BA, the first hardware-software co-designed BA engine on an embedded FPGA-SoC that exploits custom hardware for higher performance and power efficiency. Specifically, based on our key observation that not all points appear on all images in a BA problem, we designed and implemented a Co-Observation Optimization technique to accelerate BA operations with optimized usage of memory and computation resources. Experimental results confirm that π-BA outperforms the existing software implementations in terms of performance and power consumption.
Index Terms-bundle adjustment, SLAM, structure from motion, FPGA
I. INTRODUCTION
Bundle adjustment (BA) is the problem of refining a visual reconstruction to produce jointly optimal 3D structure and viewing parameter, including camera pose and calibration, estimates. Optimal means that the parameter estimates are found by minimizing some cost function that quantifies the model fitting error, and jointly means that the solution is simultaneously optimal with respect to both structure and camera variations [1] [2] . Given a set of measured image feature locations and correspondences, the goal of BA is to find 3D point positions and camera parameters that minimize the reprojection error. This optimization problem is usually formulated as a non-linear least squares problem, where the error is the squared L2 norm of the difference between the observed feature location and the projection of the corresponding 3D point on the image plane of the camera. However, we are not limited to using the L2 norm; even when robust loss functions like Huber norm are used, the problem can be cast as a re-weighted non-linear least squares problem. In essence, BA is a large sparse geometric parameter estimation problem, the parameters being the combined 3D feature coordinates, camera poses and calibrations.
BA is widely used in many modern applications. Firstly, BA is the core component of 3D scene reconstruction applications: Agarwal et al. present a system that can match and reconstruct 3D scenes from extremely large collections of photographs such as those found by searching for a given city on Internet photo sharing sites [3] . The authors designed and implemented a cluster with 500 compute nodes to reconstruct cities consisting of 150K images in less than a day. In addition, BA is crucial in robotic localization applications: Mur-Artal et al. developed a feature-based simultaneous localization and mapping (SLAM) system, ORB-SLAM. The system consists of four modules, including tracking, mapping, relocalization, and loop closing. BA is used in the mapping stage for optimizing the visual feature map such that the robot can better localize itself [4] . Moreover, BA is used heavily in autonomous driving applications, especially in the production of high-definition maps [5] . BA is also used in space exploration mission as well, in multiple Mars exploration missions, NASA utilized BA technology to generate and optimize Mars explorer localization accuracies [6] . BA is also used in commercial products, such as Google street map, to perform scene reconstruction optimization [7] .
In both online real-time localization applications and offline visual reconstructions applications, BA remains the primary performance and power consumption bottlenecks: for real-time localization systems (including mobile robots, autonomous vehicles, and space explorers) that perform local BA involving tens to hundreds of images, the latency of BA can be extremely high and thus fails to provide optimal localization updates in real-time. For offline visual reconstruction systems (including 3D scene reconstruction, street view maps, high-definition maps) that perform global BA involving thousands to millions of images, the power consumption of BA can be extremely costly. Previous approaches of optimizing BA performance heavily rely on parallel processing or distributed computing, which trade higher power consumption for higher performance. Nonetheless, to enable effective and efficient both online and offline applications, we need a BA solution that simultaneously optimize for performance and power consumption, and thus we explore hardware acceleration techniques.
In this paper, aiming to achieve optimal performance and power efficiency for BA, we present π-BA, the first hardware-software co-designed BA engine on an embedded FPGA-SoC. The contribution of this paper is three-fold: first, this paper is the first exploration study of implementing a BA hardware accelerator, and the proposed π-BA's implementation has been proven effective. Second, based on our key observation that not all points appear on all images in a BA problem, we developed a novel Co-Observation Optimization technique for designing BA hardware accelerators. Third, in addition to achieving performance and power efficiency, we also demonstrate that the proposed π-BA optimizes computing and memory resource usage.
The rest of this paper is organized as follows. In Section II, the related research works are presented. In Section III, we review the fundamentals of BA problems to help readers understand the challenges and complexities of designing BA hardware accelerators. In Sections IV and V we describe the π-BA architecture and delve into the novel Co-Observation Optimization design. In Section VI, we share the detailed experimental methodologies and results to demonstrate the effectiveness of π-BA architecture. Finally, we summarize the conclusions in Section VII.
II. RELATED WORK
In this section, we review several existing approaches of optimizing BA performance. First, to optimize BA performance on CPU, Jeong et al. exploit the block-sparsity pattern that arises in a reduced camera system and enhance the computational speed of the bundler with BLAS library matrix operations accelerations, efficient memory handling, and fast blockbased linear solving [8] . Furthermore, the authors proposed novel embedded point iterations, which substantially improved the convergence speed by yielding a high cost decrease from each camera update step. In addition, the experimental results show the improved performance of the proposed bundler and provide useful and detailed comparisons among various choices when compositing a bundler.
Parallel processing using multicore, either on CPU or GPU, can be applied to optimize BA performance. Wu et al. presented multicore solutions to the problem of bundle adjustment that run on currently available CPUs and GPUs [9] . The authors concluded that using multicore systems deliver a 10x to 30x boost in speed over existing systems while reducing the amount of memory used. This was achieved by carefully restructuring the matrix vector product used in the PCG iterations into easily parallelizable operations. This restructuring also opens the door to a matrix free implementation which leads to substantial reductions in the memory consumption as well as execution time. The authors also showed that single precision arithmetic when combined with appropriate normalization gives numerical performance comparable to double precision based solvers while further reducing the memory and time cost. The resulting system enabled running the largest bundle adjustment problems to date on a single GPU.
Distributed computing is another effective way to optimize BA performance. Eriksson et al. proposed a consensus framework to deal with large scale bundle adjustment in distributed system [10] . Instead of merging small problems by the optimization of overlapping regions of small problems, the consensus framework utilizes the proximal splitting method to formulate the bundle adjustment problem, in which the small problems are merged by averaging points in fact, decreasing the cost of merging. The merging process for the same parameters guarantees the consensus of points in different nodes. This design may suffer from several problems. Firstly, in each iteration, each node in the distributed system has to broadcast all overlapping points to the master node to complete the merging process, which is a huge overhead for large scale datasets. Secondly, parameters of each camera are independent of parameters of other cameras. However, in practice, some cameras may share the same intrinsic parameters. Thirdly, the method by merging points converges a little slowly in very large scale data-sets and may converge in a local minimum early.
Similarly, Zhang et al. proposed a distributed approach for very large scale global bundle adjustment computation [11] . The proposed distributed formulation was derived from the classical optimization algorithm alternating direction method of multipliers, based on the global camera consensus. The authors analyzed the conditions under which the convergence of this distributed optimization would be guaranteed and they adopted over-relaxation and self-adaption schemes to improve the convergence rate. Also, the authors proposed to split the large scale camera-point visibility graph in order to reduce the communication overheads of the distributed computing.
The presented paper proposes π-BA, the first to BA hardware accelerator and its implementation on FPGA. Compared to existing acceleration techniques, π-BA simultaneously optimize both performance and power consumption, thus enabling both real-time local robotic localization applications and efficient offline visual reconstruction applications.
III. PROBLEM STATEMENT
In the following sections, we use boldface to represent vectors and matrices.
A. Perspective camera model
In computer vision area, camera is a device that performs central projection of mapping 3D points onto a 2D image plane. Fig. 1 illustrates the perspective camera model that projects a 3D points on an image plane. By employing projective geometry and coordinate transformation, the perspective projection is modeled by the following equation,
where X is a 4 × 1 vector representing the position of a 3D points in the world coordinate, and x is a 3 × 1 vector representing the projection point's location in the image plane.
Note that X and x are represented in the homogeneous coordinate.
[R|t] is a 3 × 4 matrix composed by a 3 × 3 rotation matrix R and a 3 × 1 translation vector t. R and t are referred as extrinsic parameters of cameras and specify the rigid transformation from the world coordinate to the camera 
B. Bundle adjustment
For visual simultaneous localization and mapping problems, bundle adjustment is employed in the last stage of the processing pipeline to further refine camera trajectories and 3D structures. It aims to minimize the discrepancies between observations of 3D points and predicted projections of the corresponding 3D points. Assume that a 3D points are observed in b images. Let p i be the i-th 3D points, o ij be the observation of the i-th point on the j-th image, and c j be the j-th camera's parameter. P (p i , c j ) denotes the projection function. Generally, bundle adjustment is formulated as a optimization problem, which is defined by Eq. (2). In the equation, σ ij evaluates to 1 if the i-th 3D point is observed by the j-th camera, otherwise its value is 0. This formulation shows that solving the bundle adjustment problem is to determine camera parameters and 3D points' positions such that observations are closely approximated by the corresponding re-projection points. Note that in the visual simultaneous localization and mapping problems, intrinsic parameters of cameras are known beforehand. As a result, only extrinsic parameters need to be optimized by bundle adjustment.
C. Levenberg-Marquardt's algorithm
Levenberg-Marquardt's (LM) algorithm is a non-linear least squares algorithm. It is widely used to find a local minimum of the functions that are expressed as a sum of squares of several nonlinear functions. LM combines the merits of the steepest descent and the Gauss-Newton method. It can converge from a wide range of initial conditions. LM has become a standard algorithm for performing the bundle adjustment in visual SLAM and 3D reconstruction problems [1] [12] .
Algorithm 1 shows the pseudo code of LM algorithm. In the pseudo code, m and m ∞ denote the 2 and infinity norm of vector m, respectively. M T and m T are matrix transposition of matrix M and vector transposition of vector m. Assume that there are a 3D points, b cameras
stop := true; 10: else 11:
pnew := p + δp; 12:
if(ρ > 0) 14:
p := pnew; 15:
J := f (p); : = x − f (p); 16:
endif 22: endif 23:endwhile 24:p + := p and o observations. The inputs of the algorithm are a n × 1 measurement vector x and a m × 1 initial parameter vector p 0 . According to Eq. (1), it can be derived that m equals 3a + 6b and n equals 2o. f is a vector function that maps a parameter vector p to an estimated measurement vector. The output of the algorithm is an optimized parameter vector p + that minimize T , where = x − f (p). J is the Jacobian matrix of f (p).
LM algorithm solves a nonlinear optimization problem by iteratively linearizing the nonlinear function and solving the linearized equation. In each iteration, it firstly computes the change of p, namely δ p , through solving the linear equation, and then updates p. The stop conditions of LM algorithm are: 1) the magnitude of gradient, g ∞ , is less than ε 1 ; 2) the change of magnitude of p, δ p , is less than ε 2 p ; 3) the maximum iteration step, k max , is reached. ε 1 , ε 2 , k max and τ are parameters specified by users. More details of LM algorithm for bundle adjustment can be found in [13] .
In the LM algorithm, the Jacobian matrix, J, is a (3a+6b)× 2o matrix. J T J + µD T D is a (3a + 6b) × (3a + 6b) square matrix. Directly solving Eq. (3) requires (3a + 6b) 3 arithmetic operations, which is computationally intensive. In practice, matrix elimination technique and Cholesky factorization are used to reduce the computational complexity of solving Eq.
.
The parameter vector p can be divided into a 3D points part and a camera parameter part, and is expressed as p = [p p ; p c ]. Similarly, the Jacobian matrix J can be divided into a 3D points Jacobian matrix and a camera parameter Jacobian matrix, as shown by the following equation. J p and J c represent 
the Jacobians of 3D points and cameras, respectively.
Given that A = J T J and µD T D = diag(A). By combining Eq. (4), A + µD T D can be expressed by a simple block matrix, shown in the following equation. U and V are a 3a × 3a and a 6b × 6b matrix. W is a 6b × 3a matrix.
By combining Eq. (3) and Eq. (5), we obtain the following equation.
By eliminating the lower left block, W, the following equation is obtained. In the equation, V − WU −1 W T is Schur complement matrix, which is a symmetric matrix and is denoted as S in this paper. Vector (J c −WU −1 J p ) is denoted as r in this paper. Then, δp c can be obtained by solving Eq. (7) . In practice, δp c is solved by Cholesky factorization.
After obtaining δp c , the change of 3D points vector, δp p , can be obtained by back substitution. Eq. (8) describes the closed-form solution of δp p . Note that U is a diagonal block matrix, of which diagonal elements are 3 × 3 matrices. The cost of computing the inversion of U is low.
The computation of S is called Schur elimination in this paper. Directly calculating S according to its expression, Fig. 2 . System architecture V − WU −1 W T , is computationally expensive. Since U, V and W are sparse matrices, the complexity of calculating S can be substantially reduced by exploiting the structure of these sparse matrices. Algorithm 2 describes the procedure of calculating S and r. According to the algorithm, computing S requires ab 2 arithmetic operations. Employing Cholesky factorization to solve Eq. (7) requires (6b) 3 /3 operation. For vSLAM problems and 3D reconstruction problems, the number of 3D points a is much larger than the number of images b. The Shur elimination is the most computationally intensive step when solving the linear equation in LM algorithm.
IV. SYSTEM ARCHITECTURE OF LM IMPLEMENTATION
After introducing Schur elimination, the computations in one iteration of LM algorithm can be divided into five parts including Jacobian update (JU), Schur elimination (SE), Cholesky factorization sloving δp (CFS), gain ratio evaluation (GRE) and trust region expand (TRE). One of the most timeconsuming parts is SE, which has complexity O(ab 2 ). Therefore, we propose a hardware-software co-design in which the SE is accelerated in hardware and other parts are implemented in software. The whole system architecture is shown in Fig. 2 . The amount of data transferred between hardware and software is 18o + b 2 /2. We use AXI Direct Memory Access (DMA), 6400Mbit/s. The measured performance shows that the data transfer time is less than the hardware computation time, and also both data transfer and computation are pipelined to reduce the data transfer overhead.
As shown in Algorithm 2, the SE mainly computes S (6b × 6b) given input J (2o × (3a + 6b)) and D (3a + 6b). Given the large matrices, data storage format is a key to system performance and size.
The Jacobian matrix is a block sparse matrix, like Fig. 3(a) . Therefore, Jacobian matrix uses a block compressed sparse row (BCSR) storage format [14] . Typical BCSR format uses three attributes: values storing linearly in row-wise order the values of all blocks, rows storing the block-column indices in a row, and block-starts storing the indices of the first element of each block in values. In this paper, because the block size of J is fixed in BA, the block-starts information is not needed. In addition, we convert the rows attributes to a set of co- observations, which means a set of images observing a point i, and the number of elements in the set (CO i ) which is called co-observation value and indicates that point i is observed by how many images. In Fig. 3(a) , the set of co-observations of the second point is {1, 2} and CO 2 = 2 because the second is observed by the first and second images. The conversion is shown in Fig. 3(b) . The advantages of doing this are that 1) The set and CO i can address values, and 2) they contain the physical meaning of the BA problem and can be directly used in subsequent computations without the need of calculation on-the-fly in hardware. In addition, J p and J c are stored separately in off-chip memory.
The S matrix is a dense matrix and obtained by accumulating the sub-matrices continuously as shown in Algorithm 2. In the entire calculation, the S matrix is divided into three parts. The first part is a diagonal matrix calculated in Line 2 of Algorithm 2, which is stored as a vector shown in Fig. 4(a) . The second part is a diagonal block matrix computed in Line 9 of Algorithm 2. Only a half of the diagonal blocks is stored shown Fig. 4(b) . The last part is a partial accumulation of S matrix as shown in Line 15 of Algorithm 2. Because S matrix is symmetric positive definite, nearly half of S is stored shown in Fig. 4(c) . In order to simplify the hardware control design, the whole diagonal blocks are preserved to maintain computation regularity on the diagonal blocks and other blocks.
V. FPGA IMPLEMENTATION OF SCHUR ELIMINATION
WITH CO-OBSERVATION OPTIMIZATION The customized SE hardware implementation is sketched in Fig. 5 . The input buffer temporarily stores the data of the Jacobian matrix transmitted from the off-chip memory. The middle is the main computation of SE, called SE processing element. The accumulation unit computes the Line 2 of Algorithm 2 and adds the intermediate computational results of S and r, respectively. The output buffer temporarily stores S and r which need to be transferred back to the off-chip memory. In the next, we present the detailed design of the SE processing element (PE). Because the co-observation value CO i has impacts on the speed, computation and memory resource usage of the PE, we propose a Co-Observation Optimization design method.
As shown in Fig. 5 , the PE is partitioned into four stages according to data dependencies and as early as possible scheduling. The first stage performs the computations of Lines 4-11 in Algorithm 2. In the second stage, the inverse of U i is calculated, corresponding Line 12 of the algorithm. The matrix computation −W ij × inv is completed in the third stage. The fourth stage completes the remainder computations of Lines 14 and 16. The matrix multiplications and additions in the matrix S processing unit (SPU) in the fourth stage are fully parallelized.
The latencies of the four stages are affected by CO i . The latencies of the first and third stages are 36CO i cycles. The latency of the fourth stage is 18(CO 2 i + CO i ) cycles. The latency of the second stage is 70 cycles, independent of CO i . The fourth stage is the bottleneck. To speedup the operations, SPU is duplicated for parallel computation on different data. The number of duplications depends on CO i and available hardware resources. That the number of SPUs does not match CO i will lead to slow performance or inefficient resource usage., In practice, CO i varies significantly among points, which will be shown in the result section. Therefore, the structure of PE should be optimized to match to the majority of the CO i of points.
In our tested data, more than 30% points have CO i = 2. We customize one PE for mainly processing points with CO i = 2. The number of SPUs (q) of the PE is roughly determined as
resulting in q = 2.
In addition, the amount of intermediate computational results is related to CO i as shown in Lines 7-11 of Algorithm 2. As a result, the size of RAMs in a PE is determined by the maximum CO i processed by the PE. In the customized PE for CO i = 2 the on-chip memory usage can be reduced.
Ideally, PEs can be customized with different q for the points with larger CO i accordingly. In practice, q is determined not only by Eq. (9), but also by available resources. When the available resources are not enough, SPU duplication is not feasible. In this case, the first three stages of PE can be slowed down to save computational resources. For example, one multiplier and one adder can be removed from the third stage when the fourth stage is the bottleneck. Moreover, the second stage involves matrix inverse computation. In the original software implementation, Cholesky factorization is used, containing complex operations such as square root and division. Especially, multiple square root operations have interdependencies and can not be parallelized. To solve the issue, we use determinant and adjoint matrix to invert the matrix U −1 i = adjUi detUi , given the small 3 × 3 matrix.
VI. EXPERIMENTAL RESULTS
The proposed hardware-software co-designed BA implementation is evaluated on a Zynq xc7z030sbg485-1 FPGA platform. The software part using double-precision floatingpoint numerical representation is executed on an ARM Cortex-A9@667MHz. The hardware part using single-precision floating-point numerical representation is synthesized and implemented on the programmable logic with maximum clock frequency 180 MHz. For the hardware part, three versions are implemented: one PE with one SPU (named Schur 1), one PE with two SPUs (named Schur 2), and two PEs each having two SPUs (named Schur 3). Dataset  Images  Points Observations  1  16  22106  83718  2  21  11315  36455  3  39  18060  63551  4  49  7776  31843  5  50  20431  73967 Our implementation is compared with a software implementation which uses the non-linear optimization library Ceres-Solver [15] . The software implementation using doubleprecision floating-point numerical representation runs on a Intel Pentium G2030 CPU at 3.0 GHz with 4 GB of RAM and on ARM Cortex-A9 core, respectively.
The experimental results include three parts. The first part analyzes the characteristics of datasets, showing the distribution of CO i . The second part shows the resource usage of the designs. The last part evaluates the speed and power consumption of the designs.
A. Dataset analysis
The datasets used in our experiments are from Bundle Adjustment in the Large (BAL) [2] . We choose five datasets, each containing input images below 50. This is because local BA on 50 images is enough for SLAM application and for simple scene SfM. The number of images, points and observations of the five datasets are shown in Table I . The five datasets contain different combinations of images, points and observations, and represent different scenes.
As mentioned earlier, co-observation value CO i affects the speed, efficiency and RAM usage of the Schur elimination processing element. Therefore, we first make a statistical analysis of the datasets. From Table II we can see that for the five datasets, about 50% points are observed in two images (CO i = 2). As CO i value increases, the percentage decreases. That is, the possibility of a point appearing in a large number of images is low. This feature is exploited in our design to customize hardware design.
In our design Schur 3 with two PEs, the first PE is customized for mainly processing points with CO i = 2. Due to the on-chip memory resource limitation of the target FPGA device, the second PE is designed with only two SPUs. Because the processing time of points with different CO i is different, the workloads of the two PEs need to be balanced. We design a software controller to assign workloads. For the five datasets, points with CO i ∈ [2, 10] are assigned to the first PE, while points with CO i ∈ [5, 50] are assigned to the second PE. Here the upper bound is set to 50 because the maximum CO i is 50 for datasets with up to 50 images.
B. Resource usage
The resource usage of the three hardware designs of the Schur elimination module is reported in Table III, shown. From the table we can make the following observations. Firstly, using single-precision floating point can save resources 14% FFs, 14% LUTs, 25% BRAMs and 28% DSPs, compared to double-precision floating point implementation, while the computation accuracy of the Schur elimination is not affected. Our experiment result shows that the difference of the Frobenius norm of matrix S is within 10 −6 between double-precision and single-precision implementations. Secondly, comparing Schur 2 with Schur 1, due to SPU duplication by 2, the computational resource usage increases up to 5%, while the BRAM usage increases significantly 18%. This is because two SPUs leads to almost doubling matrix S storage space. Thirdly, in Schur 3, the two PEs are customized leading to DSP or BRAM saving. PE Small corresponds to the PE processing points with 2 ≤ CO i ≤ 10. This can save 13.5 BRAM blocks compared to the PE (for processing points with 2 ≤ CO i ≤ 50) in Schur 2. PE Large corresponds to the PE processing points with 5 ≤ CO i ≤ 50. It saves 5 DSP blocks compared to the PE in Schur 2. The DSP saving can be enlarged if the lower bound increases. Lastly, for the parallel Schur elimination implementation, on-chip BRAM is the main limit or obstacle. The consumption source mainly stems from matrix S storage. This also points out a future research direction on matrix S storage reduction.
C. Execution time and power consumption
We evaluate the execution time of Schur elimination with different datasets on different computing platforms. The results are shown in Table IV . It is shown that on average Schur 3 is 3.7 times and 2 times faster than Schur 1 and Schur 2, respectively. The hardware implementation Schur 3 is about 3.1 times and 57 times faster than the Intel and ARM implementations, respectively.
Moreover, we also evaluate performance and power consumption of the BA implementations on the three computing platforms. Averaging over the five datesets, the execution time per BA iteration of Intel, ARM and our software-hardware design are 0.11s, 1.87s and 1.29s, respectively. The nominal power consumption of the used Intel CPU is 55W, while the power consumption of the ARM core and our design are 1.5W and 2.8W, respectively, reported by Xilinx power estimator. The Intel CPU implementation has the fastest speed, but is not suitable to embedded applications due to its high power consumption. Our design is 1.46 times faster than the ARM implementation. Note that, in our design the Schur elimination part is accelerated on hardware and the rest of BA is executed on the ARM core. The Schur elimination part takes about 30% of the execution time of BA. In future, other parts of BA such as Jacobian update will also be accelerated on hardware to achieve higher performance improvement.
VII. CONCLUSION
BA is a fundamental optimization technique used in many crucial applications, and often the primary performance and power consumption bottleneck in these applications. However, due to the complexities of BA algorithms, designing hardware to accelerate BA is extremely challenging. Previous approaches of optimizing BA performance heavily rely on parallel processing or distributed computing, which trade higher power consumption for higher performance. In this paper, we presented π-BA, the first hardware-software codesigned BA engine on an embedded FPGA-SoC. Specifically, we developed a novel Co-Observation Optimization technique, and experimental results confirmed that π-BA outperformed existing BA solutions in both performance and power consumption. With π-BA, we can enable more robotic localization as well as visual reconstruction applications by allowing larger scale online local BA on power-constrained embedded devices and more efficient offline global BA by using less computing resources and power consumption.
