Minimal Solutions for Relative Pose with a Single Affine Correspondence
Abstract
In this paper we present four cases of minimal solutions for twoview relative pose estimation by exploiting the affine transformation between feature points and we demonstrate efficient solvers for these cases. It is shown, that under the planar motion assumption or with knowledge of a vertical direction, a single affine correspondence is sufficient to recover the relative camera pose. The four cases considered are twoview planar relative motion for calibrated cameras as a closedform and a leastsquares solution, a closedform solution for unknown focal length and the case of a known vertical direction. These algorithms can be used efficiently for outlier detection within a RANSAC loop and for initial motion estimation. All the methods are evaluated on both synthetic data and realworld datasets from the KITTI benchmark. The experimental results demonstrate that our methods outperform comparable stateoftheart methods in accuracy with the benefit of a reduced number of needed RANSAC iterations.
1 Introduction
Simultaneous localization and mapping (SLAM), visual odometry (VO) and StructurefromMotion (SfM) have been active research topics in computer vision for decades [32, 33]. These technologies have been used successfully in a wide variety of applications and they play an important role in future technologies like autonomous driving. Relative pose estimation from two views is regarded as a fundamental algorithm, which is an essential part of SLAM and SfM pipelines. Thus, improving the accuracy, efficiency and robustness of relative pose estimation algorithms is still of relevant interest [1, 5, 34].
Most of the SLAM and SfM pipelines follow the scheme where 2D2D putative correspondences between subsequent views are established by feature matching. Then a robust a robust motion estimation framework such as the Random Sample Consensus (RANSAC) [13] is typically adopted to identify and remove matching outliers. Finally, only inlier matches between subsequent views are used to estimate the final relative pose [32]. This outlier removal step is critical for the the robustness and reliability of the pose estimation step. Besides, the efficiency of the outlier removal process affects the realtime performance of SLAM and SfM directly, in particular, as the computational complexity of the RANSAC estimator increases exponentially with respect to the required number of data points needed. Thus minimal case solutions for relative pose estimation are still of significant importance [4, 5, 38, 11].
The idea of minimal solutions for relative pose estimation ranges back to work of Hartley and Zisserman with the sevenpoint method [18]. Other classical works are the fivepoint method [27] and the homography estimation method [18]. By exploiting motion constraints on camera movements or utilizing an additional sensor like an inertial measurement unit (IMU), the minimal number of point correspondences needed can be further reduced, which makes the outlier removal more efficient and numerically more stable. For instance, two points are sufficient to recover camera motion under the planar motion assumption since the pose only has two degrees of freedom (DOF) [28, 8, 9]. Another example is to make use of the Ackermann steering principle which allows us to parameterize the camera motion with only 1 point correspondence [31, 19]. These scenarios are typical for selfdriving vehicles and ground robots. For unmanned aerial vehicles (UAV) and smartphones, a camera is often used in combination with an IMU. The partial IMU measurements can be used to provide a known gravity direction for the camera images. In this case relative pose estimation is thus possible with only three point correspondences. [14, 26, 37, 30].
It is now possible to replace simple point correspondences with affinecovariant feature detectors, such as ASIFT [24] and MODS [23]. Such an affine correspondence (AC) consists of a point correspondence and a affine transformation, see Figure 1. It has been proven that 1 AC yields three constraints on the geometric model estimation [7, 29, 2]. In this paper we exploit these additional affine parameters in the process of relative pose estimation which allows to reduce the number of correspondences needed. We propose the following novel minimal solutions for relative pose estimation using a single affine correspondence:

Three solvers under the planar motion constraint are proposed. We prove that a single affine correspondence is sufficient to recover the planar motion of a calibrated camera (2DOF) and a partially uncalibrated camera for which only the focal length is unknown (3DOF).

A fourth solver for the case of a known vertical direction is proposed. The egomotion estimation of calibrated camera with a common direction has 3DOF, and we will show that only a single affine correspondence is required to estimate the relative pose for this case.
The remainder of the paper is organized as follows. First we review related work in Section 2. We propose three minimal solutions for planar motion estimation in Section 3. In Section 4, we propose a minimal solution for twoview relative motion estimation with known vertical direction. In Section 5, we evaluate the performance of proposed methods using both synthetic and realworld dataset. Finally, concluding remarks are given in Section 6.
2 Related Work
For noncalibrated cameras, a minimum of 7 point correspondences is sufficient to estimate the fundamental matrix [18]. If the camera is partially uncalibrated such that only the common focal length is unknown, a minimum of 6 point correspondences is required to estimate the relative pose [35]. For calibrated cameras, at least 5 point correspondences are needed to estimate the essential matrix [27]. If all the 3D points lie on a plane, the point correspondences are related by a planar homography and the number of required point correspondences is reduced to 4 [18]. The relative pose of two views can be recovered by the decomposition of the essential matrix or the homography.
To further improve the computational efficiency and reliability of relative pose estimation, assumptions about the camera motion or additional information can help to reduce the number of required point correspondences across views. For example, if the camera is mounted on ground robots and follows planar motion, the relative pose of two views has only 2DOF and can be estimated by using 2 point correspondences [28, 8, 9]. By taking into account the Ackermann motion model, only 1 point correspondence is sufficient to recover the camera motion [31].
When additional information can be provided by an additional sensor, such as an IMU, the DOF of relative pose estimation can also be reduced. If the rotation of the camera is fully provided by an IMU, only the translation of two views is unknown and can easily be solved with 2 point correspondences [20]. It is more often the case that a common direction of rotation is assumed to be known. This common direction can be determined from an IMU (which provides the known pitch and roll angles of the camera), but as well from vanishing points extracted across the two views. When the common direction of rotation is known, a variety of algorithms have been proposed to estimate the relative pose utilizing this information [14, 26, 37, 30, 16, 10].
Recently, a number of methods have been proposed which reduce the number of required points by exploiting the additional affine parameters between two feature matches. These additional information can come from the feature’s rotation and scale estimates when SIFT [22] or SURF [6] feature detectors are used. From five such point correspondences extended by the rotational angles of the features the essential matrix can be computed [5]. Similarly, the homography can be estimated by using two correspondences when including the corresponding rotational angles and scales of the features [3]. Of high interest are methods which use affine correspondences obtained by an affinecovariant feature detector, such as ASIFT [24] and MODS [23]. One AC yields three constraints on the geometric model estimation. This allows the estimation of a fundamental matrix from 3 ACs [7]. The estimation of a homography and an essential matrix can be accomplished from 2 ACs [29, 12, 2]. Furthermore, it is shown in [29] that ACs have benefits as compared to point correspondences for visual odometry in the presence of many outliers.
3 Relative Pose Estimation Under Planar Motion
For planar motion shown in Figure 2, we derive three minimal solvers by exploiting one affine correspondence only. (1) We develop two minimal solvers for calibrated cameras. Since one AC provides three independent equations and there are two unknowns for the pose, the equation system is overdetermined. We propose two variants for this scenario including a closedform solution and a leastsquares solution. (2) For uncalibrated cameras with unknown focal length only, we propose a minimal solver for this scenario as well.
3.1 Solver for Planar Motion with Calibrated Camera
With known intrinsic camera parameters, the epipolar constraint between views to is given as follows [18]:
(1) 
where and are the normalized homogeneous image coordinates of a feature points in views and , respectively. is the essential matrix, where and represent relative rotation and translation respectively.
For planar motion, we assume that the image plane of the camera is vertical to the ground plane without loss of generality, see Figure 2. There are only a Yaxis rotation and 2D translation between two different views, so the rotation and the translation from views to can be written as:
(2) 
(3) 
where is the distance between views and . Based on Eqs. (2) and (3), the essential matrix under planar motion is reformulated:
(4) 
By substituting the above equation into Eq. (1), the epipolar constraint can be written as:
(5) 
Moreover, the widelyused affinecovariant feature detectors, e.g. ASIFT [24], provides affine correspondences between two views directly. Here, we exploit the affine transformation in the relative pose estimation under planar motion, to further reduce the number of required point correspondences. Firstly, we introduce the affine correspondence, which is considered as a triplet: . The local affine transformation which relates the patches surrounding and is defined as follows [5]:
(6) 
The relationship of essential matrix and local affine transformation can be described as follows [2]
(7) 
where and are the epipolar lines in the views and , respectively. is a matrix:
(8) 
By substituting Eq. (4) into Eq. (7), two equations which relate the affine transformation to the relative pose are obtained
(9)  
(10) 
3.1.1 ClosedForm Solution
For an affine correspondence, the combination of equations Eqs. (5), (9) and (10) can be expressed as , where
(11) 
By ignoring the implicit constraints between the entries of , i.e., and , should lie in the null space of . Thus the solution of the system can be obtained directly by using singular value decomposition (SVD) of the matrix . Once has been obtained by SVD, the angles and are
(12) 
3.1.2 LeastSquares Solution
Eqs. (5), (9) and (10) together with the implicit constraints of the trigonometric functions can be reformulated as:
(13) 
The coefficients , , and denote the problem coefficients in Eqs. (5), (9) and (10). This equation system has unknowns and independent constraints, thus it is overconstrained. We find the leastsquared solution by
(14)  
s.t.  
The Lagrange multiplier method is used to find all stationary points in problem (14). The Lagrange multiplier is
(15) 
By taking the partial derivatives with and and setting them to be zeros, we obtain an equation system with unknowns and , see the supplementary material. This equation system contains unknowns , and the order is . A Gröbner basis solver with template size can be obtained by an automatic solver generator [21]. It also shows that there are at most 8 solutions.
3.2 Solver for Planar motion and Unknown Focal Length
In this subsection, we assume that there is a camera with known intrinsic parameters except for an unknown focal length. This case is typical to be encounter in practice. For most cameras, it is often reasonable to assume that the cameras have squareshaped pixels and the principal point is well approximated by the image center [17]. By assuming that the only unknown calibration parameter of the camera is the focal length , the intrinsic matrix of the cameras is simplified to .
Since the intrinsic matrix is unknown, we can not obtain the coordinates of point features in the normalized image plane. Recall that the normalized homogeneous image coordinates of the points in views and are and , respectively. Without loss of generality, we set the principle point as the centre of image plane. Denote coordinates of a point in original image plane and as and , respectively. We also denote and obtain the following relations
(16) 
By substituting Eq. (16) into Eqs. (5), (9) and (10), we also obtain three equations. To reduce the burden in notation, we substitute Eq. (11) into the three equations. By combining them with two trigonometric constraints, we have a polynomial equation system as follows
(17) 
The above equation system contains unknowns , and the order is . The Gröbner basis solver with template size can be obtained by an automatic solver generator [21]. It also shows that there are at most solutions.
4 Relative Pose Estimation with Known Vertical Direction
In this section we present a minimal solution for twoview relative motion estimation with known vertical direction, which uses only one affine correspondence, see Figure 3. In this case, we have an IMU coupled with the camera. Assuming the roll and pitch angles of the camera can be obtained directly from the IMU, we can align every camera coordinate system with the measured gravity direction. The Yaxis of the camera is parallel to the gravity direction and the XZplane of the camera is orthogonal to the gravity direction. The rotation matrix for aligning the camera coordinate system to the aligned camera coordinate system can be expressed:
where and represent pitch and roll angle, respectively.
Furthermore, denote and as the orientation information delivered by the IMU for views and , respectively. Then the aligned image coordinates in views and can be expressed by
(18) 
By leveraging IMU measurement, the essential matrix between original views and can be written as
(19)  
Note that denotes the simplified essential matrix between the aligned views and , where is the translation between the aligned views and , and is the rotation matrix between the aligned views and . Now, we substitute Eq. (19) into Eq. (7):
(20) 
By multiplying both sides of Eq. (20) by the rotation matrix , it yields an equation
(21) 
The above equation can be reformulated based on Eq. (18):
(22)  
where denotes the affine transformation between the aligned image features and .
For further derivation, we denote , , and as follows
(23)  
In addition, the epipolar constraint can be written as:
(26) 
For an affine correspondence , the combination of equations Eqs. (24)(26) can be expressed as , where
(27) 
where the null space basis vectors are computed from the SVD of matrix , and and are the coefficients.
To determine the coefficients of and , note that there are two internal constraints for the essential matrix, i.e., the singularity of the essential matrix and the trace constraint:
(28) 
(29) 
By substituting Eq. (27) into Eqs. (28) and (29), a polynomial equation system with unknowns and can be generated. A straightforward method to solve the equation system is using a general automatic solver generator [21]. Inspired by [14], we use a more simpler method to convert the equation system to a univariate quartic equation, see supplementary material for details. Once the coefficients and have been obtained, the simplified essential matrix is determined by Eq. (27) and can be decomposed into and by exploiting Eq. (23). Finally, the relative pose between views and can be obtained by
(30) 
5 Experiments
The performance of the proposed methods is evaluated using both synthetic and real scene data. To deal with outliers, the minimal solvers can be integrated into a robust estimator using RANSAC or used for histogram voting. For the histogram voting, we estimate the relative pose by selecting the peak of the histogram, which is formed by estimating poses from all the affine correspondences. For RANSAC, the maximum iteration is set to and the relative pose which produces the highest number of inliers is chosen.
For relative pose estimation under planar motion, the proposed solvers in section 3.1 are referred to as 1ACVoting (which uses histogram voting with the closedform solution), 1ACCS (which uses RANSAC with the closedform solution), and 1ACLS (which uses RANSAC with the leastsquares solution). The solver for planar motion with unknown focal length in section 3.2 is referred to as the 1ACUnknownF, which also uses RANSAC. The comparative methods include 5ptNister [27], 2ACBarath [2] and 2ptChoi [8]. All comparative methods are integrated into a RANSAC scheme.
For relative pose estimation with known vertical direction, our solver proposed in section 4 is referred to as the 1AC method. The proposed solver is compared against 5ptNister [27], 3ptSaurer [30], 2ptSaurer [30] and 2ACBarath [2]. All of these minimal solvers are integrated into a RANSAC scheme. Due to space limit, the efficiency comparison is provided in supplementary material.
To demonstrate the suitability of our methods in real scenarios, the KITTI dataset [15] is used to validate the performance.
5.1 Experiments on Synthetic Data
The synthetic scene consists of a ground plane and 50 random planes, which are randomly distributed in the range of 5 to 5 meters (Xaxis direction), 5 to 5 meters (Yaxis direction), and 10 to 20 meters (Zaxis direction). 50 points are randomly generated in the ground plane. We choose a point in each random plane randomly, so there are also 50 points in the random planes. The ground truth affine transformation related to each point correspondence is computed from the noisy image coordinates and the ground truth homography [2]. The baseline between two views is set to be 2 meters. The resolution of the simulated camera is 640 480 pixels. The focal length is set to 400 pixels and the principal point is set to (320, 240) pixel.
The rotation and translation error are assessed by the root mean square error (RMSE) of the errors. We report the results on the data points within the first two intervals of a 5quantile partitioning^{1}^{1}1kquantiles divide an ordered dataset into regular intervals (Quintile) of 1000 trials. The relative rotation and translation between views and are compared separately in the synthetic experiments. The rotation error compares the angular difference between the ground truth rotation and the estimated rotation. The translation error also compares the angular difference between the ground truth translation and the estimated translation since the estimated translation between views and is only known up to scale. Specifically, we define:

Rotation error:

Translation error:
In the above criteria, and denote the ground truth rotation and translation, respectively. and denote the corresponding estimated rotation and translation, respectively.
5.1.1 Planar Motion Estimation
In this scenario the motion of the camera is described by (, ), see Figure 2. Both angles vary from to . Figure 4(a) and (b) show the performance of the proposed methods with respect to the magnitude of added image noise. All of our proposed methods for planar motion provide better results than comparative methods under perfect planar motion. It is worth to mention that our 1ACUnknownF method performs better than comparative methods even when the ground truth of the focal length is not used.
To test the performance of our method under nonplanar motion, we generate the nonplanar components of a 6DOF relative pose randomly and add them to the camera motion, which include Xaxis rotation, Zaxis rotation, and direction of YZplane translation [8]. The magnitude of nonplanar motion noise is set to Gaussian noise with a standard deviation ranging from to . The image noise is set to pixel standard deviation. Figure 4(c) and (d) show the performance of the proposed methods with respect to the magnitude of nonplanar motion noise. The 2ACBarath method and the 5ptNister method do not have an obvious trend with nonplanar motion noise levels, because both of them estimate 6DOF relative pose of two views. The proposed four methods perform better than the 2ptChoi method and the 5ptNister method at the maximum magnitude for the nonplanar motion noise up to . Meanwhile, the accuracy of these four methods is also better than the 2ACBarath method when the nonplanar motion noise is less than .
5.1.2 Motion with Known Vertical Direction
In this set of experiments the directions of the camera motion are set to forward, sideways and random motions, respectively. The second view is rotated around every axis, three rotation angles vary from to . The roll angle and pitch angle are known and used to align the camera coordinate system with the gravity direction. The proposed 1AC method is compared with 5ptNister [27], 3ptSaurer [30], 2ptSaurer [30] and 2ACBarath [2]. To save space we only show the results under random motion. The results under forward and sideways motions are available in the supplementary material. Figure 5(a) and (b) show the performance of the proposed method with respect to the magnitude of image noise with perfect IMU data. Our method is robust to the increasing image noise and provides obviously better results than the previous methods.
Figure 5(c)(f) show the performance of the proposed method for increasing noise on the IMU data, while the image noise is set to pixel standard deviation. The 1AC method basically outperform the 2ptSaurer method and the 3ptSaurer method. The 2ACBarath method and the 5ptNister method are not influenced by the pitch error and the roll error, because their calculation does not utilize the known vertical direction as prior. It is interesting to see that our method performs better than the 2ACBarath method and the 5ptNister method in the random motion case, even though the rotation noise is around . Under forward and sideways motion, the accuracy of our method is also better than the 2ACBarath method and the 5ptNister method, when the rotation noise stays below .
5.2 Experiments on Real Data
The performance of our methods on real image data is evaluated on the KITTI dataset [15]. All the sequences which provide ground truth data are utilized in this experiments. There are about 23000 images in total and are available as sequence 0 to 10.
5.2.1 Pose Estimation from Pairwise Image Pairs
Two settings of experiments are performed with the KITTI dataset, including planar motion estimation and relative pose estimation with known vertical direction. The ASIFT feature extraction and matching [24] is performed to obtain the affine correspondences between consecutive frames. Both the RANSAC and the histogram voting schemes were tested in this experiment. An inlier threshold of pixels and a fixed number of iterations are set in RANSAC.
In the first experiment, we test the relative pose estimation algorithms under planar motion. The motion estimation results between two consecutive images (, ) are compared to the corresponding ground truth. The median error for each individual sequence is used to evaluate the performances. The proposed methods are compared with 2ptChoi [8]. The results of the rotation and translation error under planar motion assumption are shown in Table 1. Table 1 demonstrates that all of our planar motion methods provide better results than the 2ptChoi method. The overall performance of the 1ACVoting method is best among all the methods, particularly the rotation accuracy of the 1ACVoting method is significantly high than other methods.
Seq.  2ptChoi [8]  1ACCS  1ACLS  1ACVoting 

00  0.203 5.169  0.133 1.335  0.155 1.345  0.016 1.493 
01  0.150 3.617  0.117 1.135  0.134 1.149  0.010 1.165 
02  0.154 3.364  0.062 1.152  0.082 1.191  0.017 1.029 
03  0.177 6.441  0.084 1.157  0.100 1.152  0.013 1.225 
04  0.115 2.871  0.029 1.132  0.041 1.155  0.012 1.018 
05  0.143 4.407  0.071 1.276  0.085 1.304  0.011 1.614 
06  0.152 3.379  0.051 1.302  0.068 1.340  0.008 1.655 
07  0.127 4.764  0.059 1.487  0.074 1.462  0.014 1.769 
08  0.137 4.312  0.064 1.428  0.081 1.427  0.014 1.591 
09  0.141 3.508  0.062 1.215  0.081 1.218  0.021 1.221 
10  0.145 3.829  0.067 1.299  0.090 1.299  0.018 1.464 
In the second experiment, we test the relative pose estimation algorithm with known vertical direction, i.e., 1AC method. To simulate IMU measurements which provide a known gravity vector for the views of the camera, the image coordinates are prerotated by obtained from the ground truth data. Table 2 lists the results of the rotation and translation estimation. The proposed methods are also compared against 5ptNister [27], 3ptSaurer [30], 2ptSaurer [30] and 2ACBarath [2]. Table 2 demonstrates that our method is significantly more accurate than the other methods, except for the translation error of sequences 02 and 10.
Seq.  5ptNister [27]  3ptSaurer [30]  2ptSaurer [30]  2ACBarath [2]  1AC method 

00  .137 2.254  .153 2.231  .336 7.675  .196 4.673  .038 2.006 
01  .120 1.988  .091 2.211  .186 9.806  .111 4.198  .050 1.507 
02  .134 1.787  .113 1.723  .293 6.034  .251 4.694  .039 1.861 
03  .109 2.507  .161 2.620  .316 9.249  .175 6.064  .041 2.143 
04  .111 1.692  .043 1.616  .141 4.816  .184 4.036  .033 1.538 
05  .116 2.059  .115 1.961  .253 7.238  .162 4.481  .031 1.725 
06  .130 1.783  .111 1.658  .232 5.750  .176 4.026  .046 1.538 
07  .113 2.434  .159 2.217  .378 8.293  .161 4.649  .033 2.009 
08  .122 2.335  .102 2.266  .241 7.556  .182 5.044  .036 2.201 
09  .133 1.843  .176 1.812  .409 6.606  .224 4.924  .045 1.799 
10  .131 1.839  .145 2.004  .308 7.324  .216 4.520  .037 1.935 
5.2.2 Visual Odometry
We demonstrate the usage of the 1AC method in a monocular visual odometry pipeline to evaluate its performance in a real application. Our monocular visual odometry is based on ORBSLAM2 [25]. The affine correspondences extracted by ASIFT feature matching are used to replace the ORB features. The relative pose between two consecutive frames is estimated based on the combination of the 1AC method using RANSAC, and is used to replace the original map initialization and the constant velocity motion model. The estimated trajectories after alignment with ground truth are illustrated in Figure 6. The color along the trajectory encodes the absolute trajectory error (ATE) [36]. Due to space limit, we show the trajectories of two sequences only. The results of other sequences can be found in supplementary materials^{2}^{2}2Both ORBSLAM2 and our monocular visual odometry fail to produce a valid result for sequence 01, because it is a highway with few tractable close objects.. It can be seen that the proposed 1AC method method has the smallest ATE among the compared trajectories.
Seq.  ORBSLAM2 [25]  1ACSLAM  Seq.  ORBSLAM2 [25]  1ACSLAM 

00  0.821 0.923  0.803 0.421  06  0.142 1.478  0.126 0.995 
02  0.200 1.052  0.156 0.686  07  0.149 0.879  0.137 0.330 
03  0.113 0.244  0.118 0.185  08  0.177 1.778  0.159 0.659 
04  0.151 0.417  0.097 0.307  09  0.221 0.777  0.172 0.502 
05  0.264 0.681  0.254 0.306  10  0.129 0.633  0.238 1.008 
Moreover, we also evaluate the Relative Pose Error (RPE) between the estimated trajectory and the ground truth trajectory, which measures the relative accuracy of the trajectory over fixed time intervals [36]. The RMSE for rotation and translation using the RPE metric is illustrated in Table. 3. Our monocular visual odometry generally has smaller rotation and translational errors than ORBSLAM2.
6 Conclusion
In this paper, we showed that by exploiting the affine parameters it is possible to estimate the relative pose of a camera with only one affine correspondence under the planar motion assumption. Three minimal case solutions have been proposed to recover the planar motion of camera, amongst which is a solver which can even deal with an unknown focal length. In addition, a minimal case solution has been proposed to estimate the relative pose of a camera for the case of a known vertical direction. The assumptions in these methods are common to scenes in which selfdriving cars and ground robots operate. By evaluating our algorithms on synthetic data and realworld image data sets, we demonstrate that our method can be used efficiently for outlier removal and for initial motion estimation in visual odometry.
Supplementary Material
Appendix A LeastSquares Solution
By taking the partial derivatives with and and set them to be zeros, we obtain an equation system with unknowns and :
The above equation system contains unknowns , and the order is .
Appendix B Relative Pose Estimation with Known Vertical Direction
We show the solution procedure of the coefficients and . To derive the solution, we start by substituting Eq. (27) to Eqs. (28) and (29). Six equations from the trace constraint Eq. (29), together with a equation from the singularity of the essential matrix Eq. (28), form a system of polynomial equations in unknowns , which has a maximum polynomial degree of . First, we stack polynomial equations into a matrix form as
(31) 
where , is a 710 coefficient matrix.
Since there is a linear dependency between the elements of the essential matrix, i.e., , , and , the rank of the coefficient matrix is only 6. By performing Gaussian elimination and row operations on the linearly independent equations, we set up a new polynomial equation system as follows:
where and represent the polynomial in the fifth and sixth rows, respectively.
In order to eliminate the monomial , we multiply with and subtract it from :
(32) 
Now, we get an up to degree 4 polynomial in : . The unknown has at most 4 solutions and can be computed as the eigenvalues of the companion matrix of . Then the corresponding solution for the unknown is obtained directly by substituting into .
Appendix C Experiments
c.1 Efficiency Comparison
We evaluate the runtime of our solvers and the comparative solvers on an Intel(R) Core(TM) i78550U 1.80GHz using MATLAB. All algorithms are implemented in Matlab, except that 5ptNister is implemented in C by using mex file. All timings are averaged over 10000 runs. Table 4 summarizes the runtimes for the planar motion estimation algorithms^{3}^{3}3Note that the runtimes of the 5ptNister method and the 2ACBarath method are showed in Table 5.. The run times of the 1ACVoting method and the 1ACCS method are same and quite low, because both methods use the same solver and the computational complexity is mainly about the singular value decomposition (SVD) of the matrix. For the 1ACLS method and the 1ACUnknownF method, the high run times are due to the complexity of the Gröbner basis solution.
Methods  2ptChoi [8]  1ACCS  1ACLS  1ACVoting  1ACUnknownF 

Timings  0.098  0.012  0.120  0.012  0.196 
Table 5 summarizes the runtimes for the motion estimation algorithms with known vertical direction. The run time of the 3ptSaurer method is higher than the 1AC method method due to the complexity of the Gröbner basis solution. Since the mex file is used, the run time of the 5ptNister method is low. The run time of the 1AC method method is significantly lower than the 2ACBarath method, because the essential matrix between two views is simplified when the common direction of rotation is known, and we use a lowcomplexity approach to solve the essential matrix as shown in Section B.
Methods  5ptNister [27]  3ptSaurer [30]  2ptSaurer [30]  2ACBarath [2]  1AC method 

Timings  0.118  2.066  0.097  65.101  1.212 
c.2 Motion with Known Vertical Direction
c.3 Visual Odometry
Here we show more trajectories for the experiments with KITTI dataset^{4}^{4}4Both ORBSLAM2 and our monocular visual odometry fail to produce a valid result for sequence 01, because it is a highway with few tractable close objects., see Figure 9. It shows that the proposed 1AC method method has the smallest ATE among all the compared trajectories.
References
 [1] (2017) On the existence of epipolar matrices. International Journal of Computer Vision 121 (3), pp. 403–415. Cited by: §1.
 [2] (2018) Efficient recovery of essential matrix from two affine correspondences. IEEE Transactions on Image Processing 27 (11), pp. 5328–5337. Cited by: Table 5, §1, §2, §3.1, §5.1.2, §5.1, §5.2.1, Table 2, §5, §5.
 [3] (2019) Homography from two orientationand scalecovariant features. In IEEE International Conference on Computer Vision, pp. 1091–1099. Cited by: §2.
 [4] (2017) A minimal solution for twoview focallength estimation using two affine correspondences. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6003–6011. Cited by: §1.
 [5] (2018) Fivepoint fundamental matrix estimation for uncalibrated cameras. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 235–243. Cited by: §1, §1, §2, §3.1.
 [6] (2008) Speededup robust features (SURF). Computer Vision and Image Understanding 110 (3), pp. 346–359. Cited by: §2.
 [7] (2014) Conic epipolar constraints from affine correspondences. Computer Vision and Image Understanding 122, pp. 105–114. Cited by: §1, §2.
 [8] (2018) Fast and reliable minimal relative pose estimation under planar motion. Image and Vision Computing 69, pp. 103–112. Cited by: Table 4, §1, §2, §5.1.1, §5.2.1, Table 1, §5.
 [9] (2018) A twostage sampling for robust feature matching. Journal of Field Robotics. Cited by: §1, §2.
 [10] (2019) An efficient solution to the homographybased relative pose problem with a common reference direction. In IEEE International Conference on Computer Vision, Cited by: §2.
 [11] (2019) PLMP  pointline minimal problems in complete multiview visibility. In IEEE International Conference on Computer Vision, Cited by: §1.
 [12] (2018) Affine correspondences between central cameras for rapid relative pose estimation. In European Conference on Computer Vision, pp. 482–497. Cited by: §2.
 [13] (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §1.
 [14] (2010) A minimal case solution to the calibrated relative pose problem for the case of two known orientation angles. In European Conference on Computer Vision, pp. 269–282. Cited by: §1, §2, §4.
 [15] (2012) Are we ready for autonomous driving? the KITTI vision benchmark suite. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. Cited by: §5.2, §5.
 [16] (2018) Visual odometry using a homography formulation with decoupled rotation and translation estimation using minimal solutions. In IEEE International Conference on Robotics and Automation, pp. 2320–2327. Cited by: §2.
 [17] (2012) An efficient hidden variable approach to minimalcase camera motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (12), pp. 2303–2314. Cited by: §3.2.
 [18] (2003) Multiple view geometry in computer vision. Cambridge university press. Cited by: §1, §2, §3.1.
 [19] (2019) Motion estimation of nonholonomic ground vehicles from a single feature correspondence measured over n views. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 12706–12715. Cited by: §1.
 [20] (2011) Robust realtime visual odometry with a single camera and an imu. In British Machine Vision Conference, Cited by: §2.
 [21] (2017) Efficient solvers for minimal problems by Syzygybased reduction. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 820–828. Cited by: §3.1.2, §3.2, §4.
 [22] (2004) Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision 60 (2), pp. 91–110. Cited by: §2.
 [23] (2015) MODS: fast and robust method for twoview matching. Computer Vision and Image Understanding 141, pp. 81–93. Cited by: §1, §2.
 [24] (2009) ASIFT: a new framework for fully affine invariant image comparison. SIAM Journal on Imaging Sciences 2 (2), pp. 438–469. Cited by: §1, §2, §3.1, §5.2.1.
 [25] (2017) ORBSLAM2: an opensource SLAM system for monocular, stereo, and RGBD cameras. IEEE Transactions on Robotics 33 (5), pp. 1255–1262. Cited by: §5.2.2, Table 3.
 [26] (2012) Two efficient solutions for visual odometry using directional correspondence. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (4), pp. 818–824. Cited by: §1, §2.
 [27] (2004) An efficient solution to the fivepoint relative pose problem. IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (6), pp. 0756–777. Cited by: Table 5, §1, §2, §5.1.2, §5.2.1, Table 2, §5, §5.
 [28] (2001) Indoor robot motion based on monocular images. Robotica 19 (3), pp. 331–342. Cited by: §1, §2.
 [29] (2016) Theory and practice of structurefrommotion using affine correspondences. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5470–5478. Cited by: §1, §2.
 [30] (2016) Homography based egomotion estimation with a common direction. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2), pp. 327–341. Cited by: Table 5, §1, §2, §5.1.2, §5.2.1, Table 2, §5.
 [31] (2009) Realtime monocular visual odometry for onroad vehicles with 1point ransac. In IEEE International Conference on Robotics and Automation, pp. 4293–4299. Cited by: §1, §2.
 [32] (2011) Visual odometry: the first 30 years and fundamentals. IEEE Robotics & Automation Magazine 18 (4), pp. 80–92. Cited by: §1, §1.
 [33] (2016) Structurefrommotion revisited. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4104–4113. Cited by: §1.
 [34] (2019) Perturbation analysis of the 8point algorithm: a case study for wide fov cameras. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 11757–11766. Cited by: §1.
 [35] (2005) A minimal solution for relative pose with unknown focal length. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 789–794. Cited by: §2.
 [36] (2012) A benchmark for the evaluation of rgbd slam systems. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 573–580. Cited by: §5.2.2, §5.2.2.
 [37] (2014) Solving for relative pose with a partially known rotation is a quadratic eigenvalue problem. In International Conference on 3D Vision, Cited by: §1, §2.
 [38] (2020) Minimal case relative pose computation using raypointray features. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.