Artificial Intelligence Boosts Classical Monocular Localization and Mapping Technologies
Artificial Intelligence (AI) or Deep Neural Networks (DNNs) have shown great success in different areas of computer vision, for example, object detection, face recognition and semantic segmentation. However, for Visual Odometry (VO) or Simultaneous Localization and Mapping (SLAM), classical approaches still dominate the field due to their superior accuracy and robustness in comparison to end-to-end trained DNNs. Recently, Artisense researchers Nan Yang, Lukas von Stumberg and Rui Wang have developed a novel method to incorporate the power of DNNs into classical direct visual SLAM. It shows that with AI, monocular VO methods, that is, tracking with one single camera, can deliver similar performance to stereo or stereo-inertial methods. This work has been accepted at CVPR 2020, the premier conference in computer vision, which will be (potentially remotely) held in Seattle, WA this year.
There are three key questions to think about in empowering classical SLAM with DNNs: What should be predicted by DNNs? What should be kept in the classical methods? And what should be the correct way to integrate both sides? Artisense researchers propose to train DNNs to predict the depth maps which capture the geometry of the scene, the relative poses of two consecutive images which encode the motion pattern of the camera, and the photometric uncertainties which reflect the reliability of the pixels in the image when they are used as features for tracking. More importantly, all three aforementioned properties are not regarded as the final estimations but are used as a priori knowledge of the environment and refined with the optimization framework of classical visual SLAM.
The below figure shows the input images, the predicted depths (brighter pixels indicate closer depths) and the predicted photometric uncertainties (brighter pixels indicate higher uncertainties) from the proposed DNNs. We can see that the DNNs can perceive the distance of the objects to the camera, and the pixels which are not reliable for tracking, for example, the moving cyclist and reflecting windows on the cars.
The figure below shows the comparison of the estimated trajectories of the proposed method with classical monocular VO and end-to-end trained DNNs. The blue curves are the estimated trajectories whereas the green and red curves are the ground-truth trajectories and the differences between the estimation and the ground truth, respectively. One can see that the proposed method outperforms the other two notably.
In the end, we show the 3D reconstruction (mapping) results from the proposed method. It is able to reconstruct large-, middle- and small-scale environments despite using only a monocular camera.
To learn more about our novel approach to combine classical SLAM with deep-learning methods, please refer to the published paper here.