Depth Estimation of


Depth Estimation of Endoscopic Scene in Robotic Surgery

The Challenge

The Future of robotic surgery with the incorporation of AR/VR can lead to the development and deployment of novel safety tools that can augment the capabilities of surgeons. One of the key engineering building blocks to make this a reality is the ability to know the depth of different anatomical structures within the endoscopic view. This can lead to a better understanding of spatial relationships between different surgical and anatomical objects and provide options for the surgical team to minimize risk and allow human experts to stay in control. In this case study, we leveraged the stereo view of the surgeon with both left and right views along with the metadata of the camera calibration provided by Intuitive Surgicals to develop a depth estimation model that generates the depth of each pixel within each frame. By utilizing only the stereo view of the surgeon, an interactive 3D view can be developed to be able to see the surgical scene from multiple perspectives, this will allow for the development of more accurate and useful safety tools.



Data Preparation: As a first step, the video was separated into frames, augmented the point cloud with a concept of locality based on depth. Rotation of the camera was then applied to the setpoint cloud to generate a depth map to be used for training.


Base Model Selection: We explored several base models for validating our data prep process and used UNet for this purpose. We iteratively augmented the UNet model to eventually arrive at PSMNet like architecture for generating the best model inferences.


Model Training: Adjusted the magnitude of the gradient as needed, masked the portions of the frames where we found bad data, used a mean average error loss, adam optimizer with weight decay and other parameters to iteratively assign weights to different samples based on the data at hand.


Experiments: We performed a number of experiments to generate a rotating 3D view of the surgical scene developed using the 2D frame and the depth map to be able to view the surgical scene from different perspectives.


We won the 2019 Stereo Correspondence and Reconstruction of Endoscopic Data challenge with an average depth error of about 3mm. Because clinically significant registration of error is within 2mm, we may be close to a solution that can be deployed to assist surgeons in a clinically significant manner.


Our model won the competition with a mean average error of 3mm across 2 different test scenes. If a surgeon has access to 3D structures of a scene they will be able to see multiple perspectives. With that, any complex surgical task can perhaps be planned and executed more effectively, and the location of the camera will matter less. The results are encouraging, but we need more accurate depth data on varying surgical scenes to be able to build more precise and production-ready depth perception models