eprintid: 10189942 rev_number: 7 eprint_status: archive userid: 699 dir: disk0/10/18/99/42 datestamp: 2024-04-05 13:27:27 lastmod: 2024-04-05 13:27:27 status_changed: 2024-04-05 13:27:27 type: article metadata_visibility: show sword_depositor: 699 creators_name: Cui, B creators_name: Islam, M creators_name: Bai, L creators_name: Ren, H title: Surgical-DINO: adapter learning of foundation models for depth estimation in endoscopic surgery ispublished: inpress divisions: UCL divisions: B04 divisions: C05 divisions: F42 keywords: Adapter learning, Depth estimation, Foundation models, Surgical scene understanding note: © 2024 Springer Nature. This article is licensed under a Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/). abstract: PURPOSE: Depth estimation in robotic surgery is vital in 3D reconstruction, surgical navigation and augmented reality visualization. Although the foundation model exhibits outstanding performance in many vision tasks, including depth estimation (e.g., DINOv2), recent works observed its limitations in medical and surgical domain-specific applications. This work presents a low-ranked adaptation (LoRA) of the foundation model for surgical depth estimation. METHODS: We design a foundation model-based depth estimation method, referred to as Surgical-DINO, a low-rank adaptation of the DINOv2 for depth estimation in endoscopic surgery. We build LoRA layers and integrate them into DINO to adapt with surgery-specific domain knowledge instead of conventional fine-tuning. During training, we freeze the DINO image encoder, which shows excellent visual representation capacity, and only optimize the LoRA layers and depth decoder to integrate features from the surgical scene. RESULTS: Our model is extensively validated on a MICCAI challenge dataset of SCARED, which is collected from da Vinci Xi endoscope surgery. We empirically show that Surgical-DINO significantly outperforms all the state-of-the-art models in endoscopic depth estimation tasks. The analysis with ablation studies has shown evidence of the remarkable effect of our LoRA layers and adaptation. CONCLUSION: Surgical-DINO shed some light on the successful adaptation of the foundation models into the surgical domain for depth estimation. There is clear evidence in the results that zero-shot prediction on pre-trained weights in computer vision datasets or naive fine-tuning is not sufficient to use the foundation model in the surgical domain directly. date: 2024 date_type: published publisher: Springer Science and Business Media LLC official_url: http://dx.doi.org/10.1007/s11548-024-03083-5 oa_status: green full_text_type: pub language: eng primo: open primo_central: open_green verified: verified_manual elements_id: 2256268 doi: 10.1007/s11548-024-03083-5 medium: Print-Electronic pii: 10.1007/s11548-024-03083-5 lyricists_name: Islam, Mobarakol lyricists_id: MISLB53 actors_name: Flynn, Bernadette actors_id: BFFLY94 actors_role: owner funding_acknowledgements: C4026-21G, GRF 14211420 14203323 [Hong Kong Research Grants Council (RGC) Collaborative Research Fund and General Research Fund]; 202108233000303 [Shenzhen-Hong Kong-Macau Technology Research Programme (Type C) STIC Grant SGDX20210823103535014] full_text_status: public publication: International Journal of Computer Assisted Radiology and Surgery event_location: Germany citation: Cui, B; Islam, M; Bai, L; Ren, H; (2024) Surgical-DINO: adapter learning of foundation models for depth estimation in endoscopic surgery. International Journal of Computer Assisted Radiology and Surgery 10.1007/s11548-024-03083-5 <https://doi.org/10.1007/s11548-024-03083-5>. (In press). Green open access document_url: https://discovery.ucl.ac.uk/id/eprint/10189942/1/s11548-024-03083-5.pdf